Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Subsampling is an effective approach to address computational challenges associated with massive datasets. However, existing subsampling methods do not consider model uncertainty. In this article, we investigate the subsampling technique for the Akaike information criterion (AIC) and extend the subsampling method to the smoothed AIC model-averaging framework in the context of generalized linear models. By correcting the asymptotic bias of the maximized subsample objective function used to approximate the Kullback–Leibler divergence, we derive the form of the AIC based on the subsample. We then provide a subsampling strategy for the smoothed AIC model-averaging estimator and study the corresponding asymptotic properties of the loss and the resulting estimator. A practically implementable algorithm is developed, and its performance is evaluated through numerical experiments on both real and simulated datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the data volume is so large that nonuniform subsampling probabilities cannot be calculated all at once, then subsampling with replacement is infeasible to implement. This article solves this problem using Poisson subsampling. We first derive optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria. For a practically implementable algorithm with approximated optimal subsampling probabilities, we establish the consistency and asymptotic normality of the resultant estimators. To deal with the situation that the full data are stored in different blocks or at multiple locations, we develop a distributed subsampling framework, in which statistics are computed simultaneously on smaller partitions of the full data. Asymptotic properties of the resultant aggregated estimator are investigated. We illustrate and evaluate the proposed strategies through numerical experiments on simulated and real datasets. Supplementary materials for this article are available online.
Facebook
TwitterDisparity-through-time analyses can be used to determine how morphological diversity changes in response to mass extinctions, and to investigate the drivers of morphological change. These analyses are routinely applied to palaeobiological datasets, yet although there is much discussion about how to best calculate disparity, there has been little consideration of how taxa should be sub-sampled through time. Standard practice is to group taxa into discrete time bins, often based on stratigraphic periods. However, this can introduce biases when bins are of unequal size, and implicitly assumes a punctuated model of evolution. In addition, many time bins may have few or no taxa, meaning that disparity cannot be calculated for the bin and making it harder to complete downstream analyses. Here we describe a different method to complement the disparity-through-time tool-kit: time-slicing. This method uses a time-calibrated phylogenetic tree to sample disparity-through-time at any fixed point in...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hierarchical data analysis is crucial in various fields for making discoveries. The linear mixed model is often used for training hierarchical data, but its parameter estimation is computationally expensive, especially with big data. Subsampling techniques have been developed to address this challenge. However, most existing subsampling methods assume homogeneous data and do not consider the possible heterogeneity in hierarchical data. To address this limitation, we develop a new approach called group-orthogonal subsampling (GOSS) for selecting informative subsets of hierarchical data that may exhibit heterogeneity. GOSS selects subdata with balanced data size among groups and combinatorial orthogonality within each group, resulting in subdata that are D- and A-optimal for building linear mixed models. Estimators of parameters trained on GOSS subdata are consistent and asymptotically normal. GOSS is shown to be numerically appealing via simulations and a real data application. Theoretical proofs, R codes, and supplementary numerical results are accessible online as Supplementary Materials.
Facebook
TwitterThis paper studies subsampling hypothesis tests for panel data that may be nonstationary, cross-sectionally correlated, and cross-sectionally cointegrated. The subsampling approach provides approximations to the finite sample distributions of the tests without estimating nuisance parameters. The tests include panel unit root and cointegration tests as special cases. The number of cross-sectional units is assumed to be finite and that of time-series observations infinite. It is shown that subsampling provides asymptotic distributions that are equivalent to the asymptotic distributions of the panel tests. In addition, the tests using critical values from subsampling are shown to be consistent. The subsampling methods are applied to panel unit root tests. The panel unit root tests considered are Levin, Lin, and Chu's (2002) t-test; Im, Pesaran, and Shin's (2003) averaged t-test; and Choi's (2001) inverse normal test. Simulation results regarding the subsampling panel unit root tests and some existing unit root tests for cross-sectionally correlated panels are reported. In using the subsampling approach to examine the real exchange rates of the G7 countries and a group of 26 OECD countries, we find only mixed support for the purchasing power parity (PPP) hypothesis. We then examine a panel of 17 developed stock market indexes, and also find only mixed empirical support for them exhibiting relative mean reversion with respect to the US stock market index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent years, large networks are routinely used to represent data from many scientific fields. Statistical analysis of these networks, such as estimation and hypothesis testing, has received considerable attention. However, most of the methods proposed in the literature are computationally expensive for large networks. In this article, we propose a subsampling-based method to reduce the computational cost of estimation and two-sample hypothesis testing. The idea is to divide the network into smaller subgraphs with an overlap region, then draw inference based on each subgraph, and finally combine the results together. We first develop the subsampling method for random dot product graph models, and establish theoretical consistency of the proposed method. Then we extend the subsampling method to a more general setup and establish similar theoretical properties. We demonstrate the performance of our methods through simulation experiments and real data analysis. Supplemental materials for the article are available online. The code is available in the following GitHub repository: https://github.com/kchak19/SubsampleTestingNetwork. Supplementary materials for this article are available online.
Facebook
TwitterHigh-velocity, large-scale data streams have become pervasive. Frequently, the associated labels for such data prove costly to measure and are not always available upfront. Consequently, the analysis of such data poses a significant challenge. In this article, we develop a method that addresses this challenge by employing an online subsampling procedure and a multinomial logistic model for efficient analysis of high-velocity, large-scale data streams. Our algorithm is designed to sequentially update parameter estimation based on the A-optimality criterion. Moreover, it significantly increases computational efficiency while imposing minimal storage requirements. Theoretical properties are rigorously established to quantify the asymptotic behavior of the estimator. The method’s efficacy is further demonstrated through comprehensive numerical studies on both simulated and real-world datasets. Supplementary materials for this article are available online.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Studying the genetic population structure of species can reveal important insights into several key evolutionary, historical, demographic, and anthropogenic processes. One of the most important statistical tools for inferring genetic clusters is the program STRUCTURE. Recently, several papers have pointed out that STRUCTURE may show a bias when the sampling design is unbalanced, resulting in spurious joining of underrepresented populations and spurious separation of overrepresented populations. Suggestions to overcome this bias include subsampling and changing the ancestry model, but the performance of these two methods has not yet been tested on actual data. Here, I use a dataset of twelve high-alpine plant species to test whether unbalanced sampling affects the STRUCTURE inference of population differentiation between the European Alps and the Carpathians. For four of the twelve species, subsampling of the Alpine populations –to match the sample size between the Alps and the Carpathians– resulted in a drastically different clustering than the full dataset. On the other hand, STRUCTURE results with the alternative ancestry model were indistinguishable from the results with the default model. Based on these results, the subsampling strategy seems a more viable approach to overcome the bias than the alternative ancestry model. However, subsampling is only possible when there is an a priori expectation of what constitute the main clusters. Though these results do not mean that the use of STRUCTURE should be discarded, it does indicate that users of the software should be cautious about the interpretation of the results when sampling is unbalanced.
Facebook
TwitterAbstract Bone and antler are important raw materials for tool manufacture in many cultures, past and present. The modification of osseous features which take place during artifact manufacture frequently makes it difficult to identify either the bone element or the host animal, which can limit our understanding of the cultural, economic, and/or symbolic factors which influence raw material acquisition and use. While biomolecular approaches can provide taxonomic identifications of bone or antler artifacts, these methods are frequently destructive, raising concerns about invasive sampling of culturally-important artifacts or belongings. Collagen peptide mass fingerprinting (Zooarchaeology by Mass Spectrometry or ZooMS) can provide robust taxonomic identifications of bone and antler artifacts. While the ZooMS method commonly involves destructive subsampling, minimally-invasive sampling techniques based on the triboelectric effect have also been proposed. In this paper, we compare three previously proposed minimally-invasive sampling methods (forced bag, eraser, and polishing film) on an assemblage of 15 bone artifacts from the pre-contact site EjTa-4, a large midden complex located on Calvert Island, British Columbia, Canada. We compare the results of the minimally-invasive methods to 10 fragmentary remains sampled using the conventional destructive ZooMS method. We assess the reliability and effectiveness of these methods by comparing MALDI-TOF spectral quality, the number of diagnostic and high molecular weight peaks as well as the taxonomic resolution reached after identification. We find that coarse fiber-optic polishing films are the most effective of the minimally-invasive techniques compared in this study, and that the spectral quality produced by this minimally-invasive method was not significantly different from the conventional destructive method. Our results suggest that this minimally-invasive sampling technique for ZooMS can be successfully applied to culturally significant artifacts, providing comparable taxonomic identifications to the conventional, destructive ZooMS method., Methods Fifteen bone artifacts were sampled for ZooMS using three different minimally-invasive techniques: forced bag, eraser, and coarse polishing film. Sampling was conducted following modified versions of the outlined procedures in Fiddyment et al. (2015) for the eraser method (E), in McGrath et al. (2019) for forced bag method (B), and in Kirby et al. (2020) for polishing disc method (P). Sampling and extraction blanks were included for all three methods. Additionally, 10 bone fragments from Calvert Island were analyzed using the destructive acid (0.6 M HCl) demineralisation method (Buckley et al., 2009; as modified in McGrath et al., 2019). Six bone objects were tested using two different types of ultra-fine polishing film: coarse (fiber optic polishing film disc with aluminum oxide grit size of 30µm) (P1) and fine (aluminum oxide grit size 6µm) (P2). Following individual sampling procedures, all samples were gelatinized in AmBic at 65℃ for one hour; 50µL of the resulting supernatant was removed, and the remaining pellet was stored in the freezer. Samples were incubated overnight (12–18 hours) at 37℃ with 0.4 µg of trypsin. The trypsin was deactivated using 1 µL of 5% TFA solution. Collagen in the samples was purified and desalted using Pierce C18S Tips. Each purified sample was spotted in triplicate, along with calibration standards, onto a 384-spot Bruker MALDI ground steel target plate using 1 µL of sample and 1 µL of α-cyano-hydroxycinnamic acid matrix. The samples were run on a Bruker ultraflex III MALDI TOF/TOF mass spectrometer with a Nd:YAG smart beam laser at the University of York in York, UK, and a SNAP averaging algorithm was used to obtain monoisotopic masses (C 4.9384, N 1.3577, O 1.4773, S 0.0417, H 7.7583). Raw spectral data has been uploaded here.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Taxonomy is the very first step of most biodiversity studies, but how confident can we be in the taxonomic-systematic exercise? One may hypothesise that the more material, the better the taxonomic delineation, because the more accurate the description of morphological variability. As rarefaction curves assess the degree of knowledge on taxonomic diversity through sampling effort, we aim to test the impact of sampling effort on species delineation by subsampling a given assemblage. To do so, we use an abundant and morphologically diverse conodont fossil record. Within the assemblage, we first recognize four well established morphospecies but about 80% of the specimens share diagnostic characters of these morphospecies. We quantify these diagnostic characters on the sample using geometric morphometrics, and assess the number of morphometric groups, i.e. morphospecies, using ordination and cluster analyses. Then we gradually subsample the assemblage in two ways (randomly and by mimicking taxonomist work) and redo the ‘ordination + clustering’ protocol to appraise the evolution of the number of clusters related to sampling effort. We observe the number of delineated morphospecies decreasing when increasing the number of specimens, whatever the subsampling method, resulting mostly in less morphospecies than expected. Such rather counter-intuitive influence of sampling effort on species delineation highlights the complexity of taxonomical work. This indicates that new morphotaxa should not be erected based on small samples, and encourages researchers to largely illustrate, measure, and quantitatively compare their material to better constrain the morphological variability of a clade, and so to better characterize and delineate morphospecies. -- Methods Please refer to the Material and Methods part of the publication Guenser et al. "When less is more and more is less: the impact of sampling effort on species delineation"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Modern statistical analysis often encounters massive datasets with ultrahigh-dimensional features. In this work, we develop a subsampling approach for feature screening with massive datasets. The approach is implemented by repeated subsampling of massive data and can be used for analyzing tasks with memory constraints. To conduct the procedure, we first calculate an R-squared screening measure (and related sample moments) based on subsamples. Second, we consider three methods to combine the local statistics. In addition to the simple average method, we design a jackknife debiased screening measure and an aggregated moment screening measure. Both approaches reduce the bias of the subsampling screening measure and therefore increase the accuracy of the feature screening. Last, we consider a novel sequential sampling method, that is more computationally efficient than the traditional random sampling method. The theoretical properties of the three screening measures under both sampling schemes are rigorously discussed. Finally, we illustrate the usefulness of the proposed method with an airline dataset containing 32.7 million records.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We consider a measurement constrained supervised learning problem, that is, (i) full sample of the predictors are given; (ii) the response observations are unavailable and expensive to measure. Thus, it is ideal to select a subsample of predictor observations, measure the corresponding responses, and then fit the supervised learning model on the subsample of the predictors and responses. However, model fitting is a trial and error process, and a postulated model for the data could be misspecified. Our empirical studies demonstrate that most of the existing subsampling methods have unsatisfactory performances when the models are misspecified. In this paper, we develop a novel subsampling method, called “LowCon,” which outperforms the competing methods when the working linear model is misspecified. Our method uses orthogonal Latin hypercube designs to achieve a robust estimation. We show that the proposed design-based estimator approximately minimizes the so-called worst-case bias with respect to many possible misspecification terms. Both the simulated and real-data analyses demonstrate the proposed estimator is more robust than several subsample least-squares estimators obtained by state-of-the-art subsampling methods. Supplementary materials for this article are available online.
Facebook
TwitterInferences of population structure and more precisely the identification of genetically homogeneous groups of individuals are essential to the fields of ecology, evolutionary biology, and conservation biology. Such population structure inferences are routinely investigated via the program STRUCTURE implementing a Bayesian algorithm to identify groups of individuals at Hardy-Weinberg and linkage equilibrium. While the method is performing relatively well under various population models with even sampling between subpopulations, the robustness of the method to uneven sample size between subpopulations and/or hierarchical levels of population structure has not yet been tested despite being commonly encountered in empirical datasets. In this study, I used simulated and empirical microsatellite datasets to investigate the impact of uneven sample size between subpopulations and/or hierarchical levels of population structure on the detected population structure. The results demonstrated that u...
Facebook
TwitterEach fragment has a separate line, grey fragments were not considered circular. (XLSX)
Facebook
TwitterPhylogenomic subsampling is a procedure by which small sets of loci are selected from large genome-scale datasets and used for phylogenetic inference. This step is often motivated by either computational limitations associated with the use of complex inference methods, or as a means of testing the robustness of phylogenetic results by discarding loci that are deemed potentially misleading. Although many alternative methods of phylogenomic subsampling have been proposed, little effort has gone into comparing their behavior across different datasets. Here, I calculate multiple gene properties for a range of phylogenomic datasets spanning animal, fungal and plant clades, uncovering a remarkable predictability in their patterns of covariance. I also show how these patterns provide a means for ordering loci by both their rate of evolution and their relative phylogenetic usefulness. This method of retrieving phylogenetically useful loci is found to be among the top performing when compared to...
Facebook
TwitterEnvironmental DNA (eDNA) metabarcoding is emerging as a novel tool for monitoring soil biodiversity. Soil biodiversity, critical for soil health and ecosystem services, is currently under-monitored due to the lack of standardized and efficient methods. We assessed whether refinements to sampling and molecular protocols could improve soil biodiversity detection and monitoring. Comparing the 2018 LUCAS soil biodiversity protocols with newly developed national methods, we tested sampling topsoil (0-10 cm) versus deeper layers, larger soil sample sizes for DNA-extraction, taking more subsamples for composite soil samples, and alternative primer sets across 9 Belgian Biopoints included in the LUCAS 2022 survey. The results suggest that significantly more species can be detected in upper soil layers, including the forest floor, while the diversity of taxa and eDNA in the 10–30 cm soil layer is insufficient for annelids and arthropods to serve as indicators of ecological change. Additionally, comparison of the universal eukaryotic primers (18S) with primer sets tailored to soil mesofauna and macrofauna, showed that universal 18S primers provide limited resolution for Collembola and Annelida. Overall, the analyses suggest that vertical soil stratification (with two sampling depths) has a greater influence on the captured diversity of soil mesofauna and macrofauna than the number of subsamples, and that the highest diversity is recovered when surface sampling (0–10 cm topsoil and forest floor) is combined with a greater number of subsamples and a larger sampled area. With refinement and standardization, eDNA metabarcoding, combined with optimized sampling protocols, could become a powerful and efficient tool for monitoring soil biodiversity in European soils. Description of the files This dataset includes interactive Krona taxonomy charts to visually summarize the diversity and relative read abundance of detected taxa across sampling locations and protocols. Each ring in the chart represents a taxonomic level, with the relative width of segments reflecting the proportion of reads assigned to specific taxa at that level. These charts enable exploration of taxonomic composition and allow for comparisons between the different sampled locations, sampling protocols tested, and primer sets tested. All krona charts were made in R using psadd::plot_krona. To correct for uneven sequencing depth per sample, datasets were rarefied using a random subsampling method to 27913, 31655, 1856, 19728, and 19632 reads for Annelida (Olig01), Collembola (Coll01), Fungi (ITS9mun/ITS4ngsUni), protists (18S), and Archaea (SSU1ArF/SSU1000ArR) respectively. Fauna datasets that are subsets of the total data recovered by a primer set designed to target many different phyla (e.g. 18S) were not rarefied prior to generating the krona plots. ejp_soil_annelida_olig01_27913.html contains the interactive taxonomy charts for Annelida. The data was generated using the group-specific Olig01 primer set and rarefied to 27,913 reads per sample. ejp_soil_collembola_coll01_31655.html contains the interactive taxonomy charts for Collembola. The data was generated using the group-specific Coll01 primer set and rarefied to 31,655 reads per sample. ejp_soil_arthropoda_inse01.html contains the interactive taxonomy charts for Arthropoda (Insecta, Arachnida, Chilopoda, Diplura, and Malacostraca). The data was generated using the Inse01 primer set. ejp_soil_fungi_its9mun_its4ngsuni_1856.html contains the interactive taxonomy charts for Fungi. The data was generated using the ITS9mun and ITS4ngsUni primer set and rarefied to 1,856 reads per sample. ejp_soil_protists_18s_19728.html contains the interactive taxonomy charts for protists. The data was generated using the eukaryotic 18S primer set and rarefied to 19,728 reads per sample. ejp_soil_archaea_ssu1arf_ssu1000arr_19632.html contains the interactive taxonomy charts for Archaea. The data was generated using the SSU1ArF and SSU1000ArR primer set and rarefied to 19,632 reads per sample. ejp_soil_annelida_18s.html contains the interactive taxonomy charts for Annelida. The data was generated using the eukaryotic 18S primer set. ejp_soil_collembola_18s.html contains the interactive taxonomy charts for Collembola. The data was generated using the eukaryotic 18S primer set. ejp_soil_arthropoda_18s.html contains the interactive taxonomy charts for Arthropoda. The data was generated using the eukaryotic 18S primer set. ejp_soil_metadata.csv contains metadata for the samples in this study. It includes information about the sampling locations, the sampling protocols used, the sampling depth (cm), land use type, EUNIS habitat classification, and the LUCAS-ID for each sample.
Facebook
Twitterhttp://dcat-ap.de/def/licenses/cc-byhttp://dcat-ap.de/def/licenses/cc-by
Simulated samples for microplastic analysis by Raman microspectroscopy used in the correlated publication to evaluate subsample selections.
Facebook
Twitterhttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Yearly citation counts for the publication titled "InterframePicturephone® Coding Using Unconditional Vertical and Temporal Subsampling Techniques".
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The sample_id is created sequentially from the 1st year (hence, it is different from the sample_id in the Kaggle Dataset). Note that while the original data follows a naming convention with 'train_...', this dataset simply uses integer IDs.
The data is divided into 12 chunks, each containing data from 8th year February to 9th year January.
Source code:
from pathlib import Path
import click
import pandas as pd
import polars as pl
@click.command()
@click.argument("subsample-rate")
@click.argument("offset")
def main(subsample_rate, offset):
NUM_YEARS = 8
MONTH_DAY = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
NUM_SAMPLES = (8 * sum(MONTH_DAY) * 72) // subsample_rate
assert (
0 <= offset < subsample_rate
), f"assertion failed: 0 <= offset < subsample_rate, got {offset} and {subsample_rate}."
idx = 0
file_id = 0
data = []
try:
for year in range(NUM_YEARS):
for month in range(1, 13):
for day in range(MONTH_DAY[month % 12]):
for term in range(72):
if file_id == NUM_SAMPLES:
raise Exception
if idx % subsample_rate == offset:
data.append(
dict(
sample_id=file_id,
year=year + 1,
real_year=year + (month // 12) + 1,
month=month % 12 + 1,
day=day + 1,
min_of_day=term * 1200,
)
)
file_id += 1
idx += 1
except Exception:
print("error")
pass
output_path = Path(
"/ml-docker/working/kaggle-leap-private/data/hugging_face_download"
)
if not output_path.exists():
output_path.mkdir()
df = pl.from_pandas(pd.DataFrame(data))
print(df.filter(pl.col("year").eq(8)))
df.write_parquet(output_path / f"subsample_s{subsample_rate}_o{offset}.pqt")
if _name_ == "_main_":
main()
Low-Resolution Real Geography
11.5° x 11.5° horizontal resolution (384 grid columns)
100 million total samples (744 GB)
1.9 MB per input file, 1.1 MB per output file
Therefore, it is appropriate to evaluate the model trained on the Kaggle Dataset with this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.