100+ datasets found

Data from: Efficient Model-Free Subsampling Method for Massive Data
tandf.figshare.com
txt
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou (2024). Efficient Model-Free Subsampling Method for Massive Data [Dataset]. http://doi.org/10.6084/m9.figshare.24347102.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24347102.v2
Dataset updated
Feb 14, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.
f
Data from: A Subsampling Strategy for AIC-based Model Averaging with...
tandf.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jun Yu; HaiYing Wang; Mingyao Ai (2024). A Subsampling Strategy for AIC-based Model Averaging with Generalized Linear Models [Dataset]. http://doi.org/10.6084/m9.figshare.27089534.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27089534.v1
Dataset updated
Nov 13, 2024
Dataset provided by
Taylor & Francis
Authors
Jun Yu; HaiYing Wang; Mingyao Ai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Subsampling is an effective approach to address computational challenges associated with massive datasets. However, existing subsampling methods do not consider model uncertainty. In this article, we investigate the subsampling technique for the Akaike information criterion (AIC) and extend the subsampling method to the smoothed AIC model-averaging framework in the context of generalized linear models. By correcting the asymptotic bias of the maximized subsample objective function used to approximate the Kullback–Leibler divergence, we derive the form of the AIC based on the subsample. We then provide a subsampling strategy for the smoothed AIC model-averaging estimator and study the corresponding asymptotic properties of the loss and the resulting estimator. A practically implementable algorithm is developed, and its performance is evaluated through numerical experiments on both real and simulated datasets.
f
Data from: Optimal Distributed Subsampling for Maximum Quasi-Likelihood...
tandf.figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated Feb 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jun Yu; HaiYing Wang; Mingyao Ai; Huiming Zhang (2024). Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data [Dataset]. http://doi.org/10.6084/m9.figshare.12383537.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12383537.v2
Dataset updated
Feb 19, 2024
Dataset provided by
Taylor & Francis
Authors
Jun Yu; HaiYing Wang; Mingyao Ai; Huiming Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the data volume is so large that nonuniform subsampling probabilities cannot be calculated all at once, then subsampling with replacement is infeasible to implement. This article solves this problem using Poisson subsampling. We first derive optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria. For a practically implementable algorithm with approximated optimal subsampling probabilities, we establish the consistency and asymptotic normality of the resultant estimators. To deal with the situation that the full data are stored in different blocks or at multiple locations, we develop a distributed subsampling framework, in which statistics are computed simultaneously on smaller partitions of the full data. Asymptotic properties of the resultant aggregated estimator are investigated. We illustrate and evaluate the proposed strategies through numerical experiments on simulated and real datasets. Supplementary materials for this article are available online.
d
Data from: Time for a rethink: time sub-sampling methods in...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Apr 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Guillerme; Natalie Cooper (2025). Time for a rethink: time sub-sampling methods in disparity-through-time analyses [Dataset]. http://doi.org/10.5061/dryad.vp4q518
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.vp4q518
Dataset updated
Apr 12, 2025
Dataset provided by
Dryad Digital Repository
Authors
Thomas Guillerme; Natalie Cooper
Time period covered
Apr 5, 2019
Description
Disparity-through-time analyses can be used to determine how morphological diversity changes in response to mass extinctions, and to investigate the drivers of morphological change. These analyses are routinely applied to palaeobiological datasets, yet although there is much discussion about how to best calculate disparity, there has been little consideration of how taxa should be sub-sampled through time. Standard practice is to group taxa into discrete time bins, often based on stratigraphic periods. However, this can introduce biases when bins are of unequal size, and implicitly assumes a punctuated model of evolution. In addition, many time bins may have few or no taxa, meaning that disparity cannot be calculated for the bin and making it harder to complete downstream analyses. Here we describe a different method to complement the disparity-through-time tool-kit: time-slicing. This method uses a time-calibrated phylogenetic tree to sample disparity-through-time at any fixed point in...
Data from: Group-Orthogonal Subsampling for Hierarchical Data Based on...
tandf.figshare.com
text/x-c++
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiaqing Zhu; Lin Wang; Fasheng Sun (2024). Group-Orthogonal Subsampling for Hierarchical Data Based on Linear Mixed Models [Dataset]. http://doi.org/10.6084/m9.figshare.24937247.v1
Explore at:
text/x-c++Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24937247.v1
Dataset updated
Feb 15, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Jiaqing Zhu; Lin Wang; Fasheng Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hierarchical data analysis is crucial in various fields for making discoveries. The linear mixed model is often used for training hierarchical data, but its parameter estimation is computationally expensive, especially with big data. Subsampling techniques have been developed to address this challenge. However, most existing subsampling methods assume homogeneous data and do not consider the possible heterogeneity in hierarchical data. To address this limitation, we develop a new approach called group-orthogonal subsampling (GOSS) for selecting informative subsets of hierarchical data that may exhibit heterogeneity. GOSS selects subdata with balanced data size among groups and combinatorial orthogonality within each group, resulting in subdata that are D- and A-optimal for building linear mixed models. Estimators of parameters trained on GOSS subdata are consistent and asymptotically normal. GOSS is shown to be numerically appealing via simulations and a real data application. Theoretical proofs, R codes, and supplementary numerical results are accessible online as Supplementary Materials.
r
Subsampling hypothesis tests for nonstationary panels with applications to...
resodate.org
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
In Choi (2025). Subsampling hypothesis tests for nonstationary panels with applications to exchange rates and stock prices (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9zdWJzYW1wbGluZy1oeXBvdGhlc2lzLXRlc3RzLWZvci1ub25zdGF0aW9uYXJ5LXBhbmVscy13aXRoLWFwcGxpY2F0aW9ucy10by1leGNoYW5nZS1yYXRlcy1hbmQtc3RvY2s=
Explore at:
Dataset updated
Oct 6, 2025
Dataset provided by
Journal of Applied Econometrics
ZBW Journal Data Archive
ZBW
Authors
In Choi
Description
This paper studies subsampling hypothesis tests for panel data that may be nonstationary, cross-sectionally correlated, and cross-sectionally cointegrated. The subsampling approach provides approximations to the finite sample distributions of the tests without estimating nuisance parameters. The tests include panel unit root and cointegration tests as special cases. The number of cross-sectional units is assumed to be finite and that of time-series observations infinite. It is shown that subsampling provides asymptotic distributions that are equivalent to the asymptotic distributions of the panel tests. In addition, the tests using critical values from subsampling are shown to be consistent. The subsampling methods are applied to panel unit root tests. The panel unit root tests considered are Levin, Lin, and Chu's (2002) t-test; Im, Pesaran, and Shin's (2003) averaged t-test; and Choi's (2001) inverse normal test. Simulation results regarding the subsampling panel unit root tests and some existing unit root tests for cross-sectionally correlated panels are reported. In using the subsampling approach to examine the real exchange rates of the G7 countries and a group of 26 OECD countries, we find only mixed support for the purchasing power parity (PPP) hypothesis. We then examine a panel of 17 developed stock market indexes, and also find only mixed empirical support for them exhibiting relative mean reversion with respect to the US stock market index.
Data from: Scalable Estimation and Two-Sample Testing for Large Networks via...
tandf.figshare.com
zip
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaustav Chakraborty; Srijan Sengupta; Yuguo Chen (2025). Scalable Estimation and Two-Sample Testing for Large Networks via Subsampling [Dataset]. http://doi.org/10.6084/m9.figshare.27905096.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27905096.v1
Dataset updated
Jan 27, 2025
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Kaustav Chakraborty; Srijan Sengupta; Yuguo Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, large networks are routinely used to represent data from many scientific fields. Statistical analysis of these networks, such as estimation and hypothesis testing, has received considerable attention. However, most of the methods proposed in the literature are computationally expensive for large networks. In this article, we propose a subsampling-based method to reduce the computational cost of estimation and two-sample hypothesis testing. The idea is to divide the network into smaller subgraphs with an overlap region, then draw inference based on each subgraph, and finally combine the results together. We first develop the subsampling method for random dot product graph models, and establish theoretical consistency of the proposed method. Then we extend the subsampling method to a more general setup and establish similar theoretical properties. We demonstrate the performance of our methods through simulation experiments and real data analysis. Supplemental materials for the article are available online. The code is available in the following GitHub repository: https://github.com/kchak19/SubsampleTestingNetwork. Supplementary materials for this article are available online.
f
Optimal Subsampling for Data Streams with Measurement Constrained...
datasetcatalog.nlm.nih.gov
tandf.figshare.com
Updated Oct 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ma, Ping; Ye, Zhiqiang; Ai, Mingyao; Yu, Jun (2024). Optimal Subsampling for Data Streams with Measurement Constrained Categorical Responses [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001398503
Explore at:
Dataset updated
Oct 31, 2024
Authors
Ma, Ping; Ye, Zhiqiang; Ai, Mingyao; Yu, Jun
Description
High-velocity, large-scale data streams have become pervasive. Frequently, the associated labels for such data prove costly to measure and are not always available upfront. Consequently, the analysis of such data poses a significant challenge. In this article, we develop a method that addresses this challenge by employing an online subsampling procedure and a multinomial logistic model for efficient analysis of high-velocity, large-scale data streams. Our algorithm is designed to sequentially update parameter estimation based on the A-optimality criterion. Moreover, it significantly increases computational efficiency while imposing minimal storage requirements. Theoretical properties are rigorously established to quantify the asymptotic behavior of the estimator. The method’s efficacy is further demonstrated through comprehensive numerical studies on both simulated and real-world datasets. Supplementary materials for this article are available online.
n
Subsampling reveals that unbalanced sampling affects STRUCTURE results in a...
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick G. Meirmans (2018). Subsampling reveals that unbalanced sampling affects STRUCTURE results in a multi-species dataset [Dataset]. http://doi.org/10.5061/dryad.nh4366s
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.nh4366s
Dataset updated
Jul 10, 2018
Dataset provided by
University of Amsterdam
Authors
Patrick G. Meirmans
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Europe, Carpathians, Alps
Description
Studying the genetic population structure of species can reveal important insights into several key evolutionary, historical, demographic, and anthropogenic processes. One of the most important statistical tools for inferring genetic clusters is the program STRUCTURE. Recently, several papers have pointed out that STRUCTURE may show a bias when the sampling design is unbalanced, resulting in spurious joining of underrepresented populations and spurious separation of overrepresented populations. Suggestions to overcome this bias include subsampling and changing the ancestry model, but the performance of these two methods has not yet been tested on actual data. Here, I use a dataset of twelve high-alpine plant species to test whether unbalanced sampling affects the STRUCTURE inference of population differentiation between the European Alps and the Carpathians. For four of the twelve species, subsampling of the Alpine populations –to match the sample size between the Alps and the Carpathians– resulted in a drastically different clustering than the full dataset. On the other hand, STRUCTURE results with the alternative ancestry model were indistinguishable from the results with the default model. Based on these results, the subsampling strategy seems a more viable approach to overcome the bias than the alternative ancestry model. However, subsampling is only possible when there is an a priori expectation of what constitute the main clusters. Though these results do not mean that the use of STRUCTURE should be discarded, it does indicate that users of the software should be cautious about the interpretation of the results when sampling is unbalanced.
d
A comparison of minimally-invasive sampling techniques for ZooMS analysis of...
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans, Zara; Paskulin, Lindsey; Rahemtulla, Farid; Speller, Camilla (2023). A comparison of minimally-invasive sampling techniques for ZooMS analysis of bone artifacts: MALDI-TOF mass spectra [Dataset]. http://doi.org/10.5683/SP3/FLPQF4
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/FLPQF4
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Evans, Zara; Paskulin, Lindsey; Rahemtulla, Farid; Speller, Camilla
Description
Abstract Bone and antler are important raw materials for tool manufacture in many cultures, past and present. The modification of osseous features which take place during artifact manufacture frequently makes it difficult to identify either the bone element or the host animal, which can limit our understanding of the cultural, economic, and/or symbolic factors which influence raw material acquisition and use. While biomolecular approaches can provide taxonomic identifications of bone or antler artifacts, these methods are frequently destructive, raising concerns about invasive sampling of culturally-important artifacts or belongings. Collagen peptide mass fingerprinting (Zooarchaeology by Mass Spectrometry or ZooMS) can provide robust taxonomic identifications of bone and antler artifacts. While the ZooMS method commonly involves destructive subsampling, minimally-invasive sampling techniques based on the triboelectric effect have also been proposed. In this paper, we compare three previously proposed minimally-invasive sampling methods (forced bag, eraser, and polishing film) on an assemblage of 15 bone artifacts from the pre-contact site EjTa-4, a large midden complex located on Calvert Island, British Columbia, Canada. We compare the results of the minimally-invasive methods to 10 fragmentary remains sampled using the conventional destructive ZooMS method. We assess the reliability and effectiveness of these methods by comparing MALDI-TOF spectral quality, the number of diagnostic and high molecular weight peaks as well as the taxonomic resolution reached after identification. We find that coarse fiber-optic polishing films are the most effective of the minimally-invasive techniques compared in this study, and that the spectral quality produced by this minimally-invasive method was not significantly different from the conventional destructive method. Our results suggest that this minimally-invasive sampling technique for ZooMS can be successfully applied to culturally significant artifacts, providing comparable taxonomic identifications to the conventional, destructive ZooMS method., Methods Fifteen bone artifacts were sampled for ZooMS using three different minimally-invasive techniques: forced bag, eraser, and coarse polishing film. Sampling was conducted following modified versions of the outlined procedures in Fiddyment et al. (2015) for the eraser method (E), in McGrath et al. (2019) for forced bag method (B), and in Kirby et al. (2020) for polishing disc method (P). Sampling and extraction blanks were included for all three methods. Additionally, 10 bone fragments from Calvert Island were analyzed using the destructive acid (0.6 M HCl) demineralisation method (Buckley et al., 2009; as modified in McGrath et al., 2019). Six bone objects were tested using two different types of ultra-fine polishing film: coarse (fiber optic polishing film disc with aluminum oxide grit size of 30µm) (P1) and fine (aluminum oxide grit size 6µm) (P2). Following individual sampling procedures, all samples were gelatinized in AmBic at 65℃ for one hour; 50µL of the resulting supernatant was removed, and the remaining pellet was stored in the freezer. Samples were incubated overnight (12–18 hours) at 37℃ with 0.4 µg of trypsin. The trypsin was deactivated using 1 µL of 5% TFA solution. Collagen in the samples was purified and desalted using Pierce C18S Tips. Each purified sample was spotted in triplicate, along with calibration standards, onto a 384-spot Bruker MALDI ground steel target plate using 1 µL of sample and 1 µL of α-cyano-hydroxycinnamic acid matrix. The samples were run on a Bruker ultraflex III MALDI TOF/TOF mass spectrometer with a Nd:YAG smart beam laser at the University of York in York, UK, and a SNAP averaging algorithm was used to obtain monoisotopic masses (C 4.9384, N 1.3577, O 1.4773, S 0.0417, H 7.7583). Raw spectral data has been uploaded here.
n
Data from: When less is more and more is less: the impact of sampling effort...
data.niaid.nih.gov
datadryad.org
zip
Updated Apr 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pauline Guenser; Samuel Ginot; Gilles Escarguel; Nicolas Goudemand (2022). When less is more and more is less: the impact of sampling effort on species delineation [Dataset]. http://doi.org/10.5061/dryad.905qfttkq
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.905qfttkq
Dataset updated
Apr 19, 2022
Dataset provided by
Université Claude Bernard Lyon 1
École Normale Supérieure de Lyon
Authors
Pauline Guenser; Samuel Ginot; Gilles Escarguel; Nicolas Goudemand
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Taxonomy is the very first step of most biodiversity studies, but how confident can we be in the taxonomic-systematic exercise? One may hypothesise that the more material, the better the taxonomic delineation, because the more accurate the description of morphological variability. As rarefaction curves assess the degree of knowledge on taxonomic diversity through sampling effort, we aim to test the impact of sampling effort on species delineation by subsampling a given assemblage. To do so, we use an abundant and morphologically diverse conodont fossil record. Within the assemblage, we first recognize four well established morphospecies but about 80% of the specimens share diagnostic characters of these morphospecies. We quantify these diagnostic characters on the sample using geometric morphometrics, and assess the number of morphometric groups, i.e. morphospecies, using ordination and cluster analyses. Then we gradually subsample the assemblage in two ways (randomly and by mimicking taxonomist work) and redo the ‘ordination + clustering’ protocol to appraise the evolution of the number of clusters related to sampling effort. We observe the number of delineated morphospecies decreasing when increasing the number of specimens, whatever the subsampling method, resulting mostly in less morphospecies than expected. Such rather counter-intuitive influence of sampling effort on species delineation highlights the complexity of taxonomical work. This indicates that new morphotaxa should not be erected based on small samples, and encourages researchers to largely illustrate, measure, and quantitatively compare their material to better constrain the morphological variability of a clade, and so to better characterize and delineate morphospecies. -- Methods Please refer to the Material and Methods part of the publication Guenser et al. "When less is more and more is less: the impact of sampling effort on species delineation"
f
Data from: Feature Screening for Massive Data Analysis by Subsampling
tandf.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuening Zhu; Rui Pan; Shuyuan Wu; Hansheng Wang (2023). Feature Screening for Massive Data Analysis by Subsampling [Dataset]. http://doi.org/10.6084/m9.figshare.17091712.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17091712.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Xuening Zhu; Rui Pan; Shuyuan Wu; Hansheng Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Modern statistical analysis often encounters massive datasets with ultrahigh-dimensional features. In this work, we develop a subsampling approach for feature screening with massive datasets. The approach is implemented by repeated subsampling of massive data and can be used for analyzing tasks with memory constraints. To conduct the procedure, we first calculate an R-squared screening measure (and related sample moments) based on subsamples. Second, we consider three methods to combine the local statistics. In addition to the simple average method, we design a jackknife debiased screening measure and an aggregated moment screening measure. Both approaches reduce the bias of the subsampling screening measure and therefore increase the accuracy of the feature screening. Last, we consider a novel sequential sampling method, that is more computationally efficient than the traditional random sampling method. The theoretical properties of the three screening measures under both sampling schemes are rigorously discussed. Finally, we illustrate the usefulness of the proposed method with an airline dataset containing 32.7 million records.
Data from: LowCon: A Design-based Subsampling Approach in a Misspecified...
tandf.figshare.com
zip
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cheng Meng; Rui Xie; Abhyuday Mandal; Xinlian Zhang; Wenxuan Zhong; Ping Ma (2024). LowCon: A Design-based Subsampling Approach in a Misspecified Linear Model [Dataset]. http://doi.org/10.6084/m9.figshare.13180178.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13180178.v3
Dataset updated
Feb 8, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Cheng Meng; Rui Xie; Abhyuday Mandal; Xinlian Zhang; Wenxuan Zhong; Ping Ma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We consider a measurement constrained supervised learning problem, that is, (i) full sample of the predictors are given; (ii) the response observations are unavailable and expensive to measure. Thus, it is ideal to select a subsample of predictor observations, measure the corresponding responses, and then fit the supervised learning model on the subsample of the predictors and responses. However, model fitting is a trial and error process, and a postulated model for the data could be misspecified. Our empirical studies demonstrate that most of the existing subsampling methods have unsatisfactory performances when the models are misspecified. In this paper, we develop a novel subsampling method, called “LowCon,” which outperforms the competing methods when the working linear model is misspecified. Our method uses orthogonal Latin hypercube designs to achieve a robust estimation. We show that the proposed design-based estimator approximately minimizes the so-called worst-case bias with respect to many possible misspecification terms. Both the simulated and real-data analyses demonstrate the proposed estimator is more robust than several subsample least-squares estimators obtained by state-of-the-art subsampling methods. Supplementary materials for this article are available online.
d
Data from: The program STRUCTURE does not reliably recover the correct...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sébastien J. Puechmaille (2016). The program STRUCTURE does not reliably recover the correct population structure when sampling is uneven: sub-sampling and new estimators alleviate the problem [Dataset]. http://doi.org/10.5061/dryad.2d4m9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2d4m9
Dataset updated
Jan 19, 2016
Dataset provided by
Dryad
Authors
Sébastien J. Puechmaille
Time period covered
Dec 13, 2015
Area covered
Africa, Asia, America, Europe, Oceania
Description
Inferences of population structure and more precisely the identification of genetically homogeneous groups of individuals are essential to the fields of ecology, evolutionary biology, and conservation biology. Such population structure inferences are routinely investigated via the program STRUCTURE implementing a Bayesian algorithm to identify groups of individuals at Hardy-Weinberg and linkage equilibrium. While the method is performing relatively well under various population models with even sampling between subpopulations, the robustness of the method to uneven sample size between subpopulations and/or hierarchical levels of population structure has not yet been tested despite being commonly encountered in empirical datasets. In this study, I used simulated and empirical microsatellite datasets to investigate the impact of uneven sample size between subpopulations and/or hierarchical levels of population structure on the detected population structure. The results demonstrated that u...
f
Subsampling used for assembly of amblyceran mitogenomes and methods used for...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Buček, Aleš; Doña, Jorge; Johnson, Kevin P.; Najer, Tomáš; Sychra, Oldřich; Sweet, Andrew D. (2024). Subsampling used for assembly of amblyceran mitogenomes and methods used for chromosome circularization. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001440747
Explore at:
Dataset updated
May 3, 2024
Authors
Buček, Aleš; Doña, Jorge; Johnson, Kevin P.; Najer, Tomáš; Sychra, Oldřich; Sweet, Andrew D.
Description
Each fragment has a separate line, grey fragments were not considered circular. (XLSX)
d
Data from: Phylogenomic subsampling and the search for phylogenetically...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jun 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolás Mongiardino Koch (2021). Phylogenomic subsampling and the search for phylogenetically reliable loci [Dataset]. http://doi.org/10.5061/dryad.sj3tx9646
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.sj3tx9646
Dataset updated
Jun 9, 2021
Dataset provided by
Dryad
Authors
Nicolás Mongiardino Koch
Time period covered
May 28, 2021
Description
Phylogenomic subsampling is a procedure by which small sets of loci are selected from large genome-scale datasets and used for phylogenetic inference. This step is often motivated by either computational limitations associated with the use of complex inference methods, or as a means of testing the robustness of phylogenetic results by discarding loci that are deemed potentially misleading. Although many alternative methods of phylogenomic subsampling have been proposed, little effort has gone into comparing their behavior across different datasets. Here, I calculate multiple gene properties for a range of phylogenomic datasets spanning animal, fungal and plant clades, uncovering a remarkable predictability in their patterns of covariance. I also show how these patterns provide a means for ordering loci by both their rate of evolution and their relative phylogenetic usefulness. This method of retrieving phylogenetically useful loci is found to be among the top performing when compared to...
s
Data from: Comparison and evaluation of sampling and eDNA metabarcoding...
repository.soilwise-he.eu
data-staging.niaid.nih.gov
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Data from: Comparison and evaluation of sampling and eDNA metabarcoding protocols to assess soil biodiversity in Belgian LUCAS Biopoints [Dataset]. http://doi.org/10.5281/zenodo.14845589
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14845589
Dataset updated
Aug 29, 2025
Description
Environmental DNA (eDNA) metabarcoding is emerging as a novel tool for monitoring soil biodiversity. Soil biodiversity, critical for soil health and ecosystem services, is currently under-monitored due to the lack of standardized and efficient methods. We assessed whether refinements to sampling and molecular protocols could improve soil biodiversity detection and monitoring. Comparing the 2018 LUCAS soil biodiversity protocols with newly developed national methods, we tested sampling topsoil (0-10 cm) versus deeper layers, larger soil sample sizes for DNA-extraction, taking more subsamples for composite soil samples, and alternative primer sets across 9 Belgian Biopoints included in the LUCAS 2022 survey. The results suggest that significantly more species can be detected in upper soil layers, including the forest floor, while the diversity of taxa and eDNA in the 10–30 cm soil layer is insufficient for annelids and arthropods to serve as indicators of ecological change. Additionally, comparison of the universal eukaryotic primers (18S) with primer sets tailored to soil mesofauna and macrofauna, showed that universal 18S primers provide limited resolution for Collembola and Annelida. Overall, the analyses suggest that vertical soil stratification (with two sampling depths) has a greater influence on the captured diversity of soil mesofauna and macrofauna than the number of subsamples, and that the highest diversity is recovered when surface sampling (0–10 cm topsoil and forest floor) is combined with a greater number of subsamples and a larger sampled area. With refinement and standardization, eDNA metabarcoding, combined with optimized sampling protocols, could become a powerful and efficient tool for monitoring soil biodiversity in European soils. Description of the files This dataset includes interactive Krona taxonomy charts to visually summarize the diversity and relative read abundance of detected taxa across sampling locations and protocols. Each ring in the chart represents a taxonomic level, with the relative width of segments reflecting the proportion of reads assigned to specific taxa at that level. These charts enable exploration of taxonomic composition and allow for comparisons between the different sampled locations, sampling protocols tested, and primer sets tested. All krona charts were made in R using psadd::plot_krona. To correct for uneven sequencing depth per sample, datasets were rarefied using a random subsampling method to 27913, 31655, 1856, 19728, and 19632 reads for Annelida (Olig01), Collembola (Coll01), Fungi (ITS9mun/ITS4ngsUni), protists (18S), and Archaea (SSU1ArF/SSU1000ArR) respectively. Fauna datasets that are subsets of the total data recovered by a primer set designed to target many different phyla (e.g. 18S) were not rarefied prior to generating the krona plots. ejp_soil_annelida_olig01_27913.html contains the interactive taxonomy charts for Annelida. The data was generated using the group-specific Olig01 primer set and rarefied to 27,913 reads per sample. ejp_soil_collembola_coll01_31655.html contains the interactive taxonomy charts for Collembola. The data was generated using the group-specific Coll01 primer set and rarefied to 31,655 reads per sample. ejp_soil_arthropoda_inse01.html contains the interactive taxonomy charts for Arthropoda (Insecta, Arachnida, Chilopoda, Diplura, and Malacostraca). The data was generated using the Inse01 primer set. ejp_soil_fungi_its9mun_its4ngsuni_1856.html contains the interactive taxonomy charts for Fungi. The data was generated using the ITS9mun and ITS4ngsUni primer set and rarefied to 1,856 reads per sample. ejp_soil_protists_18s_19728.html contains the interactive taxonomy charts for protists. The data was generated using the eukaryotic 18S primer set and rarefied to 19,728 reads per sample. ejp_soil_archaea_ssu1arf_ssu1000arr_19632.html contains the interactive taxonomy charts for Archaea. The data was generated using the SSU1ArF and SSU1000ArR primer set and rarefied to 19,632 reads per sample. ejp_soil_annelida_18s.html contains the interactive taxonomy charts for Annelida. The data was generated using the eukaryotic 18S primer set. ejp_soil_collembola_18s.html contains the interactive taxonomy charts for Collembola. The data was generated using the eukaryotic 18S primer set. ejp_soil_arthropoda_18s.html contains the interactive taxonomy charts for Arthropoda. The data was generated using the eukaryotic 18S primer set. ejp_soil_metadata.csv contains metadata for the samples in this study. It includes information about the sampling locations, the sampling protocols used, the sampling depth (cm), land use type, EUNIS habitat classification, and the LUCAS-ID for each sample.
e
Data on “Which particles to select, and if yes, how many? Subsampling...
data.europa.eu
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Universitätsbibliothek der Technischen Universität München, Data on “Which particles to select, and if yes, how many? Subsampling methods for Raman microspectroscopic analysis of very small microplastic.” [Dataset]. https://data.europa.eu/data/datasets/https-open-bydata-de-api-hub-repo-datasets-https-mediatum-ub-tum-de-1596628-dataset?locale=bg
Explore at:
zipAvailable download formats
Dataset authored and provided by
Universitätsbibliothek der Technischen Universität München
License
http://dcat-ap.de/def/licenses/cc-byhttp://dcat-ap.de/def/licenses/cc-by
Description
Simulated samples for microplastic analysis by Raman microspectroscopy used in the correlated publication to evaluate subsample selections.
s
Citation Trends for "InterframePicturephone® Coding Using Unconditional...
shibatadb.com
Updated Sep 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2025). Citation Trends for "InterframePicturephone® Coding Using Unconditional Vertical and Temporal Subsampling Techniques" [Dataset]. https://www.shibatadb.com/article/VFA5ahzK
Explore at:
Dataset updated
Sep 29, 2025
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Time period covered
1986
Variables measured
New Citations per Year
Description
Yearly citation counts for the publication titled "InterframePicturephone® Coding Using Unconditional Vertical and Temporal Subsampling Techniques".

leap-val-data-f64

kaggle.com

zip

Updated Jun 13, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Bilzard (2024). leap-val-data-f64 [Dataset]. https://www.kaggle.com/tatamikenn/leap-val-data-f64

Explore at:

zip(8800477931 bytes)Available download formats

Dataset updated

Jun 13, 2024

Authors

Bilzard

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Validation Data for Stacking - 8th Year Validation Set (1/6 Subsample)

The sample_id is created sequentially from the 1st year (hence, it is different from the sample_id in the Kaggle Dataset). Note that while the original data follows a naming convention with 'train_...', this dataset simply uses integer IDs.

The data is divided into 12 chunks, each containing data from 8th year February to 9th year January.

Period

0008-02 to 0009-01

Sub-sampling Method

Sub-sample to 1/6 of all samples, ignoring leap day (2/29) in leap years (offset=0).

Source code:

from pathlib import Path

import click
import pandas as pd
import polars as pl


@click.command()
@click.argument("subsample-rate")
@click.argument("offset")
def main(subsample_rate, offset):
  NUM_YEARS = 8
  MONTH_DAY = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
  NUM_SAMPLES = (8 * sum(MONTH_DAY) * 72) // subsample_rate
  assert (
    0 <= offset < subsample_rate
  ), f"assertion failed: 0 <= offset < subsample_rate, got {offset} and {subsample_rate}."

  idx = 0
  file_id = 0
  data = []
  try:
    for year in range(NUM_YEARS):
      for month in range(1, 13):
        for day in range(MONTH_DAY[month % 12]):
          for term in range(72):
            if file_id == NUM_SAMPLES:
              raise Exception
            if idx % subsample_rate == offset:
              data.append(
                dict(
                  sample_id=file_id,
                  year=year + 1,
                  real_year=year + (month // 12) + 1,
                  month=month % 12 + 1,
                  day=day + 1,
                  min_of_day=term * 1200,
                )
              )
              file_id += 1
            idx += 1
  except Exception:
    print("error")
    pass

  output_path = Path(
    "/ml-docker/working/kaggle-leap-private/data/hugging_face_download"
  )
  if not output_path.exists():
    output_path.mkdir()

  df = pl.from_pandas(pd.DataFrame(data))
  print(df.filter(pl.col("year").eq(8)))
  df.write_parquet(output_path / f"subsample_s{subsample_rate}_o{offset}.pqt")


if _name_ == "_main_":
  main()

Positioning of This Data

Low-Resolution Real Geography

  11.5° x 11.5° horizontal resolution (384 grid columns)
  100 million total samples (744 GB)
  1.9 MB per input file, 1.1 MB per output file

Low resolution (774 GB) -> I refer to this as the 1/1 full set.
Kaggle Dataset -> A 1/7 subsample of 1, containing data from 1st to 7th year (excluding the 8th year).
leap-val-data-f64 (this dataset) -> A 1/6 subsample of 1, containing only the 8th year data.

Therefore, it is appropriate to evaluate the model trained on the Kaggle Dataset with this dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou (2024). Efficient Model-Free Subsampling Method for Massive Data [Dataset]. http://doi.org/10.6084/m9.figshare.24347102.v2

Data from: Efficient Model-Free Subsampling Method for Massive Data

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.24347102.v2

Dataset updated

Feb 14, 2024

Dataset provided by

Taylor & Francishttps://taylorandfrancis.com/

Authors

Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.

Clear search

Close search

Google apps

Main menu

Data from: Efficient Model-Free Subsampling Method for Massive Data

Data from: A Subsampling Strategy for AIC-based Model Averaging with...

Data from: Optimal Distributed Subsampling for Maximum Quasi-Likelihood...

Data from: Time for a rethink: time sub-sampling methods in...

Data from: Group-Orthogonal Subsampling for Hierarchical Data Based on...

Subsampling hypothesis tests for nonstationary panels with applications to...

Data from: Scalable Estimation and Two-Sample Testing for Large Networks via...

Optimal Subsampling for Data Streams with Measurement Constrained...

Subsampling reveals that unbalanced sampling affects STRUCTURE results in a...

A comparison of minimally-invasive sampling techniques for ZooMS analysis of...

Data from: When less is more and more is less: the impact of sampling effort...

Data from: Feature Screening for Massive Data Analysis by Subsampling

Data from: LowCon: A Design-based Subsampling Approach in a Misspecified...

Data from: The program STRUCTURE does not reliably recover the correct...

Subsampling used for assembly of amblyceran mitogenomes and methods used for...

Data from: Phylogenomic subsampling and the search for phylogenetically...

Data from: Comparison and evaluation of sampling and eDNA metabarcoding...

Data on “Which particles to select, and if yes, how many? Subsampling...

Citation Trends for "InterframePicturephone® Coding Using Unconditional...

leap-val-data-f64

Validation Data for Stacking - 8th Year Validation Set (1/6 Subsample)

Period

Sub-sampling Method

Positioning of This Data

Data from: Efficient Model-Free Subsampling Method for Massive Data