100+ datasets found
  1. D

    Data and scripts from: High-dimensional percolation criticality and hints of...

    • research.repository.duke.edu
    Updated Jul 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang, Zhen; Charbonneau, Patrick; Charbonneau, Benoit; Hu, Yi (2021). Data and scripts from: High-dimensional percolation criticality and hints of mean-field-like caging of the random Lorentz gas [Dataset]. http://doi.org/10.7924/r4s46r07b
    Explore at:
    Dataset updated
    Jul 2, 2021
    Dataset provided by
    Duke Research Data Repository
    Authors
    Yang, Zhen; Charbonneau, Patrick; Charbonneau, Benoit; Hu, Yi
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Simons Foundation
    Description

    The random Lorentz gas (RLG) is a minimal model for transport in disordered media. Despite the broad relevance of the model, theoretical grasp over its properties remains weak. For instance, the scaling with dimension $d$ of its localization transition at the void percolation threshold is not well controlled analytically nor computationally. A recent study [Biroli et al. Phys. Rev. E 103, L030104 (2021)] of the caging behavior of the RLG motivated by the mean-field theory of glasses has uncovered physical inconsistencies in that scaling that heighten the need for guidance. Here, we first extend analytical expectations for asymptotic high-d bounds on the void percolation threshold, and then computationally evaluate both the threshold and its criticality in various d. In high-d systems, we observe that the standard percolation physics is complemented by a dynamical slowdown of the tracer dynamics reminiscent of mean-field caging. A simple modification of the RLG is found to bring the interplay between percolation and mean-field-like caging down to d=3. ... [Read More]

  2. f

    Data from: Skeleton Clustering: Dimension-Free Density-Aided Clustering

    • tandf.figshare.com
    pdf
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyu Wei; Yen-Chi Chen (2023). Skeleton Clustering: Dimension-Free Density-Aided Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.21976961.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Zeyu Wei; Yen-Chi Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We introduce a density-aided clustering method called Skeleton Clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios. Supplementary materials for this article are available online.

  3. f

    Dataset for: Some Remarks on the R2 for Clustering

    • wiley.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wiley
    Authors
    Nicola Loperfido; Thaddeus Tarpey
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

  4. f

    Data from: Testing Alphas in Conditional Time-Varying Factor Models with...

    • figshare.com
    txt
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shujie Ma; Wei Lan; Liangjun Su; Chih-Ling Tsai (2023). Testing Alphas in Conditional Time-Varying Factor Models with High Dimensional Assets [Dataset]. http://doi.org/10.6084/m9.figshare.6453074.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Shujie Ma; Wei Lan; Liangjun Su; Chih-Ling Tsai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For conditional time-varying factor models with high dimensional assets, this article proposes a high dimensional alpha (HDA) test to assess whether there exist abnormal returns on securities (or portfolios) over the theoretical expected returns. To employ this test effectively, a constant coefficient test is also introduced. It examines the validity of constant alphas and factor loadings. Simulation studies and an empirical example are presented to illustrate the finite sample performance and the usefulness of the proposed tests. Using the HDA test, the empirical example demonstrates that the FF three-factor model (Fama and French, 1993) is better than CAPM (Sharpe, 1964) in explaining the mean-variance efficiency of both the Chinese and US stock markets. Furthermore, our results suggest that the US stock market is more efficient in terms of mean-variance efficiency than the Chinese stock market.

  5. n

    Data from: Machine Learning Methods for High-Dimensional and Multimodal...

    • curate.nd.edu
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ouyang Zhu (2025). Machine Learning Methods for High-Dimensional and Multimodal Single-Cell Data [Dataset]. http://doi.org/10.7274/29191802.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    University of Notre Dame
    Authors
    Ouyang Zhu
    License

    https://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106

    Description

    Recent advances in single-cell and multi-omics technologies have enabled high-resolution profiling of cellular states, but also introduced new computational challenges. This dissertation presents machine learning methods to improve data quality and extract insights from high-dimensional, multimodal single-cell datasets.

    First, we propose Decaf K-means, a clustering algorithm that accounts for cluster-specific confounding effects, such as batch variation, directly during clustering. This approach improves clustering accuracy in both synthetic and real data.

    Second, we develop scPDA, a denoising method for droplet-based single-cell protein data that eliminates the need for empty droplets or null controls. scPDA models protein-protein relationships to enhance denoising accuracy and significantly improves cell-type identification.

    Third, we introduce Scouter, a model that predicts transcriptional outcomes of unseen gene perturbations. Scouter combines neural networks with large language models to generalize across perturbations, reducing prediction error by over 50% compared to existing methods.

    Finally, we extend this to TranScouter, which predicts transcriptional responses under new biological conditions without direct perturbation data. Using a tailored encoder-decoder architecture, TranScouter achieves accurate cross-condition predictions, paving the way for more generalizable models in perturbation biology.

  6. f

    Data from: A change-point–based control chart for detecting sparse mean...

    • tandf.figshare.com
    txt
    Updated Jan 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zezhong Wang; Inez Maria Zwetsloot (2024). A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data [Dataset]. http://doi.org/10.6084/m9.figshare.24441804.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 17, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Zezhong Wang; Inez Maria Zwetsloot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.

  7. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    North Carolina
    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  8. S

    Supplement to ``Investigation of methods to enhance the power of...

    • scidb.cn
    Updated Oct 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiujing Wu; outer wall or surrounding area of a city文雯 (2024). Supplement to ``Investigation of methods to enhance the power of high-dimensional tests" [Dataset]. http://doi.org/10.57760/sciencedb.j00206.00036
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Jiujing Wu; outer wall or surrounding area of a city文雯
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a simulation result of the effectiveness improvement methods for high-dimensional data testing (such as mean testing, linear model testing, and independence testing), including images and tables.

  9. Data from: Multivariate phylogenetic comparative methods: evaluations,...

    • zenodo.org
    • data.niaid.nih.gov
    • +3more
    bin, zip
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dean C. Adams; Michael L. Collyer; Dean C. Adams; Michael L. Collyer (2022). Data from: Multivariate phylogenetic comparative methods: evaluations, comparisons, and recommendations [Dataset]. http://doi.org/10.5061/dryad.29722
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    May 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dean C. Adams; Michael L. Collyer; Dean C. Adams; Michael L. Collyer
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Recent years have seen increased interest in phylogenetic comparative analyses of multivariate datasets, but to date the varied proposed approaches have not been extensively examined. Here we review the mathematical properties required of any multivariate method, and specifically evaluate existing multivariate phylogenetic comparative methods in this context. Phylogenetic comparative methods based on the full multivariate likelihood are robust to levels of covariation among trait dimensions and are insensitive to the orientation of the dataset, but display increasing model misspecification as the number of trait dimensions increases. This is because the expected evolutionary covariance matrix (V) used in the likelihood calculations becomes more ill-conditioned as trait dimensionality increases, and as evolutionary models become more complex. Thus, these approaches are only appropriate for datasets with few traits and many species. Methods that summarize patterns across trait dimensions treated separately (e.g., SURFACE) incorrectly assume independence among trait dimensions, resulting in nearly a 100% model misspecification rate. Methods using pairwise composite likelihood are highly sensitive to levels of trait covariation, the orientation of the dataset, and the number of trait dimensions. The consequences of these debilitating deficiencies is that a user can arrive at differing statistical conclusions, and therefore biological inferences, simply from a dataspace rotation, like principal component analysis. By contrast, algebraic generalizations of the standard phylogenetic comparative toolkit that use the trace of covariance matrices are insensitive to levels of trait covariation, the number of trait dimensions, and the orientation of the dataset. Further, when appropriate permutation tests are used, these approaches display acceptable Type I error and statistical power. We conclude that methods summarizing information across trait dimensions, as well as pairwise composite likelihood methods should be avoided, while algebraic generalizations of the phylogenetic comparative toolkit provide a useful means of assessing macroevolutionary patterns in multivariate data. Finally, we discuss areas in which multivariate phylogenetic comparative methods are still in need of future development; namely highly multivariate Ornstein-Uhlenbeck models and approaches for multivariate evolutionary model comparisons.

  10. d

    Data from: Detecting Anomalies in Multivariate Data Sets with Switching...

    • datasets.ai
    • s.cnmilf.com
    • +4more
    33
    Updated Aug 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2024). Detecting Anomalies in Multivariate Data Sets with Switching Sequences and Continuous Streams [Dataset]. https://datasets.ai/datasets/detecting-anomalies-in-multivariate-data-sets-with-switching-sequences-and-continuous-stre
    Explore at:
    33Available download formats
    Dataset updated
    Aug 9, 2024
    Dataset authored and provided by
    National Aeronautics and Space Administration
    Description

    The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. Here, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequence of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also briefly discuss results on synthetic and real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art methods.

  11. Data from: Quantifying and comparing phylogenetic evolutionary rates for...

    • zenodo.org
    • search.dataone.org
    • +1more
    bin, csv
    Updated May 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dean C. Adams; Dean C. Adams (2022). Data from: Quantifying and comparing phylogenetic evolutionary rates for shape and other high-dimensional phenotypic data [Dataset]. http://doi.org/10.5061/dryad.41hc4
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    May 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dean C. Adams; Dean C. Adams
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Many questions in evolutionary biology require the quantification and comparison of rates of phenotypic evolution. Recently, phylogenetic comparative methods have been developed for comparing evolutionary rates on a phylogeny for single, univariate traits (σ2), and evolutionary rate matrices (R) for sets of traits treated simultaneously. However, high-dimensional traits like shape remain under-examined with this framework, because methods suited for such data have not been fully developed. In this article, I describe a method to quantify phylogenetic evolutionary rates for high-dimensional multivariate data (σ2mult), found from the equivalency between statistical methods based on covariance matrices and those based on distance matrices (R-mode and Q-mode methods). I then use simulations to evaluate the statistical performance of hypothesis testing procedures that compare σ2mult for two or more groups of species on a phylogeny. Under both isotropic and non-isotropic conditions, and for differing numbers of trait dimensions, the proposed method displays appropriate Type I error and high statistical power for detecting known differences in σ2mult among groups. By contrast, the Type I error rate of likelihood tests based on the evolutionary rate matrix (R) increases as the number of trait dimensions (p) increases, and becomes unacceptably large when only a few trait dimensions are considered. Further, likelihood tests based on R cannot be computed when the number of trait dimensions equals or exceeds the number of taxa in the phylogeny (i.e., when p ≥ N). These results demonstrate that tests based on σ2mult provide a useful means of comparing evolutionary rates for high-dimensional data that are otherwise not analytically accessible to methods based on the evolutionary rate matrix. This advance thus expands the phylogenetic comparative toolkit for high-dimensional phenotypic traits like shape. Finally, I illustrate the utility of the new approach by evaluating rates of head shape evolution in a lineage of Plethodon salamanders.

  12. Z

    Data from: First-passage probability estimation of high-dimensional...

    • data.niaid.nih.gov
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valdebenito, Marcos (2024). First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems by a fractional moments-based mixture distribution approach [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7661087
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Broggi, Matteo
    Dang, Chao
    Faes, Matthias
    Beer, Michael
    Valdebenito, Marcos
    Ding, Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems is a significant task to be solved in many science and engineering fields, but remains still an open challenge. The present paper develops a novel approach, termed ‘fractional moments-based mixture distribution’, to address such challenge. This approach is implemented by capturing the extreme value distribution (EVD) of the system response with the concepts of fractional moment and mixture distribution. In our context, the fractional moment itself is by definition a high-dimensional integral with a complicated integrand. To efficiently compute the fractional moments, a parallel adaptive sampling scheme that allows for sample size extension is developed using the refined Latinized stratified sampling (RLSS). In this manner, both variance reduction and parallel computing are possible for evaluating the fractional moments. From the knowledge of low-order fractional moments, the EVD of interest is then expected to be reconstructed. Based on introducing an extended inverse Gaussian distribution and a log extended skew-normal distribution, one flexible mixture distribution model is proposed, where its fractional moments are derived in analytic form. By fitting a set of fractional moments, the EVD can be recovered via the proposed mixture model. Accordingly, the first-passage probabilities under different thresholds can be obtained from the recovered EVD straightforwardly. The performance of the proposed method is verified by three examples consisting of two test examples and one engineering problem.

  13. d

    Data from: High-dimensional variance partitioning reveals the modular...

    • datadryad.org
    zip
    Updated May 26, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Ann McGraw; Yixin Henry Ye; Brad Foley; Stephen F Chenoweth; Megan Higgie; Emma Hine; Mark W Blows (2011). High-dimensional variance partitioning reveals the modular genetic basis of adaptive divergence in gene expression during reproductive character displacement [Dataset]. http://doi.org/10.5061/dryad.rn6gg
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 26, 2011
    Dataset provided by
    Dryad
    Authors
    Elizabeth Ann McGraw; Yixin Henry Ye; Brad Foley; Stephen F Chenoweth; Megan Higgie; Emma Hine; Mark W Blows
    Time period covered
    2011
    Description

    RIL RTPCR and CHC dataqRT-PCR data for three genes (AmyPres CL481res NinaDres) and cuticular hydrocarbons (lc1 lc2 lc3 lc5 lc6 lc7 lc8 lc9) for 430 individuals from 41 RILs and 2 parental lines (EUNGC and FORS4). Canonical variates for genes (V1, V2, V3) and CHCs (W1, W2, W3) are also given.15 RIL CHC and gene factor meansRIL line means for 12 genetic factors and 9 cuticular hydrocarbons for the line-mean analysis presented in Table 2.

  14. d

    Data from: Permutation tests for phylogenetic comparative analyses of...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Dec 26, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dean C. Adams; Michael L. Collyer (2014). Permutation tests for phylogenetic comparative analyses of high-dimensional shape data: what you shuffle matters [Dataset]. http://doi.org/10.5061/dryad.2jv17
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 26, 2014
    Dataset provided by
    Dryad
    Authors
    Dean C. Adams; Michael L. Collyer
    Time period covered
    2014
    Description

    PlethHead-SVL-DataHead shape and body size data (species means) for 42 species of Plethodon salamanderPlethodon PhylogenyPlethodon phylogeny (from Wiens et al. 2006)plethred.treType I Error Simulation ScriptR script for simulating data to evaluate Type I error rates of two multivariate phylogenetic comparative method approachesPICRand-TypeIError.rSimulation support scriptsSupport scripts for Type I error simulationsprocD.pic.r

  15. f

    Data from: Gaussian approximation and spatially dependent wild bootstrap for...

    • tandf.figshare.com
    txt
    Updated Jul 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daisuke Kurisu; Kengo Kato; Xiaofeng Shao (2023). Gaussian approximation and spatially dependent wild bootstrap for high-dimensional spatial data [Dataset]. http://doi.org/10.6084/m9.figshare.23227432.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 3, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Daisuke Kurisu; Kengo Kato; Xiaofeng Shao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this paper, we establish a high-dimensional CLT for the sample mean of p-dimensional spatial data observed over irregularly spaced sampling sites in Rd, allowing the dimension p to be much larger than the sample size n. We adopt a stochastic sampling scheme that can generate irregularly spaced sampling sites in a flexible manner and include both pure increasing domain and mixed increasing domain frameworks. To facilitate statistical inference, we develop the spatially dependent wild bootstrap (SDWB) and justify its asymptotic validity in high dimensions by deriving error bounds that hold almost surely conditionally on the stochastic sampling sites. Our dependence conditions on the underlying random field cover a wide class of random fields such as Gaussian random fields and continuous autoregressive moving average random fields. Through numerical simulations and a real data analysis, we demonstrate the usefulness of our bootstrap-based inference in several applications, including joint confidence interval construction for high-dimensional spatial data and change-point detection for spatio-temporal data.

  16. Data from: Materials Science Optimization Benchmark Dataset for...

    • zenodo.org
    bin, csv, json
    Updated Mar 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sterling G. Baird; Sterling G. Baird; Jeet N. Parikh; Jeet N. Parikh (2023). Materials Science Optimization Benchmark Dataset for High-dimensional, Multi-objective, Multi-fidelity Optimization of CrabNet Hyperparameters [Dataset]. http://doi.org/10.5281/zenodo.7693716
    Explore at:
    bin, json, csvAvailable download formats
    Dataset updated
    Mar 3, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sterling G. Baird; Sterling G. Baird; Jeet N. Parikh; Jeet N. Parikh
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Benchmarks are an essential driver of progress in scientific disciplines. Ideal benchmarks mimic real-world tasks as closely as possible, where insufficient difficulty or applicability can stunt growth in the field. Benchmarks should also have sufficiently low computational overhead to promote accessibility and repeatability. The goal is then to win a "Turing test" of sorts by creating a surrogate model that is indistinguishable from the ground truth observation (at least within the dataset bounds that were explored), necessitating a large amount of data. In materials science and chemistry, industry-relevant optimization tasks are often hierarchical, noisy, multi-fidelity, multi-objective, high-dimensional, and non-linearly correlated while exhibiting mixed numerical and categorical variables subject to linear and non-linear constraints. To complicate matters, unexpected, failed simulation or experimental regions may be present in the search space. In this study, 173219 quasi-random hyperparameter combinations were generated across 23 hyperparameters and used to train CrabNet on the Matbench experimental band gap dataset. The results were logged to a free-tier shared MongoDB Atlas dataset. This study resulted in a regression dataset mapping hyperparameter combinations (including repeats) to MAE, RMSE, computational runtime, and model size for CrabNet model trained on the Matbench experimental band gap benchmark task1. This dataset is used to create a surrogate model as close as possible to running the actual simulations by incorporating heteroskedastic noise. Failure cases for bad hyperparameter combinations were excluded via careful construction of the hyperparameter search space, and so were not considered as was done in prior work. For the regression dataset, percentile ranks were computed within each of the groups of identical parameter sets to enable capturing heteroskedastic noise. This contrasts with a more traditional approach that imposes a-priori assumptions such as Gaussian noise, e.g., by providing a mean and standard deviation. A similar approach can be applied to other benchmark datasets to bridge the gap between optimization benchmarks with low computational overhead and realistically complex, real-world optimization scenarios.

  17. f

    Data from: Empirical Dynamic Quantiles for Visualization of High-Dimensional...

    • tandf.figshare.com
    png
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Peña; Ruey S. Tsay; Ruben Zamar (2023). Empirical Dynamic Quantiles for Visualization of High-Dimensional Time Series [Dataset]. http://doi.org/10.6084/m9.figshare.7701638.v1
    Explore at:
    pngAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Daniel Peña; Ruey S. Tsay; Ruben Zamar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The empirical quantiles of independent data provide a good summary of the underlying distribution of the observations. For high-dimensional time series defined in two dimensions, such as in space and time, one can define empirical quantiles of all observations at a given time point, but such time-wise quantiles can only reflect properties of the data at that time point. They often fail to capture the dynamic dependence of the data. In this article, we propose a new definition of empirical dynamic quantiles (EDQ) for high-dimensional time series that mitigates this limitation by imposing that the quantile must be one of the observed time series. The word dynamic emphasizes the fact that these newly defined quantiles capture the time evolution of the data. We prove that the EDQ converge to the time-wise quantiles under some weak conditions as the dimension increases. A fast algorithm to compute the dynamic quantiles is presented and the resulting quantiles are used to produce summary plots for a collection of many time series. We illustrate with two real datasets that the time-wise and dynamic quantiles convey different and complementary information. We also briefly compare the visualization provided by EDQ with that obtained by functional depth. The R code and a vignette for computing and plotting EDQ are available athttps://github.com/dpena157/HDts/.

  18. In Situ Photoluminescence Dataset for Exploring Material and Processing...

    • zenodo.org
    bin, zip
    Updated Jan 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felix Laufer; Felix Laufer; Markus Götz; Markus Götz; Ulrich Wilhelm Paetzold; Ulrich Wilhelm Paetzold (2025). In Situ Photoluminescence Dataset for Exploring Material and Processing Variabilities in Blade-Coated Perovskite Photovoltaics [Dataset]. http://doi.org/10.5281/zenodo.14609789
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Felix Laufer; Felix Laufer; Markus Götz; Markus Götz; Ulrich Wilhelm Paetzold; Ulrich Wilhelm Paetzold
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Content:

    This dataset contains time-resolved in situ data acquired during the formation of blade-coated perovskite thin films, which were subsequently processed into functional perovskite solar cells. The time series data capture the vacuum quenching process - a critical step in perovskite layer formation - using photoluminescence (PL) and diffuse reflection imaging. The dataset is intended to support deep learning applications for predicting both material-level properties (e.g., precursor composition) and device-level performance metrics (e.g., power conversion efficiency, PCE).

    Unlike the previous dataset, this dataset includes perovskite solar cells fabricated under varied process conditions. Specifically, the quenching duration, precursor solution molarity, and molar ratio were systematically changed to enhance the diversity of the data.

    To monitor the vacuum quenching process, a PL imaging setup captured four channels of time series image data (2D+t), including one diffuse reflection channel and three PL spectrum channels filtered for different wavelengths. All images were cropped into 65x56 pixel patches, isolating the active area of individual solar cells. However, currently, the dataset provides only the time transients of these four channels, where the spatial mean intensity was calculated for each time step. This dimensionality reduction transforms the high-dimensional video data into compact temporal transients, highlighting the critical dynamics of thin-film formation.

    The dataset consists of two parts:

    1. Samples finalized into functional solar cells:

      • Includes photovoltaic (PV) performance metrics as target variables:
        (1) power conversion efficiency (PCE),
        (2) open-circuit voltage (VOC),
        (3) short-circuit current density (JSC),
        (4) fill factor (FF), measured in forward and backward sweeps.
    2. Samples not finalized into functional solar cells:

      • Does not include PV metrics. These samples are suitable for classification tasks, such as predicting precursor solution molarity and molar ratio.

    Further information on the experimental procedure and data processing is detailed in the corresponding paper: Deep learning for augmented process monitoring of scalable perovskite thin-film fabrication. Please cite this paper when using the dataset.

    Columns in the data.h5 file:

    • Identifiers: date, expID, patchID (sample identifiers).
    • Input features: ND, LP725, LP780, SP775 (signal transients from vacuum quenching).
    • Material properties: ratio, molarity (precursor solution properties).
    • Process parameters: evac_duration (vacuum quenching duration).
    • Photovoltaic performance metrics:
      • PCE_forward, PCE_backward, VOC_forward, VOC_backward, JSC_forward, JSC_backward, FF_forward, FF_backward.
    • Photoluminescence measurements: plqyWL, lumFluxDens (PL spectra after vacuum quenching).
    • Electrical characteristics:
      • RSHUNT_forward, RSHUNT_backward, RS_forward, RS_backward (shunt and series resistances from jV curves).
    • Derived parameters: PLQY, iVOC, jscPLQY, egPLQY (calculated from PLQY measurements).

    Usage:

    The dataset is structured for machine learning applications to improve understanding of the complex perovskite thin-film formation from solution. The corresponding paper tackles these challenges:

    1. Material classification: Using ND, LP725, LP780, and SP775 as inputs to predict ratio and molarity.
    2. Device performance regression: Using ND, LP725, LP780, and SP775 with a variable process parameter (evac_duration) as inputs to predict PCE_backward.
    3. Process control recommendations: Forecasting monitoring signals (ND, LP725, LP780, SP775) as a function of a variable process parameter (evac_duration) and predicting the corresponding device performance metric PCE_backward.

    Scripts for generating the same train-test splits and cross-validation folds as in the corresponding paper are provided in the GitHub repository:

    • 00a_generate_Material_train_test_folds.ipynb
    • 00b_generate_PCE_train_test_folds.ipynb

    Additionally, random forest models used for forecasting are included in forecasting_models.zip.

  19. d

    Data from: Multiple Kernel Learning for Heterogeneous Anomaly Detection:...

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Multiple Kernel Learning for Heterogeneous Anomaly Detection: Algorithm and Aviation Safety Case Study [Dataset]. https://catalog.data.gov/dataset/multiple-kernel-learning-for-heterogeneous-anomaly-detection-algorithm-and-aviation-safety
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. In this paper, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequences of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also discuss results on real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art methods.

  20. D

    Data from: ApHIN - Autoencoder-based port-Hamiltonian Identification...

    • darus.uni-stuttgart.de
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonas Kneifl; Johannes Rettberg; Julius Herb (2024). ApHIN - Autoencoder-based port-Hamiltonian Identification Networks (Software Package) [Dataset]. http://doi.org/10.18419/DARUS-4446
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2024
    Dataset provided by
    DaRUS
    Authors
    Jonas Kneifl; Johannes Rettberg; Julius Herb
    License

    https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html

    Dataset funded by
    DFG
    Ministry of Science, Research and the Arts Baden-Württemberg
    Description

    Software package for data-driven identification of latent port-Hamiltonian systems. Abstract Conventional physics-based modeling techniques involve high effort, e.g.~time and expert knowledge, while data-driven methods often lack interpretability, structure, and sometimes reliability. To mitigate this, we present a data-driven system identification framework that derives models in the port-Hamiltonian (pH) formulation. This formulation is suitable for multi-physical systems while guaranteeing the useful system theoretical properties of passivity and stability. Our framework combines linear and nonlinear reduction with structured, physics-motivated system identification. In this process, high-dimensional state data obtained from possibly nonlinear systems serves as the input for an autoencoder, which then performs two tasks: (i) nonlinearly transforming and (ii) reducing this data onto a low-dimensional manifold. In the resulting latent space, a pH system is identified by considering the unknown matrix entries as weights of a neural network. The matrices strongly satisfy the pH matrix properties through Cholesky factorizations. In a joint optimization process over the loss term, the pH matrices are adjusted to match the dynamics observed by the data, while defining a linear pH system in the latent space per construction. The learned, low-dimensional pH system can describe even nonlinear systems and is rapidly computable due to its small size. The method is exemplified by a parametric mass-spring-damper and a nonlinear pendulum example as well as the high-dimensional model of a disc brake with linear thermoelastic behavior Features This package implements neural networks that identify linear port-Hamiltonian systems from (potentially high-dimensional) data [1]. Autoencoders (AEs) for dimensionality reduction pH layer to identify system matrices that fullfill the definition of a linear pH system pHIN: identify a (parametric) low-dimensional port-Hamiltonian system directly ApHIN: identify a (parametric) low-dimensional latent port-Hamiltonian system based on coordinate representations found using an autoencoder Examples for the identification of linear pH systems from data One-dimensional mass-spring-damper chain Pendulum discbrake model See documentation for more details.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yang, Zhen; Charbonneau, Patrick; Charbonneau, Benoit; Hu, Yi (2021). Data and scripts from: High-dimensional percolation criticality and hints of mean-field-like caging of the random Lorentz gas [Dataset]. http://doi.org/10.7924/r4s46r07b

Data and scripts from: High-dimensional percolation criticality and hints of mean-field-like caging of the random Lorentz gas

Related Article
Explore at:
Dataset updated
Jul 2, 2021
Dataset provided by
Duke Research Data Repository
Authors
Yang, Zhen; Charbonneau, Patrick; Charbonneau, Benoit; Hu, Yi
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Dataset funded by
Simons Foundation
Description

The random Lorentz gas (RLG) is a minimal model for transport in disordered media. Despite the broad relevance of the model, theoretical grasp over its properties remains weak. For instance, the scaling with dimension $d$ of its localization transition at the void percolation threshold is not well controlled analytically nor computationally. A recent study [Biroli et al. Phys. Rev. E 103, L030104 (2021)] of the caging behavior of the RLG motivated by the mean-field theory of glasses has uncovered physical inconsistencies in that scaling that heighten the need for guidance. Here, we first extend analytical expectations for asymptotic high-d bounds on the void percolation threshold, and then computationally evaluate both the threshold and its criticality in various d. In high-d systems, we observe that the standard percolation physics is complemented by a dynamical slowdown of the tracer dynamics reminiscent of mean-field caging. A simple modification of the RLG is found to bring the interplay between percolation and mean-field-like caging down to d=3. ... [Read More]

Search
Clear search
Close search
Google apps
Main menu