100+ datasets found

D
Data and scripts from: High-dimensional percolation criticality and hints of...
research.repository.duke.edu
Updated Jul 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang, Zhen; Charbonneau, Patrick; Charbonneau, Benoit; Hu, Yi (2021). Data and scripts from: High-dimensional percolation criticality and hints of mean-field-like caging of the random Lorentz gas [Dataset]. http://doi.org/10.7924/r4s46r07b
Explore at:
Unique identifier
https://doi.org/10.7924/r4s46r07b, https://identifiers.org/ark:/87924/r4s46r07b
Dataset updated
Jul 2, 2021
Dataset provided by
Duke Research Data Repository
Authors
Yang, Zhen; Charbonneau, Patrick; Charbonneau, Benoit; Hu, Yi
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Simons Foundation
Description
The random Lorentz gas (RLG) is a minimal model for transport in disordered media. Despite the broad relevance of the model, theoretical grasp over its properties remains weak. For instance, the scaling with dimension $d$ of its localization transition at the void percolation threshold is not well controlled analytically nor computationally. A recent study [Biroli et al. Phys. Rev. E 103, L030104 (2021)] of the caging behavior of the RLG motivated by the mean-field theory of glasses has uncovered physical inconsistencies in that scaling that heighten the need for guidance. Here, we first extend analytical expectations for asymptotic high-d bounds on the void percolation threshold, and then computationally evaluate both the threshold and its criticality in various d. In high-d systems, we observe that the standard percolation physics is complemented by a dynamical slowdown of the tracer dynamics reminiscent of mean-field caging. A simple modification of the RLG is found to bring the interplay between percolation and mean-field-like caging down to d=3. ... [Read More]
f
Data from: Skeleton Clustering: Dimension-Free Density-Aided Clustering
tandf.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeyu Wei; Yen-Chi Chen (2023). Skeleton Clustering: Dimension-Free Density-Aided Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.21976961.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21976961.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francis
Authors
Zeyu Wei; Yen-Chi Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduce a density-aided clustering method called Skeleton Clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios. Supplementary materials for this article are available online.
f
Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
f
Data from: Testing Alphas in Conditional Time-Varying Factor Models with...
figshare.com
txt
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shujie Ma; Wei Lan; Liangjun Su; Chih-Ling Tsai (2023). Testing Alphas in Conditional Time-Varying Factor Models with High Dimensional Assets [Dataset]. http://doi.org/10.6084/m9.figshare.6453074.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6453074.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francis
Authors
Shujie Ma; Wei Lan; Liangjun Su; Chih-Ling Tsai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For conditional time-varying factor models with high dimensional assets, this article proposes a high dimensional alpha (HDA) test to assess whether there exist abnormal returns on securities (or portfolios) over the theoretical expected returns. To employ this test effectively, a constant coefficient test is also introduced. It examines the validity of constant alphas and factor loadings. Simulation studies and an empirical example are presented to illustrate the finite sample performance and the usefulness of the proposed tests. Using the HDA test, the empirical example demonstrates that the FF three-factor model (Fama and French, 1993) is better than CAPM (Sharpe, 1964) in explaining the mean-variance efficiency of both the Chinese and US stock markets. Furthermore, our results suggest that the US stock market is more efficient in terms of mean-variance efficiency than the Chinese stock market.
n
Data from: Machine Learning Methods for High-Dimensional and Multimodal...
curate.nd.edu
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ouyang Zhu (2025). Machine Learning Methods for High-Dimensional and Multimodal Single-Cell Data [Dataset]. http://doi.org/10.7274/29191802.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.7274/29191802.v1
Dataset updated
Jun 9, 2025
Dataset provided by
University of Notre Dame
Authors
Ouyang Zhu
License
https://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106
Description
Recent advances in single-cell and multi-omics technologies have enabled high-resolution profiling of cellular states, but also introduced new computational challenges. This dissertation presents machine learning methods to improve data quality and extract insights from high-dimensional, multimodal single-cell datasets.

First, we propose Decaf K-means, a clustering algorithm that accounts for cluster-specific confounding effects, such as batch variation, directly during clustering. This approach improves clustering accuracy in both synthetic and real data.

Second, we develop scPDA, a denoising method for droplet-based single-cell protein data that eliminates the need for empty droplets or null controls. scPDA models protein-protein relationships to enhance denoising accuracy and significantly improves cell-type identification.

Third, we introduce Scouter, a model that predicts transcriptional outcomes of unseen gene perturbations. Scouter combines neural networks with large language models to generalize across perturbations, reducing prediction error by over 50% compared to existing methods.

Finally, we extend this to TranScouter, which predicts transcriptional responses under new biological conditions without direct perturbation data. Using a tailored encoder-decoder architecture, TranScouter achieves accurate cross-condition predictions, paving the way for more generalizable models in perturbation biology.
f
Data from: A change-point–based control chart for detecting sparse mean...
tandf.figshare.com
txt
Updated Jan 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zezhong Wang; Inez Maria Zwetsloot (2024). A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data [Dataset]. http://doi.org/10.6084/m9.figshare.24441804.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24441804.v1
Dataset updated
Jan 17, 2024
Dataset provided by
Taylor & Francis
Authors
Zezhong Wang; Inez Maria Zwetsloot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
S
Supplement to ``Investigation of methods to enhance the power of...
scidb.cn
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiujing Wu; outer wall or surrounding area of a city文雯 (2024). Supplement to ``Investigation of methods to enhance the power of high-dimensional tests" [Dataset]. http://doi.org/10.57760/sciencedb.j00206.00036
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00206.00036
Dataset updated
Oct 17, 2024
Dataset provided by
Science Data Bank
Authors
Jiujing Wu; outer wall or surrounding area of a city文雯
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a simulation result of the effectiveness improvement methods for high-dimensional data testing (such as mean testing, linear model testing, and independence testing), including images and tables.
Data from: Multivariate phylogenetic comparative methods: evaluations,...
zenodo.org
data.niaid.nih.gov
+3more
bin, zip
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dean C. Adams; Michael L. Collyer; Dean C. Adams; Michael L. Collyer (2022). Data from: Multivariate phylogenetic comparative methods: evaluations, comparisons, and recommendations [Dataset]. http://doi.org/10.5061/dryad.29722
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.29722
Dataset updated
May 31, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dean C. Adams; Michael L. Collyer; Dean C. Adams; Michael L. Collyer
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Recent years have seen increased interest in phylogenetic comparative analyses of multivariate datasets, but to date the varied proposed approaches have not been extensively examined. Here we review the mathematical properties required of any multivariate method, and specifically evaluate existing multivariate phylogenetic comparative methods in this context. Phylogenetic comparative methods based on the full multivariate likelihood are robust to levels of covariation among trait dimensions and are insensitive to the orientation of the dataset, but display increasing model misspecification as the number of trait dimensions increases. This is because the expected evolutionary covariance matrix (V) used in the likelihood calculations becomes more ill-conditioned as trait dimensionality increases, and as evolutionary models become more complex. Thus, these approaches are only appropriate for datasets with few traits and many species. Methods that summarize patterns across trait dimensions treated separately (e.g., SURFACE) incorrectly assume independence among trait dimensions, resulting in nearly a 100% model misspecification rate. Methods using pairwise composite likelihood are highly sensitive to levels of trait covariation, the orientation of the dataset, and the number of trait dimensions. The consequences of these debilitating deficiencies is that a user can arrive at differing statistical conclusions, and therefore biological inferences, simply from a dataspace rotation, like principal component analysis. By contrast, algebraic generalizations of the standard phylogenetic comparative toolkit that use the trace of covariance matrices are insensitive to levels of trait covariation, the number of trait dimensions, and the orientation of the dataset. Further, when appropriate permutation tests are used, these approaches display acceptable Type I error and statistical power. We conclude that methods summarizing information across trait dimensions, as well as pairwise composite likelihood methods should be avoided, while algebraic generalizations of the phylogenetic comparative toolkit provide a useful means of assessing macroevolutionary patterns in multivariate data. Finally, we discuss areas in which multivariate phylogenetic comparative methods are still in need of future development; namely highly multivariate Ornstein-Uhlenbeck models and approaches for multivariate evolutionary model comparisons.
d
Data from: Detecting Anomalies in Multivariate Data Sets with Switching...
datasets.ai
s.cnmilf.com
+4more
33
Updated Aug 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2024). Detecting Anomalies in Multivariate Data Sets with Switching Sequences and Continuous Streams [Dataset]. https://datasets.ai/datasets/detecting-anomalies-in-multivariate-data-sets-with-switching-sequences-and-continuous-stre
Explore at:
33Available download formats
Dataset updated
Aug 9, 2024
Dataset authored and provided by
National Aeronautics and Space Administration
Description
The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. Here, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequence of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also briefly discuss results on synthetic and real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art methods.
Data from: Quantifying and comparing phylogenetic evolutionary rates for...
zenodo.org
search.dataone.org
+1more
bin, csv
Updated May 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dean C. Adams; Dean C. Adams (2022). Data from: Quantifying and comparing phylogenetic evolutionary rates for shape and other high-dimensional phenotypic data [Dataset]. http://doi.org/10.5061/dryad.41hc4
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.41hc4
Dataset updated
May 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dean C. Adams; Dean C. Adams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Many questions in evolutionary biology require the quantification and comparison of rates of phenotypic evolution. Recently, phylogenetic comparative methods have been developed for comparing evolutionary rates on a phylogeny for single, univariate traits (σ2), and evolutionary rate matrices (R) for sets of traits treated simultaneously. However, high-dimensional traits like shape remain under-examined with this framework, because methods suited for such data have not been fully developed. In this article, I describe a method to quantify phylogenetic evolutionary rates for high-dimensional multivariate data (σ2mult), found from the equivalency between statistical methods based on covariance matrices and those based on distance matrices (R-mode and Q-mode methods). I then use simulations to evaluate the statistical performance of hypothesis testing procedures that compare σ2mult for two or more groups of species on a phylogeny. Under both isotropic and non-isotropic conditions, and for differing numbers of trait dimensions, the proposed method displays appropriate Type I error and high statistical power for detecting known differences in σ2mult among groups. By contrast, the Type I error rate of likelihood tests based on the evolutionary rate matrix (R) increases as the number of trait dimensions (p) increases, and becomes unacceptably large when only a few trait dimensions are considered. Further, likelihood tests based on R cannot be computed when the number of trait dimensions equals or exceeds the number of taxa in the phylogeny (i.e., when p ≥ N). These results demonstrate that tests based on σ2mult provide a useful means of comparing evolutionary rates for high-dimensional data that are otherwise not analytically accessible to methods based on the evolutionary rate matrix. This advance thus expands the phylogenetic comparative toolkit for high-dimensional phenotypic traits like shape. Finally, I illustrate the utility of the new approach by evaluating rates of head shape evolution in a lineage of Plethodon salamanders.
Z
Data from: First-passage probability estimation of high-dimensional...
data.niaid.nih.gov
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valdebenito, Marcos (2024). First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems by a fractional moments-based mixture distribution approach [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7661087
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Broggi, Matteo
Dang, Chao
Faes, Matthias
Beer, Michael
Valdebenito, Marcos
Ding, Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems is a significant task to be solved in many science and engineering fields, but remains still an open challenge. The present paper develops a novel approach, termed ‘fractional moments-based mixture distribution’, to address such challenge. This approach is implemented by capturing the extreme value distribution (EVD) of the system response with the concepts of fractional moment and mixture distribution. In our context, the fractional moment itself is by definition a high-dimensional integral with a complicated integrand. To efficiently compute the fractional moments, a parallel adaptive sampling scheme that allows for sample size extension is developed using the refined Latinized stratified sampling (RLSS). In this manner, both variance reduction and parallel computing are possible for evaluating the fractional moments. From the knowledge of low-order fractional moments, the EVD of interest is then expected to be reconstructed. Based on introducing an extended inverse Gaussian distribution and a log extended skew-normal distribution, one flexible mixture distribution model is proposed, where its fractional moments are derived in analytic form. By fitting a set of fractional moments, the EVD can be recovered via the proposed mixture model. Accordingly, the first-passage probabilities under different thresholds can be obtained from the recovered EVD straightforwardly. The performance of the proposed method is verified by three examples consisting of two test examples and one engineering problem.
d
Data from: High-dimensional variance partitioning reveals the modular...
datadryad.org
zip
Updated May 26, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Ann McGraw; Yixin Henry Ye; Brad Foley; Stephen F Chenoweth; Megan Higgie; Emma Hine; Mark W Blows (2011). High-dimensional variance partitioning reveals the modular genetic basis of adaptive divergence in gene expression during reproductive character displacement [Dataset]. http://doi.org/10.5061/dryad.rn6gg
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rn6gg
Dataset updated
May 26, 2011
Dataset provided by
Dryad
Authors
Elizabeth Ann McGraw; Yixin Henry Ye; Brad Foley; Stephen F Chenoweth; Megan Higgie; Emma Hine; Mark W Blows
Time period covered
2011
Description
RIL RTPCR and CHC dataqRT-PCR data for three genes (AmyPres CL481res NinaDres) and cuticular hydrocarbons (lc1 lc2 lc3 lc5 lc6 lc7 lc8 lc9) for 430 individuals from 41 RILs and 2 parental lines (EUNGC and FORS4). Canonical variates for genes (V1, V2, V3) and CHCs (W1, W2, W3) are also given.15 RIL CHC and gene factor meansRIL line means for 12 genetic factors and 9 cuticular hydrocarbons for the line-mean analysis presented in Table 2.
d
Data from: Permutation tests for phylogenetic comparative analyses of...
datadryad.org
data.niaid.nih.gov
zip
Updated Dec 26, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dean C. Adams; Michael L. Collyer (2014). Permutation tests for phylogenetic comparative analyses of high-dimensional shape data: what you shuffle matters [Dataset]. http://doi.org/10.5061/dryad.2jv17
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2jv17
Dataset updated
Dec 26, 2014
Dataset provided by
Dryad
Authors
Dean C. Adams; Michael L. Collyer
Time period covered
2014
Description
PlethHead-SVL-DataHead shape and body size data (species means) for 42 species of Plethodon salamanderPlethodon PhylogenyPlethodon phylogeny (from Wiens et al. 2006)plethred.treType I Error Simulation ScriptR script for simulating data to evaluate Type I error rates of two multivariate phylogenetic comparative method approachesPICRand-TypeIError.rSimulation support scriptsSupport scripts for Type I error simulationsprocD.pic.r
f
Data from: Gaussian approximation and spatially dependent wild bootstrap for...
tandf.figshare.com
txt
Updated Jul 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daisuke Kurisu; Kengo Kato; Xiaofeng Shao (2023). Gaussian approximation and spatially dependent wild bootstrap for high-dimensional spatial data [Dataset]. http://doi.org/10.6084/m9.figshare.23227432.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23227432.v1
Dataset updated
Jul 3, 2023
Dataset provided by
Taylor & Francis
Authors
Daisuke Kurisu; Kengo Kato; Xiaofeng Shao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this paper, we establish a high-dimensional CLT for the sample mean of p-dimensional spatial data observed over irregularly spaced sampling sites in Rd, allowing the dimension p to be much larger than the sample size n. We adopt a stochastic sampling scheme that can generate irregularly spaced sampling sites in a flexible manner and include both pure increasing domain and mixed increasing domain frameworks. To facilitate statistical inference, we develop the spatially dependent wild bootstrap (SDWB) and justify its asymptotic validity in high dimensions by deriving error bounds that hold almost surely conditionally on the stochastic sampling sites. Our dependence conditions on the underlying random field cover a wide class of random fields such as Gaussian random fields and continuous autoregressive moving average random fields. Through numerical simulations and a real data analysis, we demonstrate the usefulness of our bootstrap-based inference in several applications, including joint confidence interval construction for high-dimensional spatial data and change-point detection for spatio-temporal data.
Data from: Materials Science Optimization Benchmark Dataset for...
zenodo.org
bin, csv, json
Updated Mar 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sterling G. Baird; Sterling G. Baird; Jeet N. Parikh; Jeet N. Parikh (2023). Materials Science Optimization Benchmark Dataset for High-dimensional, Multi-objective, Multi-fidelity Optimization of CrabNet Hyperparameters [Dataset]. http://doi.org/10.5281/zenodo.7693716
Explore at:
bin, json, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7693716
Dataset updated
Mar 3, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sterling G. Baird; Sterling G. Baird; Jeet N. Parikh; Jeet N. Parikh
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Benchmarks are an essential driver of progress in scientific disciplines. Ideal benchmarks mimic real-world tasks as closely as possible, where insufficient difficulty or applicability can stunt growth in the field. Benchmarks should also have sufficiently low computational overhead to promote accessibility and repeatability. The goal is then to win a "Turing test" of sorts by creating a surrogate model that is indistinguishable from the ground truth observation (at least within the dataset bounds that were explored), necessitating a large amount of data. In materials science and chemistry, industry-relevant optimization tasks are often hierarchical, noisy, multi-fidelity, multi-objective, high-dimensional, and non-linearly correlated while exhibiting mixed numerical and categorical variables subject to linear and non-linear constraints. To complicate matters, unexpected, failed simulation or experimental regions may be present in the search space. In this study, 173219 quasi-random hyperparameter combinations were generated across 23 hyperparameters and used to train CrabNet on the Matbench experimental band gap dataset. The results were logged to a free-tier shared MongoDB Atlas dataset. This study resulted in a regression dataset mapping hyperparameter combinations (including repeats) to MAE, RMSE, computational runtime, and model size for CrabNet model trained on the Matbench experimental band gap benchmark task1. This dataset is used to create a surrogate model as close as possible to running the actual simulations by incorporating heteroskedastic noise. Failure cases for bad hyperparameter combinations were excluded via careful construction of the hyperparameter search space, and so were not considered as was done in prior work. For the regression dataset, percentile ranks were computed within each of the groups of identical parameter sets to enable capturing heteroskedastic noise. This contrasts with a more traditional approach that imposes a-priori assumptions such as Gaussian noise, e.g., by providing a mean and standard deviation. A similar approach can be applied to other benchmark datasets to bridge the gap between optimization benchmarks with low computational overhead and realistically complex, real-world optimization scenarios.
f
Data from: Empirical Dynamic Quantiles for Visualization of High-Dimensional...
tandf.figshare.com
png
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Peña; Ruey S. Tsay; Ruben Zamar (2023). Empirical Dynamic Quantiles for Visualization of High-Dimensional Time Series [Dataset]. http://doi.org/10.6084/m9.figshare.7701638.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7701638.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francis
Authors
Daniel Peña; Ruey S. Tsay; Ruben Zamar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The empirical quantiles of independent data provide a good summary of the underlying distribution of the observations. For high-dimensional time series defined in two dimensions, such as in space and time, one can define empirical quantiles of all observations at a given time point, but such time-wise quantiles can only reflect properties of the data at that time point. They often fail to capture the dynamic dependence of the data. In this article, we propose a new definition of empirical dynamic quantiles (EDQ) for high-dimensional time series that mitigates this limitation by imposing that the quantile must be one of the observed time series. The word dynamic emphasizes the fact that these newly defined quantiles capture the time evolution of the data. We prove that the EDQ converge to the time-wise quantiles under some weak conditions as the dimension increases. A fast algorithm to compute the dynamic quantiles is presented and the resulting quantiles are used to produce summary plots for a collection of many time series. We illustrate with two real datasets that the time-wise and dynamic quantiles convey different and complementary information. We also briefly compare the visualization provided by EDQ with that obtained by functional depth. The R code and a vignette for computing and plotting EDQ are available athttps://github.com/dpena157/HDts/.
In Situ Photoluminescence Dataset for Exploring Material and Processing...
zenodo.org
bin, zip
Updated Jan 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Laufer; Felix Laufer; Markus Götz; Markus Götz; Ulrich Wilhelm Paetzold; Ulrich Wilhelm Paetzold (2025). In Situ Photoluminescence Dataset for Exploring Material and Processing Variabilities in Blade-Coated Perovskite Photovoltaics [Dataset]. http://doi.org/10.5281/zenodo.14609789
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14609789
Dataset updated
Jan 9, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Felix Laufer; Felix Laufer; Markus Götz; Markus Götz; Ulrich Wilhelm Paetzold; Ulrich Wilhelm Paetzold
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Content:

This dataset contains time-resolved in situ data acquired during the formation of blade-coated perovskite thin films, which were subsequently processed into functional perovskite solar cells. The time series data capture the vacuum quenching process - a critical step in perovskite layer formation - using photoluminescence (PL) and diffuse reflection imaging. The dataset is intended to support deep learning applications for predicting both material-level properties (e.g., precursor composition) and device-level performance metrics (e.g., power conversion efficiency, PCE).

Unlike the previous dataset, this dataset includes perovskite solar cells fabricated under varied process conditions. Specifically, the quenching duration, precursor solution molarity, and molar ratio were systematically changed to enhance the diversity of the data.

To monitor the vacuum quenching process, a PL imaging setup captured four channels of time series image data (2D+t), including one diffuse reflection channel and three PL spectrum channels filtered for different wavelengths. All images were cropped into 65x56 pixel patches, isolating the active area of individual solar cells. However, currently, the dataset provides only the time transients of these four channels, where the spatial mean intensity was calculated for each time step. This dimensionality reduction transforms the high-dimensional video data into compact temporal transients, highlighting the critical dynamics of thin-film formation.

The dataset consists of two parts:

Samples finalized into functional solar cells:

Includes photovoltaic (PV) performance metrics as target variables:
(1) power conversion efficiency (PCE),
(2) open-circuit voltage (VOC),
(3) short-circuit current density (JSC),
(4) fill factor (FF), measured in forward and backward sweeps.

Samples not finalized into functional solar cells:

Does not include PV metrics. These samples are suitable for classification tasks, such as predicting precursor solution molarity and molar ratio.

Further information on the experimental procedure and data processing is detailed in the corresponding paper: Deep learning for augmented process monitoring of scalable perovskite thin-film fabrication. Please cite this paper when using the dataset.

Columns in the data.h5 file:

Identifiers: date, expID, patchID (sample identifiers).

Input features: ND, LP725, LP780, SP775 (signal transients from vacuum quenching).

Material properties: ratio, molarity (precursor solution properties).

Process parameters: evac_duration (vacuum quenching duration).

Photovoltaic performance metrics:

PCE_forward, PCE_backward, VOC_forward, VOC_backward, JSC_forward, JSC_backward, FF_forward, FF_backward.

Photoluminescence measurements: plqyWL, lumFluxDens (PL spectra after vacuum quenching).

Electrical characteristics:

RSHUNT_forward, RSHUNT_backward, RS_forward, RS_backward (shunt and series resistances from jV curves).

Derived parameters: PLQY, iVOC, jscPLQY, egPLQY (calculated from PLQY measurements).

Usage:

The dataset is structured for machine learning applications to improve understanding of the complex perovskite thin-film formation from solution. The corresponding paper tackles these challenges:

Material classification: Using ND, LP725, LP780, and SP775 as inputs to predict ratio and molarity.

Device performance regression: Using ND, LP725, LP780, and SP775 with a variable process parameter (evac_duration) as inputs to predict PCE_backward.

Process control recommendations: Forecasting monitoring signals (ND, LP725, LP780, SP775) as a function of a variable process parameter (evac_duration) and predicting the corresponding device performance metric PCE_backward.

Scripts for generating the same train-test splits and cross-validation folds as in the corresponding paper are provided in the GitHub repository:

00a_generate_Material_train_test_folds.ipynb

00b_generate_PCE_train_test_folds.ipynb

Additionally, random forest models used for forecasting are included in forecasting_models.zip.
d
Data from: Multiple Kernel Learning for Heterogeneous Anomaly Detection:...
catalog.data.gov
datasets.ai
+3more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Multiple Kernel Learning for Heterogeneous Anomaly Detection: Algorithm and Aviation Safety Case Study [Dataset]. https://catalog.data.gov/dataset/multiple-kernel-learning-for-heterogeneous-anomaly-detection-algorithm-and-aviation-safety
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. In this paper, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequences of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also discuss results on real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art methods.
D
Data from: ApHIN - Autoencoder-based port-Hamiltonian Identification...
darus.uni-stuttgart.de
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonas Kneifl; Johannes Rettberg; Julius Herb (2024). ApHIN - Autoencoder-based port-Hamiltonian Identification Networks (Software Package) [Dataset]. http://doi.org/10.18419/DARUS-4446
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-4446
Dataset updated
Aug 27, 2024
Dataset provided by
DaRUS
Authors
Jonas Kneifl; Johannes Rettberg; Julius Herb
License
https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html
Dataset funded by
DFG
Ministry of Science, Research and the Arts Baden-Württemberg
Description
Software package for data-driven identification of latent port-Hamiltonian systems. Abstract Conventional physics-based modeling techniques involve high effort, e.g.~time and expert knowledge, while data-driven methods often lack interpretability, structure, and sometimes reliability. To mitigate this, we present a data-driven system identification framework that derives models in the port-Hamiltonian (pH) formulation. This formulation is suitable for multi-physical systems while guaranteeing the useful system theoretical properties of passivity and stability. Our framework combines linear and nonlinear reduction with structured, physics-motivated system identification. In this process, high-dimensional state data obtained from possibly nonlinear systems serves as the input for an autoencoder, which then performs two tasks: (i) nonlinearly transforming and (ii) reducing this data onto a low-dimensional manifold. In the resulting latent space, a pH system is identified by considering the unknown matrix entries as weights of a neural network. The matrices strongly satisfy the pH matrix properties through Cholesky factorizations. In a joint optimization process over the loss term, the pH matrices are adjusted to match the dynamics observed by the data, while defining a linear pH system in the latent space per construction. The learned, low-dimensional pH system can describe even nonlinear systems and is rapidly computable due to its small size. The method is exemplified by a parametric mass-spring-damper and a nonlinear pendulum example as well as the high-dimensional model of a disc brake with linear thermoelastic behavior Features This package implements neural networks that identify linear port-Hamiltonian systems from (potentially high-dimensional) data [1]. Autoencoders (AEs) for dimensionality reduction pH layer to identify system matrices that fullfill the definition of a linear pH system pHIN: identify a (parametric) low-dimensional port-Hamiltonian system directly ApHIN: identify a (parametric) low-dimensional latent port-Hamiltonian system based on coordinate representations found using an autoencoder Examples for the identification of linear pH systems from data One-dimensional mass-spring-damper chain Pendulum discbrake model See documentation for more details.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yang, Zhen; Charbonneau, Patrick; Charbonneau, Benoit; Hu, Yi (2021). Data and scripts from: High-dimensional percolation criticality and hints of mean-field-like caging of the random Lorentz gas [Dataset]. http://doi.org/10.7924/r4s46r07b

Data and scripts from: High-dimensional percolation criticality and hints of mean-field-like caging of the random Lorentz gas

Explore at:

Unique identifier

https://doi.org/10.7924/r4s46r07b, https://identifiers.org/ark:/87924/r4s46r07b

Dataset updated

Jul 2, 2021

Dataset provided by

Duke Research Data Repository

Authors

Yang, Zhen; Charbonneau, Patrick; Charbonneau, Benoit; Hu, Yi

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Dataset funded by

Simons Foundation

Description

The random Lorentz gas (RLG) is a minimal model for transport in disordered media. Despite the broad relevance of the model, theoretical grasp over its properties remains weak. For instance, the scaling with dimension $d$ of its localization transition at the void percolation threshold is not well controlled analytically nor computationally. A recent study [Biroli et al. Phys. Rev. E 103, L030104 (2021)] of the caging behavior of the RLG motivated by the mean-field theory of glasses has uncovered physical inconsistencies in that scaling that heighten the need for guidance. Here, we first extend analytical expectations for asymptotic high-d bounds on the void percolation threshold, and then computationally evaluate both the threshold and its criticality in various d. In high-d systems, we observe that the standard percolation physics is complemented by a dynamical slowdown of the tracer dynamics reminiscent of mean-field caging. A simple modification of the RLG is found to bring the interplay between percolation and mean-field-like caging down to d=3. ... [Read More]

Clear search

Close search

Google apps

Main menu

Data and scripts from: High-dimensional percolation criticality and hints of...

Data from: Skeleton Clustering: Dimension-Free Density-Aided Clustering

Dataset for: Some Remarks on the R2 for Clustering

Data from: Testing Alphas in Conditional Time-Varying Factor Models with...

Data from: Machine Learning Methods for High-Dimensional and Multimodal...

Data from: A change-point–based control chart for detecting sparse mean...

Educational Attainment in North Carolina Public Schools: Use of statistical...

Supplement to ``Investigation of methods to enhance the power of...

Data from: Multivariate phylogenetic comparative methods: evaluations,...

Data from: Detecting Anomalies in Multivariate Data Sets with Switching...

Data from: Quantifying and comparing phylogenetic evolutionary rates for...

Data from: First-passage probability estimation of high-dimensional...

Data from: High-dimensional variance partitioning reveals the modular...

Data from: Permutation tests for phylogenetic comparative analyses of...

Data from: Gaussian approximation and spatially dependent wild bootstrap for...

Data from: Materials Science Optimization Benchmark Dataset for...

Data from: Empirical Dynamic Quantiles for Visualization of High-Dimensional...

In Situ Photoluminescence Dataset for Exploring Material and Processing...

Data from: Multiple Kernel Learning for Heterogeneous Anomaly Detection:...

Data from: ApHIN - Autoencoder-based port-Hamiltonian Identification...

Data and scripts from: High-dimensional percolation criticality and hints of mean-field-like caging of the random Lorentz gas