76 datasets found

f
Data from: A change-point–based control chart for detecting sparse mean...
tandf.figshare.com
txt
Updated Jan 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zezhong Wang; Inez Maria Zwetsloot (2024). A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data [Dataset]. http://doi.org/10.6084/m9.figshare.24441804.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24441804.v1
Dataset updated
Jan 17, 2024
Dataset provided by
Taylor & Francis
Authors
Zezhong Wang; Inez Maria Zwetsloot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.
Data from: A method for analysis of phenotypic change for phenotypes...
zenodo.org
data.niaid.nih.gov
+2more
csv
Updated May 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael L. Collyer; David J. Sekora; Dean C. Adams; Michael L. Collyer; David J. Sekora; Dean C. Adams (2022). Data from: A method for analysis of phenotypic change for phenotypes described by high-dimensional data [Dataset]. http://doi.org/10.5061/dryad.1p80f
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.1p80f
Dataset updated
May 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael L. Collyer; David J. Sekora; Dean C. Adams; Michael L. Collyer; David J. Sekora; Dean C. Adams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The analysis of phenotypic change is important for several evolutionary biology disciplines, including phenotypic plasticity, evolutionary developmental biology, morphological evolution, physiological evolution, evolutionary ecology and behavioral evolution. It is common for researchers in these disciplines to work with multivariate phenotypic data. When phenotypic variables exceed the number of research subjects—data called 'high-dimensional data'—researchers are confronted with analytical challenges. Parametric tests that require high observation to variable ratios present a paradox for researchers, as eliminating variables potentially reduces effect sizes for comparative analyses, yet test statistics require more observations than variables. This problem is exacerbated with data that describe 'multidimensional' phenotypes, whereby a description of phenotype requires high-dimensional data. For example, landmark-based geometric morphometric data use the Cartesian coordinates of (potentially) many anatomical landmarks to describe organismal shape. Collectively such shape variables describe organism shape, although the analysis of each variable, independently, offers little benefit for addressing biological questions. Here we present a nonparametric method of evaluating effect size that is not constrained by the number of phenotypic variables, and motivate its use with example analyses of phenotypic change using geometric morphometric data. Our examples contrast different characterizations of body shape for a desert fish species, associated with measuring and comparing sexual dimorphism between two populations. We demonstrate that using more phenotypic variables can increase effect sizes, and allow for stronger inferences.
Research data supporting: "Relevant, hidden, and frustrated information in...
zenodo.org
zip
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chiara Lionello; Chiara Lionello; Matteo Becchi; Matteo Becchi; Simone Martino; Simone Martino; Giovanni M. Pavan; Giovanni M. Pavan (2025). Research data supporting: "Relevant, hidden, and frustrated information in high-dimensional analyses of complex dynamical systems with internal noise" [Dataset]. http://doi.org/10.5281/zenodo.14529457
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14529457
Dataset updated
May 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Chiara Lionello; Chiara Lionello; Matteo Becchi; Matteo Becchi; Simone Martino; Simone Martino; Giovanni M. Pavan; Giovanni M. Pavan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the set of data shown in the paper "Relevant, hidden, and frustrated information in high-dimensional analyses of complex dynamical systems with internal noise", published on arXiv (DOI: 10.48550/arXiv.2412.09412).

The scripts contained herein are:

PCA-Analysis.py: python script to calculate the SOAP descriptor, denoising it, and compute the Principal Component Analysis

SOAP-Component-Analysis.py: python script to calculate the variance of the single SOAP components

Hierarchical-Clustering.py: python script to compute the hierarchical clustering and plot the dataset

OnionClustering-1d.py: script to compute the Onion clustering on a single SOAP component or principal component

OnionClustering-2d.py: script to compute bi-dimensional Onion clustering

OnionClustering-plot.py: script to plot the Onion plot, removing clusters with population <1%

UMAP.py: script to compute the UMAP dimensionality reduction technique

To reproduce the data of this work you should start form SOAP-Component-Analysis.py to calculate the SOAP descriptor and select the components that are interesting for you, then you can calculate the PCA with PCA-Analysis.py, and applying the clustering based on your necessities (OnionClustering-1d.py, OnionClustering-2d.py, Hierarchical-Clustering.py). Further modifications of the Onion plot can be done with the script: OnionClustering-plot.py. Umap can be calculated with UMAP.py.

Additional data contained herein are:

starting-configuration.gro: gromacs file with the initial configuration of the ice-water system

traj-ice-water-50ns-sampl4ps.xtc: trajectory of the ice-water system sampled every 4 ps

traj-ice-water-50ns-sampl40ps.xtc: trajectory of the ice-water system sampled every 40 ps

some files containing the SOAP descriptor of the ice-water system: ice-water-50ns-sampl40ps.hdf5, ice-water-50ns-sampl40ps_soap.hdf5, ice-water-50ns-sampl40ps_soap.npy, ice-water-50ns-sampl40ps_soap-spavg.npy

PCA-results: folder that contains some example results of the PCA

UMAP-results: folder that contains some example results of UMAP

The data related to the Quincke rollers can be found here: https://zenodo.org/records/10638736
Additional file 1: of Proposal of supervised data analysis strategy of...
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone (2023). Additional file 1: of Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation [Dataset]. http://doi.org/10.6084/m9.figshare.c.3595874_D5.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3595874_D5.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R codes for implementing the described analyses (sample processing, data pre-processing, class comparison and class prediction). Caliper matching was implemented using the nonrandom package; the t- and the AD tests were implemented using the stats package and the adk package, respectively. Notice that the updated package for implementing the AD test is kSamples. As regards the bootstrap selection and the egg-shaped plot, we respectively modified the doBS and the importance igraph functions, both included in the bootfs package. For the SVM model we used the e1071 package. (R 12Â kb)
Z
Data from: First-passage probability estimation of high-dimensional...
data.niaid.nih.gov
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valdebenito, Marcos (2024). First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems by a fractional moments-based mixture distribution approach [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7661087
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Broggi, Matteo
Ding, Chen
Valdebenito, Marcos
Faes, Matthias
Dang, Chao
Beer, Michael
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems is a significant task to be solved in many science and engineering fields, but remains still an open challenge. The present paper develops a novel approach, termed ‘fractional moments-based mixture distribution’, to address such challenge. This approach is implemented by capturing the extreme value distribution (EVD) of the system response with the concepts of fractional moment and mixture distribution. In our context, the fractional moment itself is by definition a high-dimensional integral with a complicated integrand. To efficiently compute the fractional moments, a parallel adaptive sampling scheme that allows for sample size extension is developed using the refined Latinized stratified sampling (RLSS). In this manner, both variance reduction and parallel computing are possible for evaluating the fractional moments. From the knowledge of low-order fractional moments, the EVD of interest is then expected to be reconstructed. Based on introducing an extended inverse Gaussian distribution and a log extended skew-normal distribution, one flexible mixture distribution model is proposed, where its fractional moments are derived in analytic form. By fitting a set of fractional moments, the EVD can be recovered via the proposed mixture model. Accordingly, the first-passage probabilities under different thresholds can be obtained from the recovered EVD straightforwardly. The performance of the proposed method is verified by three examples consisting of two test examples and one engineering problem.
f
Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
Reproducibility data for tables in "A machine learning approach to portfolio...
zenodo.org
csv
Updated Jun 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucio Fernandez-Arjona; Lucio Fernandez-Arjona; Damir Filipovic; Damir Filipovic (2022). Reproducibility data for tables in "A machine learning approach to portfolio pricing and risk management for high dimensional problems" [Dataset]. http://doi.org/10.5281/zenodo.6659959
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6659959
Dataset updated
Jun 19, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lucio Fernandez-Arjona; Lucio Fernandez-Arjona; Damir Filipovic; Damir Filipovic
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all the information necessary for reproducing the tables in the paper "A machine learning approach to portfolio pricing and risk management for high dimensional problems".

The raw benchmark data can be found in the Zenodo dataset "Benchmark and training data for replicating financial and insurance examples" (https://zenodo.org/record/3837381). To the extend necessary, only summary data from that dataset is used in this dataset.

The dataset includes a jupyter notebook file that explains what the different files contain, and provides sample code to analyze the information and reproduce the tables.
I
Data from: Modeling in higher dimensions to improve diagnostic testing...
data.niaid.nih.gov
url
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Modeling in higher dimensions to improve diagnostic testing accuracy: Theory and examples for multiplex saliva-based SARS-CoV-2 antibody assays [Dataset]. http://doi.org/10.21430/M33WC6JLYP
Explore at:
urlAvailable download formats
Unique identifier
https://doi.org/10.21430/M33WC6JLYP
Dataset updated
Sep 26, 2024
License
https://www.immport.org/agreementhttps://www.immport.org/agreement
Description
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has emphasized the importance and challenges of correctly interpreting antibody test results. Identification of positive and negative samples requires a classification strategy with low error rates, which is hard to achieve when the corresponding measurement values overlap. Additional uncertainty arises when classification schemes fail to account for complicated structure in data. We address these problems through a mathematical framework that combines high dimensional data modeling and optimal decision theory. Specifically, we show that appropriately increasing the dimension of data better separates positive and negative populations and reveals nuanced structure that can be described in terms of mathematical models. We combine these models with optimal decision theory to yield a classification scheme that better separates positive and negative samples relative to traditional methods such as confidence intervals (CIs) and receiver operating characteristics. We validate the usefulness of this approach in the context of a multiplex salivary SARS-CoV-2 immunoglobulin G assay dataset. This example illustrates how our analysis: (i) improves the assay accuracy, (e.g. lowers classification errors by up to 42% compared to CI methods); (ii) reduces the number of indeterminate samples when an inconclusive class is permissible, (e.g. by 40% compared to the original analysis of the example multiplex dataset) and (iii) decreases the number of antigens needed to classify samples. Our work showcases the power of mathematical modeling in diagnostic classification and highlights a method that can be adopted broadly in public health and clinical settings.
Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...
zenodo.org
pdf, zip
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis (2024). Multi-Dimensional Data Viewer (MDV) user manual for data exploration: "Systematic analysis of YFP traps reveals common discordance between mRNA and protein across the nervous system" [Dataset]. http://doi.org/10.5281/zenodo.7875495
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7875495
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

Please also see the latest version of the repository:
https://doi.org/10.5281/zenodo.6374011 and
our website: https://ilandavis.com/jcb2023-yfp

The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV) -link, a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP traps reveals common discordance between mRNA and protein across the nervous system (eprint link). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.
f
Mean squared error (×10−2) for data as in Table 2.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fan Chen; Guy Nason (2023). Mean squared error (×10−2) for data as in Table 2. [Dataset]. http://doi.org/10.1371/journal.pone.0229845.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0229845.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Fan Chen; Guy Nason
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mean squared error (×10−2) for data as in Table 2.
f
Data from: Bayesian Inference on High-Dimensional Multivariate Binary...
tandf.figshare.com
pdf
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antik Chakraborty; Rihui Ou; David B. Dunson (2023). Bayesian Inference on High-Dimensional Multivariate Binary Responses [Dataset]. http://doi.org/10.6084/m9.figshare.24171450.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24171450.v2
Dataset updated
Nov 9, 2023
Dataset provided by
Taylor & Francis
Authors
Antik Chakraborty; Rihui Ou; David B. Dunson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
It has become increasingly common to collect high-dimensional binary response data; for example, with the emergence of new sampling techniques in ecology. In smaller dimensions, multivariate probit (MVP) models are routinely used for inferences. However, algorithms for fitting such models face issues in scaling up to high dimensions due to the intractability of the likelihood, involving an integral over a multivariate normal distribution having no analytic form. Although a variety of algorithms have been proposed to approximate this intractable integral, these approaches are difficult to implement and/or inaccurate in high dimensions. Our main focus is in accommodating high-dimensional binary response data with a small-to-moderate number of covariates. We propose a two-stage approach for inference on model parameters while taking care of uncertainty propagation between the stages. We use the special structure of latent Gaussian models to reduce the highly expensive computation involved in joint parameter estimation to focus inference on marginal distributions of model parameters. This essentially makes the method embarrassingly parallel for both stages. We illustrate performance in simulations and applications to joint species distribution modeling in ecology. Supplementary materials for this article are available online.
d
Data from: Reliable phylogenetic regressions for multivariate comparative...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Clavel; HÃ©lÃ¨ne Morlon (2025). Reliable phylogenetic regressions for multivariate comparative data: illustration with the MANOVA and application to the effect of diet on mandible morphology in Phyllostomid bats [Dataset]. http://doi.org/10.5061/dryad.jsxksn052
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.jsxksn052
Dataset updated
Jun 17, 2025
Dataset provided by
Dryad Digital Repository
Authors
Julien Clavel; HÃ©lÃ¨ne Morlon
Time period covered
Jan 1, 2019
Description
Understanding what shapes species phenotypes over macroevolutionary timescales from comparative data often requires studying the relationship between phenotypes and putative explanatory factors or testing for differences in phenotypes across species groups. In phyllostomid bats for example, is mandible morphology associated to diet preferences? Performing such analyses depends upon reliable phylogenetic regression techniques and associated tests (e.g. phylogenetic Generalized Least Squares, pGLS and phylogenetic analyses of variance and covariance, pANOVA, pANCOVA). While these tools are well established for univariate data, their multivariate counterparts are lagging behind. This is particularly true for high dimensional phenotypic data, such as morphometric data. Here we implement much-needed likelihood-based multivariate pGLS, pMANOVA and pMANCOVA, and use a recently developed penalized likelihood framework to extend their application to the difficult case when the number of traitsÂ p...
h
DOVE
huggingface.co
Updated Mar 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nlphuji (2025). DOVE [Dataset]. https://huggingface.co/datasets/nlphuji/DOVE
Explore at:
Dataset updated
Mar 2, 2025
Dataset authored and provided by
nlphuji
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
🕊️ DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

🌐 Project Website | 📄 Read our paper

Updates 📅

2025-06-11: Added Llama 70B evaluations with ~5,700 MMLU examples across 100 different prompt variations (= 570K new predictions!), based on data from ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments 2025-04-12: Added MMLU predictions from dozens of models including OpenAI, Qwen, Mistral, Gemini… See the full description on the dataset page: https://huggingface.co/datasets/nlphuji/DOVE.
f
Data from: MWPCR: Multiscale Weighted Principal Component Regression for...
tandf.figshare.com
zip
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongtu Zhu; Dan Shen; Xuewei Peng; Leo Yufeng Liu (2023). MWPCR: Multiscale Weighted Principal Component Regression for High-Dimensional Prediction [Dataset]. http://doi.org/10.6084/m9.figshare.4478390.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4478390.v2
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Hongtu Zhu; Dan Shen; Xuewei Peng; Leo Yufeng Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We propose a multiscale weighted principal component regression (MWPCR) framework for the use of high-dimensional features with strong spatial features (e.g., smoothness and correlation) to predict an outcome variable, such as disease status. This development is motivated by identifying imaging biomarkers that could potentially aid detection, diagnosis, assessment of prognosis, prediction of response to treatment, and monitoring of disease status, among many others. The MWPCR can be regarded as a novel integration of principal components analysis (PCA), kernel methods, and regression models. In MWPCR, we introduce various weight matrices to prewhitten high-dimensional feature vectors, perform matrix decomposition for both dimension reduction and feature extraction, and build a prediction model by using the extracted features. Examples of such weight matrices include an importance score weight matrix for the selection of individual features at each location and a spatial weight matrix for the incorporation of the spatial pattern of feature vectors. We integrate the importance of score weights with the spatial weights to recover the low-dimensional structure of high-dimensional features. We demonstrate the utility of our methods through extensive simulations and real data analyses of the Alzheimer’s disease neuroimaging initiative (ADNI) dataset. Supplementary materials for this article are available online.
g
Mining Distance-Based Outliers in Near Linear Time | gimi9.com
gimi9.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mining Distance-Based Outliers in Near Linear Time | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_mining-distance-based-outliers-in-near-linear-time/
Explore at:
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
D
Data from: ApHIN - Autoencoder-based port-Hamiltonian Identification...
darus.uni-stuttgart.de
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonas Kneifl; Johannes Rettberg; Julius Herb (2024). ApHIN - Autoencoder-based port-Hamiltonian Identification Networks (Software Package) [Dataset]. http://doi.org/10.18419/DARUS-4446
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-4446
Dataset updated
Aug 27, 2024
Dataset provided by
DaRUS
Authors
Jonas Kneifl; Johannes Rettberg; Julius Herb
License
https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html
Dataset funded by
DFG
Ministry of Science, Research and the Arts Baden-Württemberg
Description
Software package for data-driven identification of latent port-Hamiltonian systems. Abstract Conventional physics-based modeling techniques involve high effort, e.g.~time and expert knowledge, while data-driven methods often lack interpretability, structure, and sometimes reliability. To mitigate this, we present a data-driven system identification framework that derives models in the port-Hamiltonian (pH) formulation. This formulation is suitable for multi-physical systems while guaranteeing the useful system theoretical properties of passivity and stability. Our framework combines linear and nonlinear reduction with structured, physics-motivated system identification. In this process, high-dimensional state data obtained from possibly nonlinear systems serves as the input for an autoencoder, which then performs two tasks: (i) nonlinearly transforming and (ii) reducing this data onto a low-dimensional manifold. In the resulting latent space, a pH system is identified by considering the unknown matrix entries as weights of a neural network. The matrices strongly satisfy the pH matrix properties through Cholesky factorizations. In a joint optimization process over the loss term, the pH matrices are adjusted to match the dynamics observed by the data, while defining a linear pH system in the latent space per construction. The learned, low-dimensional pH system can describe even nonlinear systems and is rapidly computable due to its small size. The method is exemplified by a parametric mass-spring-damper and a nonlinear pendulum example as well as the high-dimensional model of a disc brake with linear thermoelastic behavior Features This package implements neural networks that identify linear port-Hamiltonian systems from (potentially high-dimensional) data [1]. Autoencoders (AEs) for dimensionality reduction pH layer to identify system matrices that fullfill the definition of a linear pH system pHIN: identify a (parametric) low-dimensional port-Hamiltonian system directly ApHIN: identify a (parametric) low-dimensional latent port-Hamiltonian system based on coordinate representations found using an autoencoder Examples for the identification of linear pH systems from data One-dimensional mass-spring-damper chain Pendulum discbrake model See documentation for more details.
f
Data from: Approximate Bayesian Computation and Bayes’ Linear Analysis:...
tandf.figshare.com
application/gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
D. J. Nott; Y. Fan; L. Marshall; S. A. Sisson (2023). Approximate Bayesian Computation and Bayes’ Linear Analysis: Toward High-Dimensional ABC [Dataset]. http://doi.org/10.6084/m9.figshare.963478.v3
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.963478.v3
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
D. J. Nott; Y. Fan; L. Marshall; S. A. Sisson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Bayes’ linear analysis and approximate Bayesian computation (ABC) are techniques commonly used in the Bayesian analysis of complex models. In this article, we connect these ideas by demonstrating that regression-adjustment ABC algorithms produce samples for which first- and second-order moment summaries approximate adjusted expectation and variance for a Bayes’ linear analysis. This gives regression-adjustment methods a useful interpretation and role in exploratory analysis in high-dimensional problems. As a result, we propose a new method for combining high-dimensional, regression-adjustment ABC with lower-dimensional approaches (such as using Markov chain Monte Carlo for ABC). This method first obtains a rough estimate of the joint posterior via regression-adjustment ABC, and then estimates each univariate marginal posterior distribution separately in a lower-dimensional analysis. The marginal distributions of the initial estimate are then modified to equal the separately estimated marginals, thereby providing an improved estimate of the joint posterior. We illustrate this method with several examples. Supplementary materials for this article are available online.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
D
Data from: Low-dimensional learned feature spaces quantify individual and...
research.repository.duke.edu
Updated May 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Goffinet, Jack; Mooney, Richard; Pearson, John; Brudner, Samuel (2021). Data from: Low-dimensional learned feature spaces quantify individual and group differences in vocal repertoires [Dataset]. http://doi.org/10.7924/r4gq6zn8w
Explore at:
Unique identifier
https://doi.org/10.7924/r4gq6zn8w, https://identifiers.org/ark:/87924/r4gq6zn8w
Dataset updated
May 24, 2021
Dataset provided by
Duke Research Data Repository
Authors
Goffinet, Jack; Mooney, Richard; Pearson, John; Brudner, Samuel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Increases in the scale and complexity of behavioral data pose an increasing challenge for data analysis. A common strategy involves replacing entire behaviors with small numbers of handpicked, domain-specific features, but this approach suffers from several crucial limitations. For example, handpicked features may miss important dimensions of variability, and correlations among them complicate statistical testing. Here, by contrast, we apply the variational autoencoder (VAE), an unsupervised learning method, to learn features directly from data and quantify the vocal behavior of two model species: the laboratory mouse and the zebra finch. The VAE converges on a parsimonious representation that outperforms handpicked features on a variety of common analysis tasks, enables the measurement of moment-by-moment vocal variability on the timescale of tens of milliseconds in the zebra finch, provides strong evidence that mouse ultrasonic vocalizations do not cluster as is commonly believed, and captures the similarity of tutor and pupil birdsong with qualitatively higher fidelity than previous approaches. In all, we demonstrate the utility of modern unsupervised learning approaches to the quantification of complex and high-dimensional vocal behavior. ... [Read More]
h
md_gender_bias
huggingface.co
opendatalab.com
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
md_gender_bias [Dataset]. https://huggingface.co/datasets/facebook/md_gender_bias
Explore at:
Dataset updated
Mar 26, 2021
Dataset authored and provided by
AI at Meta
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zezhong Wang; Inez Maria Zwetsloot (2024). A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data [Dataset]. http://doi.org/10.6084/m9.figshare.24441804.v1

Data from: A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.24441804.v1

Dataset updated

Jan 17, 2024

Dataset provided by

Taylor & Francis

Authors

Zezhong Wang; Inez Maria Zwetsloot

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.

Clear search

Close search

Google apps

Main menu

Data from: A change-point–based control chart for detecting sparse mean...

Data from: A method for analysis of phenotypic change for phenotypes...

Research data supporting: "Relevant, hidden, and frustrated information in...

Additional file 1: of Proposal of supervised data analysis strategy of...

Data from: First-passage probability estimation of high-dimensional...

Dataset for: Some Remarks on the R2 for Clustering

Reproducibility data for tables in "A machine learning approach to portfolio...

Data from: Modeling in higher dimensions to improve diagnostic testing...

Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...

Mean squared error (×10−2) for data as in Table 2.

Data from: Bayesian Inference on High-Dimensional Multivariate Binary...

Data from: Reliable phylogenetic regressions for multivariate comparative...

DOVE

Data from: MWPCR: Multiscale Weighted Principal Component Regression for...

Mining Distance-Based Outliers in Near Linear Time | gimi9.com

Data from: ApHIN - Autoencoder-based port-Hamiltonian Identification...

Data from: Approximate Bayesian Computation and Bayes’ Linear Analysis:...

Educational Attainment in North Carolina Public Schools: Use of statistical...

Data from: Low-dimensional learned feature spaces quantify individual and...

md_gender_bias

Data from: A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data