100+ datasets found

Additional file 1: of Proposal of supervised data analysis strategy of...
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone (2023). Additional file 1: of Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation [Dataset]. http://doi.org/10.6084/m9.figshare.c.3595874_D5.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3595874_D5.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R codes for implementing the described analyses (sample processing, data pre-processing, class comparison and class prediction). Caliper matching was implemented using the nonrandom package; the t- and the AD tests were implemented using the stats package and the adk package, respectively. Notice that the updated package for implementing the AD test is kSamples. As regards the bootstrap selection and the egg-shaped plot, we respectively modified the doBS and the importance igraph functions, both included in the bootfs package. For the SVM model we used the e1071 package. (R 12Â kb)
Data from: A method for analysis of phenotypic change for phenotypes...
zenodo.org
data.niaid.nih.gov
+2more
csv
Updated May 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael L. Collyer; David J. Sekora; Dean C. Adams; Michael L. Collyer; David J. Sekora; Dean C. Adams (2022). Data from: A method for analysis of phenotypic change for phenotypes described by high-dimensional data [Dataset]. http://doi.org/10.5061/dryad.1p80f
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.1p80f
Dataset updated
May 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael L. Collyer; David J. Sekora; Dean C. Adams; Michael L. Collyer; David J. Sekora; Dean C. Adams
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The analysis of phenotypic change is important for several evolutionary biology disciplines, including phenotypic plasticity, evolutionary developmental biology, morphological evolution, physiological evolution, evolutionary ecology and behavioral evolution. It is common for researchers in these disciplines to work with multivariate phenotypic data. When phenotypic variables exceed the number of research subjects—data called 'high-dimensional data'—researchers are confronted with analytical challenges. Parametric tests that require high observation to variable ratios present a paradox for researchers, as eliminating variables potentially reduces effect sizes for comparative analyses, yet test statistics require more observations than variables. This problem is exacerbated with data that describe 'multidimensional' phenotypes, whereby a description of phenotype requires high-dimensional data. For example, landmark-based geometric morphometric data use the Cartesian coordinates of (potentially) many anatomical landmarks to describe organismal shape. Collectively such shape variables describe organism shape, although the analysis of each variable, independently, offers little benefit for addressing biological questions. Here we present a nonparametric method of evaluating effect size that is not constrained by the number of phenotypic variables, and motivate its use with example analyses of phenotypic change using geometric morphometric data. Our examples contrast different characterizations of body shape for a desert fish species, associated with measuring and comparing sexual dimorphism between two populations. We demonstrate that using more phenotypic variables can increase effect sizes, and allow for stronger inferences.
f
Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
Research data supporting: "Relevant, hidden, and frustrated information in...
zenodo.org
zip
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chiara Lionello; Chiara Lionello; Matteo Becchi; Matteo Becchi; Simone Martino; Simone Martino; Giovanni M. Pavan; Giovanni M. Pavan (2025). Research data supporting: "Relevant, hidden, and frustrated information in high-dimensional analyses of complex dynamical systems with internal noise" [Dataset]. http://doi.org/10.5281/zenodo.14529457
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14529457
Dataset updated
May 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Chiara Lionello; Chiara Lionello; Matteo Becchi; Matteo Becchi; Simone Martino; Simone Martino; Giovanni M. Pavan; Giovanni M. Pavan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the set of data shown in the paper "Relevant, hidden, and frustrated information in high-dimensional analyses of complex dynamical systems with internal noise", published on arXiv (DOI: 10.48550/arXiv.2412.09412).

The scripts contained herein are:

PCA-Analysis.py: python script to calculate the SOAP descriptor, denoising it, and compute the Principal Component Analysis

SOAP-Component-Analysis.py: python script to calculate the variance of the single SOAP components

Hierarchical-Clustering.py: python script to compute the hierarchical clustering and plot the dataset

OnionClustering-1d.py: script to compute the Onion clustering on a single SOAP component or principal component

OnionClustering-2d.py: script to compute bi-dimensional Onion clustering

OnionClustering-plot.py: script to plot the Onion plot, removing clusters with population <1%

UMAP.py: script to compute the UMAP dimensionality reduction technique

To reproduce the data of this work you should start form SOAP-Component-Analysis.py to calculate the SOAP descriptor and select the components that are interesting for you, then you can calculate the PCA with PCA-Analysis.py, and applying the clustering based on your necessities (OnionClustering-1d.py, OnionClustering-2d.py, Hierarchical-Clustering.py). Further modifications of the Onion plot can be done with the script: OnionClustering-plot.py. Umap can be calculated with UMAP.py.

Additional data contained herein are:

starting-configuration.gro: gromacs file with the initial configuration of the ice-water system

traj-ice-water-50ns-sampl4ps.xtc: trajectory of the ice-water system sampled every 4 ps

traj-ice-water-50ns-sampl40ps.xtc: trajectory of the ice-water system sampled every 40 ps

some files containing the SOAP descriptor of the ice-water system: ice-water-50ns-sampl40ps.hdf5, ice-water-50ns-sampl40ps_soap.hdf5, ice-water-50ns-sampl40ps_soap.npy, ice-water-50ns-sampl40ps_soap-spavg.npy

PCA-results: folder that contains some example results of the PCA

UMAP-results: folder that contains some example results of UMAP

The data related to the Quincke rollers can be found here: https://zenodo.org/records/10638736
d
An Advanced Learning Framework for High Dimensional Multi-Sensor Remote...
catalog.data.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Science Mission Directorate (2025). An Advanced Learning Framework for High Dimensional Multi-Sensor Remote Sensing Data Project [Dataset]. https://catalog.data.gov/dataset/an-advanced-learning-framework-for-high-dimensional-multi-sensor-remote-sensing-data-proje
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Science Mission Directorate
Description
Improve the use of land cover data by developing an advanced framework for robust classification using multi-source datasets:
Develop, validate and optimize a generalized multi-kernel, active learning (MKL-AL) pattern recognition framework for multi-source data fusion.
Develop both single- and ensemble-classifier versions (MKL-AL and Ensemble-MKL-AL) of the system.
Utilize multi-source remotely sensed and in situ data to create land-cover classification and perform accuracy assessment with available labeled data; utilize first results to query new samples that, if inducted into the training of the system, will significantly improve classification performance and accuracy.
f
Data from: Some Multivariate Tests of Independence Based on Ranks of Nearest...
tandf.figshare.com
figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soham Sarkar; Anil K. Ghosh (2023). Some Multivariate Tests of Independence Based on Ranks of Nearest Neighbors [Dataset]. http://doi.org/10.6084/m9.figshare.4531400.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4531400.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Soham Sarkar; Anil K. Ghosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Several parametric and nonparametric tests of independence between two random vectors are available in the literature. But, many of them perform poorly for high-dimensional data and are not applicable when the dimension exceeds the sample size. In this article, we propose some tests based on ranks of nearest neighbors, which can be conveniently used in high dimension, low sample size situations. Several simulated and real datasets are analyzed to show the utility of the proposed tests. Codes for implementation of the proposed tests are available as supplementary materials.
Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...
zenodo.org
pdf, zip
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis (2024). Multi-Dimensional Data Viewer (MDV) user manual for data exploration: "Systematic analysis of YFP traps reveals common discordance between mRNA and protein across the nervous system" [Dataset]. http://doi.org/10.5281/zenodo.7875495
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7875495
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

Please also see the latest version of the repository:
https://doi.org/10.5281/zenodo.6374011 and
our website: https://ilandavis.com/jcb2023-yfp

The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV) -link, a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP traps reveals common discordance between mRNA and protein across the nervous system (eprint link). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.
o
Data from: DEIMoS: An Open-Source Tool for Processing High-Dimensional Mass...
osti.gov
Updated Nov 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDOE (2021). DEIMoS: An Open-Source Tool for Processing High-Dimensional Mass Spectrometry Data [Dataset]. http://doi.org/10.25584/2483273
Explore at:
Unique identifier
https://doi.org/10.25584/2483273
Dataset updated
Nov 16, 2021
Dataset provided by
USDOE
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Description
We present DEIMoS: Data Extraction for Integrated Multidimensional Spectrometry, a Python application programming interface and command-line tool for high-dimensional mass spectrometry data analysis workflows, offering ease of development and access to efficient algorithmic implementations. Functionality includes feature detection, feature alignment, collision cross section calibration, isotope detection, and MS/MS spectral deconvolution, with the output comprising detected features aligned across study samples and characterized by mass, CCS, tandem mass spectra, and isotopic signature. Notably, DEIMoS operates on N-dimensional data, largely agnostic to acquisition instrumentation: algorithm implementations utilize all dimensions simultaneously to (i) offer greater separation between features, improving detection sensitivity, (ii) increase alignment/feature matching confidence among datasets, and (iii) mitigate convolution artifacts in tandem mass spectra. We demonstrate DEIMoS with LC-IMS-MS/MS data, demonstrating the advantages of a multidimensional approach in each data processing step.
f
Data from: On rank distribution classifiers for high-dimensional data
tandf.figshare.com
txt
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olusola Samuel Makinde (2023). On rank distribution classifiers for high-dimensional data [Dataset]. http://doi.org/10.6084/m9.figshare.12337025.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12337025.v1
Dataset updated
Nov 28, 2023
Dataset provided by
Taylor & Francis
Authors
Olusola Samuel Makinde
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Spatial sign and rank-based methods have been studied in the recent literature, especially when the dimension is smaller than the sample size. In this paper, a classification method based on the distribution of rank functions for high-dimensional data is considered with extension to functional data. The method is fully nonparametric in nature. The performance of the classification method is illustrated in comparison with some other classifiers using simulated and real data sets. Supporting code in R are provided for computational implementation of the classification method that will be of use to others.
d
Data from: Mining Distance-Based Outliers in Near Linear Time
catalog.data.gov
datasets.ai
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.
D
Data from: Dimensionality and Validation of the Highly Sensitive Person...
dataverse.nl
docx, zip
Updated Aug 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Veronique de Gucht; Veronique de Gucht; Tom Wilderjans; Tom Wilderjans; Franshelis Garcia; Franshelis Garcia; Stan Maes; Stan Maes (2024). Dimensionality and Validation of the Highly Sensitive Person Scale (HSPS) in a Dutch General Population Sample and Two Clinical Samples [Dataset]. http://doi.org/10.34894/PXHGBI
Explore at:
zip(762266), zip(8042), zip(2563406), zip(789595), docx(18664), zip(114559)Available download formats
Unique identifier
https://doi.org/10.34894/PXHGBI
Dataset updated
Aug 20, 2024
Dataset provided by
DataverseNL
Authors
Veronique de Gucht; Veronique de Gucht; Tom Wilderjans; Tom Wilderjans; Franshelis Garcia; Franshelis Garcia; Stan Maes; Stan Maes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data included in the publication package pertain to the validation of an existing English questionnaire into Dutch. The validation was conducted both in a sample of the general population and in two clinical samples (fatigue; pain). The publication package contains both the data and the syntax that form the basis of the validation article.
Reproducibility data for tables in "A machine learning approach to portfolio...
zenodo.org
csv
Updated Jun 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucio Fernandez-Arjona; Lucio Fernandez-Arjona; Damir Filipovic; Damir Filipovic (2022). Reproducibility data for tables in "A machine learning approach to portfolio pricing and risk management for high dimensional problems" [Dataset]. http://doi.org/10.5281/zenodo.6659959
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6659959
Dataset updated
Jun 19, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lucio Fernandez-Arjona; Lucio Fernandez-Arjona; Damir Filipovic; Damir Filipovic
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all the information necessary for reproducing the tables in the paper "A machine learning approach to portfolio pricing and risk management for high dimensional problems".

The raw benchmark data can be found in the Zenodo dataset "Benchmark and training data for replicating financial and insurance examples" (https://zenodo.org/record/3837381). To the extend necessary, only summary data from that dataset is used in this dataset.

The dataset includes a jupyter notebook file that explains what the different files contain, and provides sample code to analyze the information and reproduce the tables.
Z
Data from: First-passage probability estimation of high-dimensional...
data.niaid.nih.gov
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valdebenito, Marcos (2024). First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems by a fractional moments-based mixture distribution approach [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7661087
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Ding, Chen
Faes, Matthias
Dang, Chao
Broggi, Matteo
Valdebenito, Marcos
Beer, Michael
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems is a significant task to be solved in many science and engineering fields, but remains still an open challenge. The present paper develops a novel approach, termed ‘fractional moments-based mixture distribution’, to address such challenge. This approach is implemented by capturing the extreme value distribution (EVD) of the system response with the concepts of fractional moment and mixture distribution. In our context, the fractional moment itself is by definition a high-dimensional integral with a complicated integrand. To efficiently compute the fractional moments, a parallel adaptive sampling scheme that allows for sample size extension is developed using the refined Latinized stratified sampling (RLSS). In this manner, both variance reduction and parallel computing are possible for evaluating the fractional moments. From the knowledge of low-order fractional moments, the EVD of interest is then expected to be reconstructed. Based on introducing an extended inverse Gaussian distribution and a log extended skew-normal distribution, one flexible mixture distribution model is proposed, where its fractional moments are derived in analytic form. By fitting a set of fractional moments, the EVD can be recovered via the proposed mixture model. Accordingly, the first-passage probabilities under different thresholds can be obtained from the recovered EVD straightforwardly. The performance of the proposed method is verified by three examples consisting of two test examples and one engineering problem.
H
Data from: Multi Class Datasets
dataverse.harvard.edu
Updated Mar 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sankha Mullick (2017). Multi Class Datasets [Dataset]. http://doi.org/10.7910/DVN/O4RIRM
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/O4RIRM
Dataset updated
Mar 9, 2017
Dataset provided by
Harvard Dataverse
Authors
Sankha Mullick
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We present 20 new multi-labeled artificial datasets, which can also be used for evaluating ambiguity resolving classifiers. The ambiguous or multi-labeled points are defined by those lying in the overlapping regions of two or more classes. Among the 20 datasets, 10 are 2-dimensional, while the rests are either 5 or 10-dimensional extended versions of the 2-dimensonal ones. The extensions are done following one of the two techniques. In the first strategy, datasets ate designed by appending 3 new dimensions each sampled uniformly at random and scaled between a specified range. The new 5-dimensional dataset is rotated by a random rotation matrix. This is a general technique by which any dataset can be transformed to higher dimensional feature space while conserving the properties of the ambiguous points. The second method extends the datasets by sampling them from a 10-dimensional real-valued feature space using the analogs class distributions of the corresponding 2-dimensional dataset. Such a strategy can extend a dataset to arbitrarily higher dimension feature space. However, the datasets will become sparse with increasing dimensionality. To tackle this issue the number of data points is increased in this case.
I
Data from: Modeling in higher dimensions to improve diagnostic testing...
data.niaid.nih.gov
url
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Modeling in higher dimensions to improve diagnostic testing accuracy: Theory and examples for multiplex saliva-based SARS-CoV-2 antibody assays [Dataset]. http://doi.org/10.21430/M33WC6JLYP
Explore at:
urlAvailable download formats
Unique identifier
https://doi.org/10.21430/M33WC6JLYP
Dataset updated
Sep 26, 2024
License
https://www.immport.org/agreementhttps://www.immport.org/agreement
Description
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has emphasized the importance and challenges of correctly interpreting antibody test results. Identification of positive and negative samples requires a classification strategy with low error rates, which is hard to achieve when the corresponding measurement values overlap. Additional uncertainty arises when classification schemes fail to account for complicated structure in data. We address these problems through a mathematical framework that combines high dimensional data modeling and optimal decision theory. Specifically, we show that appropriately increasing the dimension of data better separates positive and negative populations and reveals nuanced structure that can be described in terms of mathematical models. We combine these models with optimal decision theory to yield a classification scheme that better separates positive and negative samples relative to traditional methods such as confidence intervals (CIs) and receiver operating characteristics. We validate the usefulness of this approach in the context of a multiplex salivary SARS-CoV-2 immunoglobulin G assay dataset. This example illustrates how our analysis: (i) improves the assay accuracy, (e.g. lowers classification errors by up to 42% compared to CI methods); (ii) reduces the number of indeterminate samples when an inconclusive class is permissible, (e.g. by 40% compared to the original analysis of the example multiplex dataset) and (iii) decreases the number of antigens needed to classify samples. Our work showcases the power of mathematical modeling in diagnostic classification and highlights a method that can be adopted broadly in public health and clinical settings.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
f
Mean squared error (×10−2) for data as in Table 2.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fan Chen; Guy Nason (2023). Mean squared error (×10−2) for data as in Table 2. [Dataset]. http://doi.org/10.1371/journal.pone.0229845.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0229845.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Fan Chen; Guy Nason
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mean squared error (×10−2) for data as in Table 2.
Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis (2024). Multi-Dimensional Data Viewer (MDV) user manual for data exploration: "Systematic analysis of YFP gene traps reveals common discordance between mRNA and protein across the nervous system" [Dataset]. http://doi.org/10.5281/zenodo.7738944
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7738944
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV: https://mdv.molbiol.ox.ac.uk), a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP gene traps reveals common discordance between mRNA and protein across the nervous system (https://doi.org/10.1083/jcb.202205129). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.
f
Data from: Statistical Significance of Clustering with Multidimensional...
tandf.figshare.com
bin
Updated Feb 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hui Shen; Shankar Bhamidi; Yufeng Liu (2024). Statistical Significance of Clustering with Multidimensional Scaling [Dataset]. http://doi.org/10.6084/m9.figshare.23271877.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23271877.v1
Dataset updated
Feb 13, 2024
Dataset provided by
Taylor & Francis
Authors
Hui Shen; Shankar Bhamidi; Yufeng Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering is a fundamental tool for exploratory data analysis. One central problem in clustering is deciding if the clusters discovered by clustering methods are reliable as opposed to being artifacts of natural sampling variation. Statistical significance of clustering (SigClust) is a recently developed cluster evaluation tool for high-dimension, low-sample size data. Despite its successful application to many scientific problems, there are cases where the original SigClust may not work well. Furthermore, for specific applications, researchers may not have access to the original data and only have the dissimilarity matrix. In this case, clustering is still a valuable exploratory tool, but the original SigClust is not applicable. To address these issues, we propose a new SigClust method using multidimensional scaling (MDS). The underlying idea behind MDS-based SigClust is that one can achieve low-dimensional representations of the original data via MDS using only the dissimilarity matrix and then apply SigClust on the low-dimensional MDS space. The proposed MDS-based SigClust can circumvent the challenge of parameter estimation of the original method in high-dimensional spaces while keeping the essential clustering structure in the MDS space. Both simulations and real data applications demonstrate that the proposed method works remarkably well for assessing the statistical significance of clustering. Supplemental materials for the article are available online.
t
Towards a unified multi-dimensional evaluator for text generation - Dataset...
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Towards a unified multi-dimensional evaluator for text generation - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/towards-a-unified-multi-dimensional-evaluator-for-text-generation
Explore at:
Dataset updated
Dec 16, 2024
Description
The NewsRoom dataset consists of 60 input source texts and 7 output summaries for each sample.

Facebook

Twitter

Click to copy link

Link copied

Cite

Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone (2023). Additional file 1: of Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation [Dataset]. http://doi.org/10.6084/m9.figshare.c.3595874_D5.v1

Additional file 1: of Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.c.3595874_D5.v1

Dataset updated

May 30, 2023

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

R codes for implementing the described analyses (sample processing, data pre-processing, class comparison and class prediction). Caliper matching was implemented using the nonrandom package; the t- and the AD tests were implemented using the stats package and the adk package, respectively. Notice that the updated package for implementing the AD test is kSamples. As regards the bootstrap selection and the egg-shaped plot, we respectively modified the doBS and the importance igraph functions, both included in the bootfs package. For the SVM model we used the e1071 package. (R 12Â kb)

Clear search

Close search

Google apps

Main menu

Additional file 1: of Proposal of supervised data analysis strategy of...

Data from: A method for analysis of phenotypic change for phenotypes...

Dataset for: Some Remarks on the R2 for Clustering

Research data supporting: "Relevant, hidden, and frustrated information in...

An Advanced Learning Framework for High Dimensional Multi-Sensor Remote...

Data from: Some Multivariate Tests of Independence Based on Ranks of Nearest...

Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...

Data from: DEIMoS: An Open-Source Tool for Processing High-Dimensional Mass...

Data from: On rank distribution classifiers for high-dimensional data

Data from: Mining Distance-Based Outliers in Near Linear Time

Data from: Dimensionality and Validation of the Highly Sensitive Person...

Reproducibility data for tables in "A machine learning approach to portfolio...

Data from: First-passage probability estimation of high-dimensional...

Data from: Multi Class Datasets

Data from: Modeling in higher dimensions to improve diagnostic testing...

Educational Attainment in North Carolina Public Schools: Use of statistical...

Mean squared error (×10−2) for data as in Table 2.

Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...

Data from: Statistical Significance of Clustering with Multidimensional...

Towards a unified multi-dimensional evaluator for text generation - Dataset...

Additional file 1: of Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation