100+ datasets found
  1. Additional file 1: of Proposal of supervised data analysis strategy of...

    • springernature.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone (2023). Additional file 1: of Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation [Dataset]. http://doi.org/10.6084/m9.figshare.c.3595874_D5.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R codes for implementing the described analyses (sample processing, data pre-processing, class comparison and class prediction). Caliper matching was implemented using the nonrandom package; the t- and the AD tests were implemented using the stats package and the adk package, respectively. Notice that the updated package for implementing the AD test is kSamples. As regards the bootstrap selection and the egg-shaped plot, we respectively modified the doBS and the importance igraph functions, both included in the bootfs package. For the SVM model we used the e1071 package. (R 12Â kb)

  2. Data from: A method for analysis of phenotypic change for phenotypes...

    • zenodo.org
    • data.niaid.nih.gov
    • +2more
    csv
    Updated May 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael L. Collyer; David J. Sekora; Dean C. Adams; Michael L. Collyer; David J. Sekora; Dean C. Adams (2022). Data from: A method for analysis of phenotypic change for phenotypes described by high-dimensional data [Dataset]. http://doi.org/10.5061/dryad.1p80f
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael L. Collyer; David J. Sekora; Dean C. Adams; Michael L. Collyer; David J. Sekora; Dean C. Adams
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The analysis of phenotypic change is important for several evolutionary biology disciplines, including phenotypic plasticity, evolutionary developmental biology, morphological evolution, physiological evolution, evolutionary ecology and behavioral evolution. It is common for researchers in these disciplines to work with multivariate phenotypic data. When phenotypic variables exceed the number of research subjects—data called 'high-dimensional data'—researchers are confronted with analytical challenges. Parametric tests that require high observation to variable ratios present a paradox for researchers, as eliminating variables potentially reduces effect sizes for comparative analyses, yet test statistics require more observations than variables. This problem is exacerbated with data that describe 'multidimensional' phenotypes, whereby a description of phenotype requires high-dimensional data. For example, landmark-based geometric morphometric data use the Cartesian coordinates of (potentially) many anatomical landmarks to describe organismal shape. Collectively such shape variables describe organism shape, although the analysis of each variable, independently, offers little benefit for addressing biological questions. Here we present a nonparametric method of evaluating effect size that is not constrained by the number of phenotypic variables, and motivate its use with example analyses of phenotypic change using geometric morphometric data. Our examples contrast different characterizations of body shape for a desert fish species, associated with measuring and comparing sexual dimorphism between two populations. We demonstrate that using more phenotypic variables can increase effect sizes, and allow for stronger inferences.

  3. f

    Dataset for: Some Remarks on the R2 for Clustering

    • wiley.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wiley
    Authors
    Nicola Loperfido; Thaddeus Tarpey
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

  4. Research data supporting: "Relevant, hidden, and frustrated information in...

    • zenodo.org
    zip
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chiara Lionello; Chiara Lionello; Matteo Becchi; Matteo Becchi; Simone Martino; Simone Martino; Giovanni M. Pavan; Giovanni M. Pavan (2025). Research data supporting: "Relevant, hidden, and frustrated information in high-dimensional analyses of complex dynamical systems with internal noise" [Dataset]. http://doi.org/10.5281/zenodo.14529457
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Chiara Lionello; Chiara Lionello; Matteo Becchi; Matteo Becchi; Simone Martino; Simone Martino; Giovanni M. Pavan; Giovanni M. Pavan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the set of data shown in the paper "Relevant, hidden, and frustrated information in high-dimensional analyses of complex dynamical systems with internal noise", published on arXiv (DOI: 10.48550/arXiv.2412.09412).

    The scripts contained herein are:

    1. PCA-Analysis.py: python script to calculate the SOAP descriptor, denoising it, and compute the Principal Component Analysis
    2. SOAP-Component-Analysis.py: python script to calculate the variance of the single SOAP components
    3. Hierarchical-Clustering.py: python script to compute the hierarchical clustering and plot the dataset
    4. OnionClustering-1d.py: script to compute the Onion clustering on a single SOAP component or principal component
    5. OnionClustering-2d.py: script to compute bi-dimensional Onion clustering
    6. OnionClustering-plot.py: script to plot the Onion plot, removing clusters with population <1%
    7. UMAP.py: script to compute the UMAP dimensionality reduction technique

    To reproduce the data of this work you should start form SOAP-Component-Analysis.py to calculate the SOAP descriptor and select the components that are interesting for you, then you can calculate the PCA with PCA-Analysis.py, and applying the clustering based on your necessities (OnionClustering-1d.py, OnionClustering-2d.py, Hierarchical-Clustering.py). Further modifications of the Onion plot can be done with the script: OnionClustering-plot.py. Umap can be calculated with UMAP.py.

    Additional data contained herein are:

    1. starting-configuration.gro: gromacs file with the initial configuration of the ice-water system
    2. traj-ice-water-50ns-sampl4ps.xtc: trajectory of the ice-water system sampled every 4 ps
    3. traj-ice-water-50ns-sampl40ps.xtc: trajectory of the ice-water system sampled every 40 ps
    4. some files containing the SOAP descriptor of the ice-water system: ice-water-50ns-sampl40ps.hdf5, ice-water-50ns-sampl40ps_soap.hdf5, ice-water-50ns-sampl40ps_soap.npy, ice-water-50ns-sampl40ps_soap-spavg.npy
    5. PCA-results: folder that contains some example results of the PCA
    6. UMAP-results: folder that contains some example results of UMAP

    The data related to the Quincke rollers can be found here: https://zenodo.org/records/10638736

  5. d

    An Advanced Learning Framework for High Dimensional Multi-Sensor Remote...

    • catalog.data.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Science Mission Directorate (2025). An Advanced Learning Framework for High Dimensional Multi-Sensor Remote Sensing Data Project [Dataset]. https://catalog.data.gov/dataset/an-advanced-learning-framework-for-high-dimensional-multi-sensor-remote-sensing-data-proje
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Science Mission Directorate
    Description

    Improve the use of land cover data by developing an advanced framework for robust classification using multi-source datasets:
    Develop, validate and optimize a generalized multi-kernel, active learning (MKL-AL) pattern recognition framework for multi-source data fusion.
    Develop both single- and ensemble-classifier versions (MKL-AL and Ensemble-MKL-AL) of the system.
    Utilize multi-source remotely sensed and in situ data to create land-cover classification and perform accuracy assessment with available labeled data; utilize first results to query new samples that, if inducted into the training of the system, will significantly improve classification performance and accuracy.
     

  6. f

    Data from: Some Multivariate Tests of Independence Based on Ranks of Nearest...

    • tandf.figshare.com
    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soham Sarkar; Anil K. Ghosh (2023). Some Multivariate Tests of Independence Based on Ranks of Nearest Neighbors [Dataset]. http://doi.org/10.6084/m9.figshare.4531400.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Soham Sarkar; Anil K. Ghosh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Several parametric and nonparametric tests of independence between two random vectors are available in the literature. But, many of them perform poorly for high-dimensional data and are not applicable when the dimension exceeds the sample size. In this article, we propose some tests based on ranks of nearest neighbors, which can be conveniently used in high dimension, low sample size situations. Several simulated and real datasets are analyzed to show the utility of the proposed tests. Codes for implementation of the proposed tests are available as supplementary materials.

  7. Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...

    • zenodo.org
    pdf, zip
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis (2024). Multi-Dimensional Data Viewer (MDV) user manual for data exploration: "Systematic analysis of YFP traps reveals common discordance between mRNA and protein across the nervous system" [Dataset]. http://doi.org/10.5281/zenodo.7875495
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please also see the latest version of the repository:
    https://doi.org/10.5281/zenodo.6374011 and
    our website: https://ilandavis.com/jcb2023-yfp

    The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV) -link, a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP traps reveals common discordance between mRNA and protein across the nervous system (eprint link). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.

  8. o

    Data from: DEIMoS: An Open-Source Tool for Processing High-Dimensional Mass...

    • osti.gov
    Updated Nov 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    USDOE (2021). DEIMoS: An Open-Source Tool for Processing High-Dimensional Mass Spectrometry Data [Dataset]. http://doi.org/10.25584/2483273
    Explore at:
    Dataset updated
    Nov 16, 2021
    Dataset provided by
    USDOE
    Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
    Description

    We present DEIMoS: Data Extraction for Integrated Multidimensional Spectrometry, a Python application programming interface and command-line tool for high-dimensional mass spectrometry data analysis workflows, offering ease of development and access to efficient algorithmic implementations. Functionality includes feature detection, feature alignment, collision cross section calibration, isotope detection, and MS/MS spectral deconvolution, with the output comprising detected features aligned across study samples and characterized by mass, CCS, tandem mass spectra, and isotopic signature. Notably, DEIMoS operates on N-dimensional data, largely agnostic to acquisition instrumentation: algorithm implementations utilize all dimensions simultaneously to (i) offer greater separation between features, improving detection sensitivity, (ii) increase alignment/feature matching confidence among datasets, and (iii) mitigate convolution artifacts in tandem mass spectra. We demonstrate DEIMoS with LC-IMS-MS/MS data, demonstrating the advantages of a multidimensional approach in each data processing step.

  9. f

    Data from: On rank distribution classifiers for high-dimensional data

    • tandf.figshare.com
    txt
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olusola Samuel Makinde (2023). On rank distribution classifiers for high-dimensional data [Dataset]. http://doi.org/10.6084/m9.figshare.12337025.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Olusola Samuel Makinde
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Spatial sign and rank-based methods have been studied in the recent literature, especially when the dimension is smaller than the sample size. In this paper, a classification method based on the distribution of rank functions for high-dimensional data is considered with extension to functional data. The method is fully nonparametric in nature. The performance of the classification method is illustrated in comparison with some other classifiers using simulated and real data sets. Supporting code in R are provided for computational implementation of the classification method that will be of use to others.

  10. d

    Data from: Mining Distance-Based Outliers in Near Linear Time

    • catalog.data.gov
    • datasets.ai
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Mining Distance-Based Outliers in Near Linear Time [Dataset]. https://catalog.data.gov/dataset/mining-distance-based-outliers-in-near-linear-time
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Full title: Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Abstract: Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

  11. D

    Data from: Dimensionality and Validation of the Highly Sensitive Person...

    • dataverse.nl
    docx, zip
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Veronique de Gucht; Veronique de Gucht; Tom Wilderjans; Tom Wilderjans; Franshelis Garcia; Franshelis Garcia; Stan Maes; Stan Maes (2024). Dimensionality and Validation of the Highly Sensitive Person Scale (HSPS) in a Dutch General Population Sample and Two Clinical Samples [Dataset]. http://doi.org/10.34894/PXHGBI
    Explore at:
    zip(762266), zip(8042), zip(2563406), zip(789595), docx(18664), zip(114559)Available download formats
    Dataset updated
    Aug 20, 2024
    Dataset provided by
    DataverseNL
    Authors
    Veronique de Gucht; Veronique de Gucht; Tom Wilderjans; Tom Wilderjans; Franshelis Garcia; Franshelis Garcia; Stan Maes; Stan Maes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data included in the publication package pertain to the validation of an existing English questionnaire into Dutch. The validation was conducted both in a sample of the general population and in two clinical samples (fatigue; pain). The publication package contains both the data and the syntax that form the basis of the validation article.

  12. Reproducibility data for tables in "A machine learning approach to portfolio...

    • zenodo.org
    csv
    Updated Jun 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucio Fernandez-Arjona; Lucio Fernandez-Arjona; Damir Filipovic; Damir Filipovic (2022). Reproducibility data for tables in "A machine learning approach to portfolio pricing and risk management for high dimensional problems" [Dataset]. http://doi.org/10.5281/zenodo.6659959
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 19, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lucio Fernandez-Arjona; Lucio Fernandez-Arjona; Damir Filipovic; Damir Filipovic
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all the information necessary for reproducing the tables in the paper "A machine learning approach to portfolio pricing and risk management for high dimensional problems".

    The raw benchmark data can be found in the Zenodo dataset "Benchmark and training data for replicating financial and insurance examples" (https://zenodo.org/record/3837381). To the extend necessary, only summary data from that dataset is used in this dataset.

    The dataset includes a jupyter notebook file that explains what the different files contain, and provides sample code to analyze the information and reproduce the tables.

  13. Z

    Data from: First-passage probability estimation of high-dimensional...

    • data.niaid.nih.gov
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valdebenito, Marcos (2024). First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems by a fractional moments-based mixture distribution approach [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7661087
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Ding, Chen
    Faes, Matthias
    Dang, Chao
    Broggi, Matteo
    Valdebenito, Marcos
    Beer, Michael
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    First-passage probability estimation of high-dimensional nonlinear stochastic dynamic systems is a significant task to be solved in many science and engineering fields, but remains still an open challenge. The present paper develops a novel approach, termed ‘fractional moments-based mixture distribution’, to address such challenge. This approach is implemented by capturing the extreme value distribution (EVD) of the system response with the concepts of fractional moment and mixture distribution. In our context, the fractional moment itself is by definition a high-dimensional integral with a complicated integrand. To efficiently compute the fractional moments, a parallel adaptive sampling scheme that allows for sample size extension is developed using the refined Latinized stratified sampling (RLSS). In this manner, both variance reduction and parallel computing are possible for evaluating the fractional moments. From the knowledge of low-order fractional moments, the EVD of interest is then expected to be reconstructed. Based on introducing an extended inverse Gaussian distribution and a log extended skew-normal distribution, one flexible mixture distribution model is proposed, where its fractional moments are derived in analytic form. By fitting a set of fractional moments, the EVD can be recovered via the proposed mixture model. Accordingly, the first-passage probabilities under different thresholds can be obtained from the recovered EVD straightforwardly. The performance of the proposed method is verified by three examples consisting of two test examples and one engineering problem.

  14. H

    Data from: Multi Class Datasets

    • dataverse.harvard.edu
    Updated Mar 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sankha Mullick (2017). Multi Class Datasets [Dataset]. http://doi.org/10.7910/DVN/O4RIRM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 9, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Sankha Mullick
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We present 20 new multi-labeled artificial datasets, which can also be used for evaluating ambiguity resolving classifiers. The ambiguous or multi-labeled points are defined by those lying in the overlapping regions of two or more classes. Among the 20 datasets, 10 are 2-dimensional, while the rests are either 5 or 10-dimensional extended versions of the 2-dimensonal ones. The extensions are done following one of the two techniques. In the first strategy, datasets ate designed by appending 3 new dimensions each sampled uniformly at random and scaled between a specified range. The new 5-dimensional dataset is rotated by a random rotation matrix. This is a general technique by which any dataset can be transformed to higher dimensional feature space while conserving the properties of the ambiguous points. The second method extends the datasets by sampling them from a 10-dimensional real-valued feature space using the analogs class distributions of the corresponding 2-dimensional dataset. Such a strategy can extend a dataset to arbitrarily higher dimension feature space. However, the datasets will become sparse with increasing dimensionality. To tackle this issue the number of data points is increased in this case.

  15. I

    Data from: Modeling in higher dimensions to improve diagnostic testing...

    • data.niaid.nih.gov
    url
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Modeling in higher dimensions to improve diagnostic testing accuracy: Theory and examples for multiplex saliva-based SARS-CoV-2 antibody assays [Dataset]. http://doi.org/10.21430/M33WC6JLYP
    Explore at:
    urlAvailable download formats
    Dataset updated
    Sep 26, 2024
    License

    https://www.immport.org/agreementhttps://www.immport.org/agreement

    Description

    The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has emphasized the importance and challenges of correctly interpreting antibody test results. Identification of positive and negative samples requires a classification strategy with low error rates, which is hard to achieve when the corresponding measurement values overlap. Additional uncertainty arises when classification schemes fail to account for complicated structure in data. We address these problems through a mathematical framework that combines high dimensional data modeling and optimal decision theory. Specifically, we show that appropriately increasing the dimension of data better separates positive and negative populations and reveals nuanced structure that can be described in terms of mathematical models. We combine these models with optimal decision theory to yield a classification scheme that better separates positive and negative samples relative to traditional methods such as confidence intervals (CIs) and receiver operating characteristics. We validate the usefulness of this approach in the context of a multiplex salivary SARS-CoV-2 immunoglobulin G assay dataset. This example illustrates how our analysis: (i) improves the assay accuracy, (e.g. lowers classification errors by up to 42% compared to CI methods); (ii) reduces the number of indeterminate samples when an inconclusive class is permissible, (e.g. by 40% compared to the original analysis of the example multiplex dataset) and (iii) decreases the number of antigens needed to classify samples. Our work showcases the power of mathematical modeling in diagnostic classification and highlights a method that can be adopted broadly in public health and clinical settings.

  16. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    North Carolina
    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  17. f

    Mean squared error (×10−2) for data as in Table 2.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fan Chen; Guy Nason (2023). Mean squared error (×10−2) for data as in Table 2. [Dataset]. http://doi.org/10.1371/journal.pone.0229845.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Fan Chen; Guy Nason
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mean squared error (×10−2) for data as in Table 2.

  18. Multi-Dimensional Data Viewer (MDV) user manual for data exploration:...

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis (2024). Multi-Dimensional Data Viewer (MDV) user manual for data exploration: "Systematic analysis of YFP gene traps reveals common discordance between mRNA and protein across the nervous system" [Dataset]. http://doi.org/10.5281/zenodo.7738944
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maria Kiourlappou; Maria Kiourlappou; Martin Sergeant; Martin Sergeant; Joshua S. Titlow; Joshua S. Titlow; Jeffrey Y. Lee; Jeffrey Y. Lee; Darragh Ennis; Stephen Taylor; Stephen Taylor; Ilan Davis; Ilan Davis; Darragh Ennis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV: https://mdv.molbiol.ox.ac.uk), a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP gene traps reveals common discordance between mRNA and protein across the nervous system (https://doi.org/10.1083/jcb.202205129). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.

  19. f

    Data from: Statistical Significance of Clustering with Multidimensional...

    • tandf.figshare.com
    bin
    Updated Feb 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hui Shen; Shankar Bhamidi; Yufeng Liu (2024). Statistical Significance of Clustering with Multidimensional Scaling [Dataset]. http://doi.org/10.6084/m9.figshare.23271877.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 13, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Hui Shen; Shankar Bhamidi; Yufeng Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering is a fundamental tool for exploratory data analysis. One central problem in clustering is deciding if the clusters discovered by clustering methods are reliable as opposed to being artifacts of natural sampling variation. Statistical significance of clustering (SigClust) is a recently developed cluster evaluation tool for high-dimension, low-sample size data. Despite its successful application to many scientific problems, there are cases where the original SigClust may not work well. Furthermore, for specific applications, researchers may not have access to the original data and only have the dissimilarity matrix. In this case, clustering is still a valuable exploratory tool, but the original SigClust is not applicable. To address these issues, we propose a new SigClust method using multidimensional scaling (MDS). The underlying idea behind MDS-based SigClust is that one can achieve low-dimensional representations of the original data via MDS using only the dissimilarity matrix and then apply SigClust on the low-dimensional MDS space. The proposed MDS-based SigClust can circumvent the challenge of parameter estimation of the original method in high-dimensional spaces while keeping the essential clustering structure in the MDS space. Both simulations and real data applications demonstrate that the proposed method works remarkably well for assessing the statistical significance of clustering. Supplemental materials for the article are available online.

  20. t

    Towards a unified multi-dimensional evaluator for text generation - Dataset...

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Towards a unified multi-dimensional evaluator for text generation - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/towards-a-unified-multi-dimensional-evaluator-for-text-generation
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The NewsRoom dataset consists of 60 input source texts and 7 output summaries for each sample.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone (2023). Additional file 1: of Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation [Dataset]. http://doi.org/10.6084/m9.figshare.c.3595874_D5.v1
Organization logoOrganization logo

Additional file 1: of Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation

Related Article
Explore at:
txtAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Elena Landoni; Rosalba Miceli; Maurizio Callari; Paola Tiberio; Valentina Appierto; Valentina Angeloni; Luigi Mariani; Maria Daidone
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

R codes for implementing the described analyses (sample processing, data pre-processing, class comparison and class prediction). Caliper matching was implemented using the nonrandom package; the t- and the AD tests were implemented using the stats package and the adk package, respectively. Notice that the updated package for implementing the AD test is kSamples. As regards the bootstrap selection and the egg-shaped plot, we respectively modified the doBS and the importance igraph functions, both included in the bootfs package. For the SVM model we used the e1071 package. (R 12Â kb)

Search
Clear search
Close search
Google apps
Main menu