100+ datasets found
  1. f

    Prediction of early breast cancer patient survival using ensembles of...

    • plos.figshare.com
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Inna Y. Gong; Natalie S. Fox; Vincent Huang; Paul C. Boutros (2023). Prediction of early breast cancer patient survival using ensembles of hypoxia signatures [Dataset]. http://doi.org/10.1371/journal.pone.0204123
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Inna Y. Gong; Natalie S. Fox; Vincent Huang; Paul C. Boutros
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundBiomarkers are a key component of precision medicine. However, full clinical integration of biomarkers has been met with challenges, partly attributed to analytical difficulties. It has been shown that biomarker reproducibility is susceptible to data preprocessing approaches. Here, we systematically evaluated machine-learning ensembles of preprocessing methods as a general strategy to improve biomarker performance for prediction of survival from early breast cancer.ResultsWe risk stratified breast cancer patients into either low-risk or high-risk groups based on four published hypoxia signatures (Buffa, Winter, Hu, and Sorensen), using 24 different preprocessing approaches for microarray normalization. The 24 binary risk profiles determined for each hypoxia signature were combined using a random forest to evaluate the efficacy of a preprocessing ensemble classifier. We demonstrate that the best way of merging preprocessing methods varies from signature to signature, and that there is likely no ‘best’ preprocessing pipeline that is universal across datasets, highlighting the need to evaluate ensembles of preprocessing algorithms. Further, we developed novel signatures for each preprocessing method and the risk classifications from each were incorporated in a meta-random forest model. Interestingly, the classification of these biomarkers and its ensemble show striking consistency, demonstrating that similar intrinsic biological information are being faithfully represented. As such, these classification patterns further confirm that there is a subset of patients whose prognosis is consistently challenging to predict.ConclusionsPerformance of different prognostic signatures varies with pre-processing method. A simple classifier by unanimous voting of classifications is a reliable way of improving on single preprocessing methods. Future signatures will likely require integration of intrinsic and extrinsic clinico-pathological variables to better predict disease-related outcomes.

  2. n

    Data from: Assessing predictive performance of supervised machine learning...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Strathmore University
    Authors
    Evans Omondi
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.

  3. Data from: Enriching time series datasets using Nonparametric kernel...

    • figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamad Ivan Fanany (2023). Enriching time series datasets using Nonparametric kernel regression to improve forecasting accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.1609661.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Mohamad Ivan Fanany
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Improving the accuracy of prediction on future values based on the past and current observations has been pursued by enhancing the prediction's methods, combining those methods or performing data pre-processing. In this paper, another approach is taken, namely by increasing the number of input in the dataset. This approach would be useful especially for a shorter time series data. By filling the in-between values in the time series, the number of training set can be increased, thus increasing the generalization capability of the predictor. The algorithm used to make prediction is Neural Network as it is widely used in literature for time series tasks. For comparison, Support Vector Regression is also employed. The dataset used in the experiment is the frequency of USPTO's patents and PubMed's scientific publications on the field of health, namely on Apnea, Arrhythmia, and Sleep Stages. Another time series data designated for NN3 Competition in the field of transportation is also used for benchmarking. The experimental result shows that the prediction performance can be significantly increased by filling in-between data in the time series. Furthermore, the use of detrend and deseasonalization which separates the data into trend, seasonal and stationary time series also improve the prediction performance both on original and filled dataset. The optimal number of increase on the dataset in this experiment is about five times of the length of original dataset.

  4. Additional file 1 of Impact of data preprocessing on cell-type clustering...

    • figshare.com
    • springernature.figshare.com
    xlsx
    Updated Feb 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chunxiang Wang; Xin Gao; Juntao Liu (2024). Additional file 1 of Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data [Dataset]. http://doi.org/10.6084/m9.figshare.13065586.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 29, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Chunxiang Wang; Xin Gao; Juntao Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file contains 50 pairs of ARI and C-score values generated by running SC3 50 times on each data set.

  5. Youtube cookery channels viewers comments in Hinglish

    • zenodo.org
    bin, csv
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Kaushik; Abhishek Kaushik; Gagandeep Kaur; Gagandeep Kaur (2020). Youtube cookery channels viewers comments in Hinglish [Dataset]. http://doi.org/10.5281/zenodo.2827025
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abhishek Kaushik; Abhishek Kaushik; Gagandeep Kaur; Gagandeep Kaur
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    The data was collected from the famous cookery Youtube channels in India. The major focus was to collect the viewers' comments in Hinglish languages. The datasets are taken from top 2 Indian cooking channel named Nisha Madhulika channel and Kabita’s Kitchen channel.

    Both the datasets comments are divided into seven categories:-

    Label 1- Gratitude

    Label 2- About the recipe

    Label 3- About the video

    Label 4- Praising

    Label 5- Hybrid

    Label 6- Undefined

    Label 7- Suggestions and queries

    All the labelling has been done manually.

    Nisha Madhulika dataset:

    Dataset characteristics: Multivariate

    Number of instances: 4900

    Area: Cooking

    Attribute characteristics: Real

    Number of attributes: 4

    Date donated: March, 2019

    Associate tasks: Classification

    Missing values: Null

    Kabita Kitchen dataset:

    Dataset characteristics: Multivariate

    Number of instances: 4900

    Area: Cooking

    Attribute characteristics: Real

    Number of attributes: 4

    Date donated: March, 2019

    Associate tasks: Classification

    Missing values: Null

    There are two separate datasets file of each channel named as preprocessing and main file .

    The files with preprocessing names are generated after doing the preprocessing and exploratory data analysis on both the datasets. This file includes:

    • Id
    • User
    • Comment text
    • Labels
    • Count of stop-words
    • Uppercase words
    • Hashtags
    • Word count
    • Char count
    • Average words
    • Numeric

    The main file includes:

    • Id
    • user
    • comment text
    • Labels

  6. Z

    Community Detection to Split Large-scale Assemblies in Subassemblies

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Münker, Sören (2023). Community Detection to Split Large-scale Assemblies in Subassemblies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8260584
    Explore at:
    Dataset updated
    Aug 19, 2023
    Dataset authored and provided by
    Münker, Sören
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The motivation for the preprocessing of large-scale CAD models for assembly-by-disassembly approaches. The assembly-by-disassembly is only suitable for assemblies with a small number of parts (n_{parts} < 22). However, when dealing with large-scale products with high complexity, the CAD models may not contain feasible subassemblies (e.g. with connected and interference-free parts) and have too many parts to be processed with assembly-by-disassembly. Product designers' preferences during the design phase might not be ideal for assembly-by-disassembly processing because they do not consider subassembly feasibility and the number of parts per subassembly concisely. An automated preprocessing approach is proposed to address this issue by splitting the model into manageable partitions using community detection. This will allow for parallelised, efficient and accurate assembly-by-disassembly of large-scale CAD models. However, applying community detection methods for automatically splitting CAD models into smaller subassemblies is a new concept and research on the suitability for ASP needs to be conducted. Therefore, the following underlying research question will be answered in this experiments:

    Underlying research question 2: Can automated preprocessing increase the suitability of CAD-based assembly-by-disassembly for large-scale products?

    A hypothesis is formulated to answer this research question, which will be utilised to design experiments for hypothesis testing.

    Hypothesis 2: Community detection algorithms can be applied to automatically split large-scale assemblies in suitable candidates for CAD-based AND/OR graph generation.}

  7. Mediapipe based Preprocessed VGGFace2 Dataset

    • zenodo.org
    jpeg, zip
    Updated Mar 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syed Taimoor Hussain Shah; Syed Taimoor Hussain Shah; Syed Adil Hussain Shah; Syed Adil Hussain Shah; Ammara Zamir; Kainat Qayyum; Syed Baqir Hussain Shah; Syeda Maryam Fatima; Marco Agostino Deriu; Ammara Zamir; Kainat Qayyum; Syed Baqir Hussain Shah; Syeda Maryam Fatima; Marco Agostino Deriu (2025). Mediapipe based Preprocessed VGGFace2 Dataset [Dataset]. http://doi.org/10.5281/zenodo.15078557
    Explore at:
    jpeg, zipAvailable download formats
    Dataset updated
    Mar 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Syed Taimoor Hussain Shah; Syed Taimoor Hussain Shah; Syed Adil Hussain Shah; Syed Adil Hussain Shah; Ammara Zamir; Kainat Qayyum; Syed Baqir Hussain Shah; Syeda Maryam Fatima; Marco Agostino Deriu; Ammara Zamir; Kainat Qayyum; Syed Baqir Hussain Shah; Syeda Maryam Fatima; Marco Agostino Deriu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    VGGFace2 Dataset and Face Mesh Preprocessing
    Introduction
    The VGGFace2 dataset is a large-scale face recognition dataset containing over 3.31 million images of 9,131 identities, with an average of 362 images per identity. The dataset is designed to include extensive variations in pose, age, illumination, ethnicity, and profession, making it one of the most diverse and challenging face recognition datasets available. For more details, please refer to the original publication:
    VGGFace2: A dataset for recognizing faces across pose and age - DOI: 10.48550/arXiv.1710.08092

    Preprocessing Using MediaPipe 3D Face Mesh
    On this dataset, we applied the MediaPipe-based 3D face mesh algorithm to accurately detect faces while removing all background elements, including hair. Our preprocessing strictly retained facial landmarks, ensuring that only the essential facial features were preserved. This approach significantly enhanced the accuracy and generalization of our model, as the model was trained exclusively on landmark-based facial data.

    Training and Performance
    The preprocessed data was utilized to train Xception model, which resulted in remarkably accurate outcomes due to the strictly landmark-based facial representation. The model demonstrated robust performance including explainable-AI, proving that eliminating unnecessary background elements contributed positively to its efficiency and reliability.

    Citation
    If you use this dataset or the preprocessed version in your work, please cite both of the following:

    VGGFace2 Dataset:

    @article{Cao2018VGGFace2,
    title={VGGFace2: A dataset for recognizing faces across pose and age},
    author={Cao, Qiong and Shen, Li and Xie, Weidi and Parkhi, Omkar M and Zisserman, Andrew},
    journal={arXiv preprint arXiv:1710.08092},
    year={2018}
    }


    DOI: [10.48550/arXiv.1710.08092](https://doi.org/10.48550/arXiv.1710.08092)
    Preprocessed Dataset using MediaPipe:@dataset{Shah2025_MediaPipe_FaceMesh,
    title={MediaPipe-based 3D Face Mesh Preprocessed VGGFace2 Dataset},
    author={Shah, Syed Taimoor Hussain and Shah, Syed Adil Hussain and Zamir, Ammara and Qayyum, Kainat and Shah, Syed Baqir Hussain and Fatima, Syeda Maryam and Deriu, Marco Agostino},
    year={2025},
    doi={10.5281/zenodo.15078557}
    }
    DOI: [10.5281/zenodo.15078557](https://doi.org/10.5281/zenodo.15078557)


    Contact
    For any questions or further details, please feel free to contact us.
    Syed Taimoor Hussain Shah
    PolitoBIOMed Lab, Department of Mechanical and Aerospace Engineering, Politecnico di Torino, Turin, Italy
    Email: taimoor.shah@polito.it
    ORCID: 0000-0002-6010-6777

  8. f

    Data_Sheet_1_Assessing the Impact of Data Preprocessing on Analyzing Next...

    • frontiersin.figshare.com
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Binsheng He; Rongrong Zhu; Huandong Yang; Qingqing Lu; Weiwei Wang; Lei Song; Xue Sun; Guandong Zhang; Shijun Li; Jialiang Yang; Geng Tian; Pingping Bing; Jidong Lang (2023). Data_Sheet_1_Assessing the Impact of Data Preprocessing on Analyzing Next Generation Sequencing Data.pdf [Dataset]. http://doi.org/10.3389/fbioe.2020.00817.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Binsheng He; Rongrong Zhu; Huandong Yang; Qingqing Lu; Weiwei Wang; Lei Song; Xue Sun; Guandong Zhang; Shijun Li; Jialiang Yang; Geng Tian; Pingping Bing; Jidong Lang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data quality control and preprocessing are often the first step in processing next-generation sequencing (NGS) data of tumors. Not only can it help us evaluate the quality of sequencing data, but it can also help us obtain high-quality data for downstream data analysis. However, by comparing data analysis results of preprocessing with Cutadapt, FastP, Trimmomatic, and raw sequencing data, we found that the frequency of mutation detection had some fluctuations and differences, and human leukocyte antigen (HLA) typing directly resulted in erroneous results. We think that our research had demonstrated the impact of data preprocessing steps on downstream data analysis results. We hope that it can promote the development or optimization of better data preprocessing methods, so that downstream information analysis can be more accurate.

  9. Data from: COVID-19 and media dataset: Mining textual data according periods...

    • dataverse.cirad.fr
    application/x-gzip +1
    Updated Dec 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu Roche; Mathieu Roche (2020). COVID-19 and media dataset: Mining textual data according periods and countries (UK, Spain, France) [Dataset]. http://doi.org/10.18167/DVN1/ZUA8MF
    Explore at:
    application/x-gzip(511157), application/x-gzip(97349), text/x-perl-script(4982), application/x-gzip(93110), application/x-gzip(23765310), application/x-gzip(107669)Available download formats
    Dataset updated
    Dec 21, 2020
    Authors
    Mathieu Roche; Mathieu Roche
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    France, United Kingdom, Spain
    Dataset funded by
    ANR (#DigitAg)
    Horizon 2020 - European Commission - (MOOD project)
    Description

    These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],

  10. u

    Data life cycle from survey files, pre-processing, analysis to visualisation...

    • figshare.unimelb.edu.au
    png
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amanda Belton; Mark Selkrig; Sharon McDonough; R.K. Keamy; Robyn Brandenburg (2025). Data life cycle from survey files, pre-processing, analysis to visualisation [Dataset]. http://doi.org/10.26188/29670347.v1
    Explore at:
    pngAvailable download formats
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    The University of Melbourne
    Authors
    Amanda Belton; Mark Selkrig; Sharon McDonough; R.K. Keamy; Robyn Brandenburg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diagram of the process starting with collating the survey data files, these were pre-processed for analysis as shown with an image of a white baby with respondent text as well as the demographic details, a sanpshot shows how these were analysed using a miro board with lines and sticky notes, then these analysis was visualised as data portraits, data quilts and quilted bar charts.

  11. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    North Carolina
    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  12. spatial_frequency_preferences

    • openneuro.org
    Updated Aug 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William F. Broderick; Jonathan Winawer; Eero P. Simoncelli (2021). spatial_frequency_preferences [Dataset]. http://doi.org/10.18112/openneuro.ds003812.v1.0.0
    Explore at:
    Dataset updated
    Aug 25, 2021
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    William F. Broderick; Jonathan Winawer; Eero P. Simoncelli
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    README

    This spatial_frequency_preferences dataset contains the data from the paper "Mapping Spatial Frequency Preferences Across Human Primary Visual Cortex", by William F. Broderick, Eero P. Simoncelli, and Jonathan Winawer. ADD LINK

    In this experiment, we measured the BOLD responses of 12 human observers to a set of novel grating stimuli in order to measure the spatial frequency tuning in primary visual cortex across eccentricities, retinotopic angles, and stimulus orientations. We then fit a parametric model which fits all voxels for a given subject simultaneously, predicting each voxel's response as a function of the voxel's retinotopic location and the stimulus local spatial frequency and orientation.

    This dataset contains the minimally pre-processed, BIDS-compliant data required to reproduce the analyses presented in the paper. In addition to the task imaging data and stimuli files, it contains three derivatives directories: - freesurfer: freesurfer subject directories for each subject, with one change: the contents of mri/ directories have been defaced. - prf_solutions: solutions to the population receptive field models from a separate retinotopy experiment for each subject, fit using VistaSoft. Also contains the Benson retinotopic atlases for each subject (Benson et al., 2014) and the solutions for Bayesian retinotopic analyses (Benson and Winawer, 2018) -- the solutions to the Bayesian retinotopy are what we actually use in the paper. - preprocessed: the preprocessed data (a custom script was used for preprocessing, found on the "https://github.com/WinawerLab/MRI_tools/">Winawer Lab Github, see "https://wikis.nyu.edu/pages/viewpage.action?pageId=86054639">Winawer Lab wiki for more details). See paper for description of steps taken. Results should not change substantially if fMRIPrep were to be used for preprocessing instead, as long as data is kept in individual subject space.

    This dataset is presented with the intention of enabling re-running our analyses to reproduce our results with our accompanying "https://github.com/billbrod/spatial-frequency-preferences">Github repo. This dataset should contain sufficient information for re-analysis with a novel method, but there are no guarantees.

    Details related to access to the data

    If you use this dataset in a publication, please cite the corresponding paper.

    • Contact person: William F. Broderick, ORCID 0000-0002-8999-9003, wfb229@nyu.edu

    This dataset is hosted on OpenNeuro, and can be downloaded from there. Additionally, we present two additional variants of this data, both hosted on this project's OSF page: - Fully-processed data: contains the final output of our analyses, the data required to reproduce the figures as they appear in the paper. - Partially-processed data: contains the outputs of GLMdenoise and all data required to start fitting the spatial frequency response functions.

    Both data sets build on top of this one and so require the data contained here as well.

    All three of these variants may be downloaded using code found in the "https://github.com/billbrod/spatial-frequency-preferences">Github repo, see the README there for more details.

    Overview

    • Spatial frequency preferences

    • Year(s) that the project ran: started gathering pilot data in 2017, this dataset was gathered in the springs of 2019 and 2020. Paper written in 2020 and 2021, submitted fall 2021.

    • Brief overview of the tasks in the experiment: subjects viewed the stimuli, fixating on the center of the images. A sequence of digits, alternating black and white, was presented at fixation; subjects pressed a button whenever a digit repeated. The behavioral data was not presented in the paper and so is not present here. See paper for more details.

    • Description of the contents of the dataset:

      • Summary:
      • 767 Files
      • 12 subjects
      • 1 session each
      • Available tasks:
      • sfprescaled
      • Available modalities:
      • MRI
    • Quality assessment of the data: the MRIQC reports for each included scan can be found on this project's OSF page

    Methods

    Subjects

    Subjects were recruited from graduate students and postdocs at NYU, all experienced MRI participants.

    Apparatus

    Data was gathered on NYU's Center for Brain Imaging's Siemens Prisma 3T MRI scanner in a shielded room. Data was gathered with subjects lying down, with the stimuli projected onto a screen above their head.

    Initial setup

    When subjects arrived, subjects were briefed on the task, given the experimental consent form to read and sign, and talked through the screener form.

    Task organization

    This experiment has only a single task.

    Task details

    Subjects passively viewed the stimuli while performing the distractor task described above: viewing a stream of alternating black and white digits and pressing a button whenever a digit repeated. Their button presses were recorded.

    Additional data acquired

    No additional data gathered.

    Experimental location

    All data gathered at NYU's Center for Brain Imaging in New York, NY.

    Missing data

    One subject (sub-wlsubj045) only has 7 of the 12 runs, due to technical issues that came up during the run. The quality of their GLMdenoise fits and their final model fits do not appear to vary much from that of the other subjects.

  13. m

    Data from: A Deep Learning and XGBoost-based Method for Predicting...

    • data.mendeley.com
    • narcis.nl
    Updated Aug 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pan wang (2021). A Deep Learning and XGBoost-based Method for Predicting Protein-protein Interaction Sites [Dataset]. http://doi.org/10.17632/9tft3vz5tm.2
    Explore at:
    Dataset updated
    Aug 3, 2021
    Authors
    pan wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

    local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

    global&local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

    global&local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

  14. AI Data Management Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). AI Data Management Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/ai-data-management-market-industry-analysis
    Explore at:
    Dataset updated
    Jul 23, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    Global, United States, Canada
    Description

    Snapshot img

    AI Data Management Market Size 2025-2029

    The AI data management market size is forecast to increase by USD 51.04 billion at a CAGR of 19.7% between 2024 and 2029.

    The market is experiencing significant growth, driven by the proliferation of generative AI and large language models. These advanced technologies are increasingly being adopted across industries, leading to an exponential increase in data generation and the need for efficient data management solutions. Furthermore, the ascendancy of data-centric AI and the industrialization of data curation are key trends shaping the market. However, the market also faces challenges. Extreme data complexity and quality assurance at scale pose significant obstacles.
    Companies seeking to capitalize on the opportunities presented by the market must invest in solutions that address these challenges effectively. By doing so, they can gain a competitive edge, improve operational efficiency, and unlock new revenue streams. Ensuring data accuracy, completeness, and consistency across vast datasets is a daunting task, requiring sophisticated data management tools and techniques. Cloud computing is a key trend in the market, as cloud-based solutions offer quick deployment, flexibility, and scalability.
    

    What will be the Size of the AI Data Management Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free Sample

    The market for AI data management continues to evolve, with applications spanning various sectors, from finance to healthcare and retail. The model training process involves intricate data preprocessing steps, feature selection techniques, and data pipeline design to ensure optimal model performance. Real-time data processing and anomaly detection techniques are crucial for effective model monitoring systems, while data access management and data security measures ensure data privacy compliance. Data lifecycle management, including data validation techniques, metadata management strategy, and data lineage management, is essential for maintaining data quality.

    Data governance framework and data versioning system enable effective data governance strategy and data privacy compliance. For instance, a leading retailer reported a 20% increase in sales due to implementing data quality monitoring and AI model deployment. The industry anticipates a 25% growth in the market size by 2025, driven by the continuous unfolding of market activities and evolving patterns. Data integration tools, data pipeline design, data bias detection, data visualization tools, and data encryption techniques are key components of this dynamic landscape. Statistical modeling methods and predictive analytics models rely on cloud data solutions and big data infrastructure for efficient data processing.

    How is this AI Data Management Industry segmented?

    The AI data management industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Component
    
      Platform
      Software tools
      Services
    
    
    Technology
    
      Machine learning
      Natural language processing
      Computer vision
      Context awareness
    
    
    End-user
    
      BFSI
      Retail and e-commerce
      Healthcare and life sciences
      Manufacturing
      Others
    
    
    Geography
    
      North America
    
        US
        Canada
    
    
      Europe
    
        France
        Germany
        Italy
        UK
    
    
      APAC
    
        China
        India
        Japan
        South Korea
    
    
      Rest of World (ROW)
    

    By Component Insights

    The Platform segment is estimated to witness significant growth during the forecast period. In the dynamic and evolving world of data management, integrated platforms have emerged as a foundational and increasingly dominant category. These platforms offer a unified environment for managing both data and AI workflows, addressing the strategic imperative for enterprises to break down silos between data engineering, data science, and machine learning operations. The market trajectory is heavily influenced by the rise of the data lakehouse architecture, which combines the scalability and cost efficiency of data lakes with the performance and management features of data warehouses. Data preprocessing techniques and validation rules ensure data accuracy and consistency, while data access control maintains security and privacy.

    Machine learning models, model performance evaluation, and anomaly detection algorithms drive insights and predictions, with feature engineering methods and real-time data streaming enabling continuous learning. Data lifecycle management, data quality metrics, and data governance policies ensure data integrity and compliance. Cloud data warehousing and data lake architecture facilitate efficient data storage and

  15. Employment Of India CLeaned and Messy Data

    • kaggle.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SONIA SHINDE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

    🔹 Dataset Composition:

    It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

    Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
    - Employment Status (Employed/Unemployed)
    - Monthly Salary (INR)
    - Education Level
    - Industry Sector
    - Years of Experience
    - Location
    - Perceived AI Risk
    - Date of Data Recording

    Transformations & Cleaning Applied:

    The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

    Purpose & Utility:

    This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

    It's also useful for: - Training ML models with clean inputs
    - Data storytelling with visual clarity
    - Demonstrating reproducibility in data cleaning pipelines

    By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.

  16. m

    Synthetic Stroke Prediction Dataset

    • data.mendeley.com
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Borhan Uddin (2025). Synthetic Stroke Prediction Dataset [Dataset]. http://doi.org/10.17632/s2nh6fm925.1
    Explore at:
    Dataset updated
    May 2, 2025
    Authors
    Mohammed Borhan Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.

  17. Weather Type Classification

    • kaggle.com
    Updated Jun 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil Narayan (2024). Weather Type Classification [Dataset]. https://www.kaggle.com/datasets/nikhil7280/weather-type-classification/suggestions?status=pending
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 23, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nikhil Narayan
    Description

    Description

    This dataset is synthetically generated to mimic weather data for classification tasks. It includes various weather-related features and categorizes the weather into four types: Rainy, Sunny, Cloudy, and Snowy. This dataset is designed for practicing classification algorithms, data preprocessing, and outlier detection methods.

    Variables

    • Temperature (numeric): The temperature in degrees Celsius, ranging from extreme cold to extreme heat.
    • Humidity (numeric): The humidity percentage, including values above 100% to introduce outliers.
    • Wind Speed (numeric): The wind speed in kilometers per hour, with a range including unrealistically high values.
    • Precipitation (%) (numeric): The precipitation percentage, including outlier values.
    • Cloud Cover (categorical): The cloud cover description.
    • Atmospheric Pressure (numeric): The atmospheric pressure in hPa, covering a wide range.
    • UV Index (numeric): The UV index, indicating the strength of ultraviolet radiation.
    • Season (categorical): The season during which the data was recorded.
    • Visibility (km) (numeric): The visibility in kilometers, including very low or very high values.
    • Location (categorical): The type of location where the data was recorded.
    • Weather Type (categorical): The target variable for classification, indicating the weather type.

    Purpose and Utility

    This dataset is useful for data scientists, students especially beginners, and practitioners to investigate classification algorithm's performance, practice data preprocessing, feature engineering, model evaluation, and test outlier detection methods. It provides opportunities for learning and experimenting with weather data analysis and machine learning techniques.

    Important Note

    This dataset is synthetically produced and does not convey real-world weather data. It includes intentional outliers to provide opportunities for practicing outlier detection and handling. The values, ranges, and distributions may not accurately represent real-world conditions, and the data should primarily be used for educational and experimental purposes.

    License

    Anyone is free to share and use the data

  18. H

    Replication data for: Matching as Nonparametric Preprocessing for Reducing...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Nov 17, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel E. Ho; Kosuke Imai; Gary King; Elizabeth A. Stuart (2016). Replication data for: Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference [Dataset]. http://doi.org/10.7910/DVN/RWUY8G
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2016
    Dataset provided by
    Harvard Dataverse
    Authors
    Daniel E. Ho; Kosuke Imai; Gary King; Elizabeth A. Stuart
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/5.3/customlicense?persistentId=doi:10.7910/DVN/RWUY8Ghttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/5.3/customlicense?persistentId=doi:10.7910/DVN/RWUY8G

    Description

    Although published works rarely include causal estimates from more than a few model specifications, authors usually choose the presented estimates from numerous trial runs readers never see. Given the often large variation in estimates across choices of control variables, functional forms, and other modeling assumptions, how can researchers ensure that the few estimates presented are accurate or representative? How do readers know that publications are not merely demonstrations that it is possible to find a specification that fits the author’s favorite hypothesis? And how do we evaluate or even define statistical properties like unbiasedness or mean squared error when no unique model or estimator even exists? Matching methods, which offer the promise of causal inference with fewer assumptions, constitute one possible way forward, but crucial results in this fast-growing methodological literature are often grossly misinterpreted. We explain how to avoid these misinterpretations and propose a unified approach that makes it possible for researchers to preprocess data with matching (such as with the easy-to-use software we offer) and then to apply the best parametric techniques they would have used anyway. This procedure makes parametric models produce more accurate and considerably less model-dependent causal inferences. See also: Causal Inference

  19. R

    Cdd Dataset

    • universe.roboflow.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2023
    Dataset authored and provided by
    hakuna matata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cumcumber Diease Detection Bounding Boxes
    Description

    Project Documentation: Cucumber Disease Detection

    1. Title and Introduction Title: Cucumber Disease Detection

    Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

    1. Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

    Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

    Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

    1. Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

    Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

    Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

    1. Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

    2. Methodology Machine Learning Algorithms:

    Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

    The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

    1. Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

    2. Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

    3. Model Evaluation Evaluation Metrics:

    Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

    The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

    1. Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

    2. Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

    3. References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

    4. Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

    Rafiur Rahman Rafit EWU 2018-3-60-111

  20. Preprocessed 2019 Blindness Detection

    • kaggle.com
    zip
    Updated Sep 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matheus Eduardo (2019). Preprocessed 2019 Blindness Detection [Dataset]. https://www.kaggle.com/datasets/matheuseduardo/preprocessed-2019-blindness-detection
    Explore at:
    zip(598247654 bytes)Available download formats
    Dataset updated
    Sep 5, 2019
    Authors
    Matheus Eduardo
    Description

    Data from the APTOS 2019 Blindness Detection competition, cropped and preprocessed using @ratthachat's implementation of circle crop and Ben Graham's preprocessing method.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Inna Y. Gong; Natalie S. Fox; Vincent Huang; Paul C. Boutros (2023). Prediction of early breast cancer patient survival using ensembles of hypoxia signatures [Dataset]. http://doi.org/10.1371/journal.pone.0204123

Prediction of early breast cancer patient survival using ensembles of hypoxia signatures

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
docxAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Inna Y. Gong; Natalie S. Fox; Vincent Huang; Paul C. Boutros
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundBiomarkers are a key component of precision medicine. However, full clinical integration of biomarkers has been met with challenges, partly attributed to analytical difficulties. It has been shown that biomarker reproducibility is susceptible to data preprocessing approaches. Here, we systematically evaluated machine-learning ensembles of preprocessing methods as a general strategy to improve biomarker performance for prediction of survival from early breast cancer.ResultsWe risk stratified breast cancer patients into either low-risk or high-risk groups based on four published hypoxia signatures (Buffa, Winter, Hu, and Sorensen), using 24 different preprocessing approaches for microarray normalization. The 24 binary risk profiles determined for each hypoxia signature were combined using a random forest to evaluate the efficacy of a preprocessing ensemble classifier. We demonstrate that the best way of merging preprocessing methods varies from signature to signature, and that there is likely no ‘best’ preprocessing pipeline that is universal across datasets, highlighting the need to evaluate ensembles of preprocessing algorithms. Further, we developed novel signatures for each preprocessing method and the risk classifications from each were incorporated in a meta-random forest model. Interestingly, the classification of these biomarkers and its ensemble show striking consistency, demonstrating that similar intrinsic biological information are being faithfully represented. As such, these classification patterns further confirm that there is a subset of patients whose prognosis is consistently challenging to predict.ConclusionsPerformance of different prognostic signatures varies with pre-processing method. A simple classifier by unanimous voting of classifications is a reliable way of improving on single preprocessing methods. Future signatures will likely require integration of intrinsic and extrinsic clinico-pathological variables to better predict disease-related outcomes.

Search
Clear search
Close search
Google apps
Main menu