100+ datasets found

f
Prediction of early breast cancer patient survival using ensembles of...
plos.figshare.com
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Inna Y. Gong; Natalie S. Fox; Vincent Huang; Paul C. Boutros (2023). Prediction of early breast cancer patient survival using ensembles of hypoxia signatures [Dataset]. http://doi.org/10.1371/journal.pone.0204123
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0204123
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Inna Y. Gong; Natalie S. Fox; Vincent Huang; Paul C. Boutros
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundBiomarkers are a key component of precision medicine. However, full clinical integration of biomarkers has been met with challenges, partly attributed to analytical difficulties. It has been shown that biomarker reproducibility is susceptible to data preprocessing approaches. Here, we systematically evaluated machine-learning ensembles of preprocessing methods as a general strategy to improve biomarker performance for prediction of survival from early breast cancer.ResultsWe risk stratified breast cancer patients into either low-risk or high-risk groups based on four published hypoxia signatures (Buffa, Winter, Hu, and Sorensen), using 24 different preprocessing approaches for microarray normalization. The 24 binary risk profiles determined for each hypoxia signature were combined using a random forest to evaluate the efficacy of a preprocessing ensemble classifier. We demonstrate that the best way of merging preprocessing methods varies from signature to signature, and that there is likely no ‘best’ preprocessing pipeline that is universal across datasets, highlighting the need to evaluate ensembles of preprocessing algorithms. Further, we developed novel signatures for each preprocessing method and the risk classifications from each were incorporated in a meta-random forest model. Interestingly, the classification of these biomarkers and its ensemble show striking consistency, demonstrating that similar intrinsic biological information are being faithfully represented. As such, these classification patterns further confirm that there is a subset of patients whose prognosis is consistently challenging to predict.ConclusionsPerformance of different prognostic signatures varies with pre-processing method. A simple classifier by unanimous voting of classifications is a reliable way of improving on single preprocessing methods. Future signatures will likely require integration of intrinsic and extrinsic clinico-pathological variables to better predict disease-related outcomes.
n
Data from: Assessing predictive performance of supervised machine learning...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wh70rxwrh
Dataset updated
May 23, 2023
Dataset provided by
Strathmore University
Authors
Evans Omondi
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
Data from: Enriching time series datasets using Nonparametric kernel...
figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Ivan Fanany (2023). Enriching time series datasets using Nonparametric kernel regression to improve forecasting accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.1609661.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1609661.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mohamad Ivan Fanany
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Improving the accuracy of prediction on future values based on the past and current observations has been pursued by enhancing the prediction's methods, combining those methods or performing data pre-processing. In this paper, another approach is taken, namely by increasing the number of input in the dataset. This approach would be useful especially for a shorter time series data. By filling the in-between values in the time series, the number of training set can be increased, thus increasing the generalization capability of the predictor. The algorithm used to make prediction is Neural Network as it is widely used in literature for time series tasks. For comparison, Support Vector Regression is also employed. The dataset used in the experiment is the frequency of USPTO's patents and PubMed's scientific publications on the field of health, namely on Apnea, Arrhythmia, and Sleep Stages. Another time series data designated for NN3 Competition in the field of transportation is also used for benchmarking. The experimental result shows that the prediction performance can be significantly increased by filling in-between data in the time series. Furthermore, the use of detrend and deseasonalization which separates the data into trend, seasonal and stationary time series also improve the prediction performance both on original and filled dataset. The optimal number of increase on the dataset in this experiment is about five times of the length of original dataset.
Additional file 1 of Impact of data preprocessing on cell-type clustering...
figshare.com
springernature.figshare.com
xlsx
Updated Feb 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chunxiang Wang; Xin Gao; Juntao Liu (2024). Additional file 1 of Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data [Dataset]. http://doi.org/10.6084/m9.figshare.13065586.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13065586.v1
Dataset updated
Feb 29, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Chunxiang Wang; Xin Gao; Juntao Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This file contains 50 pairs of ARI and C-score values generated by running SC3 50 times on each data set.
Youtube cookery channels viewers comments in Hinglish
zenodo.org
bin, csv
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Kaushik; Abhishek Kaushik; Gagandeep Kaur; Gagandeep Kaur (2020). Youtube cookery channels viewers comments in Hinglish [Dataset]. http://doi.org/10.5281/zenodo.2827025
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2827025
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abhishek Kaushik; Abhishek Kaushik; Gagandeep Kaur; Gagandeep Kaur
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Area covered
YouTube
Description
The data was collected from the famous cookery Youtube channels in India. The major focus was to collect the viewers' comments in Hinglish languages. The datasets are taken from top 2 Indian cooking channel named Nisha Madhulika channel and Kabita’s Kitchen channel.

Both the datasets comments are divided into seven categories:-

Label 1- Gratitude

Label 2- About the recipe

Label 3- About the video

Label 4- Praising

Label 5- Hybrid

Label 6- Undefined

Label 7- Suggestions and queries

All the labelling has been done manually.

Nisha Madhulika dataset:

Dataset characteristics: Multivariate

Number of instances: 4900

Area: Cooking

Attribute characteristics: Real

Number of attributes: 4

Date donated: March, 2019

Associate tasks: Classification

Missing values: Null

Kabita Kitchen dataset:

Dataset characteristics: Multivariate

Number of instances: 4900

Area: Cooking

Attribute characteristics: Real

Number of attributes: 4

Date donated: March, 2019

Associate tasks: Classification

Missing values: Null

There are two separate datasets file of each channel named as preprocessing and main file .

The files with preprocessing names are generated after doing the preprocessing and exploratory data analysis on both the datasets. This file includes:

Id

User

Comment text

Labels

Count of stop-words

Uppercase words

Hashtags

Word count

Char count

Average words

Numeric

The main file includes:

Id

user

comment text

Labels
Z
Community Detection to Split Large-scale Assemblies in Subassemblies
data.niaid.nih.gov
zenodo.org
Updated Aug 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Münker, Sören (2023). Community Detection to Split Large-scale Assemblies in Subassemblies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8260584
Explore at:
Dataset updated
Aug 19, 2023
Dataset authored and provided by
Münker, Sören
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The motivation for the preprocessing of large-scale CAD models for assembly-by-disassembly approaches. The assembly-by-disassembly is only suitable for assemblies with a small number of parts (n_{parts} < 22). However, when dealing with large-scale products with high complexity, the CAD models may not contain feasible subassemblies (e.g. with connected and interference-free parts) and have too many parts to be processed with assembly-by-disassembly. Product designers' preferences during the design phase might not be ideal for assembly-by-disassembly processing because they do not consider subassembly feasibility and the number of parts per subassembly concisely. An automated preprocessing approach is proposed to address this issue by splitting the model into manageable partitions using community detection. This will allow for parallelised, efficient and accurate assembly-by-disassembly of large-scale CAD models. However, applying community detection methods for automatically splitting CAD models into smaller subassemblies is a new concept and research on the suitability for ASP needs to be conducted. Therefore, the following underlying research question will be answered in this experiments:

Underlying research question 2: Can automated preprocessing increase the suitability of CAD-based assembly-by-disassembly for large-scale products?

A hypothesis is formulated to answer this research question, which will be utilised to design experiments for hypothesis testing.

Hypothesis 2: Community detection algorithms can be applied to automatically split large-scale assemblies in suitable candidates for CAD-based AND/OR graph generation.}
Mediapipe based Preprocessed VGGFace2 Dataset
zenodo.org
jpeg, zip
Updated Mar 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syed Taimoor Hussain Shah; Syed Taimoor Hussain Shah; Syed Adil Hussain Shah; Syed Adil Hussain Shah; Ammara Zamir; Kainat Qayyum; Syed Baqir Hussain Shah; Syeda Maryam Fatima; Marco Agostino Deriu; Ammara Zamir; Kainat Qayyum; Syed Baqir Hussain Shah; Syeda Maryam Fatima; Marco Agostino Deriu (2025). Mediapipe based Preprocessed VGGFace2 Dataset [Dataset]. http://doi.org/10.5281/zenodo.15078557
Explore at:
jpeg, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15078557
Dataset updated
Mar 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Syed Taimoor Hussain Shah; Syed Taimoor Hussain Shah; Syed Adil Hussain Shah; Syed Adil Hussain Shah; Ammara Zamir; Kainat Qayyum; Syed Baqir Hussain Shah; Syeda Maryam Fatima; Marco Agostino Deriu; Ammara Zamir; Kainat Qayyum; Syed Baqir Hussain Shah; Syeda Maryam Fatima; Marco Agostino Deriu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
VGGFace2 Dataset and Face Mesh Preprocessing
Introduction
The VGGFace2 dataset is a large-scale face recognition dataset containing over 3.31 million images of 9,131 identities, with an average of 362 images per identity. The dataset is designed to include extensive variations in pose, age, illumination, ethnicity, and profession, making it one of the most diverse and challenging face recognition datasets available. For more details, please refer to the original publication:
VGGFace2: A dataset for recognizing faces across pose and age - DOI: 10.48550/arXiv.1710.08092

Preprocessing Using MediaPipe 3D Face Mesh
On this dataset, we applied the MediaPipe-based 3D face mesh algorithm to accurately detect faces while removing all background elements, including hair. Our preprocessing strictly retained facial landmarks, ensuring that only the essential facial features were preserved. This approach significantly enhanced the accuracy and generalization of our model, as the model was trained exclusively on landmark-based facial data.

Training and Performance
The preprocessed data was utilized to train Xception model, which resulted in remarkably accurate outcomes due to the strictly landmark-based facial representation. The model demonstrated robust performance including explainable-AI, proving that eliminating unnecessary background elements contributed positively to its efficiency and reliability.

Citation
If you use this dataset or the preprocessed version in your work, please cite both of the following:

VGGFace2 Dataset:

@article{Cao2018VGGFace2,
title={VGGFace2: A dataset for recognizing faces across pose and age},
author={Cao, Qiong and Shen, Li and Xie, Weidi and Parkhi, Omkar M and Zisserman, Andrew},
journal={arXiv preprint arXiv:1710.08092},
year={2018}
}

DOI: [10.48550/arXiv.1710.08092](https://doi.org/10.48550/arXiv.1710.08092)
Preprocessed Dataset using MediaPipe:@dataset{Shah2025_MediaPipe_FaceMesh,
title={MediaPipe-based 3D Face Mesh Preprocessed VGGFace2 Dataset},
author={Shah, Syed Taimoor Hussain and Shah, Syed Adil Hussain and Zamir, Ammara and Qayyum, Kainat and Shah, Syed Baqir Hussain and Fatima, Syeda Maryam and Deriu, Marco Agostino},
year={2025},
doi={10.5281/zenodo.15078557}
}
DOI: [10.5281/zenodo.15078557](https://doi.org/10.5281/zenodo.15078557)

Contact
For any questions or further details, please feel free to contact us.
Syed Taimoor Hussain Shah
PolitoBIOMed Lab, Department of Mechanical and Aerospace Engineering, Politecnico di Torino, Turin, Italy
Email: taimoor.shah@polito.it
ORCID: 0000-0002-6010-6777
f
Data_Sheet_1_Assessing the Impact of Data Preprocessing on Analyzing Next...
frontiersin.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Binsheng He; Rongrong Zhu; Huandong Yang; Qingqing Lu; Weiwei Wang; Lei Song; Xue Sun; Guandong Zhang; Shijun Li; Jialiang Yang; Geng Tian; Pingping Bing; Jidong Lang (2023). Data_Sheet_1_Assessing the Impact of Data Preprocessing on Analyzing Next Generation Sequencing Data.pdf [Dataset]. http://doi.org/10.3389/fbioe.2020.00817.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fbioe.2020.00817.s001
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Binsheng He; Rongrong Zhu; Huandong Yang; Qingqing Lu; Weiwei Wang; Lei Song; Xue Sun; Guandong Zhang; Shijun Li; Jialiang Yang; Geng Tian; Pingping Bing; Jidong Lang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data quality control and preprocessing are often the first step in processing next-generation sequencing (NGS) data of tumors. Not only can it help us evaluate the quality of sequencing data, but it can also help us obtain high-quality data for downstream data analysis. However, by comparing data analysis results of preprocessing with Cutadapt, FastP, Trimmomatic, and raw sequencing data, we found that the frequency of mutation detection had some fluctuations and differences, and human leukocyte antigen (HLA) typing directly resulted in erroneous results. We think that our research had demonstrated the impact of data preprocessing steps on downstream data analysis results. We hope that it can promote the development or optimization of better data preprocessing methods, so that downstream information analysis can be more accurate.
Data from: COVID-19 and media dataset: Mining textual data according periods...
dataverse.cirad.fr
application/x-gzip +1
Updated Dec 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathieu Roche; Mathieu Roche (2020). COVID-19 and media dataset: Mining textual data according periods and countries (UK, Spain, France) [Dataset]. http://doi.org/10.18167/DVN1/ZUA8MF
Explore at:
application/x-gzip(511157), application/x-gzip(97349), text/x-perl-script(4982), application/x-gzip(93110), application/x-gzip(23765310), application/x-gzip(107669)Available download formats
Unique identifier
https://doi.org/10.18167/DVN1/ZUA8MF
Dataset updated
Dec 21, 2020
Dataset provided by
Centre de coopération internationale en recherche agronomique pour le développementhttps://www.cirad.fr/
Authors
Mathieu Roche; Mathieu Roche
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
France, United Kingdom, Spain
Dataset funded by
ANR (#DigitAg)
Horizon 2020 - European Commission - (MOOD project)
Description
These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],
u
Data life cycle from survey files, pre-processing, analysis to visualisation...
figshare.unimelb.edu.au
png
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amanda Belton; Mark Selkrig; Sharon McDonough; R.K. Keamy; Robyn Brandenburg (2025). Data life cycle from survey files, pre-processing, analysis to visualisation [Dataset]. http://doi.org/10.26188/29670347.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.26188/29670347.v1
Dataset updated
Jul 30, 2025
Dataset provided by
The University of Melbourne
Authors
Amanda Belton; Mark Selkrig; Sharon McDonough; R.K. Keamy; Robyn Brandenburg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diagram of the process starting with collating the survey data files, these were pre-processed for analysis as shown with an image of a white baby with respondent text as well as the demographic details, a sanpshot shows how these were analysed using a miro board with lines and sticky notes, then these analysis was visualised as data portraits, data quilts and quilted bar charts.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
spatial_frequency_preferences
openneuro.org
Updated Aug 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William F. Broderick; Jonathan Winawer; Eero P. Simoncelli (2021). spatial_frequency_preferences [Dataset]. http://doi.org/10.18112/openneuro.ds003812.v1.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds003812.v1.0.0
Dataset updated
Aug 25, 2021
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
William F. Broderick; Jonathan Winawer; Eero P. Simoncelli
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
README

This spatial_frequency_preferences dataset contains the data from the paper "Mapping Spatial Frequency Preferences Across Human Primary Visual Cortex", by William F. Broderick, Eero P. Simoncelli, and Jonathan Winawer. ADD LINK

In this experiment, we measured the BOLD responses of 12 human observers to a set of novel grating stimuli in order to measure the spatial frequency tuning in primary visual cortex across eccentricities, retinotopic angles, and stimulus orientations. We then fit a parametric model which fits all voxels for a given subject simultaneously, predicting each voxel's response as a function of the voxel's retinotopic location and the stimulus local spatial frequency and orientation.

This dataset contains the minimally pre-processed, BIDS-compliant data required to reproduce the analyses presented in the paper. In addition to the task imaging data and stimuli files, it contains three derivatives directories: - freesurfer: freesurfer subject directories for each subject, with one change: the contents of mri/ directories have been defaced. - prf_solutions: solutions to the population receptive field models from a separate retinotopy experiment for each subject, fit using VistaSoft. Also contains the Benson retinotopic atlases for each subject (Benson et al., 2014) and the solutions for Bayesian retinotopic analyses (Benson and Winawer, 2018) -- the solutions to the Bayesian retinotopy are what we actually use in the paper. - preprocessed: the preprocessed data (a custom script was used for preprocessing, found on the "https://github.com/WinawerLab/MRI_tools/">Winawer Lab Github, see "https://wikis.nyu.edu/pages/viewpage.action?pageId=86054639">Winawer Lab wiki for more details). See paper for description of steps taken. Results should not change substantially if fMRIPrep were to be used for preprocessing instead, as long as data is kept in individual subject space.

This dataset is presented with the intention of enabling re-running our analyses to reproduce our results with our accompanying "https://github.com/billbrod/spatial-frequency-preferences">Github repo. This dataset should contain sufficient information for re-analysis with a novel method, but there are no guarantees.

Details related to access to the data

If you use this dataset in a publication, please cite the corresponding paper.

Contact person: William F. Broderick, ORCID 0000-0002-8999-9003, wfb229@nyu.edu

This dataset is hosted on OpenNeuro, and can be downloaded from there. Additionally, we present two additional variants of this data, both hosted on this project's OSF page: - Fully-processed data: contains the final output of our analyses, the data required to reproduce the figures as they appear in the paper. - Partially-processed data: contains the outputs of GLMdenoise and all data required to start fitting the spatial frequency response functions.

Both data sets build on top of this one and so require the data contained here as well.

All three of these variants may be downloaded using code found in the "https://github.com/billbrod/spatial-frequency-preferences">Github repo, see the README there for more details.

Overview

Spatial frequency preferences

Year(s) that the project ran: started gathering pilot data in 2017, this dataset was gathered in the springs of 2019 and 2020. Paper written in 2020 and 2021, submitted fall 2021.

Brief overview of the tasks in the experiment: subjects viewed the stimuli, fixating on the center of the images. A sequence of digits, alternating black and white, was presented at fixation; subjects pressed a button whenever a digit repeated. The behavioral data was not presented in the paper and so is not present here. See paper for more details.

Description of the contents of the dataset:

Summary:

767 Files

12 subjects

1 session each

Available tasks:

sfprescaled

Available modalities:

MRI

Quality assessment of the data: the MRIQC reports for each included scan can be found on this project's OSF page

Methods

Subjects

Subjects were recruited from graduate students and postdocs at NYU, all experienced MRI participants.

Apparatus

Data was gathered on NYU's Center for Brain Imaging's Siemens Prisma 3T MRI scanner in a shielded room. Data was gathered with subjects lying down, with the stimuli projected onto a screen above their head.

Initial setup

When subjects arrived, subjects were briefed on the task, given the experimental consent form to read and sign, and talked through the screener form.

Task organization

This experiment has only a single task.

Task details

Subjects passively viewed the stimuli while performing the distractor task described above: viewing a stream of alternating black and white digits and pressing a button whenever a digit repeated. Their button presses were recorded.

Additional data acquired

No additional data gathered.

Experimental location

All data gathered at NYU's Center for Brain Imaging in New York, NY.

Missing data

One subject (sub-wlsubj045) only has 7 of the 12 runs, due to technical issues that came up during the run. The quality of their GLMdenoise fits and their final model fits do not appear to vary much from that of the other subjects.
m
Data from: A Deep Learning and XGBoost-based Method for Predicting...
data.mendeley.com
narcis.nl
Updated Aug 3, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pan wang (2021). A Deep Learning and XGBoost-based Method for Predicting Protein-protein Interaction Sites [Dataset]. http://doi.org/10.17632/9tft3vz5tm.2
Explore at:
Unique identifier
https://doi.org/10.17632/9tft3vz5tm.2
Dataset updated
Aug 3, 2021
Authors
pan wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 344 columns, and rows represent the number of samples , the first 343 columns represent feature and the last column represent label

global&local_feature_training_set.csv: Preprocessing data of feature extractor contains 65869 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label

global&local_feature_testing_set.csv: Preprocessing data of feature extractor contains 11791 rows and 1028 columns, and rows represent the number of samples , the first 1027 columns represent feature and the last column represent label
AI Data Management Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). AI Data Management Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/ai-data-management-market-industry-analysis
Explore at:
Dataset updated
Jul 23, 2025
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
Global, United States, Canada
Description
Snapshot img

AI Data Management Market Size 2025-2029

The AI data management market size is forecast to increase by USD 51.04 billion at a CAGR of 19.7% between 2024 and 2029.

The market is experiencing significant growth, driven by the proliferation of generative AI and large language models. These advanced technologies are increasingly being adopted across industries, leading to an exponential increase in data generation and the need for efficient data management solutions. Furthermore, the ascendancy of data-centric AI and the industrialization of data curation are key trends shaping the market. However, the market also faces challenges. Extreme data complexity and quality assurance at scale pose significant obstacles. Companies seeking to capitalize on the opportunities presented by the market must invest in solutions that address these challenges effectively. By doing so, they can gain a competitive edge, improve operational efficiency, and unlock new revenue streams. Ensuring data accuracy, completeness, and consistency across vast datasets is a daunting task, requiring sophisticated data management tools and techniques. Cloud computing is a key trend in the market, as cloud-based solutions offer quick deployment, flexibility, and scalability.

What will be the Size of the AI Data Management Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample

The market for AI data management continues to evolve, with applications spanning various sectors, from finance to healthcare and retail. The model training process involves intricate data preprocessing steps, feature selection techniques, and data pipeline design to ensure optimal model performance. Real-time data processing and anomaly detection techniques are crucial for effective model monitoring systems, while data access management and data security measures ensure data privacy compliance. Data lifecycle management, including data validation techniques, metadata management strategy, and data lineage management, is essential for maintaining data quality.

Data governance framework and data versioning system enable effective data governance strategy and data privacy compliance. For instance, a leading retailer reported a 20% increase in sales due to implementing data quality monitoring and AI model deployment. The industry anticipates a 25% growth in the market size by 2025, driven by the continuous unfolding of market activities and evolving patterns. Data integration tools, data pipeline design, data bias detection, data visualization tools, and data encryption techniques are key components of this dynamic landscape. Statistical modeling methods and predictive analytics models rely on cloud data solutions and big data infrastructure for efficient data processing.

How is this AI Data Management Industry segmented?

The AI data management industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Component Platform Software tools Services Technology Machine learning Natural language processing Computer vision Context awareness End-user BFSI Retail and e-commerce Healthcare and life sciences Manufacturing Others Geography North America US Canada Europe France Germany Italy UK APAC China India Japan South Korea Rest of World (ROW)

By Component Insights

The Platform segment is estimated to witness significant growth during the forecast period. In the dynamic and evolving world of data management, integrated platforms have emerged as a foundational and increasingly dominant category. These platforms offer a unified environment for managing both data and AI workflows, addressing the strategic imperative for enterprises to break down silos between data engineering, data science, and machine learning operations. The market trajectory is heavily influenced by the rise of the data lakehouse architecture, which combines the scalability and cost efficiency of data lakes with the performance and management features of data warehouses. Data preprocessing techniques and validation rules ensure data accuracy and consistency, while data access control maintains security and privacy.

Machine learning models, model performance evaluation, and anomaly detection algorithms drive insights and predictions, with feature engineering methods and real-time data streaming enabling continuous learning. Data lifecycle management, data quality metrics, and data governance policies ensure data integrity and compliance. Cloud data warehousing and data lake architecture facilitate efficient data storage and
Employment Of India CLeaned and Messy Data
kaggle.com
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SONIA SHINDE
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

🔹 Dataset Composition:

It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording

Transformations & Cleaning Applied:

The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

Purpose & Utility:

This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

It's also useful for: - Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines

By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
m
Synthetic Stroke Prediction Dataset
data.mendeley.com
Updated May 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Borhan Uddin (2025). Synthetic Stroke Prediction Dataset [Dataset]. http://doi.org/10.17632/s2nh6fm925.1
Explore at:
Unique identifier
https://doi.org/10.17632/s2nh6fm925.1
Dataset updated
May 2, 2025
Authors
Mohammed Borhan Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.
Weather Type Classification
kaggle.com
Updated Jun 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil Narayan (2024). Weather Type Classification [Dataset]. https://www.kaggle.com/datasets/nikhil7280/weather-type-classification/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 23, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nikhil Narayan
Description
Description

This dataset is synthetically generated to mimic weather data for classification tasks. It includes various weather-related features and categorizes the weather into four types: Rainy, Sunny, Cloudy, and Snowy. This dataset is designed for practicing classification algorithms, data preprocessing, and outlier detection methods.

Variables

Temperature (numeric): The temperature in degrees Celsius, ranging from extreme cold to extreme heat.

Humidity (numeric): The humidity percentage, including values above 100% to introduce outliers.

Wind Speed (numeric): The wind speed in kilometers per hour, with a range including unrealistically high values.

Precipitation (%) (numeric): The precipitation percentage, including outlier values.

Cloud Cover (categorical): The cloud cover description.

Atmospheric Pressure (numeric): The atmospheric pressure in hPa, covering a wide range.

UV Index (numeric): The UV index, indicating the strength of ultraviolet radiation.

Season (categorical): The season during which the data was recorded.

Visibility (km) (numeric): The visibility in kilometers, including very low or very high values.

Location (categorical): The type of location where the data was recorded.

Weather Type (categorical): The target variable for classification, indicating the weather type.

Purpose and Utility

This dataset is useful for data scientists, students especially beginners, and practitioners to investigate classification algorithm's performance, practice data preprocessing, feature engineering, model evaluation, and test outlier detection methods. It provides opportunities for learning and experimenting with weather data analysis and machine learning techniques.

Important Note

This dataset is synthetically produced and does not convey real-world weather data. It includes intentional outliers to provide opportunities for practicing outlier detection and handling. The values, ranges, and distributions may not accurately represent real-world conditions, and the data should primarily be used for educational and experimental purposes.

License

Anyone is free to share and use the data
H
Replication data for: Matching as Nonparametric Preprocessing for Reducing...
dataverse.harvard.edu
search.dataone.org
Updated Nov 17, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel E. Ho; Kosuke Imai; Gary King; Elizabeth A. Stuart (2016). Replication data for: Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference [Dataset]. http://doi.org/10.7910/DVN/RWUY8G
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/RWUY8G
Dataset updated
Nov 17, 2016
Dataset provided by
Harvard Dataverse
Authors
Daniel E. Ho; Kosuke Imai; Gary King; Elizabeth A. Stuart
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/5.3/customlicense?persistentId=doi:10.7910/DVN/RWUY8Ghttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/5.3/customlicense?persistentId=doi:10.7910/DVN/RWUY8G
Description
Although published works rarely include causal estimates from more than a few model specifications, authors usually choose the presented estimates from numerous trial runs readers never see. Given the often large variation in estimates across choices of control variables, functional forms, and other modeling assumptions, how can researchers ensure that the few estimates presented are accurate or representative? How do readers know that publications are not merely demonstrations that it is possible to find a specification that fits the author’s favorite hypothesis? And how do we evaluate or even define statistical properties like unbiasedness or mean squared error when no unique model or estimator even exists? Matching methods, which offer the promise of causal inference with fewer assumptions, constitute one possible way forward, but crucial results in this fast-growing methodological literature are often grossly misinterpreted. We explain how to avoid these misinterpretations and propose a unified approach that makes it possible for researchers to preprocess data with matching (such as with the easy-to-use software we offer) and then to apply the best parametric techniques they would have used anyway. This procedure makes parametric models produce more accurate and considerably less model-dependent causal inferences. See also: Causal Inference
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
Preprocessed 2019 Blindness Detection
kaggle.com
zip
Updated Sep 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matheus Eduardo (2019). Preprocessed 2019 Blindness Detection [Dataset]. https://www.kaggle.com/datasets/matheuseduardo/preprocessed-2019-blindness-detection
Explore at:
zip(598247654 bytes)Available download formats
Dataset updated
Sep 5, 2019
Authors
Matheus Eduardo
Description
Data from the APTOS 2019 Blindness Detection competition, cropped and preprocessed using @ratthachat's implementation of circle crop and Ben Graham's preprocessing method.

Facebook

Twitter

Click to copy link

Link copied

Cite

Inna Y. Gong; Natalie S. Fox; Vincent Huang; Paul C. Boutros (2023). Prediction of early breast cancer patient survival using ensembles of hypoxia signatures [Dataset]. http://doi.org/10.1371/journal.pone.0204123

Prediction of early breast cancer patient survival using ensembles of hypoxia signatures

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

docxAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0204123

Dataset updated

May 30, 2023

Dataset provided by

PLOS ONE

Authors

Inna Y. Gong; Natalie S. Fox; Vincent Huang; Paul C. Boutros

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundBiomarkers are a key component of precision medicine. However, full clinical integration of biomarkers has been met with challenges, partly attributed to analytical difficulties. It has been shown that biomarker reproducibility is susceptible to data preprocessing approaches. Here, we systematically evaluated machine-learning ensembles of preprocessing methods as a general strategy to improve biomarker performance for prediction of survival from early breast cancer.ResultsWe risk stratified breast cancer patients into either low-risk or high-risk groups based on four published hypoxia signatures (Buffa, Winter, Hu, and Sorensen), using 24 different preprocessing approaches for microarray normalization. The 24 binary risk profiles determined for each hypoxia signature were combined using a random forest to evaluate the efficacy of a preprocessing ensemble classifier. We demonstrate that the best way of merging preprocessing methods varies from signature to signature, and that there is likely no ‘best’ preprocessing pipeline that is universal across datasets, highlighting the need to evaluate ensembles of preprocessing algorithms. Further, we developed novel signatures for each preprocessing method and the risk classifications from each were incorporated in a meta-random forest model. Interestingly, the classification of these biomarkers and its ensemble show striking consistency, demonstrating that similar intrinsic biological information are being faithfully represented. As such, these classification patterns further confirm that there is a subset of patients whose prognosis is consistently challenging to predict.ConclusionsPerformance of different prognostic signatures varies with pre-processing method. A simple classifier by unanimous voting of classifications is a reliable way of improving on single preprocessing methods. Future signatures will likely require integration of intrinsic and extrinsic clinico-pathological variables to better predict disease-related outcomes.

Clear search

Close search

Google apps

Main menu

Prediction of early breast cancer patient survival using ensembles of...

Data from: Assessing predictive performance of supervised machine learning...

Data from: Enriching time series datasets using Nonparametric kernel...

Additional file 1 of Impact of data preprocessing on cell-type clustering...

Youtube cookery channels viewers comments in Hinglish

Community Detection to Split Large-scale Assemblies in Subassemblies

Mediapipe based Preprocessed VGGFace2 Dataset

Data_Sheet_1_Assessing the Impact of Data Preprocessing on Analyzing Next...

Data from: COVID-19 and media dataset: Mining textual data according periods...

Data life cycle from survey files, pre-processing, analysis to visualisation...

Educational Attainment in North Carolina Public Schools: Use of statistical...

spatial_frequency_preferences

README

Details related to access to the data

Overview

Methods

Subjects

Apparatus

Initial setup

Task organization

Task details

Additional data acquired

Experimental location

Missing data

Data from: A Deep Learning and XGBoost-based Method for Predicting...

AI Data Management Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Employment Of India CLeaned and Messy Data

🔹 Dataset Composition:

Transformations & Cleaning Applied:

Purpose & Utility:

Synthetic Stroke Prediction Dataset

Weather Type Classification

Description

Variables

Purpose and Utility

Important Note

License

Replication data for: Matching as Nonparametric Preprocessing for Reducing...

Cdd Dataset

Preprocessed 2019 Blindness Detection

Prediction of early breast cancer patient survival using ensembles of hypoxia signatures