100+ datasets found

f
Preprocessing steps.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle (2024). Preprocessing steps. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001483628
Explore at:
Dataset updated
Jun 28, 2024
Authors
Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle
Description
In this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.
Preprocessed Titanic Survived Prediction Data
kaggle.com
zip
Updated Jan 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fethiye (2021). Preprocessed Titanic Survived Prediction Data [Dataset]. https://www.kaggle.com/datasets/fethiye/titanic-preprocessed-train-data
Explore at:
zip(21508 bytes)Available download formats
Dataset updated
Jan 23, 2021
Authors
Fethiye
Description
Context

Data set was created by preprocessing (filling lost data, extracting new features) of Titanic - Machine Learning Disaster data set.

Using this processed data set, the machine learning models can be applied directly.

You can see preprocessing step in notebook: https://www.kaggle.com/fethiye/titanic-predict-survival-prediction
f
Data_Sheet_1_Assessing the Impact of Data Preprocessing on Analyzing Next...
frontiersin.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Binsheng He; Rongrong Zhu; Huandong Yang; Qingqing Lu; Weiwei Wang; Lei Song; Xue Sun; Guandong Zhang; Shijun Li; Jialiang Yang; Geng Tian; Pingping Bing; Jidong Lang (2023). Data_Sheet_1_Assessing the Impact of Data Preprocessing on Analyzing Next Generation Sequencing Data.pdf [Dataset]. http://doi.org/10.3389/fbioe.2020.00817.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fbioe.2020.00817.s001
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Binsheng He; Rongrong Zhu; Huandong Yang; Qingqing Lu; Weiwei Wang; Lei Song; Xue Sun; Guandong Zhang; Shijun Li; Jialiang Yang; Geng Tian; Pingping Bing; Jidong Lang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data quality control and preprocessing are often the first step in processing next-generation sequencing (NGS) data of tumors. Not only can it help us evaluate the quality of sequencing data, but it can also help us obtain high-quality data for downstream data analysis. However, by comparing data analysis results of preprocessing with Cutadapt, FastP, Trimmomatic, and raw sequencing data, we found that the frequency of mutation detection had some fluctuations and differences, and human leukocyte antigen (HLA) typing directly resulted in erroneous results. We think that our research had demonstrated the impact of data preprocessing steps on downstream data analysis results. We hope that it can promote the development or optimization of better data preprocessing methods, so that downstream information analysis can be more accurate.
Z
Adult dataset preprocessed
data.niaid.nih.gov
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pustozerova, Anastasia; Schuster, Verena (2024). Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
Explore at:
Dataset updated
Jul 1, 2024
Dataset provided by
SBA Research
Authors
Pustozerova, Anastasia; Schuster, Verena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

The preprocessing steps include:

One-hot-encoding of categorical values

Imputation of missing values using knn-imputer with k=1

Standard scaling of ordinal attributes

Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
d
Artificial intelligence preprocessing of ground penetrating radar signals...
data.gov.tw
pdf
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute of Transportation, MOTC (2025). Artificial intelligence preprocessing of ground penetrating radar signals for image recognition: an initial exploration [Dataset]. https://data.gov.tw/en/datasets/174565
Explore at:
pdfAvailable download formats
Dataset updated
Sep 15, 2025
Dataset authored and provided by
Institute of Transportation, MOTC
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Description
This project aims to use artificial intelligence to identify potential risk factors for damaged asphalt pavements under the road, explore the pre-processing procedures and steps of ground penetrating radar data, and propose initial solutions or recommendations for difficulties and problems encountered in the pre-processing process.
Metabolomics Data Preprocessing PQN PCA
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Metabolomics Data Preprocessing PQN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/metabolomics-data-preprocessing-pqn-pca
Explore at:
zip(22763 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides a step-by-step pipeline for preprocessing metabolomics data.

The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.

Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.

Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.

Includes data visualization techniques to interpret PCA results effectively.

Suitable for metabolomics researchers and data scientists working on omics data.

Enables better reproducibility of preprocessing workflows for metabolomics studies.

Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.

Provides a Python-based notebook that is easy to adapt to new datasets.

Includes example datasets and code snippets for immediate application.

Helps users understand the impact of normalization on downstream statistical analyses.

Supports integration with other metabolomics pipelines or machine learning workflows.
Data from: Evaluation of the preprocessing and training stages in text...
scielo.figshare.com
jpeg
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Marques Sathler Guimarães; Magali Rezende Gouvêa Meireles; Paulo Eduardo Maciel de Almeida (2023). Evaluation of the preprocessing and training stages in text classification algorithms in the context of information retrieval [Dataset]. http://doi.org/10.6084/m9.figshare.8162216.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8162216.v1
Dataset updated
May 30, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Lucas Marques Sathler Guimarães; Magali Rezende Gouvêa Meireles; Paulo Eduardo Maciel de Almeida
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract The amount of unstructured data grows with the popularization of the Internet. Texts in natural language represent a relevant and significant set for the analysis and production of knowledge. This work proposes a quantitative analysis of the preprocessing and training stages of a text classifier, which uses as an attribute the feelings expressed by the users. Artificial Neural Network, as a classifier algorithm, and texts from Amazon, IMDB and Yelp sites were used for the experiments. The database allows the analysis of the expression of positive and negative feelings of the users in evaluations of products and services in unstructured texts. Two distinct processes of preprocessing and different training of the Artificial Neural Networks were carried out to classify the textual set. The results quantitatively confirm the importance of the preprocessing and training stages of the classifier, highlighting the importance of the vocabulary selected for the text representation and classification. The available classification techniques achieve satisfactory results. However, even by using two distinct processes of preprocessing and identifying the best training process, it was not possible to totally eliminate the learning difficulties and understanding of the model for the classifications of feelings that involved subjective characteristics of the expression of human feeling.
RNA_Seq_Data_Preprocessing_DGE analysis
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). RNA_Seq_Data_Preprocessing_DGE analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/rna-seq-data-preprocessing-dge-analysis
Explore at:
zip(75256 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains RNA-Seq data preprocessing and differential gene expression (DGE) analysis.

It is designed for researchers, bioinformaticians, and students interested in transcriptomics.

The dataset includes raw count data and step-by-step preprocessing instructions.

It demonstrates quality control, normalization, and filtering of RNA-Seq data.

Differential expression analysis using popular tools and methods is included.

Results include differentially expressed genes with statistical significance.

It provides visualizations like PCA plots, heatmaps, and volcano plots.

The dataset is suitable for learning and reproducing RNA-Seq workflows.

Both human-readable explanations and code snippets are included for guidance.

It can serve as a reference for new RNA-Seq projects and research pipelines.
n
Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...
data.niaid.nih.gov
datadryad.org
zip
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuqi Tan; Tim Kempchen (2024). Demo dataset for: SPACEc, a streamlined, interactive Python workflow for multiplexed image processing and analysis [Dataset]. http://doi.org/10.5061/dryad.brv15dvj1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.brv15dvj1
Dataset updated
Jul 8, 2024
Dataset provided by
Stanford University School of Medicine
Authors
Yuqi Tan; Tim Kempchen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface. The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user’s system or to familiarize oneself with the pipeline. Methods Tissue samples: Tonsil cores were extracted from a larger multi-tumor tissue microarray (TMA), which included a total of 66 unique tissues (51 malignant and semi-malignant tissues, as well as 15 non-malignant tissues). Representative tissue regions were annotated on corresponding hematoxylin and eosin (H&E)-stained sections by a board-certified surgical pathologist (S.Z.). Annotations were used to generate the 66 cores each with cores of 1mm diameter. FFPE tissue blocks were retrieved from the tissue archives of the Institute of Pathology, University Medical Center Mainz, Germany, and the Department of Dermatology, University Medical Center Mainz, Germany. The multi-tumor-TMA block was sectioned at 3µm thickness onto SuperFrost Plus microscopy slides before being processed for CODEX multiplex imaging as previously described. CODEX multiplexed imaging and processing To run the CODEX machine, the slide was taken from the storage buffer and placed in PBS for 10 minutes to equilibrate. After drying the PBS with a tissue, a flow cell was sealed onto the tissue slide. The assembled slide and flow cell were then placed in a PhenoCycler Buffer made from 10X PhenoCycler Buffer & Additive for at least 10 minutes before starting the experiment. A 96-well reporter plate was prepared with each reporter corresponding to the correct barcoded antibody for each cycle, with up to 3 reporters per cycle per well. The fluorescence reporters were mixed with 1X PhenoCycler Buffer, Additive, nuclear-staining reagent, and assay reagent according to the manufacturer's instructions. With the reporter plate and assembled slide and flow cell placed into the CODEX machine, the automated multiplexed imaging experiment was initiated. Each imaging cycle included steps for reporter binding, imaging of three fluorescent channels, and reporter stripping to prepare for the next cycle and set of markers. This was repeated until all markers were imaged. After the experiment, a .qptiff image file containing individual antibody channels and the DAPI channel was obtained. Image stitching, drift compensation, deconvolution, and cycle concatenation are performed within the Akoya PhenoCycler software. The raw imaging data output (tiff, 377.442nm per pixel for 20x CODEX) is first examined with QuPath software (https://qupath.github.io/) for inspection of staining quality. Any markers that produce unexpected patterns or low signal-to-noise ratios should be excluded from the ensuing analysis. The qptiff files must be converted into tiff files for input into SPACEc. Data preprocessing includes image stitching, drift compensation, deconvolution, and cycle concatenation performed using the Akoya Phenocycler software. The raw imaging data (qptiff, 377.442 nm/pixel for 20x CODEX) files from the Akoya PhenoCycler technology were first examined with QuPath software (https://qupath.github.io/) to inspect staining qualities. Markers with untenable patterns or low signal-to-noise ratios were excluded from further analysis. A custom CODEX analysis pipeline was used to process all acquired CODEX data (scripts available upon request). The qptiff files were converted into tiff files for tissue detection (watershed algorithm) and cell segmentation.
r
Data from: The efficacy of different preprocessing steps in reducing...
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stuart Oldham (2022). The efficacy of different preprocessing steps in reducing motion-related confounds in diffusion MRI connectomics [Dataset]. http://doi.org/10.26180/5e7313d012cee
Explore at:
Unique identifier
https://doi.org/10.26180/5e7313d012cee
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Stuart Oldham
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This contains both analysed and unanalysed data from the paper The efficacy of different preprocessing steps in reducing motion-related confounds in diffusion MRI connectomics. See https://github.com/BMHLab/MotionStructuralConnectivity for code used to analyse the data. Simply unzip the folders into MotionStructuralConnectivity
Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Data from: Enriching time series datasets using Nonparametric kernel...
figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Ivan Fanany (2023). Enriching time series datasets using Nonparametric kernel regression to improve forecasting accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.1609661.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1609661.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mohamad Ivan Fanany
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Improving the accuracy of prediction on future values based on the past and current observations has been pursued by enhancing the prediction's methods, combining those methods or performing data pre-processing. In this paper, another approach is taken, namely by increasing the number of input in the dataset. This approach would be useful especially for a shorter time series data. By filling the in-between values in the time series, the number of training set can be increased, thus increasing the generalization capability of the predictor. The algorithm used to make prediction is Neural Network as it is widely used in literature for time series tasks. For comparison, Support Vector Regression is also employed. The dataset used in the experiment is the frequency of USPTO's patents and PubMed's scientific publications on the field of health, namely on Apnea, Arrhythmia, and Sleep Stages. Another time series data designated for NN3 Competition in the field of transportation is also used for benchmarking. The experimental result shows that the prediction performance can be significantly increased by filling in-between data in the time series. Furthermore, the use of detrend and deseasonalization which separates the data into trend, seasonal and stationary time series also improve the prediction performance both on original and filled dataset. The optimal number of increase on the dataset in this experiment is about five times of the length of original dataset.
m
Digital game addiction data version 2
data.mendeley.com
kaggle.com
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esra Kahya-Ozyirmidokuz (2024). Digital game addiction data version 2 [Dataset]. http://doi.org/10.17632/7z75yjs8zg.1
Explore at:
Unique identifier
https://doi.org/10.17632/7z75yjs8zg.1
Dataset updated
Aug 6, 2024
Authors
Esra Kahya-Ozyirmidokuz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We have revised the dataset, ensuring that it is thoroughly processed and ready for analysis. The attached second dataset has undergone comprehensive preprocessing algorithms. This preprocessing includes steps such as data cleaning, normalization, and feature extraction to enhance the quality and usability of the data. These steps are crucial to ensure that the dataset is free from inconsistencies, missing values, and irrelevant information, thereby improving the accuracy and reliability of the subsequent machine learning models.
Preprocessed Dell Tweets
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Foriduzzaman Zihad (2023). Preprocessed Dell Tweets [Dataset]. https://www.kaggle.com/datasets/mdforiduzzamanzihad/preprocessed-dell-tweets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Md. Foriduzzaman Zihad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
A collection of tweets about the international computer technology business Dell can be found in the "Preprocessed Dell Tweets" dataset. To make sentiment analysis and natural language processing jobs easier, these tweets have undergone meticulous preprocessing. The dataset was first made up of tweets that were scraped from Twitter; preprocessing was done to clean and organize the data.

Important characteristics:

Text: The preprocessed text of the tweets is contained in this column, making it appropriate for natural language processing and text analysis.

Sentiment: To facilitate machine learning activities, the sentiment column has been converted into numeric values. The following is how sentiment labels have been encoded: 0: Neutral 1: Positive 2: Negative

Possible Applications: The dataset is perfect for machine learning, sentiment analysis, and sentiment classification tasks. This dataset can be used by researchers and data scientists to test and refine sentiment analysis methods. Efficient sentiment models for prediction can be developed thanks to the numerical sentiment labels.

Data Preprocessing: To prepare the dataset for analysis, the following preprocessing steps were applied:

Punctuation and special characters were removed from the text.

URLs and hyperlinks were stripped from the text.

Text was converted to lowercase for uniformity.

Stopwords (common words with limited analytical value) were removed.

Tokenization, stemming, and lemmatization were performed to normalize the text data.
h
Generating_Video_Data
huggingface.co
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hanzhe (2025). Generating_Video_Data [Dataset]. https://huggingface.co/datasets/HanzheL/Generating_Video_Data
Explore at:
Dataset updated
May 23, 2025
Authors
Hanzhe
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Generating_Video_Data

Hanzhe Liang modified: 23 May

😊 This library contains some common data for training video generation models. This is a library of summary datasets, with thanks to the original authors for their contributions. Regardless of which data you need, the following steps: mkdir data cd data

Next, go to my cloud database link and download the data you want into the folderdata/. Once you've downloaded the data you need, we've found preprocessing steps from its… See the full description on the dataset page: https://huggingface.co/datasets/HanzheL/Generating_Video_Data.
h
chicago-crime-description
huggingface.co
Updated Oct 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TPP-LLM (2024). chicago-crime-description [Dataset]. https://huggingface.co/datasets/tppllm/chicago-crime-description
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 21, 2024
Dataset authored and provided by
TPP-LLM
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Area covered
Chicago
Description
Chicago Crime Description Dataset

This dataset contains reported crime incidents in Chicago from January 1, 2022, to December 31, 2023. It includes 4,033 sequences with 202,333 events across 20 crime types. The data is sourced from the Chicago Data Portal under the Terms of Use. The detailed data preprocessing steps used to create this dataset can be found in the TPP-LLM paper and TPP-Embedding paper. If you find this dataset useful, we kindly invite you to cite the following… See the full description on the dataset page: https://huggingface.co/datasets/tppllm/chicago-crime-description.
Processed test data from ERX1492351 using nexomis/rna-preprocessing@1.1.2
zenodo.org
application/gzip, bin +2
Updated Sep 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Fouret; Julien Fouret (2025). Processed test data from ERX1492351 using nexomis/rna-preprocessing@1.1.2 [Dataset]. http://doi.org/10.5281/zenodo.17061674
Explore at:
application/gzip, csv, sh, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17061674
Dataset updated
Sep 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julien Fouret; Julien Fouret
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a pre-processed dataset to streamline testing of expression analysis R package.

the preprocessing steps are self-contained via a nextflow command line with parameters (see run.sh)
o
County Social Determinants of Health Data Pre-Processed to Facilitate...
openicpsr.org
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Crown; Rachel Adams; Mary Jo Larson (2025). County Social Determinants of Health Data Pre-Processed to Facilitate Machine Learning/Multivariate Analysis [Dataset]. http://doi.org/10.3886/E227481V2
Explore at:
Unique identifier
https://doi.org/10.3886/E227481V2
Dataset updated
Apr 23, 2025
Dataset provided by
Brandeis University
Boston University
Authors
William Crown; Rachel Adams; Mary Jo Larson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2018 - Dec 31, 2019
Area covered
US Counties
Description
These datasets contain data from the AHRQ Social Determinants of Health (SDOH) Database (https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html), processed to facilitate machine learning/multivariate analyses focusing on the healthcare context of counties. The datasets derive from the AHRQ 2019 and 2018 county-level SDOH files. Three sets of files are provided. The first "Raw" set has the source SDOH data with a few core pre-processing steps applied. The second, “Full” set has variables characterizing the health and healthcare context of counties (rather than outcomes), with further processing steps applied to facilitate multivariate and machine learning analytics (e.g. handling of missing data, normalizing, standardizing). The third set, labeled “Reduced”, incorporates those same data processing steps but in addition has had a further data reduction step applied in which groups of highly intercorrelated variables were removed and replaced with corresponding principal component scores, one for each group. These files would be useful for investigators interested in characterizing and comparing the broad SDOH context of US counties.
Dollar street 10 - 64x64x3
zenodo.org
data.niaid.nih.gov
+1more
bin
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven van der burg; Sven van der burg (2025). Dollar street 10 - 64x64x3 [Dataset]. http://doi.org/10.5281/zenodo.10970014
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10970014
Dataset updated
May 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sven van der burg; Sven van der burg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

These are the preprocessing steps that were performed:

Only take examples with one imagenet_synonym label

Use only examples with the 10 most frequently occuring labels

Downscale images to 64 x 64 pixels

Split data in train and test

Store as numpy array

This is the label mapping:

Category label
day bed 0
dishrag 1
plate 2
running shoe 3
soap dispenser 4
street sign 5
table lamp 6
tile roof 7
toilet seat 8
washing machine 9

Checkout https://github.com/carpentries-lab/deep-learning-intro/blob/main/instructors/prepare-dollar-street-data.ipynb" target="_blank" rel="noopener">this notebook to see how the subset was created.

The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.

Facebook

Twitter

Click to copy link

Link copied

Cite

Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle (2024). Preprocessing steps. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001483628

Preprocessing steps.

Explore at:

Dataset updated

Jun 28, 2024

Authors

Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle

Description

In this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.

Clear search

Close search

Google apps

Main menu

Category	label
day bed	0
dishrag	1
plate	2
running shoe	3
soap dispenser	4
street sign	5
table lamp	6
tile roof	7
toilet seat	8
washing machine	9

Preprocessing steps.

Preprocessed Titanic Survived Prediction Data

Context

Data_Sheet_1_Assessing the Impact of Data Preprocessing on Analyzing Next...

Adult dataset preprocessed

Artificial intelligence preprocessing of ground penetrating radar signals...

Metabolomics Data Preprocessing PQN PCA

Data from: Evaluation of the preprocessing and training stages in text...

RNA_Seq_Data_Preprocessing_DGE analysis

Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...

Data from: The efficacy of different preprocessing steps in reducing...

Data Analysis for the Systematic Literature Review of DL4SE

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Data from: Enriching time series datasets using Nonparametric kernel...

Digital game addiction data version 2

Preprocessed Dell Tweets

Generating_Video_Data

chicago-crime-description

Processed test data from ERX1492351 using nexomis/rna-preprocessing@1.1.2

County Social Determinants of Health Data Pre-Processed to Facilitate...

Dollar street 10 - 64x64x3

Preprocessing steps.See More Versions

Preprocessing steps.