100+ datasets found
  1. f

    Preprocessing steps.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle (2024). Preprocessing steps. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001483628
    Explore at:
    Dataset updated
    Jun 28, 2024
    Authors
    Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle
    Description

    In this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.

  2. Preprocessed Titanic Survived Prediction Data

    • kaggle.com
    zip
    Updated Jan 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fethiye (2021). Preprocessed Titanic Survived Prediction Data [Dataset]. https://www.kaggle.com/datasets/fethiye/titanic-preprocessed-train-data
    Explore at:
    zip(21508 bytes)Available download formats
    Dataset updated
    Jan 23, 2021
    Authors
    Fethiye
    Description

    Context

    Data set was created by preprocessing (filling lost data, extracting new features) of Titanic - Machine Learning Disaster data set.

    Using this processed data set, the machine learning models can be applied directly.

    You can see preprocessing step in notebook: https://www.kaggle.com/fethiye/titanic-predict-survival-prediction

  3. f

    Data_Sheet_1_Assessing the Impact of Data Preprocessing on Analyzing Next...

    • frontiersin.figshare.com
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Binsheng He; Rongrong Zhu; Huandong Yang; Qingqing Lu; Weiwei Wang; Lei Song; Xue Sun; Guandong Zhang; Shijun Li; Jialiang Yang; Geng Tian; Pingping Bing; Jidong Lang (2023). Data_Sheet_1_Assessing the Impact of Data Preprocessing on Analyzing Next Generation Sequencing Data.pdf [Dataset]. http://doi.org/10.3389/fbioe.2020.00817.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Binsheng He; Rongrong Zhu; Huandong Yang; Qingqing Lu; Weiwei Wang; Lei Song; Xue Sun; Guandong Zhang; Shijun Li; Jialiang Yang; Geng Tian; Pingping Bing; Jidong Lang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data quality control and preprocessing are often the first step in processing next-generation sequencing (NGS) data of tumors. Not only can it help us evaluate the quality of sequencing data, but it can also help us obtain high-quality data for downstream data analysis. However, by comparing data analysis results of preprocessing with Cutadapt, FastP, Trimmomatic, and raw sequencing data, we found that the frequency of mutation detection had some fluctuations and differences, and human leukocyte antigen (HLA) typing directly resulted in erroneous results. We think that our research had demonstrated the impact of data preprocessing steps on downstream data analysis results. We hope that it can promote the development or optimization of better data preprocessing methods, so that downstream information analysis can be more accurate.

  4. Z

    Adult dataset preprocessed

    • data.niaid.nih.gov
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pustozerova, Anastasia; Schuster, Verena (2024). Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
    Explore at:
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    SBA Research
    Authors
    Pustozerova, Anastasia; Schuster, Verena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

    The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

    The preprocessing steps include:

    One-hot-encoding of categorical values

    Imputation of missing values using knn-imputer with k=1

    Standard scaling of ordinal attributes

    Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.

  5. d

    Artificial intelligence preprocessing of ground penetrating radar signals...

    • data.gov.tw
    pdf
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Transportation, MOTC (2025). Artificial intelligence preprocessing of ground penetrating radar signals for image recognition: an initial exploration [Dataset]. https://data.gov.tw/en/datasets/174565
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 15, 2025
    Dataset authored and provided by
    Institute of Transportation, MOTC
    License

    https://data.gov.tw/licensehttps://data.gov.tw/license

    Description

    This project aims to use artificial intelligence to identify potential risk factors for damaged asphalt pavements under the road, explore the pre-processing procedures and steps of ground penetrating radar data, and propose initial solutions or recommendations for difficulties and problems encountered in the pre-processing process.

  6. Metabolomics Data Preprocessing PQN PCA

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). Metabolomics Data Preprocessing PQN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/metabolomics-data-preprocessing-pqn-pca
    Explore at:
    zip(22763 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides a step-by-step pipeline for preprocessing metabolomics data.

    The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.

    Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.

    Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.

    Includes data visualization techniques to interpret PCA results effectively.

    Suitable for metabolomics researchers and data scientists working on omics data.

    Enables better reproducibility of preprocessing workflows for metabolomics studies.

    Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.

    Provides a Python-based notebook that is easy to adapt to new datasets.

    Includes example datasets and code snippets for immediate application.

    Helps users understand the impact of normalization on downstream statistical analyses.

    Supports integration with other metabolomics pipelines or machine learning workflows.

  7. Data from: Evaluation of the preprocessing and training stages in text...

    • scielo.figshare.com
    jpeg
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Marques Sathler Guimarães; Magali Rezende Gouvêa Meireles; Paulo Eduardo Maciel de Almeida (2023). Evaluation of the preprocessing and training stages in text classification algorithms in the context of information retrieval [Dataset]. http://doi.org/10.6084/m9.figshare.8162216.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Lucas Marques Sathler Guimarães; Magali Rezende Gouvêa Meireles; Paulo Eduardo Maciel de Almeida
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The amount of unstructured data grows with the popularization of the Internet. Texts in natural language represent a relevant and significant set for the analysis and production of knowledge. This work proposes a quantitative analysis of the preprocessing and training stages of a text classifier, which uses as an attribute the feelings expressed by the users. Artificial Neural Network, as a classifier algorithm, and texts from Amazon, IMDB and Yelp sites were used for the experiments. The database allows the analysis of the expression of positive and negative feelings of the users in evaluations of products and services in unstructured texts. Two distinct processes of preprocessing and different training of the Artificial Neural Networks were carried out to classify the textual set. The results quantitatively confirm the importance of the preprocessing and training stages of the classifier, highlighting the importance of the vocabulary selected for the text representation and classification. The available classification techniques achieve satisfactory results. However, even by using two distinct processes of preprocessing and identifying the best training process, it was not possible to totally eliminate the learning difficulties and understanding of the model for the classifications of feelings that involved subjective characteristics of the expression of human feeling.

  8. RNA_Seq_Data_Preprocessing_DGE analysis

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). RNA_Seq_Data_Preprocessing_DGE analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/rna-seq-data-preprocessing-dge-analysis
    Explore at:
    zip(75256 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains RNA-Seq data preprocessing and differential gene expression (DGE) analysis.

    It is designed for researchers, bioinformaticians, and students interested in transcriptomics.

    The dataset includes raw count data and step-by-step preprocessing instructions.

    It demonstrates quality control, normalization, and filtering of RNA-Seq data.

    Differential expression analysis using popular tools and methods is included.

    Results include differentially expressed genes with statistical significance.

    It provides visualizations like PCA plots, heatmaps, and volcano plots.

    The dataset is suitable for learning and reproducing RNA-Seq workflows.

    Both human-readable explanations and code snippets are included for guidance.

    It can serve as a reference for new RNA-Seq projects and research pipelines.

  9. n

    Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuqi Tan; Tim Kempchen (2024). Demo dataset for: SPACEc, a streamlined, interactive Python workflow for multiplexed image processing and analysis [Dataset]. http://doi.org/10.5061/dryad.brv15dvj1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Stanford University School of Medicine
    Authors
    Yuqi Tan; Tim Kempchen
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface. The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user’s system or to familiarize oneself with the pipeline. Methods Tissue samples: Tonsil cores were extracted from a larger multi-tumor tissue microarray (TMA), which included a total of 66 unique tissues (51 malignant and semi-malignant tissues, as well as 15 non-malignant tissues). Representative tissue regions were annotated on corresponding hematoxylin and eosin (H&E)-stained sections by a board-certified surgical pathologist (S.Z.). Annotations were used to generate the 66 cores each with cores of 1mm diameter. FFPE tissue blocks were retrieved from the tissue archives of the Institute of Pathology, University Medical Center Mainz, Germany, and the Department of Dermatology, University Medical Center Mainz, Germany. The multi-tumor-TMA block was sectioned at 3µm thickness onto SuperFrost Plus microscopy slides before being processed for CODEX multiplex imaging as previously described. CODEX multiplexed imaging and processing To run the CODEX machine, the slide was taken from the storage buffer and placed in PBS for 10 minutes to equilibrate. After drying the PBS with a tissue, a flow cell was sealed onto the tissue slide. The assembled slide and flow cell were then placed in a PhenoCycler Buffer made from 10X PhenoCycler Buffer & Additive for at least 10 minutes before starting the experiment. A 96-well reporter plate was prepared with each reporter corresponding to the correct barcoded antibody for each cycle, with up to 3 reporters per cycle per well. The fluorescence reporters were mixed with 1X PhenoCycler Buffer, Additive, nuclear-staining reagent, and assay reagent according to the manufacturer's instructions. With the reporter plate and assembled slide and flow cell placed into the CODEX machine, the automated multiplexed imaging experiment was initiated. Each imaging cycle included steps for reporter binding, imaging of three fluorescent channels, and reporter stripping to prepare for the next cycle and set of markers. This was repeated until all markers were imaged. After the experiment, a .qptiff image file containing individual antibody channels and the DAPI channel was obtained. Image stitching, drift compensation, deconvolution, and cycle concatenation are performed within the Akoya PhenoCycler software. The raw imaging data output (tiff, 377.442nm per pixel for 20x CODEX) is first examined with QuPath software (https://qupath.github.io/) for inspection of staining quality. Any markers that produce unexpected patterns or low signal-to-noise ratios should be excluded from the ensuing analysis. The qptiff files must be converted into tiff files for input into SPACEc. Data preprocessing includes image stitching, drift compensation, deconvolution, and cycle concatenation performed using the Akoya Phenocycler software. The raw imaging data (qptiff, 377.442 nm/pixel for 20x CODEX) files from the Akoya PhenoCycler technology were first examined with QuPath software (https://qupath.github.io/) to inspect staining qualities. Markers with untenable patterns or low signal-to-noise ratios were excluded from further analysis. A custom CODEX analysis pipeline was used to process all acquired CODEX data (scripts available upon request). The qptiff files were converted into tiff files for tissue detection (watershed algorithm) and cell segmentation.

  10. r

    Data from: The efficacy of different preprocessing steps in reducing...

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stuart Oldham (2022). The efficacy of different preprocessing steps in reducing motion-related confounds in diffusion MRI connectomics [Dataset]. http://doi.org/10.26180/5e7313d012cee
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Stuart Oldham
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This contains both analysed and unanalysed data from the paper The efficacy of different preprocessing steps in reducing motion-related confounds in diffusion MRI connectomics. See https://github.com/BMHLab/MotionStructuralConnectivity for code used to analyse the data. Simply unzip the folders into MotionStructuralConnectivity

  11. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  12. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  13. Data from: Enriching time series datasets using Nonparametric kernel...

    • figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamad Ivan Fanany (2023). Enriching time series datasets using Nonparametric kernel regression to improve forecasting accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.1609661.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Mohamad Ivan Fanany
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Improving the accuracy of prediction on future values based on the past and current observations has been pursued by enhancing the prediction's methods, combining those methods or performing data pre-processing. In this paper, another approach is taken, namely by increasing the number of input in the dataset. This approach would be useful especially for a shorter time series data. By filling the in-between values in the time series, the number of training set can be increased, thus increasing the generalization capability of the predictor. The algorithm used to make prediction is Neural Network as it is widely used in literature for time series tasks. For comparison, Support Vector Regression is also employed. The dataset used in the experiment is the frequency of USPTO's patents and PubMed's scientific publications on the field of health, namely on Apnea, Arrhythmia, and Sleep Stages. Another time series data designated for NN3 Competition in the field of transportation is also used for benchmarking. The experimental result shows that the prediction performance can be significantly increased by filling in-between data in the time series. Furthermore, the use of detrend and deseasonalization which separates the data into trend, seasonal and stationary time series also improve the prediction performance both on original and filled dataset. The optimal number of increase on the dataset in this experiment is about five times of the length of original dataset.

  14. m

    Digital game addiction data version 2

    • data.mendeley.com
    • kaggle.com
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esra Kahya-Ozyirmidokuz (2024). Digital game addiction data version 2 [Dataset]. http://doi.org/10.17632/7z75yjs8zg.1
    Explore at:
    Dataset updated
    Aug 6, 2024
    Authors
    Esra Kahya-Ozyirmidokuz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We have revised the dataset, ensuring that it is thoroughly processed and ready for analysis. The attached second dataset has undergone comprehensive preprocessing algorithms. This preprocessing includes steps such as data cleaning, normalization, and feature extraction to enhance the quality and usability of the data. These steps are crucial to ensure that the dataset is free from inconsistencies, missing values, and irrelevant information, thereby improving the accuracy and reliability of the subsequent machine learning models.

  15. Preprocessed Dell Tweets

    • kaggle.com
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Foriduzzaman Zihad (2023). Preprocessed Dell Tweets [Dataset]. https://www.kaggle.com/datasets/mdforiduzzamanzihad/preprocessed-dell-tweets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Md. Foriduzzaman Zihad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A collection of tweets about the international computer technology business Dell can be found in the "Preprocessed Dell Tweets" dataset. To make sentiment analysis and natural language processing jobs easier, these tweets have undergone meticulous preprocessing. The dataset was first made up of tweets that were scraped from Twitter; preprocessing was done to clean and organize the data.

    Important characteristics:

    Text: The preprocessed text of the tweets is contained in this column, making it appropriate for natural language processing and text analysis.

    Sentiment: To facilitate machine learning activities, the sentiment column has been converted into numeric values. The following is how sentiment labels have been encoded: 0: Neutral 1: Positive 2: Negative

    Possible Applications: The dataset is perfect for machine learning, sentiment analysis, and sentiment classification tasks. This dataset can be used by researchers and data scientists to test and refine sentiment analysis methods. Efficient sentiment models for prediction can be developed thanks to the numerical sentiment labels.

    Data Preprocessing: To prepare the dataset for analysis, the following preprocessing steps were applied:

    • Punctuation and special characters were removed from the text.
    • URLs and hyperlinks were stripped from the text.
    • Text was converted to lowercase for uniformity.
    • Stopwords (common words with limited analytical value) were removed.
    • Tokenization, stemming, and lemmatization were performed to normalize the text data.
  16. h

    Generating_Video_Data

    • huggingface.co
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hanzhe (2025). Generating_Video_Data [Dataset]. https://huggingface.co/datasets/HanzheL/Generating_Video_Data
    Explore at:
    Dataset updated
    May 23, 2025
    Authors
    Hanzhe
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Generating_Video_Data

    Hanzhe Liang modified: 23 May

    😊 This library contains some common data for training video generation models. This is a library of summary datasets, with thanks to the original authors for their contributions. Regardless of which data you need, the following steps: mkdir data cd data

    Next, go to my cloud database link and download the data you want into the folderdata/. Once you've downloaded the data you need, we've found preprocessing steps from its… See the full description on the dataset page: https://huggingface.co/datasets/HanzheL/Generating_Video_Data.

  17. h

    chicago-crime-description

    • huggingface.co
    Updated Oct 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TPP-LLM (2024). chicago-crime-description [Dataset]. https://huggingface.co/datasets/tppllm/chicago-crime-description
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 21, 2024
    Dataset authored and provided by
    TPP-LLM
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Area covered
    Chicago
    Description

    Chicago Crime Description Dataset

    This dataset contains reported crime incidents in Chicago from January 1, 2022, to December 31, 2023. It includes 4,033 sequences with 202,333 events across 20 crime types. The data is sourced from the Chicago Data Portal under the Terms of Use. The detailed data preprocessing steps used to create this dataset can be found in the TPP-LLM paper and TPP-Embedding paper. If you find this dataset useful, we kindly invite you to cite the following… See the full description on the dataset page: https://huggingface.co/datasets/tppllm/chicago-crime-description.

  18. Processed test data from ERX1492351 using nexomis/rna-preprocessing@1.1.2

    • zenodo.org
    application/gzip, bin +2
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Fouret; Julien Fouret (2025). Processed test data from ERX1492351 using nexomis/rna-preprocessing@1.1.2 [Dataset]. http://doi.org/10.5281/zenodo.17061674
    Explore at:
    application/gzip, csv, sh, binAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julien Fouret; Julien Fouret
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a pre-processed dataset to streamline testing of expression analysis R package.

    the preprocessing steps are self-contained via a nextflow command line with parameters (see run.sh)

  19. o

    County Social Determinants of Health Data Pre-Processed to Facilitate...

    • openicpsr.org
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William Crown; Rachel Adams; Mary Jo Larson (2025). County Social Determinants of Health Data Pre-Processed to Facilitate Machine Learning/Multivariate Analysis [Dataset]. http://doi.org/10.3886/E227481V2
    Explore at:
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    Brandeis University
    Boston University
    Authors
    William Crown; Rachel Adams; Mary Jo Larson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2018 - Dec 31, 2019
    Area covered
    US Counties
    Description

    These datasets contain data from the AHRQ Social Determinants of Health (SDOH) Database (https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html), processed to facilitate machine learning/multivariate analyses focusing on the healthcare context of counties. The datasets derive from the AHRQ 2019 and 2018 county-level SDOH files. Three sets of files are provided. The first "Raw" set has the source SDOH data with a few core pre-processing steps applied. The second, “Full” set has variables characterizing the health and healthcare context of counties (rather than outcomes), with further processing steps applied to facilitate multivariate and machine learning analytics (e.g. handling of missing data, normalizing, standardizing). The third set, labeled “Reduced”, incorporates those same data processing steps but in addition has had a further data reduction step applied in which groups of highly intercorrelated variables were removed and replaced with corresponding principal component scores, one for each group. These files would be useful for investigators interested in characterizing and comparing the broad SDOH context of US counties.

  20. Dollar street 10 - 64x64x3

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven van der burg; Sven van der burg (2025). Dollar street 10 - 64x64x3 [Dataset]. http://doi.org/10.5281/zenodo.10970014
    Explore at:
    binAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sven van der burg; Sven van der burg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

    This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

    These are the preprocessing steps that were performed:

    1. Only take examples with one imagenet_synonym label
    2. Use only examples with the 10 most frequently occuring labels
    3. Downscale images to 64 x 64 pixels
    4. Split data in train and test
    5. Store as numpy array

    This is the label mapping:

    Categorylabel
    day bed0
    dishrag1
    plate2
    running shoe3
    soap dispenser4
    street sign5
    table lamp6
    tile roof7
    toilet seat8
    washing machine9

    Checkout https://github.com/carpentries-lab/deep-learning-intro/blob/main/instructors/prepare-dollar-street-data.ipynb" target="_blank" rel="noopener">this notebook to see how the subset was created.

    The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle (2024). Preprocessing steps. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001483628

Preprocessing steps.

Explore at:
Dataset updated
Jun 28, 2024
Authors
Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle
Description

In this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.

Search
Clear search
Close search
Google apps
Main menu