100+ datasets found
  1. h

    cleaned-quora-dataset-train-test-split

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fivesixseven (2024). cleaned-quora-dataset-train-test-split [Dataset]. https://huggingface.co/datasets/567-labs/cleaned-quora-dataset-train-test-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    fivesixseven
    Description

    This is a cleaned version of the Quora dataset that's been configured with a train-test-val split.

    Train : For training model Test : For running experiments and comparing different OSS models and closed sourced models Val : Only to be used at the end!

    Colab Notebook to reproduce : https://colab.research.google.com/drive/1dGjGiqwPV1M7JOLfcPEsSh3SC37urItS?usp=sharing

  2. Caltech-256: Pre-Processed 80/20 Train-Test Split

    • kaggle.com
    zip
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test
    Explore at:
    zip(1138799273 bytes)Available download formats
    Dataset updated
    Nov 12, 2025
    Authors
    KUSHAGRA MATHUR
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

    The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

    A clean, pre-defined 80/20 train-test split.

    Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

    A flat directory structure (train/, test/) for simplified file access.

    File Content The dataset is organized into a single top-level folder and two CSV files:

    train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

    test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

    Caltech-256_Train_Test/: The primary data folder.

    train/: This directory contains 80% of the images from all 257 categories, intended for model training.

    test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

    Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

    Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

    Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

    Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.

  3. Machine learning algorithm validation with a limited sample size

    • plos.figshare.com
    text/x-python
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

  4. d

    Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

    • demo-b2find.dkrz.de
    Updated Nov 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/8f288eb3-f541-5fca-a337-d519f903668f
    Explore at:
    Dataset updated
    Nov 27, 2020
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  5. dataset-muenzen-training-test-split-02

    • kaggle.com
    zip
    Updated Dec 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pascalammeter (2024). dataset-muenzen-training-test-split-02 [Dataset]. https://www.kaggle.com/datasets/pascalammeter/dataset-muenzen-training-test-split-02/discussion
    Explore at:
    zip(12574 bytes)Available download formats
    Dataset updated
    Dec 13, 2024
    Authors
    pascalammeter
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by pascalammeter

    Released under CC BY-NC-SA 4.0

    Contents

  6. f

    Data split for each class of each dataset for training and test.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niranjan, Mahesan; Fan, Keqiang; Cai, Xiaohao; Liu, Jiahui (2024). Data split for each class of each dataset for training and test. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001424294
    Explore at:
    Dataset updated
    Nov 6, 2024
    Authors
    Niranjan, Mahesan; Fan, Keqiang; Cai, Xiaohao; Liu, Jiahui
    Description

    Data split for each class of each dataset for training and test.

  7. f

    Table of averaged results over 100 train-test splits with a ratio of 0.33.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leibnitz, Kenji; Rattay, Frank; Golaszewski, Stefan Martin; Wutzl, Betty; Murata, Masayuki; Kronbichler, Martin (2019). Table of averaged results over 100 train-test splits with a ratio of 0.33. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000130899
    Explore at:
    Dataset updated
    Jul 11, 2019
    Authors
    Leibnitz, Kenji; Rattay, Frank; Golaszewski, Stefan Martin; Wutzl, Betty; Murata, Masayuki; Kronbichler, Martin
    Description

    The AUC of the precision and recall curve is shown (AUC with feature selection) for training and testing with the most important ROIs and with all ROIs (AUC without feature selection).

  8. f

    Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  9. d

    Data from: Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  10. h

    MATH-openai-split

    • huggingface.co
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hong Yi (2025). MATH-openai-split [Dataset]. https://huggingface.co/datasets/LuyiCui/MATH-openai-split
    Explore at:
    Dataset updated
    Jun 2, 2025
    Authors
    Hong Yi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    MATH-openai-split In order to avoid the risk of over-fitting on the 7,500 MATH training problems, we expanded the training set to include 4,500 MATH test split problems. We therefore evaluate our models only on the remaining 500 held-out problems. We selected these 500 test problems uniformly at random, and we believe they are representative of the test set as a whole.

    train split: 12k test split: 500 github repository: openai/prm800k: 800,000 step-level correctness labels on LLM solutions to… See the full description on the dataset page: https://huggingface.co/datasets/LuyiCui/MATH-openai-split.

  11. Titanic Dataset - Machine Learning from Disaster

    • kaggle.com
    zip
    Updated Sep 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). Titanic Dataset - Machine Learning from Disaster [Dataset]. https://www.kaggle.com/datasets/whenamancodes/titanic-dataset-machine-learning-from-disaster
    Explore at:
    zip(34877 bytes)Available download formats
    Dataset updated
    Sep 20, 2022
    Authors
    Aman Chauhan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    The data has been split into two groups:

    • training set (train.csv)
    • test set (test.csv)

    The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

    The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

    We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

    Data Dictionary:

    | Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

    Variable Notes

    pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

    age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

    sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

    parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

    More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe

  12. h

    nemotron-post-training-samples-splits

    • huggingface.co
    Updated Sep 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brandon Tong (2025). nemotron-post-training-samples-splits [Dataset]. https://huggingface.co/datasets/brandolorian/nemotron-post-training-samples-splits
    Explore at:
    Dataset updated
    Sep 18, 2025
    Authors
    Brandon Tong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Nemotron Post-Training Samples with Train/Val/Test Splits

    This dataset contains structured train/validation/test splits from the nvidia/Llama-Nemotron-Post-Training-Dataset, with both tagged and untagged versions for different training scenarios.

      Attribution
    

    This work is derived from the Llama-Nemotron-Post-Training-Dataset-v1.1 by NVIDIA Corporation, licensed under CC BY 4.0. Original Dataset: nvidia/Llama-Nemotron-Post-Training-Dataset Original Authors: NVIDIA… See the full description on the dataset page: https://huggingface.co/datasets/brandolorian/nemotron-post-training-samples-splits.

  13. Z

    Downsized camera trap images for automated classification

    • data.niaid.nih.gov
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Norman, Danielle L; Wearne, Oliver R; Chapman, Philip M; Heon, Sui P; Ewers, Robert M (2022). Downsized camera trap images for automated classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6627706
    Explore at:
    Dataset updated
    Dec 1, 2022
    Dataset provided by
    Imperial College London
    Authors
    Norman, Danielle L; Wearne, Oliver R; Chapman, Philip M; Heon, Sui P; Ewers, Robert M
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description: Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707. Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions Funding: These data were collected as part of research funded by:

    NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A) This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

    XML metadata: GEMINI compliant metadata for this dataset is available here Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip CT_image_data_info2.xlsx This file contains dataset metadata and 1 data tables:

    Dataset Images (described in worksheet Dataset_images) Description: This worksheet details the composition of each dataset used in the analyses Number of fields: 69 Number of data rows: 270287 Fields:

    filename: Root ID (Field type: id) camera_trap_site: Site ID for the camera trap location (Field type: location) taxon: Taxon recorded by camera trap (Field type: taxa) dist_level: Level of disturbance at site (Field type: ordered categorical) baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical) increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_all_1_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3, 4 and 5 (all)' training set, or not included (NA) (Field type: categorical) dist_camera_level_individ_1: Label as to whether image is included in the 'disturbance level combination analysis split at camera level: disturbance

  14. DRIVE Train/Validation Split Dataset

    • kaggle.com
    Updated Feb 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sovit Ranjan Rath (2023). DRIVE Train/Validation Split Dataset [Dataset]. https://www.kaggle.com/datasets/sovitrath/drive-trainvalidation-split-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sovit Ranjan Rath
    Description

    This dataset contains images and masks for Retinal Vessel Extraction (Segmentation). It contains a training and validation split to easily train semantic segmentation models.

    The original dataset can be found here => https://www.kaggle.com/datasets/andrewmvd/drive-digital-retinal-images-for-vessel-extraction

    This dataset also has an accompanying blog post => Retinal Vessel Segmentation using PyTorch Semantic Segmentation

    Split sample numbers: Training images and masks: 16 Validation images and masks: 4 Test images: 20

  15. Runtime of implementations on Pfam seed and full.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samantha Petti; Sean R. Eddy (2023). Runtime of implementations on Pfam seed and full. [Dataset]. http://doi.org/10.1371/journal.pcbi.1009492.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Samantha Petti; Sean R. Eddy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The runtime benchmarks were obtained by running each algorithm on the seed and full multi-MSAs Pfam-A.seed and Pfam-A.full on 2 cores with 8 GB RAM for the seed alignments and on 3 cores with 12 GB RAM for the full alignments. We did not compute the maximum runtime of the Blue algorithm; the algorithm failed to terminate within 6 days for 34 families.

  16. f

    Summary of the training and testing data.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meysman, Pieter; Laukens, Kris; Bui-Thi, Danh; Rivière, Emmanuel (2022). Summary of the training and testing data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000250189
    Explore at:
    Dataset updated
    Jul 21, 2022
    Authors
    Meysman, Pieter; Laukens, Kris; Bui-Thi, Danh; Rivière, Emmanuel
    Description

    To small datasets, human and C.elegans, we evaluate the models’ performance using k-fold cross validation, with k = 5. To the other datasets, we split them into three sets: training, validation and testing.

  17. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  18. Fake News Detection

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KranNaik777 (2025). Fake News Detection [Dataset]. https://www.kaggle.com/datasets/krannaik777/train-news
    Explore at:
    zip(38846301 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    KranNaik777
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    The fake news detection dataset used in this project contains labeled news articles categorized as either "fake" or "real." These articles have been collected from credible real-world sources and fact-checking websites, ensuring diverse and high-quality data. The dataset includes textual features such as the news content, along with metadata like publication date, author, and source details. On average, articles vary in length, providing a rich linguistic variety for model training. The dataset is balanced to minimize bias between fake and real news categories, supporting robust classification. It often contains thousands to hundreds of thousands of articles, enabling effective machine learning model development and evaluation. Additionally, some versions of the dataset may also include image URLs for multimodal analysis, expanding the detection capability beyond text alone. This comprehensive dataset plays a critical role in training and validating the fake news detection model used in this project.

    Here is a description for each column header of the fake news dataset:

    id: A unique identifier assigned to each news article in the dataset for easy reference and indexing.

    headline: The title or headline of the news article, summarizing the key news story in brief.

    written by: The author or journalist who wrote the news article; this may sometimes be missing or anonymized.

    news: The full text content of the news article, which is the main body used for analysis and classification.

    label: The classification label indicating the authenticity of the news article, typically a binary value such as "fake" or "real" (or 0 for real and 1 for fake), indicating whether the news is deceptive or truthful.

    This detailed column description provides clarity on the structure and contents of the dataset used for fake news detection modeling.

  19. Prediction of Personality Traits using the Big 5 Framework

    • zenodo.org
    csv, text/x-python
    Updated Feb 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neelima Brahmbhatt; Neelima Brahmbhatt (2023). Prediction of Personality Traits using the Big 5 Framework [Dataset]. http://doi.org/10.5281/zenodo.7596072
    Explore at:
    text/x-python, csvAvailable download formats
    Dataset updated
    Feb 2, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Neelima Brahmbhatt; Neelima Brahmbhatt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The methodology is the core component of any research-related work. The methods used to gain the results are shown in the methodology. Here, the whole research implementation is done using python. There are different steps involved to get the entire research work done which is as follows:

    1. Acquire Personality Dataset

    The kaggle machine learning dataset is a collection of datasets, data generators which are used by machine learning community for analysis purpose. The personality prediction dataset is acquired from the kaggle website. This dataset was collected (2016-2018) through an interactive on-line personality test. The personality test was constructed from the IPIP. The personality prediction dataset can be downloaded in zip file format just by clicking on the link available. The personality prediction file consists of two subject CSV files (test.csv & train.csv). The test.csv file has 0 missing values, 7 attributes, and final label output. Also, the dataset has multivariate characteristics. Here, data-preprocessing is done for checking inconsistent behaviors or trends.

    2. Data preprocessing

    After, Data acquisition the next step is to clean and preprocess the data. The Dataset available has numerical type features. The target value is a five-level personality consisting of serious,lively,responsible,dependable & extraverted. The preprocessed dataset is further split into training and testing datasets. This is achieved by passing feature value, target value, test size to the train-test split method of the scikit-learn package. After splitting of data, the training data is sent to the following Logistic regression & SVM design is used for training the artificial neural networks then test data is used to predict the accuracy of the trained network model.

    3. Feature Extraction

    The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc. The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

            EXT1 I am the life of the party.
            EXT2  I don't talk a lot.
            EXT3  I feel comfortable around people.
            EXT4  I am quiet around strangers.
            EST1  I get stressed out easily.
            EST2  I get irritated easily.
            EST3  I worry about things.
            EST4  I change my mood a lot.
            AGR1  I have a soft heart.
            AGR2  I am interested in people.
            AGR3  I insult people.
            AGR4  I am not really interested in others.
            CSN1  I am always prepared.
            CSN2  I leave my belongings around.
            CSN3  I follow a schedule.
            CSN4  I make a mess of things.
            OPN1  I have a rich vocabulary.
            OPN2  I have difficulty understanding abstract ideas.
            OPN3  I do not have a good imagination.
            OPN4  I use difficult words.

    4. Training the Model

    Train/Test is a method to measure the accuracy of your model. It is called Train/Test because you split the the data set into two sets: a training set and a testing set. 80% for training, and 20% for testing. You train the model using the training set.In this model we trained our dataset using linear_model.LogisticRegression() & svm.SVC() from sklearn Package

    5. Personality Prediction Output

    After the training of the designed neural network, the testing of Logistic Regression & SVM is performed using Cohen_kappa_score & Accuracy Score.

  20. f

    EternaBrain CNN accuracies on eternamoves-select with different splits of...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    • +1more
    Updated Jun 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Portela, Fernando; Koodli, Rohan V.; Das, Rhiju; Keep, Benjamin; Coppess, Katherine R. (2019). EternaBrain CNN accuracies on eternamoves-select with different splits of training and test sets. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000163915
    Explore at:
    Dataset updated
    Jun 27, 2019
    Authors
    Portela, Fernando; Koodli, Rohan V.; Das, Rhiju; Keep, Benjamin; Coppess, Katherine R.
    Description

    EternaBrain CNN accuracies on eternamoves-select with different splits of training and test sets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
fivesixseven (2024). cleaned-quora-dataset-train-test-split [Dataset]. https://huggingface.co/datasets/567-labs/cleaned-quora-dataset-train-test-split

cleaned-quora-dataset-train-test-split

567-labs/cleaned-quora-dataset-train-test-split

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Dataset authored and provided by
fivesixseven
Description

This is a cleaned version of the Quora dataset that's been configured with a train-test-val split.

Train : For training model Test : For running experiments and comparing different OSS models and closed sourced models Val : Only to be used at the end!

Colab Notebook to reproduce : https://colab.research.google.com/drive/1dGjGiqwPV1M7JOLfcPEsSh3SC37urItS?usp=sharing

Search
Clear search
Close search
Google apps
Main menu