100+ datasets found
  1. h

    alpaca-train-validation-test-split

    • huggingface.co
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2023
    Authors
    Doula Isham Rashik Hasan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

    I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

  2. R

    Train Test Split For Freiburg In Yolov7 Format Dataset

    • universe.roboflow.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac H (2023). Train Test Split For Freiburg In Yolov7 Format Dataset [Dataset]. https://universe.roboflow.com/isaac-h/train-test-split-for-freiburg-dataset-in-yolov7-format
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 4, 2023
    Dataset authored and provided by
    Isaac H
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Freiburg im Breisgau
    Variables measured
    Groceries Bounding Boxes
    Description

    Train Test Split For Freiburg Dataset In YOLOv7 Format

    ## Overview
    
    Train Test Split For Freiburg Dataset In YOLOv7 Format is a dataset for object detection tasks - it contains Groceries annotations for 8,879 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  3. h

    arc-agi-prompts-train-test-split

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bryce Sandlund (2025). arc-agi-prompts-train-test-split [Dataset]. https://huggingface.co/datasets/bcsandlund/arc-agi-prompts-train-test-split
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Bryce Sandlund
    Description

    bcsandlund/arc-agi-prompts-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    arc-agi-prompts-train-test-split

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pritish Saha (2025). arc-agi-prompts-train-test-split [Dataset]. https://huggingface.co/datasets/Pritish92/arc-agi-prompts-train-test-split
    Explore at:
    Dataset updated
    Jul 8, 2025
    Authors
    Pritish Saha
    Description

    Pritish92/arc-agi-prompts-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. P

    PubMed (60%/20%/20% random splits) Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PubMed (60%/20%/20% random splits) Dataset [Dataset]. https://paperswithcode.com/dataset/pubmed-60-20-20-random-splits
    Explore at:
    Description

    Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.

  6. Dataset, splits, models, and scripts for the QM descriptors prediction

    • zenodo.org
    • explore.openaire.eu
    application/gzip
    Updated Apr 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green (2024). Dataset, splits, models, and scripts for the QM descriptors prediction [Dataset]. http://doi.org/10.5281/zenodo.10668491
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.

    Below are descriptions of the available scripts:

    1. atom_bond_descriptors.sh: Trains atom/bond targets.
    2. atom_bond_descriptors_predict.sh: Predicts atom/bond targets from pre-trained model.
    3. dipole_quadrupole_moments.sh: Trains dipole and quadrupole moments.
    4. dipole_quadrupole_moments_predict.sh: Predicts dipole and quadrupole moments from pre-trained model.
    5. energy_gaps_IP_EA.sh: Trains energy gaps, ionization potential (IP), and electron affinity (EA).
    6. energy_gaps_IP_EA_predict.sh: Predicts energy gaps, IP, and EA from pre-trained model.
    7. get_constraints.py: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.
    8. csv2pkl.py: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.

    Below is the procedure for running the ml-QM-GNN on your own dataset:

    1. Use get_constraints.py to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.
    2. Execute atom_bond_descriptors_predict.sh to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh and energy_gaps_IP_EA_predict.sh to calculate molecular QM descriptors.
    3. Utilize csv2pkl.py to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).
    4. Run Chemprop to train your models using the additional predicted features supported here.
  7. R

    Complete Final Rainy With Traintest Split & Augm Dataset

    • universe.roboflow.com
    zip
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NIT Jalandhar (2023). Complete Final Rainy With Traintest Split & Augm Dataset [Dataset]. https://universe.roboflow.com/nit-jalandhar-euvaa/complete-final-rainy-dataset-with-traintest-split-augm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 8, 2023
    Dataset authored and provided by
    NIT Jalandhar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Car Auto Motorbike Bus Truck Bounding Boxes
    Description

    Complete Final Rainy Dataset With Traintest Split & Augm

    ## Overview
    
    Complete Final Rainy Dataset With Traintest Split & Augm is a dataset for object detection tasks - it contains Car Auto Motorbike Bus Truck annotations for 2,106 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
    
  8. Train Test and Validation Split

    • kaggle.com
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IMT2022053 (2025). Train Test and Validation Split [Dataset]. https://www.kaggle.com/datasets/pranavakulkarni/train-test-and-validation-split/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    IMT2022053
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by IMT2022053

    Released under Apache 2.0

    Contents

  9. WikiLingua Train/Test Split

    • kaggle.com
    Updated Sep 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stiff_Subset (2023). WikiLingua Train/Test Split [Dataset]. https://www.kaggle.com/datasets/stiffsubset/wikilingua-traintest-split/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Stiff_Subset
    Description

    Dataset

    This dataset was created by Stiff_Subset

    Contents

  10. Training/Validation/Test set split

    • figshare.com
    zip
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianfan Jin (2024). Training/Validation/Test set split [Dataset]. http://doi.org/10.6084/m9.figshare.25511056.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 30, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Tianfan Jin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Including the split of real and null reactions for training, validation and test

  11. P

    Film (60%/20%/20% random splits) Dataset

    • paperswithcode.com
    • library.toponeai.link
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Film (60%/20%/20% random splits) Dataset [Dataset]. https://paperswithcode.com/dataset/film-60-20-20-random-splits
    Explore at:
    Description

    Node classification on Film with 60%/20%/20% random splits for training/validation/test.

  12. dataset-muenzen-training-test-split-01

    • kaggle.com
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pascalammeter (2024). dataset-muenzen-training-test-split-01 [Dataset]. https://www.kaggle.com/datasets/pascalammeter/dataset-muenzen-training-test-split-01/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    pascalammeter
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by pascalammeter

    Released under MIT

    Contents

  13. h

    deepstock-sp500-companies-info-stonkv2-test-train-split

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Abrie Nel, deepstock-sp500-companies-info-stonkv2-test-train-split [Dataset]. https://huggingface.co/datasets/2084Collective/deepstock-sp500-companies-info-stonkv2-test-train-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Lukas Abrie Nel
    Description

    2084Collective/deepstock-sp500-companies-info-stonkv2-test-train-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. R

    Final Raw Rainy Dataset With Augum Without Traintest Split Dataset

    • universe.roboflow.com
    zip
    Updated Aug 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NIT Jalandhar (2023). Final Raw Rainy Dataset With Augum Without Traintest Split Dataset [Dataset]. https://universe.roboflow.com/nit-jalandhar-euvaa/final-raw-rainy-dataset-with-augum-without-traintest-split/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 9, 2023
    Dataset authored and provided by
    NIT Jalandhar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Car Auto Motorbike Bus Truck Bounding Boxes
    Description

    Final Raw Rainy Dataset With Augum Without Traintest Split

    ## Overview
    
    Final Raw Rainy Dataset With Augum Without Traintest Split is a dataset for object detection tasks - it contains Car Auto Motorbike Bus Truck annotations for 731 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  15. DUDE competition train - validation - test splits ground truth

    • zenodo.org
    json
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordy Van Landeghem; Jordy Van Landeghem (2023). DUDE competition train - validation - test splits ground truth [Dataset]. http://doi.org/10.5281/zenodo.7763635
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Mar 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jordy Van Landeghem; Jordy Van Landeghem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This JSON file contains the ground truth annotations for the train and validation set of the DUDE competition (https://rrc.cvc.uab.es/?ch=23&com=tasks) of ICDAR 2023 (https://icdar2023.org/).

    V1.0.7 release: 41454 annotations for 4974 documents (train-validation-test)

    DatasetDict({
      train: Dataset({
        features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'],
        num_rows: 23728
      })
      val: Dataset({
        features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'],
        num_rows: 6315
      })
      test: Dataset({
        features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'],
        num_rows: 11402
      })
    })
    
    ++update on answer_type
    +++formatting change to answers_variants
    ++++stricter check on answer_variants & rename annotations file
    
    + blind test set (no ground truth answers provided)
    ++ removed duplicates from test set: 
    

    "92bd5c758bda9bdceb5f67c17009207b_ac6964cbdf483e765b6668e27b3d0bc4",

    "6ee71a16d4e4d1dbd7c1f569a92d4e08_549f2a163f8ff3e9f0293cf59fdd98bc",

    "e6f3855472231a7ca6aada2f8e85fe5a_827c03a72f2552c722f2c872fd7f74c3",

    "e3eecd7cca5de11f1d17cd94ae6a8d77_6300df64e4cf6ba0600ac81278f68de2",

    "107b4037df8127a92ee4b6ae9b5df8fb_d7a60e7a9fc0b27487ea39cd7f56f98e",

    "300cc3900080064d308983f958141232_6a7cf1aad908d58a75ab8e02ddc856f4",

    "fdd3308efacddb88d4aa6e2073f481d4_138cb868ecc804a63cc7a4502c0009b2",

    "1f7de256ff1743d329a8402ba0d132e7_95b6e8758533a9817b9f20a958e7b776",

    "4f399b8c526ffb6a2fd585a18d4ed5ec_51097231bc327c26c59a4fd8d3ff3069",

  16. Split Garbage Dataset

    • kaggle.com
    Updated May 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Santoro (2019). Split Garbage Dataset [Dataset]. https://www.kaggle.com/andreasantoro/split-garbage-dataset/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Andrea Santoro
    Description

    Split version of the garbage classification dataset (link below). train, test and valid folders have been generated as specified by the one-indexed files of the original dataset

    Acknowledgements

    Original dataset here: https://www.kaggle.com/asdasdasasdas/garbage-classification

  17. flowers-299_Train&Test

    • kaggle.com
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xocion (2023). flowers-299_Train&Test [Dataset]. https://www.kaggle.com/datasets/xocion/flower299-train-and-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Xocion
    Description

    Original dataset https://www.kaggle.com/datasets/bogdancretu/flower299 I choose a Acacia flower as the display picture of this dataset to highlight a problem in the dataset flowers-299, if you go to the second folder of Acacia flowers you will see a bunch of pictures of different looking flowers, despite having different shapes structure and colors they are all technically Acacia flowers but we can't use this data to train because we don't have enough samples of acacia flowers despite all efforts and the best model the probability of a model giving accurate prediction of acacia flowers are low

    this set of data needs data augmentation to be effieciently used with resnet50

  18. Titanic Dataset

    • kaggle.com
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhavesh Padharia (2022). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/bhavesh1335/titanic-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bhavesh Padharia
    Description

    Dataset

    This dataset was created by Bhavesh Padharia

    Contents

  19. h

    hak-chat-dataset-train-test-split

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ho Kang, hak-chat-dataset-train-test-split [Dataset]. https://huggingface.co/datasets/kanghokh/hak-chat-dataset-train-test-split
    Explore at:
    Authors
    Ho Kang
    Description

    kanghokh/hak-chat-dataset-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    Juliet-train-split-test-on-BinRealVul

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Compote, Juliet-train-split-test-on-BinRealVul [Dataset]. https://huggingface.co/datasets/CCompote/Juliet-train-split-test-on-BinRealVul
    Explore at:
    Authors
    Compote
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    Juliet-train-split-test-on-BinRealVul

      Dataset Summary
    

    Juliet-train-split-test-on-BinRealVul is a curated subset of the Juliet Test Suite (as organized in the GitHub repository), compiled and lifted to LLVM Intermediate Representation (IR) after pre-process phase. This dataset is designed specifically for training binary vulnerability detection models in a setting that ensures a fair comparison with models trained on CompRealVul_LLVM. The split was constructed to match… See the full description on the dataset page: https://huggingface.co/datasets/CCompote/Juliet-train-split-test-on-BinRealVul.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Search
Clear search
Close search
Google apps
Main menu