100+ datasets found

h
alpaca-train-validation-test-split
huggingface.co
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
Dataset, splits, models, and scripts for the QM descriptors prediction
zenodo.org
explore.openaire.eu
application/gzip
Updated Apr 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green (2024). Dataset, splits, models, and scripts for the QM descriptors prediction [Dataset]. http://doi.org/10.5281/zenodo.10668491
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10668491
Dataset updated
Apr 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shih-Cheng Li; Shih-Cheng Li; Haoyang Wu; Haoyang Wu; Angiras Menon; Angiras Menon; Kevin A. Spiekermann; Kevin A. Spiekermann; Yi-Pei Li; Yi-Pei Li; William H. Green; William H. Green
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset, splits, models, and scripts from the manuscript "When Do Quantum Mechanical Descriptors Help Graph Neural Networks Predict Chemical Properties?" are provided. The curated dataset includes 37 QM descriptors for 64,921 unique molecules across six levels of theory: wB97XD, B3LYP, M06-2X, PBE0, TPSS, and BP86. This dataset is stored in the data.tar.gz file, which also contains a file for multitask constraints applied to various atomic and bond properties. The data splits (training, validation, and test splits) for both random and scaffold-based divisions are saved as separate index files in splits.tar.gz. The trained D-MPNN models for predicting QM descriptors are saved in the models.tar.gz file. The scripts.tar.gz file contains ready-to-use scripts for training machine learning models to predict QM descriptors, as well as scripts for predicting QM descriptors using our trained models on unseen molecules and for applying radial basis function (RBF) expansion to QM atom and bond features.

Below are descriptions of the available scripts:

atom_bond_descriptors.sh: Trains atom/bond targets.

atom_bond_descriptors_predict.sh: Predicts atom/bond targets from pre-trained model.

dipole_quadrupole_moments.sh: Trains dipole and quadrupole moments.

dipole_quadrupole_moments_predict.sh: Predicts dipole and quadrupole moments from pre-trained model.

energy_gaps_IP_EA.sh: Trains energy gaps, ionization potential (IP), and electron affinity (EA).

energy_gaps_IP_EA_predict.sh: Predicts energy gaps, IP, and EA from pre-trained model.

get_constraints.py: Generates constraints file for testing dataset. This generated file needs to be provided before using our trained models to predict the atom/bond QM descriptors of your testing data.

csv2pkl.py: Converts QM atom and bond features to .pkl files using RBF expansion for use with Chemprop software.

Below is the procedure for running the ml-QM-GNN on your own dataset:

Use get_constraints.py to generate a constraint file required for predicting atom/bond QM descriptors with the trained ML models.

Execute atom_bond_descriptors_predict.sh to predict atom and bond properties. Run dipole_quadrupole_moments_predict.sh and energy_gaps_IP_EA_predict.sh to calculate molecular QM descriptors.

Utilize csv2pkl.py to convert the data from predicted atom/bond descriptors .csv file into separate atom and bond feature files (which are saved as .pkl files here).

Run Chemprop to train your models using the additional predicted features supported here.
R
Train Test Split For Freiburg In Yolov7 Format Dataset
universe.roboflow.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac H (2023). Train Test Split For Freiburg In Yolov7 Format Dataset [Dataset]. https://universe.roboflow.com/isaac-h/train-test-split-for-freiburg-dataset-in-yolov7-format/dataset/7
Explore at:
zipAvailable download formats
Dataset updated
Aug 4, 2023
Dataset authored and provided by
Isaac H
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Groceries Bounding Boxes
Description
Train Test Split For Freiburg Dataset In YOLOv7 Format

## Overview Train Test Split For Freiburg Dataset In YOLOv7 Format is a dataset for object detection tasks - it contains Groceries annotations for 8,879 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
arc-agi-prompts-train-test-split
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryce Sandlund (2025). arc-agi-prompts-train-test-split [Dataset]. https://huggingface.co/datasets/bcsandlund/arc-agi-prompts-train-test-split
Explore at:
Dataset updated
Jun 1, 2025
Authors
Bryce Sandlund
Description
bcsandlund/arc-agi-prompts-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
P
Film (60%/20%/20% random splits) Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Film (60%/20%/20% random splits) Dataset [Dataset]. https://paperswithcode.com/dataset/film-60-20-20-random-splits
Explore at:
Description
Node classification on Film with 60%/20%/20% random splits for training/validation/test.
Train Test and Validation Split
kaggle.com
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IMT2022053 (2025). Train Test and Validation Split [Dataset]. https://www.kaggle.com/datasets/pranavakulkarni/train-test-and-validation-split/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
IMT2022053
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by IMT2022053

Released under Apache 2.0

Contents
dataset-muenzen-training-test-split-01
kaggle.com
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pascalammeter (2024). dataset-muenzen-training-test-split-01 [Dataset]. https://www.kaggle.com/datasets/pascalammeter/dataset-muenzen-training-test-split-01/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
pascalammeter
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by pascalammeter

Released under MIT

Contents
P
Dayton Dataset
paperswithcode.com
opendatalab.com
Updated Feb 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Vo; James Hays (2020). Dayton Dataset [Dataset]. https://paperswithcode.com/dataset/dayton
Explore at:
Dataset updated
Feb 2, 2020
Authors
Nam Vo; James Hays
Description
The Dayton dataset is a dataset for ground-to-aerial (or aerial-to-ground) image translation, or cross-view image synthesis. It contains images of road views and aerial views of roads. There are 76,048 images in total and the train/test split is 55,000/21,048. The images in the original dataset have 354×354 resolution.
R
Hard Hat Workers Object Detection Dataset - resize-416x416-reflectEdges
public.roboflow.com
zip
Updated Sep 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Northeastern University - China (2022). Hard Hat Workers Object Detection Dataset - resize-416x416-reflectEdges [Dataset]. https://public.roboflow.com/object-detection/hard-hat-workers/1
Explore at:
zipAvailable download formats
Dataset updated
Sep 30, 2022
Dataset authored and provided by
Northeastern University - China
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Bounding Boxes of Workers
Description
Overview

The Hard Hat dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.

The original dataset has a 75/25 train-test split.

Example Image: https://i.imgur.com/7spoIJT.png" alt="Example Image">

Use Cases

One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.

Using this Dataset

Use the fork or Download this Dataset button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

Dataset Versions:

Image Preprocessing | Image Augmentation | Modify Classes * v1 (resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations * v2 (raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images * v3 (v3): generated with the original 75/25 train-test split | Modify Classes used to drop person class | Preprocessing and Augmentation applied * v5 (raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class * v8 (raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and person classes * v9 (raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and helmet classes * v10 (raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images * v11 (augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model * v12 (augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Fast Model * v13 (augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Accurate Model * v14 (raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class, and remap/relabel helmet class to head

Choosing Between Computer Vision Model Sizes | Roboflow Train

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Dataset for Cost-effective Simulation-based Test Selection in Self-driving...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella (2024). Dataset for Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor [Dataset]. http://doi.org/10.5281/zenodo.5914130
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5914130
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SDC-Scissor tool for Cost-effective Simulation-based Test Selection in Self-driving Cars Software

This dataset provides test cases for self-driving cars with the BeamNG simulator. Check out the repository and demo video to get started.

GitHub: github.com/ChristianBirchler/sdc-scissor

This project extends the tool competition platform from the Cyber-Phisical Systems Testing Competition which was part of the SBST Workshop in 2021.

Usage

Demo

YouTube Link

Installation

The tool can either be run with Docker or locally using Poetry.

When running the simulations a working installation of BeamNG.research is required. Additionally, this simulation cannot be run in a Docker container but must run locally.

To install the application use one of the following approaches:

Docker: docker build --tag sdc-scissor .

Poetry: poetry install

Using the Tool

The tool can be used with the following two commands:

Docker: docker run --volume "$(pwd)/results:/out" --rm sdc-scissor [COMMAND] [OPTIONS] (this will write all files written to /out to the local folder results)

Poetry: poetry run python sdc-scissor.py [COMMAND] [OPTIONS]

There are multiple commands to use. For simplifying the documentation only the command and their options are described.

Generation of tests:

generate-tests --out-path /path/to/store/tests

Automated labeling of Tests:

label-tests --road-scenarios /path/to/tests --result-folder /path/to/store/labeled/tests

Note: This only works locally with BeamNG.research installed

Model evaluation:

evaluate-models --dataset /path/to/train/set --save

Split train and test data:

split-train-test-data --scenarios /path/to/scenarios --train-dir /path/for/train/data --test-dir /path/for/test/data --train-ratio 0.8

Test outcome prediction:

predict-tests --scenarios /path/to/scenarios --classifier /path/to/model.joblib

Evaluation based on random strategy:

evaluate --scenarios /path/to/test/scenarios --classifier /path/to/model.joblib

The possible parameters are always documented with --help.

Linting

The tool is verified the linters flake8 and pylint. These are automatically enabled in Visual Studio Code and can be run manually with the following commands:

poetry run flake8 . poetry run pylint **/*.py

License

The software we developed is distributed under GNU GPL license. See the LICENSE.md file.

Contacts

Christian Birchler - Zurich University of Applied Science (ZHAW), Switzerland - birc@zhaw.ch

Nicolas Ganz - Zurich University of Applied Science (ZHAW), Switzerland - gann@zhaw.ch

Sajad Khatiri - Zurich University of Applied Science (ZHAW), Switzerland - mazr@zhaw.ch

Dr. Alessio Gambi - Passau University, Germany - alessio.gambi@uni-passau.de

Dr. Sebastiano Panichella - Zurich University of Applied Science (ZHAW), Switzerland - panc@zhaw.ch

References

Christian Birchler, Nicolas Ganz, Sajad Khatiri, Alessio Gambi, and Sebastiano Panichella. 2022. Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor. In 2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE.

If you use this tool in your research, please cite the following papers:

@INPROCEEDINGS{Birchler2022, author={Birchler, Christian and Ganz, Nicolas and Khatiri, Sajad and Gambi, Alessio, and Panichella, Sebastiano}, booktitle={2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), title={Cost-effective Simulationbased Test Selection in Self-driving Cars Software with SDC-Scissor}, year={2022}, }
h
stackoverflow_linux
huggingface.co
Updated Oct 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konrad Szafer (2023). stackoverflow_linux [Dataset]. https://huggingface.co/datasets/KonradSzafer/stackoverflow_linux
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2023
Authors
Konrad Szafer
Description
Dataset Card for "stackoverflow_linux"

Dataset information:

Source: Stack Overflow Category: Linux Number of samples: 300 Train/Test split: 270/30 Quality: Data come from the top 1k most upvoted questions

Additional Information License

All Stack Overflow user contributions are licensed under CC-BY-SA 3.0 with attribution required. More Information needed
R
Hard Hat Workers Dataset
universe.roboflow.com
zip
Updated Sep 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Nelson (2022). Hard Hat Workers Dataset [Dataset]. https://universe.roboflow.com/joseph-nelson/hard-hat-workers/model/13
Explore at:
zipAvailable download formats
Dataset updated
Sep 30, 2022
Dataset authored and provided by
Joseph Nelson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Workers Bounding Boxes
Description
Overview

The Hard Hat dataset is an object detection dataset of workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart.

The original dataset has a 75/25 train-test split.

Example Image: https://i.imgur.com/7spoIJT.png" alt="Example Image">

Use Cases

One could use this dataset to, for example, build a classifier of workers that are abiding safety code within a workplace versus those that may not be. It is also a good general dataset for practice.

Using this Dataset

Use the fork or Download this Dataset button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

Dataset Versions:

Image Preprocessing | Image Augmentation | Modify Classes * v1 (resize-416x416-reflect): generated with the original 75/25 train-test split | No augmentations * v2 (raw_75-25_trainTestSplit): generated with the original 75/25 train-test split | These are the raw, original images * v3 (v3): generated with the original 75/25 train-test split | Modify Classes used to drop person class | Preprocessing and Augmentation applied * v5 (raw_HeadHelmetClasses): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class * v8 (raw_HelmetClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and person classes * v9 (raw_PersonClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop head and helmet classes * v10 (raw_AllClasses): generated with a 70/20/10 train/valid/test split | These are the raw, original images * v11 (augmented3x-AllClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied | 3x image generation | Trained with Roboflow's Fast Model * v12 (augmented3x-HeadHelmetClasses-FastModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Fast Model * v13 (augmented3x-HeadHelmetClasses-AccurateModel): generated with a 70/20/10 train/valid/test split | Preprocessing and Augmentation applied, Modify Classes used to drop person class | 3x image generation | Trained with Roboflow's Accurate Model * v14 (raw_HeadClassOnly): generated with a 70/20/10 train/valid/test split | Modify Classes used to drop person class, and remap/relabel helmet class to head

Choosing Between Computer Vision Model Sizes | Roboflow Train

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
h
deepstock-sp500-companies-info-stonkv2-test-train-split
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Abrie Nel, deepstock-sp500-companies-info-stonkv2-test-train-split [Dataset]. https://huggingface.co/datasets/2084Collective/deepstock-sp500-companies-info-stonkv2-test-train-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Lukas Abrie Nel
Description
2084Collective/deepstock-sp500-companies-info-stonkv2-test-train-split dataset hosted on Hugging Face and contributed by the HF Datasets community
test train split ntu_60
kaggle.com
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bharghav_kv_02 (2024). test train split ntu_60 [Dataset]. https://www.kaggle.com/datasets/bharghavkv02/test-train-split-ntu-60/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 13, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
bharghav_kv_02
Description
Dataset

This dataset was created by bharghav_kv_02

Contents
Z
Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...
data.niaid.nih.gov
Updated Jun 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kauwe, K. Steven (2021). Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4903957
Explore at:
Dataset updated
Jun 6, 2021
Dataset provided by
Sparks, D. Taylor
Henderson, N. Ashley
Kauwe, K. Steven
Description
This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.

For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.

For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0
DUDE competition train - validation - test splits ground truth
zenodo.org
json
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordy Van Landeghem; Jordy Van Landeghem (2023). DUDE competition train - validation - test splits ground truth [Dataset]. http://doi.org/10.5281/zenodo.7763635
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7763635
Dataset updated
Mar 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jordy Van Landeghem; Jordy Van Landeghem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This JSON file contains the ground truth annotations for the train and validation set of the DUDE competition (https://rrc.cvc.uab.es/?ch=23&com=tasks) of ICDAR 2023 (https://icdar2023.org/).

V1.0.7 release: 41454 annotations for 4974 documents (train-validation-test)

DatasetDict({ train: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 23728 }) val: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 6315 }) test: Dataset({ features: ['docId', 'questionId', 'question', 'answers', 'answers_page_bounding_boxes', 'answers_variants', 'answer_type', 'data_split', 'document', 'OCR'], num_rows: 11402 }) }) ++update on answer_type +++formatting change to answers_variants ++++stricter check on answer_variants & rename annotations file + blind test set (no ground truth answers provided) ++ removed duplicates from test set:

"92bd5c758bda9bdceb5f67c17009207b_ac6964cbdf483e765b6668e27b3d0bc4",

"6ee71a16d4e4d1dbd7c1f569a92d4e08_549f2a163f8ff3e9f0293cf59fdd98bc",

"e6f3855472231a7ca6aada2f8e85fe5a_827c03a72f2552c722f2c872fd7f74c3",

"e3eecd7cca5de11f1d17cd94ae6a8d77_6300df64e4cf6ba0600ac81278f68de2",

"107b4037df8127a92ee4b6ae9b5df8fb_d7a60e7a9fc0b27487ea39cd7f56f98e",

"300cc3900080064d308983f958141232_6a7cf1aad908d58a75ab8e02ddc856f4",

"fdd3308efacddb88d4aa6e2073f481d4_138cb868ecc804a63cc7a4502c0009b2",

"1f7de256ff1743d329a8402ba0d132e7_95b6e8758533a9817b9f20a958e7b776",

"4f399b8c526ffb6a2fd585a18d4ed5ec_51097231bc327c26c59a4fd8d3ff3069",
f
Dataset
figshare.com
application/x-gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13577873.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Moynuddin Ahmed Shibly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.
Dollar street 10 - 64x64x3
zenodo.org
data.niaid.nih.gov
bin
Updated May 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven van der burg; Sven van der burg (2025). Dollar street 10 - 64x64x3 [Dataset]. http://doi.org/10.5281/zenodo.10970014
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10970014
Dataset updated
May 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sven van der burg; Sven van der burg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MLCommons Dollar Street Dataset is a collection of images of everyday household items from homes around the world that visually captures socioeconomic diversity of traditionally underrepresented populations. It consists of public domain data, licensed for academic, commercial and non-commercial usage, under CC-BY and CC-BY-SA 4.0. The dataset was developed because similar datasets lack socioeconomic metadata and are not representative of global diversity.

This is a subset of the original dataset that can be used for multiclass classification with 10 categories. It is designed to be used in teaching, similar to the widely used, but unlicensed CIFAR-10 dataset.

These are the preprocessing steps that were performed:

Only take examples with one imagenet_synonym label

Use only examples with the 10 most frequently occuring labels

Downscale images to 64 x 64 pixels

Split data in train and test

Store as numpy array

This is the label mapping:

Category label
day bed 0
dishrag 1
plate 2
running shoe 3
soap dispenser 4
street sign 5
table lamp 6
tile roof 7
toilet seat 8
washing machine 9

Checkout https://github.com/carpentries-lab/deep-learning-intro/blob/main/instructors/prepare-dollar-street-data.ipynb" target="_blank" rel="noopener">this notebook to see how the subset was created.

The original dataset was downloaded from https://www.kaggle.com/datasets/mlcommons/the-dollar-street-dataset. See https://mlcommons.org/datasets/dollar-street/ for more information.
f
Description of the train test split dataset.
plos.figshare.com
xls
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahade Hasan; Farhana Yasmin; Md. Mehedi Hassan; Xue Yu; Soniya Yeasmin; Herat Joshi; Sheikh Mohammed Shariful Islam (2025). Description of the train test split dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0312914.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312914.t003
Dataset updated
Jan 9, 2025
Dataset provided by
PLOS ONE
Authors
Mahade Hasan; Farhana Yasmin; Md. Mehedi Hassan; Xue Yu; Soniya Yeasmin; Herat Joshi; Sheikh Mohammed Shariful Islam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heart disease remains a leading cause of mortality and morbidity worldwide, necessitating the development of accurate and reliable predictive models to facilitate early detection and intervention. While state of the art work has focused on various machine learning approaches for predicting heart disease, but they could not able to achieve remarkable accuracy. In response to this need, we applied nine machine learning algorithms XGBoost, logistic regression, decision tree, random forest, k-nearest neighbors (KNN), support vector machine (SVM), gaussian naïve bayes (NB gaussian), adaptive boosting, and linear regression to predict heart disease based on a range of physiological indicators. Our approach involved feature selection techniques to identify the most relevant predictors, aimed at refining the models to enhance both performance and interpretability. The models were trained, incorporating processes such as grid search hyperparameter tuning, and cross-validation to minimize overfitting. Additionally, we have developed a novel voting system with feature selection techniques to advance heart disease classification. Furthermore, we have evaluated the models using key performance metrics including accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (ROC AUC). Among the models, XGBoost demonstrated exceptional performance, achieving 99% accuracy, precision, F1-Score, 98% recall, and 100% ROC AUC. This study offers a promising approach to early heart disease diagnosis and preventive healthcare.
R
Data from: Fashion Mnist Dataset
universe.roboflow.com
opendatalab.com
+3more
zip
Updated Aug 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Clothing
Description
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf

https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online, author = {Han Xiao and Kashif Rasul and Roland Vollgraf}, title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms}, date = {2017-08-28}, year = {2017}, eprintclass = {cs.LG}, eprinttype = {arXiv}, eprint = {cs.LG/1708.07747}, }

Facebook

Twitter

Click to copy link

Link copied

Cite

Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 12, 2023

Authors

Doula Isham Rashik Hasan

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Clear search

Close search

Google apps

Main menu

Category	label
day bed	0
dishrag	1
plate	2
running shoe	3
soap dispenser	4
street sign	5
table lamp	6
tile roof	7
toilet seat	8
washing machine	9

alpaca-train-validation-test-split

Dataset, splits, models, and scripts for the QM descriptors prediction

Train Test Split For Freiburg In Yolov7 Format Dataset

Train Test Split For Freiburg Dataset In YOLOv7 Format

arc-agi-prompts-train-test-split

Film (60%/20%/20% random splits) Dataset

Train Test and Validation Split

Dataset

Contents

dataset-muenzen-training-test-split-01

Dataset

Contents

Dayton Dataset

Hard Hat Workers Object Detection Dataset - resize-416x416-reflectEdges

Overview

Use Cases

Using this Dataset

Dataset Versions:

About Roboflow

Dataset for Cost-effective Simulation-based Test Selection in Self-driving...

stackoverflow_linux

Hard Hat Workers Dataset

Overview

Use Cases

Using this Dataset

Dataset Versions:

About Roboflow

deepstock-sp500-companies-info-stonkv2-test-train-split

test train split ntu_60

Dataset

Contents

Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

DUDE competition train - validation - test splits ground truth

Dataset

Dollar street 10 - 64x64x3

Description of the train test split dataset.

Data from: Fashion Mnist Dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation:

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split