100+ datasets found

h
alpaca-train-validation-test-split
huggingface.co
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
R
Train Test Split For Freiburg In Yolov7 Format Dataset
universe.roboflow.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac H (2023). Train Test Split For Freiburg In Yolov7 Format Dataset [Dataset]. https://universe.roboflow.com/isaac-h/train-test-split-for-freiburg-dataset-in-yolov7-format
Explore at:
zipAvailable download formats
Dataset updated
Aug 4, 2023
Dataset authored and provided by
Isaac H
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Freiburg im Breisgau
Variables measured
Groceries Bounding Boxes
Description
Train Test Split For Freiburg Dataset In YOLOv7 Format

## Overview Train Test Split For Freiburg Dataset In YOLOv7 Format is a dataset for object detection tasks - it contains Groceries annotations for 8,879 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
arc-agi-prompts-train-test-split
huggingface.co
Updated Jul 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pritish Saha (2025). arc-agi-prompts-train-test-split [Dataset]. https://huggingface.co/datasets/Pritish92/arc-agi-prompts-train-test-split
Explore at:
Dataset updated
Jul 8, 2025
Authors
Pritish Saha
Description
Pritish92/arc-agi-prompts-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
Data Cleaning, Translation & Split of the Dataset for the Automatic...
zenodo.org
data.niaid.nih.gov
bin, csv +1
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6957842
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
Plant Village Dataset
kaggle.com
figshare.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Mubashir Chaudhry (2023). Plant Village Dataset [Dataset]. https://www.kaggle.com/datasets/adilmubashirchaudhry/plant-village-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Adil Mubashir Chaudhry
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains images used to create a classification model for plant diseases. In particular the dataset contains the following plants: 1. Bell Peppers 2. Potatoes 3. Tomatoes

This dataset is a **modified version **of the dataset from the Plant Village Dataset and is a subset of the total data with reduced number of plants. The dataset is modified and contains separate directories for: 1. Train: 70% of the Data 2. Test: 20% of the Data 3. Validation 10% of the Data
Data from: Time-Split Cross-Validation as a Method for Estimating the...
acs.figshare.com
figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci400084k.s001
Dataset updated
Jun 2, 2023
Dataset provided by
ACS Publications
Authors
Robert P. Sheridan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

NBA Player Dataset & Prediction Model Artifacts

test.researchdata.tuwien.ac.at

bin, csv, json, png +2

Updated Apr 28, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Burak Baltali; Burak Baltali (2025). NBA Player Dataset & Prediction Model Artifacts [Dataset]. http://doi.org/10.70124/ymgzs-z3s43

Explore at:

json, png, csv, bin, txt, text/markdownAvailable download formats

Unique identifier

https://doi.org/10.70124/ymgzs-z3s43

Dataset updated

Apr 28, 2025

Dataset provided by

TU Wien

Authors

Burak Baltali; Burak Baltali

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains end-of-season box-score aggregates for NBA players over the 2012–13 through 2023–24 seasons, split into training and test sets for both regular season and playoffs. Each CSV has one row per player per season with columns for points, rebounds, steals, turnovers, 3-pt attempts, FG attempts, plus identifiers.

Brief overview of Files

end-of-season box-score aggregates (2012–13 – 2023–24) split into train/test;
the Jupyter notebook (Analysis.ipynb); All the code can be executed in there
the trained model binary (nba_model.pkl); Serialized Random Forest model artifact
Evaluation plots (LAL vs. whole‐league) for regular & playoff predictions are given as png outputs and uploaded in here
FAIR4ML metadata (fair4ml_metadata.jsonld);
see README.md and abbreviations.txt for file details.”
For further information you can go to the github site (Link below)

File Details

Notebook

Analysis.ipynb: Involves the graphica output of the trained and tested data.

Trained/ Test csv Data

Name	Description	PID
regular_train.csv	For training purposes, the seasons 2012-2013 through 2021-2022 were selected as training purpose	4421e56c-4cd3-4ec1-a566-a89d7ec0bced
regular_test.csv:	For testing purpose of the regular season, the 2022-2023 season was selected	f9d84d5e-db01-4475-b7d1-80cfe9fe0e61
playoff_train.csv	For training purposes of the playoff season, the seasons 2012-2013 through 2022-2023 were selected	bcb3cf2b-27df-48cc-8b76-9e49254783d0
playoff_test.csv	For testing purpose of the playoff season, 2023-2024 season was selected	de37d568-e97f-4cb9-bc05-2e600cc97102

Others

abbrevations.txt: Involves the fundemental abbrevations of the columns in csv data

Additional Notes

Raw csv files are taken from Kaggle (Source: https://www.kaggle.com/datasets/shivamkumar121215/nba-stats-dataset-for-last-10-years/data)

Some preprocessing has to be done before uploading into dbrepo

Plots have also been uploaded as an output for visual purposes.

A more detailed version can be found on github (Link: https://github.com/bubaltali/nba-prediction-analysis/)

e
Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND
b2find.eudat.eu
Updated Jun 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/63d20a5e-3584-5096-a34d-d3f93fcc8857
Explore at:
Dataset updated
Jun 2, 2024
Description
Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder
h
0808-no_sexism-honly-regress-train-test-split
huggingface.co
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nate Rahn (2025). 0808-no_sexism-honly-regress-train-test-split [Dataset]. https://huggingface.co/datasets/nate-rahn/0808-no_sexism-honly-regress-train-test-split
Explore at:
Dataset updated
Aug 11, 2025
Authors
Nate Rahn
Description
nate-rahn/0808-no_sexism-honly-regress-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Juliet-train-split-test-on-BinRealVul
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Compote, Juliet-train-split-test-on-BinRealVul [Dataset]. https://huggingface.co/datasets/CCompote/Juliet-train-split-test-on-BinRealVul
Explore at:
Authors
Compote
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
Juliet-train-split-test-on-BinRealVul

Dataset Summary

Juliet-train-split-test-on-BinRealVul is a curated subset of the Juliet Test Suite (as organized in the GitHub repository), compiled and lifted to LLVM Intermediate Representation (IR) after pre-process phase. This dataset is designed specifically for training binary vulnerability detection models in a setting that ensures a fair comparison with models trained on CompRealVul_LLVM. The split was constructed to match… See the full description on the dataset page: https://huggingface.co/datasets/CCompote/Juliet-train-split-test-on-BinRealVul.
flowers-299_Train&Test
kaggle.com
Updated Jul 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xocion (2023). flowers-299_Train&Test [Dataset]. https://www.kaggle.com/datasets/xocion/flower299-train-and-test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Xocion
Description
Original dataset https://www.kaggle.com/datasets/bogdancretu/flower299 I choose a Acacia flower as the display picture of this dataset to highlight a problem in the dataset flowers-299, if you go to the second folder of Acacia flowers you will see a bunch of pictures of different looking flowers, despite having different shapes structure and colors they are all technically Acacia flowers but we can't use this data to train because we don't have enough samples of acacia flowers despite all efforts and the best model the probability of a model giving accurate prediction of acacia flowers are low

this set of data needs data augmentation to be effieciently used with resnet50
Multimodal Vision-Audio-Language Dataset
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10060785
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.
Details can be found in the attached report.
Annotation
The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.
The split into train, validation and test set follows the split of the original datasets.
Installation
pip install pandas pyarrow
Example
import pandas as pd
df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
print(df.iloc[0])
dataset AudioSet
filename train/---2_BBVHAA.mp3
captions_visual [a man in a black hat and glasses.]
captions_auditory [a man speaks and dishes clank.]
tags [Speech]
Description
The annotation file consists of the following fields:

filename: Name of the corresponding file (video or audio file)
dataset: Source dataset associated with the data point
captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
captions_auditory: A list of captions related to the auditory content of the video
tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided
Data files
The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
e
Web Data Commons Training and Test Sets for Large-Scale Product Matching -...
b2find.eudat.eu
Updated Nov 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/720b440c-eda0-5182-af9f-f868ed999bd7
Explore at:
Dataset updated
Nov 27, 2020
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Z
Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...
data.niaid.nih.gov
explore.openaire.eu
Updated Jun 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sparks, D. Taylor (2021). Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4903957
Explore at:
Dataset updated
Jun 6, 2021
Dataset provided by
Henderson, N. Ashley
Sparks, D. Taylor
Kauwe, K. Steven
Description
This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.

For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.

For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0
Split Garbage Dataset
kaggle.com
Updated May 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Santoro (2019). Split Garbage Dataset [Dataset]. https://www.kaggle.com/andreasantoro/split-garbage-dataset/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Andrea Santoro
Description
Split version of the garbage classification dataset (link below). train, test and valid folders have been generated as specified by the one-indexed files of the original dataset

Acknowledgements

Original dataset here: https://www.kaggle.com/asdasdasasdas/garbage-classification
Yeast genotype/phenotype data partitioned into train/validation/test splits...
zenodo.org
zip
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Yeast genotype/phenotype data partitioned into train/validation/test splits from Ba Nguyen et al, 2022 [Dataset]. http://doi.org/10.5281/zenodo.15313069
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15313069
Dataset updated
Apr 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The raw data comes from Ba Nguyen et al, 2022, who hosted their data here. This dataset was used in an independent study in Rijal et al, 2025, who preprocessed the data using these notebook scripts. They did not release their processed data, so we reproduced their processing pipeline and have uploaded the data ourselves as part of this data resource.

This release accompanies this publication: https://doi.org/10.57844/arcadia-bmb9-fzxd
f
ORBIT: A real-world few-shot dataset for teachable object recognition...
city.figshare.com
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniela Massiceti; Lida Theodorou; Luisa Zintgraf; Matthew Tobias Harris; Simone Stumpf; Cecily Morrison; Edward Cutrell; Katja Hofmann (2023). ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision [Dataset]. http://doi.org/10.25383/city.14294597.v3
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.25383/city.14294597.v3
Dataset updated
May 31, 2023
Dataset provided by
City, University of London
Authors
Daniela Massiceti; Lida Theodorou; Luisa Zintgraf; Matthew Tobias Harris; Simone Stumpf; Cecily Morrison; Edward Cutrell; Katja Hofmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Object recognition predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning research, however, has been driven by benchmark datasets that lack the high variation that these applications will face when deployed in the real-world. To close this gap, we present the ORBIT dataset, grounded in a real-world application of teachable object recognizers for people who are blind/low vision. We provide a full, unfiltered dataset of 4,733 videos of 588 objects recorded by 97 people who are blind/low-vision on their mobile phones, and a benchmark dataset of 3,822 videos of 486 objects collected by 77 collectors. The code for loading the dataset, computing all benchmark metrics, and running the baseline models is available at https://github.com/microsoft/ORBIT-DatasetThis version comprises several zip files:- train, validation, test: benchmark dataset, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS- other: data not in the benchmark set, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS (please note that the train, validation, test, and other files make up the unfiltered dataset)- *_224: as for the benchmark, but static individual frames are scaled down to 224 pixels.- *_unfiltered_videos: full unfiltered dataset, organised by collector, in mp4 format.
ATOM3D: Ligand Binding Affinity (LBA) Dataset
zenodo.org
data.niaid.nih.gov
application/gzip, bin
Updated Jun 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raphael J.L. Townshend; Raphael J.L. Townshend; Martin Vögele; Martin Vögele; Patricia Suriana; Patricia Suriana; Alexander Derry; Alexander Derry; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Brandon Anderson; Stephan Eismann; Risi Kondor; Russ B. Altman; Ron O. Dror; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Brandon Anderson; Stephan Eismann; Risi Kondor; Russ B. Altman; Ron O. Dror (2021). ATOM3D: Ligand Binding Affinity (LBA) Dataset [Dataset]. http://doi.org/10.5281/zenodo.4914718
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4914718
Dataset updated
Jun 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Raphael J.L. Townshend; Raphael J.L. Townshend; Martin Vögele; Martin Vögele; Patricia Suriana; Patricia Suriana; Alexander Derry; Alexander Derry; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Brandon Anderson; Stephan Eismann; Risi Kondor; Russ B. Altman; Ron O. Dror; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Brandon Anderson; Stephan Eismann; Risi Kondor; Russ B. Altman; Ron O. Dror
Description
Ligand Binding Affinity (LBA) dataset from the ATOM3D project. This upload includes five zipped data directories:

Full, unsplit dataset in LMDB format

Split datasets at 60% sequence identity, with each in LMDB format

Split datasets at 30% sequence identity, with each in LMDB format

Text files containing train, validation, and test indices used to split raw dataset (for both 30% and 60%)

README containing dataset details
IAM_dataset_modified
kaggle.com
Updated Apr 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashish Goswami (2020). IAM_dataset_modified [Dataset]. https://www.kaggle.com/ashish2001/iam-dataset-modified/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ashish Goswami
Description
Dataset

This dataset was created by Ashish Goswami

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 12, 2023

Authors

Doula Isham Rashik Hasan

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Clear search

Close search

Google apps

Main menu

alpaca-train-validation-test-split

Train Test Split For Freiburg In Yolov7 Format Dataset

Train Test Split For Freiburg Dataset In YOLOv7 Format

arc-agi-prompts-train-test-split

Data Cleaning, Translation & Split of the Dataset for the Automatic...

Plant Village Dataset

Data from: Time-Split Cross-Validation as a Method for Estimating the...

NBA Player Dataset & Prediction Model Artifacts

Description

Brief overview of Files

File Details

Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND

0808-no_sexism-honly-regress-train-test-split

Juliet-train-split-test-on-BinRealVul

flowers-299_Train&Test

Multimodal Vision-Audio-Language Dataset

Annotation

Installation

Example

Description

Data files

Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

Training dataset for NABat Machine Learning V1.0

Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

Split Garbage Dataset

Acknowledgements

Yeast genotype/phenotype data partitioned into train/validation/test splits...

ORBIT: A real-world few-shot dataset for teachable object recognition...

ATOM3D: Ligand Binding Affinity (LBA) Dataset

IAM_dataset_modified

Dataset

Contents

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split