100+ datasets found
  1. h

    alpaca-train-validation-test-split

    • huggingface.co
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 12, 2023
    Authors
    Doula Isham Rashik Hasan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

    I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

  2. R

    Train Test Split For Freiburg In Yolov7 Format Dataset

    • universe.roboflow.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac H (2023). Train Test Split For Freiburg In Yolov7 Format Dataset [Dataset]. https://universe.roboflow.com/isaac-h/train-test-split-for-freiburg-dataset-in-yolov7-format
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 4, 2023
    Dataset authored and provided by
    Isaac H
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Freiburg im Breisgau
    Variables measured
    Groceries Bounding Boxes
    Description

    Train Test Split For Freiburg Dataset In YOLOv7 Format

    ## Overview
    
    Train Test Split For Freiburg Dataset In YOLOv7 Format is a dataset for object detection tasks - it contains Groceries annotations for 8,879 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  3. h

    arc-agi-prompts-train-test-split

    • huggingface.co
    Updated Jul 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pritish Saha (2025). arc-agi-prompts-train-test-split [Dataset]. https://huggingface.co/datasets/Pritish92/arc-agi-prompts-train-test-split
    Explore at:
    Dataset updated
    Jul 8, 2025
    Authors
    Pritish Saha
    Description

    Pritish92/arc-agi-prompts-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv +1
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
    Explore at:
    text/x-python, csv, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juliane Köhler; Juliane Köhler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
    • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
    • ger_train.csv – The German training set as CSV file.
    • ger_validation.csv – The German validation set as CSV file.
    • en_test.csv – The English test set as CSV file.
    • en_train.csv – The English training set as CSV file.
    • en_validation.csv – The English validation set as CSV file.
    • splitting.py – The python code for splitting a dataset into train, test and validation set.
    • DataSetTrans_de.csv – The final German dataset as a CSV file.
    • DataSetTrans_en.csv – The final English dataset as a CSV file.
    • translation.py – The python code for translating the cleaned dataset.
  5. Plant Village Dataset

    • kaggle.com
    • figshare.com
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Mubashir Chaudhry (2023). Plant Village Dataset [Dataset]. https://www.kaggle.com/datasets/adilmubashirchaudhry/plant-village-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Adil Mubashir Chaudhry
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains images used to create a classification model for plant diseases. In particular the dataset contains the following plants: 1. Bell Peppers 2. Potatoes 3. Tomatoes

    This dataset is a **modified version **of the dataset from the Plant Village Dataset and is a subset of the total data with reduced number of plants. The dataset is modified and contains separate directories for: 1. Train: 70% of the Data 2. Test: 20% of the Data 3. Validation 10% of the Data

  6. Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  7. t

    NBA Player Dataset & Prediction Model Artifacts

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png +2
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Burak Baltali; Burak Baltali (2025). NBA Player Dataset & Prediction Model Artifacts [Dataset]. http://doi.org/10.70124/ymgzs-z3s43
    Explore at:
    json, png, csv, bin, txt, text/markdownAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Burak Baltali; Burak Baltali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This dataset contains end-of-season box-score aggregates for NBA players over the 2012–13 through 2023–24 seasons, split into training and test sets for both regular season and playoffs. Each CSV has one row per player per season with columns for points, rebounds, steals, turnovers, 3-pt attempts, FG attempts, plus identifiers.

    Brief overview of Files

    1. end-of-season box-score aggregates (2012–13 – 2023–24) split into train/test;

    2. the Jupyter notebook (Analysis.ipynb); All the code can be executed in there

    3. the trained model binary (nba_model.pkl); Serialized Random Forest model artifact

    4. Evaluation plots (LAL vs. whole‐league) for regular & playoff predictions are given as png outputs and uploaded in here

    5. FAIR4ML metadata (fair4ml_metadata.jsonld);
      see README.md and abbreviations.txt for file details.”

    6. For further information you can go to the github site (Link below)

    File Details

    Notebook

    Analysis.ipynb: Involves the graphica output of the trained and tested data.

    Trained/ Test csv Data

    NameDescriptionPID
    regular_train.csvFor training purposes, the seasons 2012-2013 through 2021-2022 were selected as training purpose4421e56c-4cd3-4ec1-a566-a89d7ec0bced
    regular_test.csv:For testing purpose of the regular season, the 2022-2023 season was selectedf9d84d5e-db01-4475-b7d1-80cfe9fe0e61
    playoff_train.csvFor training purposes of the playoff season, the seasons 2012-2013 through 2022-2023 were selected bcb3cf2b-27df-48cc-8b76-9e49254783d0
    playoff_test.csvFor testing purpose of the playoff season, 2023-2024 season was selectedde37d568-e97f-4cb9-bc05-2e600cc97102

    Others

    abbrevations.txt: Involves the fundemental abbrevations of the columns in csv data

    Additional Notes

    Raw csv files are taken from Kaggle (Source: https://www.kaggle.com/datasets/shivamkumar121215/nba-stats-dataset-for-last-10-years/data)

    Some preprocessing has to be done before uploading into dbrepo

    Plots have also been uploaded as an output for visual purposes.

    A more detailed version can be found on github (Link: https://github.com/bubaltali/nba-prediction-analysis/)

  8. e

    Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Jun 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Industrial Machine Tool Element Surface Defect Dataset - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/63d20a5e-3584-5096-a34d-d3f93fcc8857
    Explore at:
    Dataset updated
    Jun 2, 2024
    Description

    Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder

  9. h

    0808-no_sexism-honly-regress-train-test-split

    • huggingface.co
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nate Rahn (2025). 0808-no_sexism-honly-regress-train-test-split [Dataset]. https://huggingface.co/datasets/nate-rahn/0808-no_sexism-honly-regress-train-test-split
    Explore at:
    Dataset updated
    Aug 11, 2025
    Authors
    Nate Rahn
    Description

    nate-rahn/0808-no_sexism-honly-regress-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    Juliet-train-split-test-on-BinRealVul

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Compote, Juliet-train-split-test-on-BinRealVul [Dataset]. https://huggingface.co/datasets/CCompote/Juliet-train-split-test-on-BinRealVul
    Explore at:
    Authors
    Compote
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    Juliet-train-split-test-on-BinRealVul

      Dataset Summary
    

    Juliet-train-split-test-on-BinRealVul is a curated subset of the Juliet Test Suite (as organized in the GitHub repository), compiled and lifted to LLVM Intermediate Representation (IR) after pre-process phase. This dataset is designed specifically for training binary vulnerability detection models in a setting that ensures a fair comparison with models trained on CompRealVul_LLVM. The split was constructed to match… See the full description on the dataset page: https://huggingface.co/datasets/CCompote/Juliet-train-split-test-on-BinRealVul.

  11. flowers-299_Train&Test

    • kaggle.com
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xocion (2023). flowers-299_Train&Test [Dataset]. https://www.kaggle.com/datasets/xocion/flower299-train-and-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Xocion
    Description

    Original dataset https://www.kaggle.com/datasets/bogdancretu/flower299 I choose a Acacia flower as the display picture of this dataset to highlight a problem in the dataset flowers-299, if you go to the second folder of Acacia flowers you will see a bunch of pictures of different looking flowers, despite having different shapes structure and colors they are all technically Acacia flowers but we can't use this data to train because we don't have enough samples of acacia flowers despite all efforts and the best model the probability of a model giving accurate prediction of acacia flowers are low

    this set of data needs data augmentation to be effieciently used with resnet50

  12. Multimodal Vision-Audio-Language Dataset

    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.

    Details can be found in the attached report.

    Annotation

    The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.

    The split into train, validation and test set follows the split of the original datasets.

    Installation

    pip install pandas pyarrow

    Example

    import pandas as pd
    df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
    print(df.iloc[0])

    dataset AudioSet

    filename train/---2_BBVHAA.mp3

    captions_visual [a man in a black hat and glasses.]

    captions_auditory [a man speaks and dishes clank.]

    tags [Speech]

    Description

    The annotation file consists of the following fields:

    filename: Name of the corresponding file (video or audio file)
    dataset: Source dataset associated with the data point
    captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
    captions_auditory: A list of captions related to the auditory content of the video
    tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided

    Data files

    The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  13. e

    Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

    • b2find.eudat.eu
    Updated Nov 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/720b440c-eda0-5182-af9f-f868ed999bd7
    Explore at:
    Dataset updated
    Nov 27, 2020
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  14. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  15. Z

    Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sparks, D. Taylor (2021). Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4903957
    Explore at:
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Henderson, N. Ashley
    Sparks, D. Taylor
    Kauwe, K. Steven
    Description

    This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.

    For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.

    For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0

  16. Split Garbage Dataset

    • kaggle.com
    Updated May 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Santoro (2019). Split Garbage Dataset [Dataset]. https://www.kaggle.com/andreasantoro/split-garbage-dataset/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Andrea Santoro
    Description

    Split version of the garbage classification dataset (link below). train, test and valid folders have been generated as specified by the one-indexed files of the original dataset

    Acknowledgements

    Original dataset here: https://www.kaggle.com/asdasdasasdas/garbage-classification

  17. Yeast genotype/phenotype data partitioned into train/validation/test splits...

    • zenodo.org
    zip
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Yeast genotype/phenotype data partitioned into train/validation/test splits from Ba Nguyen et al, 2022 [Dataset]. http://doi.org/10.5281/zenodo.15313069
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The raw data comes from Ba Nguyen et al, 2022, who hosted their data here. This dataset was used in an independent study in Rijal et al, 2025, who preprocessed the data using these notebook scripts. They did not release their processed data, so we reproduced their processing pipeline and have uploaded the data ourselves as part of this data resource.

    This release accompanies this publication: https://doi.org/10.57844/arcadia-bmb9-fzxd

  18. f

    ORBIT: A real-world few-shot dataset for teachable object recognition...

    • city.figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniela Massiceti; Lida Theodorou; Luisa Zintgraf; Matthew Tobias Harris; Simone Stumpf; Cecily Morrison; Edward Cutrell; Katja Hofmann (2023). ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision [Dataset]. http://doi.org/10.25383/city.14294597.v3
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    City, University of London
    Authors
    Daniela Massiceti; Lida Theodorou; Luisa Zintgraf; Matthew Tobias Harris; Simone Stumpf; Cecily Morrison; Edward Cutrell; Katja Hofmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Object recognition predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning research, however, has been driven by benchmark datasets that lack the high variation that these applications will face when deployed in the real-world. To close this gap, we present the ORBIT dataset, grounded in a real-world application of teachable object recognizers for people who are blind/low vision. We provide a full, unfiltered dataset of 4,733 videos of 588 objects recorded by 97 people who are blind/low-vision on their mobile phones, and a benchmark dataset of 3,822 videos of 486 objects collected by 77 collectors. The code for loading the dataset, computing all benchmark metrics, and running the baseline models is available at https://github.com/microsoft/ORBIT-DatasetThis version comprises several zip files:- train, validation, test: benchmark dataset, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS- other: data not in the benchmark set, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS (please note that the train, validation, test, and other files make up the unfiltered dataset)- *_224: as for the benchmark, but static individual frames are scaled down to 224 pixels.- *_unfiltered_videos: full unfiltered dataset, organised by collector, in mp4 format.

  19. ATOM3D: Ligand Binding Affinity (LBA) Dataset

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin
    Updated Jun 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raphael J.L. Townshend; Raphael J.L. Townshend; Martin Vögele; Martin Vögele; Patricia Suriana; Patricia Suriana; Alexander Derry; Alexander Derry; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Brandon Anderson; Stephan Eismann; Risi Kondor; Russ B. Altman; Ron O. Dror; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Brandon Anderson; Stephan Eismann; Risi Kondor; Russ B. Altman; Ron O. Dror (2021). ATOM3D: Ligand Binding Affinity (LBA) Dataset [Dataset]. http://doi.org/10.5281/zenodo.4914718
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Jun 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Raphael J.L. Townshend; Raphael J.L. Townshend; Martin Vögele; Martin Vögele; Patricia Suriana; Patricia Suriana; Alexander Derry; Alexander Derry; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Brandon Anderson; Stephan Eismann; Risi Kondor; Russ B. Altman; Ron O. Dror; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Brandon Anderson; Stephan Eismann; Risi Kondor; Russ B. Altman; Ron O. Dror
    Description

    Ligand Binding Affinity (LBA) dataset from the ATOM3D project. This upload includes five zipped data directories:

    1. Full, unsplit dataset in LMDB format
    2. Split datasets at 60% sequence identity, with each in LMDB format
    3. Split datasets at 30% sequence identity, with each in LMDB format
    4. Text files containing train, validation, and test indices used to split raw dataset (for both 30% and 60%)
    5. README containing dataset details
  20. IAM_dataset_modified

    • kaggle.com
    Updated Apr 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashish Goswami (2020). IAM_dataset_modified [Dataset]. https://www.kaggle.com/ashish2001/iam-dataset-modified/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ashish Goswami
    Description

    Dataset

    This dataset was created by Ashish Goswami

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Search
Clear search
Close search
Google apps
Main menu