Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Train Test Split For Freiburg Dataset In YOLOv7 Format is a dataset for object detection tasks - it contains Groceries annotations for 8,879 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Pritish92/arc-agi-prompts-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains images used to create a classification model for plant diseases. In particular the dataset contains the following plants: 1. Bell Peppers 2. Potatoes 3. Tomatoes
This dataset is a **modified version **of the dataset from the Plant Village Dataset and is a subset of the total data with reduced number of plants. The dataset is modified and contains separate directories for: 1. Train: 70% of the Data 2. Test: 20% of the Data 3. Validation 10% of the Data
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains end-of-season box-score aggregates for NBA players over the 2012–13 through 2023–24 seasons, split into training and test sets for both regular season and playoffs. Each CSV has one row per player per season with columns for points, rebounds, steals, turnovers, 3-pt attempts, FG attempts, plus identifiers.
end-of-season box-score aggregates (2012–13 – 2023–24) split into train/test;
the Jupyter notebook (Analysis.ipynb); All the code can be executed in there
the trained model binary (nba_model.pkl); Serialized Random Forest model artifact
Evaluation plots (LAL vs. whole‐league) for regular & playoff predictions are given as png outputs and uploaded in here
FAIR4ML metadata (fair4ml_metadata.jsonld);
see README.md and abbreviations.txt for file details.”
Notebook
Analysis.ipynb: Involves the graphica output of the trained and tested data.
Trained/ Test csv Data
Name | Description | PID |
regular_train.csv | For training purposes, the seasons 2012-2013 through 2021-2022 were selected as training purpose | 4421e56c-4cd3-4ec1-a566-a89d7ec0bced |
regular_test.csv: | For testing purpose of the regular season, the 2022-2023 season was selected | f9d84d5e-db01-4475-b7d1-80cfe9fe0e61 |
playoff_train.csv | For training purposes of the playoff season, the seasons 2012-2013 through 2022-2023 were selected | bcb3cf2b-27df-48cc-8b76-9e49254783d0 |
playoff_test.csv | For testing purpose of the playoff season, 2023-2024 season was selected | de37d568-e97f-4cb9-bc05-2e600cc97102 |
Others
abbrevations.txt: Involves the fundemental abbrevations of the columns in csv data
Additional Notes
Raw csv files are taken from Kaggle (Source: https://www.kaggle.com/datasets/shivamkumar121215/nba-stats-dataset-for-last-10-years/data)
Some preprocessing has to be done before uploading into dbrepo
Plots have also been uploaded as an output for visual purposes.
A more detailed version can be found on github (Link: https://github.com/bubaltali/nba-prediction-analysis/)
Using Machine Learning Techniques in general and Deep Learning techniques in specific needs a certain amount of data often not available in large quantities in some technical domains. The manual inspection of Machine Tool Components, as well as the manual end of line check of products, are labour intensive tasks in industrial applications that often want to be automated by companies. To automate the classification processes and to develop reliable and robust Machine Learning based classification and wear prognostics models there is a need for real-world datasets to train and test models on. The dataset contains 1104 channel 3 images with 394 image-annotations for the surface damage type “pitting”. The annotations made with the annotation tool labelme, are available in JSON format and hence convertible to VOC and COCO format. All images come from two BSD types. The dataset available for download is divided into two folders, data with all images as JPEG, label with all annotations, and saved_model with a baseline model. The authors also provide a python script to divide the data and labels into three different split types – train_test_split, which splits images into the same train and test data-split the authors used for the baseline model, wear_dev_split, which creates all 27 wear developments and type_split, which splits the data into the occurring BSD-types. One of the two mentioned BSD types is represented with 69 images and 55 different image-sizes. All images with this BSD type come either in a clean or soiled condition. The other BSD type is shown on 325 images with two image-sizes. Since all images of this type have been taken with continuous time the degree of soiling is evolving. Also, the dataset contains as above mentioned 27 pitting development sequences with every 69 images. Instruction dataset split The authors of this dataset provide 3 types of different dataset splits. To get the data split you have to run the python script split_dataset.py. Script inputs: split-type (mandatory) output directory (mandatory) Different split-types: train_test_split: splits dataset into train and test data (80%/20%) wear_dev_split: splits dataset into 27 wear-developments type_split: splits dataset into different BSD types Example: C:\Users\Desktop>python split_dataset.py --split_type=train_test_split --output_dir=BSD_split_folder
nate-rahn/0808-no_sexism-honly-regress-train-test-split dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Juliet-train-split-test-on-BinRealVul
Dataset Summary
Juliet-train-split-test-on-BinRealVul is a curated subset of the Juliet Test Suite (as organized in the GitHub repository), compiled and lifted to LLVM Intermediate Representation (IR) after pre-process phase. This dataset is designed specifically for training binary vulnerability detection models in a setting that ensures a fair comparison with models trained on CompRealVul_LLVM. The split was constructed to match… See the full description on the dataset page: https://huggingface.co/datasets/CCompote/Juliet-train-split-test-on-BinRealVul.
Original dataset https://www.kaggle.com/datasets/bogdancretu/flower299 I choose a Acacia flower as the display picture of this dataset to highlight a problem in the dataset flowers-299, if you go to the second folder of Acacia flowers you will see a bunch of pictures of different looking flowers, despite having different shapes structure and colors they are all technically Acacia flowers but we can't use this data to train because we don't have enough samples of acacia flowers despite all efforts and the best model the probability of a model giving accurate prediction of acacia flowers are low
this set of data needs data augmentation to be effieciently used with resnet50
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.
Details can be found in the attached report.
The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.
The split into train, validation and test set follows the split of the original datasets.
pip install pandas pyarrow
import pandas as pd
df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
print(df.iloc[0])
dataset AudioSet
filename train/---2_BBVHAA.mp3
captions_visual [a man in a black hat and glasses.]
captions_auditory [a man speaks and dishes clank.]
tags [Speech]
The annotation file consists of the following fields:
filename: Name of the corresponding file (video or audio file)
dataset: Source dataset associated with the data point
captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
captions_auditory: A list of captions related to the auditory content of the video
tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided
The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.
For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.
For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0
Split version of the garbage classification dataset (link below). train, test and valid folders have been generated as specified by the one-indexed files of the original dataset
Original dataset here: https://www.kaggle.com/asdasdasasdas/garbage-classification
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The raw data comes from Ba Nguyen et al, 2022, who hosted their data here. This dataset was used in an independent study in Rijal et al, 2025, who preprocessed the data using these notebook scripts. They did not release their processed data, so we reproduced their processing pipeline and have uploaded the data ourselves as part of this data resource.
This release accompanies this publication: https://doi.org/10.57844/arcadia-bmb9-fzxd
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Object recognition predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning research, however, has been driven by benchmark datasets that lack the high variation that these applications will face when deployed in the real-world. To close this gap, we present the ORBIT dataset, grounded in a real-world application of teachable object recognizers for people who are blind/low vision. We provide a full, unfiltered dataset of 4,733 videos of 588 objects recorded by 97 people who are blind/low-vision on their mobile phones, and a benchmark dataset of 3,822 videos of 486 objects collected by 77 collectors. The code for loading the dataset, computing all benchmark metrics, and running the baseline models is available at https://github.com/microsoft/ORBIT-DatasetThis version comprises several zip files:- train, validation, test: benchmark dataset, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS- other: data not in the benchmark set, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS (please note that the train, validation, test, and other files make up the unfiltered dataset)- *_224: as for the benchmark, but static individual frames are scaled down to 224 pixels.- *_unfiltered_videos: full unfiltered dataset, organised by collector, in mp4 format.
Ligand Binding Affinity (LBA) dataset from the ATOM3D project. This upload includes five zipped data directories:
This dataset was created by Ashish Goswami
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for Alpaca
I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.
Dataset Summary
Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.