100+ datasets found

h
alpaca-train-validation-test-split
huggingface.co
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2023
Authors
Doula Isham Rashik Hasan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.
P
PubMed (60%/20%/20% random splits) Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PubMed (60%/20%/20% random splits) Dataset [Dataset]. https://paperswithcode.com/dataset/pubmed-60-20-20-random-splits
Explore at:
Description
Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.
P
Film (60%/20%/20% random splits) Dataset
paperswithcode.com
library.toponeai.link
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Film (60%/20%/20% random splits) Dataset [Dataset]. https://paperswithcode.com/dataset/film-60-20-20-random-splits
Explore at:
Description
Node classification on Film with 60%/20%/20% random splits for training/validation/test.
Data Cleaning, Translation & Split of the Dataset for the Automatic...
zenodo.org
data.niaid.nih.gov
bin, csv +1
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6957842
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
FSOCO-split
kaggle.com
Updated Oct 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tommaso Fava (2024). FSOCO-split [Dataset]. https://www.kaggle.com/datasets/tommasofava/fsoco-split/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tommaso Fava
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
FSOCO dataset split into train (80%), validation (10%), and test (10%) set. Ready for Ultralytics YOLO training.
h
Putnam-AXIOM-for-zip-fit-splits
huggingface.co
Updated Jan 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZIP-FIT - Compression-Based Data Selection For Code (2023). Putnam-AXIOM-for-zip-fit-splits [Dataset]. https://huggingface.co/datasets/zipfit/Putnam-AXIOM-for-zip-fit-splits
Explore at:
Dataset updated
Jan 1, 2023
Dataset authored and provided by
ZIP-FIT - Compression-Based Data Selection For Code
Description
Putnam-AXIOM Splits for ZIP-FIT

This repository contains the train, validation, and test splits of the Putnam-AXIOM dataset specifically for use with the ZIP-FIT methodology research. The dataset is split as follows:

train: 150 examples validation: 150 examples test: 222 examples

These splits are derived from the original 522 Putnam problems found in the main Putnam-AXIOM repository.

Main Repository

The full dataset with original problems and variations is available… See the full description on the dataset page: https://huggingface.co/datasets/zipfit/Putnam-AXIOM-for-zip-fit-splits.
P
Texas(60%/20%/20% random splits) Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Texas(60%/20%/20% random splits) Dataset [Dataset]. https://paperswithcode.com/dataset/texas-60-20-20-random-splits-1
Explore at:
Area covered
Texas
Description
Node classification on Texas with 60%/20%/20% random splits for training/validation/test.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
h
tripadvisor-split-dataset-v2
huggingface.co
Updated Jan 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
N. Hüll (2025). tripadvisor-split-dataset-v2 [Dataset]. https://huggingface.co/datasets/nhull/tripadvisor-split-dataset-v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2025
Authors
N. Hüll
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
TripAdvisor Review Rating Split Dataset

This dataset contains 80,000 TripAdvisor reviews with corresponding ratings. It is derived from the original TripAdvisor dataset available here and was created to train different models for a university project in the class of NLP.

Dataset Structure

Training Set: 30,400 examples Validation Set: 1,600 examples Test Set: 8,000 examples

Each set is balanced, ensuring equal representation of all sentiment labels.

Label

The… See the full description on the dataset page: https://huggingface.co/datasets/nhull/tripadvisor-split-dataset-v2.
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Data for the manuscript "Spatially resolved uncertainties for machine...
zenodo.org
application/gzip, bin +1
Updated May 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen (2024). Data for the manuscript "Spatially resolved uncertainties for machine learning potentials" [Dataset]. http://doi.org/10.5281/zenodo.11093925
Explore at:
application/gzip, bin, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11093925
Dataset updated
May 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:

mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).

aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.

data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].

data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.

data_h2o.tar.gz contains:

full_db.extxyz: The full dataset of 1.5k structures.

iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.

the subfolders in the folders random and uncertain contain the training and validation sets for the random and uncertainty-based active learning loops.
Data for the manuscript "Spatially resolved uncertainties for machine...
zenodo.org
data.niaid.nih.gov
application/gzip, bin +1
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen (2024). Data for the manuscript "Spatially resolved uncertainties for machine learning potentials" [Dataset]. http://doi.org/10.5281/zenodo.13151227
Explore at:
application/gzip, text/x-python, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13151227
Dataset updated
Aug 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:

mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).

aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.

data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].

data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.

data_h2o.tar.gz contains:

full_db.extxyz: The full dataset of 1.5k structures.

iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.

the subfolders in the folders random, and uncertain, and atomic contain the training and validation sets for the random and uncertainty-based (local or atomic) active learning loops.
R
Data from: Fashion Mnist Dataset
universe.roboflow.com
opendatalab.com
+3more
zip
Updated Aug 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
Explore at:
zipAvailable download formats
Dataset updated
Aug 10, 2022
Dataset authored and provided by
Popular Benchmarks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Clothing
Description
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Han Xiao, Kashif Rasul and Roland Vollgraf

https://arxiv.org/abs/1708.07747

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

Version 1 (original-images_Original-FashionMNIST-Splits):

Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.

This version was not trained

Version 3 (original-images_trainSetSplitBy80_20):

Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set

https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

Citation:

@online{xiao2017/online, author = {Han Xiao and Kashif Rasul and Roland Vollgraf}, title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms}, date = {2017-08-28}, year = {2017}, eprintclass = {cs.LG}, eprinttype = {arXiv}, eprint = {cs.LG/1708.07747}, }
o
Natural Language Inference Evaluation Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Natural Language Inference Evaluation Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/abcd24c8-a1a1-4724-83b2-ea07314b8d13
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
The HellaSwag dataset is a highly valuable resource for assessing a machine's sentence completion abilities based on commonsense natural language inference (NLI). It was initially introduced in a paper published at ACL2019. This dataset enables researchers and machine learning practitioners to train, validate, and evaluate models designed to understand and predict plausible sentence completions using common sense knowledge. It is useful for understanding the limitations of current NLI systems and for developing algorithms that reason with common sense.

Columns

The dataset includes several key columns: * ind: The index of the data point. (Integer) * activity_label: The label indicating the activity or event described in the sentence. (String) * ctx_a: The first context sentence, providing background information. (String) * ctx_b: The second context sentence, providing further background information. (String) * endings: A list of possible sentence completions for the given context. (List of Strings) * split: The dataset split, such as 'train', 'dev', or 'test'. (String) * split_type: The type of split used for dividing the dataset, like 'random' or 'balanced'. (String) * source_id: An identifier for the source. * label: A label associated with the data point.

Distribution

The dataset is typically provided in CSV format and consists of three primary files: train.csv, validation.csv, and test.csv. The train.csv file facilitates the learning process for machine learning models, validation.csv is used to validate model performance, and test.csv enables thorough evaluation of models in completing sentences with common sense. While exact total row counts for the entire dataset are not specified in the provided information, insights into unique values for fields such as activity_label (9965 unique values), source_id (8173 unique values), and split_type (e.g., 'indomain' and 'zeroshot' each accounting for 50%) are available.

Usage

This dataset is ideal for a variety of applications and use cases: * Language Modelling: Training language models to better understand common sense knowledge and improve sentence completion tasks. * Common Sense Reasoning: Developing and studying algorithms that can reason and make inferences based on common sense. * Machine Performance Evaluation: Assessing the effectiveness of machine learning models in generating appropriate sentence endings given specific contexts and activity labels. * Natural Language Inference (NLI): Benchmarking and improving NLI systems by evaluating their ability to predict plausible sentence completions.

Coverage

The dataset has a global region scope. It was listed on 17/06/2025. Specific time ranges for the data collection itself or detailed demographic scopes are not provided. The dataset includes various splits (train, dev, test) and split types (random, balanced) to ensure diversity for generalisation testing and fairness evaluation during model development.

License

CC0

Who Can Use It

The HellaSwag dataset is intended for researchers and machine learning practitioners. They can utilise it to: * Train, validate, and evaluate machine learning models for tasks requiring common sense knowledge. * Develop and refine algorithms for common sense reasoning. * Benchmark and assess the performance and limitations of current natural language inference systems.

Dataset Name Suggestions

HellaSwag: Commonsense NLI

Commonsense Sentence Completion Data

Natural Language Inference Evaluation Dataset

AI Common Sense Benchmark

Attributes

Original Data Source: HellaSwag: Commonsense NLI
h
layoutlmv3-invoice-dataset
huggingface.co
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Dataset updated
Jul 15, 2025
Authors
Kwashie
Description
LayoutLMv3 Invoice Dataset

This dataset is processed and ready for training LayoutLMv3 models for invoice information extraction.

Dataset Description

This dataset contains invoice documents with OCR-extracted text, bounding boxes, and entity labels for training document understanding models.

Dataset Structure

train: Training split validation: Validation split (if available) test: Test split (if available)

Features

input_ids: Tokenized text input… See the full description on the dataset page: https://huggingface.co/datasets/Kwash67/layoutlmv3-invoice-dataset.
Chroma training, test, and validation sets.
zenodo.org
Updated Sep 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gevorg Grigoryan; Gevorg Grigoryan (2023). Chroma training, test, and validation sets. [Dataset]. http://doi.org/10.5281/zenodo.8285077
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8285077
Dataset updated
Sep 22, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gevorg Grigoryan; Gevorg Grigoryan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Column pdb contains the PDB codes for entries used in training Chroma, while column split designates whether the corresponding entry was part of the training set (value train), the test set (value test), or the validation set (value validation).
CAPYBARA: Decompiled Binary Functions and Related Summaries
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Al-Kaswan; Ali Al-Kaswan; Toufique Ahmed; Toufique Ahmed; Maliheh Izadi; Maliheh Izadi; Anand Ashok Sawant; Anand Ashok Sawant; Prem Devanbu; Prem Devanbu; Arie van Deursen; Arie van Deursen (2023). CAPYBARA: Decompiled Binary Functions and Related Summaries [Dataset]. http://doi.org/10.5281/zenodo.7229809
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7229809
Dataset updated
Jan 4, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ali Al-Kaswan; Ali Al-Kaswan; Toufique Ahmed; Toufique Ahmed; Maliheh Izadi; Maliheh Izadi; Anand Ashok Sawant; Anand Ashok Sawant; Prem Devanbu; Prem Devanbu; Arie van Deursen; Arie van Deursen
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
CAPYBARA

This dataset is published as part of the paper: "Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries". It includes both the training/evaluation data as well as the raw data.

The data_split folder contains .pickle files with the test and validation repos, the train repos are all the remaining repos. It also contains a .pickle file with a dictionary that specifies the optimization level for each repository.

In the processed_data folder, the processed datasets can be found in .csv format. The columns of the CSV are the `summaries`, the `original documentation`, the `repo`, the `source` and `decompiled` code, the `function name` and a unique `identifier`. We also include the deduplicated samples in separate CSVs.

The processed training files can be found in the training_data folder. `Source C`, `decompiled`, `demiStripped`, and `stripped` can each be found in their corresponding folders and are split into deduplicated and regular datasets. The data is further split into .jsonl files, for the train, test, and validation sets. These .jsonl files can be loaded into CodeT5 and CodeXGlue as is.

The raw_data folder contains all the stripped and decompiled functions without any pre-processing applied. The columns of the CSV are `repo`, the `location`, the `original` code, the corresponding `decompiled` code, the `function name`, a unique `identifier` key, and the corresponding `documentation` for both the decompiled and stripped functions.

License

Copyright 2022 ##########

Licensed under the Apache License, Version 2.0 (the "License");

you may not use this file except in compliance with the License.

You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.
f
ORBIT: A real-world few-shot dataset for teachable object recognition...
city.figshare.com
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniela Massiceti; Lida Theodorou; Luisa Zintgraf; Matthew Tobias Harris; Simone Stumpf; Cecily Morrison; Edward Cutrell; Katja Hofmann (2023). ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision [Dataset]. http://doi.org/10.25383/city.14294597.v3
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.25383/city.14294597.v3
Dataset updated
May 31, 2023
Dataset provided by
City, University of London
Authors
Daniela Massiceti; Lida Theodorou; Luisa Zintgraf; Matthew Tobias Harris; Simone Stumpf; Cecily Morrison; Edward Cutrell; Katja Hofmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Object recognition predominately still relies on many high-quality training examples per object category. In contrast, learning new objects from only a few examples could enable many impactful applications from robotics to user personalization. Most few-shot learning research, however, has been driven by benchmark datasets that lack the high variation that these applications will face when deployed in the real-world. To close this gap, we present the ORBIT dataset, grounded in a real-world application of teachable object recognizers for people who are blind/low vision. We provide a full, unfiltered dataset of 4,733 videos of 588 objects recorded by 97 people who are blind/low-vision on their mobile phones, and a benchmark dataset of 3,822 videos of 486 objects collected by 77 collectors. The code for loading the dataset, computing all benchmark metrics, and running the baseline models is available at https://github.com/microsoft/ORBIT-DatasetThis version comprises several zip files:- train, validation, test: benchmark dataset, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS- other: data not in the benchmark set, organised by collector, with raw videos split into static individual frames in jpg format at 30FPS (please note that the train, validation, test, and other files make up the unfiltered dataset)- *_224: as for the benchmark, but static individual frames are scaled down to 224 pixels.- *_unfiltered_videos: full unfiltered dataset, organised by collector, in mp4 format.
Data from: Transfer learning reveals sequence determinants of the...
zenodo.org
application/gzip, zip
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahin Naqvi; Sahin Naqvi (2024). Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage [Dataset]. http://doi.org/10.5281/zenodo.11224809
Explore at:
application/gzip, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11224809
Dataset updated
May 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sahin Naqvi; Sahin Naqvi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Processed data and code for "Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage," Naqvi et al 2024.

Directory is organized into 4 subfolders, each tar'ed and gzipped:

data_analysis.tar.gz - Processed data for modulation of TWIST1 levels and calculation of RE responsiveness to TWIST1 dosage

atac_design.txt - design matrix for ATAC-seq TWIST1 titration samples

all.sub.150bpclust.greater2.500bp.merge.TWIST1.titr.ATAC.counts.txt - ATAC-seq counts from all samples over all reproducible ATAC-seq peak regions, as defined in Naqvi et al 2023

atac_deseq_fitmodels_moded50.R - R code for calculating new version of ED50 and response to full depletion from TWIST1 titration data (note, uses drm.R function from 10.5281/zenodo.7689948, install drc() with this version to avoid errors)

baseline_models.tar.gz - Code and data for training baseline models to predict RE responsiveness to SOX9/TWIST1 dosage

{sox9|twist1}.{0v100|ed50}.{train|valid|test}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds

HOCOMOCOv11_core_HUMAN_mono_jaspar_format.all.sub.150bpclust.greater2.500bp.merge.minus300bp.p01.maxscore.mat.cpg.gc.basemean.txt.gz - matrix of predictors for all REs. Quantitative encoding of PWM match for all HOCOMOCO motifs + CpG + GC content, plus unperturbed ATAC-seq signal

train_baseline.R - R code to train baseline (LASSO regression or random forest) models using predictor matrix and the provided training data.

Note: training the random forest to predict full TF depletion is computationally intensive because it is across all REs, if doing this run on CPU for ~6 hrs.

chrombpnet_models.tar.gz - Remainder of code, data, and models for fine-tuning and interpreting ChromBPNet mdoels to predict RE responsiveness to SOX9/TWIST1 dosage

Fine-tuning code, data, models

{all|sox9.direct|twist1.bound.down}.{train|valid|test}.{ed50|0v100.log2fc}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds

pretrained.unperturbed.chrombpnet.h5 - Pretrained model of unperturbed ATAC-seq signal in CNCCs, obtained by running ChromBPNet (https://github.com/kundajelab/chrombpnet) on DMSO-treated SOX9/TWIST1-tagged ATAC-seq data

finetune_chrombpnet.py - code for fine-tuning the pretrained model for any of the relevant prediction tasks (ED50/ effect of full TF depletion for SOX9/TWIST1)

best.model.chrombpnet.{0v100|ed50}.{sox9|twist1}.h5 - output of finetune_chrombpnet.py, best model after 10 training epochs for the indicated task

chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.{h5|bw} - contribution scores for the indicated predictive model, obtained by running chrombpnet contribs_bw on the corresponding model h5 file.

chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.modisco.{h5|bw} - TF-MoDIsCo output from the corresponding contribution score file

Interpretation code, data, models

contrib_h5_to_projshap_npy.py - code to convert contrib .h5 files into .npy files containing projected SHAP scores (required because the CWM matching code takes this format of contribution scores)

sox9.direct.10col.bed, twist1.bound.down.10col.uniq.bed - regions over which CWMs will be matched (likely direct targets of each TF)

match_cwms.py - Python code to match individual CWM instances. Takes as input: modisco .h5 file, SHAP .npy file, bed file of regions to be matched. Output is a bed file of all CWM matches (not pruned, contains many redundant matches).

chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.bed - output of match_cwms.py

take_max_overlap.py - code to merge output of match_cwms.py into clusters, and then take the maximum (length-normalized) match score in each cluster as the representative CWM match of that cluster. Requires upstream bedtools commands to be piped in, see example usage in file.

chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.maxoverlap.bed - output of take_max_overlap.py. These CWM instances are the ones used throughout the paper.

modisco_reports.zip - TF-MoDIsCo reports from running on the fine-tuned ChromBPNet models

modisco_report_{sox9|twist1}_{0v100|ed50}: folders containing images of discovered CWMs and HTMLs/PDFs of summarized reports from running TF-MoDisCo on the indicated fine-tuned ChromBPNet model

mirny_model.tar.gz - Code and data for analyzing and fitting Mirny model of TF-nucleosome competition to observed RE dosage response curves

twist1.strong.multi.only.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper.

twist1.strong.weak{1|2|3}.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and the indicated number of sensitizing (weak) Coordinators and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper.

MirnyModelAnalysis.py - Python code for analysis of Mirny model of TF-nucleosome competition. Contains implementations of analytic solutions, as well as code to fit model to observed ED50 and hill coefficients in the provided data files.
t
Tour Recommendation Model
test.researchdata.tuwien.at
test.researchdata.tuwien.ac.at
bin, png +1
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
Explore at:
text/markdown, png, binAvailable download formats
Unique identifier
https://doi.org/10.70124/akpf6-8p175
Dataset updated
May 14, 2025
Dataset provided by
TU Wien
Authors
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Dataset Description for Tour Recommendation Model

Context and Methodology:

Research Domain/Project:
This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

Purpose:
The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

Creation Methodology:
The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

Technical Details:

Structure of the Dataset:
The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

place_or_event_id: Unique identifier for each tourist place or event.

rating: Rating given by the user, ranging from 1 to 5.

The data is split into three subsets:

Training Set: 80% of the dataset used to train the model.

Validation Set: A small portion used for hyperparameter tuning.

Test Set: 20% used to evaluate model performance.

Folder and File Naming Conventions:
The dataset files are stored in the following structure:

user_ratings_dataset.csv: The original dataset file containing user ratings.

tour_recommendation_model.pkl: The saved model after training.

actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

Software Requirements:
To open and work with this dataset, the following software and libraries are required:

Python 3.x

Pandas for data manipulation

Scikit-learn for training and evaluating machine learning models

Matplotlib for chart generation

Joblib for saving and loading the trained model

The dataset can be opened and processed using any Python environment that supports these libraries.

Additional Resources:

The model training code, README file, and performance chart are available in the project repository.

For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

Further Details:

Dataset Reusability:
The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

Train other types of models (e.g., regression, classification).

Experiment with different features or add more metadata to enrich the dataset.

Data Integrity:
The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

Licensing:
The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.

Facebook

Twitter

Click to copy link

Link copied

Cite

Doula Isham Rashik Hasan (2023). alpaca-train-validation-test-split [Dataset]. https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 12, 2023

Authors

Doula Isham Rashik Hasan

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for Alpaca

I have just performed train, test and validation split on the original dataset. Repository to reproduce this will be shared here soon. I am including the orignal Dataset card as follows.

  Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.… See the full description on the dataset page: https://huggingface.co/datasets/disham993/alpaca-train-validation-test-split.

Clear search

Close search

Google apps

Main menu

alpaca-train-validation-test-split

PubMed (60%/20%/20% random splits) Dataset

Film (60%/20%/20% random splits) Dataset

Data Cleaning, Translation & Split of the Dataset for the Automatic...

FSOCO-split

Putnam-AXIOM-for-zip-fit-splits

Texas(60%/20%/20% random splits) Dataset

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

tripadvisor-split-dataset-v2

Training dataset for NABat Machine Learning V1.0

Data for the manuscript "Spatially resolved uncertainties for machine...

Data for the manuscript "Spatially resolved uncertainties for machine...

Data from: Fashion Mnist Dataset

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Authors:

Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

All images were sized 28x28 in the original dataset

Version 1 (original-images_Original-FashionMNIST-Splits):

Version 3 (original-images_trainSetSplitBy80_20):

Citation:

Natural Language Inference Evaluation Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

layoutlmv3-invoice-dataset

Chroma training, test, and validation sets.

CAPYBARA: Decompiled Binary Functions and Related Summaries

ORBIT: A real-world few-shot dataset for teachable object recognition...

Data from: Transfer learning reveals sequence determinants of the...

Tour Recommendation Model

Dataset Description for Tour Recommendation Model

Context and Methodology:

Technical Details:

Further Details:

alpaca-train-validation-test-split

Alpaca

disham993/alpaca-train-validation-test-split