100+ datasets found

FStarDataSet-V2
huggingface.co
Updated Oct 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2024). FStarDataSet-V2 [Dataset]. https://huggingface.co/datasets/microsoft/FStarDataSet-V2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 13, 2025
Dataset authored and provided by
Microsofthttp://microsoft.com/
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
This dataset is the Version 2.0 of microsoft/FStarDataSet.

Primary-Objective

This dataset's primary objective is to train and evaluate Proof-oriented Programming with AI (PoPAI, in short). Given a specification of a program and proof in F*, the objective of a AI model is to synthesize the implemantation (see below for details about the usage of this dataset, including the input and output).

Data Format

Each of the examples in this dataset are organized as dictionaries… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/FStarDataSet-V2.
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058142
Dataset updated
Jan 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
h
wikihow
huggingface.co
tensorflow.org
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Yang Wang, wikihow [Dataset]. https://huggingface.co/datasets/wangwilliamyang/wikihow
Explore at:
Authors
William Yang Wang
Description
WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.
h
dialogsum
huggingface.co
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Karthick Kaliannan Neelamohan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for DIALOGSum Corpus

Dataset Description Links

Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

Dataset Summary

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.
Z
CoAID dataset with multiple extracted features (both sparse and dense)
data.niaid.nih.gov
zenodo.org
+1more
Updated Jun 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Bernard (2022). CoAID dataset with multiple extracted features (both sparse and dense) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6630404
Explore at:
Dataset updated
Jun 10, 2022
Dataset authored and provided by
Guillaume Bernard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a publication of the CoAID dataset originaly dedicated to fake news detection. We changed here the purpose of this dataset in order to use it in the context of event tracking in press documents.

Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.

In this dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.

Features are extracted using:

A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]

A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]

A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) 3

A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) 4

References:

[1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406

[2]: Reimers, Nils, et Iryna Gurevych. 2019. « Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks ». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.
MeDAL Dataset
kaggle.com
opendatalab.com
+1more
zip
Updated Nov 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
Explore at:
zip(7324382521 bytes)Available download formats
Dataset updated
Nov 16, 2020
Authors
xhlulu
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)

Downloading the data

We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

Loading FastText Embeddings

For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

Model Quickstart

Using Torch Hub

You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

Using Huggingface transformers

If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("xhlu/electra-medal") tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")

Citation

Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

License, Terms and Conditions

The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

INTRODUCTION

Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

MEDLINE/PUBMED SPECIFIC TERMS

NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

GENERAL TERMS AND CONDITIONS

Users of the data agree to:

acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,

properly use registration and/or trademark symbols when referring to NLM products, and

not indicate or imply that NLM has endorsed its products/services/applications.

Users who republish or redistribute the data (services, products or raw data) agree to:

maintain the most current version of all distributed data, or

make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.

These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
h
alpaca
huggingface.co
opendatalab.com
Updated Mar 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2023
Dataset authored and provided by
Tatsu Lab
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for Alpaca

Dataset Summary

Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.
h
lsun_church_train
huggingface.co
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Generative Landscape (2025). lsun_church_train [Dataset]. https://huggingface.co/datasets/tglcourse/lsun_church_train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2025
Dataset authored and provided by
The Generative Landscape
Description
Dataset Card for "lsun_church_train"

Uploading lsun church train dataset for convenience I've split this into 119915 train and 6312 test but if you want the original test set see https://github.com/fyu/lsun Notebook that I used to download then upload this dataset: https://colab.research.google.com/drive/1_f-D2ENgmELNSB51L1igcnLx63PkveY2?usp=sharing More Information needed
h
SlimPajama-627B
huggingface.co
opendatalab.com
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebras
Description
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
DiffusionDB-2M –– Part 0201 to 0300 of 2000
kaggle.com
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2023). DiffusionDB-2M –– Part 0201 to 0300 of 2000 [Dataset]. https://www.kaggle.com/datasets/dschettler8845/diffusiondb-2m-part-0201-to-0300-of-2000
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Darien Schettler
Description
IMPORTANT NOTE

This Kaggle dataset represents 1 of 20 parts that make up the DiffusionDB 2Million Image-Prompt Subset Dataset. The description and information below comes directly from the DiffusionDB website and is not specific to this Kaggle dataset. I am not the author of this work, I am just the person who downloaded, unzipped, rezipped, and uploaded it to Kaggle. Please pass all your kudos and whatnot on to the original authors at the following website: * https://poloclub.github.io/diffusiondb/

Links to the other parts of this dataset available on Kaggle can be found below:

DiffusionDB-2M –– Part 0001 to 0100 of 2000

DiffusionDB-2M –– Part 0101 to 0200 of 2000

DiffusionDB-2M –– Part 0201 to 0300 of 2000

TBD

DiffusionDB

Home

Datasheet

DiffusionDB on GitHub

Download DiffusionDB on Hugging Face

This site uses Just the Docs, a documentation theme for Jekyll.

DiffusionDB

https://user-images.githubusercontent.com/15007159/201762588-f24db2b8-dbb2-4a94-947b-7de393fc3d33.gif" alt="">

Table of Contents

DiffusionDB

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Two Subsets

Key Differences

Dataset Structure

Data Instances

Data Fields

Dataset Metadata

Metadata Schema

Data Splits

Loading Data Subsets

Method 1: Using Hugging Face Datasets Loader

Method 2. Use the PoloClub Downloader

Usage/Examples

Downloading a single file

Downloading a range of files

Downloading to a specific directory

Setting the files to unzip once they’ve been downloaded

Method 3. Use metadata.parquet (Text Only)

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Dataset Description

Homepage: DiffusionDB homepage

Repository: DiffusionDB repository

Distribution: DiffusionDB Hugging Face Dataset

Paper: [DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to...
instruction-dataset
huggingface.co
opendatalab.com
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.
h
ktda-datasets
huggingface.co
Updated Dec 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
XavierJiezou (2024). ktda-datasets [Dataset]. https://huggingface.co/datasets/XavierJiezou/ktda-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 8, 2024
Authors
XavierJiezou
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
KTDA-Datasets

This dataset card aims to describe the datasets used in the KTDA.

Install

pip install huggingface-hub

Usage

Step 1: Download datasets

huggingface-cli download --repo-type dataset XavierJiezou/ktda-datasets --local-dir data --include grass.zip huggingface-cli download --repo-type dataset XavierJiezou/ktda-datasets --local-dir data --include cloud.zip

Step 2: Extract datasets

unzip grass.zip -d grass unzip cloud.zip -d l8_biome… See the full description on the dataset page: https://huggingface.co/datasets/XavierJiezou/ktda-datasets.
h
VLM4Bio
huggingface.co
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HDR Imageomics Institute (2025). VLM4Bio [Dataset]. http://doi.org/10.57967/hf/3393
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/3393
Dataset updated
Oct 6, 2025
Dataset authored and provided by
HDR Imageomics Institute
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for VLM4Bio

Instructions for downloading the dataset

Install Git LFS Git clone the VLM4Bio repository to download all metadata and associated files Run the following commands in a terminal:

git clone https://huggingface.co/datasets/imageomics/VLM4Bio cd VLM4Bio

Downloading and processing bird images

To download the bird images, run the following command:

bash download_bird_images.sh

This should download the bird images inside datasets/Bird/images… See the full description on the dataset page: https://huggingface.co/datasets/imageomics/VLM4Bio.
h
howto100m
huggingface.co
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HuggingFaceM4 (2022). howto100m [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/howto100m
Explore at:
Dataset updated
Jun 30, 2022
Dataset authored and provided by
HuggingFaceM4
Description
HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. HowTo100M features a total of - 136M video clips with captions sourced from 1.2M YouTube videos (15 years of video) - 23k activities from domains such as cooking, hand crafting, personal care, gardening or fitness

Each video is associated with a narration available as subtitles automatically downloaded from YouTube.
h
mmcows
huggingface.co
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NEIS Lab @ Purdue (2025). mmcows [Dataset]. http://doi.org/10.57967/hf/5965
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5965
Dataset updated
Mar 4, 2025
Dataset authored and provided by
NEIS Lab @ Purdue
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
MmCows: A Multimodal Dataset for Dairy Cattle Monitoring

Details of the dataset and benchmarks are available here. For a quick overview of the dataset, please check this video.

Instruction for downloading 1. Install requirements

pip install huggingface_hub

See the file structure here for the next step.

2. Download a file individually

To download visual_data.zip to your local-dir, use command line: huggingface-cli download
neis-lab/mmcows \… See the full description on the dataset page: https://huggingface.co/datasets/neis-lab/mmcows.
openai_humaneval
huggingface.co
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2022
Dataset authored and provided by
OpenAIhttp://openai.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for OpenAI HumanEval

Dataset Summary

The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

Supported Tasks and Leaderboards Languages

The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
h
the-reddit-dataset-dataset
huggingface.co
Updated Jun 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SocialGrep (2022). the-reddit-dataset-dataset [Dataset]. https://huggingface.co/datasets/SocialGrep/the-reddit-dataset-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2022
Authors
SocialGrep
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A meta dataset of Reddit's own /r/datasets community.
databricks-dolly-15k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Databrickshttp://databricks.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
h
DL3DV-ALL-4K
huggingface.co
Updated Feb 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DL3DV (2024). DL3DV-ALL-4K [Dataset]. https://huggingface.co/datasets/DL3DV/DL3DV-ALL-4K
Explore at:
Dataset updated
Feb 4, 2024
Dataset authored and provided by
DL3DV
Description
DL3DV-Dataset

This repo has all the 4K frames with camera poses of DL3DV-10K Dataset. We are working hard to review all the dataset to avoid sensitive information. Thank you for your patience.

Download

If you have enough space, you can use git to download a dataset from huggingface. See this link. 480P/960P versions should satisfies most needs. If you do not have enough space, we further provide a download script here to download a subset. The usage: usage: download.py… See the full description on the dataset page: https://huggingface.co/datasets/DL3DV/DL3DV-ALL-4K.
h
TAO-Amodal
huggingface.co
Updated Jan 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cheng-Yen Hsieh (2024). TAO-Amodal [Dataset]. https://huggingface.co/datasets/chengyenhsieh/TAO-Amodal
Explore at:
Dataset updated
Jan 17, 2024
Authors
Cheng-Yen Hsieh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
TAO-Amodal Dataset

Official Source for Downloading the TAO-Amodal and TAO Dataset. 📙 Project Page | 💻 Code | 📎 Paper Link | ✏️ Citations

Contact: 🙋🏻‍♂️Cheng-Yen (Wesley) Hsieh

Dataset Description

Our dataset augments the TAO dataset with amodal bounding box annotations for fully invisible, out-of-frame, and occluded objects. Note that this implies TAO-Amodal also includes modal segmentation masks (as visualized in the color overlays above). Our… See the full description on the dataset page: https://huggingface.co/datasets/chengyenhsieh/TAO-Amodal.

Facebook

Twitter

Click to copy link

Link copied

Cite

Microsoft (2024). FStarDataSet-V2 [Dataset]. https://huggingface.co/datasets/microsoft/FStarDataSet-V2

FStarDataSet-V2

PoPAI-FStarDataSet-V2

microsoft/FStarDataSet-V2

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 13, 2025

Dataset authored and provided by

Microsofthttp://microsoft.com/

License

https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

Description

This dataset is the Version 2.0 of microsoft/FStarDataSet.

  Primary-Objective

This dataset's primary objective is to train and evaluate Proof-oriented Programming with AI (PoPAI, in short). Given a specification of a program and proof in F*, the objective of a AI model is to synthesize the implemantation (see below for details about the usage of this dataset, including the input and output).

  Data Format

Each of the examples in this dataset are organized as dictionaries… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/FStarDataSet-V2.

Clear search

Close search

Google apps

Main menu

FStarDataSet-V2

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

wikihow

dialogsum

CoAID dataset with multiple extracted features (both sparse and dense)

MeDAL Dataset

Downloading the data

Loading FastText Embeddings

Model Quickstart

Using Torch Hub

Using Huggingface transformers

Citation

License, Terms and Conditions

alpaca

lsun_church_train

SlimPajama-627B

DiffusionDB-2M –– Part 0201 to 0300 of 2000

IMPORTANT NOTE

DiffusionDB

Table of Contents

Dataset Description

instruction-dataset

ktda-datasets

Step 1: Download datasets

Step 2: Extract datasets

VLM4Bio

howto100m

mmcows

openai_humaneval

the-reddit-dataset-dataset

databricks-dolly-15k

DL3DV-ALL-4K

TAO-Amodal

FStarDataSet-V2

PoPAI-FStarDataSet-V2

microsoft/FStarDataSet-V2

Using Huggingface `transformers`