100+ datasets found

Data from: label-files
huggingface.co
Updated Dec 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
label-files [Dataset]. https://huggingface.co/datasets/huggingface/label-files
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 23, 2021
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
Description
This repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:

ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...

You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.
policy-docs
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face, policy-docs [Dataset]. https://huggingface.co/datasets/huggingface/policy-docs
Explore at:
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Public Policy at Hugging Face

AI Policy at Hugging Face is a multidisciplinary and cross-organizational workstream. Instead of being part of a vertical communications or global affairs organization, our policy work is rooted in the expertise of our many researchers and developers, from Ethics and Society Regulars and legal team to machine learning engineers working on healthcare, art, and evaluations. What we work on is informed by our Hugging Face community needs and experiences… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/policy-docs.
instruct_me
huggingface.co
Updated Mar 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). instruct_me [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruct_me
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 6, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Instruct Me is a dataset of instruction-like dialogues between a human user and AI assistant. The prompts are derived from (prompt, completion) pairs in the Helpful Instructions dataset. The goal is to train a language model to that is "chatty" and can answer the kind of questions or tasks a human user might instruct an AI assistant to perform.
finevideo
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face FineVideo (2024). finevideo [Dataset]. https://huggingface.co/datasets/HuggingFaceFV/finevideo
Explore at:
Dataset updated
Sep 12, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face FineVideo
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
FineVideo

FineVideo Description Dataset Explorer Revisions Dataset Distribution

How to download and use FineVideo Using datasets Using huggingface_hub Load a subset of the dataset

Dataset Structure Data Instances Data Fields

Dataset Creation License CC-By Considerations for Using the Data Social Impact of Dataset Discussion of Biases

Additional Information Credits Future Work Opting out of FineVideo Citation Information

Terms of use for FineVideo… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFV/finevideo.
Data from: cosmopedia
huggingface.co
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cosmopedia [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/cosmopedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cosmopedia v0.1

Image generated by DALL-E, the prompt was generated by Mixtral-8x7B-Instruct-v0.1

Note: Cosmopedia v0.2 is available at smollm-corpus User: What do you think "Cosmopedia" could mean? Hint: in our case it's not related to cosmology.

Mixtral-8x7B-Instruct-v0.1: A possible meaning for "Cosmopedia" could be an encyclopedia or collection of information about different cultures, societies, and topics from around the world, emphasizing diversity and global… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/cosmopedia.
stack-exchange-preferences
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for H4 Stack Exchange Preferences Dataset

Dataset Summary

This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
newswire
huggingface.co
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dell Research Harvard (2024). newswire [Dataset]. http://doi.org/10.57967/hf/2423
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2423
Dataset updated
Jun 7, 2024
Dataset provided by
Dell Research
Dell Technologieshttp://dell.com/
Authors
Dell Research Harvard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for NewsWire

Dataset Summary

NewsWire contains 2.7 million unique public domain U.S. news wire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model.

Languages

English (en)

Dataset Structure

Each year in… See the full description on the dataset page: https://huggingface.co/datasets/dell-research-harvard/newswire.
ultrachat_200k
huggingface.co
opendatalab.com
Updated Oct 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for UltraChat 200k

Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
databricks-dolly-15k
huggingface.co
Updated Apr 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databricks (2023). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 17, 2023
Dataset authored and provided by
Databrickshttp://databricks.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
instruction-dataset
huggingface.co
opendatalab.com
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.
helpful-instructions
huggingface.co
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
helpful-instructions [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/helpful-instructions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Helpful Instructions

Dataset Summary

Helpful Instructions is a dataset of (instruction, demonstration) pairs that are derived from public datasets. As the name suggests, it focuses on instructions that are "helpful", i.e. the kind of questions or tasks a human user might instruct an AI assistant to perform. You can load the dataset as follows: from datasets import load_dataset

Load all subsets

helpful_instructions =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/helpful-instructions.
paper-central-data-2
huggingface.co
Updated Oct 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
paper-central-data-2 [Dataset]. https://huggingface.co/datasets/huggingface/paper-central-data-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2024
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
Description
huggingface/paper-central-data-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
huggingface-datasets
huggingface.co
Updated Jan 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
huggingface-datasets [Dataset]. https://huggingface.co/datasets/ftopal/huggingface-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 18, 2024
Authors
Fazil
Description
ftopal/huggingface-datasets dataset hosted on Hugging Face and contributed by the HF Datasets community
HuggingFace models
redivis.com
application/jsonl +7
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2025). HuggingFace models [Dataset]. https://redivis.com/datasets/d2aq-2jp4d5xpd
Explore at:
sas, parquet, avro, application/jsonl, arrow, spss, stata, csvAvailable download formats
Dataset updated
Feb 24, 2025
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Description
Abstract

Container dataset for demonstration of Hugging Face models on Redivis. Currently just contains a single BERT model, but may expand in the future.
h
Data from: huggingface
huggingface.co
Updated Dec 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marenacio Banks (2024). huggingface [Dataset]. https://huggingface.co/datasets/MBanks50/huggingface
Explore at:
Dataset updated
Dec 27, 2024
Authors
Marenacio Banks
Description
MBanks50/huggingface dataset hosted on Hugging Face and contributed by the HF Datasets community
Locations
huggingface.co
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ek KAdam Aur Foundation for Education and Healthh (2025). Locations [Dataset]. https://huggingface.co/datasets/EKKADMAUR/Locations
Explore at:
Dataset updated
Mar 3, 2025
Dataset provided by
Ek Kadam Aur Foundation
Authors
Ek KAdam Aur Foundation for Education and Healthh
Description
EKKADMAUR/Locations dataset hosted on Hugging Face and contributed by the HF Datasets community
h
real-names-real-companies-real-locations-v0.1
huggingface.co
Updated Feb 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
real-names-real-companies-real-locations-v0.1 [Dataset]. https://huggingface.co/datasets/WendyHoang/real-names-real-companies-real-locations-v0.1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2025
Authors
Hoang Thi Thu Uyen
Description
WendyHoang/real-names-real-companies-real-locations-v0.1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
dialogsum
huggingface.co
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Karthick Kaliannan Neelamohan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for DIALOGSum Corpus

Dataset Description Links

Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

Dataset Summary

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.
hh-rlhf
huggingface.co
opendatalab.com
Updated Feb 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). hh-rlhf [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/hh-rlhf
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 17, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is part of the Anthropic's HH data used to train their RLHF Assistant https://github.com/anthropics/hh-rlhf. The data contains the first utterance from human to the dialog agent and the number of words in that utterance. The sampled version is a random sample of size 200.
transformers-stats-space-data
huggingface.co
Updated Aug 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face (2022). transformers-stats-space-data [Dataset]. https://huggingface.co/datasets/huggingface/transformers-stats-space-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 14, 2022
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
Description
huggingface/transformers-stats-space-data dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

label-files [Dataset]. https://huggingface.co/datasets/huggingface/label-files

Data from: label-files

huggingface/label-files

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 23, 2021

Dataset authored and provided by

Hugging Facehttps://huggingface.co/

Description

This repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:

ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...

You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.

Clear search

Close search

Google apps

Main menu

Data from: label-files

policy-docs

instruct_me

finevideo

Data from: cosmopedia

stack-exchange-preferences

newswire

ultrachat_200k

databricks-dolly-15k

instruction-dataset

helpful-instructions

Load all subsets

paper-central-data-2

huggingface-datasets

HuggingFace models

Abstract

Data from: huggingface

Locations

real-names-real-companies-real-locations-v0.1

dialogsum

hh-rlhf

transformers-stats-space-data

Data from: label-files

huggingface/label-files