This repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:
ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...
You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Public Policy at Hugging Face
AI Policy at Hugging Face is a multidisciplinary and cross-organizational workstream. Instead of being part of a vertical communications or global affairs organization, our policy work is rooted in the expertise of our many researchers and developers, from Ethics and Society Regulars and legal team to machine learning engineers working on healthcare, art, and evaluations. What we work on is informed by our Hugging Face community needs and experiences… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/policy-docs.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Instruct Me is a dataset of instruction-like dialogues between a human user and AI assistant. The prompts are derived from (prompt, completion) pairs in the Helpful Instructions dataset. The goal is to train a language model to that is "chatty" and can answer the kind of questions or tasks a human user might instruct an AI assistant to perform.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
FineVideo
FineVideo Description Dataset Explorer Revisions Dataset Distribution
How to download and use FineVideo Using datasets Using huggingface_hub Load a subset of the dataset
Dataset Structure Data Instances Data Fields
Dataset Creation License CC-By Considerations for Using the Data Social Impact of Dataset Discussion of Biases
Additional Information Credits Future Work Opting out of FineVideo Citation Information
Terms of use for FineVideo… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFV/finevideo.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Cosmopedia v0.1
Image generated by DALL-E, the prompt was generated by Mixtral-8x7B-Instruct-v0.1
Note: Cosmopedia v0.2 is available at smollm-corpus User: What do you think "Cosmopedia" could mean? Hint: in our case it's not related to cosmology.
Mixtral-8x7B-Instruct-v0.1: A possible meaning for "Cosmopedia" could be an encyclopedia or collection of information about different cultures, societies, and topics from around the world, emphasizing diversity and global… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/cosmopedia.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for H4 Stack Exchange Preferences Dataset
Dataset Summary
This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for NewsWire
Dataset Summary
NewsWire contains 2.7 million unique public domain U.S. news wire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model.
Languages
English (en)
Dataset Structure
Each year in… See the full description on the dataset page: https://huggingface.co/datasets/dell-research-harvard/newswire.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for UltraChat 200k
Dataset Description
This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:
Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Summary
databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Helpful Instructions
Dataset Summary
Helpful Instructions is a dataset of (instruction, demonstration) pairs that are derived from public datasets. As the name suggests, it focuses on instructions that are "helpful", i.e. the kind of questions or tasks a human user might instruct an AI assistant to perform. You can load the dataset as follows: from datasets import load_dataset
helpful_instructions =… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/helpful-instructions.
huggingface/paper-central-data-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
ftopal/huggingface-datasets dataset hosted on Hugging Face and contributed by the HF Datasets community
Container dataset for demonstration of Hugging Face models on Redivis. Currently just contains a single BERT model, but may expand in the future.
MBanks50/huggingface dataset hosted on Hugging Face and contributed by the HF Datasets community
EKKADMAUR/Locations dataset hosted on Hugging Face and contributed by the HF Datasets community
WendyHoang/real-names-real-companies-real-locations-v0.1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for DIALOGSum Corpus
Dataset Description
Links
Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick
Dataset Summary
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is part of the Anthropic's HH data used to train their RLHF Assistant https://github.com/anthropics/hh-rlhf. The data contains the first utterance from human to the dialog agent and the number of words in that utterance. The sampled version is a random sample of size 200.
huggingface/transformers-stats-space-data dataset hosted on Hugging Face and contributed by the HF Datasets community
This repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:
ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...
You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.