100+ datasets found

h
huggingface-datasets
huggingface.co
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noah Kasmanoff (2023). huggingface-datasets [Dataset]. https://huggingface.co/datasets/nkasmanoff/huggingface-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 2, 2023
Authors
Noah Kasmanoff
Description
Dataset Card for "huggingface-datasets"

This dataset is a snapshot of all public datasets in HuggingFace as of 04/24/2023. It is based on the dataset metadata that can be found at the following endpoint: https://huggingface.co/api/datasets/{dataset_id} Which contains information like the dataset name, its tags, description, and more. Please note that description is different from dataset card, which is what you are reading now :-). I would love to replace this dataset with one which… See the full description on the dataset page: https://huggingface.co/datasets/nkasmanoff/huggingface-datasets.
h
ShareGPT_Vicuna_unfiltered
huggingface.co
Updated Apr 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
z. (2023). ShareGPT_Vicuna_unfiltered [Dataset]. https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
Explore at:
Dataset updated
Apr 12, 2023
Authors
z.
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Further cleaning done. Please look through the dataset and ensure that I didn't miss anything. Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c Two choices:

Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json Has instances of "I'm sorry, but":… See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.
h
dialogsum
huggingface.co
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Karthick Kaliannan Neelamohan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for DIALOGSum Corpus

Dataset Description Links

Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

Dataset Summary

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.
h
HarmfulQA
huggingface.co
Updated Aug 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deep Cognition and Language Research (DeCLaRe) Lab (2023). HarmfulQA [Dataset]. https://huggingface.co/datasets/declare-lab/HarmfulQA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2023
Dataset authored and provided by
Deep Cognition and Language Research (DeCLaRe) Lab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Paper | Github | Dataset| Model 📣📣📣: Do check our new multilingual dataset CatQA here used in Safety Vectors:📣📣📣

As a part of our research efforts toward making LLMs more safe for public use, we create HarmfulQA i.e. a ChatGPT-distilled dataset constructed using the Chain of Utterances (CoU) prompt. More details are in our paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment HarmfulQA serves as both-a new LLM safety benchmark and an alignment dataset… See the full description on the dataset page: https://huggingface.co/datasets/declare-lab/HarmfulQA.
h
huggingface-models-raw
huggingface.co
Updated Mar 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fazil (2022). huggingface-models-raw [Dataset]. https://huggingface.co/datasets/ftopal/huggingface-models-raw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2022
Authors
Fazil
Description
ftopal/huggingface-models-raw dataset hosted on Hugging Face and contributed by the HF Datasets community
databricks-dolly-15k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Databrickshttp://databricks.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
instruction-dataset
huggingface.co
opendatalab.com
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.
h
huggingface-dataset
huggingface.co
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roger Rose (2023). huggingface-dataset [Dataset]. https://huggingface.co/datasets/RogerRose/huggingface-dataset
Explore at:
Dataset updated
Aug 18, 2023
Authors
Roger Rose
Description
RogerRose/huggingface-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
stack-exchange-preferences
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4, stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for H4 Stack Exchange Preferences Dataset

Dataset Summary

This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are grouped with… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
h
whoops
huggingface.co
Updated Nov 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nlphuji (2024). whoops [Dataset]. https://huggingface.co/datasets/nlphuji/whoops
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset authored and provided by
nlphuji
Description
Dataset Card for WHOOPS!

Dataset Description Contribute Images to Extend WHOOPS! Languages

Dataset Data Fields Data Splits Data Loading

Licensing Information Annotations Considerations for Using the Data Citation Information

Dataset Description

WHOOPS! is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. It contains… See the full description on the dataset page: https://huggingface.co/datasets/nlphuji/whoops.
smollm-corpus
huggingface.co
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
SmolLM-Corpus

This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

Dataset subsets Cosmopedia v2

Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
h
pile-of-law
huggingface.co
opendatalab.com
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
Explore at:
Dataset updated
Jul 10, 2022
Dataset authored and provided by
Pile of Law
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
h
do-not-answer
huggingface.co
Updated Sep 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LibrAI (2023). do-not-answer [Dataset]. https://huggingface.co/datasets/LibrAI/do-not-answer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Dataset authored and provided by
LibrAI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Overview

Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

Instruction… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
h
SHP
huggingface.co
opendatalab.com
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2023). SHP [Dataset]. https://huggingface.co/datasets/stanfordnlp/SHP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2023
Dataset authored and provided by
Stanford NLP
Description
🚢 Stanford Human Preferences Dataset (SHP)

If you mention this dataset in a paper, please cite the paper: Understanding Dataset Difficulty with V-Usable Information (ICML 2022).

Summary

SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/SHP.
h
diffusiondb
huggingface.co
Updated Mar 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polo Club of Data Science (2023). diffusiondb [Dataset]. https://huggingface.co/datasets/poloclub/diffusiondb
Explore at:
Dataset updated
Mar 16, 2023
Dataset authored and provided by
Polo Club of Data Science
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 2 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models.
wmt22_african
huggingface.co
Updated May 15, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2007). wmt22_african [Dataset]. https://huggingface.co/datasets/allenai/wmt22_african
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2007
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
Description
Dataset Card for allenai/wmt22_african

Dataset Summary

This dataset was created based on metadata for mined bitext released by Meta AI. It contains bitext for 248 pairs for the African languages that are part of the 2022 WMT Shared Task on Large Scale Machine Translation Evaluation for African Languages.

How to use the data

There are two ways to access the data:

Via the Hugging Face Python datasets library

from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/allenai/wmt22_african.
h
coco
huggingface.co
Updated Mar 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Detection datasets (2023). coco [Dataset]. https://huggingface.co/datasets/detection-datasets/coco
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 3, 2023
Dataset authored and provided by
Detection datasets
Description
detection-datasets/coco dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Data from: imdb
huggingface.co
Updated Aug 3, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2003
Dataset authored and provided by
Stanford NLP
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for "imdb"

Dataset Summary

Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
h
common_corpus
huggingface.co
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PleIAs (2024). common_corpus [Dataset]. https://huggingface.co/datasets/PleIAs/common_corpus
Explore at:
Dataset updated
Nov 13, 2024
Dataset authored and provided by
PleIAs
Description
Common Corpus

Full data paper

Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.
h
UrbanSyn
huggingface.co
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Administrator (2024). UrbanSyn [Dataset]. https://huggingface.co/datasets/UrbanSyn/UrbanSyn
Explore at:
Dataset updated
Dec 19, 2024
Authors
Administrator
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
UrbanSyn Dataset

UrbanSyn is an open synthetic dataset featuring photorealistic driving scenes. It contains ground-truth annotations for semantic segmentation, scene depth, panoptic instance segmentation, and 2-D bounding boxes. Website https://urbansyn.org

Overview

UrbanSyn is a diverse, compact, and photorealistic dataset that provides more than 7.5k synthetic annotated images. It was born to address the synth-to-real domain gap, contributing to unprecedented… See the full description on the dataset page: https://huggingface.co/datasets/UrbanSyn/UrbanSyn.

Facebook

Twitter

Click to copy link

Link copied

Cite

Noah Kasmanoff (2023). huggingface-datasets [Dataset]. https://huggingface.co/datasets/nkasmanoff/huggingface-datasets

huggingface-datasets

nkasmanoff/huggingface-datasets

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 2, 2023

Authors

Noah Kasmanoff

Description

Dataset Card for "huggingface-datasets"

This dataset is a snapshot of all public datasets in HuggingFace as of 04/24/2023. It is based on the dataset metadata that can be found at the following endpoint: https://huggingface.co/api/datasets/{dataset_id} Which contains information like the dataset name, its tags, description, and more. Please note that description is different from dataset card, which is what you are reading now :-). I would love to replace this dataset with one which… See the full description on the dataset page: https://huggingface.co/datasets/nkasmanoff/huggingface-datasets.

Clear search

Close search

Google apps

Main menu

huggingface-datasets

ShareGPT_Vicuna_unfiltered

dialogsum

HarmfulQA

huggingface-models-raw

databricks-dolly-15k

instruction-dataset

huggingface-dataset

stack-exchange-preferences

whoops

smollm-corpus

pile-of-law

do-not-answer

SHP

diffusiondb

wmt22_african

coco

Data from: imdb

common_corpus

UrbanSyn

huggingface-datasets

nkasmanoff/huggingface-datasets