100+ datasets found
  1. h

    huggingface-datasets

    • huggingface.co
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noah Kasmanoff (2023). huggingface-datasets [Dataset]. https://huggingface.co/datasets/nkasmanoff/huggingface-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2023
    Authors
    Noah Kasmanoff
    Description

    Dataset Card for "huggingface-datasets"

    This dataset is a snapshot of all public datasets in HuggingFace as of 04/24/2023. It is based on the dataset metadata that can be found at the following endpoint: https://huggingface.co/api/datasets/{dataset_id} Which contains information like the dataset name, its tags, description, and more. Please note that description is different from dataset card, which is what you are reading now :-). I would love to replace this dataset with one which… See the full description on the dataset page: https://huggingface.co/datasets/nkasmanoff/huggingface-datasets.

  2. h

    ShareGPT_Vicuna_unfiltered

    • huggingface.co
    Updated Apr 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    z. (2023). ShareGPT_Vicuna_unfiltered [Dataset]. https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
    Explore at:
    Dataset updated
    Apr 12, 2023
    Authors
    z.
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Further cleaning done. Please look through the dataset and ensure that I didn't miss anything. Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c Two choices:

    Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json Has instances of "I'm sorry, but":… See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.

  3. h

    dialogsum

    • huggingface.co
    Updated Jun 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2022
    Authors
    Karthick Kaliannan Neelamohan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for DIALOGSum Corpus

      Dataset Description
    
    
    
    
    
      Links
    

    Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

      Dataset Summary
    

    DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.

  4. h

    HarmfulQA

    • huggingface.co
    Updated Aug 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deep Cognition and Language Research (DeCLaRe) Lab (2023). HarmfulQA [Dataset]. https://huggingface.co/datasets/declare-lab/HarmfulQA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2023
    Dataset authored and provided by
    Deep Cognition and Language Research (DeCLaRe) Lab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Paper | Github | Dataset| Model 📣📣📣: Do check our new multilingual dataset CatQA here used in Safety Vectors:📣📣📣

    As a part of our research efforts toward making LLMs more safe for public use, we create HarmfulQA i.e. a ChatGPT-distilled dataset constructed using the Chain of Utterances (CoU) prompt. More details are in our paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment HarmfulQA serves as both-a new LLM safety benchmark and an alignment dataset… See the full description on the dataset page: https://huggingface.co/datasets/declare-lab/HarmfulQA.

  5. h

    huggingface-models-raw

    • huggingface.co
    Updated Mar 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fazil (2022). huggingface-models-raw [Dataset]. https://huggingface.co/datasets/ftopal/huggingface-models-raw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2022
    Authors
    Fazil
    Description

    ftopal/huggingface-models-raw dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. databricks-dolly-15k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Databrickshttp://databricks.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

  7. instruction-dataset

    • huggingface.co
    • opendatalab.com
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.

  8. h

    huggingface-dataset

    • huggingface.co
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roger Rose (2023). huggingface-dataset [Dataset]. https://huggingface.co/datasets/RogerRose/huggingface-dataset
    Explore at:
    Dataset updated
    Aug 18, 2023
    Authors
    Roger Rose
    Description

    RogerRose/huggingface-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. stack-exchange-preferences

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4, stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for H4 Stack Exchange Preferences Dataset

      Dataset Summary
    

    This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are grouped with… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

  10. h

    whoops

    • huggingface.co
    Updated Nov 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nlphuji (2024). whoops [Dataset]. https://huggingface.co/datasets/nlphuji/whoops
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset authored and provided by
    nlphuji
    Description

    Dataset Card for WHOOPS!

    Dataset Description Contribute Images to Extend WHOOPS! Languages

    Dataset Data Fields Data Splits Data Loading

    Licensing Information Annotations Considerations for Using the Data Citation Information

      Dataset Description
    

    WHOOPS! is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. It contains… See the full description on the dataset page: https://huggingface.co/datasets/nlphuji/whoops.

  11. smollm-corpus

    • huggingface.co
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    SmolLM-Corpus

    This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

      Dataset subsets
    
    
    
    
    
      Cosmopedia v2
    

    Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

  12. h

    pile-of-law

    • huggingface.co
    • opendatalab.com
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
    Explore at:
    Dataset updated
    Jul 10, 2022
    Dataset authored and provided by
    Pile of Law
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

  13. h

    do-not-answer

    • huggingface.co
    Updated Sep 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LibrAI (2023). do-not-answer [Dataset]. https://huggingface.co/datasets/LibrAI/do-not-answer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2023
    Dataset authored and provided by
    LibrAI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

      Overview
    

    Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

      Instruction… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
    
  14. h

    SHP

    • huggingface.co
    • opendatalab.com
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2023). SHP [Dataset]. https://huggingface.co/datasets/stanfordnlp/SHP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2023
    Dataset authored and provided by
    Stanford NLP
    Description

    🚢 Stanford Human Preferences Dataset (SHP)

    If you mention this dataset in a paper, please cite the paper: Understanding Dataset Difficulty with V-Usable Information (ICML 2022).

      Summary
    

    SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/SHP.

  15. h

    diffusiondb

    • huggingface.co
    Updated Mar 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Polo Club of Data Science (2023). diffusiondb [Dataset]. https://huggingface.co/datasets/poloclub/diffusiondb
    Explore at:
    Dataset updated
    Mar 16, 2023
    Dataset authored and provided by
    Polo Club of Data Science
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 2 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models.

  16. wmt22_african

    • huggingface.co
    Updated May 15, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2007). wmt22_african [Dataset]. https://huggingface.co/datasets/allenai/wmt22_african
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2007
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    Dataset Card for allenai/wmt22_african

      Dataset Summary
    

    This dataset was created based on metadata for mined bitext released by Meta AI. It contains bitext for 248 pairs for the African languages that are part of the 2022 WMT Shared Task on Large Scale Machine Translation Evaluation for African Languages.

      How to use the data
    

    There are two ways to access the data:

    Via the Hugging Face Python datasets library

    from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/allenai/wmt22_african.

  17. h

    coco

    • huggingface.co
    Updated Mar 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Detection datasets (2023). coco [Dataset]. https://huggingface.co/datasets/detection-datasets/coco
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 3, 2023
    Dataset authored and provided by
    Detection datasets
    Description

    detection-datasets/coco dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
  19. h

    common_corpus

    • huggingface.co
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). common_corpus [Dataset]. https://huggingface.co/datasets/PleIAs/common_corpus
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset authored and provided by
    PleIAs
    Description

    Common Corpus

    Full data paper

    Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

  20. h

    UrbanSyn

    • huggingface.co
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Administrator (2024). UrbanSyn [Dataset]. https://huggingface.co/datasets/UrbanSyn/UrbanSyn
    Explore at:
    Dataset updated
    Dec 19, 2024
    Authors
    Administrator
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    UrbanSyn Dataset

    UrbanSyn is an open synthetic dataset featuring photorealistic driving scenes. It contains ground-truth annotations for semantic segmentation, scene depth, panoptic instance segmentation, and 2-D bounding boxes. Website https://urbansyn.org

      Overview
    

    UrbanSyn is a diverse, compact, and photorealistic dataset that provides more than 7.5k synthetic annotated images. It was born to address the synth-to-real domain gap, contributing to unprecedented… See the full description on the dataset page: https://huggingface.co/datasets/UrbanSyn/UrbanSyn.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Noah Kasmanoff (2023). huggingface-datasets [Dataset]. https://huggingface.co/datasets/nkasmanoff/huggingface-datasets

huggingface-datasets

nkasmanoff/huggingface-datasets

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 2, 2023
Authors
Noah Kasmanoff
Description

Dataset Card for "huggingface-datasets"

This dataset is a snapshot of all public datasets in HuggingFace as of 04/24/2023. It is based on the dataset metadata that can be found at the following endpoint: https://huggingface.co/api/datasets/{dataset_id} Which contains information like the dataset name, its tags, description, and more. Please note that description is different from dataset card, which is what you are reading now :-). I would love to replace this dataset with one which… See the full description on the dataset page: https://huggingface.co/datasets/nkasmanoff/huggingface-datasets.

Search
Clear search
Close search
Google apps
Main menu