100+ datasets found
  1. h

    huggingface-datasets

    • huggingface.co
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noah Kasmanoff (2023). huggingface-datasets [Dataset]. https://huggingface.co/datasets/nkasmanoff/huggingface-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2023
    Authors
    Noah Kasmanoff
    Description

    Dataset Card for "huggingface-datasets"

    This dataset is a snapshot of all public datasets in HuggingFace as of 04/24/2023. It is based on the dataset metadata that can be found at the following endpoint: https://huggingface.co/api/datasets/{dataset_id} Which contains information like the dataset name, its tags, description, and more. Please note that description is different from dataset card, which is what you are reading now :-). I would love to replace this dataset with one which… See the full description on the dataset page: https://huggingface.co/datasets/nkasmanoff/huggingface-datasets.

  2. h

    huggingface-datasets-processed

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fazil, huggingface-datasets-processed [Dataset]. https://huggingface.co/datasets/ftopal/huggingface-datasets-processed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Fazil
    Description

    ftopal/huggingface-datasets-processed dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. instruction-dataset

    • huggingface.co
    • opendatalab.com
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.

  4. h

    dialogsum

    • huggingface.co
    Updated Jun 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2022
    Authors
    Karthick Kaliannan Neelamohan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for DIALOGSum Corpus

      Dataset Description
    
    
    
    
    
      Links
    

    Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

      Dataset Summary
    

    DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.

  5. databricks-dolly-15k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Databricks, databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Databrickshttp://databricks.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

  6. h

    pile-of-law

    • huggingface.co
    • opendatalab.com
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
    Explore at:
    Dataset updated
    Jul 10, 2022
    Dataset authored and provided by
    Pile of Law
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

  7. Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

    ## Root directory

    - `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements

    - `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)

    - `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

    ## Dataset

    - `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed

    - `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library

    - `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model

    - `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project

    - `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

    ## RQ1

    - `RQ1/RQ1_dataset-list.txt`: list of HF datasets

    - `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets

    - `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script

    - `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis

    - `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`

    - `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

    ## RQ2

    - `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task

    - `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling

    - `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias

    - `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories

    - `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

    ## RQ3

    - `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses

    - `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness

    - `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name

    - `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license

    - `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)

    - `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

    ## scripts

    Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

  8. h

    fineweb-edu

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    📚 FineWeb-Edu

    1.3 trillion tokens of the finest educational data the 🌐 web has to offer

    Paper: https://arxiv.org/abs/2406.17557

      What is it?
    

    📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

  9. smollm-corpus

    • huggingface.co
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    SmolLM-Corpus

    This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

      Dataset subsets
    
    
    
    
    
      Cosmopedia v2
    

    Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

  10. h

    rag-mini-wikipedia

    • huggingface.co
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2025
    Dataset authored and provided by
    RAG Datasets
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.

  11. wmt22_african

    • huggingface.co
    Updated May 15, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2007). wmt22_african [Dataset]. https://huggingface.co/datasets/allenai/wmt22_african
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2007
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    Dataset Card for allenai/wmt22_african

      Dataset Summary
    

    This dataset was created based on metadata for mined bitext released by Meta AI. It contains bitext for 248 pairs for the African languages that are part of the 2022 WMT Shared Task on Large Scale Machine Translation Evaluation for African Languages.

      How to use the data
    

    There are two ways to access the data:

    Via the Hugging Face Python datasets library

    from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/allenai/wmt22_african.

  12. stack-exchange-preferences

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4, stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for H4 Stack Exchange Preferences Dataset

      Dataset Summary
    

    This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are grouped with… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

  13. h

    Data from: imdb

    • huggingface.co
    Updated Aug 3, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2003). imdb [Dataset]. https://huggingface.co/datasets/stanfordnlp/imdb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2003
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "imdb"

      Dataset Summary
    

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
    
  14. h

    do-not-answer

    • huggingface.co
    Updated Sep 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LibrAI (2023). do-not-answer [Dataset]. https://huggingface.co/datasets/LibrAI/do-not-answer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2023
    Dataset authored and provided by
    LibrAI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

      Overview
    

    Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

      Instruction… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
    
  15. h

    clapnq

    • huggingface.co
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PrimeQA (2024). clapnq [Dataset]. https://huggingface.co/datasets/PrimeQA/clapnq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2024
    Dataset authored and provided by
    PrimeQA
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    We present CLAP NQ, a benchmark Long-form Question Answering dataset for the full RAG pipeline. CLAP NQ includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The CLAP NQ answers are concise, 3x smaller than the full passage, and cohesive, with multiple pieces of the passage that are not contiguous. This is the annotated data for the generation portion of the RAG pipeline. For more… See the full description on the dataset page: https://huggingface.co/datasets/PrimeQA/clapnq.

  16. h

    coco

    • huggingface.co
    Updated Mar 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Detection datasets (2023). coco [Dataset]. https://huggingface.co/datasets/detection-datasets/coco
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 3, 2023
    Dataset authored and provided by
    Detection datasets
    Description

    detection-datasets/coco dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    HarmfulQA

    • huggingface.co
    Updated Aug 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deep Cognition and Language Research (DeCLaRe) Lab (2023). HarmfulQA [Dataset]. https://huggingface.co/datasets/declare-lab/HarmfulQA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2023
    Dataset authored and provided by
    Deep Cognition and Language Research (DeCLaRe) Lab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Paper | Github | Dataset| Model 📣📣📣: Do check our new multilingual dataset CatQA here used in Safety Vectors:📣📣📣

    As a part of our research efforts toward making LLMs more safe for public use, we create HarmfulQA i.e. a ChatGPT-distilled dataset constructed using the Chain of Utterances (CoU) prompt. More details are in our paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment HarmfulQA serves as both-a new LLM safety benchmark and an alignment dataset… See the full description on the dataset page: https://huggingface.co/datasets/declare-lab/HarmfulQA.

  18. h

    VisIT-Bench

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ML Foundations, VisIT-Bench [Dataset]. https://huggingface.co/datasets/mlfoundations/VisIT-Bench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    ML Foundations
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for VisIT-Bench

    Dataset Description Links Dataset Structure Data Fields Data Splits Data Loading

    Licensing Information Annotations Considerations for Using the Data Citation Information

      Dataset Description
    

    VisIT-Bench is a dataset and benchmark for vision-and-language instruction following. The dataset is comprised of image-instruction pairs and corresponding example outputs, spanning a wide range of tasks, from simple object recognition to complex… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/VisIT-Bench.

  19. h

    clinical-ie

    • huggingface.co
    Updated Dec 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MIT Clinical Machine Learning Group (2022). clinical-ie [Dataset]. https://huggingface.co/datasets/mitclinicalml/clinical-ie
    Explore at:
    Dataset updated
    Dec 7, 2022
    Dataset authored and provided by
    MIT Clinical Machine Learning Group
    Description

    Below, we provide access to the datasets used in and created for the EMNLP 2022 paper Large Language Models are Few-Shot Clinical Information Extractors.

      Task #1: Clinical Sense Disambiguation
    

    For Task #1, we use the original annotations from the Clinical Acronym Sense Inventory (CASI) dataset, described in their paper. As is common, due to noisiness in the label set, we do not evaluate on the entire dataset, but only on a cleaner subset. For consistency, we use the subset defined… See the full description on the dataset page: https://huggingface.co/datasets/mitclinicalml/clinical-ie.

  20. h

    whoops

    • huggingface.co
    Updated Nov 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nlphuji (2024). whoops [Dataset]. https://huggingface.co/datasets/nlphuji/whoops
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Dataset authored and provided by
    nlphuji
    Description

    Dataset Card for WHOOPS!

    Dataset Description Contribute Images to Extend WHOOPS! Languages

    Dataset Data Fields Data Splits Data Loading

    Licensing Information Annotations Considerations for Using the Data Citation Information

      Dataset Description
    

    WHOOPS! is a dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. It contains… See the full description on the dataset page: https://huggingface.co/datasets/nlphuji/whoops.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Noah Kasmanoff (2023). huggingface-datasets [Dataset]. https://huggingface.co/datasets/nkasmanoff/huggingface-datasets

huggingface-datasets

nkasmanoff/huggingface-datasets

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 2, 2023
Authors
Noah Kasmanoff
Description

Dataset Card for "huggingface-datasets"

This dataset is a snapshot of all public datasets in HuggingFace as of 04/24/2023. It is based on the dataset metadata that can be found at the following endpoint: https://huggingface.co/api/datasets/{dataset_id} Which contains information like the dataset name, its tags, description, and more. Please note that description is different from dataset card, which is what you are reading now :-). I would love to replace this dataset with one which… See the full description on the dataset page: https://huggingface.co/datasets/nkasmanoff/huggingface-datasets.

Search
Clear search
Close search
Google apps
Main menu