100+ datasets found
  1. Hugging Face Models Dataset

    • kaggle.com
    zip
    Updated Feb 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasir Raza (2023). Hugging Face Models Dataset [Dataset]. https://www.kaggle.com/datasets/yasirabdaali/hugging-face-models-dataset
    Explore at:
    zip(980916 bytes)Available download formats
    Dataset updated
    Feb 19, 2023
    Authors
    Yasir Raza
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Hugging Face

    Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets.

    This dataset contains the data of 16k models available on huggingface.co. This dataset contains the following features of the model; 1. model url 2. model title 3. downloads and likes 4. updated

  2. instruction-dataset

    • huggingface.co
    • opendatalab.com
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.

  3. stack-exchange-preferences

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4, stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for H4 Stack Exchange Preferences Dataset

      Dataset Summary
    

    This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are grouped withโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

  4. BERT Hugging face dataset

    • kaggle.com
    zip
    Updated Jun 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xen Xiou (2022). BERT Hugging face dataset [Dataset]. https://www.kaggle.com/datasets/xenxiou/bert-hugging-face-dataset
    Explore at:
    zip(12009924 bytes)Available download formats
    Dataset updated
    Jun 19, 2022
    Authors
    Xen Xiou
    Description

    Dataset

    This dataset was created by Xen Xiou

    Contents

  5. Data from: hugging face datasets

    • kaggle.com
    zip
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Broad (2025). hugging face datasets [Dataset]. https://www.kaggle.com/nbroad/hf-ds
    Explore at:
    zip(70163997 bytes)Available download formats
    Dataset updated
    Nov 3, 2025
    Authors
    Nicholas Broad
    Description

    This is the latest version of Hugging Face datasets to be used in offline notebooks on Kaggle. It is automatically updated every week.

    Docs are here

    Installation Instructions

    !pip install datasets --no-index --find-links=file:///kaggle/input/hf-ds -U -q

  6. h

    dataset-card-example

    • huggingface.co
    Updated Sep 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Templates (2023). dataset-card-example [Dataset]. https://huggingface.co/datasets/templates/dataset-card-example
    Explore at:
    Dataset updated
    Sep 28, 2023
    Dataset authored and provided by
    Templates
    Description

    Dataset Card for Dataset Name

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

      Dataset Sources [optional]
    

    Repository: [Moreโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/templates/dataset-card-example.

  7. Hugging Face Dataset Preparation

    • kaggle.com
    zip
    Updated Jun 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohannad Ayman Salah (2024). Hugging Face Dataset Preparation [Dataset]. https://www.kaggle.com/datasets/mohannadaymansalah/hugging-face-dataset-preparation
    Explore at:
    zip(911351 bytes)Available download formats
    Dataset updated
    Jun 15, 2024
    Authors
    Mohannad Ayman Salah
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Mohannad Ayman Salah

    Released under MIT

    Contents

  8. Huggingface Modelhub

    • kaggle.com
    zip
    Updated Jun 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
    Explore at:
    zip(2274876 bytes)Available download formats
    Dataset updated
    Jun 19, 2021
    Authors
    Kartik Godawat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

    Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

    Dataset was generated using huggingface_hub APIs provided by huggingface team.

    Update v3:

    • Added Downloads last month metric
    • Added library name

    Contents:

    • huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames
    • huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv
    • modelId: ID of the model as present on HF website
    • lastModified: Time when this model was last modified
    • tags: Tags associated with the model (provided by mantainer)
    • pipeline_tag: If exists, denotes which pipeline this model could be used with
    • files: List of available files in the model repo
    • publishedBy: Custom column derived from modelID, specifying who published this model
    • downloads_last_month: Number of times the model has been downloaded in last month.
    • library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv
    • modelId: ID of the model as available on HF website
    • modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

    This is my first dataset upload on Kaggle. I hope you like it. :)

  9. Huggingface Hub Permissible models and datasets

    • kaggle.com
    zip
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dheeraj M Pai (2023). Huggingface Hub Permissible models and datasets [Dataset]. https://www.kaggle.com/datasets/dheerajmpai/huggingface-hub-permissible-models-and-datasets
    Explore at:
    zip(34761279 bytes)Available download formats
    Dataset updated
    Dec 26, 2023
    Authors
    Dheeraj M Pai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Huggingface Hub: Models, Datasets, and Spaces

    Dataset Overview

    This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.

    Key Features

    • Comprehensive Data: Includes exhaustive details on all models, datasets, and spaces from the Huggingface Hub.
    • Permissible Models: A specialized subset is provided in a separate CSV file, focusing exclusively on models that are permissible for use.
    • Regularly Updated: The dataset is refreshed weekly to ensure the latest information is always available.

    Last Update

    • Date: December 26, 2023

    Update Frequency

    • Frequency: Weekly

    Dataset Contents

    1. Models: Detailed listings of all models available on Huggingface Hub.
    2. Datasets: Comprehensive information on datasets hosted on the Hub.
    3. Spaces: An overview of the different spaces and their functionalities.
    4. Permissible Models CSV: A smaller, curated list of models that are cleared for use.

    Usage

    This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.

    Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.

  10. Data from: huggingface

    • kaggle.com
    zip
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    amulil (2022). huggingface [Dataset]. https://www.kaggle.com/datasets/amulil/amulil-huggingface
    Explore at:
    zip(5498282999 bytes)Available download formats
    Dataset updated
    Mar 22, 2022
    Authors
    amulil
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Dataset

    This dataset was created by amulil

    Released under GPL 2

    Contents

  11. Labelled Corpus - Political Bias (Hugging Face)

    • kaggle.com
    zip
    Updated May 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suraj Karakulath (2024). Labelled Corpus - Political Bias (Hugging Face) [Dataset]. https://www.kaggle.com/datasets/surajkarakulath/labelled-corpus-political-bias-hugging-face
    Explore at:
    zip(50133530 bytes)Available download formats
    Dataset updated
    May 8, 2024
    Authors
    Suraj Karakulath
    Description

    This is a labeled corpus dataset of article text with corresponding political bias obtained from Huggingface. It contains 17,362 articles labeled left, right, or center by the editors of allsides.com. Articles were manually annotated by news editors who were attempting to select representative articles from the left, right and center of each article topic.

  12. drlc-leaderboard-data

    • huggingface.co
    Updated Apr 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huggingface Projects (2023). drlc-leaderboard-data [Dataset]. https://huggingface.co/datasets/huggingface-projects/drlc-leaderboard-data
    Explore at:
    Dataset updated
    Apr 25, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Huggingface Projects
    Description

    huggingface-projects/drlc-leaderboard-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. paper-central-data-2

    • huggingface.co
    Updated Oct 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face (2024). paper-central-data-2 [Dataset]. https://huggingface.co/datasets/huggingface/paper-central-data-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2024
    Dataset authored and provided by
    Hugging Facehttps://huggingface.co/
    Description

    huggingface/paper-central-data-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. smollm-corpus

    • huggingface.co
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    SmolLM-Corpus

    This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

      Dataset subsets
    
    
    
    
    
      Cosmopedia v2
    

    Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated byโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.

  15. contribute-a-dataset

    • huggingface.co
    Updated Jul 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huggingface Projects (2023). contribute-a-dataset [Dataset]. https://huggingface.co/datasets/huggingface-projects/contribute-a-dataset
    Explore at:
    Dataset updated
    Jul 15, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Huggingface Projects
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    huggingface-projects/contribute-a-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    Data-Science-Instruct-Dataset

    • huggingface.co
    Updated May 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Habib Ahmed (2025). Data-Science-Instruct-Dataset [Dataset]. https://huggingface.co/datasets/HabibAhmed/Data-Science-Instruct-Dataset
    Explore at:
    Dataset updated
    May 3, 2025
    Authors
    Mohammed Habib Ahmed
    Description

    HabibAhmed/Data-Science-Instruct-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    fineweb-edu

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
    Explore at:
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    ๐Ÿ“š FineWeb-Edu

    1.3 trillion tokens of the finest educational data the ๐ŸŒ web has to offer

    Paper: https://arxiv.org/abs/2406.17557

      What is it?
    

    ๐Ÿ“š FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from ๐Ÿท FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We thenโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

  18. Drive_Stats

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Backblaze, Drive_Stats [Dataset]. https://huggingface.co/datasets/backblaze/Drive_Stats
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Backblazehttp://www.backblaze.com/
    Backblaze
    Authors
    Backblaze
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Drive Stats

    Drive Stats is a public data set of daily metrics on the hard drives in Backblazeโ€™s cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating aโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.

  19. h

    enron_aeslc_emails

    • huggingface.co
    Updated May 14, 2001
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahn Young Jin (2001). enron_aeslc_emails [Dataset]. https://huggingface.co/datasets/snoop2head/enron_aeslc_emails
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 14, 2001
    Authors
    Ahn Young Jin
    Description

    snoop2head/enron_aeslc_emails dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    healthcare_data

    • huggingface.co
    Updated Jun 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicoly Barbosa Gomes da Silva (2023). healthcare_data [Dataset]. https://huggingface.co/datasets/Nicolybgs/healthcare_data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2023
    Authors
    Nicoly Barbosa Gomes da Silva
    Description

    Nicolybgs/healthcare_data dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yasir Raza (2023). Hugging Face Models Dataset [Dataset]. https://www.kaggle.com/datasets/yasirabdaali/hugging-face-models-dataset
Organization logo

Hugging Face Models Dataset

Dataset of the models available on HuggingFace.co

Explore at:
zip(980916 bytes)Available download formats
Dataset updated
Feb 19, 2023
Authors
Yasir Raza
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Hugging Face

Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets.

This dataset contains the data of 16k models available on huggingface.co. This dataset contains the following features of the model; 1. model url 2. model title 3. downloads and likes 4. updated

Search
Clear search
Close search
Google apps
Main menu