100+ datasets found

Hugging Face Models Dataset
kaggle.com
zip
Updated Feb 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasir Raza (2023). Hugging Face Models Dataset [Dataset]. https://www.kaggle.com/datasets/yasirabdaali/hugging-face-models-dataset
Explore at:
zip(980916 bytes)Available download formats
Dataset updated
Feb 19, 2023
Authors
Yasir Raza
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Hugging Face

Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets.

This dataset contains the data of 16k models available on huggingface.co. This dataset contains the following features of the model; 1. model url 2. model title 3. downloads and likes 4. updated
instruction-dataset
huggingface.co
opendatalab.com
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). instruction-dataset [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the blind eval dataset of high-quality, diverse, human-written instructions with demonstrations. We will be using this for step 3 evaluations in our RLHF pipeline.
stack-exchange-preferences
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4, stack-exchange-preferences [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for H4 Stack Exchange Preferences Dataset

Dataset Summary

This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021): have >=2 answers. This data could also be used for instruction fine-tuning and language model training. The questions are grouped with… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
BERT Hugging face dataset
kaggle.com
zip
Updated Jun 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xen Xiou (2022). BERT Hugging face dataset [Dataset]. https://www.kaggle.com/datasets/xenxiou/bert-hugging-face-dataset
Explore at:
zip(12009924 bytes)Available download formats
Dataset updated
Jun 19, 2022
Authors
Xen Xiou
Description
Dataset

This dataset was created by Xen Xiou

Contents
Data from: hugging face datasets
kaggle.com
zip
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Broad (2025). hugging face datasets [Dataset]. https://www.kaggle.com/nbroad/hf-ds
Explore at:
zip(70163997 bytes)Available download formats
Dataset updated
Nov 3, 2025
Authors
Nicholas Broad
Description
This is the latest version of Hugging Face datasets to be used in offline notebooks on Kaggle. It is automatically updated every week.

Docs are here

Installation Instructions

!pip install datasets --no-index --find-links=file:///kaggle/input/hf-ds -U -q
h
dataset-card-example
huggingface.co
Updated Sep 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Templates (2023). dataset-card-example [Dataset]. https://huggingface.co/datasets/templates/dataset-card-example
Explore at:
Dataset updated
Sep 28, 2023
Dataset authored and provided by
Templates
Description
Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Dataset Details Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

Dataset Sources [optional]

Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/templates/dataset-card-example.
Hugging Face Dataset Preparation
kaggle.com
zip
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohannad Ayman Salah (2024). Hugging Face Dataset Preparation [Dataset]. https://www.kaggle.com/datasets/mohannadaymansalah/hugging-face-dataset-preparation
Explore at:
zip(911351 bytes)Available download formats
Dataset updated
Jun 15, 2024
Authors
Mohannad Ayman Salah
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Mohannad Ayman Salah

Released under MIT

Contents
Huggingface Modelhub
kaggle.com
zip
Updated Jun 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
Explore at:
zip(2274876 bytes)Available download formats
Dataset updated
Jun 19, 2021
Authors
Kartik Godawat
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

Dataset was generated using huggingface_hub APIs provided by huggingface team.

Update v3:

Added Downloads last month metric

Added library name

Contents:

huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames

huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv

modelId: ID of the model as present on HF website

lastModified: Time when this model was last modified

tags: Tags associated with the model (provided by mantainer)

pipeline_tag: If exists, denotes which pipeline this model could be used with

files: List of available files in the model repo

publishedBy: Custom column derived from modelID, specifying who published this model

downloads_last_month: Number of times the model has been downloaded in last month.

library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv

modelId: ID of the model as available on HF website

modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

This is my first dataset upload on Kaggle. I hope you like it. :)
Huggingface Hub Permissible models and datasets
kaggle.com
zip
Updated Dec 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dheeraj M Pai (2023). Huggingface Hub Permissible models and datasets [Dataset]. https://www.kaggle.com/datasets/dheerajmpai/huggingface-hub-permissible-models-and-datasets
Explore at:
zip(34761279 bytes)Available download formats
Dataset updated
Dec 26, 2023
Authors
Dheeraj M Pai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Huggingface Hub: Models, Datasets, and Spaces

Dataset Overview

This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.

Key Features

Comprehensive Data: Includes exhaustive details on all models, datasets, and spaces from the Huggingface Hub.

Permissible Models: A specialized subset is provided in a separate CSV file, focusing exclusively on models that are permissible for use.

Regularly Updated: The dataset is refreshed weekly to ensure the latest information is always available.

Last Update

Date: December 26, 2023

Update Frequency

Frequency: Weekly

Dataset Contents

Models: Detailed listings of all models available on Huggingface Hub.

Datasets: Comprehensive information on datasets hosted on the Hub.

Spaces: An overview of the different spaces and their functionalities.

Permissible Models CSV: A smaller, curated list of models that are cleared for use.

Usage

This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.

Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.
Data from: huggingface
kaggle.com
zip
Updated Mar 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
amulil (2022). huggingface [Dataset]. https://www.kaggle.com/datasets/amulil/amulil-huggingface
Explore at:
zip(5498282999 bytes)Available download formats
Dataset updated
Mar 22, 2022
Authors
amulil
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Dataset

This dataset was created by amulil

Released under GPL 2

Contents
Labelled Corpus - Political Bias (Hugging Face)
kaggle.com
zip
Updated May 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suraj Karakulath (2024). Labelled Corpus - Political Bias (Hugging Face) [Dataset]. https://www.kaggle.com/datasets/surajkarakulath/labelled-corpus-political-bias-hugging-face
Explore at:
zip(50133530 bytes)Available download formats
Dataset updated
May 8, 2024
Authors
Suraj Karakulath
Description
This is a labeled corpus dataset of article text with corresponding political bias obtained from Huggingface. It contains 17,362 articles labeled left, right, or center by the editors of allsides.com. Articles were manually annotated by news editors who were attempting to select representative articles from the left, right and center of each article topic.
drlc-leaderboard-data
huggingface.co
Updated Apr 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huggingface Projects (2023). drlc-leaderboard-data [Dataset]. https://huggingface.co/datasets/huggingface-projects/drlc-leaderboard-data
Explore at:
Dataset updated
Apr 25, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Huggingface Projects
Description
huggingface-projects/drlc-leaderboard-data dataset hosted on Hugging Face and contributed by the HF Datasets community
paper-central-data-2
huggingface.co
Updated Oct 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face (2024). paper-central-data-2 [Dataset]. https://huggingface.co/datasets/huggingface/paper-central-data-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2024
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
Description
huggingface/paper-central-data-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
smollm-corpus
huggingface.co
Updated Jul 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
SmolLM-Corpus

This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

Dataset subsets Cosmopedia v2

Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
contribute-a-dataset
huggingface.co
Updated Jul 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huggingface Projects (2023). contribute-a-dataset [Dataset]. https://huggingface.co/datasets/huggingface-projects/contribute-a-dataset
Explore at:
Dataset updated
Jul 15, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Huggingface Projects
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
huggingface-projects/contribute-a-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Data-Science-Instruct-Dataset
huggingface.co
Updated May 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Habib Ahmed (2025). Data-Science-Instruct-Dataset [Dataset]. https://huggingface.co/datasets/HabibAhmed/Data-Science-Instruct-Dataset
Explore at:
Dataset updated
May 3, 2025
Authors
Mohammed Habib Ahmed
Description
HabibAhmed/Data-Science-Instruct-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb-edu
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2497
Dataset updated
Jan 3, 2025
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📚 FineWeb-Edu

1.3 trillion tokens of the finest educational data the 🌐 web has to offer

Paper: https://arxiv.org/abs/2406.17557

What is it?

📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
Drive_Stats
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Backblaze, Drive_Stats [Dataset]. https://huggingface.co/datasets/backblaze/Drive_Stats
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Backblazehttp://www.backblaze.com/
Backblaze
Authors
Backblaze
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Drive Stats

Drive Stats is a public data set of daily metrics on the hard drives in Backblaze’s cloud storage infrastructure that Backblaze has open-sourced since April 2013. Currently, Drive Stats comprises over 388 million records, rising by over 240,000 records per day. Drive Stats is an append-only dataset effectively logging daily statistics that once written are never updated or deleted. This is our first Hugging Face dataset; feel free to suggest improvements by creating a… See the full description on the dataset page: https://huggingface.co/datasets/backblaze/Drive_Stats.
h
enron_aeslc_emails
huggingface.co
Updated May 14, 2001
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahn Young Jin (2001). enron_aeslc_emails [Dataset]. https://huggingface.co/datasets/snoop2head/enron_aeslc_emails
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 14, 2001
Authors
Ahn Young Jin
Description
snoop2head/enron_aeslc_emails dataset hosted on Hugging Face and contributed by the HF Datasets community
h
healthcare_data
huggingface.co
Updated Jun 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicoly Barbosa Gomes da Silva (2023). healthcare_data [Dataset]. https://huggingface.co/datasets/Nicolybgs/healthcare_data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2023
Authors
Nicoly Barbosa Gomes da Silva
Description
Nicolybgs/healthcare_data dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Yasir Raza (2023). Hugging Face Models Dataset [Dataset]. https://www.kaggle.com/datasets/yasirabdaali/hugging-face-models-dataset

Hugging Face Models Dataset

Dataset of the models available on HuggingFace.co

Explore at:

zip(980916 bytes)Available download formats

Dataset updated

Feb 19, 2023

Authors

Yasir Raza

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Hugging Face

Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets.

This dataset contains the data of 16k models available on huggingface.co. This dataset contains the following features of the model; 1. model url 2. model title 3. downloads and likes 4. updated

Clear search

Close search

Google apps

Main menu

Hugging Face Models Dataset

Hugging Face

instruction-dataset

stack-exchange-preferences

BERT Hugging face dataset

Dataset

Contents

Data from: hugging face datasets

Installation Instructions

dataset-card-example

Hugging Face Dataset Preparation

Dataset

Contents

Huggingface Modelhub

Update v3:

Contents:

Huggingface Hub Permissible models and datasets

Huggingface Hub: Models, Datasets, and Spaces

Dataset Overview

Key Features

Last Update

Update Frequency

Dataset Contents

Usage

Data from: huggingface

Dataset

Contents

Labelled Corpus - Political Bias (Hugging Face)

drlc-leaderboard-data

paper-central-data-2

smollm-corpus

contribute-a-dataset

Data-Science-Instruct-Dataset

fineweb-edu

Drive_Stats

enron_aeslc_emails

healthcare_data

Hugging Face Models Dataset

Dataset of the models available on HuggingFace.co

Hugging Face