100+ datasets found

Huggingface Hub Permissible models and datasets
kaggle.com
zip
Updated Dec 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dheeraj M Pai (2023). Huggingface Hub Permissible models and datasets [Dataset]. https://www.kaggle.com/datasets/dheerajmpai/huggingface-hub-permissible-models-and-datasets
Explore at:
zip(34761279 bytes)Available download formats
Dataset updated
Dec 26, 2023
Authors
Dheeraj M Pai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Huggingface Hub: Models, Datasets, and Spaces

Dataset Overview

This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.

Key Features

Comprehensive Data: Includes exhaustive details on all models, datasets, and spaces from the Huggingface Hub.

Permissible Models: A specialized subset is provided in a separate CSV file, focusing exclusively on models that are permissible for use.

Regularly Updated: The dataset is refreshed weekly to ensure the latest information is always available.

Last Update

Date: December 26, 2023

Update Frequency

Frequency: Weekly

Dataset Contents

Models: Detailed listings of all models available on Huggingface Hub.

Datasets: Comprehensive information on datasets hosted on the Hub.

Spaces: An overview of the different spaces and their functionalities.

Permissible Models CSV: A smaller, curated list of models that are cleared for use.

Usage

This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.

Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.
h
TabPalooza
huggingface.co
Updated Sep 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xyz987 (2025). TabPalooza [Dataset]. https://huggingface.co/datasets/data-hub-xyz987/TabPalooza
Explore at:
Dataset updated
Sep 27, 2025
Dataset authored and provided by
xyz987
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
data-hub-xyz987/TabPalooza dataset hosted on Hugging Face and contributed by the HF Datasets community
h
odaigen_hindi_pre_trained_sp
huggingface.co
Updated Sep 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hindi-data-hub (2025). odaigen_hindi_pre_trained_sp [Dataset]. https://huggingface.co/datasets/Hindi-data-hub/odaigen_hindi_pre_trained_sp
Explore at:
Dataset updated
Sep 22, 2025
Dataset authored and provided by
Hindi-data-hub
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Hindi Language Pre-Trained LLM Datasets Overview Welcome to the Hindi Language Pre-Training Datasets repository! This README provides a comprehensive overview of various pre-training datasets available for Hindi, including essential details such as licenses, sources, and statistical information. These datasets are invaluable resources for training and fine-tuning large language models (LLMs) for a wide range of natural language processing (NLP) tasks. -Data Overview and Statistics This README… See the full description on the dataset page: https://huggingface.co/datasets/Hindi-data-hub/odaigen_hindi_pre_trained_sp.

Hugging Face Models

kaggle.com

zip

Updated Nov 28, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

A T M Ragib Raihan (2023). Hugging Face Models [Dataset]. https://www.kaggle.com/datasets/atmragib/hugging-face-models/code

Explore at:

zip(13652285 bytes)Available download formats

Dataset updated

Nov 28, 2023

Authors

A T M Ragib Raihan

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Contex

The Hugging Face Hub hosts many models for a variety of machine learning tasks. Models are stored in repositories, so they benefit from all the features possessed by every repo on the Hugging Face Hub.

Data Source Link: huggingface.co/models

Attribute Information

Variable	Description
model_id
pipeline	There are total 40 pipelines. To learn more read: Hugging Face Pipeline
downloads
likes
author_id
author_name
author_type	user or organization
author_isPro	Paid user or organization
lastModified	from 2014-08-10 to 2023-11-27

Huggingface Modelhub
kaggle.com
zip
Updated Jun 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
Explore at:
zip(2274876 bytes)Available download formats
Dataset updated
Jun 19, 2021
Authors
Kartik Godawat
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

Dataset was generated using huggingface_hub APIs provided by huggingface team.

Update v3:

Added Downloads last month metric

Added library name

Contents:

huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames

huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv

modelId: ID of the model as present on HF website

lastModified: Time when this model was last modified

tags: Tags associated with the model (provided by mantainer)

pipeline_tag: If exists, denotes which pipeline this model could be used with

files: List of available files in the model repo

publishedBy: Custom column derived from modelID, specifying who published this model

downloads_last_month: Number of times the model has been downloaded in last month.

library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv

modelId: ID of the model as available on HF website

modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

This is my first dataset upload on Kaggle. I hope you like it. :)
On the Suitability of Hugging Face Hub for Empirical Studies
zenodo.org
recerca.uoc.edu
Updated Apr 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adem Ait Fonollà; Adem Ait Fonollà; Javier Luis Cánovas Izquierdo; Javier Luis Cánovas Izquierdo; Jordi Cabot; Jordi Cabot (2024). On the Suitability of Hugging Face Hub for Empirical Studies [Dataset]. http://doi.org/10.5281/zenodo.11072131
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.11072131
Dataset updated
Apr 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Adem Ait Fonollà; Adem Ait Fonollà; Javier Luis Cánovas Izquierdo; Javier Luis Cánovas Izquierdo; Jordi Cabot; Jordi Cabot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the data used in the paper title "On the Suitability of Hugging Face Hub for Empirical Studies". For RQ1 we share the survey responses and the interview transcription, while for RQ2 we share the link to the repository where the data is hosted.

For RQ1, the survey responses are in an Excel file titled "Survey Responses Public.xlsx". The transcriptions of each interview are in a Word file titled "Transcription-intvw-slot-N.docx".

For RQ2, we collected the data of the HFCommunity release of October 2023. It can be found in its website. We also share the DOI of the dump.
h
huggingface-hub-classes
huggingface.co
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niels Rogge (2025). huggingface-hub-classes [Dataset]. https://huggingface.co/datasets/nielsr/huggingface-hub-classes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 15, 2025
Authors
Niels Rogge
Description
nielsr/huggingface-hub-classes dataset hosted on Hugging Face and contributed by the HF Datasets community
Huggingface Hub 0.27.0
kaggle.com
zip
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aishik Rakshit (2024). Huggingface Hub 0.27.0 [Dataset]. https://www.kaggle.com/datasets/aishikai/huggingface-hub-0-27-0
Explore at:
zip(438787 bytes)Available download formats
Dataset updated
Dec 26, 2024
Authors
Aishik Rakshit
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Aishik Rakshit

Released under Apache 2.0

Contents
h
huggingface-hub-docs-chunks-test
huggingface.co
Updated Jan 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niels Rogge (2025). huggingface-hub-docs-chunks-test [Dataset]. https://huggingface.co/datasets/nielsr/huggingface-hub-docs-chunks-test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 13, 2025
Authors
Niels Rogge
Description
nielsr/huggingface-hub-docs-chunks-test dataset hosted on Hugging Face and contributed by the HF Datasets community
Z
GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span...
data.niaid.nih.gov
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moncla, Ludovic; Vigier, Denis; McDonough, Katherine (2024). GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span Categorization Annotations of Diderot & d'Alembert's Encyclopédie [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10530177
Explore at:
Dataset updated
Mar 20, 2024
Dataset provided by
Lancaster University
Laboratoire d'Informatique en Images et Systèmes d'Information
Interactions, Corpus, Apprentissages, Représentations
Authors
Moncla, Ludovic; Vigier, Denis; McDonough, Katherine
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.

The dataset is available in the following formats:

JSONL format provided by Prodigy

binary spaCy format (ready to use with the spaCy train pipeline)

The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.

The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.

Tagset

NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume.

NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.

ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.

Relation: spatial relation, e.g. dans, sur, à 10 lieues de.

Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.

NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.

NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.

ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.

NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique

ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.

Head: entry name

Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.

HuggingFace

The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA

spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries

This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.

Acknowledgement

The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.
Labelled Corpus - Political Bias (Hugging Face)
kaggle.com
zip
Updated May 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suraj Karakulath (2024). Labelled Corpus - Political Bias (Hugging Face) [Dataset]. https://www.kaggle.com/datasets/surajkarakulath/labelled-corpus-political-bias-hugging-face
Explore at:
zip(50133530 bytes)Available download formats
Dataset updated
May 8, 2024
Authors
Suraj Karakulath
Description
This is a labeled corpus dataset of article text with corresponding political bias obtained from Huggingface. It contains 17,362 articles labeled left, right, or center by the editors of allsides.com. Articles were manually annotated by news editors who were attempting to select representative articles from the left, right and center of each article topic.
h
veg-data-hub
huggingface.co
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Chatterjee (2025). veg-data-hub [Dataset]. https://huggingface.co/datasets/darthraider/veg-data-hub
Explore at:
Dataset updated
Apr 7, 2025
Authors
Subhajit Chatterjee
Description
This dataset contains images of vegetables (Bell Pepper, Brinjal, Chile Pepper, Cucumber, New Mexico Green Chile, Pumpkin, Tomato) with various conditions including Damaged, Dried, Old, Ripe, Unripe, Diseased, Flower, Ripe, Rotten, and Unripe states. It is split into training and testing subsets.
OpenOrca
kaggle.com
opendatalab.com
+1more
zip
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). OpenOrca [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset/versions/2
Explore at:
zip(2548102631 bytes)Available download formats
Dataset updated
Nov 22, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Open-Orca Augmented FLAN Dataset

Unlocking Advanced Language Understanding and ML Model Performance

By Huggingface Hub [source]

About this dataset

The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.

Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.

Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:

import pandas as pd # Library used for importing datasets into Python df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'

After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :

df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so on

Data Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as

Research Ideas

Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.

Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.

Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...
C
Community-Driven Model Service Platform Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Community-Driven Model Service Platform Report [Dataset]. https://www.marketreportanalytics.com/reports/community-driven-model-service-platform-73131
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Apr 9, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Discover the booming Community-Driven Model Service Platform market! This comprehensive analysis reveals a CAGR of 10.1%, driven by AI adoption and open-source innovation. Explore market size, trends, segmentation (cloud, on-premises, adult, children), key players (Kaggle, GitHub, Hugging Face), and regional insights. Learn more about this rapidly expanding sector.
h
hub_models_with_base_model_info
huggingface.co
Updated Dec 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Librarian Bots (2023). hub_models_with_base_model_info [Dataset]. https://huggingface.co/datasets/librarian-bots/hub_models_with_base_model_info
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 3, 2023
Dataset authored and provided by
Librarian Bots
Description
Dataset Card for Hugging Face Hub Models with Base Model Metadata

Dataset Details

This dataset contains a subset of possible metadata for models hosted on the Hugging Face Hub. All of these models contain base_model metadata i.e. information about the model used for fine-tuning. This data can be used for creating network graphs showing links between models on the Hub.

Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More… See the full description on the dataset page: https://huggingface.co/datasets/librarian-bots/hub_models_with_base_model_info.
h
webui-all
huggingface.co
Updated Nov 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Big Lab (2024). webui-all [Dataset]. https://huggingface.co/datasets/biglab/webui-all
Explore at:
Dataset updated
Nov 1, 2024
Dataset authored and provided by
Big Lab
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
This data accompanies the WebUI project (https://dl.acm.org/doi/abs/10.1145/3544548.3581158) For more information, check out the project website: https://uimodeling.github.io/ To download this dataset, you need to install the huggingface-hub package pip install huggingface-hub

Use snapshot_download from huggingface_hub import snapshot_download snapshot_download(repo_id="biglab/webui-all", repo_type="dataset")

IMPORTANT

Before downloading and using, please review the copyright info here:… See the full description on the dataset page: https://huggingface.co/datasets/biglab/webui-all.
h
dataset-formats
huggingface.co
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sylvain Lesage (2024). dataset-formats [Dataset]. https://huggingface.co/datasets/severo/dataset-formats
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 17, 2024
Authors
Sylvain Lesage
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Datasets formats on the Hugging Face Hub

Every day, we check the proportion of data formats among the datasets published on Hugging Face. The data is published at https://huggingface.co/datasets/severo/dataset-formats. The count includes all the datasets supported by the dataset viewer, and only for the supported formats. By dataset format, we refer to the native format of the data. All the supported datasets are also available as Parquet. See… See the full description on the dataset page: https://huggingface.co/datasets/severo/dataset-formats.
h
dev-push-to-hub
huggingface.co
Updated Aug 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashim Mahara (2021). dev-push-to-hub [Dataset]. https://huggingface.co/datasets/ashim/dev-push-to-hub
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2021
Authors
Ashim Mahara
Description
ashim/dev-push-to-hub dataset hosted on Hugging Face and contributed by the HF Datasets community
h
isha-call-center-qa-data
huggingface.co
Updated Jul 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natesh Bhat (2023). isha-call-center-qa-data [Dataset]. https://huggingface.co/datasets/nateshmbhat/isha-call-center-qa-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2023
Authors
Natesh Bhat
Description
nateshmbhat/isha-call-center-qa-data dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dheeraj M Pai (2023). Huggingface Hub Permissible models and datasets [Dataset]. https://www.kaggle.com/datasets/dheerajmpai/huggingface-hub-permissible-models-and-datasets

Huggingface Hub Permissible models and datasets

Huggingface Hub models, datasets and spaces.

Explore at:

zip(34761279 bytes)Available download formats

Dataset updated

Dec 26, 2023

Authors

Dheeraj M Pai

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Huggingface Hub: Models, Datasets, and Spaces

Dataset Overview

This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.

Key Features

Comprehensive Data: Includes exhaustive details on all models, datasets, and spaces from the Huggingface Hub.
Permissible Models: A specialized subset is provided in a separate CSV file, focusing exclusively on models that are permissible for use.
Regularly Updated: The dataset is refreshed weekly to ensure the latest information is always available.

Last Update

Date: December 26, 2023

Update Frequency

Frequency: Weekly

Dataset Contents

Models: Detailed listings of all models available on Huggingface Hub.
Datasets: Comprehensive information on datasets hosted on the Hub.
Spaces: An overview of the different spaces and their functionalities.
Permissible Models CSV: A smaller, curated list of models that are cleared for use.

Usage

This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.

Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.

Clear search

Close search

Google apps

Main menu

Huggingface Hub Permissible models and datasets

Huggingface Hub: Models, Datasets, and Spaces

Dataset Overview

Key Features

Last Update

Update Frequency

Dataset Contents

Usage

TabPalooza

odaigen_hindi_pre_trained_sp

Hugging Face Models

Contex

Data Source Link: huggingface.co/models

Attribute Information

Huggingface Modelhub

Update v3:

Contents:

On the Suitability of Hugging Face Hub for Empirical Studies

huggingface-hub-classes

Huggingface Hub 0.27.0

Dataset

Contents

huggingface-hub-docs-chunks-test

GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span...

Labelled Corpus - Political Bias (Hugging Face)

veg-data-hub

OpenOrca

Open-Orca Augmented FLAN Dataset

Unlocking Advanced Language Understanding and ML Model Performance

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Community-Driven Model Service Platform Report

hub_models_with_base_model_info

webui-all

dataset-formats

dev-push-to-hub

isha-call-center-qa-data

fineweb

Huggingface Hub Permissible models and datasets

Huggingface Hub models, datasets and spaces.

Huggingface Hub: Models, Datasets, and Spaces

Dataset Overview

Key Features

Last Update

Update Frequency

Dataset Contents

Usage