100+ datasets found

Huggingface Modelhub
kaggle.com
zip
Updated Jun 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
Explore at:
zip(2274876 bytes)Available download formats
Dataset updated
Jun 19, 2021
Authors
Kartik Godawat
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

Dataset was generated using huggingface_hub APIs provided by huggingface team.

Update v3:

Added Downloads last month metric

Added library name

Contents:

huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames

huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv

modelId: ID of the model as present on HF website

lastModified: Time when this model was last modified

tags: Tags associated with the model (provided by mantainer)

pipeline_tag: If exists, denotes which pipeline this model could be used with

files: List of available files in the model repo

publishedBy: Custom column derived from modelID, specifying who published this model

downloads_last_month: Number of times the model has been downloaded in last month.

library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv

modelId: ID of the model as available on HF website

modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

This is my first dataset upload on Kaggle. I hope you like it. :)
h
test-parquet-upload-dataset
huggingface.co
Updated Feb 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Not Lain (2025). test-parquet-upload-dataset [Dataset]. https://huggingface.co/datasets/not-lain/test-parquet-upload-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 9, 2025
Authors
Not Lain
Description
not-lain/test-parquet-upload-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
data-upload
huggingface.co
Updated May 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quan Nguyen (2025). data-upload [Dataset]. https://huggingface.co/datasets/jasong03/data-upload
Explore at:
Dataset updated
May 28, 2025
Authors
Quan Nguyen
Description
jasong03/data-upload dataset hosted on Hugging Face and contributed by the HF Datasets community
Huggingface RoBERTa
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darius Singh (2023). Huggingface RoBERTa [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-roberta
Explore at:
zip(34531447596 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
Darius Singh
Description
This dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.

By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.

Usage

To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

from transformers import AutoTokenizer, AutoModelForPreTraining MODEL_DIR = "/kaggle/input/huggingface-roberta/" tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base") model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")

Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Huggingface SqueezeBERT
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darius Singh (2023). Huggingface SqueezeBERT [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-squeezebert
Explore at:
zip(930441465 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
Darius Singh
Description
This dataset contains different variants of the SqueezeBERT model available on Hugging Face's model repository.

By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

For more information on usage visit the squeezebert hugging face docs.

Usage

To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

from transformers import AutoTokenizer, AutoModelForPreTraining MODEL_DIR = "/kaggle/input/huggingface-squeezebert/" tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "squeezebert-mnli-headless") model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "squeezebert-mnli-headless")

Acknowledgements All the copyrights and IP relating to SqueezeBERT belong to the original authors (Krishna et al.). All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
squad-like-loader
kaggle.com
zip
Updated Feb 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vissarion Moutafis (2022). squad-like-loader [Dataset]. https://www.kaggle.com/datasets/vissarionmoutafis/squadlikeloader
Explore at:
zip(1619 bytes)Available download formats
Dataset updated
Feb 28, 2022
Authors
Vissarion Moutafis
Description
Dataset

This dataset was created by Vissarion Moutafis

Contents
facebook/natural_reasoning
kaggle.com
zip
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zehra Korkusuz (2025). facebook/natural_reasoning [Dataset]. https://www.kaggle.com/datasets/zehrakorkusuz/natural-reasoning
Explore at:
zip(1694591016 bytes)Available download formats
Dataset updated
Feb 27, 2025
Authors
Zehra Korkusuz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Natural Reasoning Dataset

Source: Huggingface

Dataset Overview

Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.

A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.

Dataset Information

File Format: natural_reasoning.parquet

Click here to view the dataset

📝 License: CC-BY-NC-4.0

🧠 Task Categories: Text Generation Reasoning

🌐 Language: English (en)

📊 Dataset Size: 1M < n < 10M

📥 Source: Hugging Face

📄 Original Paper: NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

How to Use

You can load the dataset directly from Hugging Face as follows:

from datasets import load_dataset ds = load_dataset("facebook/natural_reasoning")

Data Collection and Quality

The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.

Reference Answer Statistics

In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.

Scaling Curve Performance

Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.

https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">

Citation

If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:

@misc{yuan2025naturalreasoningreasoningwild28m, title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions}, author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li}, year={2025}, eprint={2502.13124}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.13124} }

Source: Hugging Face
Huggingface Google MobileBERT
kaggle.com
zip
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darius Singh (2023). Huggingface Google MobileBERT [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-google-mobilebert
Explore at:
zip(875319161 bytes)Available download formats
Dataset updated
Jul 26, 2023
Authors
Darius Singh
Description
This dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.

By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

For more information on usage visit the mobilebert hugging face docs.

Usage

To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

from transformers import AutoTokenizer, AutoModelForPreTraining MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/" tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR) model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)

Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
h
test-audio-upload
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
gupta, test-audio-upload [Dataset]. https://huggingface.co/datasets/ananyahume/test-audio-upload
Explore at:
Authors
gupta
Description
ananyahume/test-audio-upload dataset hosted on Hugging Face and contributed by the HF Datasets community
h
test-upload-dataset
huggingface.co
Updated Nov 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nishan Shehadeh (2025). test-upload-dataset [Dataset]. https://huggingface.co/datasets/nshehadeh/test-upload-dataset
Explore at:
Dataset updated
Nov 5, 2025
Authors
Nishan Shehadeh
Description
nshehadeh/test-upload-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Huggingface ALBERT v2
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darius Singh (2023). Huggingface ALBERT v2 [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-albert-v2
Explore at:
zip(8046027655 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
Darius Singh
Description
This dataset contains different variants of the ALBERTv2 model by Google available on Hugging Face's model repository.

By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

For more information on usage visit the albert hugging face docs.

Usage

To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

from transformers import AutoTokenizer, AutoModelForPreTraining MODEL_DIR = "/kaggle/input/huggingface-albert-v2/" tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "albert-base-v2") model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "albert-base-v2")

Acknowledgements All the copyrights and IP relating to ALBERT belong to the original authors (Lan et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Huggingface BERT
kaggle.com
zip
Updated Jun 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xhlulu (2025). Huggingface BERT [Dataset]. https://www.kaggle.com/xhlulu/huggingface-bert
Explore at:
zip(25978385354 bytes)Available download formats
Dataset updated
Jun 21, 2025
Authors
xhlulu
Description
This dataset contains many popular BERT weights retrieved directly on Hugging Face's model repository, and hosted on Kaggle. It will be automatically updated every month to ensure that the latest version is available to the user. By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook.

The banner was adapted from figures by Jimmy Lin (tweet; slide) released under CC BY 4.0. BERT has an Apache 2.0 license according to the model repository.

Quick Start

To use this dataset, simply attach it the your notebook and specify the path to the dataset. For example:

from transformers import AutoTokenizer, AutoModelForMaskedLM MODEL_DIR = "/kaggle/input/huggingface-bert/" tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "bert-large-uncased") model = AutoModelForMaskedLM.from_pretrained(MODEL_DIR + "bert-large-uncased")

Acknowledgements

All the copyrights and IP relating to BERT belong to the original authors (Devlin et. al 2019) and Google. All copyrights relating to the transformers library belong to Hugging Face. The banner image was created thanks to Jimmy Lin so any modification of this figure should mention the original author and respect the conditions of the license; all copyrights related to the images belong to him.

Some of the models are community created or trained. Please reach out directly to the authors if you have questions regarding licenses and usage.
Huggingface AllenAI longformer
kaggle.com
zip
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darius Singh (2023). Huggingface AllenAI longformer [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-allenai-longformer
Explore at:
zip(20672838048 bytes)Available download formats
Dataset updated
Jul 26, 2023
Authors
Darius Singh
Description
This dataset contains different variants of the Longformer model by AllenAI available on Hugging Face's model repository.

By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

For more information on usage visit the longformer hugging face docs.

Usage

To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

from transformers import AutoTokenizer from transformers import AutoModelForMultipleChoice MODEL_DIR = "/kaggle/input/huggingface-allenai-longformer/" tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "longformer-base-4096") model = AutoModelForMultipleChoice.from_pretrained(MODEL_DIR + "longformer-base-4096")

Acknowledgements All the copyrights and IP relating to Longformer belong to the original authors of the respective models (Beltagy et al. and Cattan et al.) and the Allen Institute for AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Vietnamese Curated Dataset
kaggle.com
zip
Updated Jan 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Henry (2025). Vietnamese Curated Dataset [Dataset]. https://www.kaggle.com/datasets/ndy001/vietnamese-curated-dataset-2
Explore at:
zip(31037919590 bytes)Available download formats
Dataset updated
Jan 26, 2025
Authors
Daniel Henry
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Description

Vietnamese Curated Text Dataset. This dataset is collected from multiple open Vietnamese datasets, and curated with NeMo Curator

Developed by: Viettel Solutions

Language: Vietnamese

Details

Please visit our Tech Blog post on NVIDIA's plog page for details. Link

Data Collection

We utilize a combination of datasets that contain samples in Vietnamese language, ensuring a robust and representative text corpus. These datasets include: - The Vietnamese subset of the C4 dataset . - The Vietnamese subset of the OSCAR dataset, version 23.01. - Wikipedia's Vietnamese articles. - Binhvq's Vietnamese news corpus.

Preprocessing

We use NeMo Curator to curate the collected data. The data curation pipeline includes these key steps: 1. Unicode Reformatting: Texts are standardized into a consistent Unicode format to avoid encoding issues. 2. Exact Deduplication: Removes exact duplicates to reduce redundancy. 3. Quality Filtering: 4. Heuristic Filtering: Applies rules-based filters to remove low-quality content. 5. Classifier-Based Filtering: Uses machine learning to classify and filter documents based on quality.

Notebook

Dataset Statistics

Content diversity https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/mW6Pct3uyP_XDdGmE8EP3.png" alt="Domain proportion in curated dataset">

Character based metrics https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/W9TQjM2vcC7uXozyERHSQ.png" alt="Box plots of percentage of symbols, numbers, and whitespace characters compared to the total characters, word counts and average word lengths">

Token count distribution https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/PDelYpBI0DefSmQgFONgE.png" alt="Distribution of document sizes (in terms of token count)">

Embedding visualization https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/sfeoZWuQ7DcSpbmUOJ12r.png" alt="UMAP visualization of 5% of the dataset"> UMAP visualization of 5% of the dataset
h
test-upload
huggingface.co
Updated Jan 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qu Yang (2021). test-upload [Dataset]. https://huggingface.co/datasets/xyyyang/test-upload
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 2, 2021
Authors
Qu Yang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
xyyyang/test-upload dataset hosted on Hugging Face and contributed by the HF Datasets community
h
upload-dataset-test
huggingface.co
Updated Jul 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
song (2024). upload-dataset-test [Dataset]. https://huggingface.co/datasets/tieba/upload-dataset-test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2024
Authors
song
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
tieba/upload-dataset-test dataset hosted on Hugging Face and contributed by the HF Datasets community
h
test-upload-corrected-training-data
huggingface.co
Updated Jun 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arnaud-Meyer (2025). test-upload-corrected-training-data [Dataset]. https://huggingface.co/datasets/peopleofverso/test-upload-corrected-training-data
Explore at:
Dataset updated
Jun 8, 2025
Authors
Arnaud-Meyer
Description
peopleofverso/test-upload-corrected-training-data dataset hosted on Hugging Face and contributed by the HF Datasets community
h
testing-file-upload
huggingface.co
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oleksandr Serdiuk (2024). testing-file-upload [Dataset]. https://huggingface.co/datasets/oserdiuk/testing-file-upload
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2024
Authors
Oleksandr Serdiuk
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
oserdiuk/testing-file-upload dataset hosted on Hugging Face and contributed by the HF Datasets community
h
test-upload
huggingface.co
Updated Sep 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim NDAW (2025). test-upload [Dataset]. https://huggingface.co/datasets/ibrahimndaw/test-upload
Explore at:
Dataset updated
Sep 29, 2025
Authors
Ibrahim NDAW
Description
ibrahimndaw/test-upload dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Data from: dataset-creation
huggingface.co
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
uv scripts for HF Jobs (2025). dataset-creation [Dataset]. https://huggingface.co/datasets/uv-scripts/dataset-creation
Explore at:
Dataset updated
Jul 23, 2025
Dataset authored and provided by
uv scripts for HF Jobs
Description
Dataset Creation Scripts

Ready-to-run scripts for creating Hugging Face datasets from local files.

Available Scripts 📄 pdf-to-dataset.py

Convert directories of PDF files into Hugging Face datasets. Features:

📁 Uploads PDFs as dataset objects for flexible processing 🏷️ Automatic labeling from folder structure 🚀 Zero configuration - just point at your PDFs 📤 Direct upload to Hugging Face Hub

Usage:

Basic usage

uv run pdf-to-dataset.py /path/to/pdfs… See the full description on the dataset page: https://huggingface.co/datasets/uv-scripts/dataset-creation.

Facebook

Twitter

Click to copy link

Link copied

Cite

Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub

Huggingface Modelhub

Dataset containing information on all the models on HuggingFace modelhub

Explore at:

8 scholarly articles cite this dataset (View in Google Scholar)

zip(2274876 bytes)Available download formats

Dataset updated

Jun 19, 2021

Authors

Kartik Godawat

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

Dataset was generated using huggingface_hub APIs provided by huggingface team.

Update v3:

Added Downloads last month metric
Added library name

huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames
huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv
modelId: ID of the model as present on HF website
lastModified: Time when this model was last modified
tags: Tags associated with the model (provided by mantainer)
pipeline_tag: If exists, denotes which pipeline this model could be used with
files: List of available files in the model repo
publishedBy: Custom column derived from modelID, specifying who published this model
downloads_last_month: Number of times the model has been downloaded in last month.
library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv
modelId: ID of the model as available on HF website
modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

This is my first dataset upload on Kaggle. I hope you like it. :)

Clear search

Close search

Google apps

Main menu

Huggingface Modelhub

Update v3:

Contents:

test-parquet-upload-dataset

data-upload

Huggingface RoBERTa

Huggingface SqueezeBERT

squad-like-loader

Dataset

Contents

facebook/natural_reasoning

Natural Reasoning Dataset

Dataset Overview

Dataset Information

How to Use

Data Collection and Quality

Reference Answer Statistics

Scaling Curve Performance

Citation

Huggingface Google MobileBERT

test-audio-upload

test-upload-dataset

Huggingface ALBERT v2

Huggingface BERT

Quick Start

Acknowledgements

Huggingface AllenAI longformer

Vietnamese Curated Dataset

Dataset Description

Details

Data Collection

Preprocessing

Dataset Statistics

test-upload

upload-dataset-test

test-upload-corrected-training-data

testing-file-upload

test-upload

Data from: dataset-creation

Basic usage

Huggingface Modelhub

Dataset containing information on all the models on HuggingFace modelhub

Update v3:

Contents: