Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">
Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.
Dataset was generated using huggingface_hub APIs provided by huggingface team.
This is my first dataset upload on Kaggle. I hope you like it. :)
Facebook
Twitternot-lain/test-parquet-upload-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterjasong03/data-upload dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
β
MODEL_DIR = "/kaggle/input/huggingface-roberta/"
β
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base")
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")
Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterThis dataset contains different variants of the SqueezeBERT model available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the squeezebert hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
β
MODEL_DIR = "/kaggle/input/huggingface-squeezebert/"
β
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "squeezebert-mnli-headless")
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "squeezebert-mnli-headless")
Acknowledgements All the copyrights and IP relating to SqueezeBERT belong to the original authors (Krishna et al.). All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterThis dataset was created by Vissarion Moutafis
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.
A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.
File Format: natural_reasoning.parquet
Click here to view the dataset
CC-BY-NC-4.0 Text Generation Reasoning English (en) 1M < n < 10M Hugging Face You can load the dataset directly from Hugging Face as follows:
from datasets import load_dataset
ds = load_dataset("facebook/natural_reasoning")
The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.
In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.
Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.
https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">
If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:
@misc{yuan2025naturalreasoningreasoningwild28m,
title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions},
author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li},
year={2025},
eprint={2502.13124},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.13124}
}
Source: Hugging Face
Facebook
TwitterThis dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the mobilebert hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
β
MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/"
β
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)
Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
Twitternshehadeh/test-upload-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset contains different variants of the ALBERTv2 model by Google available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the albert hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
β
MODEL_DIR = "/kaggle/input/huggingface-albert-v2/"
β
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "albert-base-v2")
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "albert-base-v2")
Acknowledgements All the copyrights and IP relating to ALBERT belong to the original authors (Lan et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterThis dataset contains many popular BERT weights retrieved directly on Hugging Face's model repository, and hosted on Kaggle. It will be automatically updated every month to ensure that the latest version is available to the user. By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook.
The banner was adapted from figures by Jimmy Lin (tweet; slide) released under CC BY 4.0. BERT has an Apache 2.0 license according to the model repository.
To use this dataset, simply attach it the your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForMaskedLM
MODEL_DIR = "/kaggle/input/huggingface-bert/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "bert-large-uncased")
model = AutoModelForMaskedLM.from_pretrained(MODEL_DIR + "bert-large-uncased")
All the copyrights and IP relating to BERT belong to the original authors (Devlin et. al 2019) and Google. All copyrights relating to the transformers library belong to Hugging Face. The banner image was created thanks to Jimmy Lin so any modification of this figure should mention the original author and respect the conditions of the license; all copyrights related to the images belong to him.
Some of the models are community created or trained. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterThis dataset contains different variants of the Longformer model by AllenAI available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the longformer hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice
β
MODEL_DIR = "/kaggle/input/huggingface-allenai-longformer/"
β
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "longformer-base-4096")
model = AutoModelForMultipleChoice.from_pretrained(MODEL_DIR + "longformer-base-4096")
Acknowledgements All the copyrights and IP relating to Longformer belong to the original authors of the respective models (Beltagy et al. and Cattan et al.) and the Allen Institute for AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Vietnamese Curated Text Dataset. This dataset is collected from multiple open Vietnamese datasets, and curated with NeMo Curator
Please visit our Tech Blog post on NVIDIA's plog page for details. Link
We utilize a combination of datasets that contain samples in Vietnamese language, ensuring a robust and representative text corpus. These datasets include: - The Vietnamese subset of the C4 dataset . - The Vietnamese subset of the OSCAR dataset, version 23.01. - Wikipedia's Vietnamese articles. - Binhvq's Vietnamese news corpus.
We use NeMo Curator to curate the collected data. The data curation pipeline includes these key steps: 1. Unicode Reformatting: Texts are standardized into a consistent Unicode format to avoid encoding issues. 2. Exact Deduplication: Removes exact duplicates to reduce redundancy. 3. Quality Filtering: 4. Heuristic Filtering: Applies rules-based filters to remove low-quality content. 5. Classifier-Based Filtering: Uses machine learning to classify and filter documents based on quality.
Content diversity
https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/mW6Pct3uyP_XDdGmE8EP3.png" alt="Domain proportion in curated dataset">
Character based metrics
https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/W9TQjM2vcC7uXozyERHSQ.png" alt="Box plots of percentage of symbols, numbers, and whitespace characters compared to the total characters, word counts and average word lengths">
Token count distribution
https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/PDelYpBI0DefSmQgFONgE.png" alt="Distribution of document sizes (in terms of token count)">
Embedding visualization
https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/sfeoZWuQ7DcSpbmUOJ12r.png" alt="UMAP visualization of 5% of the dataset">
UMAP visualization of 5% of the dataset
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
xyyyang/test-upload dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
tieba/upload-dataset-test dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterpeopleofverso/test-upload-corrected-training-data dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
oserdiuk/testing-file-upload dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitteribrahimndaw/test-upload dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Creation Scripts
Ready-to-run scripts for creating Hugging Face datasets from local files.
Available Scripts
π pdf-to-dataset.py
Convert directories of PDF files into Hugging Face datasets. Features:
π Uploads PDFs as dataset objects for flexible processing π·οΈ Automatic labeling from folder structure π Zero configuration - just point at your PDFs π€ Direct upload to Hugging Face Hub
Usage:
uv run pdf-to-dataset.py /path/to/pdfs⦠See the full description on the dataset page: https://huggingface.co/datasets/uv-scripts/dataset-creation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">
Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.
Dataset was generated using huggingface_hub APIs provided by huggingface team.
This is my first dataset upload on Kaggle. I hope you like it. :)