100+ datasets found
  1. Huggingface Modelhub

    • kaggle.com
    zip
    Updated Jun 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
    Explore at:
    zip(2274876 bytes)Available download formats
    Dataset updated
    Jun 19, 2021
    Authors
    Kartik Godawat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

    Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

    Dataset was generated using huggingface_hub APIs provided by huggingface team.

    Update v3:

    • Added Downloads last month metric
    • Added library name

    Contents:

    • huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames
    • huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv
    • modelId: ID of the model as present on HF website
    • lastModified: Time when this model was last modified
    • tags: Tags associated with the model (provided by mantainer)
    • pipeline_tag: If exists, denotes which pipeline this model could be used with
    • files: List of available files in the model repo
    • publishedBy: Custom column derived from modelID, specifying who published this model
    • downloads_last_month: Number of times the model has been downloaded in last month.
    • library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv
    • modelId: ID of the model as available on HF website
    • modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

    This is my first dataset upload on Kaggle. I hope you like it. :)

  2. h

    test-parquet-upload-dataset

    • huggingface.co
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Not Lain (2025). test-parquet-upload-dataset [Dataset]. https://huggingface.co/datasets/not-lain/test-parquet-upload-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2025
    Authors
    Not Lain
    Description

    not-lain/test-parquet-upload-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    data-upload

    • huggingface.co
    Updated May 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quan Nguyen (2025). data-upload [Dataset]. https://huggingface.co/datasets/jasong03/data-upload
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    Quan Nguyen
    Description

    jasong03/data-upload dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. Huggingface RoBERTa

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darius Singh (2023). Huggingface RoBERTa [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-roberta
    Explore at:
    zip(34531447596 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    Darius Singh
    Description

    This dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.

    By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

    For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.

    Usage

    To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForPreTraining
    ​
    MODEL_DIR = "/kaggle/input/huggingface-roberta/"
    ​
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base")
    model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")
    

    Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

  5. Huggingface SqueezeBERT

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darius Singh (2023). Huggingface SqueezeBERT [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-squeezebert
    Explore at:
    zip(930441465 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    Darius Singh
    Description

    This dataset contains different variants of the SqueezeBERT model available on Hugging Face's model repository.

    By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

    For more information on usage visit the squeezebert hugging face docs.

    Usage

    To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForPreTraining
    ​
    MODEL_DIR = "/kaggle/input/huggingface-squeezebert/"
    ​
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "squeezebert-mnli-headless")
    model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "squeezebert-mnli-headless")
    

    Acknowledgements All the copyrights and IP relating to SqueezeBERT belong to the original authors (Krishna et al.). All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

  6. squad-like-loader

    • kaggle.com
    zip
    Updated Feb 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vissarion Moutafis (2022). squad-like-loader [Dataset]. https://www.kaggle.com/datasets/vissarionmoutafis/squadlikeloader
    Explore at:
    zip(1619 bytes)Available download formats
    Dataset updated
    Feb 28, 2022
    Authors
    Vissarion Moutafis
    Description

    Dataset

    This dataset was created by Vissarion Moutafis

    Contents

  7. facebook/natural_reasoning

    • kaggle.com
    zip
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zehra Korkusuz (2025). facebook/natural_reasoning [Dataset]. https://www.kaggle.com/datasets/zehrakorkusuz/natural-reasoning
    Explore at:
    zip(1694591016 bytes)Available download formats
    Dataset updated
    Feb 27, 2025
    Authors
    Zehra Korkusuz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Natural Reasoning Dataset

    Source: Huggingface

    Dataset Overview

    Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.

    A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.

    Dataset Information

    File Format: natural_reasoning.parquet

    Click here to view the dataset

    How to Use

    You can load the dataset directly from Hugging Face as follows:

    from datasets import load_dataset
    
    ds = load_dataset("facebook/natural_reasoning")
    

    Data Collection and Quality

    The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.

    Reference Answer Statistics

    In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.

    Scaling Curve Performance

    Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.

    https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">

    Citation

    If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:

    @misc{yuan2025naturalreasoningreasoningwild28m,
       title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions},
       author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li},
       year={2025},
       eprint={2502.13124},
       archivePrefix={arXiv},
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2502.13124}
    }
    

    Source: Hugging Face

  8. Huggingface Google MobileBERT

    • kaggle.com
    zip
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darius Singh (2023). Huggingface Google MobileBERT [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-google-mobilebert
    Explore at:
    zip(875319161 bytes)Available download formats
    Dataset updated
    Jul 26, 2023
    Authors
    Darius Singh
    Description

    This dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.

    By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

    For more information on usage visit the mobilebert hugging face docs.

    Usage

    To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForPreTraining
    ​
    MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/"
    ​
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
    model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)
    

    Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

  9. h

    test-audio-upload

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gupta, test-audio-upload [Dataset]. https://huggingface.co/datasets/ananyahume/test-audio-upload
    Explore at:
    Authors
    gupta
    Description

    ananyahume/test-audio-upload dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    test-upload-dataset

    • huggingface.co
    Updated Nov 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nishan Shehadeh (2025). test-upload-dataset [Dataset]. https://huggingface.co/datasets/nshehadeh/test-upload-dataset
    Explore at:
    Dataset updated
    Nov 5, 2025
    Authors
    Nishan Shehadeh
    Description

    nshehadeh/test-upload-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. Huggingface ALBERT v2

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darius Singh (2023). Huggingface ALBERT v2 [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-albert-v2
    Explore at:
    zip(8046027655 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    Darius Singh
    Description

    This dataset contains different variants of the ALBERTv2 model by Google available on Hugging Face's model repository.

    By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

    For more information on usage visit the albert hugging face docs.

    Usage

    To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForPreTraining
    ​
    MODEL_DIR = "/kaggle/input/huggingface-albert-v2/"
    ​
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "albert-base-v2")
    model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "albert-base-v2")
    

    Acknowledgements All the copyrights and IP relating to ALBERT belong to the original authors (Lan et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

  12. Huggingface BERT

    • kaggle.com
    zip
    Updated Jun 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xhlulu (2025). Huggingface BERT [Dataset]. https://www.kaggle.com/xhlulu/huggingface-bert
    Explore at:
    zip(25978385354 bytes)Available download formats
    Dataset updated
    Jun 21, 2025
    Authors
    xhlulu
    Description

    This dataset contains many popular BERT weights retrieved directly on Hugging Face's model repository, and hosted on Kaggle. It will be automatically updated every month to ensure that the latest version is available to the user. By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook.

    The banner was adapted from figures by Jimmy Lin (tweet; slide) released under CC BY 4.0. BERT has an Apache 2.0 license according to the model repository.

    Quick Start

    To use this dataset, simply attach it the your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForMaskedLM
    
    MODEL_DIR = "/kaggle/input/huggingface-bert/"
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "bert-large-uncased")
    model = AutoModelForMaskedLM.from_pretrained(MODEL_DIR + "bert-large-uncased")
    

    Acknowledgements

    All the copyrights and IP relating to BERT belong to the original authors (Devlin et. al 2019) and Google. All copyrights relating to the transformers library belong to Hugging Face. The banner image was created thanks to Jimmy Lin so any modification of this figure should mention the original author and respect the conditions of the license; all copyrights related to the images belong to him.

    Some of the models are community created or trained. Please reach out directly to the authors if you have questions regarding licenses and usage.

  13. Huggingface AllenAI longformer

    • kaggle.com
    zip
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darius Singh (2023). Huggingface AllenAI longformer [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-allenai-longformer
    Explore at:
    zip(20672838048 bytes)Available download formats
    Dataset updated
    Jul 26, 2023
    Authors
    Darius Singh
    Description

    This dataset contains different variants of the Longformer model by AllenAI available on Hugging Face's model repository.

    By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

    For more information on usage visit the longformer hugging face docs.

    Usage

    To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer
    from transformers import AutoModelForMultipleChoice
    ​
    MODEL_DIR = "/kaggle/input/huggingface-allenai-longformer/"
    ​
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "longformer-base-4096")
    model = AutoModelForMultipleChoice.from_pretrained(MODEL_DIR + "longformer-base-4096")
    

    Acknowledgements All the copyrights and IP relating to Longformer belong to the original authors of the respective models (Beltagy et al. and Cattan et al.) and the Allen Institute for AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

  14. Vietnamese Curated Dataset

    • kaggle.com
    zip
    Updated Jan 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Henry (2025). Vietnamese Curated Dataset [Dataset]. https://www.kaggle.com/datasets/ndy001/vietnamese-curated-dataset-2
    Explore at:
    zip(31037919590 bytes)Available download formats
    Dataset updated
    Jan 26, 2025
    Authors
    Daniel Henry
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Description

    Vietnamese Curated Text Dataset. This dataset is collected from multiple open Vietnamese datasets, and curated with NeMo Curator

    • Developed by: Viettel Solutions
    • Language: Vietnamese

    Details

    Please visit our Tech Blog post on NVIDIA's plog page for details. Link

    Data Collection

    We utilize a combination of datasets that contain samples in Vietnamese language, ensuring a robust and representative text corpus. These datasets include: - The Vietnamese subset of the C4 dataset . - The Vietnamese subset of the OSCAR dataset, version 23.01. - Wikipedia's Vietnamese articles. - Binhvq's Vietnamese news corpus.

    Preprocessing

    We use NeMo Curator to curate the collected data. The data curation pipeline includes these key steps: 1. Unicode Reformatting: Texts are standardized into a consistent Unicode format to avoid encoding issues. 2. Exact Deduplication: Removes exact duplicates to reduce redundancy. 3. Quality Filtering: 4. Heuristic Filtering: Applies rules-based filters to remove low-quality content. 5. Classifier-Based Filtering: Uses machine learning to classify and filter documents based on quality.

    Notebook

    Dataset Statistics

    Content diversity https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/mW6Pct3uyP_XDdGmE8EP3.png" alt="Domain proportion in curated dataset">

    Character based metrics https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/W9TQjM2vcC7uXozyERHSQ.png" alt="Box plots of percentage of symbols, numbers, and whitespace characters compared to the total characters, word counts and average word lengths">

    Token count distribution https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/PDelYpBI0DefSmQgFONgE.png" alt="Distribution of document sizes (in terms of token count)">

    Embedding visualization https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/sfeoZWuQ7DcSpbmUOJ12r.png" alt="UMAP visualization of 5% of the dataset"> UMAP visualization of 5% of the dataset

  15. h

    test-upload

    • huggingface.co
    Updated Jan 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qu Yang (2021). test-upload [Dataset]. https://huggingface.co/datasets/xyyyang/test-upload
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 2, 2021
    Authors
    Qu Yang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    xyyyang/test-upload dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    upload-dataset-test

    • huggingface.co
    Updated Jul 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    song (2024). upload-dataset-test [Dataset]. https://huggingface.co/datasets/tieba/upload-dataset-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2024
    Authors
    song
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    tieba/upload-dataset-test dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    test-upload-corrected-training-data

    • huggingface.co
    Updated Jun 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnaud-Meyer (2025). test-upload-corrected-training-data [Dataset]. https://huggingface.co/datasets/peopleofverso/test-upload-corrected-training-data
    Explore at:
    Dataset updated
    Jun 8, 2025
    Authors
    Arnaud-Meyer
    Description

    peopleofverso/test-upload-corrected-training-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    testing-file-upload

    • huggingface.co
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleksandr Serdiuk (2024). testing-file-upload [Dataset]. https://huggingface.co/datasets/oserdiuk/testing-file-upload
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2024
    Authors
    Oleksandr Serdiuk
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    oserdiuk/testing-file-upload dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    test-upload

    • huggingface.co
    Updated Sep 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim NDAW (2025). test-upload [Dataset]. https://huggingface.co/datasets/ibrahimndaw/test-upload
    Explore at:
    Dataset updated
    Sep 29, 2025
    Authors
    Ibrahim NDAW
    Description

    ibrahimndaw/test-upload dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    Data from: dataset-creation

    • huggingface.co
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    uv scripts for HF Jobs (2025). dataset-creation [Dataset]. https://huggingface.co/datasets/uv-scripts/dataset-creation
    Explore at:
    Dataset updated
    Jul 23, 2025
    Dataset authored and provided by
    uv scripts for HF Jobs
    Description

    Dataset Creation Scripts

    Ready-to-run scripts for creating Hugging Face datasets from local files.

      Available Scripts
    
    
    
    
    
      πŸ“„ pdf-to-dataset.py
    

    Convert directories of PDF files into Hugging Face datasets. Features:

    πŸ“ Uploads PDFs as dataset objects for flexible processing 🏷️ Automatic labeling from folder structure πŸš€ Zero configuration - just point at your PDFs πŸ“€ Direct upload to Hugging Face Hub

    Usage:

    Basic usage

    uv run pdf-to-dataset.py /path/to/pdfs… See the full description on the dataset page: https://huggingface.co/datasets/uv-scripts/dataset-creation.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
Organization logo

Huggingface Modelhub

Dataset containing information on all the models on HuggingFace modelhub

Explore at:
8 scholarly articles cite this dataset (View in Google Scholar)
zip(2274876 bytes)Available download formats
Dataset updated
Jun 19, 2021
Authors
Kartik Godawat
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

Dataset was generated using huggingface_hub APIs provided by huggingface team.

Update v3:

  • Added Downloads last month metric
  • Added library name

Contents:

  • huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames
  • huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv
  • modelId: ID of the model as present on HF website
  • lastModified: Time when this model was last modified
  • tags: Tags associated with the model (provided by mantainer)
  • pipeline_tag: If exists, denotes which pipeline this model could be used with
  • files: List of available files in the model repo
  • publishedBy: Custom column derived from modelID, specifying who published this model
  • downloads_last_month: Number of times the model has been downloaded in last month.
  • library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv
  • modelId: ID of the model as available on HF website
  • modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

This is my first dataset upload on Kaggle. I hope you like it. :)

Search
Clear search
Close search
Google apps
Main menu