100+ datasets found
  1. Huggingface Modelhub

    • kaggle.com
    zip
    Updated Jun 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
    Explore at:
    zip(2274876 bytes)Available download formats
    Dataset updated
    Jun 19, 2021
    Authors
    Kartik Godawat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

    Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

    Dataset was generated using huggingface_hub APIs provided by huggingface team.

    Update v3:

    • Added Downloads last month metric
    • Added library name

    Contents:

    • huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames
    • huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv
    • modelId: ID of the model as present on HF website
    • lastModified: Time when this model was last modified
    • tags: Tags associated with the model (provided by mantainer)
    • pipeline_tag: If exists, denotes which pipeline this model could be used with
    • files: List of available files in the model repo
    • publishedBy: Custom column derived from modelID, specifying who published this model
    • downloads_last_month: Number of times the model has been downloaded in last month.
    • library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv
    • modelId: ID of the model as available on HF website
    • modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

    This is my first dataset upload on Kaggle. I hope you like it. :)

  2. h

    new-food-nextvit-update-dataset-train-file

    • huggingface.co
    Updated Oct 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BitwiseMind (2025). new-food-nextvit-update-dataset-train-file [Dataset]. https://huggingface.co/datasets/bitwisemind/new-food-nextvit-update-dataset-train-file
    Explore at:
    Dataset updated
    Oct 14, 2025
    Authors
    BitwiseMind
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    bitwisemind/new-food-nextvit-update-dataset-train-file dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. Hugging Face Models Dataset

    • kaggle.com
    zip
    Updated Feb 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasir Raza (2023). Hugging Face Models Dataset [Dataset]. https://www.kaggle.com/datasets/yasirabdaali/hugging-face-models-dataset
    Explore at:
    zip(980916 bytes)Available download formats
    Dataset updated
    Feb 19, 2023
    Authors
    Yasir Raza
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Hugging Face

    Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets.

    This dataset contains the data of 16k models available on huggingface.co. This dataset contains the following features of the model; 1. model url 2. model title 3. downloads and likes 4. updated

  4. h

    open_data_boamps

    • huggingface.co
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BoAmps (2025). open_data_boamps [Dataset]. https://huggingface.co/datasets/boavizta/open_data_boamps
    Explore at:
    Dataset updated
    Jul 15, 2025
    Dataset authored and provided by
    BoAmps
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Guide: How to share your data on the BoAmps repository

    This guide explains step by step how to share BoAmps format reports on this public Hugging Face repository.

      Prerequisites
    

    Before starting, make sure you have:

    A Hugging Face account The files you want to upload

      Method 1: Hugging Face Web Interface
    

    Log in to Hugging Face

    Go to the boamps dataset

    Navigate to the files: Click on "Files and versions" then on the "data" folder

    Click on "Contribute" then… See the full description on the dataset page: https://huggingface.co/datasets/boavizta/open_data_boamps.

  5. Data from: label-files

    • huggingface.co
    Updated Dec 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face (2021). label-files [Dataset]. https://huggingface.co/datasets/huggingface/label-files
    Explore at:
    Dataset updated
    Dec 23, 2021
    Dataset authored and provided by
    Hugging Facehttps://huggingface.co/
    Description

    This repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:

    ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...

    You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.

  6. Huggingface Google MobileBERT

    • kaggle.com
    zip
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darius Singh (2023). Huggingface Google MobileBERT [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-google-mobilebert
    Explore at:
    zip(875319161 bytes)Available download formats
    Dataset updated
    Jul 26, 2023
    Authors
    Darius Singh
    Description

    This dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.

    By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

    For more information on usage visit the mobilebert hugging face docs.

    Usage

    To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForPreTraining
    ​
    MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/"
    ​
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
    model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)
    

    Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

  7. Z

    Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pepe, Federica; Nardone, Vittoria; Mastropaolo, Antonio; Canfora, Gerardo; BAVOTA, Gabriele; Di Penta, Massimiliano (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8200098
    Explore at:
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    Università della Svizzera italiana
    University of Sannio
    University of Molise
    Università degli Studi del Sannio
    Authors
    Pepe, Federica; Nardone, Vittoria; Mastropaolo, Antonio; Canfora, Gerardo; BAVOTA, Gabriele; Di Penta, Massimiliano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

    Root directory

    • statistics.r: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
    • modelsInfo.zip: zip file containing all the downloaded model cards (in JSON format)
    • script: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

    Dataset

    • Dataset/Dataset_HF-models-list.csv: list of HF models analyzed
    • Dataset/Dataset_github-prj-list.txt: list of GitHub projects using the transformers library
    • Dataset/Dataset_github-Prj_model-Used.csv: contains usage pairs: project, model
    • Dataset/Dataset_prj-num-models-reused.csv: number of models used by each GitHub project
    • Dataset/Dataset_model-download_num-prj_correlation.csv contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

    RQ1

    • RQ1/RQ1_dataset-list.txt: list of HF datasets
    • RQ1/RQ1_datasetSample.csv: sample set of models used for the manual analysis of datasets
    • RQ1/RQ1_analyzeDatasetTags.py: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip in a directory with the same name (modelsInfo) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py script
    • RQ1/RQ1_countDataset.py: given the output of RQ2/analyzeDatasetTags.py (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
    • RQ1/RQ1_datasetTags.csv: output of RQ2/analyzeDatasetTags.py
    • RQ1/RQ1_dataset_usage_count.csv: output of RQ2/countDataset.py

    RQ2

    • RQ2/tableBias.pdf: table detailing the number of occurrences of different types of bias by model Task
    • RQ2/RQ2_bias_classification_sheet.csv: results of the manual labeling
    • RQ2/RQ2_isBiased.csv: file to compute the inter-rater agreement of whether or not a model documents Bias
    • RQ2/RQ2_biasAgrLabels.csv: file to compute the inter-rater agreement related to bias categories
    • RQ2/RQ2_final_bias_categories_with_levels.csv: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

    RQ3

    • RQ3/RQ3_LicenseValidation.csv: manual validation of a sample of licenses
    • RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt: lists of licenses with different permissiveness
    • RQ3/RQ3_prjs_license.csv: for each project linked to models, among other fields it indicates the license tag and name
    • RQ3/RQ3_models_license.csv: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
    • RQ3/RQ3_model-prj-license_contingency_table.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows)
    • RQ3/RQ3_models_prjs_licenses_with_type.csv: pairs project-model, with their respective licenses and permissiveness level

    scripts

    Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

  8. Huggingface Hub Permissible models and datasets

    • kaggle.com
    zip
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dheeraj M Pai (2023). Huggingface Hub Permissible models and datasets [Dataset]. https://www.kaggle.com/datasets/dheerajmpai/huggingface-hub-permissible-models-and-datasets
    Explore at:
    zip(34761279 bytes)Available download formats
    Dataset updated
    Dec 26, 2023
    Authors
    Dheeraj M Pai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Huggingface Hub: Models, Datasets, and Spaces

    Dataset Overview

    This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.

    Key Features

    • Comprehensive Data: Includes exhaustive details on all models, datasets, and spaces from the Huggingface Hub.
    • Permissible Models: A specialized subset is provided in a separate CSV file, focusing exclusively on models that are permissible for use.
    • Regularly Updated: The dataset is refreshed weekly to ensure the latest information is always available.

    Last Update

    • Date: December 26, 2023

    Update Frequency

    • Frequency: Weekly

    Dataset Contents

    1. Models: Detailed listings of all models available on Huggingface Hub.
    2. Datasets: Comprehensive information on datasets hosted on the Hub.
    3. Spaces: An overview of the different spaces and their functionalities.
    4. Permissible Models CSV: A smaller, curated list of models that are cleared for use.

    Usage

    This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.

    Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.

  9. h

    pubmed

    • huggingface.co
    Updated Dec 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed
    Explore at:
    Dataset updated
    Dec 15, 2023
    Dataset authored and provided by
    NLM/DIR BioNLP Group
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.

  10. Hugging Face Dataset Metadata

    • figshare.com
    json
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous USER (2025). Hugging Face Dataset Metadata [Dataset]. http://doi.org/10.6084/m9.figshare.29082806.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Anonymous USER
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file describes the dataset hosted currently on hugging face: https://huggingface.co/datasets/LISTTT/NeurIPS_2025_BMDB

  11. HuggingFace models

    • redivis.com
    application/jsonl +7
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redivis Demo Organization (2025). HuggingFace models [Dataset]. https://redivis.com/datasets/d2aq-2jp4d5xpd
    Explore at:
    sas, parquet, avro, application/jsonl, arrow, spss, stata, csvAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Redivis Inc.
    Authors
    Redivis Demo Organization
    Time period covered
    Feb 24, 2025
    Description

    Abstract

    Container dataset for demonstration of Hugging Face models on Redivis. Currently just contains a single BERT model, but may expand in the future.

  12. h

    statcast-era-pitches

    • huggingface.co
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jensen Holm (2024). statcast-era-pitches [Dataset]. https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches
    Explore at:
    Dataset updated
    Oct 30, 2024
    Authors
    Jensen Holm
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    statcast-pitches

    pybaseball is a great tool for downloading baseball data. Even though the library is optimized and scrapes this data in parallel, it can be time consuming. The point of this repository is to utilize GitHub Actions to scrape new baseball data weekly during the MLB season, and update a parquet file hosted as a huggingface dataset. Reading this data as a huggingface dataset is much faster than scraping the new data each time you re run your code, or just want updated… See the full description on the dataset page: https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches.

  13. Data from: hugging face datasets

    • kaggle.com
    zip
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Broad (2025). hugging face datasets [Dataset]. https://www.kaggle.com/nbroad/hf-ds
    Explore at:
    zip(70163997 bytes)Available download formats
    Dataset updated
    Nov 3, 2025
    Authors
    Nicholas Broad
    Description

    This is the latest version of Hugging Face datasets to be used in offline notebooks on Kaggle. It is automatically updated every week.

    Docs are here

    Installation Instructions

    !pip install datasets --no-index --find-links=file:///kaggle/input/hf-ds -U -q

  14. h

    Data from: label-files

    • huggingface.co
    Updated Jun 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatih C. Akyon (2023). label-files [Dataset]. https://huggingface.co/datasets/fcakyon/label-files
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 24, 2023
    Authors
    Fatih C. Akyon
    Description

    fcakyon/label-files dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    notebooks_on_the_hub

    • huggingface.co
    Updated Apr 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sylvain Lesage (2023). notebooks_on_the_hub [Dataset]. https://huggingface.co/datasets/severo/notebooks_on_the_hub
    Explore at:
    Dataset updated
    Apr 3, 2023
    Authors
    Sylvain Lesage
    Description

    Notebooks on the Hub

    This dataset uses files from the repository https://huggingface.co/datasets/davanstrien/notebooks_on_the_hub_raw which records all the repositories hosted on the Hugging Face Hub that contain notebooks. Daniel's repository was updated daily from April of 2023 to June of 2024. I manually copied only one version per month: they are stored in the original folder with the name YYYY_MM.parquet, from 2023_05.parquet to 2024_05.parquet (13 files). Then, I recreated… See the full description on the dataset page: https://huggingface.co/datasets/severo/notebooks_on_the_hub.

  16. facebook/natural_reasoning

    • kaggle.com
    zip
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zehra Korkusuz (2025). facebook/natural_reasoning [Dataset]. https://www.kaggle.com/datasets/zehrakorkusuz/natural-reasoning
    Explore at:
    zip(1694591016 bytes)Available download formats
    Dataset updated
    Feb 27, 2025
    Authors
    Zehra Korkusuz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Natural Reasoning Dataset

    Source: Huggingface

    Dataset Overview

    Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.

    A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.

    Dataset Information

    File Format: natural_reasoning.parquet

    Click here to view the dataset

    How to Use

    You can load the dataset directly from Hugging Face as follows:

    from datasets import load_dataset
    
    ds = load_dataset("facebook/natural_reasoning")
    

    Data Collection and Quality

    The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.

    Reference Answer Statistics

    In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.

    Scaling Curve Performance

    Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.

    https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">

    Citation

    If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:

    @misc{yuan2025naturalreasoningreasoningwild28m,
       title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions},
       author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li},
       year={2025},
       eprint={2502.13124},
       archivePrefix={arXiv},
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2502.13124}
    }
    

    Source: Hugging Face

  17. Data from: huggingface

    • kaggle.com
    zip
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    amulil (2022). huggingface [Dataset]. https://www.kaggle.com/datasets/amulil/amulil-huggingface
    Explore at:
    zip(5498282999 bytes)Available download formats
    Dataset updated
    Mar 22, 2022
    Authors
    amulil
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Dataset

    This dataset was created by amulil

    Released under GPL 2

    Contents

  18. edit_amazon_reviews_multi

    • kaggle.com
    zip
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radim Közl (2025). edit_amazon_reviews_multi [Dataset]. https://www.kaggle.com/datasets/radimkzl/edit-amazon-reviews-multi
    Explore at:
    zip(167232183 bytes)Available download formats
    Dataset updated
    Aug 21, 2025
    Authors
    Radim Közl
    Description

    Dataset Summary

    The data file is based on a copy of the Hugging Face data file: buruzaemon/amazon_reviews_multi. Ten je kopií původního datového souboru defunct-datasets/amazon_reviews_multi. The dataset was published by the community Open Data on AWS

    In our modification, we removed unnecessary columns and thus anonymized the data file, and at the same time we added columns describing the lengths of the strings of the single columns, see Multilingual_Amazon_Reviews_Corpus_analysis. Next, the dataset was re-partitioned:

    • train: 95% (199500)
    • validation: 2.5% (5250)
    • test: 2.5% (5250)

    The original *.jsonl data format has been changed to the more modern *.parquet format see Apacha Arrow

    The data file was created for the purpose of testing the Hugging Face tutorial Summarization, because the older version of the dataset is not compatible with the new version of the datasets library.

    This dataset is comprehensive, derived datasets for the tutorial can be found here:

    Description of the original dataset - Hugging Face Datasets

    "We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.

    For each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.

    Note that the language of a review does not necessarily match the language of its marketplace (e.g. reviews from amazon.de are primarily written in German, but could also be written in English, etc.). For this reason, we applied a language detection algorithm based on the work in Bojanowski et al. (2017) to determine the language of the review text and we removed reviews that were not written in the expected language." source

    Documentation of the authors of the original dataset: The Multilingual Amazon Reviews Corpus

    Languages

    The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish.

    Dataset Structure

    • id: record id
    • stars: An int between 1-5 indicating the number of stars.
    • review_body: The text body of the review.
    • review_title: The text title of the review.
    • language: The string identifier of the review language.
    • product_category: String representation of the product's category.
    • lenght_review_body: text length of review_body
    • lenght_review_title: text lenght of review_title
    • lenght_product_category: text lenght of product_category

    Social Impact of Dataset

    This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. Unfortunately, each of the languages included here is relatively high resource and well studied. The dataset is used for training in NLP, summarization tasks, text generation, and masked text filling. source

    Discussion of Biases of origin dataset

    The dataset contains only reviews from verified purchases (as described in the paper, section 2.1), and the reviews should conform the Amazon Community Guidelines. source

    Licensing Information

    Licensing of origin dataset

    Amazon has licensed this dataset under its own agreement for non-commercial research usage only. This licenc...

  19. h

    rag-human-rights-from-files

    • huggingface.co
    Updated Jan 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Han Díaz (2025). rag-human-rights-from-files [Dataset]. https://huggingface.co/datasets/sdiazlor/rag-human-rights-from-files
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 17, 2025
    Authors
    Sara Han Díaz
    Description

    Dataset Card for my-distiset-rag-files

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/my-distiset-rag-files/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/sdiazlor/rag-human-rights-from-files.

  20. Data from: AstroChat

    • kaggle.com
    • huggingface.co
    zip
    Updated Jun 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
    Explore at:
    zip(1214166 bytes)Available download formats
    Dataset updated
    Jun 9, 2024
    Authors
    astro_pat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Purpose and Scope

    The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

    Intended Use

    The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

    Quickstart

    To be completed

    DATASET DESCRIPTION

    Access

    Structure

    901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

    Important See the full list of topics and subtopics covered below.

    Metadata

    Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

    Generation Method

    We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

    Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

    Step-by-step description

    • Defined a set of user persona
    • Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering
    • For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)
    • For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)
    • We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions
    • We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

    Future work and contributions appreciated

    • Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)
    • Implement more creativity in the opening questions and follow-up questions
    • Filter-out questions and conversations which are too similar
    • Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

    Languages

    All instances in the dataset are in english

    Size

    901 synthetically-generated dialogue

    USAGE AND GUIDELINES

    License

    AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

    Restrictions

    No restriction. Please provide the correct attribution following the license terms.

    Citation

    Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

    Update Frequency

    Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

    Have a feedback or spot an error?

    Use the ...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
Organization logo

Huggingface Modelhub

Dataset containing information on all the models on HuggingFace modelhub

Explore at:
8 scholarly articles cite this dataset (View in Google Scholar)
zip(2274876 bytes)Available download formats
Dataset updated
Jun 19, 2021
Authors
Kartik Godawat
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

Dataset was generated using huggingface_hub APIs provided by huggingface team.

Update v3:

  • Added Downloads last month metric
  • Added library name

Contents:

  • huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames
  • huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv
  • modelId: ID of the model as present on HF website
  • lastModified: Time when this model was last modified
  • tags: Tags associated with the model (provided by mantainer)
  • pipeline_tag: If exists, denotes which pipeline this model could be used with
  • files: List of available files in the model repo
  • publishedBy: Custom column derived from modelID, specifying who published this model
  • downloads_last_month: Number of times the model has been downloaded in last month.
  • library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv
  • modelId: ID of the model as available on HF website
  • modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

This is my first dataset upload on Kaggle. I hope you like it. :)

Search
Clear search
Close search
Google apps
Main menu