100+ datasets found

Huggingface Modelhub
kaggle.com
zip
Updated Jun 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub
Explore at:
zip(2274876 bytes)Available download formats
Dataset updated
Jun 19, 2021
Authors
Kartik Godawat
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

Dataset was generated using huggingface_hub APIs provided by huggingface team.

Update v3:

Added Downloads last month metric

Added library name

Contents:

huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames

huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv

modelId: ID of the model as present on HF website

lastModified: Time when this model was last modified

tags: Tags associated with the model (provided by mantainer)

pipeline_tag: If exists, denotes which pipeline this model could be used with

files: List of available files in the model repo

publishedBy: Custom column derived from modelID, specifying who published this model

downloads_last_month: Number of times the model has been downloaded in last month.

library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv

modelId: ID of the model as available on HF website

modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

This is my first dataset upload on Kaggle. I hope you like it. :)
h
new-food-nextvit-update-dataset-train-file
huggingface.co
Updated Oct 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BitwiseMind (2025). new-food-nextvit-update-dataset-train-file [Dataset]. https://huggingface.co/datasets/bitwisemind/new-food-nextvit-update-dataset-train-file
Explore at:
Dataset updated
Oct 14, 2025
Authors
BitwiseMind
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
bitwisemind/new-food-nextvit-update-dataset-train-file dataset hosted on Hugging Face and contributed by the HF Datasets community
Hugging Face Models Dataset
kaggle.com
zip
Updated Feb 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasir Raza (2023). Hugging Face Models Dataset [Dataset]. https://www.kaggle.com/datasets/yasirabdaali/hugging-face-models-dataset
Explore at:
zip(980916 bytes)Available download formats
Dataset updated
Feb 19, 2023
Authors
Yasir Raza
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Hugging Face

Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets.

This dataset contains the data of 16k models available on huggingface.co. This dataset contains the following features of the model; 1. model url 2. model title 3. downloads and likes 4. updated
h
open_data_boamps
huggingface.co
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BoAmps (2025). open_data_boamps [Dataset]. https://huggingface.co/datasets/boavizta/open_data_boamps
Explore at:
Dataset updated
Jul 15, 2025
Dataset authored and provided by
BoAmps
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Guide: How to share your data on the BoAmps repository

This guide explains step by step how to share BoAmps format reports on this public Hugging Face repository.

Prerequisites

Before starting, make sure you have:

A Hugging Face account The files you want to upload

Method 1: Hugging Face Web Interface

Log in to Hugging Face

Go to the boamps dataset

Navigate to the files: Click on "Files and versions" then on the "data" folder

Click on "Contribute" then… See the full description on the dataset page: https://huggingface.co/datasets/boavizta/open_data_boamps.
Data from: label-files
huggingface.co
Updated Dec 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face (2021). label-files [Dataset]. https://huggingface.co/datasets/huggingface/label-files
Explore at:
Dataset updated
Dec 23, 2021
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
Description
This repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:

ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...

You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.
Huggingface Google MobileBERT
kaggle.com
zip
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darius Singh (2023). Huggingface Google MobileBERT [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-google-mobilebert
Explore at:
zip(875319161 bytes)Available download formats
Dataset updated
Jul 26, 2023
Authors
Darius Singh
Description
This dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.

By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

For more information on usage visit the mobilebert hugging face docs.

Usage

To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

from transformers import AutoTokenizer, AutoModelForPreTraining MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/" tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR) model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)

Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Z
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pepe, Federica; Nardone, Vittoria; Mastropaolo, Antonio; Canfora, Gerardo; BAVOTA, Gabriele; Di Penta, Massimiliano (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8200098
Explore at:
Dataset updated
Jan 16, 2024
Dataset provided by
Università della Svizzera italiana
University of Sannio
University of Molise
Università degli Studi del Sannio
Authors
Pepe, Federica; Nardone, Vittoria; Mastropaolo, Antonio; Canfora, Gerardo; BAVOTA, Gabriele; Di Penta, Massimiliano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

Root directory

statistics.r: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements

modelsInfo.zip: zip file containing all the downloaded model cards (in JSON format)

script: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

Dataset

Dataset/Dataset_HF-models-list.csv: list of HF models analyzed

Dataset/Dataset_github-prj-list.txt: list of GitHub projects using the transformers library

Dataset/Dataset_github-Prj_model-Used.csv: contains usage pairs: project, model

Dataset/Dataset_prj-num-models-reused.csv: number of models used by each GitHub project

Dataset/Dataset_model-download_num-prj_correlation.csv contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

RQ1

RQ1/RQ1_dataset-list.txt: list of HF datasets

RQ1/RQ1_datasetSample.csv: sample set of models used for the manual analysis of datasets

RQ1/RQ1_analyzeDatasetTags.py: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip in a directory with the same name (modelsInfo) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py script

RQ1/RQ1_countDataset.py: given the output of RQ2/analyzeDatasetTags.py (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis

RQ1/RQ1_datasetTags.csv: output of RQ2/analyzeDatasetTags.py

RQ1/RQ1_dataset_usage_count.csv: output of RQ2/countDataset.py

RQ2

RQ2/tableBias.pdf: table detailing the number of occurrences of different types of bias by model Task

RQ2/RQ2_bias_classification_sheet.csv: results of the manual labeling

RQ2/RQ2_isBiased.csv: file to compute the inter-rater agreement of whether or not a model documents Bias

RQ2/RQ2_biasAgrLabels.csv: file to compute the inter-rater agreement related to bias categories

RQ2/RQ2_final_bias_categories_with_levels.csv: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

RQ3

RQ3/RQ3_LicenseValidation.csv: manual validation of a sample of licenses

RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt: lists of licenses with different permissiveness

RQ3/RQ3_prjs_license.csv: for each project linked to models, among other fields it indicates the license tag and name

RQ3/RQ3_models_license.csv: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license

RQ3/RQ3_model-prj-license_contingency_table.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows)

RQ3/RQ3_models_prjs_licenses_with_type.csv: pairs project-model, with their respective licenses and permissiveness level

scripts

Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Huggingface Hub Permissible models and datasets
kaggle.com
zip
Updated Dec 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dheeraj M Pai (2023). Huggingface Hub Permissible models and datasets [Dataset]. https://www.kaggle.com/datasets/dheerajmpai/huggingface-hub-permissible-models-and-datasets
Explore at:
zip(34761279 bytes)Available download formats
Dataset updated
Dec 26, 2023
Authors
Dheeraj M Pai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Huggingface Hub: Models, Datasets, and Spaces

Dataset Overview

This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.

Key Features

Comprehensive Data: Includes exhaustive details on all models, datasets, and spaces from the Huggingface Hub.

Permissible Models: A specialized subset is provided in a separate CSV file, focusing exclusively on models that are permissible for use.

Regularly Updated: The dataset is refreshed weekly to ensure the latest information is always available.

Last Update

Date: December 26, 2023

Update Frequency

Frequency: Weekly

Dataset Contents

Models: Detailed listings of all models available on Huggingface Hub.

Datasets: Comprehensive information on datasets hosted on the Hub.

Spaces: An overview of the different spaces and their functionalities.

Permissible Models CSV: A smaller, curated list of models that are cleared for use.

Usage

This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.

Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.
h
pubmed
huggingface.co
Updated Dec 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLM/DIR BioNLP Group (2023). pubmed [Dataset]. https://huggingface.co/datasets/ncbi/pubmed
Explore at:
Dataset updated
Dec 15, 2023
Dataset authored and provided by
NLM/DIR BioNLP Group
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.
Hugging Face Dataset Metadata
figshare.com
json
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous USER (2025). Hugging Face Dataset Metadata [Dataset]. http://doi.org/10.6084/m9.figshare.29082806.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29082806.v1
Dataset updated
May 15, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Anonymous USER
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This file describes the dataset hosted currently on hugging face: https://huggingface.co/datasets/LISTTT/NeurIPS_2025_BMDB
HuggingFace models
redivis.com
application/jsonl +7
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2025). HuggingFace models [Dataset]. https://redivis.com/datasets/d2aq-2jp4d5xpd
Explore at:
sas, parquet, avro, application/jsonl, arrow, spss, stata, csvAvailable download formats
Dataset updated
Feb 24, 2025
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Time period covered
Feb 24, 2025
Description
Abstract

Container dataset for demonstration of Hugging Face models on Redivis. Currently just contains a single BERT model, but may expand in the future.
h
statcast-era-pitches
huggingface.co
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jensen Holm (2024). statcast-era-pitches [Dataset]. https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches
Explore at:
Dataset updated
Oct 30, 2024
Authors
Jensen Holm
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
statcast-pitches

pybaseball is a great tool for downloading baseball data. Even though the library is optimized and scrapes this data in parallel, it can be time consuming. The point of this repository is to utilize GitHub Actions to scrape new baseball data weekly during the MLB season, and update a parquet file hosted as a huggingface dataset. Reading this data as a huggingface dataset is much faster than scraping the new data each time you re run your code, or just want updated… See the full description on the dataset page: https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches.
Data from: hugging face datasets
kaggle.com
zip
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Broad (2025). hugging face datasets [Dataset]. https://www.kaggle.com/nbroad/hf-ds
Explore at:
zip(70163997 bytes)Available download formats
Dataset updated
Nov 3, 2025
Authors
Nicholas Broad
Description
This is the latest version of Hugging Face datasets to be used in offline notebooks on Kaggle. It is automatically updated every week.

Docs are here

Installation Instructions

!pip install datasets --no-index --find-links=file:///kaggle/input/hf-ds -U -q
h
Data from: label-files
huggingface.co
Updated Jun 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatih C. Akyon (2023). label-files [Dataset]. https://huggingface.co/datasets/fcakyon/label-files
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 24, 2023
Authors
Fatih C. Akyon
Description
fcakyon/label-files dataset hosted on Hugging Face and contributed by the HF Datasets community
h
notebooks_on_the_hub
huggingface.co
Updated Apr 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sylvain Lesage (2023). notebooks_on_the_hub [Dataset]. https://huggingface.co/datasets/severo/notebooks_on_the_hub
Explore at:
Dataset updated
Apr 3, 2023
Authors
Sylvain Lesage
Description
Notebooks on the Hub

This dataset uses files from the repository https://huggingface.co/datasets/davanstrien/notebooks_on_the_hub_raw which records all the repositories hosted on the Hugging Face Hub that contain notebooks. Daniel's repository was updated daily from April of 2023 to June of 2024. I manually copied only one version per month: they are stored in the original folder with the name YYYY_MM.parquet, from 2023_05.parquet to 2024_05.parquet (13 files). Then, I recreated… See the full description on the dataset page: https://huggingface.co/datasets/severo/notebooks_on_the_hub.
facebook/natural_reasoning
kaggle.com
zip
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zehra Korkusuz (2025). facebook/natural_reasoning [Dataset]. https://www.kaggle.com/datasets/zehrakorkusuz/natural-reasoning
Explore at:
zip(1694591016 bytes)Available download formats
Dataset updated
Feb 27, 2025
Authors
Zehra Korkusuz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Natural Reasoning Dataset

Source: Huggingface

Dataset Overview

Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.

A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.

Dataset Information

File Format: natural_reasoning.parquet

Click here to view the dataset

📝 License: CC-BY-NC-4.0

🧠 Task Categories: Text Generation Reasoning

🌐 Language: English (en)

📊 Dataset Size: 1M < n < 10M

📥 Source: Hugging Face

📄 Original Paper: NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

How to Use

You can load the dataset directly from Hugging Face as follows:

from datasets import load_dataset ds = load_dataset("facebook/natural_reasoning")

Data Collection and Quality

The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.

Reference Answer Statistics

In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.

Scaling Curve Performance

Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.

https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">

Citation

If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:

@misc{yuan2025naturalreasoningreasoningwild28m, title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions}, author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li}, year={2025}, eprint={2502.13124}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.13124} }

Source: Hugging Face
Data from: huggingface
kaggle.com
zip
Updated Mar 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
amulil (2022). huggingface [Dataset]. https://www.kaggle.com/datasets/amulil/amulil-huggingface
Explore at:
zip(5498282999 bytes)Available download formats
Dataset updated
Mar 22, 2022
Authors
amulil
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Dataset

This dataset was created by amulil

Released under GPL 2

Contents
edit_amazon_reviews_multi
kaggle.com
zip
Updated Aug 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radim Közl (2025). edit_amazon_reviews_multi [Dataset]. https://www.kaggle.com/datasets/radimkzl/edit-amazon-reviews-multi
Explore at:
zip(167232183 bytes)Available download formats
Dataset updated
Aug 21, 2025
Authors
Radim Közl
Description
Dataset Summary

The data file is based on a copy of the Hugging Face data file: buruzaemon/amazon_reviews_multi. Ten je kopií původního datového souboru defunct-datasets/amazon_reviews_multi. The dataset was published by the community Open Data on AWS

In our modification, we removed unnecessary columns and thus anonymized the data file, and at the same time we added columns describing the lengths of the strings of the single columns, see Multilingual_Amazon_Reviews_Corpus_analysis. Next, the dataset was re-partitioned:

train: 95% (199500)

validation: 2.5% (5250)

test: 2.5% (5250)

The original *.jsonl data format has been changed to the more modern *.parquet format see Apacha Arrow

The data file was created for the purpose of testing the Hugging Face tutorial Summarization, because the older version of the dataset is not compatible with the new version of the datasets library.

This dataset is comprehensive, derived datasets for the tutorial can be found here:

KRadim/edit_amazon_reviews_multi_en

KRadim/edit_amazon_reviews_multi_es

Description of the original dataset - Hugging Face Datasets

"We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.

For each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.

Note that the language of a review does not necessarily match the language of its marketplace (e.g. reviews from amazon.de are primarily written in German, but could also be written in English, etc.). For this reason, we applied a language detection algorithm based on the work in Bojanowski et al. (2017) to determine the language of the review text and we removed reviews that were not written in the expected language." source

Documentation of the authors of the original dataset: The Multilingual Amazon Reviews Corpus

Languages

The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish.

Dataset Structure

id: record id

stars: An int between 1-5 indicating the number of stars.

review_body: The text body of the review.

review_title: The text title of the review.

language: The string identifier of the review language.

product_category: String representation of the product's category.

lenght_review_body: text length of review_body

lenght_review_title: text lenght of review_title

lenght_product_category: text lenght of product_category

Social Impact of Dataset

This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. Unfortunately, each of the languages included here is relatively high resource and well studied. The dataset is used for training in NLP, summarization tasks, text generation, and masked text filling. source

Discussion of Biases of origin dataset

The dataset contains only reviews from verified purchases (as described in the paper, section 2.1), and the reviews should conform the Amazon Community Guidelines. source

Licensing Information

Licensing of origin dataset

Amazon has licensed this dataset under its own agreement for non-commercial research usage only. This licenc...
h
rag-human-rights-from-files
huggingface.co
Updated Jan 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Han Díaz (2025). rag-human-rights-from-files [Dataset]. https://huggingface.co/datasets/sdiazlor/rag-human-rights-from-files
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 17, 2025
Authors
Sara Han Díaz
Description
Dataset Card for my-distiset-rag-files

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/my-distiset-rag-files/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/sdiazlor/rag-human-rights-from-files.
Data from: AstroChat
kaggle.com
huggingface.co
zip
Updated Jun 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
Explore at:
zip(1214166 bytes)Available download formats
Dataset updated
Jun 9, 2024
Authors
astro_pat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Purpose and Scope

The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

Intended Use

The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

Quickstart

To be completed

DATASET DESCRIPTION

Access

Manual download from Hugging face hub: https://huggingface.co/datasets/patrickfleith/AstroChat

Or with python: python from datasets import load_dataset dataset = load_dataset("patrickfleith/AstroChat")

Structure

901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

Important See the full list of topics and subtopics covered below.

Metadata

Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

Generation Method

We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

Step-by-step description

Defined a set of user persona

Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering

For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)

For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)

We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions

We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

Future work and contributions appreciated

Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)

Implement more creativity in the opening questions and follow-up questions

Filter-out questions and conversations which are too similar

Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

Languages

All instances in the dataset are in english

Size

901 synthetically-generated dialogue

USAGE AND GUIDELINES

License

AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

Restrictions

No restriction. Please provide the correct attribution following the license terms.

Citation

Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

Update Frequency

Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

Have a feedback or spot an error?

Use the ...

Facebook

Twitter

Click to copy link

Link copied

Cite

Kartik Godawat (2021). Huggingface Modelhub [Dataset]. https://www.kaggle.com/crazydiv/huggingface-modelhub

Huggingface Modelhub

Dataset containing information on all the models on HuggingFace modelhub

Explore at:

8 scholarly articles cite this dataset (View in Google Scholar)

zip(2274876 bytes)Available download formats

Dataset updated

Jun 19, 2021

Authors

Kartik Godawat

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">

Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.

Dataset was generated using huggingface_hub APIs provided by huggingface team.

Update v3:

Added Downloads last month metric
Added library name

huggingface_models.csv : Primary file which contains metadata information like model name, tags, last modified and filenames
huggingface_modelcard_readme.csv : Detailed file containing README.md contents if available for a particular model. Content is in markdown format. modelId column joins both the files together. ### huggingface_models.csv
modelId: ID of the model as present on HF website
lastModified: Time when this model was last modified
tags: Tags associated with the model (provided by mantainer)
pipeline_tag: If exists, denotes which pipeline this model could be used with
files: List of available files in the model repo
publishedBy: Custom column derived from modelID, specifying who published this model
downloads_last_month: Number of times the model has been downloaded in last month.
library: Name of library the model belongs to eg: transformers, spacy, timm etc. ### huggingface_modelcard_readme.csv
modelId: ID of the model as available on HF website
modelCard: Readme contents of a model (referred to as modelCard in HuggingFace ecoystem). It contains useful information on how the model was trained, benchmarks and author notes. ### Inspiration: The idea of analyzing publicly available models on HugginFace struck me while I was attending a livesession of the amazing transformers course by @LysandreJik. Soon after, I tweeted the team and asked for permission to create such a dataset. Special shoutout to @osanseviero for encouraging and pointing me in the right direction.

This is my first dataset upload on Kaggle. I hope you like it. :)

Clear search

Close search

Google apps

Main menu

Huggingface Modelhub

Update v3:

Contents:

new-food-nextvit-update-dataset-train-file

Hugging Face Models Dataset

Hugging Face

open_data_boamps

Data from: label-files

Huggingface Google MobileBERT

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

Root directory

Dataset

RQ1

RQ2

RQ3

scripts

Huggingface Hub Permissible models and datasets

Huggingface Hub: Models, Datasets, and Spaces

Dataset Overview

Key Features

Last Update

Update Frequency

Dataset Contents

Usage

pubmed

Hugging Face Dataset Metadata

HuggingFace models

Abstract

statcast-era-pitches

Data from: hugging face datasets

Installation Instructions

Data from: label-files

notebooks_on_the_hub

facebook/natural_reasoning

Natural Reasoning Dataset

Dataset Overview

Dataset Information

How to Use

Data Collection and Quality

Reference Answer Statistics

Scaling Curve Performance

Citation

Data from: huggingface

Dataset

Contents

edit_amazon_reviews_multi

Dataset Summary

Description of the original dataset - Hugging Face Datasets

Languages

Dataset Structure

Social Impact of Dataset

Discussion of Biases of origin dataset

Licensing Information

Licensing of origin dataset

rag-human-rights-from-files

Data from: AstroChat

Purpose and Scope

Intended Use

Quickstart

DATASET DESCRIPTION

Access

Structure

Metadata

Generation Method

Step-by-step description

Future work and contributions appreciated

Languages

Size

USAGE AND GUIDELINES

License

Restrictions

Citation

Update Frequency

Have a feedback or spot an error?

Huggingface Modelhub

Dataset containing information on all the models on HuggingFace modelhub

Update v3:

Contents: