Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">
Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.
Dataset was generated using huggingface_hub APIs provided by huggingface team.
This is my first dataset upload on Kaggle. I hope you like it. :)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
bitwisemind/new-food-nextvit-update-dataset-train-file dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets.
This dataset contains the data of 16k models available on huggingface.co. This dataset contains the following features of the model; 1. model url 2. model title 3. downloads and likes 4. updated
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Guide: How to share your data on the BoAmps repository
This guide explains step by step how to share BoAmps format reports on this public Hugging Face repository.
Prerequisites
Before starting, make sure you have:
A Hugging Face account The files you want to upload
Method 1: Hugging Face Web Interface
Log in to Hugging Face
Go to the boamps dataset
Navigate to the files: Click on "Files and versions" then on the "data" folder
Click on "Contribute" then… See the full description on the dataset page: https://huggingface.co/datasets/boavizta/open_data_boamps.
Facebook
TwitterThis repository contains the mapping from integer id's to actual label names (in HuggingFace Transformers typically called id2label) for several datasets. Current datasets include:
ImageNet-1k ImageNet-22k (also called ImageNet-21k as there are 21,843 classes) COCO detection 2017 COCO panoptic 2017 ADE20k (actually, the MIT Scene Parsing benchmark, which is a subset of ADE20k) Cityscapes VQAv2 Kinetics-700 RVL-CDIP PASCAL VOC Kinetics-400 ...
You can read in a label file as follows (using… See the full description on the dataset page: https://huggingface.co/datasets/huggingface/label-files.
Facebook
TwitterThis dataset contains different variants of the MobileBERT model by Google available on Hugging Face's model repository.
By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".
For more information on usage visit the mobilebert hugging face docs.
Usage
To use this dataset, attach it to your notebook and specify the path to the dataset. For example:
from transformers import AutoTokenizer, AutoModelForPreTraining
MODEL_DIR = "/kaggle/input/huggingface-google-mobilebert/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForPreTraining.from_pretrained(MODEL_DIR)
Acknowledgements All the copyrights and IP relating to MobileBERT belong to the original authors (Sun et al.) and Google. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
statistics.r: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreementsmodelsInfo.zip: zip file containing all the downloaded model cards (in JSON format)script: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.Dataset/Dataset_HF-models-list.csv: list of HF models analyzedDataset/Dataset_github-prj-list.txt: list of GitHub projects using the transformers libraryDataset/Dataset_github-Prj_model-Used.csv: contains usage pairs: project, modelDataset/Dataset_prj-num-models-reused.csv: number of models used by each GitHub projectDataset/Dataset_model-download_num-prj_correlation.csv contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloadsRQ1/RQ1_dataset-list.txt: list of HF datasetsRQ1/RQ1_datasetSample.csv: sample set of models used for the manual analysis of datasetsRQ1/RQ1_analyzeDatasetTags.py: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip in a directory with the same name (modelsInfo) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py scriptRQ1/RQ1_countDataset.py: given the output of RQ2/analyzeDatasetTags.py (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysisRQ1/RQ1_datasetTags.csv: output of RQ2/analyzeDatasetTags.pyRQ1/RQ1_dataset_usage_count.csv: output of RQ2/countDataset.pyRQ2/tableBias.pdf: table detailing the number of occurrences of different types of bias by model TaskRQ2/RQ2_bias_classification_sheet.csv: results of the manual labelingRQ2/RQ2_isBiased.csv: file to compute the inter-rater agreement of whether or not a model documents BiasRQ2/RQ2_biasAgrLabels.csv: file to compute the inter-rater agreement related to bias categoriesRQ2/RQ2_final_bias_categories_with_levels.csv: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate categoryRQ3/RQ3_LicenseValidation.csv: manual validation of a sample of licensesRQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt: lists of licenses with different permissivenessRQ3/RQ3_prjs_license.csv: for each project linked to models, among other fields it indicates the license tag and nameRQ3/RQ3_models_license.csv: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of licenseRQ3/RQ3_model-prj-license_contingency_table.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows)RQ3/RQ3_models_prjs_licenses_with_type.csv: pairs project-model, with their respective licenses and permissiveness levelContains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This comprehensive dataset contains detailed information about all the models, datasets, and spaces available on the Huggingface Hub. It is an essential resource for anyone looking to explore the extensive range of tools and datasets available for machine learning and AI research.
This dataset is ideal for researchers, developers, and AI enthusiasts who are looking for a one-stop repository of models, datasets, and spaces from the Huggingface Hub. It provides a holistic view and simplifies the task of finding the right tools for various machine learning and AI projects.
Note: This dataset is not officially affiliated with or endorsed by the Huggingface organization.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis. The annual baseline is released in December of each year. Each day, NLM produces update files that include new, revised and deleted citations. See our documentation page for more information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file describes the dataset hosted currently on hugging face: https://huggingface.co/datasets/LISTTT/NeurIPS_2025_BMDB
Facebook
TwitterContainer dataset for demonstration of Hugging Face models on Redivis. Currently just contains a single BERT model, but may expand in the future.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
statcast-pitches
pybaseball is a great tool for downloading baseball data. Even though the library is optimized and scrapes this data in parallel, it can be time consuming. The point of this repository is to utilize GitHub Actions to scrape new baseball data weekly during the MLB season, and update a parquet file hosted as a huggingface dataset. Reading this data as a huggingface dataset is much faster than scraping the new data each time you re run your code, or just want updated… See the full description on the dataset page: https://huggingface.co/datasets/Jensen-holm/statcast-era-pitches.
Facebook
TwitterThis is the latest version of Hugging Face datasets to be used in offline notebooks on Kaggle. It is automatically updated every week.
!pip install datasets --no-index --find-links=file:///kaggle/input/hf-ds -U -q
Facebook
Twitterfcakyon/label-files dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterNotebooks on the Hub
This dataset uses files from the repository https://huggingface.co/datasets/davanstrien/notebooks_on_the_hub_raw which records all the repositories hosted on the Hugging Face Hub that contain notebooks. Daniel's repository was updated daily from April of 2023 to June of 2024. I manually copied only one version per month: they are stored in the original folder with the name YYYY_MM.parquet, from 2023_05.parquet to 2024_05.parquet (13 files). Then, I recreated… See the full description on the dataset page: https://huggingface.co/datasets/severo/notebooks_on_the_hub.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.
A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.
File Format: natural_reasoning.parquet
Click here to view the dataset
CC-BY-NC-4.0 Text Generation Reasoning English (en) 1M < n < 10M Hugging Face You can load the dataset directly from Hugging Face as follows:
from datasets import load_dataset
ds = load_dataset("facebook/natural_reasoning")
The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.
In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.
Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.
https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">
If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:
@misc{yuan2025naturalreasoningreasoningwild28m,
title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions},
author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li},
year={2025},
eprint={2502.13124},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.13124}
}
Source: Hugging Face
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset was created by amulil
Released under GPL 2
Facebook
TwitterThe data file is based on a copy of the Hugging Face data file: buruzaemon/amazon_reviews_multi. Ten je kopií původního datového souboru defunct-datasets/amazon_reviews_multi. The dataset was published by the community Open Data on AWS
In our modification, we removed unnecessary columns and thus anonymized the data file, and at the same time we added columns describing the lengths of the strings of the single columns, see Multilingual_Amazon_Reviews_Corpus_analysis. Next, the dataset was re-partitioned:
The original *.jsonl data format has been changed to the more modern *.parquet format see Apacha Arrow
The data file was created for the purpose of testing the Hugging Face tutorial Summarization, because the older version of the dataset is not compatible with the new version of the datasets library.
This dataset is comprehensive, derived datasets for the tutorial can be found here:
"We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language.
For each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated after 2,000 characters, and all reviews are at least 20 characters long.
Note that the language of a review does not necessarily match the language of its marketplace (e.g. reviews from amazon.de are primarily written in German, but could also be written in English, etc.). For this reason, we applied a language detection algorithm based on the work in Bojanowski et al. (2017) to determine the language of the review text and we removed reviews that were not written in the expected language." source
Documentation of the authors of the original dataset: The Multilingual Amazon Reviews Corpus
The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish.
id: record idstars: An int between 1-5 indicating the number of stars.review_body: The text body of the review.review_title: The text title of the review.language: The string identifier of the review language.product_category: String representation of the product's category.lenght_review_body: text length of review_bodylenght_review_title: text lenght of review_titlelenght_product_category: text lenght of product_categoryThis dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. Unfortunately, each of the languages included here is relatively high resource and well studied. The dataset is used for training in NLP, summarization tasks, text generation, and masked text filling. source
The dataset contains only reviews from verified purchases (as described in the paper, section 2.1), and the reviews should conform the Amazon Community Guidelines. source
Amazon has licensed this dataset under its own agreement for non-commercial research usage only. This licenc...
Facebook
TwitterDataset Card for my-distiset-rag-files
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/my-distiset-rag-files/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/sdiazlor/rag-human-rights-from-files.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.
The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).
To be completed
python
from datasets import load_dataset
dataset = load_dataset("patrickfleith/AstroChat")901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column):
- id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets.
- topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split.
- subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc.
- persona: description of the persona used to simulate a user
- opening_question: the first question asked by the user to start a conversation with the AI-assistant
- messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields:
- role: the role of the speaker, either user or assistant
- content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.
Important See the full list of topics and subtopics covered below.
Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main
We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:
Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
gpt-4-turbo model) to generate the answers to the opening questionsAll instances in the dataset are in english
901 synthetically-generated dialogue
AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International
No restriction. Please provide the correct attribution following the license terms.
Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579
Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)
Use the ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://huggingface.co/landing/assets/transformers-docs/huggingface_logo.svg" alt="HuggingFace">
Dataset containing metadata information of all the publicly uploaded models(10,000+) available on HuggingFace model hub Data was collected between 15-20th June 2021.
Dataset was generated using huggingface_hub APIs provided by huggingface team.
This is my first dataset upload on Kaggle. I hope you like it. :)