100+ datasets found
  1. h

    fineweb-edu

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
    Explore at:
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    📚 FineWeb-Edu

    1.3 trillion tokens of the finest educational data the 🌐 web has to offer

    Paper: https://arxiv.org/abs/2406.17557

      What is it?
    

    📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

  2. finevideo

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face FineVideo (2024). finevideo [Dataset]. https://huggingface.co/datasets/HuggingFaceFV/finevideo
    Explore at:
    Dataset updated
    Sep 12, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face FineVideo
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    FineVideo

    FineVideo Description Dataset Explorer Revisions Dataset Distribution

    How to download and use FineVideo Using datasets Using huggingface_hub Load a subset of the dataset

    Dataset StructureData Instances Data Fields

    Dataset Creation License CC-By Considerations for Using the Data Social Impact of Dataset Discussion of Biases

    Additional Information Credits Future Work Opting out of FineVideo Citation Information

    Terms of use for FineVideo… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFV/finevideo.

  3. Data from: huggingface

    • kaggle.com
    zip
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    amulil (2022). huggingface [Dataset]. https://www.kaggle.com/datasets/amulil/amulil-huggingface
    Explore at:
    zip(5498282999 bytes)Available download formats
    Dataset updated
    Mar 22, 2022
    Authors
    amulil
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Dataset

    This dataset was created by amulil

    Released under GPL 2

    Contents

  4. h

    D4RL

    • huggingface.co
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    One (2023). D4RL [Dataset]. https://huggingface.co/datasets/imone/D4RL
    Explore at:
    Dataset updated
    Aug 28, 2023
    Authors
    One
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    D4RL Dataset on HuggingFace

    This repository hosts the pre-downloaded D4RL dataset on HuggingFace. It is designed to provide accelerated data downloading for users, eliminating the need to download the dataset from scratch.

      Installation
    

    To use this dataset, you need to clone it into your local .d4rl directory. Here are the steps to do so:

    Navigate to your .d4rl directory:

    cd ~/.d4rl

    Clone the dataset repository from HuggingFace:

    git clone… See the full description on the dataset page: https://huggingface.co/datasets/imone/D4RL.

  5. Huggingface RoBERTa

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darius Singh (2023). Huggingface RoBERTa [Dataset]. https://www.kaggle.com/datasets/dariussingh/huggingface-roberta
    Explore at:
    zip(34531447596 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    Darius Singh
    Description

    This dataset contains different variants of the RoBERTa and XLM-RoBERTa model by Meta AI available on Hugging Face's model repository.

    By making it a dataset, it is significantly faster to load the weights since you can directly attach a Kaggle dataset to the notebook rather than downloading the data every time. See the speed comparison notebook. Another benefit of loading models as a dataset is that it can be used in competitions that require internet access to be "off".

    For more information on usage visit the roberta hugging face docs and the xlm-roberta hugging face docs.

    Usage

    To use this dataset, attach it to your notebook and specify the path to the dataset. For example:

    from transformers import AutoTokenizer, AutoModelForPreTraining
    ​
    MODEL_DIR = "/kaggle/input/huggingface-roberta/"
    ​
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "roberta-base")
    model = AutoModelForPreTraining.from_pretrained(MODEL_DIR + "roberta-base")
    

    Acknowledgements All the copyrights and IP relating to RoBERTa and XLM-RoBERTa belong to the original authors (Liu et al. and Conneau et al.) and Meta AI. All copyrights relating to the transformers library belong to Hugging Face. Please reach out directly to the authors if you have questions regarding licenses and usage.

  6. h

    fineweb

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  7. Hugging Face Models Metadata

    • kaggle.com
    zip
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kumar Saksham (2023). Hugging Face Models Metadata [Dataset]. https://www.kaggle.com/datasets/everydaycodings/hugging-face-models-metadata/code
    Explore at:
    zip(8182909 bytes)Available download formats
    Dataset updated
    Nov 30, 2023
    Authors
    Kumar Saksham
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description:

    This dataset provides metadata for various models available on Hugging Face, a popular platform for sharing and discovering natural language processing (NLP) and machine learning models. The dataset includes information such as model name, author, repository link, image URL, category, star ratings, download statistics, and the last update timestamp.

    Columns: 1. Name: Model name on Hugging Face. 2. Author: Author or organization associated with the model. 3. Repo Link: Link to the model's repository on Hugging Face. 4. Image URL: URL for the model's image/icon. 5. Category: The category or type of model (e.g., Text Generation, Automatic Speech Recognition). 6. Stars: Number of stars the model has received. 7. Downloads: Number of downloads for the model. 8. Last Updated: Timestamp indicating the last update of the model.

    This dataset is valuable for researchers, data scientists, and enthusiasts interested in exploring and analyzing the landscape of pre-trained models on Hugging Face.

  8. h

    coco

    • huggingface.co
    Updated Mar 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Detection datasets (2023). coco [Dataset]. https://huggingface.co/datasets/detection-datasets/coco
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 3, 2023
    Dataset authored and provided by
    Detection datasets
    Description

    detection-datasets/coco dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. openai_humaneval

    • huggingface.co
    Updated Jul 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2021). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2021
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for OpenAI HumanEval

      Dataset Summary
    

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

      Supported Tasks and Leaderboards
    
    
    
    
    
      Languages
    

    The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

  10. gsm8k

    • huggingface.co
    Updated Oct 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2021). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
    Explore at:
    Dataset updated
    Oct 27, 2021
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K

      Dataset Summary
    

    GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

  11. databricks-dolly-15k

    • huggingface.co
    • kaggle.com
    Updated Mar 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Databricks (2022). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2022
    Dataset authored and provided by
    Databrickshttp://databricks.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Summary

    databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.

  12. h

    glue

    • huggingface.co
    • tensorflow.google.cn
    • +1more
    Updated Apr 20, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYU Machine Learning for Language (2018). glue [Dataset]. https://huggingface.co/datasets/nyu-mll/glue
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2018
    Dataset authored and provided by
    NYU Machine Learning for Language
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for GLUE

      Dataset Summary
    

    GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

      Supported Tasks and Leaderboards
    

    The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks:

      ax
    

    A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.

  13. h

    minds14

    • huggingface.co
    Updated Apr 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PolyAI (2021). minds14 [Dataset]. https://huggingface.co/datasets/PolyAI/minds14
    Explore at:
    Dataset updated
    Apr 17, 2021
    Dataset authored and provided by
    PolyAI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MInDS-14

    MINDS-14 is training and evaluation resource for intent detection task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.

      Example
    

    MInDS-14 can be downloaded and used as follows: from datasets import load_dataset

    minds_14 = load_dataset("PolyAI/minds14", "fr-FR") # for French

    to download all data for multi-lingual fine-tuning uncomment following… See the full description on the dataset page: https://huggingface.co/datasets/PolyAI/minds14.

  14. h

    audioset-full

    • huggingface.co
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AudioConFit (2024). audioset-full [Dataset]. https://huggingface.co/datasets/confit/audioset-full
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2024
    Dataset authored and provided by
    AudioConFit
    Description

    confit/audioset-full dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    commonsense_qa

    • huggingface.co
    • opendatalab.com
    Updated Nov 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tel Aviv University (2018). commonsense_qa [Dataset]. https://huggingface.co/datasets/tau/commonsense_qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 2, 2018
    Dataset authored and provided by
    Tel Aviv University
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "commonsense_qa"

      Dataset Summary
    

    CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split", see paper for details.… See the full description on the dataset page: https://huggingface.co/datasets/tau/commonsense_qa.

  16. h

    oasst1

    • huggingface.co
    Updated Apr 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAssistant (2023). oasst1 [Dataset]. https://huggingface.co/datasets/OpenAssistant/oasst1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2023
    Dataset authored and provided by
    OpenAssistant
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OpenAssistant Conversations Dataset (OASST1)

      Dataset Summary
    

    In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort… See the full description on the dataset page: https://huggingface.co/datasets/OpenAssistant/oasst1.

  17. h

    webui-all

    • huggingface.co
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Big Lab (2024). webui-all [Dataset]. https://huggingface.co/datasets/biglab/webui-all
    Explore at:
    Dataset updated
    Nov 1, 2024
    Dataset authored and provided by
    Big Lab
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    This data accompanies the WebUI project (https://dl.acm.org/doi/abs/10.1145/3544548.3581158) For more information, check out the project website: https://uimodeling.github.io/ To download this dataset, you need to install the huggingface-hub package pip install huggingface-hub

    Use snapshot_download from huggingface_hub import snapshot_download snapshot_download(repo_id="biglab/webui-all", repo_type="dataset")

    IMPORTANT

    Before downloading and using, please review the copyright info here:… See the full description on the dataset page: https://huggingface.co/datasets/biglab/webui-all.

  18. h

    the-reddit-dataset-dataset

    • huggingface.co
    Updated Jun 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SocialGrep (2022). the-reddit-dataset-dataset [Dataset]. https://huggingface.co/datasets/SocialGrep/the-reddit-dataset-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2022
    Authors
    SocialGrep
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A meta dataset of Reddit's own /r/datasets community.

  19. h

    text-to-image-2M

    • huggingface.co
    Updated Sep 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kzou (2024). text-to-image-2M [Dataset]. http://doi.org/10.57967/hf/3066
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 14, 2024
    Authors
    kzou
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    text-to-image-2M: A High-Quality, Diverse Text-to-Image Training Dataset

      Overview
    

    text-to-image-2M is a curated text-image pair dataset designed for fine-tuning text-to-image models. The dataset consists of approximately 2 million samples, carefully selected and enhanced to meet the high demands of text-to-image model training. The motivation behind creating this dataset stems from the observation that datasets with over 1 million samples tend to produce better… See the full description on the dataset page: https://huggingface.co/datasets/jackyhate/text-to-image-2M.

  20. GroundCUA

    • huggingface.co
    Updated Nov 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ServiceNow (2025). GroundCUA [Dataset]. https://huggingface.co/datasets/ServiceNow/GroundCUA
    Explore at:
    Dataset updated
    Nov 10, 2025
    Dataset authored and provided by
    ServiceNowhttp://servicenow.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GroundCUA: Grounding Computer Use Agents on Human Demonstrations

    🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Models

      GroundCUA Dataset
    

    GroundCUA is a large and diverse dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers 87 software platforms across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. GroundCUA is designed for research on GUI… See the full description on the dataset page: https://huggingface.co/datasets/ServiceNow/GroundCUA.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497

fineweb-edu

FineWeb-Edu

HuggingFaceFW/fineweb-edu

Explore at:
96 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 3, 2025
Dataset authored and provided by
FineData
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

📚 FineWeb-Edu

1.3 trillion tokens of the finest educational data the 🌐 web has to offer

Paper: https://arxiv.org/abs/2406.17557

  What is it?

📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

Search
Clear search
Close search
Google apps
Main menu