23 datasets found
  1. h

    falcon-refinedweb

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technology Innovation Institute, falcon-refinedweb [Dataset]. http://doi.org/10.57967/hf/0737
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Technology Innovation Institute
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    ๐Ÿ“€ Falcon RefinedWeb

    Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the ๐Ÿ““ paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and altโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.

  2. h

    refinedweb-generated-questions

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pinecone, refinedweb-generated-questions [Dataset]. https://huggingface.co/datasets/pinecone/refinedweb-generated-questions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Pinecone
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Generated Questions and Answers from the Falcon RefinedWeb Dataset

    This dataset contains 1k open-domain questions and answers generated using documents from Falcon's refinedweb dataset using GPT-4. You can find more details about this work in the following blogpost. Each row consits of:

    document_id - an id of a text chunk from the refined web dataset, from which the question was generated. Each id contains the original document index from the refinedweb dataset, and the chunk indexโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/pinecone/refinedweb-generated-questions.

  3. h

    falcon-refinedweb-100k_en-xlong

    • huggingface.co
    Updated Apr 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEEspoke Data (2024). falcon-refinedweb-100k_en-xlong [Dataset]. https://huggingface.co/datasets/BEE-spoke-data/falcon-refinedweb-100k_en-xlong
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 24, 2024
    Dataset authored and provided by
    BEEspoke Data
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    BEE-spoke-data/falcon-refinedweb-100k_en-xlong

    A sample from falcon-refinedweb:

    more than 4096 & less than 34,000 gpt4 tiktoken tokens en only (via fasttext-langdetect) 100k samples

  4. h

    refinedweb-22mil-128clusters

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    maxine, refinedweb-22mil-128clusters [Dataset]. https://huggingface.co/datasets/crumb/refinedweb-22mil-128clusters
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    maxine
    Description

    crumb/refinedweb-22mil-128clusters dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    falcon-refined-web-10M1

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tonyshah, falcon-refined-web-10M1 [Dataset]. https://huggingface.co/datasets/Tony068/falcon-refined-web-10M1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tonyshah
    Description

    Tony068/falcon-refined-web-10M1 dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    refinedWeb-subset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asif Rezai, refinedWeb-subset [Dataset]. https://huggingface.co/datasets/AlexMRTY/refinedWeb-subset
    Explore at:
    Authors
    Asif Rezai
    Description

    AlexMRTY/refinedWeb-subset dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    refinedweb-embed-english-v3.0

    • huggingface.co
    Updated Jan 10, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    InsightHub (2011). refinedweb-embed-english-v3.0 [Dataset]. https://huggingface.co/datasets/InsightHub/refinedweb-embed-english-v3.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2011
    Dataset authored and provided by
    InsightHub
    Description

    InsightHub/refinedweb-embed-english-v3.0 dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    flan-t5-large-embed-refinedweb

    • huggingface.co
    Updated Jun 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    maxine (2023). flan-t5-large-embed-refinedweb [Dataset]. https://huggingface.co/datasets/crumb/flan-t5-large-embed-refinedweb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2023
    Authors
    maxine
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    All of the data together is around 81.3GB. It's the last hidden states of 131,072 samples from refinedweb padded/truncated to 512 tokens on the left, fed through google/flan-t5-base. Structure: { "encoding": List, shaped (512, 1024) aka (tokens, d_model), "text": String, the original text that was encoded, "attention_mask": List, binary mask to pass to your model with encoding to not attend to pad tokens }

  9. h

    falcon-refined-web-5M-part2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tonyshah, falcon-refined-web-5M-part2 [Dataset]. https://huggingface.co/datasets/Tony068/falcon-refined-web-5M-part2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tonyshah
    Description

    Tony068/falcon-refined-web-5M-part2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    ๐Ÿท FineWeb

    15 trillion tokens of the finest data the ๐ŸŒ web has to offer

      What is it?
    

    The ๐Ÿท FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the ๐Ÿญ datatrove library, our large scale data processing library. ๐Ÿท FineWeb was originally meant to be a fully open replication of ๐Ÿฆ… RefinedWeb, with a releaseโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  11. h

    refined-web-50k-random

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asif Rezai, refined-web-50k-random [Dataset]. https://huggingface.co/datasets/AlexMRTY/refined-web-50k-random
    Explore at:
    Authors
    Asif Rezai
    Description

    AlexMRTY/refined-web-50k-random dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    bert-base-uncased-refined-web-segment0

    • huggingface.co
    Updated Aug 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Min Ong (2023). bert-base-uncased-refined-web-segment0 [Dataset]. https://huggingface.co/datasets/Jackmin108/bert-base-uncased-refined-web-segment0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 22, 2023
    Authors
    Jack Min Ong
    Description

    Dataset Card for "bert-base-uncased-refined-web-segment0"

    More Information needed

  13. h

    NEWS-FACTOR

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matin Ansaripour, NEWS-FACTOR [Dataset]. https://huggingface.co/datasets/mansaripo/NEWS-FACTOR
    Explore at:
    Authors
    Matin Ansaripour
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    This repo contains data from AI21 Labs' paper Generating Benchmarks for Factuality Evaluation of Language Models. NEWS-FACTOR: Based on Reuters articles extracted from The RefinedWeb Dataset. The dataset consists of 1036 examples. The benchmark is derived from The RefinedWeb Dataset. The public extract is made available under an ODC-By 1.0 license; users should also abide to the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. Cite: @article{muhlgay2023generating, title={Generatingโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/mansaripo/NEWS-FACTOR.

  14. h

    falcon-refinedweb_urls

    • huggingface.co
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Hagar (2025). falcon-refinedweb_urls [Dataset]. http://doi.org/10.57967/hf/5457
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Nick Hagar
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for falcon-refinedweb_urls

    This dataset provides the URLs and top-level domains associated with training records in tiiuae/falcon-refinedweb. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record identifiers.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/nhagar/falcon-refinedweb_urls.

  15. h

    Web_DomURLs

    • huggingface.co
    Updated Sep 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelkader El Mahdaouy (2024). Web_DomURLs [Dataset]. https://huggingface.co/datasets/amahdaouy/Web_DomURLs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2024
    Authors
    Abdelkader El Mahdaouy
    Description

    Datasets Overview

    The dataset URLs and Domain Names are collected from the following sources:

      mC4
    

    Description: The Multilingual Colossal Common Crawl Corpus (mC4) is a cleaned version of the Common Crawl's web corpus, curated by the Allen Institute for Artificial Intelligence. It contains approximately 170 million URLs. Source: mC4 Dataset on Hugging Face

      falcon-refinedweb
    

    Description: An English large-scale dataset curated for large language modelโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/amahdaouy/Web_DomURLs.

  16. h

    mini-en

    • huggingface.co
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Pham (2025). mini-en [Dataset]. https://huggingface.co/datasets/nampdn-ai/mini-en
    Explore at:
    Dataset updated
    Mar 28, 2025
    Authors
    Nam Pham
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Tiny English

    A collection of short texts that have been curated for long-term human value. The texts in this dataset have been filtered from the falcon-refinedweb and minipile datasets to ensure better quality and tiny in size. The tiny-en dataset is concise and small in size, yet highly diverse, making it an excellent resource for training natural language processing models. Despite its compact size, the dataset offers a wide range of content that has been carefully selected forโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/mini-en.

  17. h

    FormatAnnotations-Llama-3.1-8B

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WebOrganizer, FormatAnnotations-Llama-3.1-8B [Dataset]. https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    WebOrganizer
    Description

    WebOrganizer/FormatAnnotations-Llama-3.1-8B

    [Paper] [Website] [GitHub] This dataset contains 1M web pages annotated with format/type labels by the Llama-3.1-8B model. The web pages are a sample of the DCLM RefinedWeb reproduction. It is used as first-stage training data for the WebOrganizer/FormatClassifier.

      Dataset Structure
    

    Each example contains the following fields:

    text: The text content of the web page url: The URL of the web page top_choice_index: Index of theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B.

  18. h

    homeo-dataset

    • huggingface.co
    Updated Oct 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhil Singh (2021). homeo-dataset [Dataset]. https://huggingface.co/datasets/akhilhsingh/homeo-dataset
    Explore at:
    Dataset updated
    Oct 15, 2021
    Authors
    Akhil Singh
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    ๐Ÿท FineWeb

    15 trillion tokens of the finest data the ๐ŸŒ web has to offer

      What is it?
    

    The ๐Ÿท FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the ๐Ÿญ datatrove library, our large scale data processing library. ๐Ÿท FineWeb was originally meant to be a fully open replication of ๐Ÿฆ… RefinedWeb, with a release of the fullโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/akhilhsingh/homeo-dataset.

  19. h

    TopicAnnotations-Llama-3.1-405B-FP8

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WebOrganizer, TopicAnnotations-Llama-3.1-405B-FP8 [Dataset]. https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    WebOrganizer
    Description

    WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8

    [Paper] [Website] [GitHub] This dataset contains 100K web pages annotated with topic labels by the Llama-3.1-405B-FP8 model. The web pages are a sample of the DCLM RefinedWeb reproduction. It is used as second-stage training data for the WebOrganizer/TopicClassifier.

      Dataset Structure
    

    Each example contains the following fields:

    text: The text content of the web page url: The URL of the web page top_choice_index: Indexโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8.

  20. h

    TM-DATA

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Gabarain, TM-DATA [Dataset]. https://huggingface.co/datasets/Locutusque/TM-DATA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Sebastian Gabarain
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset used to train TinyMistral-248m-v2. Consists of around 8 million examples. Consists of the following sources:

    4 million Wikipedia pages 1 million arxiv papers 1.5 million web pages sourced from RefinedWeb and SlimPajama 200,000 college text books 1 million stack exchange forum posts.

    This dataset can contain NSFW examples, use at your own risk.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Technology Innovation Institute, falcon-refinedweb [Dataset]. http://doi.org/10.57967/hf/0737

falcon-refinedweb

Falcon RefinedWeb

tiiuae/falcon-refinedweb

Explore at:
53 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Technology Innovation Institute
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

๐Ÿ“€ Falcon RefinedWeb

Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the ๐Ÿ““ paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and altโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.

Search
Clear search
Close search
Google apps
Main menu