23 datasets found

h
falcon-refinedweb
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technology Innovation Institute, falcon-refinedweb [Dataset]. http://doi.org/10.57967/hf/0737
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/0737
Dataset authored and provided by
Technology Innovation Institute
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📀 Falcon RefinedWeb

Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the 📓 paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and alt… See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.
h
refinedweb-generated-questions
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pinecone, refinedweb-generated-questions [Dataset]. https://huggingface.co/datasets/pinecone/refinedweb-generated-questions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Pinecone
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Generated Questions and Answers from the Falcon RefinedWeb Dataset

This dataset contains 1k open-domain questions and answers generated using documents from Falcon's refinedweb dataset using GPT-4. You can find more details about this work in the following blogpost. Each row consits of:

document_id - an id of a text chunk from the refined web dataset, from which the question was generated. Each id contains the original document index from the refinedweb dataset, and the chunk index… See the full description on the dataset page: https://huggingface.co/datasets/pinecone/refinedweb-generated-questions.
h
falcon-refinedweb-100k_en-xlong
huggingface.co
Updated Apr 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BEEspoke Data (2024). falcon-refinedweb-100k_en-xlong [Dataset]. https://huggingface.co/datasets/BEE-spoke-data/falcon-refinedweb-100k_en-xlong
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 24, 2024
Dataset authored and provided by
BEEspoke Data
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
BEE-spoke-data/falcon-refinedweb-100k_en-xlong

A sample from falcon-refinedweb:

more than 4096 & less than 34,000 gpt4 tiktoken tokens en only (via fasttext-langdetect) 100k samples
h
refinedweb-22mil-128clusters
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
maxine, refinedweb-22mil-128clusters [Dataset]. https://huggingface.co/datasets/crumb/refinedweb-22mil-128clusters
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
maxine
Description
crumb/refinedweb-22mil-128clusters dataset hosted on Hugging Face and contributed by the HF Datasets community
h
falcon-refined-web-10M1
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tonyshah, falcon-refined-web-10M1 [Dataset]. https://huggingface.co/datasets/Tony068/falcon-refined-web-10M1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Tonyshah
Description
Tony068/falcon-refined-web-10M1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
refinedWeb-subset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asif Rezai, refinedWeb-subset [Dataset]. https://huggingface.co/datasets/AlexMRTY/refinedWeb-subset
Explore at:
Authors
Asif Rezai
Description
AlexMRTY/refinedWeb-subset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
refinedweb-embed-english-v3.0
huggingface.co
Updated Jan 10, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
InsightHub (2011). refinedweb-embed-english-v3.0 [Dataset]. https://huggingface.co/datasets/InsightHub/refinedweb-embed-english-v3.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2011
Dataset authored and provided by
InsightHub
Description
InsightHub/refinedweb-embed-english-v3.0 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
flan-t5-large-embed-refinedweb
huggingface.co
Updated Jun 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
maxine (2023). flan-t5-large-embed-refinedweb [Dataset]. https://huggingface.co/datasets/crumb/flan-t5-large-embed-refinedweb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2023
Authors
maxine
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
All of the data together is around 81.3GB. It's the last hidden states of 131,072 samples from refinedweb padded/truncated to 512 tokens on the left, fed through google/flan-t5-base. Structure: { "encoding": List, shaped (512, 1024) aka (tokens, d_model), "text": String, the original text that was encoded, "attention_mask": List, binary mask to pass to your model with encoding to not attend to pad tokens }
h
falcon-refined-web-5M-part2
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tonyshah, falcon-refined-web-5M-part2 [Dataset]. https://huggingface.co/datasets/Tony068/falcon-refined-web-5M-part2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Tonyshah
Description
Tony068/falcon-refined-web-5M-part2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
h
refined-web-50k-random
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asif Rezai, refined-web-50k-random [Dataset]. https://huggingface.co/datasets/AlexMRTY/refined-web-50k-random
Explore at:
Authors
Asif Rezai
Description
AlexMRTY/refined-web-50k-random dataset hosted on Hugging Face and contributed by the HF Datasets community
h
bert-base-uncased-refined-web-segment0
huggingface.co
Updated Aug 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Min Ong (2023). bert-base-uncased-refined-web-segment0 [Dataset]. https://huggingface.co/datasets/Jackmin108/bert-base-uncased-refined-web-segment0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 22, 2023
Authors
Jack Min Ong
Description
Dataset Card for "bert-base-uncased-refined-web-segment0"

More Information needed
h
NEWS-FACTOR
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matin Ansaripour, NEWS-FACTOR [Dataset]. https://huggingface.co/datasets/mansaripo/NEWS-FACTOR
Explore at:
Authors
Matin Ansaripour
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
This repo contains data from AI21 Labs' paper Generating Benchmarks for Factuality Evaluation of Language Models. NEWS-FACTOR: Based on Reuters articles extracted from The RefinedWeb Dataset. The dataset consists of 1036 examples. The benchmark is derived from The RefinedWeb Dataset. The public extract is made available under an ODC-By 1.0 license; users should also abide to the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. Cite: @article{muhlgay2023generating, title={Generating… See the full description on the dataset page: https://huggingface.co/datasets/mansaripo/NEWS-FACTOR.
h
falcon-refinedweb_urls
huggingface.co
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Hagar (2025). falcon-refinedweb_urls [Dataset]. http://doi.org/10.57967/hf/5457
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5457
Dataset updated
May 15, 2025
Authors
Nick Hagar
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Dataset Card for falcon-refinedweb_urls

This dataset provides the URLs and top-level domains associated with training records in tiiuae/falcon-refinedweb. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

Dataset Details Dataset Description

This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record identifiers.… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/falcon-refinedweb_urls.
h
Web_DomURLs
huggingface.co
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelkader El Mahdaouy (2024). Web_DomURLs [Dataset]. https://huggingface.co/datasets/amahdaouy/Web_DomURLs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2024
Authors
Abdelkader El Mahdaouy
Description
Datasets Overview

The dataset URLs and Domain Names are collected from the following sources:

mC4

Description: The Multilingual Colossal Common Crawl Corpus (mC4) is a cleaned version of the Common Crawl's web corpus, curated by the Allen Institute for Artificial Intelligence. It contains approximately 170 million URLs. Source: mC4 Dataset on Hugging Face

falcon-refinedweb

Description: An English large-scale dataset curated for large language model… See the full description on the dataset page: https://huggingface.co/datasets/amahdaouy/Web_DomURLs.
h
mini-en
huggingface.co
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Pham (2025). mini-en [Dataset]. https://huggingface.co/datasets/nampdn-ai/mini-en
Explore at:
Dataset updated
Mar 28, 2025
Authors
Nam Pham
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Tiny English

A collection of short texts that have been curated for long-term human value. The texts in this dataset have been filtered from the falcon-refinedweb and minipile datasets to ensure better quality and tiny in size. The tiny-en dataset is concise and small in size, yet highly diverse, making it an excellent resource for training natural language processing models. Despite its compact size, the dataset offers a wide range of content that has been carefully selected for… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/mini-en.
h
FormatAnnotations-Llama-3.1-8B
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WebOrganizer, FormatAnnotations-Llama-3.1-8B [Dataset]. https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
WebOrganizer
Description
WebOrganizer/FormatAnnotations-Llama-3.1-8B

[Paper] [Website] [GitHub] This dataset contains 1M web pages annotated with format/type labels by the Llama-3.1-8B model. The web pages are a sample of the DCLM RefinedWeb reproduction. It is used as first-stage training data for the WebOrganizer/FormatClassifier.

Dataset Structure

Each example contains the following fields:

text: The text content of the web page url: The URL of the web page top_choice_index: Index of the… See the full description on the dataset page: https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B.
h
homeo-dataset
huggingface.co
Updated Oct 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhil Singh (2021). homeo-dataset [Dataset]. https://huggingface.co/datasets/akhilhsingh/homeo-dataset
Explore at:
Dataset updated
Oct 15, 2021
Authors
Akhil Singh
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full… See the full description on the dataset page: https://huggingface.co/datasets/akhilhsingh/homeo-dataset.
h
TopicAnnotations-Llama-3.1-405B-FP8
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WebOrganizer, TopicAnnotations-Llama-3.1-405B-FP8 [Dataset]. https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
WebOrganizer
Description
WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8

[Paper] [Website] [GitHub] This dataset contains 100K web pages annotated with topic labels by the Llama-3.1-405B-FP8 model. The web pages are a sample of the DCLM RefinedWeb reproduction. It is used as second-stage training data for the WebOrganizer/TopicClassifier.

Dataset Structure

Each example contains the following fields:

text: The text content of the web page url: The URL of the web page top_choice_index: Index… See the full description on the dataset page: https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8.
h
TM-DATA
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Gabarain, TM-DATA [Dataset]. https://huggingface.co/datasets/Locutusque/TM-DATA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Sebastian Gabarain
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset used to train TinyMistral-248m-v2. Consists of around 8 million examples. Consists of the following sources:

4 million Wikipedia pages 1 million arxiv papers 1.5 million web pages sourced from RefinedWeb and SlimPajama 200,000 college text books 1 million stack exchange forum posts.

This dataset can contain NSFW examples, use at your own risk.

Facebook

Twitter

Click to copy link

Link copied

Cite

Technology Innovation Institute, falcon-refinedweb [Dataset]. http://doi.org/10.57967/hf/0737

falcon-refinedweb

Falcon RefinedWeb

tiiuae/falcon-refinedweb

Explore at:

53 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.57967/hf/0737

Dataset authored and provided by

Technology Innovation Institute

License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

📀 Falcon RefinedWeb

Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the 📓 paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and alt… See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.

Clear search

Close search

Google apps

Main menu

falcon-refinedweb

refinedweb-generated-questions

falcon-refinedweb-100k_en-xlong

refinedweb-22mil-128clusters

falcon-refined-web-10M1

refinedWeb-subset

refinedweb-embed-english-v3.0

flan-t5-large-embed-refinedweb

falcon-refined-web-5M-part2

fineweb

refined-web-50k-random

bert-base-uncased-refined-web-segment0

NEWS-FACTOR

falcon-refinedweb_urls

Web_DomURLs

mini-en

FormatAnnotations-Llama-3.1-8B

homeo-dataset

TopicAnnotations-Llama-3.1-405B-FP8

TM-DATA

falcon-refinedweb

Falcon RefinedWeb

tiiuae/falcon-refinedweb