https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
๐ Falcon RefinedWeb
Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the ๐ paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and altโฆ See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Generated Questions and Answers from the Falcon RefinedWeb Dataset
This dataset contains 1k open-domain questions and answers generated using documents from Falcon's refinedweb dataset using GPT-4. You can find more details about this work in the following blogpost. Each row consits of:
document_id - an id of a text chunk from the refined web dataset, from which the question was generated. Each id contains the original document index from the refinedweb dataset, and the chunk indexโฆ See the full description on the dataset page: https://huggingface.co/datasets/pinecone/refinedweb-generated-questions.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
BEE-spoke-data/falcon-refinedweb-100k_en-xlong
A sample from falcon-refinedweb:
more than 4096 & less than 34,000 gpt4 tiktoken tokens en only (via fasttext-langdetect) 100k samples
crumb/refinedweb-22mil-128clusters dataset hosted on Hugging Face and contributed by the HF Datasets community
Tony068/falcon-refined-web-10M1 dataset hosted on Hugging Face and contributed by the HF Datasets community
AlexMRTY/refinedWeb-subset dataset hosted on Hugging Face and contributed by the HF Datasets community
InsightHub/refinedweb-embed-english-v3.0 dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
All of the data together is around 81.3GB. It's the last hidden states of 131,072 samples from refinedweb padded/truncated to 512 tokens on the left, fed through google/flan-t5-base. Structure: { "encoding": List, shaped (512, 1024) aka (tokens, d_model), "text": String, the original text that was encoded, "attention_mask": List, binary mask to pass to your model with encoding to not attend to pad tokens }
Tony068/falcon-refined-web-5M-part2 dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
๐ท FineWeb
15 trillion tokens of the finest data the ๐ web has to offer
What is it?
The ๐ท FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the ๐ญ datatrove library, our large scale data processing library. ๐ท FineWeb was originally meant to be a fully open replication of ๐ฆ RefinedWeb, with a releaseโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
AlexMRTY/refined-web-50k-random dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "bert-base-uncased-refined-web-segment0"
More Information needed
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
This repo contains data from AI21 Labs' paper Generating Benchmarks for Factuality Evaluation of Language Models. NEWS-FACTOR: Based on Reuters articles extracted from The RefinedWeb Dataset. The dataset consists of 1036 examples. The benchmark is derived from The RefinedWeb Dataset. The public extract is made available under an ODC-By 1.0 license; users should also abide to the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. Cite: @article{muhlgay2023generating, title={Generatingโฆ See the full description on the dataset page: https://huggingface.co/datasets/mansaripo/NEWS-FACTOR.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for falcon-refinedweb_urls
This dataset provides the URLs and top-level domains associated with training records in tiiuae/falcon-refinedweb. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.
Dataset Details
Dataset Description
This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record identifiers.โฆ See the full description on the dataset page: https://huggingface.co/datasets/nhagar/falcon-refinedweb_urls.
Datasets Overview
The dataset URLs and Domain Names are collected from the following sources:
mC4
Description: The Multilingual Colossal Common Crawl Corpus (mC4) is a cleaned version of the Common Crawl's web corpus, curated by the Allen Institute for Artificial Intelligence. It contains approximately 170 million URLs. Source: mC4 Dataset on Hugging Face
falcon-refinedweb
Description: An English large-scale dataset curated for large language modelโฆ See the full description on the dataset page: https://huggingface.co/datasets/amahdaouy/Web_DomURLs.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Tiny English
A collection of short texts that have been curated for long-term human value. The texts in this dataset have been filtered from the falcon-refinedweb and minipile datasets to ensure better quality and tiny in size. The tiny-en dataset is concise and small in size, yet highly diverse, making it an excellent resource for training natural language processing models. Despite its compact size, the dataset offers a wide range of content that has been carefully selected forโฆ See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/mini-en.
WebOrganizer/FormatAnnotations-Llama-3.1-8B
[Paper] [Website] [GitHub] This dataset contains 1M web pages annotated with format/type labels by the Llama-3.1-8B model. The web pages are a sample of the DCLM RefinedWeb reproduction. It is used as first-stage training data for the WebOrganizer/FormatClassifier.
Dataset Structure
Each example contains the following fields:
text: The text content of the web page url: The URL of the web page top_choice_index: Index of theโฆ See the full description on the dataset page: https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
๐ท FineWeb
15 trillion tokens of the finest data the ๐ web has to offer
What is it?
The ๐ท FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the ๐ญ datatrove library, our large scale data processing library. ๐ท FineWeb was originally meant to be a fully open replication of ๐ฆ RefinedWeb, with a release of the fullโฆ See the full description on the dataset page: https://huggingface.co/datasets/akhilhsingh/homeo-dataset.
WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8
[Paper] [Website] [GitHub] This dataset contains 100K web pages annotated with topic labels by the Llama-3.1-405B-FP8 model. The web pages are a sample of the DCLM RefinedWeb reproduction. It is used as second-stage training data for the WebOrganizer/TopicClassifier.
Dataset Structure
Each example contains the following fields:
text: The text content of the web page url: The URL of the web page top_choice_index: Indexโฆ See the full description on the dataset page: https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset used to train TinyMistral-248m-v2. Consists of around 8 million examples. Consists of the following sources:
4 million Wikipedia pages 1 million arxiv papers 1.5 million web pages sourced from RefinedWeb and SlimPajama 200,000 college text books 1 million stack exchange forum posts.
This dataset can contain NSFW examples, use at your own risk.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
๐ Falcon RefinedWeb
Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the ๐ paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and altโฆ See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.