https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
📚 FineWeb-Edu
1.3 trillion tokens of the finest educational data the 🌐 web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🥂 FineWeb2
A sparkling update with 1000s of languages
What is it?
This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for Fineweb Ultra Mini
Fineweb Ultra Mini is a dataset derived from the original Fineweb dataset made by huggingface (see here: https://huggingface.co/datasets/HuggingFaceFW/fineweb). The dataset focuses on extracting high quality data from the Fineweb dataset, from the 2-3% range. If you would like even more high-quality data, keep out for our next release, fineweb ultra mini pro, which focuses on the 0-1% of high quality data originally found in fineweb.… See the full description on the dataset page: https://huggingface.co/datasets/reflex-ai/fineweb-ultra-mini.
Occiglot Fineweb v1.0
We present a more mature version of the multilingual Occiglot Fineweb corpus. In this early form, the dataset contains roughly 430M heavily cleaned documents from 10 languages. Occiglot Fineweb builds on our existing collection of curated datasets and pre-filtered web data. Subsequently, all documents were filtered with language-specific derivatives of the fine-web processing pipeline and different levels of depuplicated. We provide the data at 3 levels of… See the full description on the dataset page: https://huggingface.co/datasets/occiglot/occiglot-fineweb-v1.0.
FineWeb-C: Educational content in many languages, labelled by the community
Multilingual data is better together!
Note: We are not actively working on this project anymore. You can continue to contribute annotations and we'll occasionally refresh the exported data.
What is this?
FineWeb-C is a collaborative, community-driven project that expands upon the FineWeb2 dataset. The goal is to create high-quality educational content annotations across hundreds of… See the full description on the dataset page: https://huggingface.co/datasets/data-is-better-together/fineweb-c.
Ornaments/fineweb dataset hosted on Hugging Face and contributed by the HF Datasets community
data-is-better-together/fineweb-c-progress dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Denne model er en demoklassifikator til at identificere problematisk indhold (forkert sprog, forvrænget tekst) i danske og svenske webtekster. Modellen blev udviklet som en del af et blogindlæg, der udforsker, hvordan man kan filtrere webdata ved hjælp af fællesskabsbaserede annoteringer. Modellen er finjusteret baseret på FacebookAI/xlm-roberta-base og trænet på datasættet data-is-better-together/fineweb-c.
Den opnår følgende resultater på evalueringssættet:
Precision: 0.9524 (95.2%)
Recall: 0.7018 (70.2%)
F1: 0.8081
AUC-ROC: 0.9648
Formål og begrænsninger: Modellen er beregnet til at fungere som et indledende filter for webtekster med henblik på at forbedre effektiviteten af annoteringsprocessen. Den er kun blevet testet på dansk og svensk indhold. Den høje præcision (95,2 %) betyder, at falske positiver er sjældne, mens recall (70,2 %) indikerer, at modellen fanger størstedelen af det problematiske indhold.
eliplutchok/fineweb-small-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Ultra-FineWeb
📜 Technical Report
📚 Introduction
Ultra-FineWeb is a large-scale, high-quality, and efficiently-filtered dataset. We use the proposed efficient verification-based high-quality filtering pipeline to the FineWeb and Chinese FineWeb datasets (source data from Chinese FineWeb-edu-v2, which includes IndustryCorpus2, MiChao, WuDao, SkyPile, WanJuan, ChineseWebText, TeleChat, and CCI3), resulting in the creation of higher-quality Ultra-FineWeb-en… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/Ultra-FineWeb.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !
Chinese Fineweb Edu Dataset V2 [中文] [English]
[OpenCSG Community] [👾github] [wechat] [Twitter]
📖Technical Report Chinese Fineweb Edu Dataset V2 is a comprehensive upgrade of the original Chinese Fineweb Edu, designed and optimized for natural language processing (NLP) tasks in the education sector. This high-quality Chinese pretraining dataset has undergone significant… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
⭐ Please download the dataset from here.
PRIMUS: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
🤗 Primus-FineWeb
The Primus-FineWeb dataset is constructed by filtering cybersecurity-related text from FineWeb, a refined version of Common Crawl. We began by leveraging Primus-Seed, a high-quality dataset of manually curated cybersecurity text, as positive samples. We then sampled ten times the amount of data from FineWeb as negative samples… See the full description on the dataset page: https://huggingface.co/datasets/trend-cybertron/Primus-FineWeb.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Annotations for 📚 FineWeb-Edu classifier
This dataset contains the annotations used for training 📚 FineWeb-Edu educational quality classifier. We prompt Llama-3-70B-Instruct to score web pages from 🍷 FineWeb based on their educational value. Note: the dataset contains the FineWeb text sample, the prompt (using the first 1000 characters of the text sample) and the scores but it doesn't contain the full Llama 3 generation.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
ivnle/fineweb dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
FineWeb-Edu-Ar
FineWeb-Edu-Ar is a machine-translated Arabic version of the FineWeb-Edu dataset designed to support the development of Arabic small language models (SLMs). Dataset Details:
Languages: Arabic, English (paired) Size: 202 billion tokens License: CC-BY-NC-4.0 Source: Machine-translated from the deduplicated version of Hugging Face’s FineWeb-Edu dataset Translation model: facebook/nllb-200-distilled-600M
Application: FineWeb-Edu-Ar is suitable for pre-training Arabic… See the full description on the dataset page: https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar.
mjkmain/fineweb-edu-1M dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for fineweb-conversational
1. Dataset Overview
fineweb-conversational is a dataset crafted for training conversational AI models in an instruction-following format. It transforms cleaned and deduplicated English web data from the FineWeb dataset into a prompt-completion structure. The dataset is curated by me, a.k.a EpGuy, is under an odc-by license, and is still in active development with periodic updates.
2. Structure & Creation Process
The… See the full description on the dataset page: https://huggingface.co/datasets/EpGuy/fineweb-conversational.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Pre-shuffled fineweb dataset
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.