TEL-LLM/fineweb-edu-1M dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for Fineweb Ultra Mini
Fineweb Ultra Mini is a dataset derived from the original Fineweb dataset made by huggingface (see here: https://huggingface.co/datasets/HuggingFaceFW/fineweb). The dataset focuses on extracting high quality data from the Fineweb dataset, from the 2-3% range. If you would like even more high-quality data, keep out for our next release, fineweb ultra mini pro, which focuses on the 0-1% of high quality data originally found in fineweb.… See the full description on the dataset page: https://huggingface.co/datasets/reflex-ai/fineweb-ultra-mini.
ohsuz/fineweb-edu-2024-10-from-5M-to-6M-ko dataset hosted on Hugging Face and contributed by the HF Datasets community
ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko dataset hosted on Hugging Face and contributed by the HF Datasets community
evinsi/fineweb-edu-Llama-3.2-Instruct-Shuffled dataset hosted on Hugging Face and contributed by the HF Datasets community
ohsuz/fineweb-edu-2024-10-from-2M-to-3M-ko dataset hosted on Hugging Face and contributed by the HF Datasets community
tobiashomie/test-fineweb-h1_2025_02_14_15 dataset hosted on Hugging Face and contributed by the HF Datasets community
ohsuz/fineweb-edu-2024-10-from-7M-to-8M dataset hosted on Hugging Face and contributed by the HF Datasets community
skymizer/fineweb-edu-dedup-45B-1-of-4 dataset hosted on Hugging Face and contributed by the HF Datasets community
dododo1234/fineweb dataset hosted on Hugging Face and contributed by the HF Datasets community
mekaneeky/fineweb-sample-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
ccde/fineweb-edu-sample-500K dataset hosted on Hugging Face and contributed by the HF Datasets community
dvilasuero/fineweb-c-prelim dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🥂 FineWeb2
A sparkling update with 1000s of languages
What is it?
This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.
mikasenghaas/fineweb-edu-1bt dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
OpenCoder Dataset
The OpenCoder dataset is composed of the following datasets:
opc-sft-stage1: the sft data used for opencoder sft-stage1 opc-sft-stage2: the sft data used for opencoder sft-stage2 opc-annealing-corpus: the synthetic data & algorithmic corpus used for opencoder annealing opc-fineweb-code-corpus: the code-related page recalled from fineweb opc-fineweb-math-corpus: the math-related page recalled from fineweb <-- you are here refineCode-code-corpus-meta: the… See the full description on the dataset page: https://huggingface.co/datasets/OpenCoder-LLM/opc-fineweb-math-corpus.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full… See the full description on the dataset page: https://huggingface.co/datasets/akhilhsingh/homeo-dataset.
Dataset Card for instructions-from-fineweb-edu
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/gabrielmbmb/instructions-from-fineweb-edu/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/gabrielmbmb/instructions-from-fineweb-edu.
juniorrios/fineweb-2-10M-tokenized-pplfixed2048SmolLM2-1.7B dataset hosted on Hugging Face and contributed by the HF Datasets community
divij30/fineweb-filtered-100k dataset hosted on Hugging Face and contributed by the HF Datasets community
TEL-LLM/fineweb-edu-1M dataset hosted on Hugging Face and contributed by the HF Datasets community