Dataset Card for Small C4 Dataset (10k Train, 10k Validation, 10k Test)
Dataset Summary
The Small C4 Dataset is a reduced version of the original C4 dataset (Colossal Clean Crawled Corpus), designed to facilitate lightweight experimentation and model training without the need to process the full C4 dataset. This dataset includes:
10,000 examples for training, 10,000 examples for validation, and 10,000 examples for testing.
Each example consists of a single text passage… See the full description on the dataset page: https://huggingface.co/datasets/brando/small-c4-dataset.
A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.
The original source is the Common Crawl dataset: https://commoncrawl.org
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('c4_wsrs', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for BBC News from C4
This dataset provides a filtered subset of BBC News articles from the realnewslike subset of the C4 dataset, containing approximately 77k articles from BBC News domains.
Dataset Details
Dataset Sources
Repository: https://huggingface.co/datasets/permutans/c4-bbc-news Source Dataset: allenai/c4 (realnewslike subset) Paper: https://arxiv.org/abs/1910.10683 (C4 paper)
Uses
Direct Use
Suitable for text… See the full description on the dataset page: https://huggingface.co/datasets/permutans/c4-bbc-news.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The C4 DatabaseThis is the official repository for the hdf5 datasets of the cerebellar cell-type classification collaboration (C4), published as a companion to the paper "A deep-learning strategy to identify cell types across species from high-density extracellular recordings" published in Cell (https://doi.org/10.1016/j.cell.2025.01.041).Instructions to use the cell-type classifier, links to download these datasets, and a data explorer can be found at https://www.c4-database.com.The specifications of the fields, data types and data formats stored in the hdf5 binary files can be found at https://www.tinyurl.com/c4database. Hdf5 files can be easily opened with Python, MATLAB and many other programming languages.Using and Citing the C4 DatabaseThe data and visualizations on this website are intended to be freely available for use by the scientific community. The C4 dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, while our classifier is licensed under the GNU General Public License v3.0 as part of NeuroPyxels. If you download and use our data for a publication, and/or if you would like to refer to the database, please cite Beau et al., 2025, Cell together with the NeuroPyxels repository (Beau et al., 2021, Zenodo), and include the link to the C4 online portal https://www.c4-database.com in your methods section. Thank you!
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🧠 ALLENAI C4 - English Train Split (Prepared Version) This repository contains the preprocessed and ready-to-use version of the ALLENAI C4 (Colossal Clean Crawled Corpus) English train split. It has been downloaded and optionally transformed for downstream NLP tasks such as pretraining large language models or text-based retrieval systems. 📦 Dataset Details Original Source: allenai/c4 Language: English (en) Split: train License: Google C4 License ⚠️ Note: This version only includes the train… See the full description on the dataset page: https://huggingface.co/datasets/amanpreet7/allenai-c4.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Summary
Various subsets of C4 with different numbers of tokens measured with the GPT2Tokenizer. This data is used in the paper Scaling Data-Constrained Language Models. Please refer to our GitHub repository for more details. @article{muennighoff2023scaling, title={Scaling Data-Constrained Language Models}, author={Muennighoff, Niklas and Rush, Alexander M and Barak, Boaz and Scao, Teven Le and Piktus, Aleksandra and Tazi, Nouamane and Pyysalo, Sampo and Wolf, Thomas and… See the full description on the dataset page: https://huggingface.co/datasets/datablations/c4-subsets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Grammar Error Correction dataset synthesized based on: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction
This dataset contains roughly 185 Million sentence pairs generated using C4/en/3.0.1 dataset
The data is stored in the format:
{
"input": "This is an grammatically wrong sentences.",
"output": "This is a grammatically correct sentence."
}
The C4 dataset was downloaded from allenai: https://github.com/allenai/allennlp/discussions/5056 The modified scripts used to generate the sentence pairs were referenced from: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction.
We hope that this dataset will help others by saving the trouble and time of generating this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
SL Abhi C4 is a dataset for object detection tasks - it contains Face annotations for 371 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a small subset representing the first 10K records of the original C4 dataset, "en" subset - created for testing. The records were extracted after having been shuffled.
DatologyAI/c4-subsets dataset hosted on Hugging Face and contributed by the HF Datasets community
This find is registered at Portable Antiquities of the Netherlands with number PAN-00019233
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- C4 (refined by Data-Juicer)
A refined version of C4 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 832GB).
Dataset Information
Number of samples: 344,491,171 (Keep ~94.42% from the original dataset)
Refining Recipe
mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.
This find is registered at Portable Antiquities of the Netherlands with number PAN-00000938
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
C4 tiny
this dataset is a very small subset of https://huggingface.co/datasets/allenai/c4 that can be use for testing without having to download the full c4 dataset. to use from dataset import load_dataset dataset = load_dataset("PrimeIntellect/c4-tiny", "en", ignore_verifications=True)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
C4 0125 is a dataset for object detection tasks - it contains CY 1865 annotations for 444 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
The photosynthetic composition (C3 or C4) of vegetation on the land surface is essential for accurate simulations of biosphere-atmosphere exchanges of carbon, water, and energy. C3 and C4 plants have different responses to light, temperature, CO2, and nitrogen; they also differ in physiological functions like stomatal conductance and isotope fractionation. A fine-scale distribution of these plant types is essential for earth science modeling.The C4 percentage is determined from datasets that describe the continuous distribution of plant growth forms (i.e., the percent of a grid cell covered by herbaceous or woody vegetation), climate classifications, the fraction of a grid cell covered in croplands, and national crop type harvest area statistics. The staff from the International Satellite Land Surface Climatology Project (ISLSCP) Initiative II have made the original data set consistent with the ISLSCP-2 land/water mask. This data set contains a single file in ArcInfo ASCIIGRID format.This data set is one of the products of the International Satellite Land-Surface Climatology Project, Initiative II (ISLSCP II) data collection which contains 50 global time series data sets for the ten-year period 1986 to 1995. Selected data sets span even longer periods. ISLSCP II is a consistent collection of data sets that were compiled from existing data sources and algorithms, and were designed to satisfy the needs of modelers and investigators of the global carbon, water and energy cycle. The data were acquired from a number of U.S. and international agencies, universities, and institutions. The global data sets were mapped at consistent spatial (1, 0.5 and 0.25 degrees) and temporal (monthly, with meteorological data at finer (e.g., 3-hour)) resolutions and reformatted into a common ASCII format. The data and documentation have undergone two peer reviews.ISLSCP is one of several projects of Global Energy and Water Cycle Experiment (GEWEX) [http://www.gewex.org/] and has the lead role in addressing land-atmosphere interactions -- process modeling, data retrieval algorithms, field experiment design and execution, and the development of global data sets.
This find is registered at Portable Antiquities of the Netherlands with number PAN-00016617
This find is registered at Portable Antiquities of the Netherlands with number PAN-00000844
This find is registered at Portable Antiquities of the Netherlands with number PAN-00143187
Dataset Card for Small C4 Dataset (10k Train, 10k Validation, 10k Test)
Dataset Summary
The Small C4 Dataset is a reduced version of the original C4 dataset (Colossal Clean Crawled Corpus), designed to facilitate lightweight experimentation and model training without the need to process the full C4 dataset. This dataset includes:
10,000 examples for training, 10,000 examples for validation, and 10,000 examples for testing.
Each example consists of a single text passage… See the full description on the dataset page: https://huggingface.co/datasets/brando/small-c4-dataset.