100+ datasets found

h
small-c4-dataset
huggingface.co
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brando Miranda (2025). small-c4-dataset [Dataset]. https://huggingface.co/datasets/brando/small-c4-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2025
Authors
Brando Miranda
Description
Dataset Card for Small C4 Dataset (10k Train, 10k Validation, 10k Test)

Dataset Summary

The Small C4 Dataset is a reduced version of the original C4 dataset (Colossal Clean Crawled Corpus), designed to facilitate lightweight experimentation and model training without the need to process the full C4 dataset. This dataset includes:

10,000 examples for training, 10,000 examples for validation, and 10,000 examples for testing.

Each example consists of a single text passage… See the full description on the dataset page: https://huggingface.co/datasets/brando/small-c4-dataset.
T
c4_wsrs
tensorflow.org
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). c4_wsrs [Dataset]. https://www.tensorflow.org/datasets/catalog/c4_wsrs
Explore at:
Dataset updated
Dec 22, 2022
Description
A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

The original source is the Common Crawl dataset: https://commoncrawl.org

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('c4_wsrs', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
c4-bbc-news
huggingface.co
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Maddox (2025). c4-bbc-news [Dataset]. https://huggingface.co/datasets/permutans/c4-bbc-news
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2025
Authors
Louis Maddox
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Dataset Card for BBC News from C4

This dataset provides a filtered subset of BBC News articles from the realnewslike subset of the C4 dataset, containing approximately 77k articles from BBC News domains.

Dataset Details Dataset Sources

Repository: https://huggingface.co/datasets/permutans/c4-bbc-news Source Dataset: allenai/c4 (realnewslike subset) Paper: https://arxiv.org/abs/1910.10683 (C4 paper)

Uses Direct Use

Suitable for text… See the full description on the dataset page: https://huggingface.co/datasets/permutans/c4-bbc-news.
u
Cerebellum cell type collaboration database
rdr.ucl.ac.uk
produccioncientifica.ugr.es
bin
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maxime Beau; David Herzfeld; Francisco Naveros; Marie Hemelt; Federico D'Agostino; Marlies Oostland; Alvaro Sánchez-López; Young Yoon Chung; Michael Maibach; Stephen Kyranakis; Hannah N. Stabb; Gabriela Martínez Lopera; Agoston Lajko; Marie Zedler; Shogo Ohmae; Nathan Hall; Beverley Clark; Dana Cohen; Stephen Lisberger; Dimitar Kostadinov; Court Hull; Michael Hausser; Javier Medina (2025). Cerebellum cell type collaboration database [Dataset]. http://doi.org/10.5522/04/23702850.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5522/04/23702850.v1
Dataset updated
Mar 4, 2025
Dataset provided by
University College London
Authors
Maxime Beau; David Herzfeld; Francisco Naveros; Marie Hemelt; Federico D'Agostino; Marlies Oostland; Alvaro Sánchez-López; Young Yoon Chung; Michael Maibach; Stephen Kyranakis; Hannah N. Stabb; Gabriela Martínez Lopera; Agoston Lajko; Marie Zedler; Shogo Ohmae; Nathan Hall; Beverley Clark; Dana Cohen; Stephen Lisberger; Dimitar Kostadinov; Court Hull; Michael Hausser; Javier Medina
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The C4 DatabaseThis is the official repository for the hdf5 datasets of the cerebellar cell-type classification collaboration (C4), published as a companion to the paper "A deep-learning strategy to identify cell types across species from high-density extracellular recordings" published in Cell (https://doi.org/10.1016/j.cell.2025.01.041).Instructions to use the cell-type classifier, links to download these datasets, and a data explorer can be found at https://www.c4-database.com.The specifications of the fields, data types and data formats stored in the hdf5 binary files can be found at https://www.tinyurl.com/c4database. Hdf5 files can be easily opened with Python, MATLAB and many other programming languages.Using and Citing the C4 DatabaseThe data and visualizations on this website are intended to be freely available for use by the scientific community. The C4 dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, while our classifier is licensed under the GNU General Public License v3.0 as part of NeuroPyxels. If you download and use our data for a publication, and/or if you would like to refer to the database, please cite Beau et al., 2025, Cell together with the NeuroPyxels repository (Beau et al., 2021, Zenodo), and include the link to the C4 online portal https://www.c4-database.com in your methods section. Thank you!
h
allenai-c4
huggingface.co
Updated Apr 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amanpreet Singh (2019). allenai-c4 [Dataset]. https://huggingface.co/datasets/amanpreet7/allenai-c4
Explore at:
Dataset updated
Apr 26, 2019
Authors
Amanpreet Singh
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🧠 ALLENAI C4 - English Train Split (Prepared Version) This repository contains the preprocessed and ready-to-use version of the ALLENAI C4 (Colossal Clean Crawled Corpus) English train split. It has been downloaded and optionally transformed for downstream NLP tasks such as pretraining large language models or text-based retrieval systems. 📦 Dataset Details Original Source: allenai/c4 Language: English (en) Split: train License: Google C4 License ⚠️ Note: This version only includes the train… See the full description on the dataset page: https://huggingface.co/datasets/amanpreet7/allenai-c4.
h
c4-subsets
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datablations, c4-subsets [Dataset]. https://huggingface.co/datasets/datablations/c4-subsets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
datablations
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Summary

Various subsets of C4 with different numbers of tokens measured with the GPT2Tokenizer. This data is used in the paper Scaling Data-Constrained Language Models. Please refer to our GitHub repository for more details. @article{muennighoff2023scaling, title={Scaling Data-Constrained Language Models}, author={Muennighoff, Niklas and Rush, Alexander M and Barak, Boaz and Scao, Teven Le and Piktus, Aleksandra and Tazi, Nouamane and Pyysalo, Sampo and Wolf, Thomas and… See the full description on the dataset page: https://huggingface.co/datasets/datablations/c4-subsets.
C4_200M
kaggle.com
Updated Nov 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A0155991R_Li Liwei (2021). C4_200M [Dataset]. https://www.kaggle.com/datasets/a0155991rliwei/c4-200m
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 13, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
A0155991R_Li Liwei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

Grammar Error Correction dataset synthesized based on: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction

Content

This dataset contains roughly 185 Million sentence pairs generated using C4/en/3.0.1 dataset

The data is stored in the format: { "input": "This is an grammatically wrong sentences.", "output": "This is a grammatically correct sentence." }

Acknowledgements

The C4 dataset was downloaded from allenai: https://github.com/allenai/allennlp/discussions/5056 The modified scripts used to generate the sentence pairs were referenced from: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction.

Inspiration

We hope that this dataset will help others by saving the trouble and time of generating this dataset.
R
Sl Abhi C4 Dataset
universe.roboflow.com
zip
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
abhi sl (2023). Sl Abhi C4 Dataset [Dataset]. https://universe.roboflow.com/abhi-sl/sl-abhi-c4/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Nov 1, 2023
Dataset authored and provided by
abhi sl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Face Bounding Boxes
Description
SL Abhi C4

## Overview SL Abhi C4 is a dataset for object detection tasks - it contains Face annotations for 371 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
c4-en-10k
opendatalab.com
huggingface.co
zip
Updated Dec 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2023). c4-en-10k [Dataset]. https://opendatalab.com/OpenDataLab/c4-en-10k
Explore at:
zipAvailable download formats
Dataset updated
Dec 31, 2023
Dataset provided by
谷歌http://google.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a small subset representing the first 10K records of the original C4 dataset, "en" subset - created for testing. The records were extracted after having been shuffled.
h
c4-subsets
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DatologyAI, c4-subsets [Dataset]. https://huggingface.co/datasets/DatologyAI/c4-subsets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
DatologyAI
Description
DatologyAI/c4-subsets dataset hosted on Hugging Face and contributed by the HF Datasets community
e
PAN-00019233 - Late Medieval/modern spoon C4 - Dataset - B2FIND
b2find.eudat.eu
Updated Sep 10, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). PAN-00019233 - Late Medieval/modern spoon C4 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/9155569e-9ffd-581f-ad18-ad3297c779ef
Explore at:
Dataset updated
Sep 10, 2019
Description
This find is registered at Portable Antiquities of the Netherlands with number PAN-00019233
h
redpajama-c4-refined-by-data-juicer
huggingface.co
Updated Apr 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer (2017). redpajama-c4-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 12, 2017
Dataset authored and provided by
Data-Juicer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama -- C4 (refined by Data-Juicer)

A refined version of C4 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 832GB).

Dataset Information

Number of samples: 344,491,171 (Keep ~94.42% from the original dataset)

Refining Recipe

… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer.
P
mC4 Dataset
library.toponeai.link
opendatalab.com
Updated Jun 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2022). mC4 Dataset [Dataset]. https://library.toponeai.link/dataset/mc4
Explore at:
Dataset updated
Jun 8, 2022
Authors
Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel
Description
mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.
e
PAN-00000938 - Late Medieval/modern spoon C4 - Dataset - B2FIND
b2find.eudat.eu
Updated Sep 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). PAN-00000938 - Late Medieval/modern spoon C4 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/bd7422a0-5c14-538a-b23a-a8a975ddb983
Explore at:
Dataset updated
Sep 10, 2019
Description
This find is registered at Portable Antiquities of the Netherlands with number PAN-00000938
h
c4-tiny
huggingface.co
Updated Apr 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prime Intellect (2019). c4-tiny [Dataset]. https://huggingface.co/datasets/PrimeIntellect/c4-tiny
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2019
Dataset authored and provided by
Prime Intellect
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
C4 tiny

this dataset is a very small subset of https://huggingface.co/datasets/allenai/c4 that can be use for testing without having to download the full c4 dataset. to use from dataset import load_dataset dataset = load_dataset("PrimeIntellect/c4-tiny", "en", ignore_verifications=True)
R
C4 0125 Dataset
universe.roboflow.com
zip
Updated Apr 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DOE3C41 (2024). C4 0125 Dataset [Dataset]. https://universe.roboflow.com/doe3c41/c4-0125-7ia9o
Explore at:
zipAvailable download formats
Dataset updated
Apr 10, 2024
Dataset authored and provided by
DOE3C41
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
CY 1865 Bounding Boxes
Description
C4 0125

## Overview C4 0125 is a dataset for object detection tasks - it contains CY 1865 annotations for 444 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Data from: ISLSCP II C4 Vegetation Percentage
data.nasa.gov
s.cnmilf.com
+6more
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). ISLSCP II C4 Vegetation Percentage [Dataset]. https://data.nasa.gov/dataset/islscp-ii-c4-vegetation-percentage-061c0
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The photosynthetic composition (C3 or C4) of vegetation on the land surface is essential for accurate simulations of biosphere-atmosphere exchanges of carbon, water, and energy. C3 and C4 plants have different responses to light, temperature, CO2, and nitrogen; they also differ in physiological functions like stomatal conductance and isotope fractionation. A fine-scale distribution of these plant types is essential for earth science modeling.The C4 percentage is determined from datasets that describe the continuous distribution of plant growth forms (i.e., the percent of a grid cell covered by herbaceous or woody vegetation), climate classifications, the fraction of a grid cell covered in croplands, and national crop type harvest area statistics. The staff from the International Satellite Land Surface Climatology Project (ISLSCP) Initiative II have made the original data set consistent with the ISLSCP-2 land/water mask. This data set contains a single file in ArcInfo ASCIIGRID format.This data set is one of the products of the International Satellite Land-Surface Climatology Project, Initiative II (ISLSCP II) data collection which contains 50 global time series data sets for the ten-year period 1986 to 1995. Selected data sets span even longer periods. ISLSCP II is a consistent collection of data sets that were compiled from existing data sources and algorithms, and were designed to satisfy the needs of modelers and investigators of the global carbon, water and energy cycle. The data were acquired from a number of U.S. and international agencies, universities, and institutions. The global data sets were mapped at consistent spatial (1, 0.5 and 0.25 degrees) and temporal (monthly, with meteorological data at finer (e.g., 3-hour)) resolutions and reformatted into a common ASCII format. The data and documentation have undergone two peer reviews.ISLSCP is one of several projects of Global Energy and Water Cycle Experiment (GEWEX) [http://www.gewex.org/] and has the lead role in addressing land-atmosphere interactions -- process modeling, data retrieval algorithms, field experiment design and execution, and the development of global data sets.
e
PAN-00016617 - early medieval figurative disc brooch (eye-hook) variant C4 -...
b2find.eudat.eu
Updated Jul 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). PAN-00016617 - early medieval figurative disc brooch (eye-hook) variant C4 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/c916e873-789e-57aa-a2b7-18efc813b392
Explore at:
Dataset updated
Jul 15, 2019
Description
This find is registered at Portable Antiquities of the Netherlands with number PAN-00016617
e
PAN-00000844 - early medieval figurative disc brooch (eye-hook) variant C4 -...
b2find.eudat.eu
Updated Apr 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). PAN-00000844 - early medieval figurative disc brooch (eye-hook) variant C4 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/6bc51cda-79f7-565c-b033-bf5693ccd86c
Explore at:
Dataset updated
Apr 18, 2019
Description
This find is registered at Portable Antiquities of the Netherlands with number PAN-00000844
e
PAN-00143187 - early medieval figurative disc brooch (eye-hook) variant C4 -...
b2find.eudat.eu
Updated Oct 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). PAN-00143187 - early medieval figurative disc brooch (eye-hook) variant C4 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/fa6c2b74-0e73-5dbe-91bd-429a3fafbafd
Explore at:
Dataset updated
Oct 25, 2024
Description
This find is registered at Portable Antiquities of the Netherlands with number PAN-00143187

Facebook

Twitter

Click to copy link

Link copied

Cite

Brando Miranda (2025). small-c4-dataset [Dataset]. https://huggingface.co/datasets/brando/small-c4-dataset

small-c4-dataset

brando/small-c4-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 31, 2025

Authors

Brando Miranda

Description

Dataset Card for Small C4 Dataset (10k Train, 10k Validation, 10k Test)

  Dataset Summary

The Small C4 Dataset is a reduced version of the original C4 dataset (Colossal Clean Crawled Corpus), designed to facilitate lightweight experimentation and model training without the need to process the full C4 dataset. This dataset includes:

10,000 examples for training, 10,000 examples for validation, and 10,000 examples for testing.

Each example consists of a single text passage… See the full description on the dataset page: https://huggingface.co/datasets/brando/small-c4-dataset.

Clear search

Close search

Google apps

Main menu

small-c4-dataset

c4_wsrs

c4-bbc-news

Cerebellum cell type collaboration database

allenai-c4

c4-subsets

C4_200M

Context

Content

Acknowledgements

Inspiration

Sl Abhi C4 Dataset

SL Abhi C4

c4-en-10k

c4-subsets

PAN-00019233 - Late Medieval/modern spoon C4 - Dataset - B2FIND

redpajama-c4-refined-by-data-juicer

… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer.

mC4 Dataset

PAN-00000938 - Late Medieval/modern spoon C4 - Dataset - B2FIND

c4-tiny

C4 0125 Dataset

C4 0125

Data from: ISLSCP II C4 Vegetation Percentage

PAN-00016617 - early medieval figurative disc brooch (eye-hook) variant C4 -...

PAN-00000844 - early medieval figurative disc brooch (eye-hook) variant C4 -...

PAN-00143187 - early medieval figurative disc brooch (eye-hook) variant C4 -...

small-c4-dataset

brando/small-c4-dataset