12 datasets found

P
C4 Dataset
paperswithcode.com
Updated Dec 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu, C4 Dataset [Dataset]. https://paperswithcode.com/dataset/c4
Explore at:
Dataset updated
Dec 13, 2023
Authors
Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu
Description
C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

The dataset can be downloaded in a pre-processed form from allennlp.
c4-tiny
huggingface.co
Updated Apr 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prime Intellect (2019). c4-tiny [Dataset]. https://huggingface.co/datasets/PrimeIntellect/c4-tiny
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2019
Dataset provided by
Authors
Prime Intellect
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
C4 tiny

this dataset is a very small subset of https://huggingface.co/datasets/allenai/c4 that can be use for testing without having to download the full c4 dataset. to use from dataset import load_dataset dataset = load_dataset("PrimeIntellect/c4-tiny", "en", ignore_verifications=True)
u
Cerebellum cell type collaboration database
rdr.ucl.ac.uk
bin
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maxime Beau; David Herzfeld; Francisco Naveros; Marie Hemelt; Federico D'Agostino; Marlies Oostland; Alvaro Sánchez-López; Young Yoon Chung; Michael Maibach; Stephen Kyranakis; Hannah N. Stabb; Gabriela Martínez Lopera; Agoston Lajko; Marie Zedler; Shogo Ohmae; Nathan Hall; Beverley Clark; Dana Cohen; Stephen Lisberger; Dimitar Kostadinov; Court Hull; Michael Hausser; Javier Medina (2025). Cerebellum cell type collaboration database [Dataset]. http://doi.org/10.5522/04/23702850.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5522/04/23702850.v1
Dataset updated
Mar 4, 2025
Dataset provided by
University College London
Authors
Maxime Beau; David Herzfeld; Francisco Naveros; Marie Hemelt; Federico D'Agostino; Marlies Oostland; Alvaro Sánchez-López; Young Yoon Chung; Michael Maibach; Stephen Kyranakis; Hannah N. Stabb; Gabriela Martínez Lopera; Agoston Lajko; Marie Zedler; Shogo Ohmae; Nathan Hall; Beverley Clark; Dana Cohen; Stephen Lisberger; Dimitar Kostadinov; Court Hull; Michael Hausser; Javier Medina
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The C4 DatabaseThis is the official repository for the hdf5 datasets of the cerebellar cell-type classification collaboration (C4), published as a companion to the paper "A deep-learning strategy to identify cell types across species from high-density extracellular recordings" published in Cell (https://doi.org/10.1016/j.cell.2025.01.041).Instructions to use the cell-type classifier, links to download these datasets, and a data explorer can be found at https://www.c4-database.com.The specifications of the fields, data types and data formats stored in the hdf5 binary files can be found at https://www.tinyurl.com/c4database. Hdf5 files can be easily opened with Python, MATLAB and many other programming languages.Using and Citing the C4 DatabaseThe data and visualizations on this website are intended to be freely available for use by the scientific community. The C4 dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, while our classifier is licensed under the GNU General Public License v3.0 as part of NeuroPyxels. If you download and use our data for a publication, and/or if you would like to refer to the database, please cite Beau et al., 2025, Cell together with the NeuroPyxels repository (Beau et al., 2021, Zenodo), and include the link to the C4 online portal https://www.c4-database.com in your methods section. Thank you!
h
c4-en-10k
huggingface.co
opendatalab.com
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stas Bekman (2025). c4-en-10k [Dataset]. https://huggingface.co/datasets/stas/c4-en-10k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2025
Authors
Stas Bekman
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a small subset representing the first 10K records of the original C4 dataset, "en" subset - created for testing. The records were extracted after having been shuffled.

The full 1TB+ dataset is at https://huggingface.co/datasets/c4.
R
C4 0140 Dataset
universe.roboflow.com
zip
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DOE4C41 (2025). C4 0140 Dataset [Dataset]. https://universe.roboflow.com/doe4c41/c4-0140
Explore at:
zipAvailable download formats
Dataset updated
Jan 23, 2025
Dataset authored and provided by
DOE4C41
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
JF2010 01D Bounding Boxes
Description
C4 0140

## Overview C4 0140 is a dataset for object detection tasks - it contains JF2010 01D annotations for 395 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
T
c4_wsrs
tensorflow.org
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). c4_wsrs [Dataset]. https://www.tensorflow.org/datasets/catalog/c4_wsrs
Explore at:
Dataset updated
Dec 22, 2022
Description
A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

The original source is the Common Crawl dataset: https://commoncrawl.org

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('c4_wsrs', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
R
Sl Abhi C4 Dataset
universe.roboflow.com
zip
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
abhi sl (2023). Sl Abhi C4 Dataset [Dataset]. https://universe.roboflow.com/abhi-sl/sl-abhi-c4/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Nov 1, 2023
Dataset authored and provided by
abhi sl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Face Bounding Boxes
Description
SL Abhi C4

## Overview SL Abhi C4 is a dataset for object detection tasks - it contains Face annotations for 371 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
'c1''c2''c3' 'c4' 'c5' 'c6' Dataset
universe.roboflow.com
zip
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MMU (2023). 'c1''c2''c3' 'c4' 'c5' 'c6' Dataset [Dataset]. https://universe.roboflow.com/mmu-ncfyw/-c1-c2-c3-c4-c5-c6
Explore at:
zipAvailable download formats
Dataset updated
Dec 6, 2023
Dataset authored and provided by
MMU
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Car Masks
Description
'C1''C2''C3' 'C4' 'C5' 'C6'

## Overview 'C1''C2''C3' 'C4' 'C5' 'C6' is a dataset for semantic segmentation tasks - it contains Car annotations for 1,000 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
P
Data from: mC4 Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Jun 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2022). mC4 Dataset [Dataset]. https://paperswithcode.com/dataset/mc4
Explore at:
Dataset updated
Jun 8, 2022
Authors
Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel
Description
mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.
R
C4 0125(cy1793 02d) Dataset
universe.roboflow.com
zip
Updated Feb 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DOE2C41 (2024). C4 0125(cy1793 02d) Dataset [Dataset]. https://universe.roboflow.com/doe2c41/c4-0125-cy1793-02d/dataset/4
Explore at:
zipAvailable download formats
Dataset updated
Feb 17, 2024
Dataset authored and provided by
DOE2C41
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
C4 0125 Bounding Boxes
Description
C4 0125(CY1793 02D)

## Overview C4 0125(CY1793 02D) is a dataset for object detection tasks - it contains C4 0125 annotations for 989 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
Bus_nocard Dataset
universe.roboflow.com
zip
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
c4 (2023). Bus_nocard Dataset [Dataset]. https://universe.roboflow.com/c4-uuwss/bus_nocard/model/1
Explore at:
zipAvailable download formats
Dataset updated
Dec 18, 2023
Dataset authored and provided by
c4
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Bus GzEj Bounding Boxes
Description
BUS_nocard

## Overview BUS_nocard is a dataset for object detection tasks - it contains Bus GzEj annotations for 3,603 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
M3 Fruit Classification Dataset
universe.roboflow.com
zip
Updated Sep 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
C4 4078 Fruit and Veg (2024). M3 Fruit Classification Dataset [Dataset]. https://universe.roboflow.com/c4-4078-fruit-and-veg-ldxid/m3-fruit-classification/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2024
Dataset authored and provided by
C4 4078 Fruit and Veg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Fruits Bounding Boxes
Description
M3 Fruit Classification

## Overview M3 Fruit Classification is a dataset for object detection tasks - it contains Fruits annotations for 2,306 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu, C4 Dataset [Dataset]. https://paperswithcode.com/dataset/c4

C4 Dataset

Colossal Clean Crawled Corpus

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Dec 13, 2023

Authors

Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu

Description

C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

The dataset can be downloaded in a pre-processed form from allennlp.

Clear search

Close search

Google apps

Main menu

C4 Dataset

c4-tiny

Cerebellum cell type collaboration database

c4-en-10k

C4 0140 Dataset

C4 0140

c4_wsrs

Sl Abhi C4 Dataset

SL Abhi C4

'c1''c2''c3' 'c4' 'c5' 'c6' Dataset

'C1''C2''C3' 'C4' 'C5' 'C6'

Data from: mC4 Dataset

C4 0125(cy1793 02d) Dataset

C4 0125(CY1793 02D)

Bus_nocard Dataset

BUS_nocard

M3 Fruit Classification Dataset

M3 Fruit Classification

C4 DatasetSee More Versions

Colossal Clean Crawled Corpus

C4 Dataset