12 datasets found
  1. P

    C4 Dataset

    • paperswithcode.com
    Updated Dec 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu, C4 Dataset [Dataset]. https://paperswithcode.com/dataset/c4
    Explore at:
    Dataset updated
    Dec 13, 2023
    Authors
    Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu
    Description

    C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

    The dataset can be downloaded in a pre-processed form from allennlp.

  2. c4-tiny

    • huggingface.co
    Updated Apr 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prime Intellect (2019). c4-tiny [Dataset]. https://huggingface.co/datasets/PrimeIntellect/c4-tiny
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2019
    Dataset provided by
    Authors
    Prime Intellect
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    C4 tiny

    this dataset is a very small subset of https://huggingface.co/datasets/allenai/c4 that can be use for testing without having to download the full c4 dataset. to use from dataset import load_dataset dataset = load_dataset("PrimeIntellect/c4-tiny", "en", ignore_verifications=True)

  3. u

    Cerebellum cell type collaboration database

    • rdr.ucl.ac.uk
    bin
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxime Beau; David Herzfeld; Francisco Naveros; Marie Hemelt; Federico D'Agostino; Marlies Oostland; Alvaro Sánchez-López; Young Yoon Chung; Michael Maibach; Stephen Kyranakis; Hannah N. Stabb; Gabriela Martínez Lopera; Agoston Lajko; Marie Zedler; Shogo Ohmae; Nathan Hall; Beverley Clark; Dana Cohen; Stephen Lisberger; Dimitar Kostadinov; Court Hull; Michael Hausser; Javier Medina (2025). Cerebellum cell type collaboration database [Dataset]. http://doi.org/10.5522/04/23702850.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    University College London
    Authors
    Maxime Beau; David Herzfeld; Francisco Naveros; Marie Hemelt; Federico D'Agostino; Marlies Oostland; Alvaro Sánchez-López; Young Yoon Chung; Michael Maibach; Stephen Kyranakis; Hannah N. Stabb; Gabriela Martínez Lopera; Agoston Lajko; Marie Zedler; Shogo Ohmae; Nathan Hall; Beverley Clark; Dana Cohen; Stephen Lisberger; Dimitar Kostadinov; Court Hull; Michael Hausser; Javier Medina
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The C4 DatabaseThis is the official repository for the hdf5 datasets of the cerebellar cell-type classification collaboration (C4), published as a companion to the paper "A deep-learning strategy to identify cell types across species from high-density extracellular recordings" published in Cell (https://doi.org/10.1016/j.cell.2025.01.041).Instructions to use the cell-type classifier, links to download these datasets, and a data explorer can be found at https://www.c4-database.com.The specifications of the fields, data types and data formats stored in the hdf5 binary files can be found at https://www.tinyurl.com/c4database. Hdf5 files can be easily opened with Python, MATLAB and many other programming languages.Using and Citing the C4 DatabaseThe data and visualizations on this website are intended to be freely available for use by the scientific community. The C4 dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, while our classifier is licensed under the GNU General Public License v3.0 as part of NeuroPyxels. If you download and use our data for a publication, and/or if you would like to refer to the database, please cite Beau et al., 2025, Cell together with the NeuroPyxels repository (Beau et al., 2021, Zenodo), and include the link to the C4 online portal https://www.c4-database.com in your methods section. Thank you!

  4. h

    c4-en-10k

    • huggingface.co
    • opendatalab.com
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stas Bekman (2025). c4-en-10k [Dataset]. https://huggingface.co/datasets/stas/c4-en-10k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2025
    Authors
    Stas Bekman
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a small subset representing the first 10K records of the original C4 dataset, "en" subset - created for testing. The records were extracted after having been shuffled.

    The full 1TB+ dataset is at https://huggingface.co/datasets/c4.

  5. R

    C4 0140 Dataset

    • universe.roboflow.com
    zip
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DOE4C41 (2025). C4 0140 Dataset [Dataset]. https://universe.roboflow.com/doe4c41/c4-0140
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset authored and provided by
    DOE4C41
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    JF2010 01D Bounding Boxes
    Description

    C4 0140

    ## Overview
    
    C4 0140 is a dataset for object detection tasks - it contains JF2010 01D annotations for 395 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  6. T

    c4_wsrs

    • tensorflow.org
    Updated Dec 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). c4_wsrs [Dataset]. https://www.tensorflow.org/datasets/catalog/c4_wsrs
    Explore at:
    Dataset updated
    Dec 22, 2022
    Description

    A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

    The original source is the Common Crawl dataset: https://commoncrawl.org

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('c4_wsrs', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  7. R

    Sl Abhi C4 Dataset

    • universe.roboflow.com
    zip
    Updated Nov 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abhi sl (2023). Sl Abhi C4 Dataset [Dataset]. https://universe.roboflow.com/abhi-sl/sl-abhi-c4/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 1, 2023
    Dataset authored and provided by
    abhi sl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Face Bounding Boxes
    Description

    SL Abhi C4

    ## Overview
    
    SL Abhi C4 is a dataset for object detection tasks - it contains Face annotations for 371 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  8. R

    'c1''c2''c3' 'c4' 'c5' 'c6' Dataset

    • universe.roboflow.com
    zip
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MMU (2023). 'c1''c2''c3' 'c4' 'c5' 'c6' Dataset [Dataset]. https://universe.roboflow.com/mmu-ncfyw/-c1-c2-c3-c4-c5-c6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 6, 2023
    Dataset authored and provided by
    MMU
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Car Masks
    Description

    'C1''C2''C3' 'C4' 'C5' 'C6'

    ## Overview
    
    'C1''C2''C3' 'C4' 'C5' 'C6' is a dataset for semantic segmentation tasks - it contains Car annotations for 1,000 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
    
  9. P

    Data from: mC4 Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2022). mC4 Dataset [Dataset]. https://paperswithcode.com/dataset/mc4
    Explore at:
    Dataset updated
    Jun 8, 2022
    Authors
    Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel
    Description

    mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.

  10. R

    C4 0125(cy1793 02d) Dataset

    • universe.roboflow.com
    zip
    Updated Feb 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DOE2C41 (2024). C4 0125(cy1793 02d) Dataset [Dataset]. https://universe.roboflow.com/doe2c41/c4-0125-cy1793-02d/dataset/4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 17, 2024
    Dataset authored and provided by
    DOE2C41
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    C4 0125 Bounding Boxes
    Description

    C4 0125(CY1793 02D)

    ## Overview
    
    C4 0125(CY1793 02D) is a dataset for object detection tasks - it contains C4 0125 annotations for 989 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  11. R

    Bus_nocard Dataset

    • universe.roboflow.com
    zip
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    c4 (2023). Bus_nocard Dataset [Dataset]. https://universe.roboflow.com/c4-uuwss/bus_nocard/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 18, 2023
    Dataset authored and provided by
    c4
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Bus GzEj Bounding Boxes
    Description

    BUS_nocard

    ## Overview
    
    BUS_nocard is a dataset for object detection tasks - it contains Bus GzEj annotations for 3,603 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  12. R

    M3 Fruit Classification Dataset

    • universe.roboflow.com
    zip
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C4 4078 Fruit and Veg (2024). M3 Fruit Classification Dataset [Dataset]. https://universe.roboflow.com/c4-4078-fruit-and-veg-ldxid/m3-fruit-classification/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2024
    Dataset authored and provided by
    C4 4078 Fruit and Veg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Fruits Bounding Boxes
    Description

    M3 Fruit Classification

    ## Overview
    
    M3 Fruit Classification is a dataset for object detection tasks - it contains Fruits annotations for 2,306 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu, C4 Dataset [Dataset]. https://paperswithcode.com/dataset/c4

C4 Dataset

Colossal Clean Crawled Corpus

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 13, 2023
Authors
Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu
Description

C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

The dataset can be downloaded in a pre-processed form from allennlp.

Search
Clear search
Close search
Google apps
Main menu