100+ datasets found
  1. h

    small-c4-dataset

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brando Miranda (2025). small-c4-dataset [Dataset]. https://huggingface.co/datasets/brando/small-c4-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2025
    Authors
    Brando Miranda
    Description

    Dataset Card for Small C4 Dataset (10k Train, 10k Validation, 10k Test)

      Dataset Summary
    

    The Small C4 Dataset is a reduced version of the original C4 dataset (Colossal Clean Crawled Corpus), designed to facilitate lightweight experimentation and model training without the need to process the full C4 dataset. This dataset includes:

    10,000 examples for training, 10,000 examples for validation, and 10,000 examples for testing.

    Each example consists of a single text passage… See the full description on the dataset page: https://huggingface.co/datasets/brando/small-c4-dataset.

  2. T

    c4_wsrs

    • tensorflow.org
    Updated Dec 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). c4_wsrs [Dataset]. https://www.tensorflow.org/datasets/catalog/c4_wsrs
    Explore at:
    Dataset updated
    Dec 22, 2022
    Description

    A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

    The original source is the Common Crawl dataset: https://commoncrawl.org

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('c4_wsrs', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  3. h

    c4-bbc-news

    • huggingface.co
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Maddox (2025). c4-bbc-news [Dataset]. https://huggingface.co/datasets/permutans/c4-bbc-news
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2025
    Authors
    Louis Maddox
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for BBC News from C4

    This dataset provides a filtered subset of BBC News articles from the realnewslike subset of the C4 dataset, containing approximately 77k articles from BBC News domains.

      Dataset Details
    
    
    
    
    
      Dataset Sources
    

    Repository: https://huggingface.co/datasets/permutans/c4-bbc-news Source Dataset: allenai/c4 (realnewslike subset) Paper: https://arxiv.org/abs/1910.10683 (C4 paper)

      Uses
    
    
    
    
    
      Direct Use
    

    Suitable for text… See the full description on the dataset page: https://huggingface.co/datasets/permutans/c4-bbc-news.

  4. u

    Cerebellum cell type collaboration database

    • rdr.ucl.ac.uk
    • produccioncientifica.ugr.es
    bin
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxime Beau; David Herzfeld; Francisco Naveros; Marie Hemelt; Federico D'Agostino; Marlies Oostland; Alvaro Sánchez-López; Young Yoon Chung; Michael Maibach; Stephen Kyranakis; Hannah N. Stabb; Gabriela Martínez Lopera; Agoston Lajko; Marie Zedler; Shogo Ohmae; Nathan Hall; Beverley Clark; Dana Cohen; Stephen Lisberger; Dimitar Kostadinov; Court Hull; Michael Hausser; Javier Medina (2025). Cerebellum cell type collaboration database [Dataset]. http://doi.org/10.5522/04/23702850.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    University College London
    Authors
    Maxime Beau; David Herzfeld; Francisco Naveros; Marie Hemelt; Federico D'Agostino; Marlies Oostland; Alvaro Sánchez-López; Young Yoon Chung; Michael Maibach; Stephen Kyranakis; Hannah N. Stabb; Gabriela Martínez Lopera; Agoston Lajko; Marie Zedler; Shogo Ohmae; Nathan Hall; Beverley Clark; Dana Cohen; Stephen Lisberger; Dimitar Kostadinov; Court Hull; Michael Hausser; Javier Medina
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The C4 DatabaseThis is the official repository for the hdf5 datasets of the cerebellar cell-type classification collaboration (C4), published as a companion to the paper "A deep-learning strategy to identify cell types across species from high-density extracellular recordings" published in Cell (https://doi.org/10.1016/j.cell.2025.01.041).Instructions to use the cell-type classifier, links to download these datasets, and a data explorer can be found at https://www.c4-database.com.The specifications of the fields, data types and data formats stored in the hdf5 binary files can be found at https://www.tinyurl.com/c4database. Hdf5 files can be easily opened with Python, MATLAB and many other programming languages.Using and Citing the C4 DatabaseThe data and visualizations on this website are intended to be freely available for use by the scientific community. The C4 dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, while our classifier is licensed under the GNU General Public License v3.0 as part of NeuroPyxels. If you download and use our data for a publication, and/or if you would like to refer to the database, please cite Beau et al., 2025, Cell together with the NeuroPyxels repository (Beau et al., 2021, Zenodo), and include the link to the C4 online portal https://www.c4-database.com in your methods section. Thank you!

  5. h

    allenai-c4

    • huggingface.co
    Updated Apr 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amanpreet Singh (2019). allenai-c4 [Dataset]. https://huggingface.co/datasets/amanpreet7/allenai-c4
    Explore at:
    Dataset updated
    Apr 26, 2019
    Authors
    Amanpreet Singh
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🧠 ALLENAI C4 - English Train Split (Prepared Version) This repository contains the preprocessed and ready-to-use version of the ALLENAI C4 (Colossal Clean Crawled Corpus) English train split. It has been downloaded and optionally transformed for downstream NLP tasks such as pretraining large language models or text-based retrieval systems. 📦 Dataset Details Original Source: allenai/c4 Language: English (en) Split: train License: Google C4 License ⚠️ Note: This version only includes the train… See the full description on the dataset page: https://huggingface.co/datasets/amanpreet7/allenai-c4.

  6. h

    c4-subsets

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datablations, c4-subsets [Dataset]. https://huggingface.co/datasets/datablations/c4-subsets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    datablations
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Summary

    Various subsets of C4 with different numbers of tokens measured with the GPT2Tokenizer. This data is used in the paper Scaling Data-Constrained Language Models. Please refer to our GitHub repository for more details. @article{muennighoff2023scaling, title={Scaling Data-Constrained Language Models}, author={Muennighoff, Niklas and Rush, Alexander M and Barak, Boaz and Scao, Teven Le and Piktus, Aleksandra and Tazi, Nouamane and Pyysalo, Sampo and Wolf, Thomas and… See the full description on the dataset page: https://huggingface.co/datasets/datablations/c4-subsets.

  7. C4_200M

    • kaggle.com
    Updated Nov 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A0155991R_Li Liwei (2021). C4_200M [Dataset]. https://www.kaggle.com/datasets/a0155991rliwei/c4-200m
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 13, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    A0155991R_Li Liwei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    Grammar Error Correction dataset synthesized based on: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction

    Content

    This dataset contains roughly 185 Million sentence pairs generated using C4/en/3.0.1 dataset

    The data is stored in the format: { "input": "This is an grammatically wrong sentences.", "output": "This is a grammatically correct sentence." }

    Acknowledgements

    The C4 dataset was downloaded from allenai: https://github.com/allenai/allennlp/discussions/5056 The modified scripts used to generate the sentence pairs were referenced from: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction.

    Inspiration

    We hope that this dataset will help others by saving the trouble and time of generating this dataset.

  8. R

    Sl Abhi C4 Dataset

    • universe.roboflow.com
    zip
    Updated Nov 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abhi sl (2023). Sl Abhi C4 Dataset [Dataset]. https://universe.roboflow.com/abhi-sl/sl-abhi-c4/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 1, 2023
    Dataset authored and provided by
    abhi sl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Face Bounding Boxes
    Description

    SL Abhi C4

    ## Overview
    
    SL Abhi C4 is a dataset for object detection tasks - it contains Face annotations for 371 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  9. c4-en-10k

    • opendatalab.com
    • huggingface.co
    zip
    Updated Dec 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2023). c4-en-10k [Dataset]. https://opendatalab.com/OpenDataLab/c4-en-10k
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 31, 2023
    Dataset provided by
    谷歌http://google.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a small subset representing the first 10K records of the original C4 dataset, "en" subset - created for testing. The records were extracted after having been shuffled.

  10. h

    c4-subsets

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DatologyAI, c4-subsets [Dataset]. https://huggingface.co/datasets/DatologyAI/c4-subsets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    DatologyAI
    Description

    DatologyAI/c4-subsets dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. e

    PAN-00019233 - Late Medieval/modern spoon C4 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Sep 10, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). PAN-00019233 - Late Medieval/modern spoon C4 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/9155569e-9ffd-581f-ad18-ad3297c779ef
    Explore at:
    Dataset updated
    Sep 10, 2019
    Description

    This find is registered at Portable Antiquities of the Netherlands with number PAN-00019233

  12. h

    redpajama-c4-refined-by-data-juicer

    • huggingface.co
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2017). redpajama-c4-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2017
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- C4 (refined by Data-Juicer)

    A refined version of C4 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 832GB).

      Dataset Information
    

    Number of samples: 344,491,171 (Keep ~94.42% from the original dataset)

      Refining Recipe
    

    … See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer.

  13. P

    mC4 Dataset

    • library.toponeai.link
    • opendatalab.com
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel (2022). mC4 Dataset [Dataset]. https://library.toponeai.link/dataset/mc4
    Explore at:
    Dataset updated
    Jun 8, 2022
    Authors
    Linting Xue; Noah Constant; Adam Roberts; Mihir Kale; Rami Al-Rfou; Aditya Siddhant; Aditya Barua; Colin Raffel
    Description

    mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.

  14. e

    PAN-00000938 - Late Medieval/modern spoon C4 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Sep 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). PAN-00000938 - Late Medieval/modern spoon C4 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/bd7422a0-5c14-538a-b23a-a8a975ddb983
    Explore at:
    Dataset updated
    Sep 10, 2019
    Description

    This find is registered at Portable Antiquities of the Netherlands with number PAN-00000938

  15. h

    c4-tiny

    • huggingface.co
    Updated Apr 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prime Intellect (2019). c4-tiny [Dataset]. https://huggingface.co/datasets/PrimeIntellect/c4-tiny
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2019
    Dataset authored and provided by
    Prime Intellect
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    C4 tiny

    this dataset is a very small subset of https://huggingface.co/datasets/allenai/c4 that can be use for testing without having to download the full c4 dataset. to use from dataset import load_dataset dataset = load_dataset("PrimeIntellect/c4-tiny", "en", ignore_verifications=True)

  16. R

    C4 0125 Dataset

    • universe.roboflow.com
    zip
    Updated Apr 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DOE3C41 (2024). C4 0125 Dataset [Dataset]. https://universe.roboflow.com/doe3c41/c4-0125-7ia9o
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 10, 2024
    Dataset authored and provided by
    DOE3C41
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    CY 1865 Bounding Boxes
    Description

    C4 0125

    ## Overview
    
    C4 0125 is a dataset for object detection tasks - it contains CY 1865 annotations for 444 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  17. Data from: ISLSCP II C4 Vegetation Percentage

    • data.nasa.gov
    • s.cnmilf.com
    • +6more
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). ISLSCP II C4 Vegetation Percentage [Dataset]. https://data.nasa.gov/dataset/islscp-ii-c4-vegetation-percentage-061c0
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The photosynthetic composition (C3 or C4) of vegetation on the land surface is essential for accurate simulations of biosphere-atmosphere exchanges of carbon, water, and energy. C3 and C4 plants have different responses to light, temperature, CO2, and nitrogen; they also differ in physiological functions like stomatal conductance and isotope fractionation. A fine-scale distribution of these plant types is essential for earth science modeling.The C4 percentage is determined from datasets that describe the continuous distribution of plant growth forms (i.e., the percent of a grid cell covered by herbaceous or woody vegetation), climate classifications, the fraction of a grid cell covered in croplands, and national crop type harvest area statistics. The staff from the International Satellite Land Surface Climatology Project (ISLSCP) Initiative II have made the original data set consistent with the ISLSCP-2 land/water mask. This data set contains a single file in ArcInfo ASCIIGRID format.This data set is one of the products of the International Satellite Land-Surface Climatology Project, Initiative II (ISLSCP II) data collection which contains 50 global time series data sets for the ten-year period 1986 to 1995. Selected data sets span even longer periods. ISLSCP II is a consistent collection of data sets that were compiled from existing data sources and algorithms, and were designed to satisfy the needs of modelers and investigators of the global carbon, water and energy cycle. The data were acquired from a number of U.S. and international agencies, universities, and institutions. The global data sets were mapped at consistent spatial (1, 0.5 and 0.25 degrees) and temporal (monthly, with meteorological data at finer (e.g., 3-hour)) resolutions and reformatted into a common ASCII format. The data and documentation have undergone two peer reviews.ISLSCP is one of several projects of Global Energy and Water Cycle Experiment (GEWEX) [http://www.gewex.org/] and has the lead role in addressing land-atmosphere interactions -- process modeling, data retrieval algorithms, field experiment design and execution, and the development of global data sets.

  18. e

    PAN-00016617 - early medieval figurative disc brooch (eye-hook) variant C4 -...

    • b2find.eudat.eu
    Updated Jul 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). PAN-00016617 - early medieval figurative disc brooch (eye-hook) variant C4 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/c916e873-789e-57aa-a2b7-18efc813b392
    Explore at:
    Dataset updated
    Jul 15, 2019
    Description

    This find is registered at Portable Antiquities of the Netherlands with number PAN-00016617

  19. e

    PAN-00000844 - early medieval figurative disc brooch (eye-hook) variant C4 -...

    • b2find.eudat.eu
    Updated Apr 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). PAN-00000844 - early medieval figurative disc brooch (eye-hook) variant C4 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/6bc51cda-79f7-565c-b033-bf5693ccd86c
    Explore at:
    Dataset updated
    Apr 18, 2019
    Description

    This find is registered at Portable Antiquities of the Netherlands with number PAN-00000844

  20. e

    PAN-00143187 - early medieval figurative disc brooch (eye-hook) variant C4 -...

    • b2find.eudat.eu
    Updated Oct 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). PAN-00143187 - early medieval figurative disc brooch (eye-hook) variant C4 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/fa6c2b74-0e73-5dbe-91bd-429a3fafbafd
    Explore at:
    Dataset updated
    Oct 25, 2024
    Description

    This find is registered at Portable Antiquities of the Netherlands with number PAN-00143187

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Brando Miranda (2025). small-c4-dataset [Dataset]. https://huggingface.co/datasets/brando/small-c4-dataset

small-c4-dataset

brando/small-c4-dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2025
Authors
Brando Miranda
Description

Dataset Card for Small C4 Dataset (10k Train, 10k Validation, 10k Test)

  Dataset Summary

The Small C4 Dataset is a reduced version of the original C4 dataset (Colossal Clean Crawled Corpus), designed to facilitate lightweight experimentation and model training without the need to process the full C4 dataset. This dataset includes:

10,000 examples for training, 10,000 examples for validation, and 10,000 examples for testing.

Each example consists of a single text passage… See the full description on the dataset page: https://huggingface.co/datasets/brando/small-c4-dataset.

Search
Clear search
Close search
Google apps
Main menu