41 datasets found
  1. h

    allenai-c4

    • huggingface.co
    Updated Apr 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amanpreet Singh (2019). allenai-c4 [Dataset]. https://huggingface.co/datasets/amanpreet7/allenai-c4
    Explore at:
    Dataset updated
    Apr 26, 2019
    Authors
    Amanpreet Singh
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🧠 ALLENAI C4 - English Train Split (Prepared Version) This repository contains the preprocessed and ready-to-use version of the ALLENAI C4 (Colossal Clean Crawled Corpus) English train split. It has been downloaded and optionally transformed for downstream NLP tasks such as pretraining large language models or text-based retrieval systems. 📦 Dataset Details Original Source: allenai/c4 Language: English (en) Split: train License: Google C4 License ⚠️ Note: This version only includes the train… See the full description on the dataset page: https://huggingface.co/datasets/amanpreet7/allenai-c4.

  2. h

    c4-parquert-train-30-shards

    • huggingface.co
    Updated Apr 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artem Zabolotnyi (2019). c4-parquert-train-30-shards [Dataset]. https://huggingface.co/datasets/zaaabik/c4-parquert-train-30-shards
    Explore at:
    Dataset updated
    Apr 26, 2019
    Authors
    Artem Zabolotnyi
    Description

    zaaabik/c4-parquert-train-30-shards dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. C4 200M Grammar Error Correction dataset

    • kaggle.com
    zip
    Updated Apr 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dario Cioni (2023). C4 200M Grammar Error Correction dataset [Dataset]. https://www.kaggle.com/datasets/dariocioni/c4200m/discussion
    Explore at:
    zip(15601869562 bytes)Available download formats
    Dataset updated
    Apr 18, 2023
    Authors
    Dario Cioni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Grammar Error Correction synthetic dataset consisting of 185 million sentence pairs, created using a Tagged Corruption modelon Google's C4 dataset.

    This version of the dataset was extracted from "https://huggingface.co/datasets/liweili/c4_200m">Li Liwei's HuggingFace dataset and converted to TSV format.

    The corruption edits by Felix Stahlberg and Shankar Kumar are licensed under CC BY 4.0. C4 dataset was released by AllenAI under the terms of ODC-BY By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.

    Format

    This dataset is converted in Parquet format, but a TSV format is available in previous versions. The reason of the conversion was the poor performance in accessing each file. I'm open to request and suggestions on how to better handle such a big dataset.

    This dataset is available in TSV format, splitted in 10 files of approximately 18M samples each. Each sample is a couple formed by the incorrect and the corrected sentences. | Incorrect | Corrected| | ------------- |:-------------:| | Much many brands and sellers still in the market. | Many brands and sellers still in the market. | | She likes playing in park and come here every week | She likes playing in the park and comes here every week |

    Usage

    I'm planning of releasing a notebook where I'll show Grammar Error Correction using a seq2seq architecture based on BERT and LSTM. Until then, you can try to build your own model!

    This dataset can be used to train sequence-to-sequence models, based on encoder-decoder approach.
    The task is quite similar to the NMT task, here are some tutorials: - NLP from scratch: translation with a seq2seq network and attention - Language Translation with nn.Transformers and TorchText

    https://production-media.paperswithcode.com/tasks/gec_foTfIZW.png" alt="Grammar Error Correction example">

    Acknowledgments

    Thanks to the dataset creators Felix Stahlberg and Shankar Kumar and to Li Liwei for first giving access to the processed dataset.

    References

  4. h

    bodo-c4-train-0000

    • huggingface.co
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    akshit kumar (2025). bodo-c4-train-0000 [Dataset]. https://huggingface.co/datasets/komikat/bodo-c4-train-0000
    Explore at:
    Dataset updated
    Sep 20, 2025
    Authors
    akshit kumar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    komikat/bodo-c4-train-0000 dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. dhbk_hb_model_cvit siamese gei_210_225 v1.3 c4

    • kaggle.com
    Updated Aug 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Le Hoang Long (2025). dhbk_hb_model_cvit siamese gei_210_225 v1.3 c4 [Dataset]. https://www.kaggle.com/datasets/lehoanglonglong/dhbk-hb-model-cvit-siamese-gei-210-225-v1-3-c4/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Le Hoang Long
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Le Hoang Long

    Released under Apache 2.0

    Contents

  6. E

    FERNET-C5

    • live.european-language-grid.eu
    Updated Sep 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). FERNET-C5 [Dataset]. https://live.european-language-grid.eu/catalogue/ld/18258
    Explore at:
    Dataset updated
    Sep 19, 2021
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042

    The same models are also released at https://huggingface.co/fav-kky/FERNET-C5

  7. Z

    C4 kōan CBOW embeddings

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irsoy, Ozan; Benton, Adrian; Stratos, Karl (2021). C4 kōan CBOW embeddings [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5542318
    Explore at:
    Dataset updated
    Oct 1, 2021
    Dataset provided by
    Rutgers University
    Bloomberg
    Authors
    Irsoy, Ozan; Benton, Adrian; Stratos, Karl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These are 2 million 768-dimensional and 300-dimensional CBOW embeddings trained on the English colossal, cleaned common crawl (C4) corpus. They were trained with the corrected CBOW code from kōan:

    https://github.com/bloomberg/koan

    with intrinsic evaluation reported in:

    Ozan İrsoy, Adrian Benton, Karl Stratos. “Corrected CBOW Performs as well as Skip-gram”. The 2nd Workshop on Insights from Negative Results in NLP. 2021.
    
  8. d

    Data from: Estimating global GPP from the plant functional type perspective...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renjie Guo; Tiexi Chen; Xin Chen; Wenping Yuan; Shuci Liu; Bin He; Lin Li; Shengzhen Wang; Ting Hu; Qingyun Yan; Xueqiong Wei; Jie Dai (2025). Estimating global GPP from the plant functional type perspective using a machine learning approach [Dataset]. http://doi.org/10.5061/dryad.dncjsxm2v
    Explore at:
    Dataset updated
    Jul 16, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Renjie Guo; Tiexi Chen; Xin Chen; Wenping Yuan; Shuci Liu; Bin He; Lin Li; Shengzhen Wang; Ting Hu; Qingyun Yan; Xueqiong Wei; Jie Dai
    Time period covered
    Mar 28, 2023
    Description

    The long-term monitoring of gross primary production (GPP) is crucial to the assessment of the carbon cycle of terrestrial ecosystems. In this study, a well-known machine learning model (Random Forest, RF) is established to reconstruct the global GPP dataset named ECGC_GPP. The model distinguished nine functional plant types, including C3 and C4 crops, using eddy fluxes, meteorological variables, and leaf area index as training data of the RF model. Based on ERA5_Land and the corrected GEOV2 data, the global monthly GPP dataset at a 0.05-degree resolution from 1999 to 2019 was estimated. The results showed that the RF model could explain 74.81% of the monthly variation of GPP in the testing dataset, of which the average contribution of Leaf Area Index (LAI) reached 41.73%. The average annual and standard deviation of GPP during 1999–2019 were 117.14 ± 1.51 Pg C yr-1, with an upward trend of 0.21 Pg C yr-2 (p < 0.01). By using the plant functional type classification, the underestimat..., We unified the ERA5_Land and the corrected GEOV2 datasets to 0.05 degree and monthly scales. The meteorological and remote sensing datasets were classified by the eight PFTs to estimate the GPP of different PFT. Particularly, we established site-level PFT training models for CRO_C3 and CRO_C4, respectively, due to their significant differences. The CRO cells were a mixture of CRO_C3 and CRO_C4. Therefore, trained CRO_C3 and CRO_C4 models were both applied to the CRO cells and multiplied by their respective proportions to generate the final GPP estimation of CRO. This is what we designed to improve the current situation of GPP underestimation over CRO_C4 dominated regions. In this way, we generated a 0.05 degree and monthly scales global GPP dataset (ECGC_GPP) from 1999 to 2019., The ECGC_GPP dataset is stored in .nc file format and can be opened using Matlab or Python.

  9. C4 (Colossal Clean Crawled Corpus)

    • opendatalab.com
    zip
    Updated Mar 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Research (2023). C4 (Colossal Clean Crawled Corpus) [Dataset]. https://opendatalab.com/OpenDataLab/C4
    Explore at:
    zip(2379 bytes)Available download formats
    Dataset updated
    Mar 9, 2023
    Dataset provided by
    Google Research
    谷歌http://google.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models. The dataset can be downloaded in a pre-processed form from allennlp.

  10. h

    c4-pro

    • huggingface.co
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GAIR-ProX (2024). c4-pro [Dataset]. https://huggingface.co/datasets/gair-prox/c4-pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 10, 2024
    Dataset authored and provided by
    GAIR-ProX
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    📚 c4-pro

    ArXiv | Models | Code c4 is refined from c4 using the ProX refining framework. It contains about 40B high quality tokens, ready for general language model pre-training.

      License
    

    c4 is based on c4, which is made available under an ODC-By 1.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.

      Citation
    

    @article{zhou2024programming… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/c4-pro.

  11. t

    Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco,...

    • service.tib.eu
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner (2024). Dataset: C4. https://doi.org/10.57702/0wpldwvq [Dataset]. https://service.tib.eu/ldmservice/dataset/c4
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    The dataset used for pre-training language models, containing a large collection of text documents.

  12. p

    Training centres Business Data for Kırklareli, Turkey

    • poidata.io
    csv, json
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Business Data Provider (2025). Training centres Business Data for Kırklareli, Turkey [Dataset]. https://poidata.io/report/training-centre/turkey/k%C4%B1rklareli
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Dec 2, 2025
    Dataset authored and provided by
    Business Data Provider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2025
    Area covered
    Kırklareli
    Variables measured
    Website URL, Phone Number, Review Count, Business Name, Email Address, Business Hours, Customer Rating, Business Address, Business Categories, Geographic Coordinates
    Description

    Comprehensive dataset containing 16 verified Training centre businesses in Kırklareli, Turkey with complete contact information, ratings, reviews, and location data.

  13. AIC, ΔAIC, and model weights obtained by considering the joint model set...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cam M. K. Rechenmacher; Michael Keating; James D. Nichols; Jonathan M. Nichols (2023). AIC, ΔAIC, and model weights obtained by considering the joint model set consisting of 6 models associated with using commitment (C2-C4) and training time (T2-T4) as the independent variables, as well as a constant (null) model, CT1. [Dataset]. http://doi.org/10.1371/journal.pone.0276762.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Cam M. K. Rechenmacher; Michael Keating; James D. Nichols; Jonathan M. Nichols
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The “constant” model is the same for the 2 independent variables.

  14. Impact of Anti-Inflammatory Medication on Task-Specific Training Efficacy...

    • odc-sci.org
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaison Cucarian; Pamela Raposo; Antoinette Nguyen; Romana Vavrek; Abel Torres-Espin; Karim Fouad; Jaison Cucarian; Pamela Raposo; Antoinette Nguyen; Romana Vavrek; Abel Torres-Espin; Karim Fouad (2023). Impact of Anti-Inflammatory Medication on Task-Specific Training Efficacy and Functional Recovery After Unilateral Dorsal Quadrant C4 Cervical Spinal Cord Injury in Female Lewis Rats [Dataset]. http://doi.org/10.34945/F57W2G
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    University of Alberta Faculty of Rehabilitation Medicinehttp://rehabilitation.ualberta.ca/
    Neuroscience and Mental Health Institute, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Canada.
    Neuroscience and Mental Health Institute, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Canada. Department of Physical Therapy, Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, Canada.
    Authors
    Jaison Cucarian; Pamela Raposo; Antoinette Nguyen; Romana Vavrek; Abel Torres-Espin; Karim Fouad; Jaison Cucarian; Pamela Raposo; Antoinette Nguyen; Romana Vavrek; Abel Torres-Espin; Karim Fouad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    STUDY PURPOSE: After spinal cord injury, inflammation is involved in secondary tissue damage. However, it may also promote neuroplasticity. We have shown earlier that promoting inflammation in a chronic setting in rats can promote the efficacy of rehabilitative training in a reaching task. Here we wanted to test whether the opposite is also true. Would common anti-inflammatory medications that could be given for any reason in later stages of a spinal lesion affect the efficacy of rehabilitative training in rats with unilateral incomplete cervical spinal cord injuries. DATA COLLECTED: This experiment involved two experimental cohorts, with a total of fifty-three age-matched adult female Lewis rats (cohort 1: n=29, cohort 2: n=24). The rats underwent training in a single pellet grasping (SPG) task for 5 weeks before receiving a C4 dorsolateral quadrant transection. Afterwards, the rats were randomized into groups: In the first cohort, three groups were included, SCI only (n=10), SCI + Diphenhydramine (SCI+DPH; n=10), and SCI + Methylprednisolone (SCI + MP; n=9). In the second cohort, only the SCI and SCI+DPH groups were included, each with a n=12. One week after the spinal cord lesion, the rats received Diphenhydramine and Methylprednisolone at 20mg/kg and 30mg/kg, respectively in their drinking water for 4 weeks, in combination with eight weeks of SPG training (10min/day). Sensorimotor and behavioral assessments were carried out and video recorded, before the dorsolateral quadrant transection (baseline), as well as on a weekly basis following the lesion. These tests included the Horizontal Ladder, Open Field, Elevated Plus Maze, Light-dark box, Von Frey, and The Irvine, Beattie, and Bresnahan test. After the final day of testing, the rats were euthanized, perfused, and their spinal cord tissue was harvested. The cervical spinal cord tissue, including the lesion site, was cryosectioned at 25 microns and processed with Neurotrace staining. To quantify the extent of spinal cord injury, we measured the damaged and spared areas within the spinal cord using ImageJ-Fiji. DATA USAGE NOTES:

  15. Forest Fire Image Classification Dataset

    • kaggle.com
    zip
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Obuli Sai Naren (2024). Forest Fire Image Classification Dataset [Dataset]. https://www.kaggle.com/datasets/obulisainaren/forest-fire-c4
    Explore at:
    zip(129161395 bytes)Available download formats
    Dataset updated
    Oct 7, 2024
    Authors
    Obuli Sai Naren
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    🔥 Forest Fire Image Classification

    A Dataset of 4 Classes: Fire, No Fire, Smoke, and SmokeFire

    Overview

    This dataset contains images of various forest conditions across 4 classes: fire, no fire, smoke, and smokefire. It is designed for use in environmental monitoring, fire detection, and image classification tasks. Each class has balanced samples in train, val, and test subsets, with all images standardized to 250x250 pixels for consistency.

    Check out the live working sample: Forest Fire Live Sample 🔗

    📝 Citation

    If you use this dataset in your research or project, please make sure to cite it appropriately.

    APA
    Obuli Sai Naren. (2022). Forest Fire Image Classification Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/3135325

    📊 Dataset Details

    SubsetfirenofiresmokesmokefireTotal Images
    train8008008008003,200
    val200200200200800
    test200200200200800
    Forest Fire Tester----23

    Total Images: 4,823
    Format: JPEG
    Dimensions: 250x250 pixels

    📂 Folder Structure & Classes

    The dataset is organized into train, val, and test subsets, each containing the 4 classes. A separate Forest Fire Tester folder provides additional images for manual testing.

    🔄 Preprocessing & Augmentation

    • Resizing: All images resized to 250x250 pixels.
    • Data Augmentation: Applied transformations like rotations, shifts, and brightness changes to enhance diversity.

    For more detailed information, please refer to the README.md file included in the dataset.

    Feel free to download, analyze, and contribute! 📊💻

  16. h

    c4-filter-small

    • huggingface.co
    Updated Apr 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    datablations (2019). c4-filter-small [Dataset]. https://huggingface.co/datasets/datablations/c4-filter-small
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2019
    Dataset authored and provided by
    datablations
    Description

    Dataset Card for "small-c4"

    More Information needed

  17. h

    llm_dataset

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wangcheng Tao, llm_dataset [Dataset]. https://huggingface.co/datasets/taowangcheng/llm_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Wangcheng Tao
    Description

    This is a preprocessed version of the realnewslike subdirectory of C4 C4 from: https://huggingface.co/datasets/allenai/c4 Files generated by using Megatron-LM https://github.com/NVIDIA/Megatron-LM/ python tools/preprocess_data.py
    --input 'c4/realnewslike/c4-train.0000[0-9]-of-00512.json'
    --partitions 8
    --output-prefix preprocessed/c4
    --tokenizer-type GPTSentencePieceTokenizer
    --tokenizer-model tokenizers/tokenizer.model
    --workers 8

      license: odc-by
    
  18. Additional file 2 of Proteomic and biochemical responses to different...

    • figshare.com
    • springernature.figshare.com
    xlsx
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Songcui Wu; Wenhui Gu; Shuao Jia; Lepu Wang; Lijun Wang; Xuehua Liu; Lu Zhou; Aiyou Huang; Guangce Wang (2024). Additional file 2 of Proteomic and biochemical responses to different concentrations of CO2 suggest the existence of multiple carbon metabolism strategies in Phaeodactylum tricornutum [Dataset]. http://doi.org/10.6084/m9.figshare.17205229.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Songcui Wu; Wenhui Gu; Shuao Jia; Lepu Wang; Lijun Wang; Xuehua Liu; Lu Zhou; Aiyou Huang; Guangce Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2: Table S2. Predicted subcellular localization of partial proteins from pathways of interest in P. tricornutum. Data are shown for enzymes putatively involved in biochemical C4 pathways, central carbon metabolism, photorespiration, the ornithine–urea cycle, and fatty acid synthesis. Protein expression at LC and HC conditions here are noted as Up or Down, and those not quantified in either replicate proteome are indicated by ND. Predictions of signal peptides, chloroplast transit peptides, mitochondrial targeting, and targeting based on a heterokont-trained HMM utilized the following programs: http://www.cbs.dtu.dk/services/SignalP/ , http://www.cbs.dtu.dk/services/ChloroP/ , http://www.cbs.dtu.dk/services/TargetP/ , http://ihg.gsf.de/ihg/mitoprot.html , https://webtools.sb-roscoff.fr/root?tool_id=abims_hectar . Hypothesized locations are given based on data derived from the five programs and those with majority consensus were chosen as the predicted localization for a particular protein.

  19. h

    SafeC4Sample

    • huggingface.co
    Updated Apr 26, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai Krishna Mendu (2019). SafeC4Sample [Dataset]. https://huggingface.co/datasets/themendu/SafeC4Sample
    Explore at:
    Dataset updated
    Apr 26, 2019
    Authors
    Sai Krishna Mendu
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SafeC4Sample: C4 Dataset with Harmfulness Predictions

      Overview
    

    SafeC4Sample is a processed subset of the C4 dataset (Colossal, Cleaned version of Common Crawl's web crawl corpus) that includes harmfulness predictions from a HarmFormer As used in our paper. This dataset can be used for content moderation, safer language model training, or research into harmfulness detection in web text. The original C4 dataset, created by Google, provides a cleaned version of Common… See the full description on the dataset page: https://huggingface.co/datasets/themendu/SafeC4Sample.

  20. Donatello_Science_Training_Data

    • kaggle.com
    zip
    Updated Nov 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Randy Chabot (2025). Donatello_Science_Training_Data [Dataset]. https://www.kaggle.com/datasets/randychabot/donatello-science-training-data
    Explore at:
    zip(3489122354 bytes)Available download formats
    Dataset updated
    Nov 1, 2025
    Authors
    Randy Chabot
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Further Research

    https://imobench.github.io/

    Here are the top 20 open source datasets often used for AI training in coding, physics, and math. These datasets are well-suited for developing large language models and machine learning systems focused on scientific reasoning, problem solving, and code generation.[1][2][3][4]

    Coding Datasets

    • StarCoder: About 250B tokens sourced from GitHub, StackOverflow, Jupyter notebooks, and more—designed for code generation and developer AI.[2]
    • RedPajama: 1.2T tokens including GitHub code, technical docs, ArXiv papers, StackExchange Q&A, and Wikipedia—used for LLM training.[2]
    • CodeSearchNet: Source code in multiple languages with natural language queries, enabling code search and completion tasks.[1]
    • The Stack: A massive dataset of source code in over 50 languages collected for LLM research.[1]
    • BigCode: Datasets focused on open-source software code, technical documentation, and developer discussions.[2]
    • OpenAI's HumanEval: High-quality Python code prompts with solutions meant for evaluating code LLMs.[1]

    Physics Datasets

    • PhysicsNeMo: NVIDIA’s open-source toolkit and example datasets for building surrogate models and digital twins for engineering and physics simulations.[5]
    • ArXiv Physics Papers: Large sets of physics preprints used for scientific language model training.[2]
    • C4 (Colossal Clean Crawled Corpus): Includes technical and scientific data scraped from the web, widely used for scientific reasoning and physics text tasks.[2]
    • OpenAssistant Conversations: Includes physics Q&A and multi-turn dialogs among other scientific domains, geared toward alignment and reasoning.[2]
    • WikiPhysics (Wikipedia Physics): Structured extraction of all physics-related Wikipedia articles for factual and conceptual reasoning.[2]

    Math Datasets

    • MATH: 12,500 competition-grade math problems with step-by-step solutions for advanced reasoning.[3][4]
    • GSM8K: Contains 8.5K grade school math word problems with language explanations—used for language reasoning and math QA.[4][3]
    • NuminaMath: Nearly 860K math problems and solutions supporting chain-of-thought reasoning—excellent for large model training.[3][4]
    • Orca-Math-200K: Synthetic set of 200K math word problems for enhancing LM mathematical capabilities.[3]
    • DART-Math: 590K math problems generated via Difficulty-Aware Rejection Tuning, designed to cover high-difficulty math tasks.[4][3]
    • LeanDojo: 98K formal mathematical theorems and proofs for theorem-proving training and math LLMs.[3]
    • NaturalProofs: 32K formal theorem-proofs sets from various branches, for training formal math reasoning models.[3]
    • OpenML Math Tasks: Benchmarks in pattern recognition and mathematical reasoning for ML benchmarking.[1]
    • Common Crawl Math Subsets: Extracts of math problems and solutions sourced from web crawl data (see DeepSeekMath methodology).[3]
    • StackExchange Mathematics: Mining QA pairs from math StackExchange forums for conversational and solution datasets.[2]

    Additional Multi-domain Datasets

    • OIG (Open Instruction Generalist): 44M samples including knowledge questions, code instructions, and math tasks for instruction-following models.[2]
    • Dolly Dataset: 15K human-generated instruction pairs, covering coding and math among other fields for alignment/fine-tuning.[2]

    These datasets have broad support across AI research, including open licensing, availability on platforms such as GitHub and Hugging Face, and coverage of coding (Python, JavaScript, Java), physics (simulation and factual reasoning), and math (competition and proof-level problems).[4][1][3][2]

    1 2 3 4 5 6 7 8 9 10

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amanpreet Singh (2019). allenai-c4 [Dataset]. https://huggingface.co/datasets/amanpreet7/allenai-c4

allenai-c4

amanpreet7/allenai-c4

Explore at:
94 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 26, 2019
Authors
Amanpreet Singh
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

🧠 ALLENAI C4 - English Train Split (Prepared Version) This repository contains the preprocessed and ready-to-use version of the ALLENAI C4 (Colossal Clean Crawled Corpus) English train split. It has been downloaded and optionally transformed for downstream NLP tasks such as pretraining large language models or text-based retrieval systems. 📦 Dataset Details Original Source: allenai/c4 Language: English (en) Split: train License: Google C4 License ⚠️ Note: This version only includes the train… See the full description on the dataset page: https://huggingface.co/datasets/amanpreet7/allenai-c4.

Search
Clear search
Close search
Google apps
Main menu