41 datasets found

h
allenai-c4
huggingface.co
Updated Apr 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amanpreet Singh (2019). allenai-c4 [Dataset]. https://huggingface.co/datasets/amanpreet7/allenai-c4
Explore at:
Dataset updated
Apr 26, 2019
Authors
Amanpreet Singh
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🧠 ALLENAI C4 - English Train Split (Prepared Version) This repository contains the preprocessed and ready-to-use version of the ALLENAI C4 (Colossal Clean Crawled Corpus) English train split. It has been downloaded and optionally transformed for downstream NLP tasks such as pretraining large language models or text-based retrieval systems. 📦 Dataset Details Original Source: allenai/c4 Language: English (en) Split: train License: Google C4 License ⚠️ Note: This version only includes the train… See the full description on the dataset page: https://huggingface.co/datasets/amanpreet7/allenai-c4.
h
c4-parquert-train-30-shards
huggingface.co
Updated Apr 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artem Zabolotnyi (2019). c4-parquert-train-30-shards [Dataset]. https://huggingface.co/datasets/zaaabik/c4-parquert-train-30-shards
Explore at:
Dataset updated
Apr 26, 2019
Authors
Artem Zabolotnyi
Description
zaaabik/c4-parquert-train-30-shards dataset hosted on Hugging Face and contributed by the HF Datasets community
C4 200M Grammar Error Correction dataset
kaggle.com
zip
Updated Apr 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dario Cioni (2023). C4 200M Grammar Error Correction dataset [Dataset]. https://www.kaggle.com/datasets/dariocioni/c4200m/discussion
Explore at:
zip(15601869562 bytes)Available download formats
Dataset updated
Apr 18, 2023
Authors
Dario Cioni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Grammar Error Correction synthetic dataset consisting of 185 million sentence pairs, created using a Tagged Corruption modelon Google's C4 dataset.

This version of the dataset was extracted from "https://huggingface.co/datasets/liweili/c4_200m">Li Liwei's HuggingFace dataset and converted to TSV format.

The corruption edits by Felix Stahlberg and Shankar Kumar are licensed under CC BY 4.0. C4 dataset was released by AllenAI under the terms of ODC-BY By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.

Format

This dataset is converted in Parquet format, but a TSV format is available in previous versions. The reason of the conversion was the poor performance in accessing each file. I'm open to request and suggestions on how to better handle such a big dataset.

This dataset is available in TSV format, splitted in 10 files of approximately 18M samples each. Each sample is a couple formed by the incorrect and the corrected sentences. | Incorrect | Corrected| | ------------- |:-------------:| | Much many brands and sellers still in the market. | Many brands and sellers still in the market. | | She likes playing in park and come here every week | She likes playing in the park and comes here every week |

Usage

I'm planning of releasing a notebook where I'll show Grammar Error Correction using a seq2seq architecture based on BERT and LSTM. Until then, you can try to build your own model!

This dataset can be used to train sequence-to-sequence models, based on encoder-decoder approach.
The task is quite similar to the NMT task, here are some tutorials: - NLP from scratch: translation with a seq2seq network and attention - Language Translation with nn.Transformers and TorchText

https://production-media.paperswithcode.com/tasks/gec_foTfIZW.png" alt="Grammar Error Correction example">

Acknowledgments

Thanks to the dataset creators Felix Stahlberg and Shankar Kumar and to Li Liwei for first giving access to the processed dataset.

References

Paper: https://aclanthology.org/2021.bea-1.4/

Github: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction

Article: https://ai.googleblog.com/2021/08/the-c4200m-synthetic-dataset-for.html

Google's C4 dataset https://www.tensorflow.org/datasets/catalog/c4

Li Liwei Kaggle dataset (Tensorflow format) https://www.kaggle.com/datasets/a0155991rliwei/c4-200m

Correction labels https://www.kaggle.com/datasets/felixstahlberg/the-c4-200m-dataset-for-gec
h
bodo-c4-train-0000
huggingface.co
Updated Sep 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
akshit kumar (2025). bodo-c4-train-0000 [Dataset]. https://huggingface.co/datasets/komikat/bodo-c4-train-0000
Explore at:
Dataset updated
Sep 20, 2025
Authors
akshit kumar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
komikat/bodo-c4-train-0000 dataset hosted on Hugging Face and contributed by the HF Datasets community
dhbk_hb_model_cvit siamese gei_210_225 v1.3 c4
kaggle.com
Updated Aug 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Le Hoang Long (2025). dhbk_hb_model_cvit siamese gei_210_225 v1.3 c4 [Dataset]. https://www.kaggle.com/datasets/lehoanglonglong/dhbk-hb-model-cvit-siamese-gei-210-225-v1-3-c4/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Le Hoang Long
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Le Hoang Long

Released under Apache 2.0

Contents
E
FERNET-C5
live.european-language-grid.eu
Updated Sep 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). FERNET-C5 [Dataset]. https://live.european-language-grid.eu/catalogue/ld/18258
Explore at:
Dataset updated
Sep 19, 2021
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042

The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
Z
C4 kōan CBOW embeddings
data.niaid.nih.gov
zenodo.org
Updated Oct 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irsoy, Ozan; Benton, Adrian; Stratos, Karl (2021). C4 kōan CBOW embeddings [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_5542318
Explore at:
Dataset updated
Oct 1, 2021
Dataset provided by
Rutgers University
Bloomberg
Authors
Irsoy, Ozan; Benton, Adrian; Stratos, Karl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These are 2 million 768-dimensional and 300-dimensional CBOW embeddings trained on the English colossal, cleaned common crawl (C4) corpus. They were trained with the corrected CBOW code from kōan:

https://github.com/bloomberg/koan

with intrinsic evaluation reported in:

Ozan İrsoy, Adrian Benton, Karl Stratos. “Corrected CBOW Performs as well as Skip-gram”. The 2nd Workshop on Insights from Negative Results in NLP. 2021.
d
Data from: Estimating global GPP from the plant functional type perspective...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Renjie Guo; Tiexi Chen; Xin Chen; Wenping Yuan; Shuci Liu; Bin He; Lin Li; Shengzhen Wang; Ting Hu; Qingyun Yan; Xueqiong Wei; Jie Dai (2025). Estimating global GPP from the plant functional type perspective using a machine learning approach [Dataset]. http://doi.org/10.5061/dryad.dncjsxm2v
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.dncjsxm2v
Dataset updated
Jul 16, 2025
Dataset provided by
Dryad Digital Repository
Authors
Renjie Guo; Tiexi Chen; Xin Chen; Wenping Yuan; Shuci Liu; Bin He; Lin Li; Shengzhen Wang; Ting Hu; Qingyun Yan; Xueqiong Wei; Jie Dai
Time period covered
Mar 28, 2023
Description
The long-term monitoring of gross primary production (GPP) is crucial to the assessment of the carbon cycle of terrestrial ecosystems. In this study, a well-known machine learning model (Random Forest, RF) is established to reconstruct the global GPP dataset named ECGC_GPP. The model distinguished nine functional plant types, including C3 and C4 crops, using eddy fluxes, meteorological variables, and leaf area index as training data of the RF model. Based on ERA5_Land and the corrected GEOV2 data, the global monthly GPP dataset at a 0.05-degree resolution from 1999 to 2019 was estimated. The results showed that the RF model could explain 74.81% of the monthly variation of GPP in the testing dataset, of which the average contribution of Leaf Area Index (LAI) reached 41.73%. The average annual and standard deviation of GPP during 1999â€“2019 were 117.14 Â± 1.51 Pg C yr-1, with an upward trend of 0.21 Pg C yr-2 (p < 0.01). By using the plant functional type classification, the underestimat..., We unified the ERA5_Land and the corrected GEOV2 datasets to 0.05 degree and monthly scales. The meteorological and remote sensing datasets were classified by the eight PFTs to estimate the GPP of different PFT. Particularly, we established site-level PFT training models for CRO_C3 and CRO_C4, respectively, due to their significant differences. The CRO cells were a mixture of CRO_C3 and CRO_C4. Therefore, trained CRO_C3 and CRO_C4 models were both applied to the CRO cells and multiplied by their respective proportions to generate the final GPP estimation of CRO. This is what we designed to improve the current situation of GPP underestimation over CRO_C4 dominated regions. In this way, we generated a 0.05 degree and monthly scales global GPP dataset (ECGC_GPP) from 1999 to 2019., The ECGC_GPP dataset is stored in .nc file format and can be opened using Matlab or Python.
C4 (Colossal Clean Crawled Corpus)
opendatalab.com
zip
Updated Mar 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google Research (2023). C4 (Colossal Clean Crawled Corpus) [Dataset]. https://opendatalab.com/OpenDataLab/C4
Explore at:
zip(2379 bytes)Available download formats
Dataset updated
Mar 9, 2023
Dataset provided by
Google Research
谷歌http://google.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models. The dataset can be downloaded in a pre-processed form from allennlp.
h
c4-pro
huggingface.co
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GAIR-ProX (2024). c4-pro [Dataset]. https://huggingface.co/datasets/gair-prox/c4-pro
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 10, 2024
Dataset authored and provided by
GAIR-ProX
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📚 c4-pro

ArXiv | Models | Code c4 is refined from c4 using the ProX refining framework. It contains about 40B high quality tokens, ready for general language model pre-training.

License

c4 is based on c4, which is made available under an ODC-By 1.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.

Citation

@article{zhou2024programming… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/c4-pro.
t
Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco,...
service.tib.eu
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner (2024). Dataset: C4. https://doi.org/10.57702/0wpldwvq [Dataset]. https://service.tib.eu/ldmservice/dataset/c4
Explore at:
Dataset updated
Dec 3, 2024
Description
The dataset used for pre-training language models, containing a large collection of text documents.
p
Training centres Business Data for Kırklareli, Turkey
poidata.io
csv, json
Updated Dec 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Business Data Provider (2025). Training centres Business Data for Kırklareli, Turkey [Dataset]. https://poidata.io/report/training-centre/turkey/k%C4%B1rklareli
Explore at:
csv, jsonAvailable download formats
Dataset updated
Dec 2, 2025
Dataset authored and provided by
Business Data Provider
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2025
Area covered
Kırklareli
Variables measured
Website URL, Phone Number, Review Count, Business Name, Email Address, Business Hours, Customer Rating, Business Address, Business Categories, Geographic Coordinates
Description
Comprehensive dataset containing 16 verified Training centre businesses in Kırklareli, Turkey with complete contact information, ratings, reviews, and location data.
AIC, ΔAIC, and model weights obtained by considering the joint model set...
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cam M. K. Rechenmacher; Michael Keating; James D. Nichols; Jonathan M. Nichols (2023). AIC, ΔAIC, and model weights obtained by considering the joint model set consisting of 6 models associated with using commitment (C2-C4) and training time (T2-T4) as the independent variables, as well as a constant (null) model, CT1. [Dataset]. http://doi.org/10.1371/journal.pone.0276762.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276762.t003
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Cam M. K. Rechenmacher; Michael Keating; James D. Nichols; Jonathan M. Nichols
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The “constant” model is the same for the 2 independent variables.
Impact of Anti-Inflammatory Medication on Task-Specific Training Efficacy...
odc-sci.org
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaison Cucarian; Pamela Raposo; Antoinette Nguyen; Romana Vavrek; Abel Torres-Espin; Karim Fouad; Jaison Cucarian; Pamela Raposo; Antoinette Nguyen; Romana Vavrek; Abel Torres-Espin; Karim Fouad (2023). Impact of Anti-Inflammatory Medication on Task-Specific Training Efficacy and Functional Recovery After Unilateral Dorsal Quadrant C4 Cervical Spinal Cord Injury in Female Lewis Rats [Dataset]. http://doi.org/10.34945/F57W2G
Explore at:
Unique identifier
https://doi.org/10.34945/F57W2G
Dataset updated
Nov 29, 2023
Dataset provided by
University of Alberta Faculty of Rehabilitation Medicinehttp://rehabilitation.ualberta.ca/
Neuroscience and Mental Health Institute, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Canada.
Neuroscience and Mental Health Institute, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Canada. Department of Physical Therapy, Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, Canada.
Authors
Jaison Cucarian; Pamela Raposo; Antoinette Nguyen; Romana Vavrek; Abel Torres-Espin; Karim Fouad; Jaison Cucarian; Pamela Raposo; Antoinette Nguyen; Romana Vavrek; Abel Torres-Espin; Karim Fouad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
STUDY PURPOSE: After spinal cord injury, inflammation is involved in secondary tissue damage. However, it may also promote neuroplasticity. We have shown earlier that promoting inflammation in a chronic setting in rats can promote the efficacy of rehabilitative training in a reaching task. Here we wanted to test whether the opposite is also true. Would common anti-inflammatory medications that could be given for any reason in later stages of a spinal lesion affect the efficacy of rehabilitative training in rats with unilateral incomplete cervical spinal cord injuries. DATA COLLECTED: This experiment involved two experimental cohorts, with a total of fifty-three age-matched adult female Lewis rats (cohort 1: n=29, cohort 2: n=24). The rats underwent training in a single pellet grasping (SPG) task for 5 weeks before receiving a C4 dorsolateral quadrant transection. Afterwards, the rats were randomized into groups: In the first cohort, three groups were included, SCI only (n=10), SCI + Diphenhydramine (SCI+DPH; n=10), and SCI + Methylprednisolone (SCI + MP; n=9). In the second cohort, only the SCI and SCI+DPH groups were included, each with a n=12. One week after the spinal cord lesion, the rats received Diphenhydramine and Methylprednisolone at 20mg/kg and 30mg/kg, respectively in their drinking water for 4 weeks, in combination with eight weeks of SPG training (10min/day). Sensorimotor and behavioral assessments were carried out and video recorded, before the dorsolateral quadrant transection (baseline), as well as on a weekly basis following the lesion. These tests included the Horizontal Ladder, Open Field, Elevated Plus Maze, Light-dark box, Von Frey, and The Irvine, Beattie, and Bresnahan test. After the final day of testing, the rats were euthanized, perfused, and their spinal cord tissue was harvested. The cervical spinal cord tissue, including the lesion site, was cryosectioned at 25 microns and processed with Neurotrace staining. To quantify the extent of spinal cord injury, we measured the damaged and spared areas within the spinal cord using ImageJ-Fiji. DATA USAGE NOTES:
Forest Fire Image Classification Dataset
kaggle.com
zip
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Obuli Sai Naren (2024). Forest Fire Image Classification Dataset [Dataset]. https://www.kaggle.com/datasets/obulisainaren/forest-fire-c4
Explore at:
zip(129161395 bytes)Available download formats
Dataset updated
Oct 7, 2024
Authors
Obuli Sai Naren
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
🔥 Forest Fire Image Classification

A Dataset of 4 Classes: Fire, No Fire, Smoke, and SmokeFire

Overview

This dataset contains images of various forest conditions across 4 classes: fire, no fire, smoke, and smokefire. It is designed for use in environmental monitoring, fire detection, and image classification tasks. Each class has balanced samples in train, val, and test subsets, with all images standardized to 250x250 pixels for consistency.

Check out the live working sample: Forest Fire Live Sample 🔗

📝 Citation

If you use this dataset in your research or project, please make sure to cite it appropriately.

APA
Obuli Sai Naren. (2022). Forest Fire Image Classification Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/3135325

📊 Dataset Details

Subset fire nofire smoke smokefire Total Images
train 800 800 800 800 3,200
val 200 200 200 200 800
test 200 200 200 200 800
Forest Fire Tester - - - - 23

Total Images: 4,823
Format: JPEG
Dimensions: 250x250 pixels

📂 Folder Structure & Classes

The dataset is organized into train, val, and test subsets, each containing the 4 classes. A separate Forest Fire Tester folder provides additional images for manual testing.

🔄 Preprocessing & Augmentation

Resizing: All images resized to 250x250 pixels.

Data Augmentation: Applied transformations like rotations, shifts, and brightness changes to enhance diversity.

For more detailed information, please refer to the README.md file included in the dataset.

Feel free to download, analyze, and contribute! 📊💻
h
c4-filter-small
huggingface.co
Updated Apr 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
datablations (2019). c4-filter-small [Dataset]. https://huggingface.co/datasets/datablations/c4-filter-small
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2019
Dataset authored and provided by
datablations
Description
Dataset Card for "small-c4"

More Information needed
h
llm_dataset
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wangcheng Tao, llm_dataset [Dataset]. https://huggingface.co/datasets/taowangcheng/llm_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Wangcheng Tao
Description
This is a preprocessed version of the realnewslike subdirectory of C4 C4 from: https://huggingface.co/datasets/allenai/c4 Files generated by using Megatron-LM https://github.com/NVIDIA/Megatron-LM/ python tools/preprocess_data.py
--input 'c4/realnewslike/c4-train.0000[0-9]-of-00512.json'
--partitions 8
--output-prefix preprocessed/c4
--tokenizer-type GPTSentencePieceTokenizer
--tokenizer-model tokenizers/tokenizer.model
--workers 8

license: odc-by
Additional file 2 of Proteomic and biochemical responses to different...
figshare.com
springernature.figshare.com
xlsx
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Songcui Wu; Wenhui Gu; Shuao Jia; Lepu Wang; Lijun Wang; Xuehua Liu; Lu Zhou; Aiyou Huang; Guangce Wang (2024). Additional file 2 of Proteomic and biochemical responses to different concentrations of CO2 suggest the existence of multiple carbon metabolism strategies in Phaeodactylum tricornutum [Dataset]. http://doi.org/10.6084/m9.figshare.17205229.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17205229.v1
Dataset updated
Feb 7, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Songcui Wu; Wenhui Gu; Shuao Jia; Lepu Wang; Lijun Wang; Xuehua Liu; Lu Zhou; Aiyou Huang; Guangce Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 2: Table S2. Predicted subcellular localization of partial proteins from pathways of interest in P. tricornutum. Data are shown for enzymes putatively involved in biochemical C4 pathways, central carbon metabolism, photorespiration, the ornithine–urea cycle, and fatty acid synthesis. Protein expression at LC and HC conditions here are noted as Up or Down, and those not quantified in either replicate proteome are indicated by ND. Predictions of signal peptides, chloroplast transit peptides, mitochondrial targeting, and targeting based on a heterokont-trained HMM utilized the following programs: http://www.cbs.dtu.dk/services/SignalP/ , http://www.cbs.dtu.dk/services/ChloroP/ , http://www.cbs.dtu.dk/services/TargetP/ , http://ihg.gsf.de/ihg/mitoprot.html , https://webtools.sb-roscoff.fr/root?tool_id=abims_hectar . Hypothesized locations are given based on data derived from the five programs and those with majority consensus were chosen as the predicted localization for a particular protein.
h
SafeC4Sample
huggingface.co
Updated Apr 26, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sai Krishna Mendu (2019). SafeC4Sample [Dataset]. https://huggingface.co/datasets/themendu/SafeC4Sample
Explore at:
Dataset updated
Apr 26, 2019
Authors
Sai Krishna Mendu
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
SafeC4Sample: C4 Dataset with Harmfulness Predictions

Overview

SafeC4Sample is a processed subset of the C4 dataset (Colossal, Cleaned version of Common Crawl's web crawl corpus) that includes harmfulness predictions from a HarmFormer As used in our paper. This dataset can be used for content moderation, safer language model training, or research into harmfulness detection in web text. The original C4 dataset, created by Google, provides a cleaned version of Common… See the full description on the dataset page: https://huggingface.co/datasets/themendu/SafeC4Sample.
Donatello_Science_Training_Data
kaggle.com
zip
Updated Nov 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Randy Chabot (2025). Donatello_Science_Training_Data [Dataset]. https://www.kaggle.com/datasets/randychabot/donatello-science-training-data
Explore at:
zip(3489122354 bytes)Available download formats
Dataset updated
Nov 1, 2025
Authors
Randy Chabot
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Further Research

https://imobench.github.io/

Here are the top 20 open source datasets often used for AI training in coding, physics, and math. These datasets are well-suited for developing large language models and machine learning systems focused on scientific reasoning, problem solving, and code generation.[1][2][3][4]

Superhuman Reasoning Google

Coding Datasets

StarCoder: About 250B tokens sourced from GitHub, StackOverflow, Jupyter notebooks, and more—designed for code generation and developer AI.[2]

RedPajama: 1.2T tokens including GitHub code, technical docs, ArXiv papers, StackExchange Q&A, and Wikipedia—used for LLM training.[2]

CodeSearchNet: Source code in multiple languages with natural language queries, enabling code search and completion tasks.[1]

The Stack: A massive dataset of source code in over 50 languages collected for LLM research.[1]

BigCode: Datasets focused on open-source software code, technical documentation, and developer discussions.[2]

OpenAI's HumanEval: High-quality Python code prompts with solutions meant for evaluating code LLMs.[1]

Physics Datasets

PhysicsNeMo: NVIDIA’s open-source toolkit and example datasets for building surrogate models and digital twins for engineering and physics simulations.[5]

ArXiv Physics Papers: Large sets of physics preprints used for scientific language model training.[2]

C4 (Colossal Clean Crawled Corpus): Includes technical and scientific data scraped from the web, widely used for scientific reasoning and physics text tasks.[2]

OpenAssistant Conversations: Includes physics Q&A and multi-turn dialogs among other scientific domains, geared toward alignment and reasoning.[2]

WikiPhysics (Wikipedia Physics): Structured extraction of all physics-related Wikipedia articles for factual and conceptual reasoning.[2]

Math Datasets

MATH: 12,500 competition-grade math problems with step-by-step solutions for advanced reasoning.[3][4]

GSM8K: Contains 8.5K grade school math word problems with language explanations—used for language reasoning and math QA.[4][3]

NuminaMath: Nearly 860K math problems and solutions supporting chain-of-thought reasoning—excellent for large model training.[3][4]

Orca-Math-200K: Synthetic set of 200K math word problems for enhancing LM mathematical capabilities.[3]

DART-Math: 590K math problems generated via Difficulty-Aware Rejection Tuning, designed to cover high-difficulty math tasks.[4][3]

LeanDojo: 98K formal mathematical theorems and proofs for theorem-proving training and math LLMs.[3]

NaturalProofs: 32K formal theorem-proofs sets from various branches, for training formal math reasoning models.[3]

OpenML Math Tasks: Benchmarks in pattern recognition and mathematical reasoning for ML benchmarking.[1]

Common Crawl Math Subsets: Extracts of math problems and solutions sourced from web crawl data (see DeepSeekMath methodology).[3]

StackExchange Mathematics: Mining QA pairs from math StackExchange forums for conversational and solution datasets.[2]

Additional Multi-domain Datasets

OIG (Open Instruction Generalist): 44M samples including knowledge questions, code instructions, and math tasks for instruction-following models.[2]

Dolly Dataset: 15K human-generated instruction pairs, covering coding and math among other fields for alignment/fine-tuning.[2]

These datasets have broad support across AI research, including open licensing, availability on platforms such as GitHub and Hugging Face, and coverage of coding (Python, JavaScript, Java), physics (simulation and factual reasoning), and math (competition and proof-level problems).[4][1][3][2]

1 2 3 4 5 6 7 8 9 10

Subset	fire	nofire	smoke	smokefire	Total Images
train	800	800	800	800	3,200
val	200	200	200	200	800
test	200	200	200	200	800
Forest Fire Tester	-	-	-	-	23

Facebook

Twitter

Click to copy link

Link copied

Cite

Amanpreet Singh (2019). allenai-c4 [Dataset]. https://huggingface.co/datasets/amanpreet7/allenai-c4

allenai-c4

amanpreet7/allenai-c4

Explore at:

94 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 26, 2019

Authors

Amanpreet Singh

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

🧠 ALLENAI C4 - English Train Split (Prepared Version) This repository contains the preprocessed and ready-to-use version of the ALLENAI C4 (Colossal Clean Crawled Corpus) English train split. It has been downloaded and optionally transformed for downstream NLP tasks such as pretraining large language models or text-based retrieval systems. 📦 Dataset Details Original Source: allenai/c4 Language: English (en) Split: train License: Google C4 License ⚠️ Note: This version only includes the train… See the full description on the dataset page: https://huggingface.co/datasets/amanpreet7/allenai-c4.

Clear search

Close search

Google apps

Main menu

allenai-c4

c4-parquert-train-30-shards

C4 200M Grammar Error Correction dataset

Format

Usage

Acknowledgments

References

bodo-c4-train-0000

dhbk_hb_model_cvit siamese gei_210_225 v1.3 c4

Dataset

Contents

FERNET-C5

C4 kōan CBOW embeddings

Data from: Estimating global GPP from the plant functional type perspective...

C4 (Colossal Clean Crawled Corpus)

c4-pro

Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco,...

Training centres Business Data for Kırklareli, Turkey

AIC, ΔAIC, and model weights obtained by considering the joint model set...

Impact of Anti-Inflammatory Medication on Task-Specific Training Efficacy...

Forest Fire Image Classification Dataset

🔥 Forest Fire Image Classification

A Dataset of 4 Classes: Fire, No Fire, Smoke, and SmokeFire

Overview

📝 Citation

📊 Dataset Details

📂 Folder Structure & Classes

🔄 Preprocessing & Augmentation

c4-filter-small

llm_dataset

Additional file 2 of Proteomic and biochemical responses to different...

SafeC4Sample

Donatello_Science_Training_Data

Further Research

Coding Datasets

Physics Datasets

Math Datasets

Additional Multi-domain Datasets

allenai-c4

amanpreet7/allenai-c4