Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🧠 ALLENAI C4 - English Train Split (Prepared Version) This repository contains the preprocessed and ready-to-use version of the ALLENAI C4 (Colossal Clean Crawled Corpus) English train split. It has been downloaded and optionally transformed for downstream NLP tasks such as pretraining large language models or text-based retrieval systems. 📦 Dataset Details Original Source: allenai/c4 Language: English (en) Split: train License: Google C4 License ⚠️ Note: This version only includes the train… See the full description on the dataset page: https://huggingface.co/datasets/amanpreet7/allenai-c4.
Facebook
Twitterzaaabik/c4-parquert-train-30-shards dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Grammar Error Correction synthetic dataset consisting of 185 million sentence pairs, created using a Tagged Corruption modelon Google's C4 dataset.
This version of the dataset was extracted from "https://huggingface.co/datasets/liweili/c4_200m">Li Liwei's HuggingFace dataset and converted to TSV format.
The corruption edits by Felix Stahlberg and Shankar Kumar are licensed under CC BY 4.0. C4 dataset was released by AllenAI under the terms of ODC-BY By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
This dataset is converted in Parquet format, but a TSV format is available in previous versions. The reason of the conversion was the poor performance in accessing each file. I'm open to request and suggestions on how to better handle such a big dataset.
This dataset is available in TSV format, splitted in 10 files of approximately 18M samples each. Each sample is a couple formed by the incorrect and the corrected sentences. | Incorrect | Corrected| | ------------- |:-------------:| | Much many brands and sellers still in the market. | Many brands and sellers still in the market. | | She likes playing in park and come here every week | She likes playing in the park and comes here every week |
I'm planning of releasing a notebook where I'll show Grammar Error Correction using a seq2seq architecture based on BERT and LSTM. Until then, you can try to build your own model!
This dataset can be used to train sequence-to-sequence models, based on encoder-decoder approach.
The task is quite similar to the NMT task, here are some tutorials:
- NLP from scratch: translation with a seq2seq network and attention
- Language Translation with nn.Transformers and TorchText
https://production-media.paperswithcode.com/tasks/gec_foTfIZW.png" alt="Grammar Error Correction example">
Thanks to the dataset creators Felix Stahlberg and Shankar Kumar and to Li Liwei for first giving access to the processed dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
komikat/bodo-c4-train-0000 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Le Hoang Long
Released under Apache 2.0
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042
The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are 2 million 768-dimensional and 300-dimensional CBOW embeddings trained on the English colossal, cleaned common crawl (C4) corpus. They were trained with the corrected CBOW code from kōan:
https://github.com/bloomberg/koan
with intrinsic evaluation reported in:
Ozan İrsoy, Adrian Benton, Karl Stratos. “Corrected CBOW Performs as well as Skip-gram”. The 2nd Workshop on Insights from Negative Results in NLP. 2021.
Facebook
TwitterThe long-term monitoring of gross primary production (GPP) is crucial to the assessment of the carbon cycle of terrestrial ecosystems. In this study, a well-known machine learning model (Random Forest, RF) is established to reconstruct the global GPP dataset named ECGC_GPP. The model distinguished nine functional plant types, including C3 and C4 crops, using eddy fluxes, meteorological variables, and leaf area index as training data of the RF model. Based on ERA5_Land and the corrected GEOV2 data, the global monthly GPP dataset at a 0.05-degree resolution from 1999 to 2019 was estimated. The results showed that the RF model could explain 74.81% of the monthly variation of GPP in the testing dataset, of which the average contribution of Leaf Area Index (LAI) reached 41.73%. The average annual and standard deviation of GPP during 1999–2019 were 117.14 ± 1.51 Pg C yr-1, with an upward trend of 0.21 Pg C yr-2 (p < 0.01). By using the plant functional type classification, the underestimat..., We unified the ERA5_Land and the corrected GEOV2 datasets to 0.05 degree and monthly scales. The meteorological and remote sensing datasets were classified by the eight PFTs to estimate the GPP of different PFT. Particularly, we established site-level PFT training models for CRO_C3 and CRO_C4, respectively, due to their significant differences. The CRO cells were a mixture of CRO_C3 and CRO_C4. Therefore, trained CRO_C3 and CRO_C4 models were both applied to the CRO cells and multiplied by their respective proportions to generate the final GPP estimation of CRO. This is what we designed to improve the current situation of GPP underestimation over CRO_C4 dominated regions. In this way, we generated a 0.05 degree and monthly scales global GPP dataset (ECGC_GPP) from 1999 to 2019., The ECGC_GPP dataset is stored in .nc file format and can be opened using Matlab or Python.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models. The dataset can be downloaded in a pre-processed form from allennlp.
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
📚 c4-pro
ArXiv | Models | Code c4 is refined from c4 using the ProX refining framework. It contains about 40B high quality tokens, ready for general language model pre-training.
License
c4 is based on c4, which is made available under an ODC-By 1.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.
Citation
@article{zhou2024programming… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/c4-pro.
Facebook
TwitterThe dataset used for pre-training language models, containing a large collection of text documents.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comprehensive dataset containing 16 verified Training centre businesses in Kırklareli, Turkey with complete contact information, ratings, reviews, and location data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The “constant” model is the same for the 2 independent variables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
STUDY PURPOSE: After spinal cord injury, inflammation is involved in secondary tissue damage. However, it may also promote neuroplasticity. We have shown earlier that promoting inflammation in a chronic setting in rats can promote the efficacy of rehabilitative training in a reaching task. Here we wanted to test whether the opposite is also true. Would common anti-inflammatory medications that could be given for any reason in later stages of a spinal lesion affect the efficacy of rehabilitative training in rats with unilateral incomplete cervical spinal cord injuries. DATA COLLECTED: This experiment involved two experimental cohorts, with a total of fifty-three age-matched adult female Lewis rats (cohort 1: n=29, cohort 2: n=24). The rats underwent training in a single pellet grasping (SPG) task for 5 weeks before receiving a C4 dorsolateral quadrant transection. Afterwards, the rats were randomized into groups: In the first cohort, three groups were included, SCI only (n=10), SCI + Diphenhydramine (SCI+DPH; n=10), and SCI + Methylprednisolone (SCI + MP; n=9). In the second cohort, only the SCI and SCI+DPH groups were included, each with a n=12. One week after the spinal cord lesion, the rats received Diphenhydramine and Methylprednisolone at 20mg/kg and 30mg/kg, respectively in their drinking water for 4 weeks, in combination with eight weeks of SPG training (10min/day). Sensorimotor and behavioral assessments were carried out and video recorded, before the dorsolateral quadrant transection (baseline), as well as on a weekly basis following the lesion. These tests included the Horizontal Ladder, Open Field, Elevated Plus Maze, Light-dark box, Von Frey, and The Irvine, Beattie, and Bresnahan test. After the final day of testing, the rats were euthanized, perfused, and their spinal cord tissue was harvested. The cervical spinal cord tissue, including the lesion site, was cryosectioned at 25 microns and processed with Neurotrace staining. To quantify the extent of spinal cord injury, we measured the damaged and spared areas within the spinal cord using ImageJ-Fiji. DATA USAGE NOTES:
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains images of various forest conditions across 4 classes: fire, no fire, smoke, and smokefire. It is designed for use in environmental monitoring, fire detection, and image classification tasks. Each class has balanced samples in train, val, and test subsets, with all images standardized to 250x250 pixels for consistency.
Check out the live working sample: Forest Fire Live Sample 🔗
If you use this dataset in your research or project, please make sure to cite it appropriately.
APA
Obuli Sai Naren. (2022). Forest Fire Image Classification Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/3135325
| Subset | fire | nofire | smoke | smokefire | Total Images |
|---|---|---|---|---|---|
| train | 800 | 800 | 800 | 800 | 3,200 |
| val | 200 | 200 | 200 | 200 | 800 |
| test | 200 | 200 | 200 | 200 | 800 |
| Forest Fire Tester | - | - | - | - | 23 |
Total Images: 4,823
Format: JPEG
Dimensions: 250x250 pixels
The dataset is organized into train, val, and test subsets, each containing the 4 classes. A separate Forest Fire Tester folder provides additional images for manual testing.
For more detailed information, please refer to the README.md file included in the dataset.
Feel free to download, analyze, and contribute! 📊💻
Facebook
TwitterDataset Card for "small-c4"
More Information needed
Facebook
TwitterThis is a preprocessed version of the realnewslike subdirectory of C4
C4 from: https://huggingface.co/datasets/allenai/c4
Files generated by using Megatron-LM https://github.com/NVIDIA/Megatron-LM/
python tools/preprocess_data.py
--input 'c4/realnewslike/c4-train.0000[0-9]-of-00512.json'
--partitions 8
--output-prefix preprocessed/c4
--tokenizer-type GPTSentencePieceTokenizer
--tokenizer-model tokenizers/tokenizer.model
--workers 8
license: odc-by
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2: Table S2. Predicted subcellular localization of partial proteins from pathways of interest in P. tricornutum. Data are shown for enzymes putatively involved in biochemical C4 pathways, central carbon metabolism, photorespiration, the ornithine–urea cycle, and fatty acid synthesis. Protein expression at LC and HC conditions here are noted as Up or Down, and those not quantified in either replicate proteome are indicated by ND. Predictions of signal peptides, chloroplast transit peptides, mitochondrial targeting, and targeting based on a heterokont-trained HMM utilized the following programs: http://www.cbs.dtu.dk/services/SignalP/ , http://www.cbs.dtu.dk/services/ChloroP/ , http://www.cbs.dtu.dk/services/TargetP/ , http://ihg.gsf.de/ihg/mitoprot.html , https://webtools.sb-roscoff.fr/root?tool_id=abims_hectar . Hypothesized locations are given based on data derived from the five programs and those with majority consensus were chosen as the predicted localization for a particular protein.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SafeC4Sample: C4 Dataset with Harmfulness Predictions
Overview
SafeC4Sample is a processed subset of the C4 dataset (Colossal, Cleaned version of Common Crawl's web crawl corpus) that includes harmfulness predictions from a HarmFormer As used in our paper. This dataset can be used for content moderation, safer language model training, or research into harmfulness detection in web text. The original C4 dataset, created by Google, provides a cleaned version of Common… See the full description on the dataset page: https://huggingface.co/datasets/themendu/SafeC4Sample.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Here are the top 20 open source datasets often used for AI training in coding, physics, and math. These datasets are well-suited for developing large language models and machine learning systems focused on scientific reasoning, problem solving, and code generation.[1][2][3][4]
These datasets have broad support across AI research, including open licensing, availability on platforms such as GitHub and Hugging Face, and coverage of coding (Python, JavaScript, Java), physics (simulation and factual reasoning), and math (competition and proof-level problems).[4][1][3][2]
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🧠 ALLENAI C4 - English Train Split (Prepared Version) This repository contains the preprocessed and ready-to-use version of the ALLENAI C4 (Colossal Clean Crawled Corpus) English train split. It has been downloaded and optionally transformed for downstream NLP tasks such as pretraining large language models or text-based retrieval systems. 📦 Dataset Details Original Source: allenai/c4 Language: English (en) Split: train License: Google C4 License ⚠️ Note: This version only includes the train… See the full description on the dataset page: https://huggingface.co/datasets/amanpreet7/allenai-c4.