MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.
Usage: Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), so the authors sub-sample from MassiveText with sampling proportions specified per subset (books, news, etc.). These sampling proportions are tuned to maximize downstream performance. The largest sampling subset is the curated web-text corpus MassiveWeb, which is found to improve downstream performance relative to existing web-text datasets such as C4 (Raffel et al., 2020).
Find Datasheets in the Gopher paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Health
Overview Nepali-Text-Corpus is a comprehensive collection of approximately 6.4 million articles in the Nepali language. This dataset is the largest text dataset on Nepali Language. It encompasses a diverse range of text types, including news articles, blogs, and more, making it an invaluable resource for researchers, developers, and enthusiasts in the fields of Natural Language Processing (NLP) and computational linguistics.
Dataset Details Total Articles: ~6.4 million Language: Nepali Size: 27.5 GB (in csv) Source: Collected from various Nepali news websites, blogs, and other online platforms.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Welcome to the English-Bahasa Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance (BFSI) domain! This meticulously curated dataset offers a rich collection of bilingual text data, translated between English and Bahasa, providing a valuable resource for developing BFSI domain-specific language models and machine translation engines.
This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the BFSI industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🐳 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
This repository contains 210 million image-text interleaved documents filtered from the OmniCorpus-CC dataset, which was sourced from Common Crawl.
Repository: https://github.com/OpenGVLab/OmniCorpus Paper (ICLR 2025 Spotlight): https://arxiv.org/abs/2406.08418
OmniCorpus dataset is a large-scale image-text interleaved dataset, which pushes the boundaries of scale and diversity by encompassing… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/OmniCorpus-CC-210M.
Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NOTE FOR WMT PARTICIPANTS:There is an easier version for MT available in Moses format (one sentence per line. The files start with moses_like.If you use this dataset, please cite the following wordk:@InProceedings{L18-1546, author = "Soares, Felipe and Moreira, Viviane and Becker, Karin", title = "A Large Parallel Corpus of Full-Text Scientific Articles", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)", year = "2018", publisher = "European Language Resource Association", location = "Miyazaki, Japan", url = "http://aclweb.org/anthology/L18-1546" }We developed a parallel corpus of full-text scientific articles collected from Scielo database in the following languages: English, Portuguese and Spanish. The corpus is sentence aligned for all language pairs, as well as trilingual aligned for a small subset of sentences
Common Corpus
Full data paper
Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.
OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The list of hyponym-hypernym pairs was obtained by applying lexical-syntactic patterns described in Hearst (1992) on the corpus prepared by Panchenko et al. (2016). This corpus is a concatenation of the English Wikipedia (2016 dump), Gigaword, ukWaC and English news corpora from the Leipzig Corpora Collection. The lexical-syntactic patterns proposed by Marti Hearst (1992) and further extended and implemented in the form of FSTs by Panchenko et al. (2012) for extracting (noisy) hyponym-hypernym pairs are as follows -- (i) such NP as NP, NP[,] and/or NP; (ii) NP such as NP, NP[,] and/or NP; (iii) NP, NP [,] or other NP; (iv) NP, NP [,] and other NP; (v) NP, including NP, NP [,] and/or NP; (vi) NP, especially NP, NP [,] and/or NP. Pattern extraction on the corpus yields a list of 27.6 million hyponym-hypernym pairs along with the frequency of their occurrence in the corpus.
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnchttps://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v9 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the new CNC tagset first utilized for the annotation of the SYN2020 corpus.
SYN v9 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz with one important exception: the corpus is shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Welcome to the English-Tamil Bilingual Parallel Corpora dataset for the Legal domain! This meticulously curated dataset offers a rich collection of bilingual text data, translated between English and Tamil providing a valuable resource for developing Legal domain-specific language models and machine translation engines.
This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the Legal industry.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Welcome to the English-French Bilingual Parallel Corpora dataset for the Education domain! This comprehensive dataset contains a vast collection of bilingual text data, carefully translated between English to French, to support the development of Education-specific language models and machine translation engines.
This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the Education industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Pro(tein)/Gene corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn.
The goals of the annotation project were
to construct a consistent and (as far as possible) subdomain-independent/-comprehensive protein-annotated corpus
to differentiate between protein families and groups, protein complexes, protein molecules, protein variants (e.g. alleles) and elliptic enumerations of proteins.
The corpus has the following annotation levels / entity types:
protein
protein_familiy_or_group
protein_complex
protein_variant
protein_enum
For definitions of the annotation levels, please refer to the Proteins-guidelines-final.doc file that is found in the download package.
To achieve a large coverage of biological subdomains, document from multiple other protein / gene corpora were reannotated. For further coverage, new document sets were created. All documents are abstracts from PubMed/MEDLINE. The corpus is made up of the union of all the documents in the different subcorpora. All document are delivered as MMAX2 (http://mmax2.net/) annotation projects.
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Welcome to the English-Finnish Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance (BFSI) domain! This meticulously curated dataset offers a rich collection of bilingual text data, translated between English and Finnish, providing a valuable resource for developing BFSI domain-specific language models and machine translation engines.
This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the BFSI industry.
The progress of Large Language Models (LLMs) has largely been driven by the availability of large-scale unlabeled text data for unsupervised learning. This work focuses on modeling both content and the corresponding receiver behavior in the same space. Although existing datasets have trillions of content tokens (text, images, audio, and videos), they lack information on receiver effects. To address this, the paper utilizes YouTube, a large publicly available source of content-behavior data, which includes:
Communicator Data: Channel name, and number of subscribers. Message: Youtube video ids, extracted speech, scene-wise captions, on screen text, video description, video length, upload date. Receiver Effect: Video likes, views, and replay graphs.
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).
Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.
EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.
Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.  For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.
TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.
The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.
Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.
Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people. ### Introduction The Wikipedia links (WikiLinks) data consists of web pages that satisfy the following two constraints: a. conta
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Description
The National Environmental Policy Act Text Corpus (NEPATEC 1.0) is an AI-ready dataset related to NEPA documents collected by the joint effort between Pacific Northwest National Laboratory (PNNL) and Office of Policy (OP). The NEPATEC 1.0 contains data extracted from the Environmental Impact Statement (EIS) Database provided by United States Environmental Protection Agency. EIS is a particular type of NEPA document (in PDF form) that analyzes the potential… See the full description on the dataset page: https://huggingface.co/datasets/PolicyAI/NEPATEC1.0.
The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend on this choice, researchers usually pick keywords in ad hoc ways that are far from optimal and usually biased. Most seem to think that keyword selection is easy, since they do Google searches every day, but we demonstrate that humans perform exceedingly poorly at this basic task. We offer a better approach, one that also can help with following conversations where participants rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; industry and intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated or human-only) statistical approach that suggests keywords from available text without needing structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and summarizing results with easy-to-understand Boolean search strings. We illustrate how the technique works with analyses of English texts about the Boston Marathon Bombings, Chinese social media posts designed to evade censorship, and others.
MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.
Usage: Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), so the authors sub-sample from MassiveText with sampling proportions specified per subset (books, news, etc.). These sampling proportions are tuned to maximize downstream performance. The largest sampling subset is the curated web-text corpus MassiveWeb, which is found to improve downstream performance relative to existing web-text datasets such as C4 (Raffel et al., 2020).
Find Datasheets in the Gopher paper.