100+ datasets found

P
MassiveText Dataset
paperswithcode.com
library.toponeai.link
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving (2025). MassiveText Dataset [Dataset]. https://paperswithcode.com/dataset/massivetext
Explore at:
Dataset updated
May 23, 2025
Authors
Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving
Description
MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.

Usage: Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), so the authors sub-sample from MassiveText with sampling proportions specified per subset (books, news, etc.). These sampling proportions are tuned to maximize downstream performance. The largest sampling subset is the curated web-text corpus MassiveWeb, which is found to improve downstream performance relative to existing web-text datasets such as C4 (Raffel et al., 2020).

Find Datasheets in the Gopher paper.
i
A Large Scale Nepali Text Corpus
ieee-dataport.org
Updated Mar 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rabindra Lamsal (2021). A Large Scale Nepali Text Corpus [Dataset]. https://ieee-dataport.org/open-access/large-scale-nepali-text-corpus
Explore at:
Dataset updated
Mar 13, 2021
Authors
Rabindra Lamsal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Health
P
Nepali Text Corpus Dataset
paperswithcode.com
Updated Nov 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prajwal Thapa; Jinu Nyachhyon; Mridul Sharma; Bal Krishna Bal (2024). Nepali Text Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/nepali-text-corpus
Explore at:
Dataset updated
Nov 23, 2024
Authors
Prajwal Thapa; Jinu Nyachhyon; Mridul Sharma; Bal Krishna Bal
Description
Overview Nepali-Text-Corpus is a comprehensive collection of approximately 6.4 million articles in the Nepali language. This dataset is the largest text dataset on Nepali Language. It encompasses a diverse range of text types, including news articles, blogs, and more, making it an invaluable resource for researchers, developers, and enthusiasts in the fields of Natural Language Processing (NLP) and computational linguistics.

Dataset Details Total Articles: ~6.4 million Language: Nepali Size: 27.5 GB (in csv) Source: Collected from various Nepali news websites, blogs, and other online platforms.
F
English-Bahasa Translated Parallel Corpora for BFSI Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-Bahasa Translated Parallel Corpora for BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/bahasa-english-translated-parallel-corpus-for-bfsi-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the English-Bahasa Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance (BFSI) domain! This meticulously curated dataset offers a rich collection of bilingual text data, translated between English and Bahasa, providing a valuable resource for developing BFSI domain-specific language models and machine translation engines.
Dataset Content
•Volume and Diversity:
•
Extensive Dataset: Over 50,000 sentences offering a robust dataset for various applications.

•
Translator Diversity: Contributions from more than 200 native translators ensure a wide range of linguistic styles and interpretations.

•Sentence Diversity:
•
Word Count: Sentences range from 7 to 25 words, suitable for various computational linguistic applications.

•
Syntactic Variety: The corpus encompasses sentences with varying syntactic structures, including simple, compound, and complex sentences.

•
Interrogative and Imperative Forms: The corpus includes sentences in interrogative (question) and imperative (command) forms, reflecting the conversational nature of the BFSI industry.

•
Affirmative and Negative Statements: Both affirmative and negative statements are represented in the corpus, ensuring different polarities.

•
Passive and Active Voice: The corpus features sentences written in both active and passive voice, ensuring different perspectives and representations of information.

•
Idiomatic Expressions and Figurative Language: The corpus incorporates idiomatic expressions, metaphors, and figurative language commonly used in the BFSI domain.

•
Discourse Markers and Connectives: The corpus includes a wide range of discourse markers and connectives, such as conjunctions, transitional phrases, and logical connectors, which are crucial for capturing the logical flow and coherence of the text.

•
Cross Translation: The dataset includes a cross-translation which means a part of the dataset is translated from English to Bahasa and another portion is translated from Bahasa to English to improve bi-directional translation capabilities.

Domain Specific Content
This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the BFSI industry.
•
Industry-Tailored Terminology: The corpus encompasses a comprehensive lexicon of BFSI-specific terminology, ranging from technical banking and financial terms to insurance-related vocabulary and regulatory jargon.

•
Authentic Industry Expressions: Beyond technical terminology, the corpus captures the authentic expressions, idioms, and colloquialisms used within the BFSI industry.

•
Contexts Specific to BFSI: The corpus encompasses a wide range of contexts specific to the BFSI domain, including financial transactions, regulatory compliance, risk management, customer service interactions, and more.

•
Cross-Domain Applicability: While the primary focus is on the BFSI sector, the corpus also includes relevant cross-domain content, such as general business terminology, legal terms, and language related to technology and digital services.

Format and Structure
•
Multiple Formats: Available in Excel format, with the ability to convert to JSON, TMX, XML, XLIFF, XLS, and other industry-standard formats, facilitating ease of use and integration.

•
Structure: It contains information like Serial Number, Unique ID, Source Sentence, Source Sentence Word Count, Target Sentence, and Target Sentence Word Count.

Usage and Application
•
Machine Translation and Language Localization: It serves as a valuable training resource for developing robust machine translation engines tailored to the BFSI domain.

•
NLP Applications: Enabling the creation and improvement of predictive keyboards, spell checkers, grammar checkers, and text/speech understanding systems.
h
OmniCorpus-CC-210M
huggingface.co
Updated Aug 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenGVLab (2024). OmniCorpus-CC-210M [Dataset]. https://huggingface.co/datasets/OpenGVLab/OmniCorpus-CC-210M
Explore at:
Dataset updated
Aug 30, 2024
Dataset authored and provided by
OpenGVLab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
🐳 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

This repository contains 210 million image-text interleaved documents filtered from the OmniCorpus-CC dataset, which was sourced from Common Crawl.

Repository: https://github.com/OpenGVLab/OmniCorpus Paper (ICLR 2025 Spotlight): https://arxiv.org/abs/2406.08418

OmniCorpus dataset is a large-scale image-text interleaved dataset, which pushes the boundaries of scale and diversity by encompassing… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/OmniCorpus-CC-210M.
d
Data from: Sparse Machine Learning Methods for Understanding Large Text...
catalog.data.gov
data.nasa.gov
+1more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://catalog.data.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.
Data from: A Large Parallel Corpus of Full-Text Scientific Articles
figshare.com
application/gzip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felipe Soares; Viviane Pereira Moreira; Karin Becker (2023). A Large Parallel Corpus of Full-Text Scientific Articles [Dataset]. http://doi.org/10.6084/m9.figshare.5382757.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5382757.v2
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Felipe Soares; Viviane Pereira Moreira; Karin Becker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NOTE FOR WMT PARTICIPANTS:There is an easier version for MT available in Moses format (one sentence per line. The files start with moses_like.If you use this dataset, please cite the following wordk:@InProceedings{L18-1546, author = "Soares, Felipe and Moreira, Viviane and Becker, Karin", title = "A Large Parallel Corpus of Full-Text Scientific Articles", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)", year = "2018", publisher = "European Language Resource Association", location = "Miyazaki, Japan", url = "http://aclweb.org/anthology/L18-1546" }We developed a parallel corpus of full-text scientific articles collected from Scielo database in the following languages: English, Portuguese and Spanish. The corpus is sentence aligned for all language pairs, as well as trilingual aligned for a small subset of sentences
h
common_corpus
huggingface.co
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PleIAs (2024). common_corpus [Dataset]. https://huggingface.co/datasets/PleIAs/common_corpus
Explore at:
Dataset updated
Nov 13, 2024
Dataset authored and provided by
PleIAs
Description
Common Corpus

Full data paper

Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.
P
OpenWebText Dataset
paperswithcode.com
opendatalab.com
+3more
Updated Jun 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Gokaslan; Vanya Cohen (2024). OpenWebText Dataset [Dataset]. https://paperswithcode.com/dataset/openwebtext
Explore at:
Dataset updated
Jun 16, 2024
Authors
Aaron Gokaslan; Vanya Cohen
Description
OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).
E
Hypernyms extracted from a large text corpus using Hearst lexical-syntactic...
live.european-language-grid.eu
csv
Updated Sep 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Hypernyms extracted from a large text corpus using Hearst lexical-syntactic patterns [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/7392
Explore at:
csvAvailable download formats
Dataset updated
Sep 11, 2021
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The list of hyponym-hypernym pairs was obtained by applying lexical-syntactic patterns described in Hearst (1992) on the corpus prepared by Panchenko et al. (2016). This corpus is a concatenation of the English Wikipedia (2016 dump), Gigaword, ukWaC and English news corpora from the Leipzig Corpora Collection. The lexical-syntactic patterns proposed by Marti Hearst (1992) and further extended and implemented in the form of FSTs by Panchenko et al. (2012) for extracting (noisy) hyponym-hypernym pairs are as follows -- (i) such NP as NP, NP[,] and/or NP; (ii) NP such as NP, NP[,] and/or NP; (iii) NP, NP [,] or other NP; (iv) NP, NP [,] and other NP; (v) NP, including NP, NP [,] and/or NP; (vi) NP, especially NP, NP [,] and/or NP. Pattern extraction on the corpus yields a list of 27.6 million hyponym-hypernym pairs along with the frequency of their occurrence in the corpus.
c
SYN v9: large corpus of written Czech
lindat.mff.cuni.cz
live.european-language-grid.eu
Updated Dec 5, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michal Křen; Václav Cvrček; Jan Henyš; Milena Hnátková; Tomáš Jelínek; Jan Kocek; Dominika Kováříková; Jan Křivan; Jiří Milička; Vladimír Petkevič; Pavel Procházka; Hana Skoumalová; Jana Šindlerová; Michal Škrabal (2021). SYN v9: large corpus of written Czech [Dataset]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-4635
Explore at:
Dataset updated
Dec 5, 2021
Authors
Michal Křen; Václav Cvrček; Jan Henyš; Milena Hnátková; Tomáš Jelínek; Jan Kocek; Dominika Kováříková; Jan Křivan; Jiří Milička; Vladimír Petkevič; Pavel Procházka; Hana Skoumalová; Jana Šindlerová; Michal Škrabal
License
https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnchttps://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
Description
Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v9 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the new CNC tagset first utilized for the annotation of the SYN2020 corpus.

SYN v9 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz with one important exception: the corpus is shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
F
English-Tamil translated Parallel Corpora for Legal Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-Tamil translated Parallel Corpora for Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-legal-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the English-Tamil Bilingual Parallel Corpora dataset for the Legal domain! This meticulously curated dataset offers a rich collection of bilingual text data, translated between English and Tamil providing a valuable resource for developing Legal domain-specific language models and machine translation engines.
Dataset Content
•Volume and Diversity:
•
Extensive Dataset: Over 50,000 sentences offering a robust dataset for various applications.

•
Translator Diversity: Contributions from more than 200 native translators ensure a wide range of linguistic styles and interpretations.

•Sentence Diversity:
•
Word Count: Sentences range from 7 to 25 words, suitable for various computational linguistic applications.

•
Syntactic Variety: The corpus encompasses sentences with varying syntactic structures, including simple, compound, and complex sentences.

•
Interrogative and Imperative Forms: The corpus includes sentences in interrogative (question) and imperative (command) forms, reflecting the conversational nature of the Legal industry.

•
Affirmative and Negative Statements: Both affirmative and negative statements are represented in the corpus, ensuring different polarities.

•
Passive and Active Voice: The corpus features sentences written in both active and passive voice, ensuring different perspectives and representations of information.

•
Idiomatic Expressions and Figurative Language: The corpus incorporates idiomatic expressions, metaphors, and figurative language commonly used in the Legal domain.

•
Discourse Markers and Connectives: The corpus includes a wide range of discourse markers and connectives, such as conjunctions, transitional phrases, and logical connectors, which are crucial for capturing the logical flow and coherence of the text.

•
Cross Translation: The dataset includes a cross-translation, where a part of the dataset is translated from English to Tamil and another portion is translated from Tamil to English, to improve bi-directional translation capabilities.

Domain Specific Content
This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the Legal industry.
•
Industry-Tailored Terminology: The corpus encompasses a comprehensive lexicon of Legal-specific terminology, ranging from technical terms related to contracts, torts, and criminal law to legal procedures and court documentation.

•
Authentic Industry Expressions: Beyond technical terminology, the corpus captures the authentic expressions, idioms, and colloquialisms used within the Legal domain.

•
Contexts Specific to Legal Domain: The corpus encompasses a diverse range of contexts specific to the Legal domain, including legal briefs, memoranda, contracts, agreements, legal articles, scholarly papers, etc

•
Cross-Domain Applicability: While the primary focus is on the Legal domain, the corpus also includes relevant cross-domain content, such as business, financial terminology, government, public policy terminology, technology and cybersecurity terms, etc

Format and Structure
•
Multiple Formats: Available in Excel format, with the ability to convert to JSON, TMX, XML, XLIFF, XLS, and other industry-standard formats, facilitating ease of use and integration.

•
Structure: It contains information like Serial Number, Unique ID, Source Sentence, Source Sentence Word Count, Target Sentence, and Target Sentence Word Count.

Usage and Application
•
Machine Translation: Develop accurate machine translation engines for legal content localization, enabling seamless communication across languages in legal proceedings.

•
NLP Applications: Enabling the creation and improvement of predictive keyboards, spell checkers, grammar checkers, and text/speech understanding systems.

<div
F
English-French translated Parallel Corpora for Gaming Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-French translated Parallel Corpora for Gaming Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/french-english-translated-parallel-corpus-for-education-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Area covered
French
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the English-French Bilingual Parallel Corpora dataset for the Education domain! This comprehensive dataset contains a vast collection of bilingual text data, carefully translated between English to French, to support the development of Education-specific language models and machine translation engines.
Dataset Content
•Volume and Diversity:
•
Extensive Dataset: Over 50,000 sentences offering a robust dataset for various applications.

•
Translator Diversity: Contributions from more than 200 native translators ensure a wide range of linguistic styles and interpretations.

•Sentence Diversity:
•
Word Count: Sentences range from 7 to 25 words, suitable for various computational linguistic applications.

•
Syntactic Variety: The corpus encompasses sentences with varying syntactic structures, including simple, compound, and complex sentences.

•
Interrogative and Imperative Forms: The corpus includes sentences in interrogative (question) and imperative (command) forms, reflecting the conversational nature of the education industry.

•
Affirmative and Negative Statements: Both affirmative and negative statements are represented in the corpus, ensuring different polarities.

•
Passive and Active Voice: The corpus features sentences written in both active and passive voice, ensuring different perspectives and representations of information.

•
Idiomatic Expressions and Figurative Language: The corpus incorporates idiomatic expressions, metaphors, and figurative language commonly used in the Education domain.

•
Discourse Markers and Connectives: The corpus includes a wide range of discourse markers and connectives, such as conjunctions, transitional phrases, and logical connectors, which are crucial for capturing the logical flow and coherence of the text.

•
Cross Translation: The dataset includes a cross-translation, where a part of the dataset is translated from English to French and another portion is translated from French to English, to improve bi-directional translation capabilities.

Domain Specific Content
This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the Education industry.
•
Industry-Tailored Terminology: The corpus encompasses a comprehensive lexicon of Education-specific terminology, ranging from technical terms related to pedagogy, curriculum design, and educational technology to teaching methodologies and learning theories.

•
Authentic Industry Expressions: Beyond technical terminology, the corpus captures the authentic expressions, idioms, and colloquialisms used within the Education domain, including classroom instructions, academic discussions, and educational feedback.

•
Contexts Specific to Education Domain: The corpus encompasses a diverse range of contexts specific to the Education domain, including lesson plans, academic papers, educational resources, and online courses.

•
Cross-Domain Applicability: While the primary focus is on the Education domain, the corpus also includes relevant cross-domain content from related areas, such as child psychology, educational psychology, cognitive science, and learning technologies.

Format and Structure
•
Multiple Formats: Available in Excel format, with the ability to convert to JSON, TMX, XML, XLIFF, XLS, and other industry-standard formats, facilitating ease of use and integration.

•
Structure: It contains information like Serial Number, Unique ID, Source Sentence, Source Sentence Word Count, Target Sentence, and Target Sentence Word Count.

Usage and Application
•
Machine Translation: Develop accurate machine translation engines for educational content

•
NLP Applications: Improve predictive keyboards, spell checkers, grammar checkers, and text/speech understanding systems tailored for educational contexts.
Z
Data from: ProGene - A Large-scale, High-Quality Protein-Gene Annotated...
data.niaid.nih.gov
live.european-language-grid.eu
+1more
Updated Jun 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lohr, Christina (2020). ProGene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3698567
Explore at:
Dataset updated
Jun 12, 2020
Dataset provided by
Modersohn, Luise
Faessler, Erik
Hahn, Udo
Lohr, Christina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Pro(tein)/Gene corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn.

The goals of the annotation project were

to construct a consistent and (as far as possible) subdomain-independent/-comprehensive protein-annotated corpus

to differentiate between protein families and groups, protein complexes, protein molecules, protein variants (e.g. alleles) and elliptic enumerations of proteins.

The corpus has the following annotation levels / entity types:

protein

protein_familiy_or_group

protein_complex

protein_variant

protein_enum

For definitions of the annotation levels, please refer to the Proteins-guidelines-final.doc file that is found in the download package.

To achieve a large coverage of biological subdomains, document from multiple other protein / gene corpora were reannotated. For further coverage, new document sets were created. All documents are abstracts from PubMed/MEDLINE. The corpus is made up of the union of all the documents in the different subcorpora. All document are delivered as MMAX2 (http://mmax2.net/) annotation projects.
F
English-Finnish Translated Parallel Corpora for BFSI Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-Finnish Translated Parallel Corpora for BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/finnish-english-translated-parallel-corpus-for-bfsi-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the English-Finnish Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance (BFSI) domain! This meticulously curated dataset offers a rich collection of bilingual text data, translated between English and Finnish, providing a valuable resource for developing BFSI domain-specific language models and machine translation engines.
Dataset Content
•Volume and Diversity:
•
Extensive Dataset: Over 50,000 sentences offering a robust dataset for various applications.

•
Translator Diversity: Contributions from more than 200 native translators ensure a wide range of linguistic styles and interpretations.

•Sentence Diversity:
•
Word Count: Sentences range from 7 to 25 words, suitable for various computational linguistic applications.

•
Syntactic Variety: The corpus encompasses sentences with varying syntactic structures, including simple, compound, and complex sentences.

•
Interrogative and Imperative Forms: The corpus includes sentences in interrogative (question) and imperative (command) forms, reflecting the conversational nature of the BFSI industry.

•
Affirmative and Negative Statements: Both affirmative and negative statements are represented in the corpus, ensuring different polarities.

•
Passive and Active Voice: The corpus features sentences written in both active and passive voice, ensuring different perspectives and representations of information.

•
Idiomatic Expressions and Figurative Language: The corpus incorporates idiomatic expressions, metaphors, and figurative language commonly used in the BFSI domain.

•
Discourse Markers and Connectives: The corpus includes a wide range of discourse markers and connectives, such as conjunctions, transitional phrases, and logical connectors, which are crucial for capturing the logical flow and coherence of the text.

•
Cross Translation: The dataset includes a cross-translation which means a part of the dataset is translated from English to Finnish and another portion is translated from Finnish to English to improve bi-directional translation capabilities.

Domain Specific Content
This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the BFSI industry.
•
Industry-Tailored Terminology: The corpus encompasses a comprehensive lexicon of BFSI-specific terminology, ranging from technical banking and financial terms to insurance-related vocabulary and regulatory jargon.

•
Authentic Industry Expressions: Beyond technical terminology, the corpus captures the authentic expressions, idioms, and colloquialisms used within the BFSI industry.

•
Contexts Specific to BFSI: The corpus encompasses a wide range of contexts specific to the BFSI domain, including financial transactions, regulatory compliance, risk management, customer service interactions, and more.

•
Cross-Domain Applicability: While the primary focus is on the BFSI sector, the corpus also includes relevant cross-domain content, such as general business terminology, legal terms, and language related to technology and digital services.

Format and Structure
•
Multiple Formats: Available in Excel format, with the ability to convert to JSON, TMX, XML, XLIFF, XLS, and other industry-standard formats, facilitating ease of use and integration.

•
Structure: It contains information like Serial Number, Unique ID, Source Sentence, Source Sentence Word Count, Target Sentence, and Target Sentence Word Count.

Usage and Application
•
Machine Translation and Language Localization: It serves as a valuable training resource for developing robust machine translation engines tailored to the BFSI domain.

•
NLP Applications: Enabling the creation and improvement of predictive keyboards, spell checkers, grammar checkers, and text/speech understanding systems.
P
Content Behavior Corpus Dataset
paperswithcode.com
Updated Aug 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashmit Khandelwal; Aditya Agrawal; Aanisha Bhattacharyya; Yaman K Singla; Somesh Singh; Uttaran Bhattacharya; Ishita Dasgupta; Stefano Petrangeli; Rajiv Ratn Shah; Changyou Chen; Balaji Krishnamurthy (2023). Content Behavior Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/content-behavior-corpus
Explore at:
Dataset updated
Aug 31, 2023
Authors
Ashmit Khandelwal; Aditya Agrawal; Aanisha Bhattacharyya; Yaman K Singla; Somesh Singh; Uttaran Bhattacharya; Ishita Dasgupta; Stefano Petrangeli; Rajiv Ratn Shah; Changyou Chen; Balaji Krishnamurthy
Description
The progress of Large Language Models (LLMs) has largely been driven by the availability of large-scale unlabeled text data for unsupervised learning. This work focuses on modeling both content and the corresponding receiver behavior in the same space. Although existing datasets have trillions of content tokens (text, images, audio, and videos), they lack information on receiver effects. To address this, the paper utilizes YouTube, a large publicly available source of content-behavior data, which includes:

Communicator Data: Channel name, and number of subscribers. Message: Youtube video ids, extracted speech, scene-wise captions, on screen text, video description, video length, upload date. Receiver Effect: Video likes, views, and replay graphs.
h
hind_encorp
huggingface.co
paperswithcode.com
+3more
Updated Mar 22, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Rychlý (2014). hind_encorp [Dataset]. https://huggingface.co/datasets/pary/hind_encorp
Explore at:
Dataset updated
Mar 22, 2014
Authors
Pavel Rychlý
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus. For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.
a
Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links...
academictorrents.com
bittorrent
Updated Mar 4, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum (2017). Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Extended Dataset) [Dataset]. https://academictorrents.com/details/689af6f153e097538ad7b8fd4ea3e87ce8f6bc42
Explore at:
bittorrentAvailable download formats
Dataset updated
Mar 4, 2017
Dataset authored and provided by
Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people. ### Introduction The Wikipedia links (WikiLinks) data consists of web pages that satisfy the following two constraints: a. conta
h
NEPATEC1.0
huggingface.co
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PermitAI (2024). NEPATEC1.0 [Dataset]. https://huggingface.co/datasets/PolicyAI/NEPATEC1.0
Explore at:
Dataset updated
Jun 21, 2024
Dataset authored and provided by
PermitAI
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Description

The National Environmental Policy Act Text Corpus (NEPATEC 1.0) is an AI-ready dataset related to NEPA documents collected by the joint effort between Pacific Northwest National Laboratory (PNNL) and Office of Policy (OP). The NEPATEC 1.0 contains data extracted from the Environmental Impact Statement (EIS) Database provided by United States Environmental Protection Agency. EIS is a particular type of NEPA document (in PDF form) that analyzes the potential… See the full description on the dataset page: https://huggingface.co/datasets/PolicyAI/NEPATEC1.0.
d
Replication Data for: Computer-Assisted Keyword and Document Set Discovery...
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
King, Gary; Patrick Lam; Margaret E. Roberts (2023). Replication Data for: Computer-Assisted Keyword and Document Set Discovery from Unstructured Text [Dataset]. http://doi.org/10.7910/DVN/FMJDCD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/FMJDCD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
King, Gary; Patrick Lam; Margaret E. Roberts
Description
The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend on this choice, researchers usually pick keywords in ad hoc ways that are far from optimal and usually biased. Most seem to think that keyword selection is easy, since they do Google searches every day, but we demonstrate that humans perform exceedingly poorly at this basic task. We offer a better approach, one that also can help with following conversations where participants rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; industry and intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated or human-only) statistical approach that suggests keywords from available text without needing structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and summarizing results with easy-to-understand Boolean search strings. We illustrate how the technique works with analyses of English texts about the Boston Marathon Bombings, Chinese social media posts designed to evade censorship, and others.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving (2025). MassiveText Dataset [Dataset]. https://paperswithcode.com/dataset/massivetext

MassiveText Dataset

Explore at:

97 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

May 23, 2025

Authors

Description

MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.

Usage: Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), so the authors sub-sample from MassiveText with sampling proportions specified per subset (books, news, etc.). These sampling proportions are tuned to maximize downstream performance. The largest sampling subset is the curated web-text corpus MassiveWeb, which is found to improve downstream performance relative to existing web-text datasets such as C4 (Raffel et al., 2020).

Find Datasheets in the Gopher paper.

Clear search

Close search

Google apps

Main menu

MassiveText Dataset

A Large Scale Nepali Text Corpus

Nepali Text Corpus Dataset

English-Bahasa Translated Parallel Corpora for BFSI Domain

Introduction

Dataset Content

Domain Specific Content

Format and Structure

Usage and Application

OmniCorpus-CC-210M

Data from: Sparse Machine Learning Methods for Understanding Large Text...

Data from: A Large Parallel Corpus of Full-Text Scientific Articles

common_corpus

OpenWebText Dataset

Hypernyms extracted from a large text corpus using Hearst lexical-syntactic...

SYN v9: large corpus of written Czech

English-Tamil translated Parallel Corpora for Legal Domain

Introduction

Dataset Content

Domain Specific Content

Format and Structure

Usage and Application

English-French translated Parallel Corpora for Gaming Domain

Introduction

Dataset Content

Domain Specific Content

Format and Structure

Usage and Application

Data from: ProGene - A Large-scale, High-Quality Protein-Gene Annotated...

English-Finnish Translated Parallel Corpora for BFSI Domain

Introduction

Dataset Content

Domain Specific Content

Format and Structure

Usage and Application

Content Behavior Corpus Dataset

hind_encorp

Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links...

NEPATEC1.0

Replication Data for: Computer-Assisted Keyword and Document Set Discovery...

MassiveText Dataset