100+ datasets found
  1. P

    MassiveText Dataset

    • paperswithcode.com
    • library.toponeai.link
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving (2025). MassiveText Dataset [Dataset]. https://paperswithcode.com/dataset/massivetext
    Explore at:
    Dataset updated
    May 23, 2025
    Authors
    Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving
    Description

    MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.

    Usage: Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), so the authors sub-sample from MassiveText with sampling proportions specified per subset (books, news, etc.). These sampling proportions are tuned to maximize downstream performance. The largest sampling subset is the curated web-text corpus MassiveWeb, which is found to improve downstream performance relative to existing web-text datasets such as C4 (Raffel et al., 2020).

    Find Datasheets in the Gopher paper.

  2. i

    A Large Scale Nepali Text Corpus

    • ieee-dataport.org
    Updated Mar 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rabindra Lamsal (2021). A Large Scale Nepali Text Corpus [Dataset]. https://ieee-dataport.org/open-access/large-scale-nepali-text-corpus
    Explore at:
    Dataset updated
    Mar 13, 2021
    Authors
    Rabindra Lamsal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Health

  3. P

    Nepali Text Corpus Dataset

    • paperswithcode.com
    Updated Nov 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prajwal Thapa; Jinu Nyachhyon; Mridul Sharma; Bal Krishna Bal (2024). Nepali Text Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/nepali-text-corpus
    Explore at:
    Dataset updated
    Nov 23, 2024
    Authors
    Prajwal Thapa; Jinu Nyachhyon; Mridul Sharma; Bal Krishna Bal
    Description

    Overview Nepali-Text-Corpus is a comprehensive collection of approximately 6.4 million articles in the Nepali language. This dataset is the largest text dataset on Nepali Language. It encompasses a diverse range of text types, including news articles, blogs, and more, making it an invaluable resource for researchers, developers, and enthusiasts in the fields of Natural Language Processing (NLP) and computational linguistics.

    Dataset Details Total Articles: ~6.4 million Language: Nepali Size: 27.5 GB (in csv) Source: Collected from various Nepali news websites, blogs, and other online platforms.

  4. F

    English-Bahasa Translated Parallel Corpora for BFSI Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English-Bahasa Translated Parallel Corpora for BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/bahasa-english-translated-parallel-corpus-for-bfsi-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the English-Bahasa Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance (BFSI) domain! This meticulously curated dataset offers a rich collection of bilingual text data, translated between English and Bahasa, providing a valuable resource for developing BFSI domain-specific language models and machine translation engines.

    Dataset Content

    Volume and Diversity:
    Extensive Dataset: Over 50,000 sentences offering a robust dataset for various applications.
    Translator Diversity: Contributions from more than 200 native translators ensure a wide range of linguistic styles and interpretations.
    Sentence Diversity:
    Word Count: Sentences range from 7 to 25 words, suitable for various computational linguistic applications.
    Syntactic Variety: The corpus encompasses sentences with varying syntactic structures, including simple, compound, and complex sentences.
    Interrogative and Imperative Forms: The corpus includes sentences in interrogative (question) and imperative (command) forms, reflecting the conversational nature of the BFSI industry.
    Affirmative and Negative Statements: Both affirmative and negative statements are represented in the corpus, ensuring different polarities.
    Passive and Active Voice: The corpus features sentences written in both active and passive voice, ensuring different perspectives and representations of information.
    Idiomatic Expressions and Figurative Language: The corpus incorporates idiomatic expressions, metaphors, and figurative language commonly used in the BFSI domain.
    Discourse Markers and Connectives: The corpus includes a wide range of discourse markers and connectives, such as conjunctions, transitional phrases, and logical connectors, which are crucial for capturing the logical flow and coherence of the text.
    Cross Translation: The dataset includes a cross-translation which means a part of the dataset is translated from English to Bahasa and another portion is translated from Bahasa to English to improve bi-directional translation capabilities.

    Domain Specific Content

    This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the BFSI industry.

    Industry-Tailored Terminology: The corpus encompasses a comprehensive lexicon of BFSI-specific terminology, ranging from technical banking and financial terms to insurance-related vocabulary and regulatory jargon.
    Authentic Industry Expressions: Beyond technical terminology, the corpus captures the authentic expressions, idioms, and colloquialisms used within the BFSI industry.
    Contexts Specific to BFSI: The corpus encompasses a wide range of contexts specific to the BFSI domain, including financial transactions, regulatory compliance, risk management, customer service interactions, and more.
    Cross-Domain Applicability: While the primary focus is on the BFSI sector, the corpus also includes relevant cross-domain content, such as general business terminology, legal terms, and language related to technology and digital services.

    Format and Structure

    Multiple Formats: Available in Excel format, with the ability to convert to JSON, TMX, XML, XLIFF, XLS, and other industry-standard formats, facilitating ease of use and integration.
    Structure: It contains information like Serial Number, Unique ID, Source Sentence, Source Sentence Word Count, Target Sentence, and Target Sentence Word Count.

    Usage and Application

    Machine Translation and Language Localization: It serves as a valuable training resource for developing robust machine translation engines tailored to the BFSI domain.
    NLP Applications: Enabling the creation and improvement of predictive keyboards, spell checkers, grammar checkers, and text/speech understanding systems.

  5. h

    OmniCorpus-CC-210M

    • huggingface.co
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenGVLab (2024). OmniCorpus-CC-210M [Dataset]. https://huggingface.co/datasets/OpenGVLab/OmniCorpus-CC-210M
    Explore at:
    Dataset updated
    Aug 30, 2024
    Dataset authored and provided by
    OpenGVLab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    🐳 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    This repository contains 210 million image-text interleaved documents filtered from the OmniCorpus-CC dataset, which was sourced from Common Crawl.

    Repository: https://github.com/OpenGVLab/OmniCorpus Paper (ICLR 2025 Spotlight): https://arxiv.org/abs/2406.08418

    OmniCorpus dataset is a large-scale image-text interleaved dataset, which pushes the boundaries of scale and diversity by encompassing… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/OmniCorpus-CC-210M.

  6. d

    Data from: Sparse Machine Learning Methods for Understanding Large Text...

    • catalog.data.gov
    • data.nasa.gov
    • +1more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Sparse Machine Learning Methods for Understanding Large Text Corpora [Dataset]. https://catalog.data.gov/dataset/sparse-machine-learning-methods-for-understanding-large-text-corpora
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Sparse machine learning has recently emerged as powerful tool to obtain models of high-dimensional data with high degree of interpretability, at low computational cost. This paper posits that these methods can be extremely useful for understanding large collections of text documents, without requiring user expertise in machine learning. Our approach relies on three main ingredients: (a) multi-document text summarization and (b) comparative summarization of two corpora, both using parse regression or classifi?cation; (c) sparse principal components and sparse graphical models for unsupervised analysis and visualization of large text corpora. We validate our approach using a corpus of Aviation Safety Reporting System (ASRS) reports and demonstrate that the methods can reveal causal and contributing factors in runway incursions. Furthermore, we show that the methods automatically discover four main tasks that pilots perform during flight, which can aid in further understanding the causal and contributing factors to runway incursions and other drivers for aviation safety incidents. Citation: L. El Ghaoui, G. C. Li, V. Duong, V. Pham, A. N. Srivastava, and K. Bhaduri, “Sparse Machine Learning Methods for Understanding Large Text Corpora,” Proceedings of the Conference on Intelligent Data Understanding, 2011.

  7. Data from: A Large Parallel Corpus of Full-Text Scientific Articles

    • figshare.com
    application/gzip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felipe Soares; Viviane Pereira Moreira; Karin Becker (2023). A Large Parallel Corpus of Full-Text Scientific Articles [Dataset]. http://doi.org/10.6084/m9.figshare.5382757.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Felipe Soares; Viviane Pereira Moreira; Karin Becker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NOTE FOR WMT PARTICIPANTS:There is an easier version for MT available in Moses format (one sentence per line. The files start with moses_like.If you use this dataset, please cite the following wordk:@InProceedings{L18-1546, author = "Soares, Felipe and Moreira, Viviane and Becker, Karin", title = "A Large Parallel Corpus of Full-Text Scientific Articles", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)", year = "2018", publisher = "European Language Resource Association", location = "Miyazaki, Japan", url = "http://aclweb.org/anthology/L18-1546" }We developed a parallel corpus of full-text scientific articles collected from Scielo database in the following languages: English, Portuguese and Spanish. The corpus is sentence aligned for all language pairs, as well as trilingual aligned for a small subset of sentences

  8. h

    common_corpus

    • huggingface.co
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). common_corpus [Dataset]. https://huggingface.co/datasets/PleIAs/common_corpus
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset authored and provided by
    PleIAs
    Description

    Common Corpus

    Full data paper

    Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative. Common Corpus differs from existing open datasets in that it is:… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

  9. P

    OpenWebText Dataset

    • paperswithcode.com
    • opendatalab.com
    • +3more
    Updated Jun 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Gokaslan; Vanya Cohen (2024). OpenWebText Dataset [Dataset]. https://paperswithcode.com/dataset/openwebtext
    Explore at:
    Dataset updated
    Jun 16, 2024
    Authors
    Aaron Gokaslan; Vanya Cohen
    Description

    OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).

  10. E

    Hypernyms extracted from a large text corpus using Hearst lexical-syntactic...

    • live.european-language-grid.eu
    csv
    Updated Sep 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Hypernyms extracted from a large text corpus using Hearst lexical-syntactic patterns [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/7392
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 11, 2021
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The list of hyponym-hypernym pairs was obtained by applying lexical-syntactic patterns described in Hearst (1992) on the corpus prepared by Panchenko et al. (2016). This corpus is a concatenation of the English Wikipedia (2016 dump), Gigaword, ukWaC and English news corpora from the Leipzig Corpora Collection. The lexical-syntactic patterns proposed by Marti Hearst (1992) and further extended and implemented in the form of FSTs by Panchenko et al. (2012) for extracting (noisy) hyponym-hypernym pairs are as follows -- (i) such NP as NP, NP[,] and/or NP; (ii) NP such as NP, NP[,] and/or NP; (iii) NP, NP [,] or other NP; (iv) NP, NP [,] and other NP; (v) NP, including NP, NP [,] and/or NP; (vi) NP, especially NP, NP [,] and/or NP. Pattern extraction on the corpus yields a list of 27.6 million hyponym-hypernym pairs along with the frequency of their occurrence in the corpus.

  11. c

    SYN v9: large corpus of written Czech

    • lindat.mff.cuni.cz
    • live.european-language-grid.eu
    Updated Dec 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michal Křen; Václav Cvrček; Jan Henyš; Milena Hnátková; Tomáš Jelínek; Jan Kocek; Dominika Kováříková; Jan Křivan; Jiří Milička; Vladimír Petkevič; Pavel Procházka; Hana Skoumalová; Jana Šindlerová; Michal Škrabal (2021). SYN v9: large corpus of written Czech [Dataset]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-4635
    Explore at:
    Dataset updated
    Dec 5, 2021
    Authors
    Michal Křen; Václav Cvrček; Jan Henyš; Milena Hnátková; Tomáš Jelínek; Jan Kocek; Dominika Kováříková; Jan Křivan; Jiří Milička; Vladimír Petkevič; Pavel Procházka; Hana Skoumalová; Jana Šindlerová; Michal Škrabal
    License

    https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnchttps://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc

    Description

    Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v9 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the new CNC tagset first utilized for the annotation of the SYN2020 corpus.

    SYN v9 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz with one important exception: the corpus is shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.

  12. F

    English-Tamil translated Parallel Corpora for Legal Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English-Tamil translated Parallel Corpora for Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/tamil-english-translated-parallel-corpus-for-legal-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the English-Tamil Bilingual Parallel Corpora dataset for the Legal domain! This meticulously curated dataset offers a rich collection of bilingual text data, translated between English and Tamil providing a valuable resource for developing Legal domain-specific language models and machine translation engines.

    Dataset Content

    Volume and Diversity:
    Extensive Dataset: Over 50,000 sentences offering a robust dataset for various applications.
    Translator Diversity: Contributions from more than 200 native translators ensure a wide range of linguistic styles and interpretations.
    Sentence Diversity:
    Word Count: Sentences range from 7 to 25 words, suitable for various computational linguistic applications.
    Syntactic Variety: The corpus encompasses sentences with varying syntactic structures, including simple, compound, and complex sentences.
    Interrogative and Imperative Forms: The corpus includes sentences in interrogative (question) and imperative (command) forms, reflecting the conversational nature of the Legal industry.
    Affirmative and Negative Statements: Both affirmative and negative statements are represented in the corpus, ensuring different polarities.
    Passive and Active Voice: The corpus features sentences written in both active and passive voice, ensuring different perspectives and representations of information.
    Idiomatic Expressions and Figurative Language: The corpus incorporates idiomatic expressions, metaphors, and figurative language commonly used in the Legal domain.
    Discourse Markers and Connectives: The corpus includes a wide range of discourse markers and connectives, such as conjunctions, transitional phrases, and logical connectors, which are crucial for capturing the logical flow and coherence of the text.
    Cross Translation: The dataset includes a cross-translation, where a part of the dataset is translated from English to Tamil and another portion is translated from Tamil to English, to improve bi-directional translation capabilities.

    Domain Specific Content

    This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the Legal industry.

    Industry-Tailored Terminology: The corpus encompasses a comprehensive lexicon of Legal-specific terminology, ranging from technical terms related to contracts, torts, and criminal law to legal procedures and court documentation.
    Authentic Industry Expressions: Beyond technical terminology, the corpus captures the authentic expressions, idioms, and colloquialisms used within the Legal domain.
    Contexts Specific to Legal Domain: The corpus encompasses a diverse range of contexts specific to the Legal domain, including legal briefs, memoranda, contracts, agreements, legal articles, scholarly papers, etc
    Cross-Domain Applicability: While the primary focus is on the Legal domain, the corpus also includes relevant cross-domain content, such as business, financial terminology, government, public policy terminology, technology and cybersecurity terms, etc

    Format and Structure

    Multiple Formats: Available in Excel format, with the ability to convert to JSON, TMX, XML, XLIFF, XLS, and other industry-standard formats, facilitating ease of use and integration.
    Structure: It contains information like Serial Number, Unique ID, Source Sentence, Source Sentence Word Count, Target Sentence, and Target Sentence Word Count.

    Usage and Application

    Machine Translation: Develop accurate machine translation engines for legal content localization, enabling seamless communication across languages in legal proceedings.
    NLP Applications: Enabling the creation and improvement of predictive keyboards, spell checkers, grammar checkers, and text/speech understanding systems.
    <div

  13. F

    English-French translated Parallel Corpora for Gaming Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English-French translated Parallel Corpora for Gaming Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/french-english-translated-parallel-corpus-for-education-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Area covered
    French
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the English-French Bilingual Parallel Corpora dataset for the Education domain! This comprehensive dataset contains a vast collection of bilingual text data, carefully translated between English to French, to support the development of Education-specific language models and machine translation engines.

    Dataset Content

    Volume and Diversity:
    Extensive Dataset: Over 50,000 sentences offering a robust dataset for various applications.
    Translator Diversity: Contributions from more than 200 native translators ensure a wide range of linguistic styles and interpretations.
    Sentence Diversity:
    Word Count: Sentences range from 7 to 25 words, suitable for various computational linguistic applications.
    Syntactic Variety: The corpus encompasses sentences with varying syntactic structures, including simple, compound, and complex sentences.
    Interrogative and Imperative Forms: The corpus includes sentences in interrogative (question) and imperative (command) forms, reflecting the conversational nature of the education industry.
    Affirmative and Negative Statements: Both affirmative and negative statements are represented in the corpus, ensuring different polarities.
    Passive and Active Voice: The corpus features sentences written in both active and passive voice, ensuring different perspectives and representations of information.
    Idiomatic Expressions and Figurative Language: The corpus incorporates idiomatic expressions, metaphors, and figurative language commonly used in the Education domain.
    Discourse Markers and Connectives: The corpus includes a wide range of discourse markers and connectives, such as conjunctions, transitional phrases, and logical connectors, which are crucial for capturing the logical flow and coherence of the text.
    Cross Translation: The dataset includes a cross-translation, where a part of the dataset is translated from English to French and another portion is translated from French to English, to improve bi-directional translation capabilities.

    Domain Specific Content

    This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the Education industry.

    Industry-Tailored Terminology: The corpus encompasses a comprehensive lexicon of Education-specific terminology, ranging from technical terms related to pedagogy, curriculum design, and educational technology to teaching methodologies and learning theories.
    Authentic Industry Expressions: Beyond technical terminology, the corpus captures the authentic expressions, idioms, and colloquialisms used within the Education domain, including classroom instructions, academic discussions, and educational feedback.
    Contexts Specific to Education Domain: The corpus encompasses a diverse range of contexts specific to the Education domain, including lesson plans, academic papers, educational resources, and online courses.
    Cross-Domain Applicability: While the primary focus is on the Education domain, the corpus also includes relevant cross-domain content from related areas, such as child psychology, educational psychology, cognitive science, and learning technologies.

    Format and Structure

    Multiple Formats: Available in Excel format, with the ability to convert to JSON, TMX, XML, XLIFF, XLS, and other industry-standard formats, facilitating ease of use and integration.
    Structure: It contains information like Serial Number, Unique ID, Source Sentence, Source Sentence Word Count, Target Sentence, and Target Sentence Word Count.

    Usage and Application

    Machine Translation: Develop accurate machine translation engines for educational content
    NLP Applications: Improve predictive keyboards, spell checkers, grammar checkers, and text/speech understanding systems tailored for educational contexts.

  14. Z

    Data from: ProGene - A Large-scale, High-Quality Protein-Gene Annotated...

    • data.niaid.nih.gov
    • live.european-language-grid.eu
    • +1more
    Updated Jun 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lohr, Christina (2020). ProGene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3698567
    Explore at:
    Dataset updated
    Jun 12, 2020
    Dataset provided by
    Modersohn, Luise
    Faessler, Erik
    Hahn, Udo
    Lohr, Christina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Pro(tein)/Gene corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn.

    The goals of the annotation project were

    to construct a consistent and (as far as possible) subdomain-independent/-comprehensive protein-annotated corpus

    to differentiate between protein families and groups, protein complexes, protein molecules, protein variants (e.g. alleles) and elliptic enumerations of proteins.

    The corpus has the following annotation levels / entity types:

    protein

    protein_familiy_or_group

    protein_complex

    protein_variant

    protein_enum

    For definitions of the annotation levels, please refer to the Proteins-guidelines-final.doc file that is found in the download package.

    To achieve a large coverage of biological subdomains, document from multiple other protein / gene corpora were reannotated. For further coverage, new document sets were created. All documents are abstracts from PubMed/MEDLINE. The corpus is made up of the union of all the documents in the different subcorpora. All document are delivered as MMAX2 (http://mmax2.net/) annotation projects.

  15. F

    English-Finnish Translated Parallel Corpora for BFSI Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English-Finnish Translated Parallel Corpora for BFSI Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/finnish-english-translated-parallel-corpus-for-bfsi-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the English-Finnish Bilingual Parallel Corpora dataset for the Banking, Financial Services, and Insurance (BFSI) domain! This meticulously curated dataset offers a rich collection of bilingual text data, translated between English and Finnish, providing a valuable resource for developing BFSI domain-specific language models and machine translation engines.

    Dataset Content

    Volume and Diversity:
    Extensive Dataset: Over 50,000 sentences offering a robust dataset for various applications.
    Translator Diversity: Contributions from more than 200 native translators ensure a wide range of linguistic styles and interpretations.
    Sentence Diversity:
    Word Count: Sentences range from 7 to 25 words, suitable for various computational linguistic applications.
    Syntactic Variety: The corpus encompasses sentences with varying syntactic structures, including simple, compound, and complex sentences.
    Interrogative and Imperative Forms: The corpus includes sentences in interrogative (question) and imperative (command) forms, reflecting the conversational nature of the BFSI industry.
    Affirmative and Negative Statements: Both affirmative and negative statements are represented in the corpus, ensuring different polarities.
    Passive and Active Voice: The corpus features sentences written in both active and passive voice, ensuring different perspectives and representations of information.
    Idiomatic Expressions and Figurative Language: The corpus incorporates idiomatic expressions, metaphors, and figurative language commonly used in the BFSI domain.
    Discourse Markers and Connectives: The corpus includes a wide range of discourse markers and connectives, such as conjunctions, transitional phrases, and logical connectors, which are crucial for capturing the logical flow and coherence of the text.
    Cross Translation: The dataset includes a cross-translation which means a part of the dataset is translated from English to Finnish and another portion is translated from Finnish to English to improve bi-directional translation capabilities.

    Domain Specific Content

    This Parallel Corpus is meticulously curated to capture the linguistic intricacies and domain-specific nuances inherent to the BFSI industry.

    Industry-Tailored Terminology: The corpus encompasses a comprehensive lexicon of BFSI-specific terminology, ranging from technical banking and financial terms to insurance-related vocabulary and regulatory jargon.
    Authentic Industry Expressions: Beyond technical terminology, the corpus captures the authentic expressions, idioms, and colloquialisms used within the BFSI industry.
    Contexts Specific to BFSI: The corpus encompasses a wide range of contexts specific to the BFSI domain, including financial transactions, regulatory compliance, risk management, customer service interactions, and more.
    Cross-Domain Applicability: While the primary focus is on the BFSI sector, the corpus also includes relevant cross-domain content, such as general business terminology, legal terms, and language related to technology and digital services.

    Format and Structure

    Multiple Formats: Available in Excel format, with the ability to convert to JSON, TMX, XML, XLIFF, XLS, and other industry-standard formats, facilitating ease of use and integration.
    Structure: It contains information like Serial Number, Unique ID, Source Sentence, Source Sentence Word Count, Target Sentence, and Target Sentence Word Count.

    Usage and Application

    Machine Translation and Language Localization: It serves as a valuable training resource for developing robust machine translation engines tailored to the BFSI domain.
    NLP Applications: Enabling the creation and improvement of predictive keyboards, spell checkers, grammar checkers, and text/speech understanding systems.

  16. P

    Content Behavior Corpus Dataset

    • paperswithcode.com
    Updated Aug 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashmit Khandelwal; Aditya Agrawal; Aanisha Bhattacharyya; Yaman K Singla; Somesh Singh; Uttaran Bhattacharya; Ishita Dasgupta; Stefano Petrangeli; Rajiv Ratn Shah; Changyou Chen; Balaji Krishnamurthy (2023). Content Behavior Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/content-behavior-corpus
    Explore at:
    Dataset updated
    Aug 31, 2023
    Authors
    Ashmit Khandelwal; Aditya Agrawal; Aanisha Bhattacharyya; Yaman K Singla; Somesh Singh; Uttaran Bhattacharya; Ishita Dasgupta; Stefano Petrangeli; Rajiv Ratn Shah; Changyou Chen; Balaji Krishnamurthy
    Description

    The progress of Large Language Models (LLMs) has largely been driven by the availability of large-scale unlabeled text data for unsupervised learning. This work focuses on modeling both content and the corresponding receiver behavior in the same space. Although existing datasets have trillions of content tokens (text, images, audio, and videos), they lack information on receiver effects. To address this, the paper utilizes YouTube, a large publicly available source of content-behavior data, which includes:

    Communicator Data: Channel name, and number of subscribers. Message: Youtube video ids, extracted speech, scene-wise captions, on screen text, video description, video length, upload date. Receiver Effect: Video likes, views, and replay graphs.

  17. h

    hind_encorp

    • huggingface.co
    • paperswithcode.com
    • +3more
    Updated Mar 22, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Rychlý (2014). hind_encorp [Dataset]. https://huggingface.co/datasets/pary/hind_encorp
    Explore at:
    Dataset updated
    Mar 22, 2014
    Authors
    Pavel Rychlý
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

    Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

    EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

    Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.  For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

    TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

    The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

    Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

    Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.

  18. a

    Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links...

    • academictorrents.com
    bittorrent
    Updated Mar 4, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum (2017). Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia (Extended Dataset) [Dataset]. https://academictorrents.com/details/689af6f153e097538ad7b8fd4ea3e87ce8f6bc42
    Explore at:
    bittorrentAvailable download formats
    Dataset updated
    Mar 4, 2017
    Dataset authored and provided by
    Sameer Singh and Amarnag Subramanya and Fernando Pereira and Andrew McCallum
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people. ### Introduction The Wikipedia links (WikiLinks) data consists of web pages that satisfy the following two constraints: a. conta

  19. h

    NEPATEC1.0

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PermitAI (2024). NEPATEC1.0 [Dataset]. https://huggingface.co/datasets/PolicyAI/NEPATEC1.0
    Explore at:
    Dataset updated
    Jun 21, 2024
    Dataset authored and provided by
    PermitAI
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Description

    The National Environmental Policy Act Text Corpus (NEPATEC 1.0) is an AI-ready dataset related to NEPA documents collected by the joint effort between Pacific Northwest National Laboratory (PNNL) and Office of Policy (OP). The NEPATEC 1.0 contains data extracted from the Environmental Impact Statement (EIS) Database provided by United States Environmental Protection Agency. EIS is a particular type of NEPA document (in PDF form) that analyzes the potential… See the full description on the dataset page: https://huggingface.co/datasets/PolicyAI/NEPATEC1.0.

  20. d

    Replication Data for: Computer-Assisted Keyword and Document Set Discovery...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    King, Gary; Patrick Lam; Margaret E. Roberts (2023). Replication Data for: Computer-Assisted Keyword and Document Set Discovery from Unstructured Text [Dataset]. http://doi.org/10.7910/DVN/FMJDCD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    King, Gary; Patrick Lam; Margaret E. Roberts
    Description

    The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend on this choice, researchers usually pick keywords in ad hoc ways that are far from optimal and usually biased. Most seem to think that keyword selection is easy, since they do Google searches every day, but we demonstrate that humans perform exceedingly poorly at this basic task. We offer a better approach, one that also can help with following conversations where participants rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; industry and intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated or human-only) statistical approach that suggests keywords from available text without needing structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and summarizing results with easy-to-understand Boolean search strings. We illustrate how the technique works with analyses of English texts about the Boston Marathon Bombings, Chinese social media posts designed to evade censorship, and others.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving (2025). MassiveText Dataset [Dataset]. https://paperswithcode.com/dataset/massivetext

MassiveText Dataset

Explore at:
97 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
May 23, 2025
Authors
Jack W. Rae; Sebastian Borgeaud; Trevor Cai; Katie Millican; Jordan Hoffmann; Francis Song; John Aslanides; Sarah Henderson; Roman Ring; Susannah Young; Eliza Rutherford; Tom Hennigan; Jacob Menick; Albin Cassirer; Richard Powell; George van den Driessche; Lisa Anne Hendricks; Maribeth Rauh; Po-Sen Huang; Amelia Glaese; Johannes Welbl; Sumanth Dathathri; Saffron Huang; Jonathan Uesato; John Mellor; Irina Higgins; Antonia Creswell; Nat McAleese; Amy Wu; Erich Elsen; Siddhant Jayakumar; Elena Buchatskaya; David Budden; Esme Sutherland; Karen Simonyan; Michela Paganini; Laurent SIfre; Lena Martens; Xiang Lorraine Li; Adhiguna Kuncoro; Aida Nematzadeh; Elena Gribovskaya; Domenic Donato; Angeliki Lazaridou; Arthur Mensch; Jean-Baptiste Lespiau; Maria Tsimpoukelli; Nikolai Grigorev; Doug Fritz; Thibault Sottiaux; Mantas Pajarskas; Toby Pohlen; Zhitao Gong; Daniel Toyama; Cyprien de Masson d'Autume; Yujia Li; Tayfun Terzi; Vladimir Mikulik; Igor Babuschkin; Aidan Clark; Diego de Las Casas; Aurelia Guy; Chris Jones; James Bradbury; Matthew Johnson; Blake Hechtman; Laura Weidinger; Iason Gabriel; William Isaac; Ed Lockhart; Simon Osindero; Laura Rimell; Chris Dyer; Oriol Vinyals; Kareem Ayoub; Jeff Stanway; Lorrayne Bennett; Demis Hassabis; Koray Kavukcuoglu; Geoffrey Irving
Description

MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.

Usage: Gopher is trained on 300B tokens (12.8% of the tokens in the dataset), so the authors sub-sample from MassiveText with sampling proportions specified per subset (books, news, etc.). These sampling proportions are tuned to maximize downstream performance. The largest sampling subset is the curated web-text corpus MassiveWeb, which is found to improve downstream performance relative to existing web-text datasets such as C4 (Raffel et al., 2020).

Find Datasheets in the Gopher paper.

Search
Clear search
Close search
Google apps
Main menu