100+ datasets found
  1. News classification dataset for NLP

    • kaggle.com
    zip
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Lo Bello (2024). News classification dataset for NLP [Dataset]. https://www.kaggle.com/datasets/alessandrolobello/guardian
    Explore at:
    zip(117148704 bytes)Available download formats
    Dataset updated
    Feb 27, 2024
    Authors
    Alessandro Lo Bello
    Description

    This dataset is compiled through scraping articles from Guardian news starting from 2021 up to the present. It's meticulously crafted to facilitate multimodal classification tasks, especially in the realm of Natural Language Processing (NLP).

    The dataset comprises 13 labels representing various topics covered in the news, ranging from sports to economics, politics, art, and entertainment. Each label corresponds to a distinct category, offering a comprehensive overview of the diverse subjects discussed in journalistic discourse. Researchers and practitioners can utilize these labels for multi-label classification tasks, aiming to categorize news articles based on their thematic content.

    The topics encapsulated by these labels reflect the multifaceted nature of contemporary news media, capturing the breadth and depth of global events and developments. Whether it's the latest in sports, updates on economic trends, insights into political affairs, or coverage of the arts and entertainment industry, this dataset encompasses a wide spectrum of subjects.

    The file forma is parquet. If you are new to this, for open it just do pd.read_parquet(filepath) and it's done!

  2. NLP Dataset

    • kaggle.com
    zip
    Updated Jan 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scientific Stephen (2024). NLP Dataset [Dataset]. https://www.kaggle.com/datasets/scientificstephen/language
    Explore at:
    zip(607351 bytes)Available download formats
    Dataset updated
    Jan 21, 2024
    Authors
    Scientific Stephen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This project is about text classification, leveraging transformer-based NLP models (BERT) for classifying whether given text relates to a disaster or not. It involves data cleaning, tokenization, and the use of a pre-trained transformer model for fine-tuning on the specific task.

  3. F

    English Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world English usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level English conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native English speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700
    Turns per Chat: Up to 50 dialogue turns
    Contributors: 200 native English speakers from the FutureBeeAI Crowd Community
    Format: TXT, DOCS, JSON or CSV (customizable)
    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies
    Health and wellness
    Children and parenting
    Family life and relationships
    Food and cooking
    Education and studying
    Festivals and traditions
    Environment and daily life
    Internet and tech usage
    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level English usage with:

    Colloquial expressions and local dialect influence
    Domain-relevant terminology
    Language-specific grammar, phrasing, and sentence flow
    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age
    Gender
    Country/Region
    Chat Domain
    Chat Topic
    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness
    Format checks for chat turns and metadata
    Linguistic verification by native speakers
    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Applications

    This dataset is ideal for training and evaluating a wide range of text-based AI systems:

    Conversational AI / Chatbots
    Smart assistants and voicebots
    <div

  4. NLP for German News Articles

    • kaggle.com
    zip
    Updated Oct 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). NLP for German News Articles [Dataset]. https://www.kaggle.com/datasets/whenamancodes/nlp-for-10k-german-news-articles
    Explore at:
    zip(128989980 bytes)Available download formats
    Dataset updated
    Oct 1, 2022
    Authors
    Aman Chauhan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    :::: Ten Thousand German News Articles Dataset ::::

    A dataset for topic extraction from 10k German News Articles and NLP for German language. English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. To my knowledge the MLDoc contains German documents for classification. Due to grammatical differences between the English and the German language, a classifier might be effective on a English dataset, but not as effective on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifier on multiple German datasets to get a sense of it’s effectiveness.

    :::: What It Cointains ::::

    The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus. In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise. The 10kGNAD uses the second part of the topic path, here Wirtschaft, as class label. The article titles and texts are concatenated into one text and the authors are removed to avoid a keyword like classification on autors frequent in a class. I created and used this dataset in my thesis to train and evaluate four text classifiers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.

    Citations:

    @InProceedings{Schabus2017, Author = {Dietmar Schabus and Marcin Skowron and Martin Trapp}, Title = {One Million Posts: A Data Set of German Online Discussions}, Booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)}, Pages = {1241--1244}, Year = {2017}, Address = {Tokyo, Japan}, Doi = {10.1145/3077136.3080711}, Month = aug } @InProceedings{Schabus2018, author = {Dietmar Schabus and Marcin Skowron}, title = {Academic-Industrial Perspective on the Development and Deployment of a Moderation System for a Newspaper Website}, booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC)}, year = {2018}, address = {Miyazaki, Japan}, month = may, pages = {1602-1605}, abstract = {This paper describes an approach and our experiences from the development, deployment and usability testing of a Natural Language Processing (NLP) and Information Retrieval system that supports the moderation of user comments on a large newspaper website. We highlight some of the differences between industry-oriented and academic research settings and their influence on the decisions made in the data collection and annotation processes, selection of document representation and machine learning methods. We report on classification results, where the problems to solve and the data to work with come from a commercial enterprise. In this context typical for NLP research, we discuss relevant industrial aspects. We believe that the challenges faced as well as the solutions proposed for addressing them can provide insights to others working in a similar setting.}, url = {http://www.lrec-conf.org/proceedings/lrec2018/summaries/8885.html}, }

    More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe

  5. B5text dataset - Textual data for 5 class sentiment classification of...

    • figshare.com
    txt
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmud Hasan (2021). B5text dataset - Textual data for 5 class sentiment classification of manufacturing parts. [Dataset]. http://doi.org/10.6084/m9.figshare.14887932.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 30, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Mahmud Hasan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    processed and lemmatised manufacturing text data relevant to 5 classes of parts: bearings, collet, sprocket, bolt, spring webscraped from different web based platforms like mcmaster carr, traceparts etc.

  6. f

    Probing Datasets for Noisy Texts

    • federation.figshare.com
    • researchdata.edu.au
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte (2021). Probing Datasets for Noisy Texts [Dataset]. http://doi.org/10.25955/604c5307db043
    Explore at:
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    Federation University Australia
    Authors
    Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    ContextProbing tasks are popular among NLP researchers to assess the richness of the encoded representations of linguistic information. Each probing task is a classification problem, and the model’s performance shall vary depending on the richness of the linguistic properties crammed into the representation.

    This dataset contains five new probing datasets consist of noisy texts (Tweets) which can serve as a benchmark dataset for researchers to study the linguistic characteristics of unstructured and noisy texts.File StructureFormat: A tab-separated text file

    Column 1: train/test/validation split (tr-train, te-test, va-validation)

         Column 2: class label (refer to the content
    

    section for the class labels of each task file)

         Column 3: Tweet message (text)
    
         Column
    

    4: a unique ID Contentsent_len.tsvIn this classification task, the goal is to predict the sentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1: (9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7: (34-70). This task is called “SentLen” in the paper.word_content.tsvWe consider a 10-way classifications task with 10 words as targets considering the available manually annotated instances. The task is predicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. We constructed the data by picking the first 10 lower-cased words occurring in the corpus vocabulary ordered by frequency and having a length of at least 4 characters (to remove noise). Each sentence contains a single target word, and the word occurs precisely once in the sentence. The task is referred to as “WC” in the paper. bigram_shift.tsvThe purpose of the Bigram Shift task is to test whether an encoder is sensitive to legal word orders. Two adjacent words in a Tweet are inverted, and the classification model performs a binary classification to identify inverted (I) and non-inverted/original (O) Tweets. The task is referred to as “BShift” in the paper. tree_depth.tsvThe Tree Depth task evaluates the encoded sentence's ability to understand the hierarchical structure by allowing the classification model to predict the depth of the longest path from the root to any leaf in the Tweet's parser tree. The task is referred to as “TreeDepth” in the paper. odd_man_out.tsv

    The Tweets are modified by replacing a random noun or a verb o with another noun or verb r. The task of the classifier is to identify whether the sentence gets modified due to this change. Class label O refers to the unmodified sentences while C refers to modified sentences. The task is called “SOMO” in the paper.

  7. F

    Norwegian Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Norwegian Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/norwegian-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Norwegian General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Norwegian usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Norwegian conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Norwegian speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700
    Turns per Chat: Up to 50 dialogue turns
    Contributors: 200 native Norwegian speakers from the FutureBeeAI Crowd Community
    Format: TXT, DOCS, JSON or CSV (customizable)
    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies
    Health and wellness
    Children and parenting
    Family life and relationships
    Food and cooking
    Education and studying
    Festivals and traditions
    Environment and daily life
    Internet and tech usage
    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level Norwegian usage with:

    Colloquial expressions and local dialect influence
    Domain-relevant terminology
    Language-specific grammar, phrasing, and sentence flow
    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age
    Gender
    Country/Region
    Chat Domain
    Chat Topic
    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness
    Format checks for chat turns and metadata
    Linguistic verification by native speakers
    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Applications

    This dataset is ideal for training and evaluating a wide range of text-based AI systems:

    Conversational AI / Chatbots
    Smart assistants and

  8. s

    Arabic Text Dataset

    • shaip.com
    • tl.shaip.com
    • +2more
    json
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Arabic Text Dataset [Dataset]. https://www.shaip.com/offerings/language-text-datasets/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 26, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Arabic Text Dataset contains a collection of text samples written in Arabic. It includes various forms of content, such as news articles, social media posts, literature, and dialogue, spanning different topics and writing styles. This dataset is used for tasks such as natural language processing (NLP), text classification, sentiment analysis, and machine translation in Arabic language applications.

  9. F

    Punjabi Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Punjabi Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Punjabi General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Punjabi usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Punjabi conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 10000 chat transcripts, each featuring free-flowing dialogue between two native Punjabi speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700
    Turns per Chat: Up to 50 dialogue turns
    Contributors: 150 native Punjabi speakers from the FutureBeeAI Crowd Community
    Format: TXT, DOCS, JSON or CSV (customizable)
    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies
    Health and wellness
    Children and parenting
    Family life and relationships
    Food and cooking
    Education and studying
    Festivals and traditions
    Environment and daily life
    Internet and tech usage
    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level Punjabi usage with:

    Colloquial expressions and local dialect influence
    Domain-relevant terminology
    Language-specific grammar, phrasing, and sentence flow
    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age
    Gender
    Country/Region
    Chat Domain
    Chat Topic
    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness
    Format checks for chat turns and metadata
    Linguistic verification by native speakers
    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Applications

    This dataset is ideal for training and evaluating a wide range of text-based AI systems:

    Conversational AI / Chatbots
    Smart assistants and voicebots
    <div

  10. Long document similarity dataset, Wikipedia excerptions for movies...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jan 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous; anonymous (2023). Long document similarity dataset, Wikipedia excerptions for movies collections [Dataset]. http://doi.org/10.5281/zenodo.7434832
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    anonymous; anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Movies-related articles extracted from Wikipedia.

    For all articles, the figures and tables have been filtered out, as well as the categories and "see also" sections.

    The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

    Movies

    The Wikipedia Movies dataset consists of 100,371 articles describing various movies. Each article may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.

  11. AI-Enhanced English Teaching Resource Dataset

    • kaggle.com
    zip
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). AI-Enhanced English Teaching Resource Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/ai-enhanced-english-teaching-resource-dataset
    Explore at:
    zip(2634 bytes)Available download formats
    Dataset updated
    Mar 7, 2025
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The AI-Enhanced English Teaching Resource Dataset is designed for research on Natural Language Processing (NLP) applications in automated English lesson generation. It contains 300 structured entries, combining human-written and AI-generated educational content across various categories such as Grammar, Vocabulary, Reading, Writing, Speaking, and Literature.

    Key Features: Lesson Text: Descriptive summaries of English lessons. Keywords: Important terms extracted for each lesson. Lesson Type: Categorization into different teaching domains. Difficulty Level: Labels for Beginner, Intermediate, and Advanced levels. Target: Binary classification (0 = Human-written, 1 = AI-generated). Use Cases: Training and evaluating NLP models for educational content generation. Assessing AI’s effectiveness in producing structured and relevant lesson materials. Developing adaptive e-learning platforms for personalized teaching. This dataset serves as a valuable resource for machine learning, NLP, and educational technology research, enabling scalable and automated curriculum design. 🚀

  12. s

    Text + Audio-Visual (Multilingual/OCR/NLP) – Books, Journals, Audio+Text

    • shaip.com
    • fr.shaip.com
    • +68more
    json
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Text + Audio-Visual (Multilingual/OCR/NLP) – Books, Journals, Audio+Text [Dataset]. https://www.shaip.com/offerings/language-text-datasets/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 26, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Chinese Books, English Books, Journals, Public Policy, Novels, Children, Cantonese Audio+Text, Lecture Video+PPT, Long-format Video Half billion books, question answers pairs, articles

  13. Tamil (Tamizh) Wikipedia Text Dataset for NLP

    • kaggle.com
    zip
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. https://www.kaggle.com/datasets/younusmohamed/tamil-tamizh-wikipedia-articles
    Explore at:
    zip(339341289 bytes)Available download formats
    Dataset updated
    Nov 12, 2024
    Authors
    Younus_Mohamed
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

    What’s Included

    - Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

    - Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

    Why This Dataset?

    Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

    ** How You Can Use This Dataset**

    - Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

    Let’s Collaborate!

    I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

    License

    This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.

  14. m

    Six small datasets for text segmentation

    • data.mendeley.com
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Krassovitskiy (2025). Six small datasets for text segmentation [Dataset]. http://doi.org/10.17632/cj22rpfdbb.1
    Explore at:
    Dataset updated
    Sep 11, 2025
    Authors
    Alexander Krassovitskiy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This collection brings together six datasets designed for experiments in text segmentation:

    Choi — 922 artificial documents [1]. Each document is composed of sentence blocks drawn from different sources. Since the segments are unrelated, segmentation is relatively easy for many algorithms and typically yields high accuracy.

    Manifesto — 6 long political speeches [2]. Each text includes a human-generated segmentation based on strict guidelines. The dataset is used to evaluate segmentation of semantic topic shifts and thematic changes.

    Wiki-1024 — 1,024 Wikipedia articles [3]. Segmentation is defined by the natural division of documents into sections and subsections.

    Abstracts — artificial documents created by merging real research abstracts into continuous texts. About 20,000 abstracts were collected from Scopus in the field of Information Retrieval. Segments correspond directly to individual abstracts.

    SMan — artificial documents constructed by randomly sampling segments from Manifesto texts. The resulting statements vary in content due to mixing, but generally follow a slogan-like style.

    PhilPapersAI — 336 philosophy articles (focused on AI) selected from philpapers.org. The source PDFs were reprocessed using the OpenAI GPT-4o-mini LLM to restore structure and add subsection divisions. The resulting texts are coherent and well-structured, while preserving the authors’ original style as closely as possible.

    [1] Choi, F.Y.Y. Advances in domain independent linear text segmentation. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 2000. [2] Hearst, M.A. Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages. Computational Linguistics 1997, 23, 33–64. [3] Koshorek, O.; Cohen, A.; Mor, N.; Rotman, M.; Berant, J. Text Segmentation as a Supervised Learning Task. In Proceedings of the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers); Walker, M.; Ji, H.; Stent, A., Eds., New Orleans, Louisiana, 2018; pp. 469–473. https://doi.org/10.18653/v1/N18-2075.

  15. F

    Japanese Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Japanese Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/japanese-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Japanese General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Japanese usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Japanese conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Japanese speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700
    Turns per Chat: Up to 50 dialogue turns
    Contributors: 200 native Japanese speakers from the FutureBeeAI Crowd Community
    Format: TXT, DOCS, JSON or CSV (customizable)
    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies
    Health and wellness
    Children and parenting
    Family life and relationships
    Food and cooking
    Education and studying
    Festivals and traditions
    Environment and daily life
    Internet and tech usage
    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level Japanese usage with:

    Colloquial expressions and local dialect influence
    Domain-relevant terminology
    Language-specific grammar, phrasing, and sentence flow
    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age
    Gender
    Country/Region
    Chat Domain
    Chat Topic
    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness
    Format checks for chat turns and metadata
    Linguistic verification by native speakers
    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Applications

    This dataset is ideal for training and evaluating a wide range of text-based AI systems:

    Conversational AI / Chatbots
    Smart assistants and voicebots
    <div

  16. Long document similarity datasets, Wikipedia excerptions for movies, video...

    • zenodo.org
    • live.european-language-grid.eu
    • +1more
    csv
    Updated Jan 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous; anonymous (2021). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. http://doi.org/10.5281/zenodo.4468783
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 27, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    anonymous; anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three corpora in different domains extracted from Wikipedia.

    For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.

    The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

    Wines

    Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are

    • Dom Pérignon - Moët & Chandon
    • Pinot Meunier - Chardonnay

    Movies

    The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.
    For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies.
    Examples for ground-truth expert-based recommendations are

    • Schindler's List - The Pianist
    • Lion King - The Jungle Book

    Video games

    The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:

    • Grand Theft Auto - Mafia
    • Burnout Paradise - Forza Horizon 3
  17. h

    text-quality

    • huggingface.co
    Updated Sep 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Tseng (2024). text-quality [Dataset]. https://huggingface.co/datasets/agentlans/text-quality
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 30, 2024
    Authors
    Alan Tseng
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Text Quality Assessment Dataset

      Overview
    

    This dataset is designed to assess text quality robustly across various domains for NLP and AI applications. It provides a composite quality score based on multiple classifiers, offering a more comprehensive evaluation of text quality beyond educational domains.

      Dataset Details
    

    Size: 100,000 sentences Source: 20,000 sentences from each of 5 different datasets allenai/c4 HuggingFaceFW/fineweb-edu… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/text-quality.

  18. s

    Japanese & Korean Language Dataset

    • shaip.com
    json
    Updated Nov 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Japanese & Korean Language Dataset [Dataset]. https://www.shaip.com/offerings/language-text-datasets/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 26, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Japanese & Korean Language Dataset includes text samples in both Japanese and Korean. It features a range of content such as sentences, phrases, and words, encompassing various contexts and styles. This dataset is used for tasks like natural language processing (NLP), machine translation, and text analysis in multilingual applications.

  19. Sample Posts from the ADHD dataset.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Feb 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam (2025). Sample Posts from the ADHD dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315829.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT-based model for natural language processing (NLP) applications. After the model creation, we applied the resulting model, LastBERT, to a real-world task—classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million-resulting in a model approximately 73.64% smaller. On the General Language Understanding Evaluation (GLUE) benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy of 85%, F1 score of 85%, precision of 85%, and recall of 85%. When compared to DistilBERT (66 million parameters) and ClinicalBERT (110 million parameters), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT model’s capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab and Kaggle Notebooks. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.

  20. Trojan Detection Software Challenge -...

    • catalog.data.gov
    • nist.gov
    • +2more
    Updated Sep 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Trojan Detection Software Challenge - nlp-sentiment-classification-apr2021-test [Dataset]. https://catalog.data.gov/dataset/trojan-detection-software-challenge-round-6-test-dataset
    Explore at:
    Dataset updated
    Sep 30, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Round 6 Test DatasetThis is the test data used to construct and evaluate trojan detection software solutions. This data, generated at NIST, consists of natural language processing (NLP) AIs trained to perform text sentiment classification on English text. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 480 sentiment classification AI models using a small set of model architectures. The models were trained on text data drawn from product reviews. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the input when the trigger is present.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Alessandro Lo Bello (2024). News classification dataset for NLP [Dataset]. https://www.kaggle.com/datasets/alessandrolobello/guardian
Organization logo

News classification dataset for NLP

Multimodal NLP Classification Dataset: A Comprehensive Repository for NLP Tasks

Explore at:
zip(117148704 bytes)Available download formats
Dataset updated
Feb 27, 2024
Authors
Alessandro Lo Bello
Description

This dataset is compiled through scraping articles from Guardian news starting from 2021 up to the present. It's meticulously crafted to facilitate multimodal classification tasks, especially in the realm of Natural Language Processing (NLP).

The dataset comprises 13 labels representing various topics covered in the news, ranging from sports to economics, politics, art, and entertainment. Each label corresponds to a distinct category, offering a comprehensive overview of the diverse subjects discussed in journalistic discourse. Researchers and practitioners can utilize these labels for multi-label classification tasks, aiming to categorize news articles based on their thematic content.

The topics encapsulated by these labels reflect the multifaceted nature of contemporary news media, capturing the breadth and depth of global events and developments. Whether it's the latest in sports, updates on economic trends, insights into political affairs, or coverage of the arts and entertainment industry, this dataset encompasses a wide spectrum of subjects.

The file forma is parquet. If you are new to this, for open it just do pd.read_parquet(filepath) and it's done!

Search
Clear search
Close search
Google apps
Main menu