100+ datasets found

News classification dataset for NLP
kaggle.com
zip
Updated Feb 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessandro Lo Bello (2024). News classification dataset for NLP [Dataset]. https://www.kaggle.com/datasets/alessandrolobello/guardian
Explore at:
zip(117148704 bytes)Available download formats
Dataset updated
Feb 27, 2024
Authors
Alessandro Lo Bello
Description
This dataset is compiled through scraping articles from Guardian news starting from 2021 up to the present. It's meticulously crafted to facilitate multimodal classification tasks, especially in the realm of Natural Language Processing (NLP).

The dataset comprises 13 labels representing various topics covered in the news, ranging from sports to economics, politics, art, and entertainment. Each label corresponds to a distinct category, offering a comprehensive overview of the diverse subjects discussed in journalistic discourse. Researchers and practitioners can utilize these labels for multi-label classification tasks, aiming to categorize news articles based on their thematic content.

The topics encapsulated by these labels reflect the multifaceted nature of contemporary news media, capturing the breadth and depth of global events and developments. Whether it's the latest in sports, updates on economic trends, insights into political affairs, or coverage of the arts and entertainment industry, this dataset encompasses a wide spectrum of subjects.

The file forma is parquet. If you are new to this, for open it just do pd.read_parquet(filepath) and it's done!
NLP Dataset
kaggle.com
zip
Updated Jan 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scientific Stephen (2024). NLP Dataset [Dataset]. https://www.kaggle.com/datasets/scientificstephen/language
Explore at:
zip(607351 bytes)Available download formats
Dataset updated
Jan 21, 2024
Authors
Scientific Stephen
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This project is about text classification, leveraging transformer-based NLP models (BERT) for classifying whether given text relates to a disaster or not. It involves data cleaning, tokenization, and the use of a pre-trained transformer model for fine-tuning on the specific task.
F
English Human-Human Chat Dataset for Conversational AI & NLP
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world English usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level English conversations covering a broad spectrum of everyday topics.
Conversational Text Data
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native English speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
•
Words per Chat: 300–700

•
Turns per Chat: Up to 50 dialogue turns

•
Contributors: 200 native English speakers from the FutureBeeAI Crowd Community

•
Format: TXT, DOCS, JSON or CSV (customizable)

•
Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
•Music, books, and movies
•Health and wellness
•Children and parenting
•Family life and relationships
•Food and cooking
•Education and studying
•Festivals and traditions
•Environment and daily life
•Internet and tech usage
•Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Linguistic Authenticity
Chats reflect informal, native-level English usage with:
•Colloquial expressions and local dialect influence
•Domain-relevant terminology
•Language-specific grammar, phrasing, and sentence flow
•Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
•Representation of different writing styles and input quirks to ensure training data realism
Metadata
Every chat instance is accompanied by structured metadata, which includes:
•Participant Age
•Gender
•Country/Region
•Chat Domain
•Chat Topic
•Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
Data Quality Assurance
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
•Manual review for content completeness
•Format checks for chat turns and metadata
•Linguistic verification by native speakers
•Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
Applications
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
•Conversational AI / Chatbots
•Smart assistants and voicebots
<div
NLP for German News Articles
kaggle.com
zip
Updated Oct 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Chauhan (2022). NLP for German News Articles [Dataset]. https://www.kaggle.com/datasets/whenamancodes/nlp-for-10k-german-news-articles
Explore at:
zip(128989980 bytes)Available download formats
Dataset updated
Oct 1, 2022
Authors
Aman Chauhan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
:::: Ten Thousand German News Articles Dataset ::::

A dataset for topic extraction from 10k German News Articles and NLP for German language. English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. To my knowledge the MLDoc contains German documents for classification. Due to grammatical differences between the English and the German language, a classifier might be effective on a English dataset, but not as effective on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifier on multiple German datasets to get a sense of it’s effectiveness.

:::: What It Cointains ::::

The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus. In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise. The 10kGNAD uses the second part of the topic path, here Wirtschaft, as class label. The article titles and texts are concatenated into one text and the authors are removed to avoid a keyword like classification on autors frequent in a class. I created and used this dataset in my thesis to train and evaluate four text classifiers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.

Citations:

@InProceedings{Schabus2017, Author = {Dietmar Schabus and Marcin Skowron and Martin Trapp}, Title = {One Million Posts: A Data Set of German Online Discussions}, Booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)}, Pages = {1241--1244}, Year = {2017}, Address = {Tokyo, Japan}, Doi = {10.1145/3077136.3080711}, Month = aug } @InProceedings{Schabus2018, author = {Dietmar Schabus and Marcin Skowron}, title = {Academic-Industrial Perspective on the Development and Deployment of a Moderation System for a Newspaper Website}, booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC)}, year = {2018}, address = {Miyazaki, Japan}, month = may, pages = {1602-1605}, abstract = {This paper describes an approach and our experiences from the development, deployment and usability testing of a Natural Language Processing (NLP) and Information Retrieval system that supports the moderation of user comments on a large newspaper website. We highlight some of the differences between industry-oriented and academic research settings and their influence on the decisions made in the data collection and annotation processes, selection of document representation and machine learning methods. We report on classification results, where the problems to solve and the data to work with come from a commercial enterprise. In this context typical for NLP research, we discuss relevant industrial aspects. We believe that the challenges faced as well as the solutions proposed for addressing them can provide insights to others working in a similar setting.}, url = {http://www.lrec-conf.org/proceedings/lrec2018/summaries/8885.html}, }

More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe
B5text dataset - Textual data for 5 class sentiment classification of...
figshare.com
txt
Updated Jun 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmud Hasan (2021). B5text dataset - Textual data for 5 class sentiment classification of manufacturing parts. [Dataset]. http://doi.org/10.6084/m9.figshare.14887932.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14887932.v4
Dataset updated
Jun 30, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Mahmud Hasan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
processed and lemmatised manufacturing text data relevant to 5 classes of parts: bearings, collet, sprocket, bolt, spring webscraped from different web based platforms like mcmaster carr, traceparts etc.
f
Probing Datasets for Noisy Texts
federation.figshare.com
researchdata.edu.au
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte (2021). Probing Datasets for Noisy Texts [Dataset]. http://doi.org/10.25955/604c5307db043
Explore at:
Unique identifier
https://doi.org/10.25955/604c5307db043
Dataset updated
Mar 14, 2021
Dataset provided by
Federation University Australia
Authors
Buddhika Kasthuriarachchy; Madhu Chetty; Adrian Shatte
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
ContextProbing tasks are popular among NLP researchers to assess the richness of the encoded representations of linguistic information. Each probing task is a classification problem, and the model’s performance shall vary depending on the richness of the linguistic properties crammed into the representation.

This dataset contains five new probing datasets consist of noisy texts (Tweets) which can serve as a benchmark dataset for researchers to study the linguistic characteristics of unstructured and noisy texts.File StructureFormat: A tab-separated text file

Column 1: train/test/validation split (tr-train, te-test, va-validation)

Column 2: class label (refer to the content

section for the class labels of each task file)

Column 3: Tweet message (text) Column

4: a unique ID Contentsent_len.tsvIn this classification task, the goal is to predict the sentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1: (9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7: (34-70). This task is called “SentLen” in the paper.word_content.tsvWe consider a 10-way classifications task with 10 words as targets considering the available manually annotated instances. The task is predicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. We constructed the data by picking the first 10 lower-cased words occurring in the corpus vocabulary ordered by frequency and having a length of at least 4 characters (to remove noise). Each sentence contains a single target word, and the word occurs precisely once in the sentence. The task is referred to as “WC” in the paper. bigram_shift.tsvThe purpose of the Bigram Shift task is to test whether an encoder is sensitive to legal word orders. Two adjacent words in a Tweet are inverted, and the classification model performs a binary classification to identify inverted (I) and non-inverted/original (O) Tweets. The task is referred to as “BShift” in the paper. tree_depth.tsvThe Tree Depth task evaluates the encoded sentence's ability to understand the hierarchical structure by allowing the classification model to predict the depth of the longest path from the root to any leaf in the Tweet's parser tree. The task is referred to as “TreeDepth” in the paper. odd_man_out.tsv

The Tweets are modified by replacing a random noun or a verb o with another noun or verb r. The task of the classifier is to identify whether the sentence gets modified due to this change. Class label O refers to the unmodified sentences while C refers to modified sentences. The task is called “SOMO” in the paper.
F
Norwegian Human-Human Chat Dataset for Conversational AI & NLP
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Norwegian Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/norwegian-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Norwegian General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Norwegian usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Norwegian conversations covering a broad spectrum of everyday topics.
Conversational Text Data
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Norwegian speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
•
Words per Chat: 300–700

•
Turns per Chat: Up to 50 dialogue turns

•
Contributors: 200 native Norwegian speakers from the FutureBeeAI Crowd Community

•
Format: TXT, DOCS, JSON or CSV (customizable)

•
Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
•Music, books, and movies
•Health and wellness
•Children and parenting
•Family life and relationships
•Food and cooking
•Education and studying
•Festivals and traditions
•Environment and daily life
•Internet and tech usage
•Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Linguistic Authenticity
Chats reflect informal, native-level Norwegian usage with:
•Colloquial expressions and local dialect influence
•Domain-relevant terminology
•Language-specific grammar, phrasing, and sentence flow
•Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
•Representation of different writing styles and input quirks to ensure training data realism
Metadata
Every chat instance is accompanied by structured metadata, which includes:
•Participant Age
•Gender
•Country/Region
•Chat Domain
•Chat Topic
•Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
Data Quality Assurance
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
•Manual review for content completeness
•Format checks for chat turns and metadata
•Linguistic verification by native speakers
•Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
Applications
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
•Conversational AI / Chatbots
•Smart assistants and
s
Arabic Text Dataset
shaip.com
tl.shaip.com
+2more
json
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Arabic Text Dataset [Dataset]. https://www.shaip.com/offerings/language-text-datasets/
Explore at:
jsonAvailable download formats
Dataset updated
Nov 26, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Arabic Text Dataset contains a collection of text samples written in Arabic. It includes various forms of content, such as news articles, social media posts, literature, and dialogue, spanning different topics and writing styles. This dataset is used for tasks such as natural language processing (NLP), text classification, sentiment analysis, and machine translation in Arabic language applications.
F
Punjabi Human-Human Chat Dataset for Conversational AI & NLP
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Punjabi Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/punjabi-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Punjabi General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Punjabi usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Punjabi conversations covering a broad spectrum of everyday topics.
Conversational Text Data
This dataset includes over 10000 chat transcripts, each featuring free-flowing dialogue between two native Punjabi speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
•
Words per Chat: 300–700

•
Turns per Chat: Up to 50 dialogue turns

•
Contributors: 150 native Punjabi speakers from the FutureBeeAI Crowd Community

•
Format: TXT, DOCS, JSON or CSV (customizable)

•
Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
•Music, books, and movies
•Health and wellness
•Children and parenting
•Family life and relationships
•Food and cooking
•Education and studying
•Festivals and traditions
•Environment and daily life
•Internet and tech usage
•Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Linguistic Authenticity
Chats reflect informal, native-level Punjabi usage with:
•Colloquial expressions and local dialect influence
•Domain-relevant terminology
•Language-specific grammar, phrasing, and sentence flow
•Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
•Representation of different writing styles and input quirks to ensure training data realism
Metadata
Every chat instance is accompanied by structured metadata, which includes:
•Participant Age
•Gender
•Country/Region
•Chat Domain
•Chat Topic
•Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
Data Quality Assurance
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
•Manual review for content completeness
•Format checks for chat turns and metadata
•Linguistic verification by native speakers
•Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
Applications
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
•Conversational AI / Chatbots
•Smart assistants and voicebots
<div
Long document similarity dataset, Wikipedia excerptions for movies...
zenodo.org
data.niaid.nih.gov
bin
Updated Jan 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous; anonymous (2023). Long document similarity dataset, Wikipedia excerptions for movies collections [Dataset]. http://doi.org/10.5281/zenodo.7434832
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7434832
Dataset updated
Jan 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
anonymous; anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Movies-related articles extracted from Wikipedia.

For all articles, the figures and tables have been filtered out, as well as the categories and "see also" sections.

The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

Movies

The Wikipedia Movies dataset consists of 100,371 articles describing various movies. Each article may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.
AI-Enhanced English Teaching Resource Dataset
kaggle.com
zip
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziya (2025). AI-Enhanced English Teaching Resource Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/ai-enhanced-english-teaching-resource-dataset
Explore at:
zip(2634 bytes)Available download formats
Dataset updated
Mar 7, 2025
Authors
Ziya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The AI-Enhanced English Teaching Resource Dataset is designed for research on Natural Language Processing (NLP) applications in automated English lesson generation. It contains 300 structured entries, combining human-written and AI-generated educational content across various categories such as Grammar, Vocabulary, Reading, Writing, Speaking, and Literature.

Key Features: Lesson Text: Descriptive summaries of English lessons. Keywords: Important terms extracted for each lesson. Lesson Type: Categorization into different teaching domains. Difficulty Level: Labels for Beginner, Intermediate, and Advanced levels. Target: Binary classification (0 = Human-written, 1 = AI-generated). Use Cases: Training and evaluating NLP models for educational content generation. Assessing AI’s effectiveness in producing structured and relevant lesson materials. Developing adaptive e-learning platforms for personalized teaching. This dataset serves as a valuable resource for machine learning, NLP, and educational technology research, enabling scalable and automated curriculum design. 🚀
s
Text + Audio-Visual (Multilingual/OCR/NLP) – Books, Journals, Audio+Text
shaip.com
fr.shaip.com
+68more
json
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Text + Audio-Visual (Multilingual/OCR/NLP) – Books, Journals, Audio+Text [Dataset]. https://www.shaip.com/offerings/language-text-datasets/
Explore at:
jsonAvailable download formats
Dataset updated
Nov 26, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Chinese Books, English Books, Journals, Public Policy, Novels, Children, Cantonese Audio+Text, Lecture Video+PPT, Long-format Video Half billion books, question answers pairs, articles
Tamil (Tamizh) Wikipedia Text Dataset for NLP
kaggle.com
zip
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. https://www.kaggle.com/datasets/younusmohamed/tamil-tamizh-wikipedia-articles
Explore at:
zip(339341289 bytes)Available download formats
Dataset updated
Nov 12, 2024
Authors
Younus_Mohamed
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

What’s Included

- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

Why This Dataset?

Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

** How You Can Use This Dataset**

- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

Let’s Collaborate!

I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

License

This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.
m
Six small datasets for text segmentation
data.mendeley.com
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Krassovitskiy (2025). Six small datasets for text segmentation [Dataset]. http://doi.org/10.17632/cj22rpfdbb.1
Explore at:
Unique identifier
https://doi.org/10.17632/cj22rpfdbb.1
Dataset updated
Sep 11, 2025
Authors
Alexander Krassovitskiy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This collection brings together six datasets designed for experiments in text segmentation:

Choi — 922 artificial documents [1]. Each document is composed of sentence blocks drawn from different sources. Since the segments are unrelated, segmentation is relatively easy for many algorithms and typically yields high accuracy.

Manifesto — 6 long political speeches [2]. Each text includes a human-generated segmentation based on strict guidelines. The dataset is used to evaluate segmentation of semantic topic shifts and thematic changes.

Wiki-1024 — 1,024 Wikipedia articles [3]. Segmentation is defined by the natural division of documents into sections and subsections.

Abstracts — artificial documents created by merging real research abstracts into continuous texts. About 20,000 abstracts were collected from Scopus in the field of Information Retrieval. Segments correspond directly to individual abstracts.

SMan — artificial documents constructed by randomly sampling segments from Manifesto texts. The resulting statements vary in content due to mixing, but generally follow a slogan-like style.

PhilPapersAI — 336 philosophy articles (focused on AI) selected from philpapers.org. The source PDFs were reprocessed using the OpenAI GPT-4o-mini LLM to restore structure and add subsection divisions. The resulting texts are coherent and well-structured, while preserving the authors’ original style as closely as possible.

[1] Choi, F.Y.Y. Advances in domain independent linear text segmentation. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 2000. [2] Hearst, M.A. Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages. Computational Linguistics 1997, 23, 33–64. [3] Koshorek, O.; Cohen, A.; Mor, N.; Rotman, M.; Berant, J. Text Segmentation as a Supervised Learning Task. In Proceedings of the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers); Walker, M.; Ji, H.; Stent, A., Eds., New Orleans, Louisiana, 2018; pp. 469–473. https://doi.org/10.18653/v1/N18-2075.
F
Japanese Human-Human Chat Dataset for Conversational AI & NLP
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Japanese Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/japanese-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Japanese General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world Japanese usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level Japanese conversations covering a broad spectrum of everyday topics.
Conversational Text Data
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native Japanese speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
•
Words per Chat: 300–700

•
Turns per Chat: Up to 50 dialogue turns

•
Contributors: 200 native Japanese speakers from the FutureBeeAI Crowd Community

•
Format: TXT, DOCS, JSON or CSV (customizable)

•
Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
•Music, books, and movies
•Health and wellness
•Children and parenting
•Family life and relationships
•Food and cooking
•Education and studying
•Festivals and traditions
•Environment and daily life
•Internet and tech usage
•Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Linguistic Authenticity
Chats reflect informal, native-level Japanese usage with:
•Colloquial expressions and local dialect influence
•Domain-relevant terminology
•Language-specific grammar, phrasing, and sentence flow
•Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
•Representation of different writing styles and input quirks to ensure training data realism
Metadata
Every chat instance is accompanied by structured metadata, which includes:
•Participant Age
•Gender
•Country/Region
•Chat Domain
•Chat Topic
•Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
Data Quality Assurance
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
•Manual review for content completeness
•Format checks for chat turns and metadata
•Linguistic verification by native speakers
•Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
Applications
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
•Conversational AI / Chatbots
•Smart assistants and voicebots
<div
Long document similarity datasets, Wikipedia excerptions for movies, video...
zenodo.org
live.european-language-grid.eu
+1more
csv
Updated Jan 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous; anonymous (2021). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. http://doi.org/10.5281/zenodo.4468783
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4468783
Dataset updated
Jan 27, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
anonymous; anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three corpora in different domains extracted from Wikipedia.

For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.

The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

Wines

Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are

Dom Pérignon - Moët & Chandon

Pinot Meunier - Chardonnay

Movies

The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.
For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies.
Examples for ground-truth expert-based recommendations are

Schindler's List - The Pianist

Lion King - The Jungle Book

Video games

The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:

Grand Theft Auto - Mafia

Burnout Paradise - Forza Horizon 3
h
text-quality
huggingface.co
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng (2024). text-quality [Dataset]. https://huggingface.co/datasets/agentlans/text-quality
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 30, 2024
Authors
Alan Tseng
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Text Quality Assessment Dataset

Overview

This dataset is designed to assess text quality robustly across various domains for NLP and AI applications. It provides a composite quality score based on multiple classifiers, offering a more comprehensive evaluation of text quality beyond educational domains.

Dataset Details

Size: 100,000 sentences Source: 20,000 sentences from each of 5 different datasets allenai/c4 HuggingFaceFW/fineweb-edu… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/text-quality.
s
Japanese & Korean Language Dataset
shaip.com
json
Updated Nov 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Japanese & Korean Language Dataset [Dataset]. https://www.shaip.com/offerings/language-text-datasets/
Explore at:
jsonAvailable download formats
Dataset updated
Nov 26, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Japanese & Korean Language Dataset includes text samples in both Japanese and Korean. It features a range of content such as sentences, phrases, and words, encompassing various contexts and styles. This dataset is used for tasks like natural language processing (NLP), machine translation, and text analysis in multilingual applications.
Sample Posts from the ADHD dataset.
figshare.com
plos.figshare.com
xls
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam (2025). Sample Posts from the ADHD dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315829.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315829.t002
Dataset updated
Feb 6, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT-based model for natural language processing (NLP) applications. After the model creation, we applied the resulting model, LastBERT, to a real-world task—classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million-resulting in a model approximately 73.64% smaller. On the General Language Understanding Evaluation (GLUE) benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy of 85%, F1 score of 85%, precision of 85%, and recall of 85%. When compared to DistilBERT (66 million parameters) and ClinicalBERT (110 million parameters), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT model’s capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab and Kaggle Notebooks. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.
Trojan Detection Software Challenge -...
catalog.data.gov
nist.gov
+2more
Updated Sep 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Trojan Detection Software Challenge - nlp-sentiment-classification-apr2021-test [Dataset]. https://catalog.data.gov/dataset/trojan-detection-software-challenge-round-6-test-dataset
Explore at:
Dataset updated
Sep 30, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Round 6 Test DatasetThis is the test data used to construct and evaluate trojan detection software solutions. This data, generated at NIST, consists of natural language processing (NLP) AIs trained to perform text sentiment classification on English text. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 480 sentiment classification AI models using a small set of model architectures. The models were trained on text data drawn from product reviews. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the input when the trigger is present.

Facebook

Twitter

Click to copy link

Link copied

Cite

Alessandro Lo Bello (2024). News classification dataset for NLP [Dataset]. https://www.kaggle.com/datasets/alessandrolobello/guardian

News classification dataset for NLP

Multimodal NLP Classification Dataset: A Comprehensive Repository for NLP Tasks

Explore at:

zip(117148704 bytes)Available download formats

Dataset updated

Feb 27, 2024

Authors

Alessandro Lo Bello

Description

This dataset is compiled through scraping articles from Guardian news starting from 2021 up to the present. It's meticulously crafted to facilitate multimodal classification tasks, especially in the realm of Natural Language Processing (NLP).

The dataset comprises 13 labels representing various topics covered in the news, ranging from sports to economics, politics, art, and entertainment. Each label corresponds to a distinct category, offering a comprehensive overview of the diverse subjects discussed in journalistic discourse. Researchers and practitioners can utilize these labels for multi-label classification tasks, aiming to categorize news articles based on their thematic content.

The topics encapsulated by these labels reflect the multifaceted nature of contemporary news media, capturing the breadth and depth of global events and developments. Whether it's the latest in sports, updates on economic trends, insights into political affairs, or coverage of the arts and entertainment industry, this dataset encompasses a wide spectrum of subjects.

The file forma is parquet. If you are new to this, for open it just do pd.read_parquet(filepath) and it's done!

Clear search

Close search

Google apps

Main menu

News classification dataset for NLP

NLP Dataset

English Human-Human Chat Dataset for Conversational AI & NLP

Introduction

Conversational Text Data

Diversity and Domain Coverage

Linguistic Authenticity

Metadata

Data Quality Assurance

Applications

NLP for German News Articles

:::: Ten Thousand German News Articles Dataset ::::

:::: What It Cointains ::::

Citations:

B5text dataset - Textual data for 5 class sentiment classification of...

Probing Datasets for Noisy Texts

Norwegian Human-Human Chat Dataset for Conversational AI & NLP

Introduction

Conversational Text Data

Diversity and Domain Coverage

Linguistic Authenticity

Metadata

Data Quality Assurance

Applications

Arabic Text Dataset

Punjabi Human-Human Chat Dataset for Conversational AI & NLP

Introduction

Conversational Text Data

Diversity and Domain Coverage

Linguistic Authenticity

Metadata

Data Quality Assurance

Applications

Long document similarity dataset, Wikipedia excerptions for movies...

AI-Enhanced English Teaching Resource Dataset

Text + Audio-Visual (Multilingual/OCR/NLP) – Books, Journals, Audio+Text

Tamil (Tamizh) Wikipedia Text Dataset for NLP

What’s Included

Why This Dataset?

** How You Can Use This Dataset**

Let’s Collaborate!

License

Six small datasets for text segmentation

Japanese Human-Human Chat Dataset for Conversational AI & NLP

Introduction

Conversational Text Data

Diversity and Domain Coverage

Linguistic Authenticity

Metadata

Data Quality Assurance

Applications

Long document similarity datasets, Wikipedia excerptions for movies, video...

text-quality

Japanese & Korean Language Dataset

Sample Posts from the ADHD dataset.

Trojan Detection Software Challenge -...

News classification dataset for NLP

Multimodal NLP Classification Dataset: A Comprehensive Repository for NLP Tasks

How You Can Use This Dataset