100+ datasets found
  1. NLP Mental Health Conversations

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). NLP Mental Health Conversations [Dataset]. https://www.kaggle.com/datasets/thedevastator/nlp-mental-health-conversations
    Explore at:
    zip(1552188 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    NLP Mental Health Conversations

    Stimulating AI-Driven Mental Health Guidance

    By Huggingface Hub [source]

    About this dataset

    This dataset contains conversations between users and experienced psychologists related to mental health topics. Carefully collected and anonymized, the data can be used to further the development of Natural Language Processing (NLP) models which focus on providing mental health advice and guidance. It consists of a variety of questions which will help train NLP models to provide users with appropriate advice in response to their queries. Whether you're an AI developer interested in building the next wave of mental health applications or a therapist looking for insights into how technology is helping people connect; this dataset provides invaluable support for advancing our understanding of human relationships through Artificial Intelligence

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This guide will provide you with the necessary knowledge to effectively use this dataset for Natural Language Processing (NLP)-based applications.

    • Download and install the dataset: To begin using the dataset, download it from Kaggle onto your system. Once downloaded, unzip and extract the .csv file into a directory of your choice.

    • Familiarize yourself with the columns: Before working with the data, it’s important to familiarize yourself with all of its components. This dataset contains two columns - Context and Response - which are intentionally structured to produce conversations between users and psychologists related to mental health topics for NLP models dedicated to providing mental health advice and guidance.

    • Analyze data entries: If possible or desired, take time now to analyze what is included in each entry; this may help you better untangle any challenges that come up during subsequent processes yet won't be required for most steps going forward if you prefer not too jump ahead of yourself at this juncture of your work process just yet! Examine questions asked by users as well as answers provided by experts in order glean an overall picture of what types of conversations are taking place within this pool of data that can help guide further work on NLP models for AI-driven mental health guidance purposes later on down the road!

    • Cleanse any information not applicable to NLP decisioning relevant application goals: It's important that only meaningful items related towards achieving AI-driven results remain within a clean copy of this Dataset going forward; consider removing all extra many verbatim entries or other pieces uneeded while also otherwise making sure all included content adheres closely enough one particular decisions purpose expected from an end goal perspective before proceeding onwards now until an ultimate end result has been successfully achieved eventually afterwards later on next afterward soon afterwards too following conveniently satisfyingly after accordingly shortly near therefore meaningfully likewise conclusively thoroughly properly productively purposely then eventually effectively finally indeed desirably plus concludingly enjoyably popularly splendidly attractively satisfactorally propitiously outstandingly fluently promisingly opportunely in conclusion efficiently hopefully progressively breathtaking deliciousness ideally genius mayhem invented unique impossibility everlastingly intense qualitative cohesiveness behaviorally affectionately fixed voraciously like alive supportively choicest decisively luckily chaotically co-creatively introducing ageless intricacy voicing auspicious promise enterprisingly preferred mathematically godly happening humorous respective achieve ultra favorability fundamentals essentials speciality grandiose selectively perfectly

    Research Ideas

    • Creating sentence-matching algorithms for natural language processing to accurately match given questions with appropriate advice and guidance.
    • Analyzing the psychological conversations to gain insights into topics such as stress, anxiety, and depression.
    • Developing personalized natural language processing models tailored to provide users with appropriate advice based on their queries and based on their individual state of mental health

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativec...

  2. NLP project

    • kaggle.com
    zip
    Updated Dec 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rawan7544 (2024). NLP project [Dataset]. https://www.kaggle.com/datasets/rawan7544/nlp-project
    Explore at:
    zip(1584256901 bytes)Available download formats
    Dataset updated
    Dec 21, 2024
    Authors
    Rawan7544
    Description

    Dataset

    This dataset was created by Rawan1652002

    Contents

  3. h

    kaggle-nlp-getting-start

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hui, kaggle-nlp-getting-start [Dataset]. https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    hui
    Description

    Dataset Summary

    Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.

    Columns

    id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.

  4. NLP Research Papers Dataset

    • kaggle.com
    zip
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subham Surana (2024). NLP Research Papers Dataset [Dataset]. https://www.kaggle.com/datasets/subhamjain/natural-language-processing-research-papers
    Explore at:
    zip(1074694 bytes)Available download formats
    Dataset updated
    May 1, 2024
    Authors
    Subham Surana
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The dataset appears to be a collection of NLP research papers, with the full text available in the "article" column, abstract summaries in the "abstract" column, and information about different sections in the "section_names" column. Researchers and practitioners in the field of natural language processing can use this dataset for various tasks, including text summarization, document classification, and analysis of research paper structures.

    Data Fields

    Here's a short description of the Natural Language Processing Research Papers dataset: 1. Article: This column likely contains the full text or content of the research papers related to Natural Language Processing (NLP). Each entry in this column represents the entire body of a specific research article. 2. Abstract: This column is likely to contain the abstracts of the NLP research papers. The abstract provides a concise summary of the paper, highlighting its key objectives, methods, and findings. 3. Section Names: This column probably contains information about the section headings within each research paper. It could include the names or titles of different sections such as Introduction, Methodology, Results, Conclusion, etc. This information can be useful for structuring and organizing the content of the research papers.

    File Description

    Content Overview: The dataset is valuable for researchers, students, and practitioners in the field of Natural Language Processing. File format: This file is csv format.

  5. High-Quality Financial News Dataset for NLP Tasks

    • kaggle.com
    zip
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayel Abualigah (2025). High-Quality Financial News Dataset for NLP Tasks [Dataset]. https://www.kaggle.com/datasets/sayelabualigah/high-quality-financial-news-dataset-for-nlp-tasks
    Explore at:
    zip(1566953 bytes)Available download formats
    Dataset updated
    Nov 21, 2025
    Authors
    Sayel Abualigah
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    High-Quality Financial News Dataset

    Description

    This repository contains a meticulously scraped dataset from various financial websites. The data extraction process ensures high-quality and accurate text, including content from both the websites and their embedded PDFs.

    Dataset Features

    • Date: The date of the announcement.
    • Subject: The subject of the financial news.
    • Content: The full content of the announcement, including text from the website and PDFs.

    Additional Processed Fields

    We applied the advanced Mixtral 7X8 model to generate the following additional fields:

    • ParaphrasedSubject: A paraphrased version of the original subject.
    • CompactedSummary: A concise summary limited to 1.5 lines.
    • DetailedSummary: A detailed summary of the content.
    • Impact: The impact of the announcement, summarized in 2 lines.

    Methodology

    The prompt used to generate the additional fields was highly effective, thanks to extensive discussions and collaboration with the Mistral AI team. This ensures that the dataset provides valuable insights and is ready for further analysis and model training.

    Usage

    This dataset can be used for various applications, including but not limited to:

    • Financial news analysis
    • Abstractive/Exctractive Summarization tasks
    • Machine learning model training
    • Natural language processing tasks
  6. Data Extraction and NLP

    • kaggle.com
    zip
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinansh (2023). Data Extraction and NLP [Dataset]. https://www.kaggle.com/datasets/vision6/data-extraction-and-nlp
    Explore at:
    zip(98724 bytes)Available download formats
    Dataset updated
    Jul 27, 2023
    Authors
    Dinansh
    Description

    Description

    This dataset includes URLs of blogs covering topics like Healthcare, Artificial Intelligence, Big Data, Lifestyle, IT Services, Data Science, and Banking. Analyzing this data can be valuable for learning web scraping and data extraction concepts. The objective of this dataset is to extract textual data articles from the given URL in the input.xlsx and perform text analysis to compute variables (Positive Score, Negative Score, Polarity Score, Subjectivity Score, Avg Sentence Length, Percentage of Complex Words, Fog Index, Avg Number of Words Per Sentence, Complex Word Count, Word Count, Syllable Per Word, Personal Pronouns, Avg Word Length).

    Text Analysis

    Sentimental Analysis: It is the process of determining whether a piece of writing is positive, negative, or neutral. - Positive Score: This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values. - Negative Score: This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. We multiply the score by -1 so that the score is a positive number. - Polarity Score: This is the score that determines if a given text is positive or negative. It is calculated by using the formula: Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001) The range is from -1 to +1 - Subjectivity Score: This is the score that determines if a given text is objective or subjective. It is calculated by using the formula: Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001) The range is from 0 to +1

    Analysis of Readability: It is calculated using the Gunning Fox index formula described below. - Average Sentence Length = the number of words / the number of sentences - Percentage of Complex words = the number of complex words / the number of words - Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)

    Other Calculations - Average Number of Words Per Sentence = the total number of words / the total number of sentences - Word Count: We count the total cleaned words present in the text by: 1. removing the stop words (using the stopwords class of NLTK package), and 2. removing any punctuations like ? ! , . from the word before counting. - Syllable Count Per Word: We count the number of Syllables in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with "es", and "ed" by not counting them as a syllable. - Complex Word Count: Complex words are words in the text that contain more than two syllables. - Personal Pronouns: To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words - “I,” “we,” “my,” “ours,” and “us”. Special care is taken so that the country name the US is not included in the list. - Average Word Length: It is calculated by the formula Sum of the total number of characters in each word/Total number of words

  7. Sentiment Analysis Dataset for NLP Projects

    • kaggle.com
    zip
    Updated Nov 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AlyAhmedTS13 (2025). Sentiment Analysis Dataset for NLP Projects [Dataset]. https://www.kaggle.com/datasets/alyahmedts13/reddit-sentiment-analysis-dataset-for-nlp-projects
    Explore at:
    zip(1204347 bytes)Available download formats
    Dataset updated
    Nov 16, 2025
    Authors
    AlyAhmedTS13
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🕹️ About Dataset

    This dataset contains short Reddit posts (≤280 characters) about pop music and pop stars, labeled for sentiment analysis.

    We collected ~124k posts using keywords like Taylor Swift, Olivia Rodrigo, Grammy, Billboard, and subreddits like popheads, Music, and Billboard. After cleaning and filtering, we kept only short-form, English posts and combined each post’s title and body into a single text column.

    The final data set is about 32,000+ rows

    Sentiment labels (positive, neutral, negative) were generated using a BERT-based model fine-tuned for social media (CardiffNLP’s Twitter RoBERTa).

    This version is ready for NLP sentiment projects — train your own model, explore pop fandom discourse, or benchmark transformer performance on real-world Reddit data.

  8. NLP for German News Articles

    • kaggle.com
    zip
    Updated Oct 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). NLP for German News Articles [Dataset]. https://www.kaggle.com/datasets/whenamancodes/nlp-for-10k-german-news-articles
    Explore at:
    zip(128989980 bytes)Available download formats
    Dataset updated
    Oct 1, 2022
    Authors
    Aman Chauhan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    :::: Ten Thousand German News Articles Dataset ::::

    A dataset for topic extraction from 10k German News Articles and NLP for German language. English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. To my knowledge the MLDoc contains German documents for classification. Due to grammatical differences between the English and the German language, a classifier might be effective on a English dataset, but not as effective on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifier on multiple German datasets to get a sense of it’s effectiveness.

    :::: What It Cointains ::::

    The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus. In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise. The 10kGNAD uses the second part of the topic path, here Wirtschaft, as class label. The article titles and texts are concatenated into one text and the authors are removed to avoid a keyword like classification on autors frequent in a class. I created and used this dataset in my thesis to train and evaluate four text classifiers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.

    Citations:

    @InProceedings{Schabus2017, Author = {Dietmar Schabus and Marcin Skowron and Martin Trapp}, Title = {One Million Posts: A Data Set of German Online Discussions}, Booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)}, Pages = {1241--1244}, Year = {2017}, Address = {Tokyo, Japan}, Doi = {10.1145/3077136.3080711}, Month = aug } @InProceedings{Schabus2018, author = {Dietmar Schabus and Marcin Skowron}, title = {Academic-Industrial Perspective on the Development and Deployment of a Moderation System for a Newspaper Website}, booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC)}, year = {2018}, address = {Miyazaki, Japan}, month = may, pages = {1602-1605}, abstract = {This paper describes an approach and our experiences from the development, deployment and usability testing of a Natural Language Processing (NLP) and Information Retrieval system that supports the moderation of user comments on a large newspaper website. We highlight some of the differences between industry-oriented and academic research settings and their influence on the decisions made in the data collection and annotation processes, selection of document representation and machine learning methods. We report on classification results, where the problems to solve and the data to work with come from a commercial enterprise. In this context typical for NLP research, we discuss relevant industrial aspects. We believe that the challenges faced as well as the solutions proposed for addressing them can provide insights to others working in a similar setting.}, url = {http://www.lrec-conf.org/proceedings/lrec2018/summaries/8885.html}, }

    More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe

  9. NLP-Chat-DataSet

    • kaggle.com
    zip
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alihassan (2024). NLP-Chat-DataSet [Dataset]. https://www.kaggle.com/datasets/alihassan779339/nlp-chat-dataset
    Explore at:
    zip(5701 bytes)Available download formats
    Dataset updated
    Mar 13, 2024
    Authors
    Alihassan
    Description

    Dataset

    This dataset was created by Alihassan

    Contents

  10. LexGLUE: Legal NLP Benchmark

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). LexGLUE: Legal NLP Benchmark [Dataset]. https://www.kaggle.com/datasets/thedevastator/lexglue-legal-nlp-benchmark-dataset
    Explore at:
    zip(343671820 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    LexGLUE: Legal NLP Benchmark

    Legal NLP Benchmark Dataset: LexGLUE

    By lex_glue (From Huggingface) [source]

    About this dataset

    The LexGLUE dataset is a comprehensive benchmark dataset specially created to evaluate the performance of natural language processing (NLP) models in various legal tasks. This dataset draws inspiration from the success of other multi-task NLP benchmarks like GLUE and SuperGLUE, as well as similar initiatives in different domains.

    The primary objective of LexGLUE is to advance the development of versatile models that can effectively handle multiple legal NLP tasks without requiring extensive task-specific fine-tuning. By providing a standardized evaluation platform, this dataset aims to foster innovation and advancements in the field of legal language understanding.

    The dataset consists of several columns that provide crucial information for each entry. The context column contains the specific text or document from which each legal language understanding task is derived, offering essential background information for proper comprehension. The endings column presents multiple potential options or choices that could complete the legal task at hand, enabling comprehensive evaluation.

    Furthermore, there are various columns related to labels and target categories associated with each entry. The label column represents the correct or expected answer for a given task, ensuring accuracy in model predictions during evaluation. The labels column provides categorical information regarding target labels or categories relevant to the respective legal NLP task.

    Another important element within this dataset is the text column, which contains the actual input text representing a particular legal scenario or context for analysis. Analyzing this text forms an integral part of conducting accurate and effective NLP tasks within a legal context.

    To facilitate efficient model performance assessment on diverse aspects of legal language understanding, additional files are included in this benchmark dataset: case_hold_test.csv comprises case contexts with multiple potential endings labeled as valid holdings or not; ledgar_validation.csv serves as a validation set specifically designed for evaluating NLP models' performance on legal tasks; ecthr_b_test.csv contains samples related to European Court of Human Rights (ECtHR) along with their corresponding labels for testing the capabilities of legal language understanding models in this domain.

    By providing a longer, accurate, informative, and descriptive description of the LexGLUE dataset, it becomes evident that it serves as a crucial resource for researchers and practitioners to benchmark and advance the state-of-the-art in legal NLP tasks

    Research Ideas

    • Training and evaluating NLP models: The LexGLUE dataset can be used to train and evaluate natural language processing models specifically designed for legal language understanding tasks. By using this dataset, researchers and developers can test the performance of their models on various legal NLP tasks, such as legal case analysis or European Court of Human Rights (ECtHR) related tasks.
    • Developing generic NLP models: The benchmark dataset is designed to push towards the development of generic models that can handle multiple legal NLP tasks with limited task-specific fine-tuning. Researchers can use this dataset to develop robust and versatile NLP models that can effectively understand and analyze legal texts.
    • Comparing different algorithms and approaches: LexGLUE provides a standardized benchmark for comparing different algorithms and approaches in the field of legal language understanding. Researchers can use this dataset to compare the performance of different techniques, such as rule-based methods, deep learning models, or transformer architectures, on various legal NLP tasks. This allows for a fair comparison between different approaches and facilitates progress in the field by identifying effective methods for solving specific legal language understanding challenges

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: case_hold_test.csv | Column name | Description ...

  11. NLP Project - Paraphrase Detection

    • kaggle.com
    zip
    Updated Oct 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Big D Dang (2023). NLP Project - Paraphrase Detection [Dataset]. https://www.kaggle.com/datasets/bigddang/nlp-project-paraphrase-detection
    Explore at:
    zip(522141 bytes)Available download formats
    Dataset updated
    Oct 21, 2023
    Authors
    Big D Dang
    Description

    Dataset

    This dataset was created by Big D Dang

    Contents

  12. Fake News Detection Dataset

    • kaggle.com
    zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Mashayekhi (2025). Fake News Detection Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/fake-news-detection-dataset
    Explore at:
    zip(11735585 bytes)Available download formats
    Dataset updated
    Apr 27, 2025
    Authors
    Mahdi Mashayekhi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📚 Fake News Detection Dataset

    Overview

    This dataset is designed for practicing fake news detection using machine learning and natural language processing (NLP) techniques. It includes a rich collection of 20,000 news articles, carefully generated to simulate real-world data scenarios. Each record contains metadata about the article and a label indicating whether the news is real or fake.

    The dataset also intentionally includes around 5% missing values in some fields to simulate the challenges of handling incomplete data in real-life projects.

    Columns Description

    title A short headline summarizing the article (around 6 words). text The body of the news article (200–300 words on average). date The publication date of the article, randomly selected over the past 3 years. source The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values (~5%). author The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data. category The general category of the article (e.g., Politics, Health, Sports, Technology). label The target label: real or fake news.

    Why Use This Dataset?

    Fake News Detection Practice: Perfect for binary classification tasks.

    NLP Preprocessing: Allows users to practice text cleaning, tokenization, vectorization, etc.

    Handling Missing Data: Some fields are incomplete to simulate real-world data challenges.

    Feature Engineering: Encourages creating new features from text and metadata.

    Balanced Labels: Realistic distribution of real and fake news for fair model training.

    Potential Use Cases

    Building and evaluating text classification models (e.g., Logistic Regression, Random Forests, XGBoost).

    Practicing NLP techniques like TF-IDF, Word2Vec, BERT embeddings.

    Performing exploratory data analysis (EDA) on news data.

    Developing pipelines for dealing with missing values and feature extraction.

    A Note on Data Quality

    This dataset has been synthetically generated to closely resemble real news articles. The diversity in titles, text, sources, and categories ensures that models trained on this dataset can generalize well to unseen, real-world data. However, since it is synthetic, it should not be used for production models or decision-making without careful validation.

    File Info

    Filename: fake_news_dataset.csv

    Size: 20,000 rows × 7 columns

    Missing Data: ~5% missing values in the source and author columns.

  13. Visual Question Answering- Computer Vision & NLP

    • kaggle.com
    zip
    Updated Jun 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhavik Ardeshna (2022). Visual Question Answering- Computer Vision & NLP [Dataset]. https://www.kaggle.com/datasets/bhavikardeshna/visual-question-answering-computer-vision-nlp
    Explore at:
    zip(430780593 bytes)Available download formats
    Dataset updated
    Jun 14, 2022
    Authors
    Bhavik Ardeshna
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    VQA is a multimodal task wherein, given an image and a natural language question related to the image, the objective is to produce a natural language answer correctly as output.

    It involves understanding the content of the image and correlating it with the context of the question asked. Because we need to compare the semantics of information present in both of the modalities — the image and natural language question related to it — VQA entails a wide range of sub-problems in both CV and NLP (such as object detection and recognition, scene classification, counting, and so on). Thus, it is considered an AI-complete task.

  14. AI-Enhanced English Teaching Resource Dataset

    • kaggle.com
    zip
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). AI-Enhanced English Teaching Resource Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/ai-enhanced-english-teaching-resource-dataset
    Explore at:
    zip(2634 bytes)Available download formats
    Dataset updated
    Mar 7, 2025
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The AI-Enhanced English Teaching Resource Dataset is designed for research on Natural Language Processing (NLP) applications in automated English lesson generation. It contains 300 structured entries, combining human-written and AI-generated educational content across various categories such as Grammar, Vocabulary, Reading, Writing, Speaking, and Literature.

    Key Features: Lesson Text: Descriptive summaries of English lessons. Keywords: Important terms extracted for each lesson. Lesson Type: Categorization into different teaching domains. Difficulty Level: Labels for Beginner, Intermediate, and Advanced levels. Target: Binary classification (0 = Human-written, 1 = AI-generated). Use Cases: Training and evaluating NLP models for educational content generation. Assessing AI’s effectiveness in producing structured and relevant lesson materials. Developing adaptive e-learning platforms for personalized teaching. This dataset serves as a valuable resource for machine learning, NLP, and educational technology research, enabling scalable and automated curriculum design. 🚀

  15. Causal Reasoning NLP

    • kaggle.com
    zip
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Locked_in_hell (2023). Causal Reasoning NLP [Dataset]. https://www.kaggle.com/datasets/stealthknight/causal-reasoningonly-positive-samples
    Explore at:
    zip(12361 bytes)Available download formats
    Dataset updated
    Nov 8, 2023
    Authors
    Locked_in_hell
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Given a text and a reason, predict if the text correctly satisfies the reason. Various approaches can be used in order to determine the correctness of the text with the reason. Note: This dataset contains only positive samples. Thus various data augmentation techniques should be applied in order to make a very good model. This is an example of an Imbalanced Causal Reasoning Dataset in the field of NLP.

  16. Natural Language Processing - IntenCareer Project

    • kaggle.com
    zip
    Updated May 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nature (2024). Natural Language Processing - IntenCareer Project [Dataset]. https://www.kaggle.com/datasets/marknature/natural-language-processing-intencareer-project
    Explore at:
    zip(141866520 bytes)Available download formats
    Dataset updated
    May 22, 2024
    Authors
    Nature
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Task 1: Natural Language Processing (NLP) - IntenCareer Project

    Overview

    This project aims to develop an NLP model for tasks like sentiment analysis, text classification, or named entity recognition.

    Steps

    1. Project Selection: Choose a specific NLP task.
    2. Data Collection: Gather and prepare a dataset relevant to the task. Dataset was too big to push
    3. Preprocessing: Clean and preprocess the text data.
    4. Model Development: Develop an NLP model using ML or DL techniques.
    5. Training and Evaluation: Train the model and evaluate its performance.
    6. Results Presentation: Present the results, including model accuracy and insights.

    For more details, refer to the project guidelines. LinkedIn: https://www.linkedin.com/in/marknature-c/ GitHub: https://github.com/marknature/

  17. Multilingual Sentences Collection Dataset- NLP

    • kaggle.com
    zip
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aayushi Mishra (2025). Multilingual Sentences Collection Dataset- NLP [Dataset]. https://www.kaggle.com/datasets/aayushiweb/multilingual-sentences-collection-dataset-nlp
    Explore at:
    zip(78952 bytes)Available download formats
    Dataset updated
    Apr 8, 2025
    Authors
    Aayushi Mishra
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This title is:

    Descriptive – tells users exactly what to expect.

    Professional – suitable for academic, NLP, and Kaggle usage.

    Searchable – includes keywords like "multilingual", "sentences", "languages".

  18. NLP in Practice competition dataset

    • kaggle.com
    zip
    Updated Jun 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Baushenko (2023). NLP in Practice competition dataset [Dataset]. https://www.kaggle.com/datasets/e0xextazy/nlp-in-practice-competition-dataset
    Explore at:
    zip(36685070 bytes)Available download formats
    Dataset updated
    Jun 12, 2023
    Authors
    Mark Baushenko
    Description

    Dataset

    This dataset was created by Mark Baushenko

    Contents

  19. 100-poems dataset for language model (NLP)

    • kaggle.com
    zip
    Updated Jul 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bikram Saha (2022). 100-poems dataset for language model (NLP) [Dataset]. https://www.kaggle.com/datasets/imbikramsaha/poems
    Explore at:
    zip(59015 bytes)Available download formats
    Dataset updated
    Jul 2, 2022
    Authors
    Bikram Saha
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    It's a dataset for doing NLP text generation.

    Train your model, upload your Notebook, and happy learning :)

  20. Books Dataset for NLP & Recommendation Systems

    • kaggle.com
    zip
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sina tavakoli (2025). Books Dataset for NLP & Recommendation Systems [Dataset]. https://www.kaggle.com/datasets/sinatavakoli/books-dataset-for-nlp-and-recommendation-systems
    Explore at:
    zip(2089930 bytes)Available download formats
    Dataset updated
    Jul 2, 2025
    Authors
    sina tavakoli
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains metadata for 4,700+ popular books across various genres, time periods, and authors. Each entry includes information such as the book’s title, author(s), average rating, publication year, language, description, and a link to its cover image.

    The dataset is ideal for Natural Language Processing (NLP) projects, recommendation systems, sentiment analysis, text summarization, author-based trend analysis, and other data science or machine learning tasks related to books and literature.

    Whether you're building a book recommender, training a language model on literary data, or analyzing rating trends over time. This dataset provides a rich, real-world foundation.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). NLP Mental Health Conversations [Dataset]. https://www.kaggle.com/datasets/thedevastator/nlp-mental-health-conversations
Organization logo

NLP Mental Health Conversations

Stimulating AI-Driven Mental Health Guidance

Explore at:
zip(1552188 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

NLP Mental Health Conversations

Stimulating AI-Driven Mental Health Guidance

By Huggingface Hub [source]

About this dataset

This dataset contains conversations between users and experienced psychologists related to mental health topics. Carefully collected and anonymized, the data can be used to further the development of Natural Language Processing (NLP) models which focus on providing mental health advice and guidance. It consists of a variety of questions which will help train NLP models to provide users with appropriate advice in response to their queries. Whether you're an AI developer interested in building the next wave of mental health applications or a therapist looking for insights into how technology is helping people connect; this dataset provides invaluable support for advancing our understanding of human relationships through Artificial Intelligence

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

This guide will provide you with the necessary knowledge to effectively use this dataset for Natural Language Processing (NLP)-based applications.

  • Download and install the dataset: To begin using the dataset, download it from Kaggle onto your system. Once downloaded, unzip and extract the .csv file into a directory of your choice.

  • Familiarize yourself with the columns: Before working with the data, it’s important to familiarize yourself with all of its components. This dataset contains two columns - Context and Response - which are intentionally structured to produce conversations between users and psychologists related to mental health topics for NLP models dedicated to providing mental health advice and guidance.

  • Analyze data entries: If possible or desired, take time now to analyze what is included in each entry; this may help you better untangle any challenges that come up during subsequent processes yet won't be required for most steps going forward if you prefer not too jump ahead of yourself at this juncture of your work process just yet! Examine questions asked by users as well as answers provided by experts in order glean an overall picture of what types of conversations are taking place within this pool of data that can help guide further work on NLP models for AI-driven mental health guidance purposes later on down the road!

  • Cleanse any information not applicable to NLP decisioning relevant application goals: It's important that only meaningful items related towards achieving AI-driven results remain within a clean copy of this Dataset going forward; consider removing all extra many verbatim entries or other pieces uneeded while also otherwise making sure all included content adheres closely enough one particular decisions purpose expected from an end goal perspective before proceeding onwards now until an ultimate end result has been successfully achieved eventually afterwards later on next afterward soon afterwards too following conveniently satisfyingly after accordingly shortly near therefore meaningfully likewise conclusively thoroughly properly productively purposely then eventually effectively finally indeed desirably plus concludingly enjoyably popularly splendidly attractively satisfactorally propitiously outstandingly fluently promisingly opportunely in conclusion efficiently hopefully progressively breathtaking deliciousness ideally genius mayhem invented unique impossibility everlastingly intense qualitative cohesiveness behaviorally affectionately fixed voraciously like alive supportively choicest decisively luckily chaotically co-creatively introducing ageless intricacy voicing auspicious promise enterprisingly preferred mathematically godly happening humorous respective achieve ultra favorability fundamentals essentials speciality grandiose selectively perfectly

Research Ideas

  • Creating sentence-matching algorithms for natural language processing to accurately match given questions with appropriate advice and guidance.
  • Analyzing the psychological conversations to gain insights into topics such as stress, anxiety, and depression.
  • Developing personalized natural language processing models tailored to provide users with appropriate advice based on their queries and based on their individual state of mental health

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativec...

Search
Clear search
Close search
Google apps
Main menu