Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains conversations between users and experienced psychologists related to mental health topics. Carefully collected and anonymized, the data can be used to further the development of Natural Language Processing (NLP) models which focus on providing mental health advice and guidance. It consists of a variety of questions which will help train NLP models to provide users with appropriate advice in response to their queries. Whether you're an AI developer interested in building the next wave of mental health applications or a therapist looking for insights into how technology is helping people connect; this dataset provides invaluable support for advancing our understanding of human relationships through Artificial Intelligence
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide will provide you with the necessary knowledge to effectively use this dataset for Natural Language Processing (NLP)-based applications.
Download and install the dataset: To begin using the dataset, download it from Kaggle onto your system. Once downloaded, unzip and extract the .csv file into a directory of your choice.
Familiarize yourself with the columns: Before working with the data, it’s important to familiarize yourself with all of its components. This dataset contains two columns - Context and Response - which are intentionally structured to produce conversations between users and psychologists related to mental health topics for NLP models dedicated to providing mental health advice and guidance.
Analyze data entries: If possible or desired, take time now to analyze what is included in each entry; this may help you better untangle any challenges that come up during subsequent processes yet won't be required for most steps going forward if you prefer not too jump ahead of yourself at this juncture of your work process just yet! Examine questions asked by users as well as answers provided by experts in order glean an overall picture of what types of conversations are taking place within this pool of data that can help guide further work on NLP models for AI-driven mental health guidance purposes later on down the road!
Cleanse any information not applicable to NLP decisioning relevant application goals: It's important that only meaningful items related towards achieving AI-driven results remain within a clean copy of this Dataset going forward; consider removing all extra many verbatim entries or other pieces uneeded while also otherwise making sure all included content adheres closely enough one particular decisions purpose expected from an end goal perspective before proceeding onwards now until an ultimate end result has been successfully achieved eventually afterwards later on next afterward soon afterwards too following conveniently satisfyingly after accordingly shortly near therefore meaningfully likewise conclusively thoroughly properly productively purposely then eventually effectively finally indeed desirably plus concludingly enjoyably popularly splendidly attractively satisfactorally propitiously outstandingly fluently promisingly opportunely in conclusion efficiently hopefully progressively breathtaking deliciousness ideally genius mayhem invented unique impossibility everlastingly intense qualitative cohesiveness behaviorally affectionately fixed voraciously like alive supportively choicest decisively luckily chaotically co-creatively introducing ageless intricacy voicing auspicious promise enterprisingly preferred mathematically godly happening humorous respective achieve ultra favorability fundamentals essentials speciality grandiose selectively perfectly
- Creating sentence-matching algorithms for natural language processing to accurately match given questions with appropriate advice and guidance.
- Analyzing the psychological conversations to gain insights into topics such as stress, anxiety, and depression.
- Developing personalized natural language processing models tailored to provide users with appropriate advice based on their queries and based on their individual state of mental health
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativec...
Facebook
TwitterThis dataset was created by Rawan1652002
Facebook
TwitterDataset Summary
Natural Language Processing with Disaster Tweets: https://www.kaggle.com/competitions/nlp-getting-started/data This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. The competition dataset is not too big, and even if you don’t have much personal computing power, you can do all of the work in our free, no-setup, Jupyter Notebooks environment called Kaggle Notebooks.
Columns
id - a unique identifier for each tweet… See the full description on the dataset page: https://huggingface.co/datasets/gdwangh/kaggle-nlp-getting-start.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The dataset appears to be a collection of NLP research papers, with the full text available in the "article" column, abstract summaries in the "abstract" column, and information about different sections in the "section_names" column. Researchers and practitioners in the field of natural language processing can use this dataset for various tasks, including text summarization, document classification, and analysis of research paper structures.
Here's a short description of the Natural Language Processing Research Papers dataset: 1. Article: This column likely contains the full text or content of the research papers related to Natural Language Processing (NLP). Each entry in this column represents the entire body of a specific research article. 2. Abstract: This column is likely to contain the abstracts of the NLP research papers. The abstract provides a concise summary of the paper, highlighting its key objectives, methods, and findings. 3. Section Names: This column probably contains information about the section headings within each research paper. It could include the names or titles of different sections such as Introduction, Methodology, Results, Conclusion, etc. This information can be useful for structuring and organizing the content of the research papers.
Content Overview: The dataset is valuable for researchers, students, and practitioners in the field of Natural Language Processing. File format: This file is csv format.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This repository contains a meticulously scraped dataset from various financial websites. The data extraction process ensures high-quality and accurate text, including content from both the websites and their embedded PDFs.
We applied the advanced Mixtral 7X8 model to generate the following additional fields:
The prompt used to generate the additional fields was highly effective, thanks to extensive discussions and collaboration with the Mistral AI team. This ensures that the dataset provides valuable insights and is ready for further analysis and model training.
This dataset can be used for various applications, including but not limited to:
Facebook
TwitterThis dataset includes URLs of blogs covering topics like Healthcare, Artificial Intelligence, Big Data, Lifestyle, IT Services, Data Science, and Banking. Analyzing this data can be valuable for learning web scraping and data extraction concepts. The objective of this dataset is to extract textual data articles from the given URL in the input.xlsx and perform text analysis to compute variables (Positive Score, Negative Score, Polarity Score, Subjectivity Score, Avg Sentence Length, Percentage of Complex Words, Fog Index, Avg Number of Words Per Sentence, Complex Word Count, Word Count, Syllable Per Word, Personal Pronouns, Avg Word Length).
Sentimental Analysis: It is the process of determining whether a piece of writing is positive, negative, or neutral.
- Positive Score: This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.
- Negative Score: This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. We multiply the score by -1 so that the score is a positive number.
- Polarity Score: This is the score that determines if a given text is positive or negative. It is calculated by using the formula:
Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001)
The range is from -1 to +1
- Subjectivity Score: This is the score that determines if a given text is objective or subjective. It is calculated by using the formula:
Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)
The range is from 0 to +1
Analysis of Readability: It is calculated using the Gunning Fox index formula described below.
- Average Sentence Length = the number of words / the number of sentences
- Percentage of Complex words = the number of complex words / the number of words
- Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)
Other Calculations
- Average Number of Words Per Sentence = the total number of words / the total number of sentences
- Word Count: We count the total cleaned words present in the text by:
1. removing the stop words (using the stopwords class of NLTK package), and
2. removing any punctuations like ? ! , . from the word before counting.
- Syllable Count Per Word: We count the number of Syllables in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with "es", and "ed" by not counting them as a syllable.
- Complex Word Count: Complex words are words in the text that contain more than two syllables.
- Personal Pronouns: To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words - “I,” “we,” “my,” “ours,” and “us”. Special care is taken so that the country name the US is not included in the list.
- Average Word Length: It is calculated by the formula
Sum of the total number of characters in each word/Total number of words
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains short Reddit posts (≤280 characters) about pop music and pop stars, labeled for sentiment analysis.
We collected ~124k posts using keywords like Taylor Swift, Olivia Rodrigo, Grammy, Billboard, and subreddits like popheads, Music, and Billboard. After cleaning and filtering, we kept only short-form, English posts and combined each post’s title and body into a single text column.
The final data set is about 32,000+ rows
Sentiment labels (positive, neutral, negative) were generated using a BERT-based model fine-tuned for social media (CardiffNLP’s Twitter RoBERTa).
This version is ready for NLP sentiment projects — train your own model, explore pop fandom discourse, or benchmark transformer performance on real-world Reddit data.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
A dataset for topic extraction from 10k German News Articles and NLP for German language. English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. To my knowledge the MLDoc contains German documents for classification. Due to grammatical differences between the English and the German language, a classifier might be effective on a English dataset, but not as effective on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifier on multiple German datasets to get a sense of it’s effectiveness.
The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus. In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise. The 10kGNAD uses the second part of the topic path, here Wirtschaft, as class label. The article titles and texts are concatenated into one text and the authors are removed to avoid a keyword like classification on autors frequent in a class. I created and used this dataset in my thesis to train and evaluate four text classifiers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.
@InProceedings{Schabus2017, Author = {Dietmar Schabus and Marcin Skowron and Martin Trapp}, Title = {One Million Posts: A Data Set of German Online Discussions}, Booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)}, Pages = {1241--1244}, Year = {2017}, Address = {Tokyo, Japan}, Doi = {10.1145/3077136.3080711}, Month = aug } @InProceedings{Schabus2018, author = {Dietmar Schabus and Marcin Skowron}, title = {Academic-Industrial Perspective on the Development and Deployment of a Moderation System for a Newspaper Website}, booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC)}, year = {2018}, address = {Miyazaki, Japan}, month = may, pages = {1602-1605}, abstract = {This paper describes an approach and our experiences from the development, deployment and usability testing of a Natural Language Processing (NLP) and Information Retrieval system that supports the moderation of user comments on a large newspaper website. We highlight some of the differences between industry-oriented and academic research settings and their influence on the decisions made in the data collection and annotation processes, selection of document representation and machine learning methods. We report on classification results, where the problems to solve and the data to work with come from a commercial enterprise. In this context typical for NLP research, we discuss relevant industrial aspects. We believe that the challenges faced as well as the solutions proposed for addressing them can provide insights to others working in a similar setting.}, url = {http://www.lrec-conf.org/proceedings/lrec2018/summaries/8885.html}, }
More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe
Facebook
TwitterThis dataset was created by Alihassan
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By lex_glue (From Huggingface) [source]
The LexGLUE dataset is a comprehensive benchmark dataset specially created to evaluate the performance of natural language processing (NLP) models in various legal tasks. This dataset draws inspiration from the success of other multi-task NLP benchmarks like GLUE and SuperGLUE, as well as similar initiatives in different domains.
The primary objective of LexGLUE is to advance the development of versatile models that can effectively handle multiple legal NLP tasks without requiring extensive task-specific fine-tuning. By providing a standardized evaluation platform, this dataset aims to foster innovation and advancements in the field of legal language understanding.
The dataset consists of several columns that provide crucial information for each entry. The context column contains the specific text or document from which each legal language understanding task is derived, offering essential background information for proper comprehension. The endings column presents multiple potential options or choices that could complete the legal task at hand, enabling comprehensive evaluation.
Furthermore, there are various columns related to labels and target categories associated with each entry. The label column represents the correct or expected answer for a given task, ensuring accuracy in model predictions during evaluation. The labels column provides categorical information regarding target labels or categories relevant to the respective legal NLP task.
Another important element within this dataset is the text column, which contains the actual input text representing a particular legal scenario or context for analysis. Analyzing this text forms an integral part of conducting accurate and effective NLP tasks within a legal context.
To facilitate efficient model performance assessment on diverse aspects of legal language understanding, additional files are included in this benchmark dataset: case_hold_test.csv comprises case contexts with multiple potential endings labeled as valid holdings or not; ledgar_validation.csv serves as a validation set specifically designed for evaluating NLP models' performance on legal tasks; ecthr_b_test.csv contains samples related to European Court of Human Rights (ECtHR) along with their corresponding labels for testing the capabilities of legal language understanding models in this domain.
By providing a longer, accurate, informative, and descriptive description of the LexGLUE dataset, it becomes evident that it serves as a crucial resource for researchers and practitioners to benchmark and advance the state-of-the-art in legal NLP tasks
- Training and evaluating NLP models: The LexGLUE dataset can be used to train and evaluate natural language processing models specifically designed for legal language understanding tasks. By using this dataset, researchers and developers can test the performance of their models on various legal NLP tasks, such as legal case analysis or European Court of Human Rights (ECtHR) related tasks.
- Developing generic NLP models: The benchmark dataset is designed to push towards the development of generic models that can handle multiple legal NLP tasks with limited task-specific fine-tuning. Researchers can use this dataset to develop robust and versatile NLP models that can effectively understand and analyze legal texts.
- Comparing different algorithms and approaches: LexGLUE provides a standardized benchmark for comparing different algorithms and approaches in the field of legal language understanding. Researchers can use this dataset to compare the performance of different techniques, such as rule-based methods, deep learning models, or transformer architectures, on various legal NLP tasks. This allows for a fair comparison between different approaches and facilitates progress in the field by identifying effective methods for solving specific legal language understanding challenges
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: case_hold_test.csv | Column name | Description ...
Facebook
TwitterThis dataset was created by Big D Dang
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is designed for practicing fake news detection using machine learning and natural language processing (NLP) techniques. It includes a rich collection of 20,000 news articles, carefully generated to simulate real-world data scenarios. Each record contains metadata about the article and a label indicating whether the news is real or fake.
The dataset also intentionally includes around 5% missing values in some fields to simulate the challenges of handling incomplete data in real-life projects.
title A short headline summarizing the article (around 6 words). text The body of the news article (200–300 words on average). date The publication date of the article, randomly selected over the past 3 years. source The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values (~5%). author The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data. category The general category of the article (e.g., Politics, Health, Sports, Technology). label The target label: real or fake news.
Fake News Detection Practice: Perfect for binary classification tasks.
NLP Preprocessing: Allows users to practice text cleaning, tokenization, vectorization, etc.
Handling Missing Data: Some fields are incomplete to simulate real-world data challenges.
Feature Engineering: Encourages creating new features from text and metadata.
Balanced Labels: Realistic distribution of real and fake news for fair model training.
Building and evaluating text classification models (e.g., Logistic Regression, Random Forests, XGBoost).
Practicing NLP techniques like TF-IDF, Word2Vec, BERT embeddings.
Performing exploratory data analysis (EDA) on news data.
Developing pipelines for dealing with missing values and feature extraction.
This dataset has been synthetically generated to closely resemble real news articles. The diversity in titles, text, sources, and categories ensures that models trained on this dataset can generalize well to unseen, real-world data. However, since it is synthetic, it should not be used for production models or decision-making without careful validation.
Filename: fake_news_dataset.csv
Size: 20,000 rows × 7 columns
Missing Data: ~5% missing values in the source and author columns.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
VQA is a multimodal task wherein, given an image and a natural language question related to the image, the objective is to produce a natural language answer correctly as output.
It involves understanding the content of the image and correlating it with the context of the question asked. Because we need to compare the semantics of information present in both of the modalities — the image and natural language question related to it — VQA entails a wide range of sub-problems in both CV and NLP (such as object detection and recognition, scene classification, counting, and so on). Thus, it is considered an AI-complete task.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The AI-Enhanced English Teaching Resource Dataset is designed for research on Natural Language Processing (NLP) applications in automated English lesson generation. It contains 300 structured entries, combining human-written and AI-generated educational content across various categories such as Grammar, Vocabulary, Reading, Writing, Speaking, and Literature.
Key Features: Lesson Text: Descriptive summaries of English lessons. Keywords: Important terms extracted for each lesson. Lesson Type: Categorization into different teaching domains. Difficulty Level: Labels for Beginner, Intermediate, and Advanced levels. Target: Binary classification (0 = Human-written, 1 = AI-generated). Use Cases: Training and evaluating NLP models for educational content generation. Assessing AI’s effectiveness in producing structured and relevant lesson materials. Developing adaptive e-learning platforms for personalized teaching. This dataset serves as a valuable resource for machine learning, NLP, and educational technology research, enabling scalable and automated curriculum design. 🚀
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Given a text and a reason, predict if the text correctly satisfies the reason. Various approaches can be used in order to determine the correctness of the text with the reason. Note: This dataset contains only positive samples. Thus various data augmentation techniques should be applied in order to make a very good model. This is an example of an Imbalanced Causal Reasoning Dataset in the field of NLP.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This project aims to develop an NLP model for tasks like sentiment analysis, text classification, or named entity recognition.
For more details, refer to the project guidelines. LinkedIn: https://www.linkedin.com/in/marknature-c/ GitHub: https://github.com/marknature/
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This title is:
Descriptive – tells users exactly what to expect.
Professional – suitable for academic, NLP, and Kaggle usage.
Searchable – includes keywords like "multilingual", "sentences", "languages".
Facebook
TwitterThis dataset was created by Mark Baushenko
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Train your model, upload your Notebook, and happy learning :)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains metadata for 4,700+ popular books across various genres, time periods, and authors. Each entry includes information such as the book’s title, author(s), average rating, publication year, language, description, and a link to its cover image.
The dataset is ideal for Natural Language Processing (NLP) projects, recommendation systems, sentiment analysis, text summarization, author-based trend analysis, and other data science or machine learning tasks related to books and literature.
Whether you're building a book recommender, training a language model on literary data, or analyzing rating trends over time. This dataset provides a rich, real-world foundation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains conversations between users and experienced psychologists related to mental health topics. Carefully collected and anonymized, the data can be used to further the development of Natural Language Processing (NLP) models which focus on providing mental health advice and guidance. It consists of a variety of questions which will help train NLP models to provide users with appropriate advice in response to their queries. Whether you're an AI developer interested in building the next wave of mental health applications or a therapist looking for insights into how technology is helping people connect; this dataset provides invaluable support for advancing our understanding of human relationships through Artificial Intelligence
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide will provide you with the necessary knowledge to effectively use this dataset for Natural Language Processing (NLP)-based applications.
Download and install the dataset: To begin using the dataset, download it from Kaggle onto your system. Once downloaded, unzip and extract the .csv file into a directory of your choice.
Familiarize yourself with the columns: Before working with the data, it’s important to familiarize yourself with all of its components. This dataset contains two columns - Context and Response - which are intentionally structured to produce conversations between users and psychologists related to mental health topics for NLP models dedicated to providing mental health advice and guidance.
Analyze data entries: If possible or desired, take time now to analyze what is included in each entry; this may help you better untangle any challenges that come up during subsequent processes yet won't be required for most steps going forward if you prefer not too jump ahead of yourself at this juncture of your work process just yet! Examine questions asked by users as well as answers provided by experts in order glean an overall picture of what types of conversations are taking place within this pool of data that can help guide further work on NLP models for AI-driven mental health guidance purposes later on down the road!
Cleanse any information not applicable to NLP decisioning relevant application goals: It's important that only meaningful items related towards achieving AI-driven results remain within a clean copy of this Dataset going forward; consider removing all extra many verbatim entries or other pieces uneeded while also otherwise making sure all included content adheres closely enough one particular decisions purpose expected from an end goal perspective before proceeding onwards now until an ultimate end result has been successfully achieved eventually afterwards later on next afterward soon afterwards too following conveniently satisfyingly after accordingly shortly near therefore meaningfully likewise conclusively thoroughly properly productively purposely then eventually effectively finally indeed desirably plus concludingly enjoyably popularly splendidly attractively satisfactorally propitiously outstandingly fluently promisingly opportunely in conclusion efficiently hopefully progressively breathtaking deliciousness ideally genius mayhem invented unique impossibility everlastingly intense qualitative cohesiveness behaviorally affectionately fixed voraciously like alive supportively choicest decisively luckily chaotically co-creatively introducing ageless intricacy voicing auspicious promise enterprisingly preferred mathematically godly happening humorous respective achieve ultra favorability fundamentals essentials speciality grandiose selectively perfectly
- Creating sentence-matching algorithms for natural language processing to accurately match given questions with appropriate advice and guidance.
- Analyzing the psychological conversations to gain insights into topics such as stress, anxiety, and depression.
- Developing personalized natural language processing models tailored to provide users with appropriate advice based on their queries and based on their individual state of mental health
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativec...