50 datasets found

Events blog
kaggle.com
zip
Updated Jun 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahil Saxena (2020). Events blog [Dataset]. https://www.kaggle.com/sahilsaxenass/events-blog
Explore at:
zip(1448144 bytes)Available download formats
Dataset updated
Jun 11, 2020
Authors
Sahil Saxena
Description
Dataset

This dataset was created by Sahil Saxena

Released under Data files © Original Authors

Contents
NLP EDA
kaggle.com
zip
Updated Feb 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adithya Madhavan (2022). NLP EDA [Dataset]. https://www.kaggle.com/datasets/adithyamadhavan/nlp-eda
Explore at:
zip(1163131 bytes)Available download formats
Dataset updated
Feb 19, 2022
Authors
Adithya Madhavan
Description
Dataset

This dataset was created by Adithya Madhavan

Contents
Results for the baseline.
plos.figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Results for the baseline. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t008
Dataset updated
Sep 26, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Presidency in the middle of the pandemic
kaggle.com
zip
Updated Dec 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kurt F (2020). Presidency in the middle of the pandemic [Dataset]. https://www.kaggle.com/kurtfcelsius/presidency-in-the-middle-of-the-pandemic
Explore at:
zip(261605 bytes)Available download formats
Dataset updated
Dec 17, 2020
Authors
Kurt F
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Twitter is a good way to measure current reactions. And during the epidemic, Lockdown is frequently the subject of the platform. While almost every country in the world suffers heavy losses in this war, politicians are also exposed to harsh criticism. In this dataset, we would like to examine the comments on Twitter about German chancellor Angela Merkel, who ranks first in the list of the world's most powerful women by Forbes. So we are curious about the results of the Lockdown arguments.

Content

The data was created in December-2020 as 1500 train and 650 test files about German chancellor Angela Merkel. Each tweet in the train data set has been labeled as positive or negative. Those behind the negative tweets were categorized under three headings. These are: - Conspiracy theory - Insult - Political criticism.

Inspiration

Maybe you might below be wondering:

-In which language were the most positive or negative tweets? -What is the structure of the words used according to languages? -What are the reflections of the headings highlighted in negative comments according to languages?

And while answering questions like this, you can find graphical options suitable for your exploratory data analysis.

And a happy ending: You can develop a machine learning model for tweets that are not labeled in test data.
Job Descriptions Dataset
kaggle.com
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jayakishan Minnekanti (2025). Job Descriptions Dataset [Dataset]. https://www.kaggle.com/datasets/jayakishan225/job-descriptions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jayakishan Minnekanti
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset includes 521 real-world job descriptions for various data analyst roles, compiled solely for educational and research purposes. It was created to support natural language processing (NLP) and skill extraction tasks.

Each row represents a unique job posting with: - Job Title: The role being advertised - Description: The full-text job description

🔍 Use Case:
This dataset was used in the "Job Skill Analyzer" project, which applies NLP and multi-label classification to extract in-demand skills such as Python, SQL, Tableau, Power BI, Excel, and Communication.

🎯 Ideal For: - NLP-based skill extraction - Resume/job description matching - EDA on job market skill trends - Multi-label text classification projects

⚠️ Disclaimer:
- The job descriptions were collected from publicly available postings across multiple job boards.
- No logos, branding, or personally identifiable information is included.
- This dataset is not intended for commercial use.

License: CC BY-NC-SA 4.0
Suitable For: NLP, EDA, Job Market Analysis, Skill Mining, Text Classification
NLP with Disaster Tweets - cleaning data
kaggle.com
zip
Updated Sep 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vitalii Mokin (2021). NLP with Disaster Tweets - cleaning data [Dataset]. https://www.kaggle.com/vbmokin/nlp-with-disaster-tweets-cleaning-data
Explore at:
zip(1053715 bytes)Available download formats
Dataset updated
Sep 11, 2021
Authors
Vitalii Mokin
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
Context

The data obtained by clearing the Getting Started Prediction Competition "Real or Not? NLP with Disaster Tweets" data is the result of a public notebook "NLP with Disaster Tweets - EDA and Cleaning data". In the future, I plan to improve cleaning and update the dataset

Content

id - a unique identifier for each tweet text - the text of the tweet location - the location the tweet was sent from (may be blank) keyword - a particular keyword from the tweet (may be blank) target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

Acknowledgements

Thanks to Kaggle team for this Competition "Real or Not? NLP with Disaster Tweets" and its datasets (this dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website here. Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480).

Thanks to web-site Ambulance services drive, strive to keep you alive for your image, which is very similar to the image of the contest "Real or Not? NLP with Disaster Tweets" and which I used as the image of my dataset

Inspiration

You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.
GLARE: Google Apps Arabic Reviews Dataset
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatima AlGhamdi; Reem Mohammed; Hend Al-Khalifa; Areeb Alowisheq; Fatima AlGhamdi; Reem Mohammed; Hend Al-Khalifa; Areeb Alowisheq (2024). GLARE: Google Apps Arabic Reviews Dataset [Dataset]. http://doi.org/10.5281/zenodo.6457824
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6457824
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fatima AlGhamdi; Reem Mohammed; Hend Al-Khalifa; Areeb Alowisheq; Fatima AlGhamdi; Reem Mohammed; Hend Al-Khalifa; Areeb Alowisheq
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper introduces GLARE an Arabic Apps Reviews dataset collected from Saudi Google PlayStore. It consists of 76M reviews, 69M of which are Arabic reviews of 9,980 Android Applications. We present the data collection methodology, along with a detailed Exploratory Data Analysis (EDA) and Feature Engineering on the gathered reviews. We also highlight possible use cases and benefits of the dataset.
Jigsaw Bias Toxicity EDA NLP aug16 alpha0.05
kaggle.com
zip
Updated Apr 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matt Yates (2019). Jigsaw Bias Toxicity EDA NLP aug16 alpha0.05 [Dataset]. https://www.kaggle.com/datasets/yeayates21/jigsaw-bias-toxicity-eda-nlp-aug16-alpha005
Explore at:
zip(1593446024 bytes)Available download formats
Dataset updated
Apr 14, 2019
Authors
Matt Yates
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Text Augmentation on Jigsaw Unintended Bias in Toxicity Classification competition training data using EDA_NLP.

Context

Code from https://github.com/jasonwei20/eda_nlp was run on the training dataset for the Jigsaw Unintended Bias in Toxicity Classification competition to create augmented training dataset. Number of augmentations was set to 16 and alpha value was set to 0.05.

Content

train_augmented1605.zip - augmented training dataset for Jigsaw Unintended Bias in Toxicity Classification competition.

Acknowledgements

Code provided by: https://github.com/jasonwei20/eda_nlp

Code for the paper: Easy data augmentation techniques for boosting performance on text classification tasks. https://arxiv.org/abs/1901.11196

Special thanks to ErvTong \ @papasmurfff for sharing the eda_nlp repo with me. https://www.kaggle.com/papasmurfff

Inspiration

https://mlwhiz.com/blog/2019/02/19/siver_medal_kaggle_learnings/

The above article talks about how the 1st place competitors for the Quora Insincere Question competition stated they:

"We do not pad sequences to the same length based on the whole data, but just on a batch level. That means we conduct padding and truncation on the data generator level for each batch separately, so that length of the sentences in a batch can vary in size. Additionally, we further improved this by not truncating based on the length of the longest sequence in the batch but based on the 95% percentile of lengths within the sequence. This improved runtime heavily and kept accuracy quite robust on single model level, and improved it by being able to average more models."

This got @papasmurfff and I thinking about text augmentation and from there @papasmurfff found the eda_nlp repo.
Research Paper
kaggle.com
zip
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vijender Singh (2020). Research Paper [Dataset]. https://www.kaggle.com/vijendersingh412/research-paper
Explore at:
zip(56828595 bytes)Available download formats
Dataset updated
Nov 10, 2020
Authors
Vijender Singh
Description
Context

The dataset consists of top research papers in NLP domain with its metadata.xls file containing detailed information.

Content

The dataset contains description of research paper, its domain, its sub domain and link of origin to correct paper. Each research paper starts with unique number followed by underscore and name of research paper. The unique number is is assigned to Sno of metadata sheet.

Acknowledgements

This is just a start of making a dataset for research purpose and using this dataset for recommendation system or solving other problems. You are welcome to contribute in this. And can also share the problem you are solving and I can help without any cost.

Problem Use Case

Collaborating Filtering EDA on NLP research paper Document Classification Creating own Embedding for NLP domain applications

Inspiration

The data is open to the world's largest data science community. Please share your doubts, problems and how we can make this better. ✌️

Open to direct chat @ https://in.linkedin.com/in/vijendersingh412 🤝
f
F1-score results [40].
plos.figshare.com
xls
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). F1-score results [40]. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t002
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Fake-News-dataset
kaggle.com
zip
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gor Abaghyan (2025). Fake-News-dataset [Dataset]. https://www.kaggle.com/datasets/abaghyangor/fake-news-dataset/versions/1
Explore at:
zip(42976116 bytes)Available download formats
Dataset updated
Feb 27, 2025
Authors
Gor Abaghyan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is designed for Natural Language Processing (NLP) and Machine Learning tasks focused on Fake News Detection. It contains a collection of labeled news articles with textual features, allowing data scientists and researchers to build models that can distinguish between real and fake news.

Dataset Features:

Title – The headline of the article Text – The full content of the article Label – 1 for real news, 0 for fake news

Potential Use Cases:

Train Machine Learning models for text classification Experiment with TF-IDF, Word Embeddings, and Deep Learning Conduct Exploratory Data Analysis (EDA) on fake vs. real news patterns Develop real-time misinformation detection tools

Suggested Approaches:

Text preprocessing (stopword removal, tokenization, lemmatization) Feature extraction using TF-IDF or Word2Vec Model training with Logistic Regression, Decision Trees, Random Forest, or Gradient Boosting Deep learning methods such as LSTMs, Transformers (BERT, RoBERTa)
Fake News Detection Dataset
kaggle.com
zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Mashayekhi (2025). Fake News Detection Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/fake-news-detection-dataset
Explore at:
zip(11735585 bytes)Available download formats
Dataset updated
Apr 27, 2025
Authors
Mahdi Mashayekhi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📚 Fake News Detection Dataset

Overview

This dataset is designed for practicing fake news detection using machine learning and natural language processing (NLP) techniques. It includes a rich collection of 20,000 news articles, carefully generated to simulate real-world data scenarios. Each record contains metadata about the article and a label indicating whether the news is real or fake.

The dataset also intentionally includes around 5% missing values in some fields to simulate the challenges of handling incomplete data in real-life projects.

Columns Description

title A short headline summarizing the article (around 6 words). text The body of the news article (200–300 words on average). date The publication date of the article, randomly selected over the past 3 years. source The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values (~5%). author The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data. category The general category of the article (e.g., Politics, Health, Sports, Technology). label The target label: real or fake news.

Why Use This Dataset?

Fake News Detection Practice: Perfect for binary classification tasks.

NLP Preprocessing: Allows users to practice text cleaning, tokenization, vectorization, etc.

Handling Missing Data: Some fields are incomplete to simulate real-world data challenges.

Feature Engineering: Encourages creating new features from text and metadata.

Balanced Labels: Realistic distribution of real and fake news for fair model training.

Potential Use Cases

Building and evaluating text classification models (e.g., Logistic Regression, Random Forests, XGBoost).

Practicing NLP techniques like TF-IDF, Word2Vec, BERT embeddings.

Performing exploratory data analysis (EDA) on news data.

Developing pipelines for dealing with missing values and feature extraction.

A Note on Data Quality

This dataset has been synthetically generated to closely resemble real news articles. The diversity in titles, text, sources, and categories ensures that models trained on this dataset can generalize well to unseen, real-world data. However, since it is synthetic, it should not be used for production models or decision-making without careful validation.

File Info

Filename: fake_news_dataset.csv

Size: 20,000 rows × 7 columns

Missing Data: ~5% missing values in the source and author columns.
Criteria for detailed characterization of the dataset.
plos.figshare.com
xls
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Criteria for detailed characterization of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t005
Dataset updated
Sep 26, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Criteria for detailed characterization of the dataset.
Harry Potter Reviews
kaggle.com
zip
Updated Jan 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Paulo (2024). Harry Potter Reviews [Dataset]. https://www.kaggle.com/datasets/marcpaulo/harry-potter-reviews
Explore at:
zip(13097 bytes)Available download formats
Dataset updated
Jan 12, 2024
Authors
Marc Paulo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains 491 synthetic reviews of the famous "Harry Potter and the Philosopher's Stone" movie. The reviews were generated using a LLM (Large Language Model).

Harry Potter is a series of seven fantasy novels written by British author J. K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's conflict with Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles (non-magical people).

😉**Play with this data!**😉 - Exploratory-Data-Analysis [*EDA*] - NLP Sentiment Analysis [*NLP, Classification*] - Rating prediction using NLP and other features [*NLP, Regression | Classification*] - Favourite Character Prediction [*Multiclass Classification*] - And much more❗ ...

Harry Potter. (2024, January 10). In Wikipedia. https://en.wikipedia.org/wiki/Harry_Potter
"AskReddit" Subreddit Top Comments
kaggle.com
zip
Updated Apr 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amarnath Kuppannan (2022). "AskReddit" Subreddit Top Comments [Dataset]. https://www.kaggle.com/datasets/akuppps/askreddit-subreddit-top-comments
Explore at:
zip(131383 bytes)Available download formats
Dataset updated
Apr 3, 2022
Authors
Amarnath Kuppannan
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
This dataset is about the top rated comments from "AskReddit" in the past month. 1900+ rows. Credit for the help goes to @gpreda. Thank you sir. This dataset can be used for EDA. Great for beginners. NLP techniques can be used to see the data in a different way as well!
Insurance Customer Reviews
kaggle.com
zip
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emre Kaan Yılmaz (2025). Insurance Customer Reviews [Dataset]. https://www.kaggle.com/datasets/emrekaany/insurance-customer-reviews-crm-action-plans
Explore at:
zip(131788 bytes)Available download formats
Dataset updated
Apr 29, 2025
Authors
Emre Kaan Yılmaz
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is produced for an insurance company called 'Blue Insurance' and contains simulated customer reviews for various insurance products. It includes feedback from customers with positive, neutral, and negative experiences. Also, suggested CRM actions for those reviews.

Columns

customer_id: Unique identifier for each customer.

review_id: Unique identifier for each review.

review_text: Full insurance review text generated by AI.

rating: Customer rating (1 to 5 stars).

review_date: Date the review was supposedly written.

insurance_type: Insurance category (health, auto, life, property, etc.).

sentiment: Inferred sentiment of the review (positive, neutral, or negative).

Why You’ll Love This Dataset ❤️

NLP-ready: Perfect for sentiment analysis, rating prediction, text classification, and topic modeling.

CRM AI Agent Training: Ideal for training customer service chatbots and prompt-tuning CRM AI agents to handle real-life insurance customer conversations.

Multi-purpose: Great for machine learning pipelines, demo apps, and EDA practice.

No cleaning required: Fully structured and ready to use.

Use Cases

Fine-tune models for sentiment analysis.

Train CRM conversational AI agents on realistic insurance feedback.

Build rating prediction or review summarization models.

Showcase AI demos for insurance tech startups or academic projects.

Practice natural language processing (NLP) and EDA techniques.

Acknowledgments

Big thanks to Google Gemini AI and Faker library for making this synthetic dataset generation possible.

If you find this dataset valuable, please consider upvoting! 🚀
Monkeypox Tweets
kaggle.com
zip
Updated Sep 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raijin (2022). Monkeypox Tweets [Dataset]. https://www.kaggle.com/vencerlanz09/monkeypox-tweets
Explore at:
zip(8093093 bytes)Available download formats
Dataset updated
Sep 3, 2022
Authors
Raijin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

The CSV data file contains tweets scraped from twitter about Monkeypox. The file contains eight significant columns namely: * date - Date of the tweet * time - Time of tweet * id - Twitter username ID of the person who tweeted about monkeypox * tweet - Text about monkeypox * language - Language used in the tweet * replies_count - Number of replies for the tweet * retweets_count - Number of retweets * likes_count - Number of likes

Similar Dataset

You may also want to check out the Monkeypox Reddit Dataset: https://www.kaggle.com/datasets/vencerlanz09/monkeypox-reddit-topics

Monkeypox Reddit Topics EDA + Sentiment Analysis Notebook: https://www.kaggle.com/code/vencerlanz09/monkeypox-reddit-topics-eda-sentiment-analysis

Inspiration

I'm currently starting to learn about NLP and I'm planning to create an algorithm that could predict whether a certain tweet is about monkey pox or not. Hopefully, I could grasp the concepts quickly and gather an appropriate dataset as my personal project.
BA Freelancer Germany
kaggle.com
zip
Updated Aug 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tobias Hein (2020). BA Freelancer Germany [Dataset]. https://www.kaggle.com/tobiasac/ba-freelancer-germany
Explore at:
zip(1460765 bytes)Available download formats
Dataset updated
Aug 28, 2020
Authors
Tobias Hein
Area covered
Germany
Description
Context

The dataset contains information about freelancer in the Business Analytics (BA) field from germany, see www.freelance.de.

Content

The dataset contains information on specifc entries of BA freelancer, i.e. titel of entry tags, hourly rate for the offered activities, location of the offer, number of references, tag id, offered activities, qualification and personnel description. The entries are provided in German language. There has been practically no data cleaning beforehand.

Acknowledgements

Many thanks to www.freelance.de for providing the data.

Inspiration

In data analytics projects, we always face the challenge to get insights from data which is often rather messy. Here it would be very interesting to develop solutions in terms of e.g. - data cleaning - explorative data analysis (eda) - natural language processing (nlp) - clustering and segmentation - machine learning
Amazon Prime Movies & TV Shows - 2025
kaggle.com
zip
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivek Patel (2025). Amazon Prime Movies & TV Shows - 2025 [Dataset]. https://www.kaggle.com/datasets/vivekcoder07/amazon-prime-movies-and-tv-shows-2025
Explore at:
zip(50789 bytes)Available download formats
Dataset updated
Jun 18, 2025
Authors
Vivek Patel
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains detailed information on over 1,000 Amazon Prime Movies and TV Shows. It includes genres, IMDb scores, cast, director, release year, and more.

✅ Suitable for:

Recommendation systems Sentiment analysis (NLP on descriptions) Exploratory data analysis IMDb rating prediction

🎯 Use this dataset to build end-to-end ML pipelines, dashboards, or genre prediction models.
MCQ Interview Prep :Data Science( EDA,ML ,DL&NLP)
kaggle.com
zip
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DILLIP MEHER (2024). MCQ Interview Prep :Data Science( EDA,ML ,DL&NLP) [Dataset]. https://www.kaggle.com/datasets/dillipmeher/mcq-interview-prep-data-science-edaml-dl-and-nlp/data
Explore at:
zip(190915 bytes)Available download formats
Dataset updated
Feb 15, 2024
Authors
DILLIP MEHER
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by DILLIP MEHER

Released under Apache 2.0

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Sahil Saxena (2020). Events blog [Dataset]. https://www.kaggle.com/sahilsaxenass/events-blog

Events blog

Big job related events to apply NLP, EDA (usage-recommendation system)

Explore at:

zip(1448144 bytes)Available download formats

Dataset updated

Jun 11, 2020

Authors

Sahil Saxena

Description

Dataset

This dataset was created by Sahil Saxena

Clear search

Close search

Google apps

Main menu

Events blog

Dataset

Contents

NLP EDA

Dataset

Contents

Results for the baseline.

Presidency in the middle of the pandemic

Context

Content

Inspiration

Job Descriptions Dataset

NLP with Disaster Tweets - cleaning data

Context

Content

Acknowledgements

Inspiration

GLARE: Google Apps Arabic Reviews Dataset

Jigsaw Bias Toxicity EDA NLP aug16 alpha0.05

Text Augmentation on Jigsaw Unintended Bias in Toxicity Classification competition training data using EDA_NLP.

Context

Content

Acknowledgements

Inspiration

Research Paper

Context

Content

Acknowledgements

Problem Use Case

Inspiration

F1-score results [40].

Fake-News-dataset

Fake News Detection Dataset

📚 Fake News Detection Dataset

Overview

Columns Description

Why Use This Dataset?

Potential Use Cases

A Note on Data Quality

File Info

Criteria for detailed characterization of the dataset.

Harry Potter Reviews

"AskReddit" Subreddit Top Comments

Insurance Customer Reviews

Columns

Why You’ll Love This Dataset ❤️

Use Cases

Acknowledgments

If you find this dataset valuable, please consider upvoting! 🚀

Monkeypox Tweets

Overview

Similar Dataset

Inspiration

BA Freelancer Germany

Context

Content

Acknowledgements

Inspiration

Amazon Prime Movies & TV Shows - 2025

MCQ Interview Prep :Data Science( EDA,ML ,DL&NLP)

Dataset

Contents

Events blog

Big job related events to apply NLP, EDA (usage-recommendation system)

Dataset

Contents