86 datasets found

Data from: Social Media Engagement Dataset
kaggle.com
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subash Shanmugam (2025). Social Media Engagement Dataset [Dataset]. https://www.kaggle.com/datasets/subashmaster0411/social-media-engagement-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Subash Shanmugam
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This machine-generated dataset simulates social media engagement data across various metrics, including likes, shares, comments, impressions, sentiment scores, toxicity, and engagement growth. It is designed for analysis and visualization of trends, buzz frequency, public sentiment, and user behavior on digital platforms.

The dataset can be used to:

Identify spikes or drops in engagement

Analyze changes in sentiment over time

Build dashboards for digital trend tracking

Test algorithms for sentiment analysis or trend prediction
Twitter Tweets Sentiment Dataset
kaggle.com
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

textID - unique ID for each piece of text

text - the text of the tweet

sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

Understand the Dataset & cleanup (if required).

Build classification models to predict the twitter sentiments.

Compare the evaluation metrics of vaious classification algorithms.

Social Media Posts in Arabic Dialect

kaggle.com

Updated Jul 11, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

UM6P Open Data (2024). Social Media Posts in Arabic Dialect [Dataset]. https://www.kaggle.com/datasets/um6popendata/sentiment-analysis-for-sm-posts-in-arabic-dialect

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 11, 2024

Dataset provided by

Kaggle

Authors

UM6P Open Data

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset: Sentiment Analysis for Social Media Posts in Arabic Dialect

Overview

This dataset contains a labeled collection of approximately 50,000 social media posts in various Arabic dialects. Each post has been manually annotated with sentiment labels, providing a rich resource for natural language processing and sentiment analysis research.

Dataset Owner

UM6P College of Computing

Content

Posts: The dataset includes raw text data of social media posts written in different Arabic dialects.
Sentiment Labels: Each post is labeled with one of the following sentiment categories:
- Positive
- Negative
- Neutral

Features

Post ID: A unique identifier for each social media post.
Text: The content of the social media post in Arabic.
Sentiment: The sentiment label assigned to the post (Positive, Negative, Neutral).

Format

The dataset is provided in a CSV format with the following columns: - Post_ID: Integer - Text: String - Sentiment: String (Positive, Negative, Neutral)

Usage

This dataset is ideal for tasks such as: - Training sentiment analysis models - Studying sentiment trends in Arabic social media - Exploring the linguistic characteristics of Arabic dialects - Benchmarking sentiment analysis tools

Example Data

Post_ID	Text	Sentiment
1	"هذا المنتج رائع جدًا وأحببته كثيرًا"	Positive
2	"لم يعجبني هذا الفيلم، كان مملًا جدًا"	Negative
3	"الطقس اليوم عادي، لا يوجد شيء مميز"	Neutral

Licensing

Please refer to the dataset license included in the dataset files for information on usage rights and restrictions.

Citation

An open access NLP dataset for Arabic dialects: data collection, labeling, and model construction, Elmehdi Boujou, Hamza Chataoui, Abdellah El Mekki, Saad Benjelloun, Ikram Chairi and Ismail Berrada MENACIS 2020 conference, In press.

Social Media Sentiment Analysis
kaggle.com
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bisma Sajjad (2024). Social Media Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/bismasajjad/social-media-sentiment-analysis/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bisma Sajjad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
A dataset of social media posts (tweets, Facebook posts, etc.) along with sentiment scores (positive, negative, neutral). The data covers a variety of topics such as politics, entertainment, and health. Columns: Post ID, Date, Platform, Topic (e.g., Politics, Entertainment), Sentiment Score (1 = Positive, -1 = Negative, 0 = Neutral), Text Content.
h
myanmar-social-media-sentiment-analysis-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chuu Htet Naing, myanmar-social-media-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/chuuhtetnaing/myanmar-social-media-sentiment-analysis-dataset
Explore at:
Authors
Chuu Htet Naing
Area covered
မြန်မာ
Description
Myanmar Social Media Sentiment Analysis Dataset

A Myanmar language dataset for sentiment analysis of social media content, translated from an English source dataset.

Dataset Description

This dataset contains social media text with sentiment annotations translated into Myanmar language. It is derived from the original Social Media Sentiments Analysis Dataset on Kaggle, with texts professionally translated to Myanmar language while preserving the sentiment labels.… See the full description on the dataset page: https://huggingface.co/datasets/chuuhtetnaing/myanmar-social-media-sentiment-analysis-dataset.
SMILE Twitter Emotion Dataset
kaggle.com
figshare.com
Updated Jul 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashwani Rathee (2020). SMILE Twitter Emotion Dataset [Dataset]. https://www.kaggle.com/datasets/ashkhagan/smile-twitter-emotion-dataset/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ashwani Rathee
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
SMILE Twitter Emotion dataset

This dataset is collected and annotated for the SMILE project http://www.culturesmile.org. This collection of tweets mentioning 13 Twitter handles associated with British museums was gathered between May 2013 and June 2015. It was created for the purpose of classifying emotions, expressed on Twitter towards arts and cultural experiences in museums.

It contains 3,085 tweets, with 5 emotions namely anger, disgust, happiness, surprise and sadness. Please see our paper "SMILE: Twitter Emotion Classification using Domain Adaptation" for more details of the dataset.

License: The annotations are provided under a CC-BY license, while Twitter retains the ownership and rights of the content of the tweets.
SENTIMENT ANALYSIS OF SOCIAL MEDIA PLATFORMS
kaggle.com
Updated Sep 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jigyashu Singh Lodhi (2023). SENTIMENT ANALYSIS OF SOCIAL MEDIA PLATFORMS [Dataset]. http://doi.org/10.34740/kaggle/dsv/6473513
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/6473513
Dataset updated
Sep 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jigyashu Singh Lodhi
Description
Dataset

This dataset was created by Jigyashu Singh Lodhi

Released under Other (specified in description)

Contents
f
Twitter dataset
figshare.com
csv
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreyas Poojary; Mohammed Riza; Rashmi Laxmikant Malghan (2025). Twitter dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28390334.v2
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28390334.v2
Dataset updated
Feb 11, 2025
Dataset provided by
figshare
Authors
Shreyas Poojary; Mohammed Riza; Rashmi Laxmikant Malghan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains tweets labeled for sentiment analysis, categorized into Positive, Negative, and Neutral sentiments. The dataset includes tweet IDs, user metadata, sentiment labels, and tweet text, making it suitable for Natural Language Processing (NLP), machine learning, and AI-based sentiment classification research. Originally sourced from Kaggle, this dataset is curated for improved usability in social media sentiment analysis.
E
A Sentiment Analysis Dataset for Code-Mixed Malayalam-English
live.european-language-grid.eu
zenodo.org
+1more
tsv
Updated Dec 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). A Sentiment Analysis Dataset for Code-Mixed Malayalam-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7634
Explore at:
tsvAvailable download formats
Dataset updated
Dec 13, 2021
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.
Tweet Sentiment's Impact on Stock Returns
kaggle.com
Updated Jan 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Tweet Sentiment's Impact on Stock Returns [Dataset]. https://www.kaggle.com/datasets/thedevastator/tweet-sentiment-s-impact-on-stock-returns
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 16, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Tweet Sentiment's Impact on Stock Returns

862,231 Labeled Instances

By [source]

About this dataset

This dataset contains 862,231 labeled tweets and associated stock returns, providing a comprehensive look into the impact of social media on company-level stock market performance. For each tweet, researchers have extracted data such as the date of the tweet and its associated stock symbol, along with metrics such as last price and various returns (1-day return, 2-day return, 3-day return, 7-day return). Also recorded are volatility scores for both 10 day intervals and 30 day intervals. Finally, sentiment scores from both Long Short - Term Memory (LSTM) and TextBlob models have been included to quantify the overall tone in which these messages were delivered. With this dataset you will be able to explore how tweets can affect a company's share prices both short term and long term by leveraging all of these data points for analysis!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use this dataset, users can utilize descriptive statistics such as histograms or regression techniques to establish relationships between tweet content & sentiment with corresponding stock return data points such as 1-day & 7-day returns measurements.

The primary fields used for analysis include Tweet Text (TWEET), Stock symbol (STOCK), Date (DATE), Closing Price at the time of Tweet (LAST_PRICE) a range of Volatility measures 10 day Volatility(VOLATILITY_10D)and 30 day Volatility(VOLATILITY_30D ) for each Stock which capture changes in market fluctuation during different periods around when Twitter reactions occur. Additionally Sentiment Polarity analysis undertaken via two Machine learning algorithms LSTM Polarity(LSTM_POLARITY)and Textblob polarity provide insight into whether people are expressing positive or negative sentiments about each company at given times which again could influence thereby potentially influence Stock Prices over shorter term periods like 1-Day Returns(1_DAY_RETURN),2-Day Returns(2_DAY_RETURN)or longer term horizon like 7 Day Returns*7DAY RETURNS*.Finally MENTION field indicates if names/acronyms associated with Companies were specifically mentioned in each Tweet or not which gives extra insight into whether company specific contexts were present within individual Tweets aka “Company Relevancy”

Research Ideas

Analyzing the degree to which tweets can influence stock prices. By analyzing relationships between variables such as tweet sentiment and stock returns, correlations can be identified that could be used to inform investment decisions.

Exploring natural language processing (NLP) models for predicting future market trends based on textual data such as tweets. Through testing and evaluating different text-based models using this dataset, better predictive models may emerge that can give investors advance warning of upcoming market shifts due to news or other events.

Investigating the impact of different types of tweets (positive/negative, factual/opinionated) on stock prices over specific time frames. By studying correlations between the sentiment or nature of a tweet and its effect on stocks, insights may be gained into what sort of news or events have a greater impact on markets in general

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: reduced_dataset-release.csv | Column name | Description | |:----------------------|:-------------------------------------------------------------------------------------------------------| | TWEET | Text of the tweet. (String) | | STOCK | Company's stock mentioned in the tweet. (String) | | DATE | Date the tweet was posted. (Date) | | LAST_PRICE | Company's last price at the time of tweeting. (Float) ...
m
The Climate Change Twitter Dataset
data.mendeley.com
kaggle.com
Updated May 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimitrios Effrosynidis (2022). The Climate Change Twitter Dataset [Dataset]. http://doi.org/10.17632/mw8yd7z9wc.2
Explore at:
Unique identifier
https://doi.org/10.17632/mw8yd7z9wc.2
Dataset updated
May 19, 2022
Authors
Dimitrios Effrosynidis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
If you use the dataset, cite the paper: https://doi.org/10.1016/j.eswa.2022.117541

The most comprehensive dataset to date regarding climate change and human opinions via Twitter. It has the heftiest temporal coverage, spanning over 13 years, includes over 15 million tweets spatially distributed across the world, and provides the geolocation of most tweets. Seven dimensions of information are tied to each tweet, namely geolocation, user gender, climate change stance and sentiment, aggressiveness, deviations from historic temperature, and topic modeling, while accompanied by environmental disaster events information. These dimensions were produced by testing and evaluating a plethora of state-of-the-art machine learning algorithms and methods, both supervised and unsupervised, including BERT, RNN, LSTM, CNN, SVM, Naive Bayes, VADER, Textblob, Flair, and LDA.

The following columns are in the dataset:

➡ created_at: The timestamp of the tweet. ➡ id: The unique id of the tweet. ➡ lng: The longitude the tweet was written. ➡ lat: The latitude the tweet was written. ➡ topic: Categorization of the tweet in one of ten topics namely, seriousness of gas emissions, importance of human intervention, global stance, significance of pollution awareness events, weather extremes, impact of resource overconsumption, Donald Trump versus science, ideological positions on global warming, politics, and undefined. ➡ sentiment: A score on a continuous scale. This scale ranges from -1 to 1 with values closer to 1 being translated to positive sentiment, values closer to -1 representing a negative sentiment while values close to 0 depicting no sentiment or being neutral. ➡ stance: That is if the tweet supports the belief of man-made climate change (believer), if the tweet does not believe in man-made climate change (denier), and if the tweet neither supports nor refuses the belief of man-made climate change (neutral). ➡ gender: Whether the user that made the tweet is male, female, or undefined. ➡ temperature_avg: The temperature deviation in Celsius and relative to the January 1951-December 1980 average at the time and place the tweet was written. ➡ aggressiveness: That is if the tweet contains aggressive language or not.

Since Twitter forbids making public the text of the tweets, in order to retrieve it you need to do a process called hydrating. Tools such as Twarc or Hydrator can be used to hydrate tweets.
f
S1 File -
plos.figshare.com
csv
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam (2025). S1 File - [Dataset]. http://doi.org/10.1371/journal.pone.0315829.s001
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315829.s001
Dataset updated
Feb 6, 2025
Dataset provided by
PLOS ONE
Authors
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT-based model for natural language processing (NLP) applications. After the model creation, we applied the resulting model, LastBERT, to a real-world task—classifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million-resulting in a model approximately 73.64% smaller. On the General Language Understanding Evaluation (GLUE) benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy of 85%, F1 score of 85%, precision of 85%, and recall of 85%. When compared to DistilBERT (66 million parameters) and ClinicalBERT (110 million parameters), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT model’s capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab and Kaggle Notebooks. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.
Vietnamese Social Media Emotion Corpus
kaggle.com
Updated Dec 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minh Thanh (2022). Vietnamese Social Media Emotion Corpus [Dataset]. https://www.kaggle.com/datasets/hmthanh/vsmec
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 29, 2022
Dataset provided by
Kaggle
Authors
Minh Thanh
Area covered
Vietnam
Description
Emotion recognition is a higher approach or special case of sentiment analysis. In this task, the result is not produced in terms of either polarity: positive or negative or in the form of rating (from 1 to 5) but of a more detailed level of sentiment analysis in which the result are depicted in more expressions like sadness, enjoyment, anger, disgust, fear and surprise. Emotion recognition plays a critical role in measuring brand value of a product by recognizing specific emotions of customers’ comments. In this study, we have achieved two targets. First and foremost, we built a standard Vietnamese Social Media Emotion Corpus (UIT-VSMEC) with about 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese which is a low-resource language in Natural Language Processing (NLP). Secondly, we assessed and measured machine learning and deep neural network models on our UIT-VSMEC. As a result, Convolutional Neural Network (CNN) model achieved the highest performance with 57.61% of F1-score.

Paper: Vong Ho, Duong Nguyen, Danh Nguyen, Linh Pham, Kiet Nguyen and Ngan Nguyen, Emotion Recognition for Vietnamese Social Media Text, 2019 16th International Conference of the Pacific Association for Computational Linguistics (PACLING 2019), October 11-13, 2019, Ha Noi, Vietnam. Link.

https://sites.google.com/uit.edu.vn/uit-nlp/datasets-projects
A
‘Flat Earth on Twitter’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Flat Earth on Twitter’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-flat-earth-on-twitter-6f4f/f73edd7f/?iid=004-389&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Earth
Description
Analysis of ‘Flat Earth on Twitter’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/gpreda/flat-earth-on-twitter on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

One of the most successful conspiracy theory is the Flat Earth theory. It has millions of followers, from Inspired Scientists that created the fake science behind the theory, Flat Brain influencers that evangelize this new religion, True Believers that are just there to press the like button. The Flat Earth theory is everywhere around Social Media. Here we collect tweets about this conspiracy theory.

Collection

Tweets using #FlatEarth hashtag are collected.

Collected using tweepy.

The data is not filtered.

Inspiration

Use the texts in this dataset to:

train to do text data analysis;

perform sentiment analysis;

perform topic modelling on the text corpus.

--- Original source retains full ownership of the source dataset ---
f
Navigating News Narratives: A Media Bias Analysis Dataset
figshare.com
txt
Updated Dec 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaina Raza (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24422122.v4
Dataset updated
Dec 8, 2023
Dataset provided by
figshare
Authors
Shaina Raza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0
i
Analytics
ieee-dataport.org
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuriy Syerov (2025). Analytics [Dataset]. https://ieee-dataport.org/documents/social-media-big-dataset-research-analytics-prediction-and-understanding-global-climate
Explore at:
Dataset updated
Jun 17, 2025
Authors
Yuriy Syerov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
trends
Twitter Sentiment Analysis - 1M data
kaggle.com
Updated Mar 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amirhossein Ahmadnejad (2023). Twitter Sentiment Analysis - 1M data [Dataset]. https://www.kaggle.com/datasets/amirhoseinahmadnejad/twitter-sentiment-analysis-1m-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Amirhossein Ahmadnejad
Description
this dataset is a combination of over 6 different datasets found on Kaggle. the labels are 0 and 1 which means negative and positive tweets. in the cleared dataset I delete mentions. you can do any preprocessing you want on the dataset. I will appreciate any notebooks submitted on this dataset to help others with sentiment analysis tasks. I will submit mine as well.
sentiment analysis
kaggle.com
Updated Sep 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sambita (2020). sentiment analysis [Dataset]. https://www.kaggle.com/datasets/samch08/sentiment-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sambita
Description
Dataset

This dataset was created by Sambita

Contents
f
Weighted average comparison of LastBERT, DistilBERT, and ClinicalBERT on...
plos.figshare.com
xls
Updated Feb 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam (2025). Weighted average comparison of LastBERT, DistilBERT, and ClinicalBERT on ADHD dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315829.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315829.t005
Dataset updated
Feb 6, 2025
Dataset provided by
PLOS ONE
Authors
Ahmed Akib Jawad Karim; Kazi Hafiz Md. Asad; Md. Golam Rabiul Alam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Weighted average comparison of LastBERT, DistilBERT, and ClinicalBERT on ADHD dataset.
Raw twitter data for sentiment analysis
kaggle.com
Updated Jul 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Swagata Datta (2020). Raw twitter data for sentiment analysis [Dataset]. https://www.kaggle.com/swagatadatta/twitter-raw-tweets-data/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 23, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Swagata Datta
Description
Dataset

This dataset was created by Swagata Datta

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Subash Shanmugam (2025). Social Media Engagement Dataset [Dataset]. https://www.kaggle.com/datasets/subashmaster0411/social-media-engagement-dataset

Data from: Social Media Engagement Dataset

Synthetic social media activity data for trend and sentiment analysis"

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 6, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Subash Shanmugam

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This machine-generated dataset simulates social media engagement data across various metrics, including likes, shares, comments, impressions, sentiment scores, toxicity, and engagement growth. It is designed for analysis and visualization of trends, buzz frequency, public sentiment, and user behavior on digital platforms.

The dataset can be used to:

Identify spikes or drops in engagement

Analyze changes in sentiment over time

Build dashboards for digital trend tracking

Test algorithms for sentiment analysis or trend prediction

Clear search

Close search

Google apps

Main menu

Data from: Social Media Engagement Dataset

Twitter Tweets Sentiment Dataset

Description:

Columns:

Acknowledgement:

Objective:

Social Media Posts in Arabic Dialect

Dataset: Sentiment Analysis for Social Media Posts in Arabic Dialect

Overview

Dataset Owner

Content

Features

Format

Usage

Example Data

Licensing

Citation

Social Media Sentiment Analysis

myanmar-social-media-sentiment-analysis-dataset

SMILE Twitter Emotion Dataset

SMILE Twitter Emotion dataset

SENTIMENT ANALYSIS OF SOCIAL MEDIA PLATFORMS

Dataset

Contents

Twitter dataset

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

Tweet Sentiment's Impact on Stock Returns

Tweet Sentiment's Impact on Stock Returns

862,231 Labeled Instances

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

The Climate Change Twitter Dataset

S1 File -

Vietnamese Social Media Emotion Corpus

‘Flat Earth on Twitter’ analyzed by Analyst-2

Context

Collection

Inspiration

Navigating News Narratives: A Media Bias Analysis Dataset

Analytics

Twitter Sentiment Analysis - 1M data

sentiment analysis

Dataset

Contents

Weighted average comparison of LastBERT, DistilBERT, and ClinicalBERT on...

Raw twitter data for sentiment analysis

Dataset

Contents

Data from: Social Media Engagement Dataset

Synthetic social media activity data for trend and sentiment analysis"