100+ datasets found

Reddit Dataset With Sentiment Analysis
kaggle.com
zip
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vijay J0shi (2025). Reddit Dataset With Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vijayj0shi/reddit-dataset-with-sentiment-analysis
Explore at:
zip(4119981 bytes)Available download formats
Dataset updated
Jun 5, 2025
Authors
Vijay J0shi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains raw data from the Reddit subreddit r/unpopularopinion, collected on June 5, 2025. It includes 100 recent posts, all comments (including sub-comments) on those posts, user details for authors involved in the discussion, and additional posts by those users. Sentiment analysis has been performed on the comments and additional user posts, providing sentiment labels, confidence scores, and derived sentiment scores.

Dataset Contents

users.csv: Contains details of Reddit users involved in the discussion (post authors and commenters).

Username: Reddit username.

Karma: Total karma (Link_Karma + Comment_Karma).

Link_Karma: Karma from posts.

Comment_Karma: Karma from comments.

Account_Created: Timestamp of account creation.

user_posts.csv: Contains additional posts by all unique users involved in the discussion, with sentiment analysis.

Username: Post author’s username.

Post_ID: Unique post identifier.

Title: Post title.

Subreddit: Subreddit where the post was made.

Score: Upvote/downvote score.

URL: Post URL.

Sentiment: Sentiment label (e.g., positive, negative, neutral).

Confidence: Confidence score of the sentiment prediction.

Sentiment_Score: Numerical sentiment score derived from sentiment analysis.

posts_df.csv: Contains the initial 100 posts fetched from r/unpopularopinion.

Title: Post title.

Score: Upvote/downvote score.

Post_ID: Unique post identifier.

URL: Post URL.

Num_Comments: Number of comments on the post.

Created: Timestamp of post creation.

Text: Post body text.

Author: Post author’s username.

comments.csv: Contains all comments and sub-comments on the 100 posts, with sentiment analysis.

Post_ID: ID of the post the comment belongs to.

Post_Title: Title of the post.

Comment_ID: Unique comment identifier.

Parent_ID: ID of the parent (post or comment), or None for top-level comments.

Body: Comment text.

Author: Comment author’s username.

Score: Upvote/downvote score.

Level: 0 for top-level comments, 1 for sub-comments.

Sentiment: Sentiment label.

Confidence: Confidence score of the sentiment prediction.

Sentiment_Score: Numerical sentiment score (inferred column).

Collection Method

The data was collected using the PRAW library to interact with the Reddit API. The pipeline: 1. Fetched the 100 most recent posts from r/unpopularopinion. 2. Retrieved all comments and sub-comments on those posts. 3. Fetched user details (e.g., karma) for all unique authors (post authors and commenters). 4. Fetched additional posts by those users. 5. Performed sentiment analysis on comments and additional user posts.

Potential Uses

Sentiment Analysis Research: Analyze the sentiment of Reddit discussions, comparing posts and comments.

Content Moderation: Develop algorithms to flag inappropriate content using sentiment and user data.

Social Media Analysis: Explore user activity patterns, such as how karma correlates with sentiment or comment scores.

NLP Projects: Use the raw text (post titles, bodies, comments) for natural language processing tasks like topic modeling or text classification.

Notes

This dataset is a raw snapshot before preprocessing steps like encoding or scaling. It retains usernames and text data, which are later anonymized in the pipeline.

Sentiment analysis was applied to comments and additional user posts, but not to the initial 100 posts in posts_df.csv.

The dataset may contain sensitive information (usernames, text). Users should handle it responsibly and consider anonymizing further if needed.
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Food Reviews - Text Mining & Sentiment Analysis
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Food Reviews - Text Mining & Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vikramamin/food-reviews-text-mining-and-sentiment-analysis
Explore at:
zip(1075643 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?

We will be using text mining and sentiment analysis (R programming) to offer insights to the CMO with regards to the food reviews

Steps: - Set the working directory and read the data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt=""> - Data cleaning. Check for missing values and data types of variables - Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer") - TEXT ACQUISITION and AGGREGATION. Create corpus. - TEXT PRE-PROCESSING. Cleaning the text - Replace special characters with " ". We use the tm_map function for this purpose - make all the alphabets lower case - remove punctuations - remove whitespace - remove stopwords - remove numbers - stem the document - create term document matrix https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt=""> - convert into matrix and find out frequency of words https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt=""> - convert into a data frame - TEXT EXPLORATION find out the words which appear most frequently and least frequently https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt=""> - Create Wordcloud

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">

TEXT MODELLING

Word association between two words which tend to appear more number of times. Here we try to find the association for the top three occurring words "like", "tast", "flavor" by setting a correlation limit of 0.2 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbfdbfbe28a30012f0e7ab54d6185c223%2FPicture4.png?generation=1691147754149529&alt=media" alt="">

"like" has an association with "realli" (they appear about 25% of the time together), dont (24%), one(21%)

"tast" does not have an association with any word with the set correlation limit

"flavor" has an association with the word "chip"(they appear about 27% of the time together)

Sentiment analysis https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa5da1dd46a60494ec9b26fa1a08b2087%2FPicture5.png?generation=1691147897889137&alt=media" alt="">

element_id refers to the Review No and sentence_id refers to the Sentence No in the review , word_count refers to the number of words part of that sentence in that review. Sentiment would be either positive or negative.

Let us find out the overall sentiment score of all the reviews https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6fce0e810d47ea8864ebac58eca1be99%2FPicture6.png?generation=1691148149575056&alt=media" alt="">

This indicates that the entire food review document has a marginally positive score

Let us find out the sentiment score for each of the 5000 reviews. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5b7861d5ebc3881483dd65a8385a539c%2FPicture7.png?generation=1691148278877972&alt=media" alt="">

(-1) indicates the most extreme negative sentiment and (+1) indicates the most extreme positive sentiment

Let us create a separate data frame for all the negative sentiments. In total there are 726 negative sentiments out of the total 5000 reviews (approx 15%).
Friends - R Package Dataset
kaggle.com
zip
Updated Nov 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Yukio Imafuko (2024). Friends - R Package Dataset [Dataset]. https://www.kaggle.com/datasets/lucasyukioimafuko/friends-r-package-dataset
Explore at:
zip(2018791 bytes)Available download formats
Dataset updated
Nov 11, 2024
Authors
Lucas Yukio Imafuko
Description
The whole data and source can be found at https://emilhvitfeldt.github.io/friends/

"The goal of friends to provide the complete script transcription of the Friends sitcom. The data originates from the Character Mining repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files."

Content

friends.csv - Contains the scenes and lines for each character, including season and episodes.

friends_emotions.csv - Contains sentiments for each scene - for the first four seasons only.

friends_info.csv - Contains information regarding each episode, such as imdb_rating, views, episode title and directors.

Uses

Text mining, sentiment analysis and word statistics.

Data visualizations.
R and Python Stack Overflow Answers + Sentiment
kaggle.com
zip
Updated May 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OJ Watson (2019). R and Python Stack Overflow Answers + Sentiment [Dataset]. https://www.kaggle.com/datasets/ojwatson/stack-overflow-output
Explore at:
zip(76142440 bytes)Available download formats
Dataset updated
May 28, 2019
Authors
OJ Watson
Description
Context

This is the output of the Stack Rudeness kernel (https://www.kaggle.com/ojwatson/stack-rudeness), as saved in Cell 17.

Content

Stack Overflow answers by the Top 10 r and python users extracted using BigQuery. Also includes data on whether the answer was accepted and some additional data based on sentiment analysis of the answer text.

Acknowledgements

BigQuery and StackOverflow
Z
SEN - Sentiment analysis of Entities in News headlines
data.niaid.nih.gov
Updated Oct 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katarzyna Baraniak; Marcin Sydow (2023). SEN - Sentiment analysis of Entities in News headlines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5211931
Explore at:
Dataset updated
Oct 15, 2023
Dataset provided by
Polish-Japanese Academy of Information Technology
Polish-Japanese Academy of Information Technology / Institute of Computer Science Polish Academy of Sciences
Authors
Katarzyna Baraniak; Marcin Sydow
Description
If you wish to use this data please cite:

Katarzyna Baraniak, Marcin Sydow, A dataset for Sentiment analysis of Entities in News headlines (SEN), Procedia Computer Science, Volume 192, 2021, Pages 3627-3636, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2021.09.136. (https://www.sciencedirect.com/science/article/pii/S1877050921018755)

bibtex: users.pja.edu.pl/~msyd/bibtex/sydow-baraniak-SENdataset-kes21.bib

SEN is a novel publicly available human-labelled dataset for training and testing machine learning algorithms for the problem of entity level sentiment analysis of political news headlines.

On-line news portals play a very important role in the information society. Fair media should present reliable and objective information. In practice there is an observable positive or negative bias concerning named entities (e.g. politicians) mentioned in the on-line news headlines. Our dataset consists of 3819 human-labelled political news headlines coming from several major on-line media outlets in English and Polish.

Each record contains a news headline, a named entity mentioned in the headline and a human annotated label (one of “positive”, “neutral”, “negative” ). Our SEN dataset package consists of 2 parts: SEN-en (English headlines that split into SEN-en-R and SEN-en-AMT), and SEN-pl (Polish headlines). Each headline-entity pair was annotated via team of volunteer researchers (the whole SEN-pl dataset and a subset of 1271 English records: the SEN-en-R subset, “R” for “researchers”) or via the Amazon Mechanical Turk service (a subset of 1360 English records: the SEN-en-AMT subset).

During analysis of annotation outlying annotations and removed . Separate version of dataset without outliers is marked by "noutliers" in data file name.

Details of the process of preparing the dataset and presenting its analysis are presented in the paper.

In case of any questions, please contact one of the authors. Email adresses are in the paper.
Logistic regression model, LDA.
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Sokolovsky; Thomas Gross; Jaume Bacardit (2023). Logistic regression model, LDA. [Dataset]. http://doi.org/10.1371/journal.pone.0246464.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0246464.t003
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Artur Sokolovsky; Thomas Gross; Jaume Bacardit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Logistic regression model, LDA.
S
Weibo Emotional Dynamic Analysis Code Dataset
scidb.cn
Updated Sep 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
liu xing yu (2025). Weibo Emotional Dynamic Analysis Code Dataset [Dataset]. http://doi.org/10.57760/sciencedb.psych.00767
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.psych.00767
Dataset updated
Sep 28, 2025
Dataset provided by
Science Data Bank
Authors
liu xing yu
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This study analyzes the dynamic evolution patterns of emotional states based on 207 Weibo posts using computational linguistics methods. The research encompasses a complete pipeline including data collection, text cleaning, sentiment analysis, co-occurrence network construction, and Markov chain modeling. The dataset contains comprehensive R code implementations, processed sentiment-annotated data, co-occurrence network matrices, transition probability matrices, and visualization results, providing a reproducible computational framework for social media emotion dynamics research.
sentiwordnet_it 1.0
zenodo.org
zip
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agnese Vardanega; Agnese Vardanega (2025). sentiwordnet_it 1.0 [Dataset]. http://doi.org/10.5281/zenodo.17248245
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17248245
Dataset updated
Oct 8, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Agnese Vardanega; Agnese Vardanega
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This repository contains a sentiment lexicon for Italian, based on SentiWordNet 3.0 (Baccianella, Esuli, and Sebastiani 2010; Esuli [2019] 2025) and MultiWordNet (Pianta, Bentivogli, and Girardi 2002).

Unlike previous resources—SentiWordNet, which provides sentiment scores without Italian lexical coverage, and MultiWordNet, which offers Italian synsets without sentiment annotation—this dataset bridges the two by mapping Italian lexical entries to sentiment scores in a ready-to-use CSV format.

This integration enables direct use in sentiment analysis and other NLP applications for Italian, filling a gap in existing resources.

The included files, in the data/ folder are:

swn_it.csv: A dataset of 35,001 Italian synsets with polarity scores, POS, synset, offset, English synset lemmas, and gloss (in English).

swn_it_tidy.csv: A tidy (one token per row) dataset of 41,725 lemmas, with polarity scores. It is designed for use in R.

It also contains a folder with examples in R, and scripts to use and manipulate the datasets:

examples-R/:

custom_dataset.R: Create a custom tidy dataset from the original one, for treating duplicate entries differently.

example.R: Examples of how to use the dataset for sentiment analysis on a sample text.

uso.md: Instructions for using the dataset in R (in Italian), referred to in example.R.
m
ParlVote: Corpora for Sentiment Analysis of Political Debatess
data.mendeley.com
Updated Jul 11, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gavin Abercrombie (2020). ParlVote: Corpora for Sentiment Analysis of Political Debatess [Dataset]. http://doi.org/10.17632/czjfwgs9tm.2
Explore at:
Unique identifier
https://doi.org/10.17632/czjfwgs9tm.2
Dataset updated
Jul 11, 2020
Authors
Gavin Abercrombie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets for policy preference identification, binary sentiment classification, and stance detection of debates from the House of Commons of the United Kingdom Parliament.

For details, see:

ParlVote: G. Abercrombie and R. Batista-Navarro. ParlVote: A Corpus for Sentiment Analysis of Political Debates. Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC-2020). European Languages Resources Association (ELRA), 2020.

ParlVote+: Paper under review. This version includes policy preference labels for each example. It has also been cleaned up a little, and some incorrect examples from the original dataset have been removed.

Data published under the Open Parliament Licence v3.0 : https://www.parliament.uk/site-information/copyright-parliament/open-parliament-licence/
Numbers of posts per package.
plos.figshare.com
xls
Updated Jun 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Sokolovsky; Thomas Gross; Jaume Bacardit (2023). Numbers of posts per package. [Dataset]. http://doi.org/10.1371/journal.pone.0246464.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0246464.t001
Dataset updated
Jun 12, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Artur Sokolovsky; Thomas Gross; Jaume Bacardit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Numbers of posts per package.
U
Replication Data for: A Review of Best Practice Recommendations for...
dataverse-staging.rdmc.unc.edu
datasearch.gesis.org
Updated Nov 7, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Wesslen; Ryan Wesslen (2017). Replication Data for: A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App) [Dataset]. http://doi.org/10.15139/S3/R4W7ZS
Explore at:
csv(1070619), application/x-rlang-transport(1014184), pdf(76215), text/x-r-markdown(14242), text/x-r-markdown(12162), html(2930583), application/x-rlang-transport(2108553), docx(24677), html(2442743), html(1689406), text/markdown(1958), application/x-rlang-transport(1623238), text/x-r-markdown(12252)Available download formats
Unique identifier
https://doi.org/10.15139/S3/R4W7ZS
Dataset updated
Nov 7, 2017
Dataset provided by
UNC Dataverse
Authors
Ryan Wesslen; Ryan Wesslen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Replication materials for "A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App)". You can also find these materials on GitHub repo (https://github.com/wesslen/text-analysis-org-science) as well as the Shiny app in the GitHub repo (https://github.com/wesslen/topicApp).
R - Data and Script Files
figshare.com
txt
Updated Sep 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carter Emerson (2025). R - Data and Script Files [Dataset]. http://doi.org/10.6084/m9.figshare.30066598.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30066598.v1
Dataset updated
Sep 5, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Carter Emerson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R data and script files
Movie-review_SentAnlsys
kaggle.com
zip
Updated Dec 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naveen Karthik R (2023). Movie-review_SentAnlsys [Dataset]. https://www.kaggle.com/datasets/naveenkarthikr/movie-review-sentanlsys
Explore at:
zip(2092198 bytes)Available download formats
Dataset updated
Dec 31, 2023
Authors
Naveen Karthik R
Description
Dataset

This dataset was created by Naveen Karthik R

Contents
f
Optimized parameters of Random Forest and CatBoost models.
figshare.com
xls
Updated Jun 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Sokolovsky; Thomas Gross; Jaume Bacardit (2023). Optimized parameters of Random Forest and CatBoost models. [Dataset]. http://doi.org/10.1371/journal.pone.0246464.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0246464.t005
Dataset updated
Jun 12, 2023
Dataset provided by
PLOS ONE
Authors
Artur Sokolovsky; Thomas Gross; Jaume Bacardit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Optimized parameters of Random Forest and CatBoost models.
Z
Toward multimodal information and AI interaction: a quasi-experiment with...
data.niaid.nih.gov
Updated Aug 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crudele, Francesca; Raffaghelli, Juliana Elisa (2024). Toward multimodal information and AI interaction: a quasi-experiment with ChatGPT [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13220545
Explore at:
Dataset updated
Aug 5, 2024
Dataset provided by
University of Padua
Authors
Crudele, Francesca; Raffaghelli, Juliana Elisa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The development of argumentative text and information comprehension (CoI) skills related to the critical reconstruction of meaning (CT) is crucial in undergraduate education. Especially now in the era of social media and AI-mediated information. Generative AI aids in information creation, but its unconscious use can complicate complex information navigation. Argument maps (AM), commonly used for analyzing analog and static texts, can help visualize, understand, and rework multimodal and dynamic arguments and information.

Stemming from the Vygotskian idea, our study used a design-based research approach on the use of AMs and ChatGPT as socio-technical artifacts to stimulate and support the understanding of information (CoI) and thus the development of critical thinking (CT). The workshop introduced the multimodal element through a 3-group quasi-experiment. The first group dealt with fully analog texts, the second group used maps with multimodal textual modes, and the third group only interacted with ChatGPT. The research focused on comparing the three groups and focusing on the two experimental groups (experimental macro-focus).

The research had three main objectives: 1) to test whether AMs improved students' CoI enhancement and critical processing (CT); 2) to determine whether interaction with ChatGPT supported information reprocessing and critical construction of opinions and assessment tools; and 3) to determine whether interaction with ChatGPT alone, without AMs, still fostered greater integration of information and viewpoints.

Our preliminary analysis showed that AMs improved students' CoI and CT, especially when exposed to multimodal information. ChatGPT interaction increased critical reflection and awareness of AI's role in education. Students using only ChatGPT performed well in argumentative reworking, suggesting that interaction with the chatbot can be effective. However, integrating AMs and ChatGPT could provide optimal support for comprehension and critical thinking skills.

This Zenodo record follows the full analysis process with R (https://cran.r-project.org/bin/windows/base/ ) and Nvivo (https://lumivero.com/products/nvivo/) composed of the following datasets, script and results:

Comprehension of Text and AMs Results - Arg_Map.xlsx

Critical Thinking level - CriThink.xlsx

Descriptive and Inferential Statistics Comprehension and Critical Thinking - Preliminary Analysis.R

Elaboration and Integration Opinion - Opi_G1.xlsx; Opi_G2.xlsx & Opi_G3.xlsx

Descriptive and Inferential Statistics Opinion level - Preliminary Analysis_opi.R

Sentiment Analysis - Sentiment Analysis.R

Vocabulary Frequent words - Vocabulary.csv

Codebook qualitative Analysis with Nvivo (Codebook.xlsx)

Results Nvivo Analysis G1 & G2 - Codebook-ChatGPT_G1&G2.docx

Any comments or improvements are welcome!
Twitter Sentiment Analysis
kaggle.com
zip
Updated Apr 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
raj713335 (2023). Twitter Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/raj713335/twittesentimentanalysis/discussion
Explore at:
zip(84855617 bytes)Available download formats
Dataset updated
Apr 16, 2023
Authors
raj713335
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About Dataset

Context

This is the Twitter Sentiment Analysis dataset. It contains 1 Million tweets extracted using the Twitter Opensource API. The tweets have been annotated (0 = negative, 4 = positive) and they can be used primarily to detect sentiment.

Content It contains the following 6 fields:

target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

ids: The id of the tweet ( 2087)

date: the date of the tweet (Sat April 15 23:58:44 UTC 2023)

flag: The query (lyx). If there is no query, then this value is NO_QUERY.

user: The user that tweeted (raj713335)

**text: **the text of the tweet (Lyx is cool)

Acknowledgments The official link regarding the dataset with resources about how it was generated is here The official paper detailing the approach is here

Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

Inspiration To detect severity from tweets. You may have a look at this.
Z
"AI as an Ally?" : AI mediation tools to support undergraduates'...
data.niaid.nih.gov
zenodo.org
Updated Aug 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crudele, Francesca; Raffaghelli, Juliana Elisa (2024). "AI as an Ally?" : AI mediation tools to support undergraduates' argumentative skills [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13170804
Explore at:
Dataset updated
Aug 5, 2024
Dataset provided by
University of Padua
Authors
Crudele, Francesca; Raffaghelli, Juliana Elisa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Argumentative skills are indispensable both personally and professionally to process complex information (CoI) relating to the critical reconstruction of meaning through critical thinking (CT). This remains a particularly relevant priority, especially in the age of social media and artificial intelligence-mediated information. Recently, the public dissemination of what has been called generative artificial intelligence (GenAI), with the particular example of ChatGPT (OpenAI, 2022), has made it even easier today to access and disseminate information, written or not, true or not. New tools are needed to critically address post-digital information abundance.

In this context, argumentative maps (AMs), which are already used to develop argumentative skills and critical thinking, are studied for multimodal and dynamic information visualization, comprehension, and reprocessing. In this regard, the entry of generative AI into university classrooms proposes a novel scenario of multimodality and technological dynamism.

Building on the Vygotskian idea of mediation and the theory of "dual stimulation" as applied to the use of learning technologies, the idea was to complement AMs with the introduction of a second set of stimuli that would support and enhance individual activity: AI-mediated tools. With AMs, an attempt has been made to create a space for understanding, fixing, and reconstructing information, which is important for the development of argumentative skills. On the other hand, by arranging forms of critical and functional interaction with ChatGPT as an ally in understanding, reformulating, and rethinking one's argumentative perspectives, a new and comprehensive argumentative learning process has been arranged, while also cultivating a deeper understanding of the artificial agents themselves.

Our study was based on a two-group quasi-experiment with 27 students of the “Research Methods in Education” course, to explore the role of AMs in fixing and supporting multimodal information reprocessing. In addition, by predicting the use of the intelligent chatbot ChatGPT, one of the most widely used GenAI technologies, we investigated the evolution of students' perceptions of its potential role as a “study companion” in information comprehension and reprocessing activities with a path to build a good prompt.

Preliminary analyses showed that in both groups, AMs supported the increase in mean CoI and CT levels for analog and digital information. However, the group with analog texts showed more complete reprocessing.The interaction with the chatbot was analyzed quantitatively and qualitatively, and there emerged an initial positive reflection on the potential of ChatGPT and increased confidence in interacting with intelligent agents after learning the rules for constructing good prompts.

This Zenodo record follows the full analysis process with R (https://cran.r-project.org/bin/windows/base/ ) and Nvivo (https://lumivero.com/products/nvivo/) composed of the following datasets, script and results:

Comprehension of Text and AMs Results - Arg_G1.xlsx & Arg_G2.xlsx

Opinion and Critical Thinking level - Opi_G1.xlsx & Opi_G2.xlsx

Data for Correlation and Regression - CorRegr_G1.xlsx & CorRegr_G2.xlsx

Interaction with ChatGPT - GPT_G1.xlsx & GPT_G2.xlsx

Descriptive and Inferential Statistics Comprehension and AMs Building - Analysis_RES_Comprehension.R

Descriptive and Inferential Statistics Opinion and Critical Thinking level - Analysis_RES_Opinion.R

Correlation and Regression - Analysis_RES_CorRegr.R

Descriptive and Inferential Statistics Interaction with ChatGPT - Analysis_RES_ChatGPT.R

Sentiment Analysis - Sentiment Analysis_G1.R & Sentiment Analysis_G2.R

Vocabulary Frequent words - Vocabulary.csv

Codebook qualitative Analysis with Nvivo (Codebook.xlsx)

Results Nvivo Analysis G1 - Codebook - ChatGPT2 G1.docx

Results Nvivo Analysis G2 - Codebook - ChatGPT2 G2.docx

Any comments or improvements are welcome!
h
CSMV_visual
huggingface.co
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jackynix (2025). CSMV_visual [Dataset]. https://huggingface.co/datasets/jackynix/CSMV_visual
Explore at:
Dataset updated
Apr 3, 2025
Authors
jackynix
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the visual features of the CSMV dataset released in Paper Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline. The repository contains feature representations of the micro-videos. Each subfolder is named after a different feature extraction method, and the features for each video are saved as .npy files. The filenames correspond to the video_file_id. Currently, features extracted using I3D(recommend) and R(2+1)D have been released.… See the full description on the dataset page: https://huggingface.co/datasets/jackynix/CSMV_visual.
m
R Code for Systematic Review and Meta Analysis
data.mendeley.com
Updated May 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Isensee (2020). R Code for Systematic Review and Meta Analysis [Dataset]. http://doi.org/10.17632/hympskpm3x.1
Explore at:
Unique identifier
https://doi.org/10.17632/hympskpm3x.1
Dataset updated
May 22, 2020
Authors
Carmen Isensee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project presents all codes related to the review paper "The relationship between organizational culture, sustainability, and digitalization in SMEs: A systematic review."

Facebook

Twitter

Click to copy link

Link copied

Cite

Vijay J0shi (2025). Reddit Dataset With Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vijayj0shi/reddit-dataset-with-sentiment-analysis

Reddit Dataset With Sentiment Analysis

Sentiment Analysis and of Posts and Comments

Explore at:

zip(4119981 bytes)Available download formats

Dataset updated

Jun 5, 2025

Authors

Vijay J0shi

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

This dataset contains raw data from the Reddit subreddit r/unpopularopinion, collected on June 5, 2025. It includes 100 recent posts, all comments (including sub-comments) on those posts, user details for authors involved in the discussion, and additional posts by those users. Sentiment analysis has been performed on the comments and additional user posts, providing sentiment labels, confidence scores, and derived sentiment scores.

Dataset Contents

users.csv: Contains details of Reddit users involved in the discussion (post authors and commenters).
- Username: Reddit username.
- Karma: Total karma (Link_Karma + Comment_Karma).
- Link_Karma: Karma from posts.
- Comment_Karma: Karma from comments.
- Account_Created: Timestamp of account creation.
user_posts.csv: Contains additional posts by all unique users involved in the discussion, with sentiment analysis.
- Username: Post author’s username.
- Post_ID: Unique post identifier.
- Title: Post title.
- Subreddit: Subreddit where the post was made.
- Score: Upvote/downvote score.
- URL: Post URL.
- Sentiment: Sentiment label (e.g., positive, negative, neutral).
- Confidence: Confidence score of the sentiment prediction.
- Sentiment_Score: Numerical sentiment score derived from sentiment analysis.
posts_df.csv: Contains the initial 100 posts fetched from r/unpopularopinion.
- Title: Post title.
- Score: Upvote/downvote score.
- Post_ID: Unique post identifier.
- URL: Post URL.
- Num_Comments: Number of comments on the post.
- Created: Timestamp of post creation.
- Text: Post body text.
- Author: Post author’s username.
comments.csv: Contains all comments and sub-comments on the 100 posts, with sentiment analysis.
- Post_ID: ID of the post the comment belongs to.
- Post_Title: Title of the post.
- Comment_ID: Unique comment identifier.
- Parent_ID: ID of the parent (post or comment), or None for top-level comments.
- Body: Comment text.
- Author: Comment author’s username.
- Score: Upvote/downvote score.
- Level: 0 for top-level comments, 1 for sub-comments.
- Sentiment: Sentiment label.
- Confidence: Confidence score of the sentiment prediction.
- Sentiment_Score: Numerical sentiment score (inferred column).

Collection Method

The data was collected using the PRAW library to interact with the Reddit API. The pipeline: 1. Fetched the 100 most recent posts from r/unpopularopinion. 2. Retrieved all comments and sub-comments on those posts. 3. Fetched user details (e.g., karma) for all unique authors (post authors and commenters). 4. Fetched additional posts by those users. 5. Performed sentiment analysis on comments and additional user posts.

Potential Uses

Sentiment Analysis Research: Analyze the sentiment of Reddit discussions, comparing posts and comments.
Content Moderation: Develop algorithms to flag inappropriate content using sentiment and user data.
Social Media Analysis: Explore user activity patterns, such as how karma correlates with sentiment or comment scores.
NLP Projects: Use the raw text (post titles, bodies, comments) for natural language processing tasks like topic modeling or text classification.

Notes

This dataset is a raw snapshot before preprocessing steps like encoding or scaling. It retains usernames and text data, which are later anonymized in the pipeline.
Sentiment analysis was applied to comments and additional user posts, but not to the initial 100 posts in posts_df.csv.
The dataset may contain sensitive information (usernames, text). Users should handle it responsibly and consider anonymizing further if needed.

Clear search

Close search

Google apps

Main menu

Reddit Dataset With Sentiment Analysis

Dataset Contents

Collection Method

Potential Uses

Notes

Datasets for Sentiment Analysis

Food Reviews - Text Mining & Sentiment Analysis

Friends - R Package Dataset

Content

Uses

R and Python Stack Overflow Answers + Sentiment

Context

Content

Acknowledgements

SEN - Sentiment analysis of Entities in News headlines

Logistic regression model, LDA.

Weibo Emotional Dynamic Analysis Code Dataset

sentiwordnet_it 1.0

ParlVote: Corpora for Sentiment Analysis of Political Debatess

Numbers of posts per package.

Replication Data for: A Review of Best Practice Recommendations for...

R - Data and Script Files

Movie-review_SentAnlsys

Dataset

Contents

Optimized parameters of Random Forest and CatBoost models.

Toward multimodal information and AI interaction: a quasi-experiment with...

Twitter Sentiment Analysis

"AI as an Ally?" : AI mediation tools to support undergraduates'...

CSMV_visual

R Code for Systematic Review and Meta Analysis

Reddit Dataset With Sentiment Analysis

Sentiment Analysis and of Posts and Comments

Dataset Contents

Collection Method

Potential Uses

Notes