Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains raw data from the Reddit subreddit r/unpopularopinion, collected on June 5, 2025. It includes 100 recent posts, all comments (including sub-comments) on those posts, user details for authors involved in the discussion, and additional posts by those users. Sentiment analysis has been performed on the comments and additional user posts, providing sentiment labels, confidence scores, and derived sentiment scores.
users.csv: Contains details of Reddit users involved in the discussion (post authors and commenters).
Username: Reddit username.Karma: Total karma (Link_Karma + Comment_Karma).Link_Karma: Karma from posts.Comment_Karma: Karma from comments.Account_Created: Timestamp of account creation.user_posts.csv: Contains additional posts by all unique users involved in the discussion, with sentiment analysis.
Username: Post author’s username.Post_ID: Unique post identifier.Title: Post title.Subreddit: Subreddit where the post was made.Score: Upvote/downvote score.URL: Post URL.Sentiment: Sentiment label (e.g., positive, negative, neutral).Confidence: Confidence score of the sentiment prediction.Sentiment_Score: Numerical sentiment score derived from sentiment analysis.posts_df.csv: Contains the initial 100 posts fetched from r/unpopularopinion.
Title: Post title.Score: Upvote/downvote score.Post_ID: Unique post identifier.URL: Post URL.Num_Comments: Number of comments on the post.Created: Timestamp of post creation.Text: Post body text.Author: Post author’s username.comments.csv: Contains all comments and sub-comments on the 100 posts, with sentiment analysis.
Post_ID: ID of the post the comment belongs to.Post_Title: Title of the post.Comment_ID: Unique comment identifier.Parent_ID: ID of the parent (post or comment), or None for top-level comments.Body: Comment text.Author: Comment author’s username.Score: Upvote/downvote score.Level: 0 for top-level comments, 1 for sub-comments.Sentiment: Sentiment label.Confidence: Confidence score of the sentiment prediction.Sentiment_Score: Numerical sentiment score (inferred column).The data was collected using the PRAW library to interact with the Reddit API. The pipeline: 1. Fetched the 100 most recent posts from r/unpopularopinion. 2. Retrieved all comments and sub-comments on those posts. 3. Fetched user details (e.g., karma) for all unique authors (post authors and commenters). 4. Fetched additional posts by those users. 5. Performed sentiment analysis on comments and additional user posts.
posts_df.csv.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?
Steps:
- Set the working directory and read the data.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt="">
- Data cleaning. Check for missing values and data types of variables
- Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer")
- TEXT ACQUISITION and AGGREGATION. Create corpus.
- TEXT PRE-PROCESSING. Cleaning the text
- Replace special characters with " ". We use the tm_map function for this purpose
- make all the alphabets lower case
- remove punctuations
- remove whitespace
- remove stopwords
- remove numbers
- stem the document
- create term document matrix
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt="">
- convert into matrix and find out frequency of words
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt="">
- convert into a data frame
- TEXT EXPLORATION find out the words which appear most frequently and least frequently
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt="">
- Create Wordcloud
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">
Facebook
TwitterThe whole data and source can be found at https://emilhvitfeldt.github.io/friends/
"The goal of friends to provide the complete script transcription of the Friends sitcom. The data originates from the Character Mining repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files."
friends.csv - Contains the scenes and lines for each character, including season and episodes.friends_emotions.csv - Contains sentiments for each scene - for the first four seasons only.friends_info.csv - Contains information regarding each episode, such as imdb_rating, views, episode title and directors.
Facebook
TwitterThis is the output of the Stack Rudeness kernel (https://www.kaggle.com/ojwatson/stack-rudeness), as saved in Cell 17.
Stack Overflow answers by the Top 10 r and python users extracted using BigQuery. Also includes data on whether the answer was accepted and some additional data based on sentiment analysis of the answer text.
BigQuery and StackOverflow
Facebook
TwitterIf you wish to use this data please cite:
Katarzyna Baraniak, Marcin Sydow, A dataset for Sentiment analysis of Entities in News headlines (SEN), Procedia Computer Science, Volume 192, 2021, Pages 3627-3636, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2021.09.136. (https://www.sciencedirect.com/science/article/pii/S1877050921018755)
bibtex: users.pja.edu.pl/~msyd/bibtex/sydow-baraniak-SENdataset-kes21.bib
SEN is a novel publicly available human-labelled dataset for training and testing machine learning algorithms for the problem of entity level sentiment analysis of political news headlines.
On-line news portals play a very important role in the information society. Fair media should present reliable and objective information. In practice there is an observable positive or negative bias concerning named entities (e.g. politicians) mentioned in the on-line news headlines. Our dataset consists of 3819 human-labelled political news headlines coming from several major on-line media outlets in English and Polish.
Each record contains a news headline, a named entity mentioned in the headline and a human annotated label (one of “positive”, “neutral”, “negative” ). Our SEN dataset package consists of 2 parts: SEN-en (English headlines that split into SEN-en-R and SEN-en-AMT), and SEN-pl (Polish headlines). Each headline-entity pair was annotated via team of volunteer researchers (the whole SEN-pl dataset and a subset of 1271 English records: the SEN-en-R subset, “R” for “researchers”) or via the Amazon Mechanical Turk service (a subset of 1360 English records: the SEN-en-AMT subset).
During analysis of annotation outlying annotations and removed . Separate version of dataset without outliers is marked by "noutliers" in data file name.
Details of the process of preparing the dataset and presenting its analysis are presented in the paper.
In case of any questions, please contact one of the authors. Email adresses are in the paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Logistic regression model, LDA.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This study analyzes the dynamic evolution patterns of emotional states based on 207 Weibo posts using computational linguistics methods. The research encompasses a complete pipeline including data collection, text cleaning, sentiment analysis, co-occurrence network construction, and Markov chain modeling. The dataset contains comprehensive R code implementations, processed sentiment-annotated data, co-occurrence network matrices, transition probability matrices, and visualization results, providing a reproducible computational framework for social media emotion dynamics research.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This repository contains a sentiment lexicon for Italian, based on SentiWordNet 3.0 (Baccianella, Esuli, and Sebastiani 2010; Esuli [2019] 2025) and MultiWordNet (Pianta, Bentivogli, and Girardi 2002).
Unlike previous resources—SentiWordNet, which provides sentiment scores without Italian lexical coverage, and MultiWordNet, which offers Italian synsets without sentiment annotation—this dataset bridges the two by mapping Italian lexical entries to sentiment scores in a ready-to-use CSV format.
This integration enables direct use in sentiment analysis and other NLP applications for Italian, filling a gap in existing resources.
The included files, in the data/ folder are:
swn_it.csv: A dataset of 35,001 Italian synsets with polarity scores, POS, synset, offset, English synset lemmas, and gloss (in English).swn_it_tidy.csv: A tidy (one token per row) dataset of 41,725 lemmas, with polarity scores. It is designed for use in R.It also contains a folder with examples in R, and scripts to use and manipulate the datasets:
examples-R/:
custom_dataset.R: Create a custom tidy dataset from the original one, for treating duplicate entries differently.example.R: Examples of how to use the dataset for sentiment analysis on a sample text.uso.md: Instructions for using the dataset in R (in Italian), referred to in example.R.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets for policy preference identification, binary sentiment classification, and stance detection of debates from the House of Commons of the United Kingdom Parliament.
For details, see:
ParlVote: G. Abercrombie and R. Batista-Navarro. ParlVote: A Corpus for Sentiment Analysis of Political Debates. Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC-2020). European Languages Resources Association (ELRA), 2020.
ParlVote+: Paper under review. This version includes policy preference labels for each example. It has also been cleaned up a little, and some incorrect examples from the original dataset have been removed.
Data published under the Open Parliament Licence v3.0 : https://www.parliament.uk/site-information/copyright-parliament/open-parliament-licence/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Numbers of posts per package.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Replication materials for "A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App)". You can also find these materials on GitHub repo (https://github.com/wesslen/text-analysis-org-science) as well as the Shiny app in the GitHub repo (https://github.com/wesslen/topicApp).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R data and script files
Facebook
TwitterThis dataset was created by Naveen Karthik R
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Optimized parameters of Random Forest and CatBoost models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The development of argumentative text and information comprehension (CoI) skills related to the critical reconstruction of meaning (CT) is crucial in undergraduate education. Especially now in the era of social media and AI-mediated information. Generative AI aids in information creation, but its unconscious use can complicate complex information navigation. Argument maps (AM), commonly used for analyzing analog and static texts, can help visualize, understand, and rework multimodal and dynamic arguments and information.
Stemming from the Vygotskian idea, our study used a design-based research approach on the use of AMs and ChatGPT as socio-technical artifacts to stimulate and support the understanding of information (CoI) and thus the development of critical thinking (CT). The workshop introduced the multimodal element through a 3-group quasi-experiment. The first group dealt with fully analog texts, the second group used maps with multimodal textual modes, and the third group only interacted with ChatGPT. The research focused on comparing the three groups and focusing on the two experimental groups (experimental macro-focus).
The research had three main objectives: 1) to test whether AMs improved students' CoI enhancement and critical processing (CT); 2) to determine whether interaction with ChatGPT supported information reprocessing and critical construction of opinions and assessment tools; and 3) to determine whether interaction with ChatGPT alone, without AMs, still fostered greater integration of information and viewpoints.
Our preliminary analysis showed that AMs improved students' CoI and CT, especially when exposed to multimodal information. ChatGPT interaction increased critical reflection and awareness of AI's role in education. Students using only ChatGPT performed well in argumentative reworking, suggesting that interaction with the chatbot can be effective. However, integrating AMs and ChatGPT could provide optimal support for comprehension and critical thinking skills.
This Zenodo record follows the full analysis process with R (https://cran.r-project.org/bin/windows/base/ ) and Nvivo (https://lumivero.com/products/nvivo/) composed of the following datasets, script and results:
Comprehension of Text and AMs Results - Arg_Map.xlsx
Critical Thinking level - CriThink.xlsx
Descriptive and Inferential Statistics Comprehension and Critical Thinking - Preliminary Analysis.R
Elaboration and Integration Opinion - Opi_G1.xlsx; Opi_G2.xlsx & Opi_G3.xlsx
Descriptive and Inferential Statistics Opinion level - Preliminary Analysis_opi.R
Sentiment Analysis - Sentiment Analysis.R
Vocabulary Frequent words - Vocabulary.csv
Codebook qualitative Analysis with Nvivo (Codebook.xlsx)
Results Nvivo Analysis G1 & G2 - Codebook-ChatGPT_G1&G2.docx
Any comments or improvements are welcome!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
About Dataset
Context
This is the Twitter Sentiment Analysis dataset. It contains 1 Million tweets extracted using the Twitter Opensource API. The tweets have been annotated (0 = negative, 4 = positive) and they can be used primarily to detect sentiment.
Content It contains the following 6 fields:
target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
ids: The id of the tweet ( 2087)
date: the date of the tweet (Sat April 15 23:58:44 UTC 2023)
flag: The query (lyx). If there is no query, then this value is NO_QUERY.
user: The user that tweeted (raj713335)
**text: **the text of the tweet (Lyx is cool)
Acknowledgments The official link regarding the dataset with resources about how it was generated is here The official paper detailing the approach is here
Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.
Inspiration To detect severity from tweets. You may have a look at this.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Argumentative skills are indispensable both personally and professionally to process complex information (CoI) relating to the critical reconstruction of meaning through critical thinking (CT). This remains a particularly relevant priority, especially in the age of social media and artificial intelligence-mediated information. Recently, the public dissemination of what has been called generative artificial intelligence (GenAI), with the particular example of ChatGPT (OpenAI, 2022), has made it even easier today to access and disseminate information, written or not, true or not. New tools are needed to critically address post-digital information abundance.
In this context, argumentative maps (AMs), which are already used to develop argumentative skills and critical thinking, are studied for multimodal and dynamic information visualization, comprehension, and reprocessing. In this regard, the entry of generative AI into university classrooms proposes a novel scenario of multimodality and technological dynamism.
Building on the Vygotskian idea of mediation and the theory of "dual stimulation" as applied to the use of learning technologies, the idea was to complement AMs with the introduction of a second set of stimuli that would support and enhance individual activity: AI-mediated tools. With AMs, an attempt has been made to create a space for understanding, fixing, and reconstructing information, which is important for the development of argumentative skills. On the other hand, by arranging forms of critical and functional interaction with ChatGPT as an ally in understanding, reformulating, and rethinking one's argumentative perspectives, a new and comprehensive argumentative learning process has been arranged, while also cultivating a deeper understanding of the artificial agents themselves.
Our study was based on a two-group quasi-experiment with 27 students of the “Research Methods in Education” course, to explore the role of AMs in fixing and supporting multimodal information reprocessing. In addition, by predicting the use of the intelligent chatbot ChatGPT, one of the most widely used GenAI technologies, we investigated the evolution of students' perceptions of its potential role as a “study companion” in information comprehension and reprocessing activities with a path to build a good prompt.
Preliminary analyses showed that in both groups, AMs supported the increase in mean CoI and CT levels for analog and digital information. However, the group with analog texts showed more complete reprocessing.The interaction with the chatbot was analyzed quantitatively and qualitatively, and there emerged an initial positive reflection on the potential of ChatGPT and increased confidence in interacting with intelligent agents after learning the rules for constructing good prompts.
This Zenodo record follows the full analysis process with R (https://cran.r-project.org/bin/windows/base/ ) and Nvivo (https://lumivero.com/products/nvivo/) composed of the following datasets, script and results:
Comprehension of Text and AMs Results - Arg_G1.xlsx & Arg_G2.xlsx
Opinion and Critical Thinking level - Opi_G1.xlsx & Opi_G2.xlsx
Data for Correlation and Regression - CorRegr_G1.xlsx & CorRegr_G2.xlsx
Interaction with ChatGPT - GPT_G1.xlsx & GPT_G2.xlsx
Descriptive and Inferential Statistics Comprehension and AMs Building - Analysis_RES_Comprehension.R
Descriptive and Inferential Statistics Opinion and Critical Thinking level - Analysis_RES_Opinion.R
Correlation and Regression - Analysis_RES_CorRegr.R
Descriptive and Inferential Statistics Interaction with ChatGPT - Analysis_RES_ChatGPT.R
Sentiment Analysis - Sentiment Analysis_G1.R & Sentiment Analysis_G2.R
Vocabulary Frequent words - Vocabulary.csv
Codebook qualitative Analysis with Nvivo (Codebook.xlsx)
Results Nvivo Analysis G1 - Codebook - ChatGPT2 G1.docx
Results Nvivo Analysis G2 - Codebook - ChatGPT2 G2.docx
Any comments or improvements are welcome!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the visual features of the CSMV dataset released in Paper Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline. The repository contains feature representations of the micro-videos. Each subfolder is named after a different feature extraction method, and the features for each video are saved as .npy files. The filenames correspond to the video_file_id. Currently, features extracted using I3D(recommend) and R(2+1)D have been released.… See the full description on the dataset page: https://huggingface.co/datasets/jackynix/CSMV_visual.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project presents all codes related to the review paper "The relationship between organizational culture, sustainability, and digitalization in SMEs: A systematic review."
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains raw data from the Reddit subreddit r/unpopularopinion, collected on June 5, 2025. It includes 100 recent posts, all comments (including sub-comments) on those posts, user details for authors involved in the discussion, and additional posts by those users. Sentiment analysis has been performed on the comments and additional user posts, providing sentiment labels, confidence scores, and derived sentiment scores.
users.csv: Contains details of Reddit users involved in the discussion (post authors and commenters).
Username: Reddit username.Karma: Total karma (Link_Karma + Comment_Karma).Link_Karma: Karma from posts.Comment_Karma: Karma from comments.Account_Created: Timestamp of account creation.user_posts.csv: Contains additional posts by all unique users involved in the discussion, with sentiment analysis.
Username: Post author’s username.Post_ID: Unique post identifier.Title: Post title.Subreddit: Subreddit where the post was made.Score: Upvote/downvote score.URL: Post URL.Sentiment: Sentiment label (e.g., positive, negative, neutral).Confidence: Confidence score of the sentiment prediction.Sentiment_Score: Numerical sentiment score derived from sentiment analysis.posts_df.csv: Contains the initial 100 posts fetched from r/unpopularopinion.
Title: Post title.Score: Upvote/downvote score.Post_ID: Unique post identifier.URL: Post URL.Num_Comments: Number of comments on the post.Created: Timestamp of post creation.Text: Post body text.Author: Post author’s username.comments.csv: Contains all comments and sub-comments on the 100 posts, with sentiment analysis.
Post_ID: ID of the post the comment belongs to.Post_Title: Title of the post.Comment_ID: Unique comment identifier.Parent_ID: ID of the parent (post or comment), or None for top-level comments.Body: Comment text.Author: Comment author’s username.Score: Upvote/downvote score.Level: 0 for top-level comments, 1 for sub-comments.Sentiment: Sentiment label.Confidence: Confidence score of the sentiment prediction.Sentiment_Score: Numerical sentiment score (inferred column).The data was collected using the PRAW library to interact with the Reddit API. The pipeline: 1. Fetched the 100 most recent posts from r/unpopularopinion. 2. Retrieved all comments and sub-comments on those posts. 3. Fetched user details (e.g., karma) for all unique authors (post authors and commenters). 4. Fetched additional posts by those users. 5. Performed sentiment analysis on comments and additional user posts.
posts_df.csv.