100+ datasets found
  1. Reddit Dataset With Sentiment Analysis

    • kaggle.com
    zip
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vijay J0shi (2025). Reddit Dataset With Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vijayj0shi/reddit-dataset-with-sentiment-analysis
    Explore at:
    zip(4119981 bytes)Available download formats
    Dataset updated
    Jun 5, 2025
    Authors
    Vijay J0shi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains raw data from the Reddit subreddit r/unpopularopinion, collected on June 5, 2025. It includes 100 recent posts, all comments (including sub-comments) on those posts, user details for authors involved in the discussion, and additional posts by those users. Sentiment analysis has been performed on the comments and additional user posts, providing sentiment labels, confidence scores, and derived sentiment scores.

    Dataset Contents

    • users.csv: Contains details of Reddit users involved in the discussion (post authors and commenters).

      • Username: Reddit username.
      • Karma: Total karma (Link_Karma + Comment_Karma).
      • Link_Karma: Karma from posts.
      • Comment_Karma: Karma from comments.
      • Account_Created: Timestamp of account creation.
    • user_posts.csv: Contains additional posts by all unique users involved in the discussion, with sentiment analysis.

      • Username: Post author’s username.
      • Post_ID: Unique post identifier.
      • Title: Post title.
      • Subreddit: Subreddit where the post was made.
      • Score: Upvote/downvote score.
      • URL: Post URL.
      • Sentiment: Sentiment label (e.g., positive, negative, neutral).
      • Confidence: Confidence score of the sentiment prediction.
      • Sentiment_Score: Numerical sentiment score derived from sentiment analysis.
    • posts_df.csv: Contains the initial 100 posts fetched from r/unpopularopinion.

      • Title: Post title.
      • Score: Upvote/downvote score.
      • Post_ID: Unique post identifier.
      • URL: Post URL.
      • Num_Comments: Number of comments on the post.
      • Created: Timestamp of post creation.
      • Text: Post body text.
      • Author: Post author’s username.
    • comments.csv: Contains all comments and sub-comments on the 100 posts, with sentiment analysis.

      • Post_ID: ID of the post the comment belongs to.
      • Post_Title: Title of the post.
      • Comment_ID: Unique comment identifier.
      • Parent_ID: ID of the parent (post or comment), or None for top-level comments.
      • Body: Comment text.
      • Author: Comment author’s username.
      • Score: Upvote/downvote score.
      • Level: 0 for top-level comments, 1 for sub-comments.
      • Sentiment: Sentiment label.
      • Confidence: Confidence score of the sentiment prediction.
      • Sentiment_Score: Numerical sentiment score (inferred column).

    Collection Method

    The data was collected using the PRAW library to interact with the Reddit API. The pipeline: 1. Fetched the 100 most recent posts from r/unpopularopinion. 2. Retrieved all comments and sub-comments on those posts. 3. Fetched user details (e.g., karma) for all unique authors (post authors and commenters). 4. Fetched additional posts by those users. 5. Performed sentiment analysis on comments and additional user posts.

    Potential Uses

    • Sentiment Analysis Research: Analyze the sentiment of Reddit discussions, comparing posts and comments.
    • Content Moderation: Develop algorithms to flag inappropriate content using sentiment and user data.
    • Social Media Analysis: Explore user activity patterns, such as how karma correlates with sentiment or comment scores.
    • NLP Projects: Use the raw text (post titles, bodies, comments) for natural language processing tasks like topic modeling or text classification.

    Notes

    • This dataset is a raw snapshot before preprocessing steps like encoding or scaling. It retains usernames and text data, which are later anonymized in the pipeline.
    • Sentiment analysis was applied to comments and additional user posts, but not to the initial 100 posts in posts_df.csv.
    • The dataset may contain sensitive information (usernames, text). Users should handle it responsibly and consider anonymizing further if needed.
  2. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  3. Food Reviews - Text Mining & Sentiment Analysis

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Food Reviews - Text Mining & Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vikramamin/food-reviews-text-mining-and-sentiment-analysis
    Explore at:
    zip(1075643 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?

    • We will be using text mining and sentiment analysis (R programming) to offer insights to the CMO with regards to the food reviews

    Steps: - Set the working directory and read the data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt=""> - Data cleaning. Check for missing values and data types of variables - Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer") - TEXT ACQUISITION and AGGREGATION. Create corpus. - TEXT PRE-PROCESSING. Cleaning the text - Replace special characters with " ". We use the tm_map function for this purpose - make all the alphabets lower case - remove punctuations - remove whitespace - remove stopwords - remove numbers - stem the document - create term document matrix https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt=""> - convert into matrix and find out frequency of words https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt=""> - convert into a data frame - TEXT EXPLORATION find out the words which appear most frequently and least frequently https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt=""> - Create Wordcloud

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">

    • TEXT MODELLING
    • Word association between two words which tend to appear more number of times. Here we try to find the association for the top three occurring words "like", "tast", "flavor" by setting a correlation limit of 0.2 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbfdbfbe28a30012f0e7ab54d6185c223%2FPicture4.png?generation=1691147754149529&alt=media" alt="">
    • "like" has an association with "realli" (they appear about 25% of the time together), dont (24%), one(21%)
    • "tast" does not have an association with any word with the set correlation limit
    • "flavor" has an association with the word "chip"(they appear about 27% of the time together)
    • Sentiment analysis https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa5da1dd46a60494ec9b26fa1a08b2087%2FPicture5.png?generation=1691147897889137&alt=media" alt="">
    • element_id refers to the Review No and sentence_id refers to the Sentence No in the review , word_count refers to the number of words part of that sentence in that review. Sentiment would be either positive or negative.
    • Let us find out the overall sentiment score of all the reviews https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6fce0e810d47ea8864ebac58eca1be99%2FPicture6.png?generation=1691148149575056&alt=media" alt="">
    • This indicates that the entire food review document has a marginally positive score
    • Let us find out the sentiment score for each of the 5000 reviews. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5b7861d5ebc3881483dd65a8385a539c%2FPicture7.png?generation=1691148278877972&alt=media" alt="">
    • (-1) indicates the most extreme negative sentiment and (+1) indicates the most extreme positive sentiment
    • Let us create a separate data frame for all the negative sentiments. In total there are 726 negative sentiments out of the total 5000 reviews (approx 15%).
  4. Friends - R Package Dataset

    • kaggle.com
    zip
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Yukio Imafuko (2024). Friends - R Package Dataset [Dataset]. https://www.kaggle.com/datasets/lucasyukioimafuko/friends-r-package-dataset
    Explore at:
    zip(2018791 bytes)Available download formats
    Dataset updated
    Nov 11, 2024
    Authors
    Lucas Yukio Imafuko
    Description

    The whole data and source can be found at https://emilhvitfeldt.github.io/friends/

    "The goal of friends to provide the complete script transcription of the Friends sitcom. The data originates from the Character Mining repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files."

    Content

    • friends.csv - Contains the scenes and lines for each character, including season and episodes.
    • friends_emotions.csv - Contains sentiments for each scene - for the first four seasons only.
    • friends_info.csv - Contains information regarding each episode, such as imdb_rating, views, episode title and directors.

    Uses

    • Text mining, sentiment analysis and word statistics.
    • Data visualizations.
  5. R and Python Stack Overflow Answers + Sentiment

    • kaggle.com
    zip
    Updated May 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OJ Watson (2019). R and Python Stack Overflow Answers + Sentiment [Dataset]. https://www.kaggle.com/datasets/ojwatson/stack-overflow-output
    Explore at:
    zip(76142440 bytes)Available download formats
    Dataset updated
    May 28, 2019
    Authors
    OJ Watson
    Description

    Context

    This is the output of the Stack Rudeness kernel (https://www.kaggle.com/ojwatson/stack-rudeness), as saved in Cell 17.

    Content

    Stack Overflow answers by the Top 10 r and python users extracted using BigQuery. Also includes data on whether the answer was accepted and some additional data based on sentiment analysis of the answer text.

    Acknowledgements

    BigQuery and StackOverflow

  6. Z

    SEN - Sentiment analysis of Entities in News headlines

    • data.niaid.nih.gov
    Updated Oct 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katarzyna Baraniak; Marcin Sydow (2023). SEN - Sentiment analysis of Entities in News headlines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5211931
    Explore at:
    Dataset updated
    Oct 15, 2023
    Dataset provided by
    Polish-Japanese Academy of Information Technology
    Polish-Japanese Academy of Information Technology / Institute of Computer Science Polish Academy of Sciences
    Authors
    Katarzyna Baraniak; Marcin Sydow
    Description

    If you wish to use this data please cite:

    Katarzyna Baraniak, Marcin Sydow, A dataset for Sentiment analysis of Entities in News headlines (SEN), Procedia Computer Science, Volume 192, 2021, Pages 3627-3636, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2021.09.136. (https://www.sciencedirect.com/science/article/pii/S1877050921018755)

    bibtex: users.pja.edu.pl/~msyd/bibtex/sydow-baraniak-SENdataset-kes21.bib

    SEN is a novel publicly available human-labelled dataset for training and testing machine learning algorithms for the problem of entity level sentiment analysis of political news headlines.

    On-line news portals play a very important role in the information society. Fair media should present reliable and objective information. In practice there is an observable positive or negative bias concerning named entities (e.g. politicians) mentioned in the on-line news headlines. Our dataset consists of 3819 human-labelled political news headlines coming from several major on-line media outlets in English and Polish.

    Each record contains a news headline, a named entity mentioned in the headline and a human annotated label (one of “positive”, “neutral”, “negative” ). Our SEN dataset package consists of 2 parts: SEN-en (English headlines that split into SEN-en-R and SEN-en-AMT), and SEN-pl (Polish headlines). Each headline-entity pair was annotated via team of volunteer researchers (the whole SEN-pl dataset and a subset of 1271 English records: the SEN-en-R subset, “R” for “researchers”) or via the Amazon Mechanical Turk service (a subset of 1360 English records: the SEN-en-AMT subset).

    During analysis of annotation outlying annotations and removed . Separate version of dataset without outliers is marked by "noutliers" in data file name.

    Details of the process of preparing the dataset and presenting its analysis are presented in the paper.

    In case of any questions, please contact one of the authors. Email adresses are in the paper.

  7. Logistic regression model, LDA.

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Sokolovsky; Thomas Gross; Jaume Bacardit (2023). Logistic regression model, LDA. [Dataset]. http://doi.org/10.1371/journal.pone.0246464.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Artur Sokolovsky; Thomas Gross; Jaume Bacardit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Logistic regression model, LDA.

  8. S

    Weibo Emotional Dynamic Analysis Code Dataset

    • scidb.cn
    Updated Sep 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    liu xing yu (2025). Weibo Emotional Dynamic Analysis Code Dataset [Dataset]. http://doi.org/10.57760/sciencedb.psych.00767
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2025
    Dataset provided by
    Science Data Bank
    Authors
    liu xing yu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This study analyzes the dynamic evolution patterns of emotional states based on 207 Weibo posts using computational linguistics methods. The research encompasses a complete pipeline including data collection, text cleaning, sentiment analysis, co-occurrence network construction, and Markov chain modeling. The dataset contains comprehensive R code implementations, processed sentiment-annotated data, co-occurrence network matrices, transition probability matrices, and visualization results, providing a reproducible computational framework for social media emotion dynamics research.

  9. sentiwordnet_it 1.0

    • zenodo.org
    zip
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agnese Vardanega; Agnese Vardanega (2025). sentiwordnet_it 1.0 [Dataset]. http://doi.org/10.5281/zenodo.17248245
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 8, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Agnese Vardanega; Agnese Vardanega
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This repository contains a sentiment lexicon for Italian, based on SentiWordNet 3.0 (Baccianella, Esuli, and Sebastiani 2010; Esuli [2019] 2025) and MultiWordNet (Pianta, Bentivogli, and Girardi 2002).

    Unlike previous resources—SentiWordNet, which provides sentiment scores without Italian lexical coverage, and MultiWordNet, which offers Italian synsets without sentiment annotation—this dataset bridges the two by mapping Italian lexical entries to sentiment scores in a ready-to-use CSV format.

    This integration enables direct use in sentiment analysis and other NLP applications for Italian, filling a gap in existing resources.

    The included files, in the data/ folder are:

    • swn_it.csv: A dataset of 35,001 Italian synsets with polarity scores, POS, synset, offset, English synset lemmas, and gloss (in English).
    • swn_it_tidy.csv: A tidy (one token per row) dataset of 41,725 lemmas, with polarity scores. It is designed for use in R.

    It also contains a folder with examples in R, and scripts to use and manipulate the datasets:

    • examples-R/:
      • custom_dataset.R: Create a custom tidy dataset from the original one, for treating duplicate entries differently.
      • example.R: Examples of how to use the dataset for sentiment analysis on a sample text.
      • uso.md: Instructions for using the dataset in R (in Italian), referred to in example.R.
  10. m

    ParlVote: Corpora for Sentiment Analysis of Political Debatess

    • data.mendeley.com
    Updated Jul 11, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gavin Abercrombie (2020). ParlVote: Corpora for Sentiment Analysis of Political Debatess [Dataset]. http://doi.org/10.17632/czjfwgs9tm.2
    Explore at:
    Dataset updated
    Jul 11, 2020
    Authors
    Gavin Abercrombie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets for policy preference identification, binary sentiment classification, and stance detection of debates from the House of Commons of the United Kingdom Parliament.

    For details, see:

    ParlVote: G. Abercrombie and R. Batista-Navarro. ParlVote: A Corpus for Sentiment Analysis of Political Debates. Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC-2020). European Languages Resources Association (ELRA), 2020.

    ParlVote+: Paper under review. This version includes policy preference labels for each example. It has also been cleaned up a little, and some incorrect examples from the original dataset have been removed.

    Data published under the Open Parliament Licence v3.0 : https://www.parliament.uk/site-information/copyright-parliament/open-parliament-licence/

  11. Numbers of posts per package.

    • plos.figshare.com
    xls
    Updated Jun 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Sokolovsky; Thomas Gross; Jaume Bacardit (2023). Numbers of posts per package. [Dataset]. http://doi.org/10.1371/journal.pone.0246464.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 12, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Artur Sokolovsky; Thomas Gross; Jaume Bacardit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Numbers of posts per package.

  12. U

    Replication Data for: A Review of Best Practice Recommendations for...

    • dataverse-staging.rdmc.unc.edu
    • datasearch.gesis.org
    Updated Nov 7, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Wesslen; Ryan Wesslen (2017). Replication Data for: A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App) [Dataset]. http://doi.org/10.15139/S3/R4W7ZS
    Explore at:
    csv(1070619), application/x-rlang-transport(1014184), pdf(76215), text/x-r-markdown(14242), text/x-r-markdown(12162), html(2930583), application/x-rlang-transport(2108553), docx(24677), html(2442743), html(1689406), text/markdown(1958), application/x-rlang-transport(1623238), text/x-r-markdown(12252)Available download formats
    Dataset updated
    Nov 7, 2017
    Dataset provided by
    UNC Dataverse
    Authors
    Ryan Wesslen; Ryan Wesslen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Replication materials for "A Review of Best Practice Recommendations for Text-Analysis in R (and a User Friendly App)". You can also find these materials on GitHub repo (https://github.com/wesslen/text-analysis-org-science) as well as the Shiny app in the GitHub repo (https://github.com/wesslen/topicApp).

  13. R - Data and Script Files

    • figshare.com
    txt
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carter Emerson (2025). R - Data and Script Files [Dataset]. http://doi.org/10.6084/m9.figshare.30066598.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Carter Emerson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R data and script files

  14. Movie-review_SentAnlsys

    • kaggle.com
    zip
    Updated Dec 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naveen Karthik R (2023). Movie-review_SentAnlsys [Dataset]. https://www.kaggle.com/datasets/naveenkarthikr/movie-review-sentanlsys
    Explore at:
    zip(2092198 bytes)Available download formats
    Dataset updated
    Dec 31, 2023
    Authors
    Naveen Karthik R
    Description

    Dataset

    This dataset was created by Naveen Karthik R

    Contents

  15. f

    Optimized parameters of Random Forest and CatBoost models.

    • figshare.com
    xls
    Updated Jun 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Sokolovsky; Thomas Gross; Jaume Bacardit (2023). Optimized parameters of Random Forest and CatBoost models. [Dataset]. http://doi.org/10.1371/journal.pone.0246464.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 12, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Artur Sokolovsky; Thomas Gross; Jaume Bacardit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Optimized parameters of Random Forest and CatBoost models.

  16. Z

    Toward multimodal information and AI interaction: a quasi-experiment with...

    • data.niaid.nih.gov
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crudele, Francesca; Raffaghelli, Juliana Elisa (2024). Toward multimodal information and AI interaction: a quasi-experiment with ChatGPT [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13220545
    Explore at:
    Dataset updated
    Aug 5, 2024
    Dataset provided by
    University of Padua
    Authors
    Crudele, Francesca; Raffaghelli, Juliana Elisa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The development of argumentative text and information comprehension (CoI) skills related to the critical reconstruction of meaning (CT) is crucial in undergraduate education. Especially now in the era of social media and AI-mediated information. Generative AI aids in information creation, but its unconscious use can complicate complex information navigation. Argument maps (AM), commonly used for analyzing analog and static texts, can help visualize, understand, and rework multimodal and dynamic arguments and information.

    Stemming from the Vygotskian idea, our study used a design-based research approach on the use of AMs and ChatGPT as socio-technical artifacts to stimulate and support the understanding of information (CoI) and thus the development of critical thinking (CT). The workshop introduced the multimodal element through a 3-group quasi-experiment. The first group dealt with fully analog texts, the second group used maps with multimodal textual modes, and the third group only interacted with ChatGPT. The research focused on comparing the three groups and focusing on the two experimental groups (experimental macro-focus).

    The research had three main objectives: 1) to test whether AMs improved students' CoI enhancement and critical processing (CT); 2) to determine whether interaction with ChatGPT supported information reprocessing and critical construction of opinions and assessment tools; and 3) to determine whether interaction with ChatGPT alone, without AMs, still fostered greater integration of information and viewpoints.

    Our preliminary analysis showed that AMs improved students' CoI and CT, especially when exposed to multimodal information. ChatGPT interaction increased critical reflection and awareness of AI's role in education. Students using only ChatGPT performed well in argumentative reworking, suggesting that interaction with the chatbot can be effective. However, integrating AMs and ChatGPT could provide optimal support for comprehension and critical thinking skills.

    This Zenodo record follows the full analysis process with R (https://cran.r-project.org/bin/windows/base/ ) and Nvivo (https://lumivero.com/products/nvivo/) composed of the following datasets, script and results:

    1. Comprehension of Text and AMs Results - Arg_Map.xlsx

    2. Critical Thinking level - CriThink.xlsx

    3. Descriptive and Inferential Statistics Comprehension and Critical Thinking - Preliminary Analysis.R

    4. Elaboration and Integration Opinion - Opi_G1.xlsx; Opi_G2.xlsx & Opi_G3.xlsx

    5. Descriptive and Inferential Statistics Opinion level - Preliminary Analysis_opi.R

    6. Sentiment Analysis - Sentiment Analysis.R

    7. Vocabulary Frequent words - Vocabulary.csv

    8. Codebook qualitative Analysis with Nvivo (Codebook.xlsx)

    9. Results Nvivo Analysis G1 & G2 - Codebook-ChatGPT_G1&G2.docx

    Any comments or improvements are welcome!

  17. Twitter Sentiment Analysis

    • kaggle.com
    zip
    Updated Apr 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    raj713335 (2023). Twitter Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/raj713335/twittesentimentanalysis/discussion
    Explore at:
    zip(84855617 bytes)Available download formats
    Dataset updated
    Apr 16, 2023
    Authors
    raj713335
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About Dataset

    Context

    This is the Twitter Sentiment Analysis dataset. It contains 1 Million tweets extracted using the Twitter Opensource API. The tweets have been annotated (0 = negative, 4 = positive) and they can be used primarily to detect sentiment.

    Content It contains the following 6 fields:

    target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

    ids: The id of the tweet ( 2087)

    date: the date of the tweet (Sat April 15 23:58:44 UTC 2023)

    flag: The query (lyx). If there is no query, then this value is NO_QUERY.

    user: The user that tweeted (raj713335)

    **text: **the text of the tweet (Lyx is cool)

    Acknowledgments The official link regarding the dataset with resources about how it was generated is here The official paper detailing the approach is here

    Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

    Inspiration To detect severity from tweets. You may have a look at this.

  18. Z

    "AI as an Ally?" : AI mediation tools to support undergraduates'...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crudele, Francesca; Raffaghelli, Juliana Elisa (2024). "AI as an Ally?" : AI mediation tools to support undergraduates' argumentative skills [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13170804
    Explore at:
    Dataset updated
    Aug 5, 2024
    Dataset provided by
    University of Padua
    Authors
    Crudele, Francesca; Raffaghelli, Juliana Elisa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Argumentative skills are indispensable both personally and professionally to process complex information (CoI) relating to the critical reconstruction of meaning through critical thinking (CT). This remains a particularly relevant priority, especially in the age of social media and artificial intelligence-mediated information. Recently, the public dissemination of what has been called generative artificial intelligence (GenAI), with the particular example of ChatGPT (OpenAI, 2022), has made it even easier today to access and disseminate information, written or not, true or not. New tools are needed to critically address post-digital information abundance.

    In this context, argumentative maps (AMs), which are already used to develop argumentative skills and critical thinking, are studied for multimodal and dynamic information visualization, comprehension, and reprocessing. In this regard, the entry of generative AI into university classrooms proposes a novel scenario of multimodality and technological dynamism.

    Building on the Vygotskian idea of mediation and the theory of "dual stimulation" as applied to the use of learning technologies, the idea was to complement AMs with the introduction of a second set of stimuli that would support and enhance individual activity: AI-mediated tools. With AMs, an attempt has been made to create a space for understanding, fixing, and reconstructing information, which is important for the development of argumentative skills. On the other hand, by arranging forms of critical and functional interaction with ChatGPT as an ally in understanding, reformulating, and rethinking one's argumentative perspectives, a new and comprehensive argumentative learning process has been arranged, while also cultivating a deeper understanding of the artificial agents themselves.

    Our study was based on a two-group quasi-experiment with 27 students of the “Research Methods in Education” course, to explore the role of AMs in fixing and supporting multimodal information reprocessing. In addition, by predicting the use of the intelligent chatbot ChatGPT, one of the most widely used GenAI technologies, we investigated the evolution of students' perceptions of its potential role as a “study companion” in information comprehension and reprocessing activities with a path to build a good prompt.

    Preliminary analyses showed that in both groups, AMs supported the increase in mean CoI and CT levels for analog and digital information. However, the group with analog texts showed more complete reprocessing.The interaction with the chatbot was analyzed quantitatively and qualitatively, and there emerged an initial positive reflection on the potential of ChatGPT and increased confidence in interacting with intelligent agents after learning the rules for constructing good prompts.

    This Zenodo record follows the full analysis process with R (https://cran.r-project.org/bin/windows/base/ ) and Nvivo (https://lumivero.com/products/nvivo/) composed of the following datasets, script and results:

    1. Comprehension of Text and AMs Results - Arg_G1.xlsx & Arg_G2.xlsx

    2. Opinion and Critical Thinking level - Opi_G1.xlsx & Opi_G2.xlsx

    3. Data for Correlation and Regression - CorRegr_G1.xlsx & CorRegr_G2.xlsx

    4. Interaction with ChatGPT - GPT_G1.xlsx & GPT_G2.xlsx

    5. Descriptive and Inferential Statistics Comprehension and AMs Building - Analysis_RES_Comprehension.R

    6. Descriptive and Inferential Statistics Opinion and Critical Thinking level - Analysis_RES_Opinion.R

    7. Correlation and Regression - Analysis_RES_CorRegr.R

    8. Descriptive and Inferential Statistics Interaction with ChatGPT - Analysis_RES_ChatGPT.R

    9. Sentiment Analysis - Sentiment Analysis_G1.R & Sentiment Analysis_G2.R

    10. Vocabulary Frequent words - Vocabulary.csv

    11. Codebook qualitative Analysis with Nvivo (Codebook.xlsx)

    12. Results Nvivo Analysis G1 - Codebook - ChatGPT2 G1.docx

    13. Results Nvivo Analysis G2 - Codebook - ChatGPT2 G2.docx

    Any comments or improvements are welcome!

  19. h

    CSMV_visual

    • huggingface.co
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jackynix (2025). CSMV_visual [Dataset]. https://huggingface.co/datasets/jackynix/CSMV_visual
    Explore at:
    Dataset updated
    Apr 3, 2025
    Authors
    jackynix
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the visual features of the CSMV dataset released in Paper Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline. The repository contains feature representations of the micro-videos. Each subfolder is named after a different feature extraction method, and the features for each video are saved as .npy files. The filenames correspond to the video_file_id. Currently, features extracted using I3D(recommend) and R(2+1)D have been released.… See the full description on the dataset page: https://huggingface.co/datasets/jackynix/CSMV_visual.

  20. m

    R Code for Systematic Review and Meta Analysis

    • data.mendeley.com
    Updated May 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Isensee (2020). R Code for Systematic Review and Meta Analysis [Dataset]. http://doi.org/10.17632/hympskpm3x.1
    Explore at:
    Dataset updated
    May 22, 2020
    Authors
    Carmen Isensee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This project presents all codes related to the review paper "The relationship between organizational culture, sustainability, and digitalization in SMEs: A systematic review."

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Vijay J0shi (2025). Reddit Dataset With Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vijayj0shi/reddit-dataset-with-sentiment-analysis
Organization logo

Reddit Dataset With Sentiment Analysis

Sentiment Analysis and of Posts and Comments

Explore at:
zip(4119981 bytes)Available download formats
Dataset updated
Jun 5, 2025
Authors
Vijay J0shi
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

This dataset contains raw data from the Reddit subreddit r/unpopularopinion, collected on June 5, 2025. It includes 100 recent posts, all comments (including sub-comments) on those posts, user details for authors involved in the discussion, and additional posts by those users. Sentiment analysis has been performed on the comments and additional user posts, providing sentiment labels, confidence scores, and derived sentiment scores.

Dataset Contents

  • users.csv: Contains details of Reddit users involved in the discussion (post authors and commenters).

    • Username: Reddit username.
    • Karma: Total karma (Link_Karma + Comment_Karma).
    • Link_Karma: Karma from posts.
    • Comment_Karma: Karma from comments.
    • Account_Created: Timestamp of account creation.
  • user_posts.csv: Contains additional posts by all unique users involved in the discussion, with sentiment analysis.

    • Username: Post author’s username.
    • Post_ID: Unique post identifier.
    • Title: Post title.
    • Subreddit: Subreddit where the post was made.
    • Score: Upvote/downvote score.
    • URL: Post URL.
    • Sentiment: Sentiment label (e.g., positive, negative, neutral).
    • Confidence: Confidence score of the sentiment prediction.
    • Sentiment_Score: Numerical sentiment score derived from sentiment analysis.
  • posts_df.csv: Contains the initial 100 posts fetched from r/unpopularopinion.

    • Title: Post title.
    • Score: Upvote/downvote score.
    • Post_ID: Unique post identifier.
    • URL: Post URL.
    • Num_Comments: Number of comments on the post.
    • Created: Timestamp of post creation.
    • Text: Post body text.
    • Author: Post author’s username.
  • comments.csv: Contains all comments and sub-comments on the 100 posts, with sentiment analysis.

    • Post_ID: ID of the post the comment belongs to.
    • Post_Title: Title of the post.
    • Comment_ID: Unique comment identifier.
    • Parent_ID: ID of the parent (post or comment), or None for top-level comments.
    • Body: Comment text.
    • Author: Comment author’s username.
    • Score: Upvote/downvote score.
    • Level: 0 for top-level comments, 1 for sub-comments.
    • Sentiment: Sentiment label.
    • Confidence: Confidence score of the sentiment prediction.
    • Sentiment_Score: Numerical sentiment score (inferred column).

Collection Method

The data was collected using the PRAW library to interact with the Reddit API. The pipeline: 1. Fetched the 100 most recent posts from r/unpopularopinion. 2. Retrieved all comments and sub-comments on those posts. 3. Fetched user details (e.g., karma) for all unique authors (post authors and commenters). 4. Fetched additional posts by those users. 5. Performed sentiment analysis on comments and additional user posts.

Potential Uses

  • Sentiment Analysis Research: Analyze the sentiment of Reddit discussions, comparing posts and comments.
  • Content Moderation: Develop algorithms to flag inappropriate content using sentiment and user data.
  • Social Media Analysis: Explore user activity patterns, such as how karma correlates with sentiment or comment scores.
  • NLP Projects: Use the raw text (post titles, bodies, comments) for natural language processing tasks like topic modeling or text classification.

Notes

  • This dataset is a raw snapshot before preprocessing steps like encoding or scaling. It retains usernames and text data, which are later anonymized in the pipeline.
  • Sentiment analysis was applied to comments and additional user posts, but not to the initial 100 posts in posts_df.csv.
  • The dataset may contain sensitive information (usernames, text). Users should handle it responsibly and consider anonymizing further if needed.
Search
Clear search
Close search
Google apps
Main menu