19 datasets found

h
Sentiment-Analysis-Over-sampled
huggingface.co
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syed Khalid Hussain (2024). Sentiment-Analysis-Over-sampled [Dataset]. https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis-Over-sampled
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 3, 2024
Authors
Syed Khalid Hussain
Description
Sentiment Analysis Dataset

Overview

This dataset is designed for sentiment analysis tasks, offering a balanced and pre-processed collection of labeled text data. The dataset includes three sentiment labels:

0: Negative
1: Neutral
2: Positive

The training dataset has been oversampled to ensure balanced label distribution, making it suitable for training robust sentiment analysis models. The validation and test datasets remain unaltered to preserve the original… See the full description on the dataset page: https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis-Over-sampled.
Transliterated Marathi Dataset
kaggle.com
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gurunath Salve (2025). Transliterated Marathi Dataset [Dataset]. https://www.kaggle.com/datasets/gurunathsalve/transliterated-marathi-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gurunath Salve
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Transliterated Marathi Sentiment Analysis Dataset

Overview

This dataset is designed to facilitate sentiment analysis for transliterated Marathi text, which is widely used on social media platforms but lacks structured sentiment resources. The dataset includes user-generated comments labeled with sentiment scores, along with a manually curated sentiment wordlist to aid classification.

The comments were collected from platforms like Instagram, Twitter, and YouTube, where informal, code-mixed text is prevalent. Each sentence has been carefully annotated for sentiment by human reviewers to ensure label accuracy and consistency.

Files in This Dataset

marathi_comments.csv – Contains user-generated transliterated Marathi comments with their sentiment classification.

marathi_wordlist.csv – A manually created wordlist that maps common transliterated Marathi words to sentiment scores.

Dataset Details

1. marathi_comments.csv

This file contains sentences along with sentiment labels assigned during manual annotation.

Column Description
Sentence Transliterated Marathi sentence
Classified Score Sentiment label (-3 to +3) based on manual annotation

Sentiment Labeling Scale:

Score Sentiment Meaning
+3 Most Positive
+2 More Positive
+1 Positive
0 Neutral
-1 Negative
-2 More Negative
-3 Most Negative

2. marathi_wordlist.csv

This file contains a sentiment wordlist with predefined scores for commonly used transliterated Marathi words.

Column Description
word Transliterated Marathi word
score Sentiment score assigned to the word (-3 to +3)

How to Use the Dataset

Train sentiment analysis models for transliterated Marathi text.

Enhance rule-based sentiment analysis using the sentiment wordlist.

Fine-tune transformer-based models like BERT, XLM-R, or multilingual LLMs.

Analyze sentiment trends in Marathi social media conversations.

Potential Applications

Social Media Sentiment Analysis: Detecting public sentiment on various topics in Marathi.

Code-Mixed Text Processing: Improving NLP models for multilingual and transliterated text.

Low-Resource Language NLP: Expanding research for sentiment classification in underrepresented languages.

Acknowledgments

This dataset was curated as part of a research project in the Department of Electronics & Telecommunication Engineering at SCTR's Pune Institute of Computer Technology, Pune, India. We sincerely appreciate the efforts and contributions of our project group in dataset collection, annotation, and structuring.

Contributors:
- Siddhi Pardeshi
- Gurunath Salve
- Sayali Thakur
- Mr. Rishikesh J. Sutar (Mentor)

We would like to extend our gratitude to our institution for providing guidance and support throughout this research. By making this dataset publicly available, we aim to encourage further advancements in low-resource language processing and Marathi NLP research.
h
turkish-sentiment-analysis-dataset
huggingface.co
Updated Jun 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Batuhan (2022). turkish-sentiment-analysis-dataset [Dataset]. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2022
Authors
Batuhan
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset

This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.
LinCE (Linguistic Code-switching Evaluation)
kaggle.com
Updated Dec 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). LinCE (Linguistic Code-switching Evaluation) [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlock-universal-language-with-the-lince-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
LinCE (Linguistic Code-switching Evaluation)

Data for training and evaluating NLP systems on code-switching tasks

By Huggingface Hub [source]

About this dataset

Do you want to uncover the power of language through analysis? The Lince Dataset is the answer! An expansive collection of language technologies and data, this dataset can be utilized for a multitude of purposes. With six different languages to explore - Spanish, Hindi, Nepali, Spanish-English, Hindi-English as well as Spanish Multi-Source-English (MSAEA) - you are granted access to an enormous selection of language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and much more. Train your models efficiently with the help of ML in order to automatically detect and classify tasks such as POS or NER from each variation. Or even build cross linguistic models between multiple languages if preferred! Push the boundaries with Lince Dataset's unparalleled diversity. Dive into exploratory research within this feast for NLP connoisseurs and unlock hidden opportunities today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Are you looking to unlock the potential of multilingual natural language processing (NLP) with the Lince Dataset? If so, you’re in the right place! With six languages and training data for language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and more, this is one of the most comprehensive datasets for NLP today.

Understand what is included in this dataset This dataset includes language technology data from six different languages. These include Spanish, Hindi, Nepali, Spanish-English, Hindi-English and Spanish Multi**Source**English (MSAEA). Each file is labelled according to its content - e.g. lid_msaea_test.csv which contains test data for language identificaiton (LID) with 5 columns containing words, part of speech tags as well as sentiment analysis labels. A brief summary of each file's contents can be found when you pull this dataset up on Kaggle or when running a script such as “head()” or “describe()” depending on your software preferences

Decide What Kind Of Analysis You Want To Do Once you are familiar with what type of data is provided it will be necessary to decide which kind of model or analysis you want to do before diving into coding any algorithms relevant for that task . For example if one wants to build a cross lingual model for POS tagging then it would be ideal to have training and validation sets from 3 different languages so that one can take advantage multi domain knowledge interchange between them during training phase hence selecting files such as pos_spaeng _train , pos_hineng _validation will come into play . While designing your model architecture make sure that task specific hyper parameters should complement each other while taking decisions , also choosing an appropriate feature vector representation strategy helps in improved performance

Run Appropriate Algorithms On The Data Provided In The Dataset Now upon understanding all elements presented in front we can start running appropriate algorithms irespective respectively of tools used while tuning our models using metrics like accuracy , f1 score etc . Once tuned ensure that our system works reliably by testing on unseen test set and ensuring desired results . During optimization various hyper parameter tuning has makes significant role depending upon algorithm chosen irespective respective ly

Research Ideas

Developing a multilingual sentiment analysis system that can analyze sentiment in any of the six languages.

Training a model to identify and classify named entities across multiple languages, such as identifying certain words for proper nouns or locations regardless of language or coding scheme.

Developing an AI-powered cross-lingual translator that is able to effectively translate text from one language to another with minimal errors and maximum accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: lid_msaea_test.csv...
f
Deep algorithms (Partially labeled dataset).
figshare.com
xls
Updated Aug 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdul Ghafoor; Ali Shariq Imran; Sher Muhammad Daudpota; Zenun Kastrati; Sarang Shaikh; Rakhi Batra (2023). Deep algorithms (Partially labeled dataset). [Dataset]. http://doi.org/10.1371/journal.pone.0290779.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290779.t002
Dataset updated
Aug 30, 2023
Dataset provided by
PLOS ONE
Authors
Abdul Ghafoor; Ali Shariq Imran; Sher Muhammad Daudpota; Zenun Kastrati; Sarang Shaikh; Rakhi Batra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.
Z
Portuguese Comparative Sentences: A Collection of Labeled Sentences on...
data.niaid.nih.gov
live.european-language-grid.eu
+1more
Updated Apr 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michele A. Brandão (2021). Portuguese Comparative Sentences: A Collection of Labeled Sentences on Twitter and Buscapé [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4124409
Explore at:
Dataset updated
Apr 19, 2021
Dataset provided by
Matheus Barbosa
Michele A. Brandão
Julio C. S. Reis
Fabrício Benevenuto
Breno Matos
Daniel Kansaon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
More and more customers demand online reviews of products and comments on the Web to make decisions about buying a product over another. In this context, sentiment analysis techniques constitute the traditional way to summarize a user’s opinions that criticizes or highlights the positive aspects of a product. Sentiment analysis of reviews usually relies on extracting positive and negative aspects of products, neglecting comparative opinions. Such opinions do not directly express a positive or negative view but contrast aspects of products from different competitors.

Here, we present the first effort to study comparative opinions in Portuguese, creating two new Portuguese datasets with comparative sentences marked by three humans. This repository consists of three important files: (1) lexicon that contains words frequently used to make a comparison in Portuguese; (2) Twitter dataset with labeled comparative sentences; and (3) Buscapé dataset with labeled comparative sentences.

The lexicon is a set of 176 words frequently used to express a comparative opinion in the Portuguese language. In these contexts, the lexicon is aggregated in a filter and used to build two sets of data with comparative sentences from two important contexts: (1) Social Network Online; and (2) Product reviews.

For Twitter, we collected all Portuguese tweets published in Brazil on 2018/01/10 and filtered all tweets that contained at least one keyword present in the lexicon, obtaining 130,459 tweets. Our work is based on the sentence level. Thus, all sentences were extracted and a sample with 2,053 sentences was created, which was labeled for three human manuals, reaching an 83.2% agreement with Fleiss' Kappa coefficient. For Buscapé, a Brazilian website (https://www.buscape.com.br/) used to compare product prices on the web, the same methodology was conducted by creating a set of 2,754 labeled sentences, obtained from comments made in 2013. This dataset was labeled by three humans, reaching an agreement of 83.46% with the Fleiss Kappa coefficient.

The Twitter dataset has 2,053 labeled sentences, of which 918 are comparative. The Buscapé dataset has 2,754 labeled sentences, of which 1,282 are comparative.

The datasets contain these labeled properties:

text: the sentence extracted from the review comment.

entity_s1: the first entity compared in the sentence.

entity_s2: the second entity compared in the sentence.

keyword: the comparative keyword used in the sentence to express comparison.

preferred_entity: the preferred entity.

id_start: the keyword's initial position in the sentence.

id_end: the keyword's final position in the sentence.

type: the sentence label, which specifies whether the phrase is a comparison.

Additional Information:

1 - The sentences were separated using a sentence tokenizer.

2 - If the compared entity is not specified, the field will receive a value: "_".

3 - The property "type" can contain five values, they are:

0: Non-comparative (Não Comparativa).

1: Non-Equal-Gradable (Gradativa com Predileção).

2: Equative (Equitativa).

3: Superlative (Superlativa).

4: Non-Equal-Gradable (Não Gradativa).

If you use this data, please cite our paper as follows:

"Daniel Kansaon, Michele A. Brandão, Julio C. S. Reis, Matheus Barbosa,Breno Matos, and Fabrício Benevenuto. 2020. Mining Portuguese Comparative Sentences in Online Reviews. In Brazilian Symposium on Multimedia and the Web (WebMedia ’20), November 30-December 4, 2020, São Luís, Brazil. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3428658.3431081"

Plus Information:

We make the raw sentences available in the dataset to allow future work to test different pre-processing steps. Then, if you want to obtain the exact sentences used in the paper above, you must reproduce the pre-processing step described in the paper (Figure 2).

For each sentence with more than one keyword in the dataset:

You need to extract three words before and three words after the comparative keyword, creating a new sentence that will receive the existing value in the “type” field as a label;

The original sentence will be divided into n new sentences. (n) is the number of keywords in the sentence;

The stopwords should not be accounted for as part of this range (3 words);

Note that: the final processed sentence can have more than six words because the stopwords are not counted as part of the range.
Digikala_3Mcomments
kaggle.com
Updated Jul 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sajjadly (2021). Digikala_3Mcomments [Dataset]. https://www.kaggle.com/sajjdeus/digikala-sentimentanalysis-3milioncomments/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
sajjadly
Description
About a year ago we had an idea about eliminating the preprocessing step in text analyzing using fastText and CNN. So, to test our idea, I started to scrap the Digikala website and extract 3milions comments on products. If you are interested in our project you can check this link:peerj.com/articles/cs-422/ And now, I decided to share my dataset with those who are researching in the text analysis area and want to test their ideas. The most prominent feature of this data, or in other words, the data available on the Digikala website, is that a significant part of it is labeled, which facilitates sentiment analysis. So that in this dataset there are 1749055 rows with the positive label (Satisfied=1), 308112 rows with the negative label (Unsatisfied=1), and 875580 rows without any label (Satisfied=0 and Unsatisfied=0).

Columns: 1. Date: Date in solar calendar format 2. Person: Name of the person who posted the comment 3. SubCatName: The main subcategory of the desired product code 4. SubName: Subcategory of the desired product code 5. ItemURL: Product URL (most of them are unavailable because the dataset is for 2020) 6. Comment: Customer opinion about the purchased product 7. Satisfied: After sending the comment, the customer can select this option to express whether he/she is satisfied with the product or not 8. Unsatisfied: After sending the comment, the customer can select this option to express whether he/she is unsatisfied with the product or not 9. Agree: Number of users who agree with this comment 10.Disagree: Number of users who disagree with this comment
GitHub Pull Request Analysis: Sentiment Data and Developer Survey Responses
zenodo.org
csv
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rinkesh Joshi; Rinkesh Joshi (2025). GitHub Pull Request Analysis: Sentiment Data and Developer Survey Responses [Dataset]. http://doi.org/10.5281/zenodo.10049493
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10049493
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rinkesh Joshi; Rinkesh Joshi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three datasets.

PRBatch Dataset (file name: prfeatures_train.csv and prfeatures_test.csv)

PRFeatures uses an extensive dataset from Xunhui Zhang et al. (2021), originating from their 2020 work and the GHTorrent data dump dated June 1, 2019 (https://github.com/ghtorrent/ghtorrent.org/). This dataset was selected for its diversity in project activity, language, and size, offering a more generalizable and holistic view of Pull-Request (PR) dynamics across various software development scenarios.

We performed necessary pre-processing steps mainly handling missing values by replacing negative and missing values with \textit{Not a Number (NaN)} and omitting factors with over 30\% missing values. In terms of feature engineering, redundant factors were removed and related factors like \textit{files-added} and \textit{files-deleted} were consolidated into \textit{files-changed}. Rather than narrowing down key variables, our study aims to showcase the adaptability of RL algorithms in handling extensive feature sets; therefore, we retain a large number of features in the dataset. Correct data types were set for each factor, and categorical values in the \textit{language} factor were label-encoded.

We used an 80/20 datasplit to create the training dataset and the testing datasets as uploaded here. The dataset contains a little over 1.3 million PRs and 72 PR related features.

PRChat Dataset (file name: pr_comments_dataset_publish)

The second dataset, PRChats Dataset was curated specifically for a specialized Reinforcement Learning formalization for Pull-Request (PR) outcome predictions on GitHub using just the developer discussions. It contains over 5,88,097 in-line code comments of 66,281 PRs and a total of 15 features. The raw comments and the respective commit_ids were extracted from the work publised by Akshay Sinha (refer to the references). The data spans from January 2015 to December 2020. All the other features were augmented using the GitHub REST API.

The dataset contains a little under 0.6 million comments associated with around 66,000 PRs. To view the PRs (consequently the related comments), group by using: owner_name, repo_name, pull_no.

Feature Extraction resulted in addition of following features:

has_code_element: whether the comment makes a code suggestion or not

word_count: no. of words in the comments (British and American English only based on Hunspell Library)

stopw_ratio: ratio of no. of stop words to total word count in the comment

Sentiment Analysis conducted using VADER resulting in addition of:

neg_vr: negative polarity score

neu_vr: neutral polarity score

pos_vr: positive polarity score

compound: overall polarity score of the comment

Other PR and project related features include:

owner_name: the account owner of the repo (not case sensitive)

repo_name: the name of the repo without the .git extension (not case sensitive)

pull_no: the number to identify the PR

merged_or_not: whether PR has been merged or not

timestamp: for each comment

Survey Data (file name: survey_responses_raw.csv)

The third dataset is the collection of responses of an online exploratory survey targeting software developers and engineers. The underpinning objective was to delve deep into the developers' perspectives regarding the PR review processes and the quality of these reviews. We received a total of 22 responses.

We designed a survey protocol following Carleton University's guidelines for on-line research, adhering to the Tri-Council Policy Statement: Ethical Conduct for Research Involving humans (TCPS 2) in Canada (https://tcps2core.ca/welcome). After careful evaluation by Carleton University's Research Ethics Boards, in alignment with TCPS2, we received approval on May 2, 2023 (Ethics Clearance ID # 119296), effective until May 31, 2023.

The survey was carefully structured into three distinct sections. The initial section delved into the participant's demographic and professional background, featuring six primary questions, along with an optional seventh question. Prioritizing participant confidentiality, the survey was designed to safeguard anonymity. The subsequent section transitioned to a set of questions focused on PR factors and review practices. This section presented participants with two multiple-choice queries and a pair of questions grounded in the Likert-scale, enabling a structured feedback mechanism.

Concluding the survey, the third section was crafted to prompt more detailed insights from the participants. It comprised two open-ended questions, providing an avenue for respondents to further describe their PR review experiences and techniques.
Data from: Bag of Words Meets Bags of Popcorn
kaggle.com
zip
Updated May 18, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rocha (2017). Bag of Words Meets Bags of Popcorn [Dataset]. https://www.kaggle.com/rochachan/bag-of-words-meets-bags-of-popcorn
Explore at:
zip(13788314 bytes)Available download formats
Dataset updated
May 18, 2017
Authors
rocha
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The competition is over 2 yrs ago. I just wanna play around the dataset.

Content

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

id - Unique ID of each review

sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews

review - Text of the review

Acknowledgements

The origin place is here. Awesome tutorial is here, we can play with it.

Inspiration

Just for study and learning
Twitter Reviews for Emotion Analysis
kaggle.com
zip
Updated Apr 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shainy Merin S (2020). Twitter Reviews for Emotion Analysis [Dataset]. https://www.kaggle.com/datasets/shainy/twitter-reviews-for-emotion-analysis/metadata
Explore at:
zip(732190 bytes)Available download formats
Dataset updated
Apr 16, 2020
Authors
Shainy Merin S
License
https://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
Description
Context

This dataset consists of a few thousand Twitter user reviews (input text) and Emotions (output labels ) for learning how to train the text for emotion analysis. This dataset was created using Twitter API by implementing the Keywords. The idea here is a dataset is more than a toy - real business data on a reasonable scale - but can be trained in minutes on a modest laptop.

Content

This file has Sl no, Tweets, Search key, Feeling.

Description of columns in the file:

Tweets - text of the review Search key - Keyword used Feeling - Emotion classified using the keyword, this column contains 6 emotions i.e., Happy, Sad, Surprise, Fear, Disgust, Angry

Inspiration

This would be helpful for the organization to understand Customer reviews/feedbacks.
A Dataset of Polarities and Emotions from Brazilian Portuguese Play Store...
zenodo.org
data.niaid.nih.gov
csv
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vitor Siqueira; Gabriel M. Lunardi; Williamson Silva; Ricardo Luiz Hentges Costa; Schifelbein Soares Tales; Vitor Siqueira; Gabriel M. Lunardi; Williamson Silva; Ricardo Luiz Hentges Costa; Schifelbein Soares Tales (2024). A Dataset of Polarities and Emotions from Brazilian Portuguese Play Store Reviews [Dataset]. http://doi.org/10.5281/zenodo.10823148
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10823148
Dataset updated
Aug 28, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Vitor Siqueira; Gabriel M. Lunardi; Williamson Silva; Ricardo Luiz Hentges Costa; Schifelbein Soares Tales; Vitor Siqueira; Gabriel M. Lunardi; Williamson Silva; Ricardo Luiz Hentges Costa; Schifelbein Soares Tales
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brazil
Description
User reviews play a crucial role in shaping consumer perceptions and guiding decision-making processes in the digital marketplace. With the rise of mobile applications, platforms like the Google Play Store serve as hubs for users to express their opinions and experiences with various apps and services. Understanding the polarities and emotions conveyed in these reviews provides valuable insights for developers, marketers, and researchers alike.

The dataset consists of user reviews collected from the "Trending" section of the Google Play Store in May 2023. A total of 300 reviews were gathered for each of the top 10 most downloaded applications during this period. Each review in the dataset has been meticulously labeled for polarity, categorizing sentiments as positive, negative, or neutral, and emotion, encompassing a range of emotional responses such as happiness, sadness, surprise, fear, disgust and anger.

Additionally, it's worth noting that this dataset underwent a rigorous annotation process. Three annotators independently classified the reviews for polarity and emotion. Afterward, they reconciled any discrepancies through discussion and arrived at a consensus for the final annotations. This ensures a high level of accuracy and reliability in the labeling process, providing researchers and practitioners with trustworthy data for analysis and decision-making.

It's important to highlight that all reviews in this dataset are in Brazilian Portuguese, reflecting the specific linguistic and cultural nuances of the Brazilian market. By leveraging this dataset, stakeholders gain access to a robust resource for exploring user sentiment and emotion within the context of popular mobile applications in Brazil.
f
Table_1_Analyzing Twitter Conversation on Genome-Edited Foods and Their...
frontiersin.figshare.com
figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yutaka Tabei; Sachiko Shimura; Yeondae Kwon; Shizu Itaka; Nobuko Fukino (2023). Table_1_Analyzing Twitter Conversation on Genome-Edited Foods and Their Labeling in Japan.pdf [Dataset]. http://doi.org/10.3389/fpls.2020.535764.s002
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpls.2020.535764.s002
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Yutaka Tabei; Sachiko Shimura; Yeondae Kwon; Shizu Itaka; Nobuko Fukino
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Japan
Description
In recent years, the research and development of genome editing technology have been progressing rapidly, and the commercial use of genome-edited soybean started in the United States in 2019. A preceding study’s results found that there is public concern with regard to the safety of high-tech foods, such as genetically modified foods and genome-edited foods. Twitter, one of the most popular social networks, allows users to post their opinions instantaneously, making it an extremely useful tool to collect what people are actually saying online in a timely manner. Therefore, it was used for collecting data on the users’ concerns with and expectations of high-tech foods. This study collected and analyzed Twitter data on genome-edited foods and their labeling from May 25 to October 15 in 2019. Of 14,066 unique user IDs, 94.9% posted 5 or less tweets, whereas 64.8% tweeted only once, indicating that the majority of users who tweeted on this issue are not as intense, as they posted tweets consistently. After a process of refining, there were 28,722 tweets, of which 2,536 tweets (8.8%) were original, 326 (1.1%) were replies, and 25,860 (90%) were retweets. The numbers of tweets increased in response to government announcements and news content in the media. A total of six prominent peaks were detected during the investigation period, proving that Twitter could serve as a tool for monitoring degree of users’ interests in real time. The co-occurrence network of original and reply tweets provided different words from various tweets that appeared with a certain frequency. However, the network derived from all tweets seemed to concentrate on words from specific tweets with negative overtones. As a result of sentiment analysis, 54.5% to 62.8% tweets were negative about genome-edited food and the labeling policy of the Consumer Affairs Agency, respectively, indicating a strong demand for mandatory labeling. These findings are expected to contribute to the communication strategy of genome-edited foods toward social implementation by government officers and science communicators.
Ten Thousand German News Articles Dataset
kaggle.com
tblock.github.io
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timo Block (2022). Ten Thousand German News Articles Dataset [Dataset]. https://www.kaggle.com/tblock/10kgnad
Explore at:
zip(21144764 bytes)Available download formats
Dataset updated
Jan 20, 2022
Authors
Timo Block
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
(see https://tblock.github.io/10kGNAD/ for the original dataset page)

This page introduces the 10k German News Articles Dataset (10kGNAD) german topic classification dataset. The 10kGNAD is based on the One Million Posts Corpus and avalaible under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can download the dataset here.

Why a German dataset?

English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. However, to my knowlege, no german topic classification dataset is avaliable to the public.

Due to grammatical differences between the English and the German language, a classifyer might be effective on a English dataset, but not as effectiv on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifyer on multiple German datasets to get a sense of it's effectivness.

The dataset

The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.

In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise. The 10kGNAD uses the second part of the topic path, here Wirtschaft, as class label. In result the dataset can be used for multi-class classification.

I created and used this dataset in my thesis to train and evaluate four text classifyers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.

Numbers and statistics

As in most real-world datasets the class distribution of the 10kGNAD is not balanced. The biggest class Web consists of 1678, while the smalles class Kultur contains only 539 articles. However articles from the Web class have on average the fewest words, while artilces from the culture class have the second most words.

Splitting into train and test

I propose a stratifyed split of 10% for testing and the remaining articles for training. To use the dataset as a benchmark dataset, please used the train.csv and test.csv files located in the project root.

Code

Python scripts to extract the articles and split them into a train- and a testset avaliable in the code directory of this project. Make sure to install the requirements. The original corpus.sqlite3 is required to extract the articles (download here (compressed) or here (uncompressed)).

License

This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please consider citing the authors of the One Million Post Corpus if you use the dataset.
c
Home Sites Niagara Open Data
catalog.civicdataecosystem.org
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Home Sites Niagara Open Data [Dataset]. https://catalog.civicdataecosystem.org/dataset/niagara-open-data
Explore at:
Dataset updated
May 13, 2025
Description
The Ontario government, generates and maintains thousands of datasets. Since 2012, we have shared data with Ontarians via a data catalogue. Open data is data that is shared with the public. Click here to learn more about open data and why Ontario releases it. Ontario’s Open Data Directive states that all data must be open, unless there is good reason for it to remain confidential. Ontario’s Chief Digital and Data Officer also has the authority to make certain datasets available publicly. Datasets listed in the catalogue that are not open will have one of the following labels: If you want to use data you find in the catalogue, that data must have a licence – a set of rules that describes how you can use it. A licence: Most of the data available in the catalogue is released under Ontario’s Open Government Licence. However, each dataset may be shared with the public under other kinds of licences or no licence at all. If a dataset doesn’t have a licence, you don’t have the right to use the data. If you have questions about how you can use a specific dataset, please contact us. The Ontario Data Catalogue endeavors to publish open data in a machine readable format. For machine readable datasets, you can simply retrieve the file you need using the file URL. The Ontario Data Catalogue is built on CKAN, which means the catalogue has the following features you can use when building applications. APIs (Application programming interfaces) let software applications communicate directly with each other. If you are using the catalogue in a software application, you might want to extract data from the catalogue through the catalogue API. Note: All Datastore API requests to the Ontario Data Catalogue must be made server-side. The catalogue's collection of dataset metadata (and dataset files) is searchable through the CKAN API. The Ontario Data Catalogue has more than just CKAN's documented search fields. You can also search these custom fields. You can also use the CKAN API to retrieve metadata about a particular dataset and check for updated files. Read the complete documentation for CKAN's API. Some of the open data in the Ontario Data Catalogue is available through the Datastore API. You can also search and access the machine-readable open data that is available in the catalogue. How to use the API feature: Read the complete documentation for CKAN's Datastore API. The Ontario Data Catalogue contains a record for each dataset that the Government of Ontario possesses. Some of these datasets will be available to you as open data. Others will not be available to you. This is because the Government of Ontario is unable to share data that would break the law or put someone's safety at risk. You can search for a dataset with a word that might describe a dataset or topic. Use words like “taxes” or “hospital locations” to discover what datasets the catalogue contains. You can search for a dataset from 3 spots on the catalogue: the homepage, the dataset search page, or the menu bar available across the catalogue. On the dataset search page, you can also filter your search results. You can select filters on the left hand side of the page to limit your search for datasets with your favourite file format, datasets that are updated weekly, datasets released by a particular organization, or datasets that are released under a specific licence. Go to the dataset search page to see the filters that are available to make your search easier. You can also do a quick search by selecting one of the catalogue’s categories on the homepage. These categories can help you see the types of data we have on key topic areas. When you find the dataset you are looking for, click on it to go to the dataset record. Each dataset record will tell you whether the data is available, and, if so, tell you about the data available. An open dataset might contain several data files. These files might represent different periods of time, different sub-sets of the dataset, different regions, language translations, or other breakdowns. You can select a file and either download it or preview it. Make sure to read the licence agreement to make sure you have permission to use it the way you want. Read more about previewing data. A non-open dataset may be not available for many reasons. Read more about non-open data. Read more about restricted data. Data that is non-open may still be subject to freedom of information requests. The catalogue has tools that enable all users to visualize the data in the catalogue without leaving the catalogue – no additional software needed. Have a look at our walk-through of how to make a chart in the catalogue. Get automatic notifications when datasets are updated. You can choose to get notifications for individual datasets, an organization’s datasets or the full catalogue. You don’t have to provide and personal information – just subscribe to our feeds using any feed reader you like using the corresponding notification web addresses. Copy those addresses and paste them into your reader. Your feed reader will let you know when the catalogue has been updated. The catalogue provides open data in several file formats (e.g., spreadsheets, geospatial data, etc). Learn about each format and how you can access and use the data each file contains. A file that has a list of items and values separated by commas without formatting (e.g. colours, italics, etc.) or extra visual features. This format provides just the data that you would display in a table. XLSX (Excel) files may be converted to CSV so they can be opened in a text editor. How to access the data: Open with any spreadsheet software application (e.g., Open Office Calc, Microsoft Excel) or text editor. Note: This format is considered machine-readable, it can be easily processed and used by a computer. Files that have visual formatting (e.g. bolded headers and colour-coded rows) can be hard for machines to understand, these elements make a file more human-readable and less machine-readable. A file that provides information without formatted text or extra visual features that may not follow a pattern of separated values like a CSV. How to access the data: Open with any word processor or text editor available on your device (e.g., Microsoft Word, Notepad). A spreadsheet file that may also include charts, graphs, and formatting. How to access the data: Open with a spreadsheet software application that supports this format (e.g., Open Office Calc, Microsoft Excel). Data can be converted to a CSV for a non-proprietary format of the same data without formatted text or extra visual features. A shapefile provides geographic information that can be used to create a map or perform geospatial analysis based on location, points/lines and other data about the shape and features of the area. It includes required files (.shp, .shx, .dbt) and might include corresponding files (e.g., .prj). How to access the data: Open with a geographic information system (GIS) software program (e.g., QGIS). A package of files and folders. The package can contain any number of different file types. How to access the data: Open with an unzipping software application (e.g., WinZIP, 7Zip). Note: If a ZIP file contains .shp, .shx, and .dbt file types, it is an ArcGIS ZIP: a package of shapefiles which provide information to create maps or perform geospatial analysis that can be opened with ArcGIS (a geographic information system software program). A file that provides information related to a geographic area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open using a GIS software application to create a map or do geospatial analysis. It can also be opened with a text editor to view raw information. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format for sharing data in a machine-readable way that can store data with more unconventional structures such as complex lists. How to access the data: Open with any text editor (e.g., Notepad) or access through a browser. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format to store and organize data in a machine-readable way that can store data with more unconventional structures (not just data organized in tables). How to access the data: Open with any text editor (e.g., Notepad). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A file that provides information related to an area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open with a geospatial software application that supports the KML format (e.g., Google Earth). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. This format contains files with data from tables used for statistical analysis and data visualization of Statistics Canada census data. How to access the data: Open with the Beyond 20/20 application. A database which links and combines data from different files or applications (including HTML, XML, Excel, etc.). The database file can be converted to a CSV/TXT to make the data machine-readable, but human-readable formatting will be lost. How to access the data: Open with Microsoft Office Access (a database management system used to develop application software). A file that keeps the original layout and
Sentiment Analysis of movie review
kaggle.com
Updated Nov 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mansi75 (2020). Sentiment Analysis of movie review [Dataset]. https://www.kaggle.com/mansi75/sentiment-analysis-of-movie-review-imdb/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mansi75
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I have always been a binge watcher and with so many movies and series to watch, a sentiment analysis of movie reviews is a good start to know more about them.

Content

This dataset contains the text of the reviews, together with a label that indi‐ cates whether a review is “positive” or “negative.” The IMDb website itself contains ratings from 1 to 10. To simplify the modeling, this annotation is summarized as a two-class classification dataset where reviews with a score of 6 or higher are labeled as positive, and the rest as negative.

Acknowledgements

author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}

Inspiration

One of the most simple but effective and commonly used ways to represent text for machine learning is using the bag-of-words representation. Classify the dataset with highest cross-validation accuracy with or without bag-of-words.
Social Media Engagement Report
kaggle.com
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2024). Social Media Engagement Report [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/social-media-engagement-report/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Reda Elblgihy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
*****Documentation Process***** 1. Data Preparation: - Upload the data into Power Query to assess quality and identify duplicate values, if any. - Verify data quality and types for each column, addressing any miswriting or inconsistencies. 2. Data Management: - Duplicate the original data sheet for future reference and label the new sheet as the "Working File" to preserve the integrity of the original dataset. 3. Understanding Metrics: - Clarify the meaning of column headers, particularly distinguishing between Impressions and Reach, and comprehend how Engagement Rate is calculated. - Engagement Rate formula: Total likes, comments, and shares divided by Reach. 4. Data Integrity Assurance: - Recognize that Impressions should outnumber Reach, reflecting total views versus unique audience size. - Investigate discrepancies between Reach and Impressions to ensure data integrity, identifying and resolving root causes for accurate reporting and analysis. 5. Data Correction: - Collaborate with the relevant team to rectify data inaccuracies, specifically addressing the discrepancy between Impressions and Reach. - Engage with the concerned team to understand the root cause of discrepancies between Impressions and Reach. - Identify instances where Impressions surpass Reach, potentially attributable to data transformation errors. - Following the rectification process, meticulously adjust the dataset to reflect the corrected Impressions and Reach values accurately. - Ensure diligent implementation of the corrections to maintain the integrity and reliability of the data. - Conduct a thorough recalculation of the Engagement Rate post-correction, adhering to rigorous data integrity standards to uphold the credibility of the analysis. 6. Data Enhancement: - Categorize Audience Age into three groups: "Senior Adults" (45+ years), "Mature Adults" (31-45 years), and "Adolescent Adults" (<30 years) within a new column named "Age Group." - Split date and time into separate columns using the text-to-columns option for improved analysis. 7. Temporal Analysis: - Introduce a new column for "Weekend and Weekday," renamed as "Weekday Type," to discern patterns and trends in engagement. - Define time periods by categorizing into "Morning," "Afternoon," "Evening," and "Night" based on time intervals. 8. Sentiment Analysis: - Populate blank cells in the Sentiment column with "Mixed Sentiment," denoting content containing both positive and negative sentiments or ambiguity. 9. Geographical Analysis: - Group countries and obtain additional continent data from an online source (e.g., https://statisticstimes.com/geography/countries-by-continents.php). - Add a new column for "Audience Continent" and utilize XLOOKUP function to retrieve corresponding continent data.

*****Drawing Conclusions and Providing a Summary*****

The data is equally distributed across different categories, platforms, and over the years.

Most of our audience comprises senior adults (aged 45 and above).

Most of our audience exhibit mixed sentiments about our posts. However, an equal portion expresses consistent sentiments.

The majority of our posts were located in Africa.

The number of posts increased from the first year to the second year and remained relatively consistent for the third year.

The optimal time for posting is during the night on weekdays.

The highest engagement rates were observed in Croatia then Malawi.

The number of posts targeting senior adults is significantly higher than the other two categories. However, the engagement rates for mature and adolescent adults are also noteworthy, based on the number of targeted posts.
o
Zoopla properties listing information dataset
opendatabay.com
Updated May 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2025). Zoopla properties listing information dataset [Dataset]. https://www.opendatabay.com/data/premium/9e626c7a-38e8-446e-bf9b-1c9a3d71154a
Explore at:
Dataset updated
May 25, 2025
Dataset authored and provided by
Bright Data
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
E-commerce & Online Transactions
Description
Zoopla Properties Listing dataset to explore detailed property information, including pricing, location, and features. Popular use cases include real estate market analysis, property valuation, and investment research.

Use our Zoopla Properties Listing Information dataset to explore detailed property listings, including property details, pricing, location, and market trends across various regions. This dataset provides valuable insights into property valuations, consumer preferences, and real estate dynamics, enabling businesses and researchers to make data-driven decisions.

Tailored for real estate professionals, investors, and market analysts, this dataset supports market trend analysis, property valuation assessments, and investment strategy development. Whether you're evaluating property investments, tracking market conditions, or conducting competitive analysis, the Zoopla Properties Listing Information dataset is a key resource for navigating the real estate landscape.

Dataset Features

url: The original listing URL on Zoopla.

property_type: Type of property (e.g., Flat, Detached, Terraced).

property_title: Title or headline of the listing.

address: Full postal address of the property.

google_map_location: Geographical coordinates (latitude, longitude).

virtual_tour: Link to a virtual walkthrough or 360° tour.

street_view: Link to the Google Street View of the property.

url_property: Zoopla-specific property page URL.

currency: Currency in which the property is priced.

deposit: Security deposit required (typically for rentals).

letting_arrangements: Letting details (e.g., short-term, long-term).

breadcrumbs: Category breadcrumbs for location and type navigation.

availability: Availability status (e.g., Available now, Under offer).

commonhold_details: Information about commonhold ownership.

service_charge: Annual service charge (for leasehold properties).

ground_rent: Annual ground rent cost.

time_remaining_on_lease: Lease duration remaining in years.

ecp_rating: Energy Performance Certificate rating.

council_tax_band: Council tax band.

price_per_size: Price per square meter or foot.

tenure: Tenure type (Freehold, Leasehold, etc.).

tags: Descriptive tags (e.g., New build, Chain-free).

features: List of property features (e.g., garden, garage, en-suite).

property_images: URLs to property photos.

additional_links: Other related links (e.g., brochures, agents).

listing_history: Changes in price, listing dates, and status over time.

agent_details: Information about the listing agent or agency.

points_ofInterest: Nearby landmarks or facilities (schools, transport).

bedrooms Number of bedrooms.

price: Listed price of the property.

bathrooms: Number of bathrooms.

receptions: Number of reception rooms (living, dining, etc.).

country_code: Country code of the listing (e.g., GB for UK).

energy_performance_certificate: Detailed EPC documentation or summary.

floor_plans: URL or data related to property floor plans.

description: Detailed property description from the listing.

price_per_time: Price frequency for rentals (e.g., per week, per month).

property_size: Area of the property (in sq ft or sq m).

market_stats_last_12_months: Market stats for the area over the past year.

market_stats_renta_opportunities: Data on rental yields and opportunities.

market_stats_recent_sales_nearby: Sales history for nearby properties.

market_stats_rental_activity: Local rental activity trends.

uprn: Unique Property Reference Number for UK properties.

listing_label: Label/category of the listing.

Distribution

Data Volume: 44 Columns and 95.92K Rows

Format: CSV

Usage

This dataset is ideal for a variety of high-impact applications:

Property Valuation Models: Train ML models to estimate market value using features like size, location, and amenities.

Real Estate Market Analysis: Identify pricing trends, demand patterns, and neighbourhood growth over time.

Investment Research: Analyse rental yields, price per square foot, and historical price changes for investment opportunities.

Recommendation Systems: Develop intelligent recommendation engines for property buyers and renters.

Urban Planning & Policy Making: Use location and infrastructure data to guide city development.

Sentiment & Description Analysis: NLP-driven insights from listing descriptions and agent narratives.

Coverage

Geographic Coverage: Global

Time Range: Ongoing collection; historical data may span multiple years

License

CUSTOM

Please review the respective licenses below:

Data Provider's License
-
CMU_MOSEI
kaggle.com
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samar Warsi (2024). CMU_MOSEI [Dataset]. https://www.kaggle.com/datasets/samarwarsi/cmu-mosei
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Samar Warsi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
CMU-MOSEI is a comprehensive multimodal dataset designed to analyze emotions and sentiment in online videos. It's a valuable resource for researchers and developers working on automatic emotion recognition and sentiment analysis.

Key Features: Over 23,500 video clips from 1000+ speakers, covering diverse topics and monologues.

Multimodal data:

Acoustics: Features extracted from audio (CMU_MOSEI_COVAREP.csd) Labels: Annotations for sentiment intensity and emotion categories (CMU_MOSEI_Labels.csd) Language: Phonetic, word-level, and word vector representations (CMU_MOSEI_*.csd files under languages folder)

Visuals: Features extracted from facial expressions (CMU_MOSEI_Visual*.csd files under visuals folder)

Balanced for gender: The dataset ensures equal representation from male and female speakers.

Unlocking Insights: By exploring the various modalities within CMU-MOSEI, researchers can investigate the relationship between speech, facial expressions, and emotions expressed in online videos.

Download: The dataset is freely available for download at: http://immortal.multicomp.cs.cmu.edu/CMU-MOSEI/

Start exploring the world of emotions in videos with CMU-MOSEI!
Cyber Threat Dataset: Network, Text & Relation
kaggle.com
Updated Apr 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAMOLIYA FENIL (2024). Cyber Threat Dataset: Network, Text & Relation [Dataset]. http://doi.org/10.34740/kaggle/dsv/8113350
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8113350
Dataset updated
Apr 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
RAMOLIYA FENIL
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
To use this dataset for research, publication, and other work, please cite the following paper:

F. Ramoliya, R. Kakkar, R. Gupta, S. Tanwar and S. Agrawal, "SEAM: Deep Learning-based Secure Message Exchange Framework For Autonomous EVs," 2023 IEEE Globecom Workshops (GC Wkshps), Kuala Lumpur, Malaysia, 2023, pp. 80-85, doi: 10.1109/GCWkshps58843.2023.10465168.

About this dataset:

This dataset provides a comprehensive collection of data for detecting, diagnosing, and mitigating cyber threats using network traffic data, textual content, and entity relationships. It can be used for training machine learning models to identify various types of cyber threats, understand their underlying patterns, and recommend appropriate solutions.

Column details:

1. id: A unique identifier for each instance in the dataset.

2. text: Textual content transferred over the network, such as emails, messages, or network traffic payloads. This column may contain descriptions of potential cyber threats or attack vectors.

3. Entries: A list of JSON objects containing the following fields: - sender_id: The ID of the entity that sent or initiated the communication. - label: The type of cyber threat or attack pattern identified, such as malware, attack pattern, identity, benign, software attack, or threat actor. - start_offset: The starting character position of the identified entity or threat within the text field. - end_offset: The ending character position of the identified entity or threat within the text field. - receiver_ids: A list of IDs representing the entities that received or were targeted by the communication.

4. relations: A list of tuples representing the relationships between entities, where each tuple contains a pair of entity IDs indicating the source and target of the relationship.

5. diagnosis: A description or diagnosis of the identified cyber threat, providing insights into the nature and potential impact of the threat.

6. solution: Recommended solutions or mitigation strategies for addressing the identified cyber threat, such as implementing specific security controls, software updates, or network configurations.

Potential Use Cases:

- Cyber Threat Detection: Train machine learning models to identify and classify various types of cyber threats based on network traffic data and textual content.

- Threat Intelligence and Analysis: Analyze the relationships between entities, threat actors, and attack patterns to gain insights into emerging cyber threats and their propagation mechanisms.

- Incident Response and Mitigation: Develop systems that can recommend appropriate solutions and mitigation strategies based on the diagnosed cyber threats, enabling timely and effective incident response.

- Network Security Monitoring: Implement real-time monitoring and analysis of network traffic to detect and prevent cyber attacks as they occur.

- Cybersecurity Education and Research: Utilize the dataset for training cybersecurity professionals, conducting research on cyber threat detection and mitigation techniques, and developing new algorithms and approaches.

Advanced ML/DL/FL/RL Use Cases:

- Multi-Modal Threat Detection: Develop multi-modal machine learning models that can leverage both the network traffic data and textual content to enhance cyber threat detection capabilities.

- Natural Language Processing (NLP) for Threat Analysis: Apply NLP techniques to analyze the textual content and identify potential threats, threat actors, and their relationships.

- Graph Neural Networks: Leverage entity relationships and network traffic patterns to build graph neural network models for detecting and classifying complex cyber threats.

- Anomaly Detection: Implement unsupervised or semi-supervised learning algorithms to detect anomalous network traffic patterns and textual content indicating cyber threats.

- Transfer Learning and Domain Adaptation: Explore transfer learning techniques to adapt pre-trained models or knowledge from related domains to the cyber threat detection task.

- Federated Learning: Develop federated learning frameworks for collaborative threat intelligence, distributed threat monitoring, and personalized threat detection.

Collaborative Threat Intelligence: Develop federated learning frameworks that enable organizations to collaboratively train machine learning models for cyber threat detection while preserving data privacy and confidentiality.

Distributed Threat Monitoring: Implement federated learning systems that can monitor and detect cyber threats across multiple distributed networks or devices, without the need for centralized data collection.

Personalized Threat Detection: Leverage federated learning to build personalized threat detection models tailored to specific organizatio...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Column	Description
`Sentence`	Transliterated Marathi sentence
`Classified Score`	Sentiment label (-3 to +3) based on manual annotation

Score	Sentiment Meaning
+3	Most Positive
+2	More Positive
+1	Positive
0	Neutral
-1	Negative
-2	More Negative
-3	Most Negative

Column	Description
`word`	Transliterated Marathi word
`score`	Sentiment score assigned to the word (-3 to +3)

Facebook

Twitter

Click to copy link

Link copied

Cite

Syed Khalid Hussain (2024). Sentiment-Analysis-Over-sampled [Dataset]. https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis-Over-sampled

Sentiment-Analysis-Over-sampled

Sentiment Analysis Dataset OverSampled

syedkhalid076/Sentiment-Analysis-Over-sampled

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 3, 2024

Authors

Syed Khalid Hussain

Description

Sentiment Analysis Dataset

  Overview

This dataset is designed for sentiment analysis tasks, offering a balanced and pre-processed collection of labeled text data. The dataset includes three sentiment labels:

0: Negative
1: Neutral
2: Positive

The training dataset has been oversampled to ensure balanced label distribution, making it suitable for training robust sentiment analysis models. The validation and test datasets remain unaltered to preserve the original… See the full description on the dataset page: https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis-Over-sampled.

Clear search

Close search

Google apps

Main menu

Sentiment-Analysis-Over-sampled

Transliterated Marathi Dataset

Transliterated Marathi Sentiment Analysis Dataset

Overview

Files in This Dataset

Dataset Details

1. marathi_comments.csv

2. marathi_wordlist.csv

How to Use the Dataset

Potential Applications

Acknowledgments

turkish-sentiment-analysis-dataset

LinCE (Linguistic Code-switching Evaluation)

LinCE (Linguistic Code-switching Evaluation)

Data for training and evaluating NLP systems on code-switching tasks

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Deep algorithms (Partially labeled dataset).

Portuguese Comparative Sentences: A Collection of Labeled Sentences on...

Digikala_3Mcomments

GitHub Pull Request Analysis: Sentiment Data and Developer Survey Responses

Data from: Bag of Words Meets Bags of Popcorn

Context

Content

Acknowledgements

Inspiration

Twitter Reviews for Emotion Analysis

Context

Content

Inspiration

A Dataset of Polarities and Emotions from Brazilian Portuguese Play Store...

Table_1_Analyzing Twitter Conversation on Genome-Edited Foods and Their...

Ten Thousand German News Articles Dataset

Why a German dataset?

The dataset

Numbers and statistics

Splitting into train and test

Code

License

Home Sites Niagara Open Data

Sentiment Analysis of movie review

Context

Content

Acknowledgements

Inspiration

Social Media Engagement Report

Zoopla properties listing information dataset

CMU_MOSEI

Cyber Threat Dataset: Network, Text & Relation

To use this dataset for research, publication, and other work, please cite the following paper:

About this dataset:

Column details:

Potential Use Cases:

Advanced ML/DL/FL/RL Use Cases:

Sentiment-Analysis-Over-sampled

Sentiment Analysis Dataset OverSampled

syedkhalid076/Sentiment-Analysis-Over-sampled