Sentiment Analysis Dataset
Overview
This dataset is designed for sentiment analysis tasks, offering a balanced and pre-processed collection of labeled text data. The dataset includes three sentiment labels:
0: Negative
1: Neutral
2: Positive
The training dataset has been oversampled to ensure balanced label distribution, making it suitable for training robust sentiment analysis models. The validation and test datasets remain unaltered to preserve the original… See the full description on the dataset page: https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis-Over-sampled.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is designed to facilitate sentiment analysis for transliterated Marathi text, which is widely used on social media platforms but lacks structured sentiment resources. The dataset includes user-generated comments labeled with sentiment scores, along with a manually curated sentiment wordlist to aid classification.
The comments were collected from platforms like Instagram, Twitter, and YouTube, where informal, code-mixed text is prevalent. Each sentence has been carefully annotated for sentiment by human reviewers to ensure label accuracy and consistency.
marathi_comments.csv
– Contains user-generated transliterated Marathi comments with their sentiment classification. marathi_wordlist.csv
– A manually created wordlist that maps common transliterated Marathi words to sentiment scores. This file contains sentences along with sentiment labels assigned during manual annotation.
Column | Description |
---|---|
Sentence | Transliterated Marathi sentence |
Classified Score | Sentiment label (-3 to +3) based on manual annotation |
Sentiment Labeling Scale:
Score | Sentiment Meaning |
---|---|
+3 | Most Positive |
+2 | More Positive |
+1 | Positive |
0 | Neutral |
-1 | Negative |
-2 | More Negative |
-3 | Most Negative |
This file contains a sentiment wordlist with predefined scores for commonly used transliterated Marathi words.
Column | Description |
---|---|
word | Transliterated Marathi word |
score | Sentiment score assigned to the word (-3 to +3) |
This dataset was curated as part of a research project in the Department of Electronics & Telecommunication Engineering at SCTR's Pune Institute of Computer Technology, Pune, India. We sincerely appreciate the efforts and contributions of our project group in dataset collection, annotation, and structuring.
Contributors:
- Siddhi Pardeshi
- Gurunath Salve
- Sayali Thakur
- Mr. Rishikesh J. Sutar (Mentor)
We would like to extend our gratitude to our institution for providing guidance and support throughout this research. By making this dataset publicly available, we aim to encourage further advancements in low-resource language processing and Marathi NLP research.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset
This dataset contains positive , negative and notr sentences from several data sources given in the references. In the most sentiment models , there are only two labels; positive and negative. However , user input can be totally notr sentence. For such cases there were no data I could find. Therefore I created this dataset with 3 class. Positive and negative sentences are listed below. Notr examples are extraced from turkish wiki dump. In addition, added some random text… See the full description on the dataset page: https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
Do you want to uncover the power of language through analysis? The Lince Dataset is the answer! An expansive collection of language technologies and data, this dataset can be utilized for a multitude of purposes. With six different languages to explore - Spanish, Hindi, Nepali, Spanish-English, Hindi-English as well as Spanish Multi-Source-English (MSAEA) - you are granted access to an enormous selection of language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and much more. Train your models efficiently with the help of ML in order to automatically detect and classify tasks such as POS or NER from each variation. Or even build cross linguistic models between multiple languages if preferred! Push the boundaries with Lince Dataset's unparalleled diversity. Dive into exploratory research within this feast for NLP connoisseurs and unlock hidden opportunities today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Are you looking to unlock the potential of multilingual natural language processing (NLP) with the Lince Dataset? If so, you’re in the right place! With six languages and training data for language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and more, this is one of the most comprehensive datasets for NLP today.
Understand what is included in this dataset This dataset includes language technology data from six different languages. These include Spanish, Hindi, Nepali, Spanish-English, Hindi-English and Spanish Multi**Source**English (MSAEA). Each file is labelled according to its content - e.g. lid_msaea_test.csv which contains test data for language identificaiton (LID) with 5 columns containing words, part of speech tags as well as sentiment analysis labels. A brief summary of each file's contents can be found when you pull this dataset up on Kaggle or when running a script such as “head()” or “describe()” depending on your software preferences
Decide What Kind Of Analysis You Want To Do Once you are familiar with what type of data is provided it will be necessary to decide which kind of model or analysis you want to do before diving into coding any algorithms relevant for that task . For example if one wants to build a cross lingual model for POS tagging then it would be ideal to have training and validation sets from 3 different languages so that one can take advantage multi domain knowledge interchange between them during training phase hence selecting files such as pos_spaeng _train , pos_hineng _validation will come into play . While designing your model architecture make sure that task specific hyper parameters should complement each other while taking decisions , also choosing an appropriate feature vector representation strategy helps in improved performance
Run Appropriate Algorithms On The Data Provided In The Dataset Now upon understanding all elements presented in front we can start running appropriate algorithms irespective respectively of tools used while tuning our models using metrics like accuracy , f1 score etc . Once tuned ensure that our system works reliably by testing on unseen test set and ensuring desired results . During optimization various hyper parameter tuning has makes significant role depending upon algorithm chosen irespective respective ly
- Developing a multilingual sentiment analysis system that can analyze sentiment in any of the six languages.
- Training a model to identify and classify named entities across multiple languages, such as identifying certain words for proper nouns or locations regardless of language or coding scheme.
- Developing an AI-powered cross-lingual translator that is able to effectively translate text from one language to another with minimal errors and maximum accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: lid_msaea_test.csv...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
More and more customers demand online reviews of products and comments on the Web to make decisions about buying a product over another. In this context, sentiment analysis techniques constitute the traditional way to summarize a user’s opinions that criticizes or highlights the positive aspects of a product. Sentiment analysis of reviews usually relies on extracting positive and negative aspects of products, neglecting comparative opinions. Such opinions do not directly express a positive or negative view but contrast aspects of products from different competitors.
Here, we present the first effort to study comparative opinions in Portuguese, creating two new Portuguese datasets with comparative sentences marked by three humans. This repository consists of three important files: (1) lexicon that contains words frequently used to make a comparison in Portuguese; (2) Twitter dataset with labeled comparative sentences; and (3) Buscapé dataset with labeled comparative sentences.
The lexicon is a set of 176 words frequently used to express a comparative opinion in the Portuguese language. In these contexts, the lexicon is aggregated in a filter and used to build two sets of data with comparative sentences from two important contexts: (1) Social Network Online; and (2) Product reviews.
For Twitter, we collected all Portuguese tweets published in Brazil on 2018/01/10 and filtered all tweets that contained at least one keyword present in the lexicon, obtaining 130,459 tweets. Our work is based on the sentence level. Thus, all sentences were extracted and a sample with 2,053 sentences was created, which was labeled for three human manuals, reaching an 83.2% agreement with Fleiss' Kappa coefficient. For Buscapé, a Brazilian website (https://www.buscape.com.br/) used to compare product prices on the web, the same methodology was conducted by creating a set of 2,754 labeled sentences, obtained from comments made in 2013. This dataset was labeled by three humans, reaching an agreement of 83.46% with the Fleiss Kappa coefficient.
The Twitter dataset has 2,053 labeled sentences, of which 918 are comparative. The Buscapé dataset has 2,754 labeled sentences, of which 1,282 are comparative.
The datasets contain these labeled properties:
text: the sentence extracted from the review comment.
entity_s1: the first entity compared in the sentence.
entity_s2: the second entity compared in the sentence.
keyword: the comparative keyword used in the sentence to express comparison.
preferred_entity: the preferred entity.
id_start: the keyword's initial position in the sentence.
id_end: the keyword's final position in the sentence.
type: the sentence label, which specifies whether the phrase is a comparison.
Additional Information:
1 - The sentences were separated using a sentence tokenizer.
2 - If the compared entity is not specified, the field will receive a value: "_".
3 - The property "type" can contain five values, they are:
0: Non-comparative (Não Comparativa).
1: Non-Equal-Gradable (Gradativa com Predileção).
2: Equative (Equitativa).
3: Superlative (Superlativa).
4: Non-Equal-Gradable (Não Gradativa).
If you use this data, please cite our paper as follows:
"Daniel Kansaon, Michele A. Brandão, Julio C. S. Reis, Matheus Barbosa,Breno Matos, and Fabrício Benevenuto. 2020. Mining Portuguese Comparative Sentences in Online Reviews. In Brazilian Symposium on Multimedia and the Web (WebMedia ’20), November 30-December 4, 2020, São Luís, Brazil. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3428658.3431081"
Plus Information:
We make the raw sentences available in the dataset to allow future work to test different pre-processing steps. Then, if you want to obtain the exact sentences used in the paper above, you must reproduce the pre-processing step described in the paper (Figure 2).
For each sentence with more than one keyword in the dataset:
You need to extract three words before and three words after the comparative keyword, creating a new sentence that will receive the existing value in the “type” field as a label;
The original sentence will be divided into n new sentences. (n) is the number of keywords in the sentence;
The stopwords should not be accounted for as part of this range (3 words);
Note that: the final processed sentence can have more than six words because the stopwords are not counted as part of the range.
About a year ago we had an idea about eliminating the preprocessing step in text analyzing using fastText and CNN. So, to test our idea, I started to scrap the Digikala website and extract 3milions comments on products. If you are interested in our project you can check this link:peerj.com/articles/cs-422/ And now, I decided to share my dataset with those who are researching in the text analysis area and want to test their ideas. The most prominent feature of this data, or in other words, the data available on the Digikala website, is that a significant part of it is labeled, which facilitates sentiment analysis. So that in this dataset there are 1749055 rows with the positive label (Satisfied=1), 308112 rows with the negative label (Unsatisfied=1), and 875580 rows without any label (Satisfied=0 and Unsatisfied=0).
Columns: 1. Date: Date in solar calendar format 2. Person: Name of the person who posted the comment 3. SubCatName: The main subcategory of the desired product code 4. SubName: Subcategory of the desired product code 5. ItemURL: Product URL (most of them are unavailable because the dataset is for 2020) 6. Comment: Customer opinion about the purchased product 7. Satisfied: After sending the comment, the customer can select this option to express whether he/she is satisfied with the product or not 8. Unsatisfied: After sending the comment, the customer can select this option to express whether he/she is unsatisfied with the product or not 9. Agree: Number of users who agree with this comment 10.Disagree: Number of users who disagree with this comment
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Three datasets.
PRBatch Dataset (file name: prfeatures_train.csv and prfeatures_test.csv)
PRFeatures uses an extensive dataset from Xunhui Zhang et al. (2021), originating from their 2020 work and the GHTorrent data dump dated June 1, 2019 (https://github.com/ghtorrent/ghtorrent.org/). This dataset was selected for its diversity in project activity, language, and size, offering a more generalizable and holistic view of Pull-Request (PR) dynamics across various software development scenarios.
We performed necessary pre-processing steps mainly handling missing values by replacing negative and missing values with \textit{Not a Number (NaN)} and omitting factors with over 30\% missing values. In terms of feature engineering, redundant factors were removed and related factors like \textit{files-added} and \textit{files-deleted} were consolidated into \textit{files-changed}. Rather than narrowing down key variables, our study aims to showcase the adaptability of RL algorithms in handling extensive feature sets; therefore, we retain a large number of features in the dataset. Correct data types were set for each factor, and categorical values in the \textit{language} factor were label-encoded.
We used an 80/20 datasplit to create the training dataset and the testing datasets as uploaded here. The dataset contains a little over 1.3 million PRs and 72 PR related features.
PRChat Dataset (file name: pr_comments_dataset_publish)
The second dataset, PRChats Dataset was curated specifically for a specialized Reinforcement Learning formalization for Pull-Request (PR) outcome predictions on GitHub using just the developer discussions. It contains over 5,88,097 in-line code comments of 66,281 PRs and a total of 15 features. The raw comments and the respective commit_ids were extracted from the work publised by Akshay Sinha (refer to the references). The data spans from January 2015 to December 2020. All the other features were augmented using the GitHub REST API.
The dataset contains a little under 0.6 million comments associated with around 66,000 PRs. To view the PRs (consequently the related comments), group by using: owner_name, repo_name, pull_no.
Feature Extraction resulted in addition of following features:
Sentiment Analysis conducted using VADER resulting in addition of:
Other PR and project related features include:
Survey Data (file name: survey_responses_raw.csv)
The third dataset is the collection of responses of an online exploratory survey targeting software developers and engineers. The underpinning objective was to delve deep into the developers' perspectives regarding the PR review processes and the quality of these reviews. We received a total of 22 responses.
We designed a survey protocol following Carleton University's guidelines for on-line research, adhering to the Tri-Council Policy Statement: Ethical Conduct for Research Involving humans (TCPS 2) in Canada (https://tcps2core.ca/welcome). After careful evaluation by Carleton University's Research Ethics Boards, in alignment with TCPS2, we received approval on May 2, 2023 (Ethics Clearance ID # 119296), effective until May 31, 2023.
The survey was carefully structured into three distinct sections. The initial section delved into the participant's demographic and professional background, featuring six primary questions, along with an optional seventh question. Prioritizing participant confidentiality, the survey was designed to safeguard anonymity. The subsequent section transitioned to a set of questions focused on PR factors and review practices. This section presented participants with two multiple-choice queries and a pair of questions grounded in the Likert-scale, enabling a structured feedback mechanism.
Concluding the survey, the third section was crafted to prompt more detailed insights from the participants. It comprised two open-ended questions, providing an avenue for respondents to further describe their PR review experiences and techniques.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The competition is over 2 yrs ago. I just wanna play around the dataset.
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.
The origin place is here. Awesome tutorial is here, we can play with it.
Just for study and learning
https://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
This dataset consists of a few thousand Twitter user reviews (input text) and Emotions (output labels ) for learning how to train the text for emotion analysis. This dataset was created using Twitter API by implementing the Keywords. The idea here is a dataset is more than a toy - real business data on a reasonable scale - but can be trained in minutes on a modest laptop.
This file has Sl no, Tweets, Search key, Feeling.
Description of columns in the file:
Tweets - text of the review Search key - Keyword used Feeling - Emotion classified using the keyword, this column contains 6 emotions i.e., Happy, Sad, Surprise, Fear, Disgust, Angry
This would be helpful for the organization to understand Customer reviews/feedbacks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
User reviews play a crucial role in shaping consumer perceptions and guiding decision-making processes in the digital marketplace. With the rise of mobile applications, platforms like the Google Play Store serve as hubs for users to express their opinions and experiences with various apps and services. Understanding the polarities and emotions conveyed in these reviews provides valuable insights for developers, marketers, and researchers alike.
The dataset consists of user reviews collected from the "Trending" section of the Google Play Store in May 2023. A total of 300 reviews were gathered for each of the top 10 most downloaded applications during this period. Each review in the dataset has been meticulously labeled for polarity, categorizing sentiments as positive, negative, or neutral, and emotion, encompassing a range of emotional responses such as happiness, sadness, surprise, fear, disgust and anger.
Additionally, it's worth noting that this dataset underwent a rigorous annotation process. Three annotators independently classified the reviews for polarity and emotion. Afterward, they reconciled any discrepancies through discussion and arrived at a consensus for the final annotations. This ensures a high level of accuracy and reliability in the labeling process, providing researchers and practitioners with trustworthy data for analysis and decision-making.
It's important to highlight that all reviews in this dataset are in Brazilian Portuguese, reflecting the specific linguistic and cultural nuances of the Brazilian market. By leveraging this dataset, stakeholders gain access to a robust resource for exploring user sentiment and emotion within the context of popular mobile applications in Brazil.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent years, the research and development of genome editing technology have been progressing rapidly, and the commercial use of genome-edited soybean started in the United States in 2019. A preceding study’s results found that there is public concern with regard to the safety of high-tech foods, such as genetically modified foods and genome-edited foods. Twitter, one of the most popular social networks, allows users to post their opinions instantaneously, making it an extremely useful tool to collect what people are actually saying online in a timely manner. Therefore, it was used for collecting data on the users’ concerns with and expectations of high-tech foods. This study collected and analyzed Twitter data on genome-edited foods and their labeling from May 25 to October 15 in 2019. Of 14,066 unique user IDs, 94.9% posted 5 or less tweets, whereas 64.8% tweeted only once, indicating that the majority of users who tweeted on this issue are not as intense, as they posted tweets consistently. After a process of refining, there were 28,722 tweets, of which 2,536 tweets (8.8%) were original, 326 (1.1%) were replies, and 25,860 (90%) were retweets. The numbers of tweets increased in response to government announcements and news content in the media. A total of six prominent peaks were detected during the investigation period, proving that Twitter could serve as a tool for monitoring degree of users’ interests in real time. The co-occurrence network of original and reply tweets provided different words from various tweets that appeared with a certain frequency. However, the network derived from all tweets seemed to concentrate on words from specific tweets with negative overtones. As a result of sentiment analysis, 54.5% to 62.8% tweets were negative about genome-edited food and the labeling policy of the Consumer Affairs Agency, respectively, indicating a strong demand for mandatory labeling. These findings are expected to contribute to the communication strategy of genome-edited foods toward social implementation by government officers and science communicators.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
(see https://tblock.github.io/10kGNAD/ for the original dataset page)
This page introduces the 10k German News Articles Dataset (10kGNAD) german topic classification dataset. The 10kGNAD is based on the One Million Posts Corpus and avalaible under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can download the dataset here.
English text classification datasets are common. Examples are the big AG News, the class-rich 20 Newsgroups and the large-scale DBpedia ontology datasets for topic classification and for example the commonly used IMDb and Yelp datasets for sentiment analysis. Non-english datasets, especially German datasets, are less common. There is a collection of sentiment analysis datasets assembled by the Interest Group on German Sentiment Analysis. However, to my knowlege, no german topic classification dataset is avaliable to the public.
Due to grammatical differences between the English and the German language, a classifyer might be effective on a English dataset, but not as effectiv on a German dataset. The German language has a higher inflection and long compound words are quite common compared to the English language. One would need to evaluate a classifyer on multiple German datasets to get a sense of it's effectivness.
The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.
In the One Million Posts Corpus each article has a topic path. For example Newsroom/Wirtschaft/Wirtschaftpolitik/Finanzmaerkte/Griechenlandkrise
.
The 10kGNAD uses the second part of the topic path, here Wirtschaft
, as class label.
In result the dataset can be used for multi-class classification.
I created and used this dataset in my thesis to train and evaluate four text classifyers on the German language. By publishing the dataset I hope to support the advancement of tools and models for the German language. Additionally this dataset can be used as a benchmark dataset for german topic classification.
As in most real-world datasets the class distribution of the 10kGNAD is not balanced. The biggest class Web consists of 1678, while the smalles class Kultur contains only 539 articles. However articles from the Web class have on average the fewest words, while artilces from the culture class have the second most words.
I propose a stratifyed split of 10% for testing and the remaining articles for training.
To use the dataset as a benchmark dataset, please used the train.csv
and test.csv
files located in the project root.
Python scripts to extract the articles and split them into a train- and a testset avaliable in the code directory of this project.
Make sure to install the requirements.
The original corpus.sqlite3
is required to extract the articles (download here (compressed) or here (uncompressed)).
This dataset is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please consider citing the authors of the One Million Post Corpus if you use the dataset.
The Ontario government, generates and maintains thousands of datasets. Since 2012, we have shared data with Ontarians via a data catalogue. Open data is data that is shared with the public. Click here to learn more about open data and why Ontario releases it. Ontario’s Open Data Directive states that all data must be open, unless there is good reason for it to remain confidential. Ontario’s Chief Digital and Data Officer also has the authority to make certain datasets available publicly. Datasets listed in the catalogue that are not open will have one of the following labels: If you want to use data you find in the catalogue, that data must have a licence – a set of rules that describes how you can use it. A licence: Most of the data available in the catalogue is released under Ontario’s Open Government Licence. However, each dataset may be shared with the public under other kinds of licences or no licence at all. If a dataset doesn’t have a licence, you don’t have the right to use the data. If you have questions about how you can use a specific dataset, please contact us. The Ontario Data Catalogue endeavors to publish open data in a machine readable format. For machine readable datasets, you can simply retrieve the file you need using the file URL. The Ontario Data Catalogue is built on CKAN, which means the catalogue has the following features you can use when building applications. APIs (Application programming interfaces) let software applications communicate directly with each other. If you are using the catalogue in a software application, you might want to extract data from the catalogue through the catalogue API. Note: All Datastore API requests to the Ontario Data Catalogue must be made server-side. The catalogue's collection of dataset metadata (and dataset files) is searchable through the CKAN API. The Ontario Data Catalogue has more than just CKAN's documented search fields. You can also search these custom fields. You can also use the CKAN API to retrieve metadata about a particular dataset and check for updated files. Read the complete documentation for CKAN's API. Some of the open data in the Ontario Data Catalogue is available through the Datastore API. You can also search and access the machine-readable open data that is available in the catalogue. How to use the API feature: Read the complete documentation for CKAN's Datastore API. The Ontario Data Catalogue contains a record for each dataset that the Government of Ontario possesses. Some of these datasets will be available to you as open data. Others will not be available to you. This is because the Government of Ontario is unable to share data that would break the law or put someone's safety at risk. You can search for a dataset with a word that might describe a dataset or topic. Use words like “taxes” or “hospital locations” to discover what datasets the catalogue contains. You can search for a dataset from 3 spots on the catalogue: the homepage, the dataset search page, or the menu bar available across the catalogue. On the dataset search page, you can also filter your search results. You can select filters on the left hand side of the page to limit your search for datasets with your favourite file format, datasets that are updated weekly, datasets released by a particular organization, or datasets that are released under a specific licence. Go to the dataset search page to see the filters that are available to make your search easier. You can also do a quick search by selecting one of the catalogue’s categories on the homepage. These categories can help you see the types of data we have on key topic areas. When you find the dataset you are looking for, click on it to go to the dataset record. Each dataset record will tell you whether the data is available, and, if so, tell you about the data available. An open dataset might contain several data files. These files might represent different periods of time, different sub-sets of the dataset, different regions, language translations, or other breakdowns. You can select a file and either download it or preview it. Make sure to read the licence agreement to make sure you have permission to use it the way you want. Read more about previewing data. A non-open dataset may be not available for many reasons. Read more about non-open data. Read more about restricted data. Data that is non-open may still be subject to freedom of information requests. The catalogue has tools that enable all users to visualize the data in the catalogue without leaving the catalogue – no additional software needed. Have a look at our walk-through of how to make a chart in the catalogue. Get automatic notifications when datasets are updated. You can choose to get notifications for individual datasets, an organization’s datasets or the full catalogue. You don’t have to provide and personal information – just subscribe to our feeds using any feed reader you like using the corresponding notification web addresses. Copy those addresses and paste them into your reader. Your feed reader will let you know when the catalogue has been updated. The catalogue provides open data in several file formats (e.g., spreadsheets, geospatial data, etc). Learn about each format and how you can access and use the data each file contains. A file that has a list of items and values separated by commas without formatting (e.g. colours, italics, etc.) or extra visual features. This format provides just the data that you would display in a table. XLSX (Excel) files may be converted to CSV so they can be opened in a text editor. How to access the data: Open with any spreadsheet software application (e.g., Open Office Calc, Microsoft Excel) or text editor. Note: This format is considered machine-readable, it can be easily processed and used by a computer. Files that have visual formatting (e.g. bolded headers and colour-coded rows) can be hard for machines to understand, these elements make a file more human-readable and less machine-readable. A file that provides information without formatted text or extra visual features that may not follow a pattern of separated values like a CSV. How to access the data: Open with any word processor or text editor available on your device (e.g., Microsoft Word, Notepad). A spreadsheet file that may also include charts, graphs, and formatting. How to access the data: Open with a spreadsheet software application that supports this format (e.g., Open Office Calc, Microsoft Excel). Data can be converted to a CSV for a non-proprietary format of the same data without formatted text or extra visual features. A shapefile provides geographic information that can be used to create a map or perform geospatial analysis based on location, points/lines and other data about the shape and features of the area. It includes required files (.shp, .shx, .dbt) and might include corresponding files (e.g., .prj). How to access the data: Open with a geographic information system (GIS) software program (e.g., QGIS). A package of files and folders. The package can contain any number of different file types. How to access the data: Open with an unzipping software application (e.g., WinZIP, 7Zip). Note: If a ZIP file contains .shp, .shx, and .dbt file types, it is an ArcGIS ZIP: a package of shapefiles which provide information to create maps or perform geospatial analysis that can be opened with ArcGIS (a geographic information system software program). A file that provides information related to a geographic area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open using a GIS software application to create a map or do geospatial analysis. It can also be opened with a text editor to view raw information. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format for sharing data in a machine-readable way that can store data with more unconventional structures such as complex lists. How to access the data: Open with any text editor (e.g., Notepad) or access through a browser. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format to store and organize data in a machine-readable way that can store data with more unconventional structures (not just data organized in tables). How to access the data: Open with any text editor (e.g., Notepad). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A file that provides information related to an area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open with a geospatial software application that supports the KML format (e.g., Google Earth). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. This format contains files with data from tables used for statistical analysis and data visualization of Statistics Canada census data. How to access the data: Open with the Beyond 20/20 application. A database which links and combines data from different files or applications (including HTML, XML, Excel, etc.). The database file can be converted to a CSV/TXT to make the data machine-readable, but human-readable formatting will be lost. How to access the data: Open with Microsoft Office Access (a database management system used to develop application software). A file that keeps the original layout and
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I have always been a binge watcher and with so many movies and series to watch, a sentiment analysis of movie reviews is a good start to know more about them.
This dataset contains the text of the reviews, together with a label that indi‐ cates whether a review is “positive” or “negative.” The IMDb website itself contains ratings from 1 to 10. To simplify the modeling, this annotation is summarized as a two-class classification dataset where reviews with a score of 6 or higher are labeled as positive, and the rest as negative.
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}
One of the most simple but effective and commonly used ways to represent text for machine learning is using the bag-of-words representation. Classify the dataset with highest cross-validation accuracy with or without bag-of-words.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
*****Documentation Process***** 1. Data Preparation: - Upload the data into Power Query to assess quality and identify duplicate values, if any. - Verify data quality and types for each column, addressing any miswriting or inconsistencies. 2. Data Management: - Duplicate the original data sheet for future reference and label the new sheet as the "Working File" to preserve the integrity of the original dataset. 3. Understanding Metrics: - Clarify the meaning of column headers, particularly distinguishing between Impressions and Reach, and comprehend how Engagement Rate is calculated. - Engagement Rate formula: Total likes, comments, and shares divided by Reach. 4. Data Integrity Assurance: - Recognize that Impressions should outnumber Reach, reflecting total views versus unique audience size. - Investigate discrepancies between Reach and Impressions to ensure data integrity, identifying and resolving root causes for accurate reporting and analysis. 5. Data Correction: - Collaborate with the relevant team to rectify data inaccuracies, specifically addressing the discrepancy between Impressions and Reach. - Engage with the concerned team to understand the root cause of discrepancies between Impressions and Reach. - Identify instances where Impressions surpass Reach, potentially attributable to data transformation errors. - Following the rectification process, meticulously adjust the dataset to reflect the corrected Impressions and Reach values accurately. - Ensure diligent implementation of the corrections to maintain the integrity and reliability of the data. - Conduct a thorough recalculation of the Engagement Rate post-correction, adhering to rigorous data integrity standards to uphold the credibility of the analysis. 6. Data Enhancement: - Categorize Audience Age into three groups: "Senior Adults" (45+ years), "Mature Adults" (31-45 years), and "Adolescent Adults" (<30 years) within a new column named "Age Group." - Split date and time into separate columns using the text-to-columns option for improved analysis. 7. Temporal Analysis: - Introduce a new column for "Weekend and Weekday," renamed as "Weekday Type," to discern patterns and trends in engagement. - Define time periods by categorizing into "Morning," "Afternoon," "Evening," and "Night" based on time intervals. 8. Sentiment Analysis: - Populate blank cells in the Sentiment column with "Mixed Sentiment," denoting content containing both positive and negative sentiments or ambiguity. 9. Geographical Analysis: - Group countries and obtain additional continent data from an online source (e.g., https://statisticstimes.com/geography/countries-by-continents.php). - Add a new column for "Audience Continent" and utilize XLOOKUP function to retrieve corresponding continent data.
*****Drawing Conclusions and Providing a Summary*****
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Zoopla Properties Listing dataset to explore detailed property information, including pricing, location, and features. Popular use cases include real estate market analysis, property valuation, and investment research.
Use our Zoopla Properties Listing Information dataset to explore detailed property listings, including property details, pricing, location, and market trends across various regions. This dataset provides valuable insights into property valuations, consumer preferences, and real estate dynamics, enabling businesses and researchers to make data-driven decisions.
Tailored for real estate professionals, investors, and market analysts, this dataset supports market trend analysis, property valuation assessments, and investment strategy development. Whether you're evaluating property investments, tracking market conditions, or conducting competitive analysis, the Zoopla Properties Listing Information dataset is a key resource for navigating the real estate landscape.
Dataset Features
Distribution
Usage
This dataset is ideal for a variety of high-impact applications:
Coverage
License
CUSTOM
Please review the respective licenses below:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CMU-MOSEI is a comprehensive multimodal dataset designed to analyze emotions and sentiment in online videos. It's a valuable resource for researchers and developers working on automatic emotion recognition and sentiment analysis.
Key Features: Over 23,500 video clips from 1000+ speakers, covering diverse topics and monologues.
Multimodal data:
Acoustics: Features extracted from audio (CMU_MOSEI_COVAREP.csd) Labels: Annotations for sentiment intensity and emotion categories (CMU_MOSEI_Labels.csd) Language: Phonetic, word-level, and word vector representations (CMU_MOSEI_*.csd files under languages folder)
Visuals: Features extracted from facial expressions (CMU_MOSEI_Visual*.csd files under visuals folder)
Balanced for gender: The dataset ensures equal representation from male and female speakers.
Unlocking Insights: By exploring the various modalities within CMU-MOSEI, researchers can investigate the relationship between speech, facial expressions, and emotions expressed in online videos.
Download: The dataset is freely available for download at: http://immortal.multicomp.cs.cmu.edu/CMU-MOSEI/
Start exploring the world of emotions in videos with CMU-MOSEI!
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
F. Ramoliya, R. Kakkar, R. Gupta, S. Tanwar and S. Agrawal, "SEAM: Deep Learning-based Secure Message Exchange Framework For Autonomous EVs," 2023 IEEE Globecom Workshops (GC Wkshps), Kuala Lumpur, Malaysia, 2023, pp. 80-85, doi: 10.1109/GCWkshps58843.2023.10465168.
This dataset provides a comprehensive collection of data for detecting, diagnosing, and mitigating cyber threats using network traffic data, textual content, and entity relationships. It can be used for training machine learning models to identify various types of cyber threats, understand their underlying patterns, and recommend appropriate solutions.
1. id: A unique identifier for each instance in the dataset.
2. text: Textual content transferred over the network, such as emails, messages, or network traffic payloads. This column may contain descriptions of potential cyber threats or attack vectors.
3. Entries: A list of JSON objects containing the following fields: - sender_id: The ID of the entity that sent or initiated the communication. - label: The type of cyber threat or attack pattern identified, such as malware, attack pattern, identity, benign, software attack, or threat actor. - start_offset: The starting character position of the identified entity or threat within the text field. - end_offset: The ending character position of the identified entity or threat within the text field. - receiver_ids: A list of IDs representing the entities that received or were targeted by the communication.
4. relations: A list of tuples representing the relationships between entities, where each tuple contains a pair of entity IDs indicating the source and target of the relationship.
5. diagnosis: A description or diagnosis of the identified cyber threat, providing insights into the nature and potential impact of the threat.
6. solution: Recommended solutions or mitigation strategies for addressing the identified cyber threat, such as implementing specific security controls, software updates, or network configurations.
- Cyber Threat Detection: Train machine learning models to identify and classify various types of cyber threats based on network traffic data and textual content.
- Threat Intelligence and Analysis: Analyze the relationships between entities, threat actors, and attack patterns to gain insights into emerging cyber threats and their propagation mechanisms.
- Incident Response and Mitigation: Develop systems that can recommend appropriate solutions and mitigation strategies based on the diagnosed cyber threats, enabling timely and effective incident response.
- Network Security Monitoring: Implement real-time monitoring and analysis of network traffic to detect and prevent cyber attacks as they occur.
- Cybersecurity Education and Research: Utilize the dataset for training cybersecurity professionals, conducting research on cyber threat detection and mitigation techniques, and developing new algorithms and approaches.
- Multi-Modal Threat Detection: Develop multi-modal machine learning models that can leverage both the network traffic data and textual content to enhance cyber threat detection capabilities.
- Natural Language Processing (NLP) for Threat Analysis: Apply NLP techniques to analyze the textual content and identify potential threats, threat actors, and their relationships.
- Graph Neural Networks: Leverage entity relationships and network traffic patterns to build graph neural network models for detecting and classifying complex cyber threats.
- Anomaly Detection: Implement unsupervised or semi-supervised learning algorithms to detect anomalous network traffic patterns and textual content indicating cyber threats.
- Transfer Learning and Domain Adaptation: Explore transfer learning techniques to adapt pre-trained models or knowledge from related domains to the cyber threat detection task.
- Federated Learning: Develop federated learning frameworks for collaborative threat intelligence, distributed threat monitoring, and personalized threat detection.
Collaborative Threat Intelligence: Develop federated learning frameworks that enable organizations to collaboratively train machine learning models for cyber threat detection while preserving data privacy and confidentiality.
Distributed Threat Monitoring: Implement federated learning systems that can monitor and detect cyber threats across multiple distributed networks or devices, without the need for centralized data collection.
Personalized Threat Detection: Leverage federated learning to build personalized threat detection models tailored to specific organizatio...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Sentiment Analysis Dataset
Overview
This dataset is designed for sentiment analysis tasks, offering a balanced and pre-processed collection of labeled text data. The dataset includes three sentiment labels:
0: Negative
1: Neutral
2: Positive
The training dataset has been oversampled to ensure balanced label distribution, making it suitable for training robust sentiment analysis models. The validation and test datasets remain unaltered to preserve the original… See the full description on the dataset page: https://huggingface.co/datasets/syedkhalid076/Sentiment-Analysis-Over-sampled.