Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Name: BBC Articles Sentiment Analysis Dataset
Source: BBC News
Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.
Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]
Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral
Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).
Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |
Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.
Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.
Dataset Card for Custom Text Dataset
Dataset Name
Custom Text Dataset
Overview
This dataset contains text data for training sentiment analysis models. The data is collected from various sources, including books, articles, and web pages.
Composition
Number of records: 50,000 Fields: text, label Size: 134 MB
Collection Process
The data was collected using web scraping and manual extraction from public domain sources.… See the full description on the dataset page: https://huggingface.co/datasets/t7439/custom_sentiment_analysis_dataset.
With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Kurdish language is regarded as one of the less-resourced languages. The language is globally practised by 30-40 people. The language has 33 letters that are largely similar to the Arabic language. The Kurdish language has two major dialects Sorani and Badini. The dataset includes a collection of texts written in the Sorani dialect. It contains tweets the Twitter API. Due to security reasons and following the policies of Twitter, we removed the user's identity. We collected the tweets which was published during the time of the Corona Virus pandemic. The tweets are raw texts, and the content covers a varied range of topics, starting from politics, sports, entertainment, social life, etc. Data Labeling We used the Twitter developer (Twitter API) to mine the tweets. The dataset was annotated manually by three Kurdish native speakers. The annotators were required to identify the classes and categories of each text. The classes included positive, negative and neutral and the categories consisted of news, technology, art, social and health. The texts which were agreed upon by at least two annotators to possess a specific label and category were regarded as conflict-free and accepted for further processing. Other texts that caused conflict among all three raters were ignored and have been removed from the dataset. The doccano program was used to help the annotators label each text one by one.
Product Review Datasets: Uncover user sentiment
Harness the power of Product Review Datasets to understand user sentiment and insights deeply. These datasets are designed to elevate your brand and product feature analysis, help you evaluate your competitive stance, and assess investment risks.
Data sources:
Leave the data collection challenges to us and dive straight into market insights with clean, structured, and actionable data, including:
Choose from multiple data delivery options to suit your needs:
Why choose Oxylabs?
Fresh and accurate data: Access organized, structured, and comprehensive data collected by our leading web scraping professionals.
Time and resource savings: Concentrate on your core business goals while we efficiently handle the data extraction process at an affordable cost.
Adaptable solutions: Share your specific data requirements, and we'll craft a customized data collection approach to meet your objectives.
Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA standards.
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Join the ranks of satisfied customers who appreciate our meticulous attention to detail and personalized support. Experience the power of Product Review Datasets today to uncover valuable insights and enhance decision-making.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sentiment Analysis outputs based on the combination of three classifiers for news headlines and body text covering the Olympic legacy of Rio 2016 and London 2012. Data was searched via Google search engine. It is composed of sentiment labels assigned to 1271 news articles in total.
News outlets:
Events covered by the articles:
All classifiers were used in texts in English. Text originally published in Portuguese by the Brazilian media were automatically translated.
Sentiment classifiers used:
Each document (spreadsheet - xlsx) refers to one outlet and one event (London 2012 or Rio 2016).
How were labels assigned to the texts?
These labels are a combination of the three sentiment classifiers listed above. If two of them agree with the same label, then this label would be considered as right. Otherwise, the label ‘other’ was assigned.
For news article body text: the proportion of sentences of each sentiment type was used to assign labels to the whole article instead of averaging the sentence scores. For example, if the proportion of sentences with negative labels is greater than 50%, then the article is assigned a negative label.
The documents are composed of the following columns:
PS: Documents do not include articles' body text.
Sentiment is presented in labels as follows:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "Large twitter tweets sentiment analysis"
Dataset Description
Dataset Summary
This dataset is a collection of tweets formatted in a tabular data structure, annotated for sentiment analysis. Each tweet is associated with a sentiment label, with 1 indicating a Positive sentiment and 0 for a Negative sentiment.
Languages
The tweets in English.
Dataset Structure
Data Instances
An instance of… See the full description on the dataset page: https://huggingface.co/datasets/gxb912/large-twitter-tweets-sentiment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data produced by the running of the Naive Bayes classifier algorithm. It is a list of every word in the vocabulary of the classifier, as well as the number of occurrences of each word, as well as the likelihood ratio of this word. Please note the likelihood ratio is calculated by taking the likelihood of word given a positive label divided by the likelihood of a word given a negative label. This data is licensed under the CC BY 4.0 international license, and may be taken and used freely with credit given. This data was produced by two different datasets, using a Naive Bayes classifier. These datasets were the Polarity Review v2.0 dataset from Cornell, and the Large Movie Review Dataset from Stanford.
If you wish to use this data please cite:
Katarzyna Baraniak, Marcin Sydow,
A dataset for Sentiment analysis of Entities in News headlines (SEN),
Procedia Computer Science,
Volume 192,
2021,
Pages 3627-3636,
ISSN 1877-0509,
https://doi.org/10.1016/j.procs.2021.09.136.
(https://www.sciencedirect.com/science/article/pii/S1877050921018755)
bibtex: users.pja.edu.pl/~msyd/bibtex/sydow-baraniak-SENdataset-kes21.bib
SEN is a novel publicly available human-labelled dataset for training and testing machine learning algorithms for the problem of entity level sentiment analysis of political news headlines.
On-line news portals play a very important role in the information society. Fair media should present reliable and objective information. In practice there is an observable positive or negative bias concerning named entities (e.g. politicians) mentioned in the on-line news headlines.
Our dataset consists of 3819 human-labelled political news headlines coming from several major on-line media outlets in English and Polish.
Each record contains a news headline, a named entity mentioned in the headline and a human annotated label (one of “positive”, “neutral”, “negative” ). Our SEN dataset package consists of 2 parts: SEN-en (English headlines that split into SEN-en-R and SEN-en-AMT), and SEN-pl (Polish headlines). Each headline-entity pair was annotated via team of volunteer researchers (the whole SEN-pl dataset and a subset of 1271 English records: the SEN-en-R subset, “R” for “researchers”) or via the Amazon Mechanical Turk service (a subset of 1360 English records: the SEN-en-AMT subset).
During analysis of annotation outlying annotations and removed . Separate version of dataset without outliers is marked by "noutliers" in data file name.
Details of the process of preparing the dataset and presenting its analysis are presented in the paper.
In case of any questions, please contact one of the authors. Email adresses are in the paper.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING
MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI
Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.
This is a data set of 482,251 public tweets and retweets (Twitter IDs) posted by the #edchat online community of educators who discuss current trends in teaching with technology. The data set was collected via Twitter's Streaming API between Feb 1, 2018 and Apr 4, 2018, and was used as part of the research on developing a learning analytics dashboard for teaching and learning with Twitter. Following Twitter's terms of service, the data set only includes unique identifiers of relevant tweets. To collect the actual tweets that are part of this data set, you will need to use one of the available third party tools such as Hydrator or Twarc ("hydrate" function). As part of this release, we are also attaching an enriched version of this data set that contains sentiment and opinion analysis labels that were produced by analyzing each tweet with the help of the NLTK SentimentAnalyzer Python package. *This work was supported in part by eCampusOntario and The Social Sciences and Humanities Research Council of Canada.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The peer-reviewed paper of AWARE dataset is published in ASEW 2021, and can be accessed through: http://doi.org/10.1109/ASEW52652.2021.00049. Kindly cite this paper when using AWARE dataset.
Aspect-Based Sentiment Analysis (ABSA) aims to identify the opinion (sentiment) with respect to a specific aspect. Since there is a lack of smartphone apps reviews dataset that is annotated to support the ABSA task, we present AWARE: ABSA Warehouse of Apps REviews.
AWARE contains apps reviews from three different domains (Productivity, Social Networking, and Games), as each domain has its distinct functionalities and audience. Each sentence is annotated with three labels, as follows:
Aspect Term: a term that exists in the sentence and describes an aspect of the app that is expressed by the sentiment. A term value of “N/A” means that the term is not explicitly mentioned in the sentence.
Aspect Category: one of the pre-defined set of domain-specific categories that represent an aspect of the app (e.g., security, usability, etc.).
Sentiment: positive or negative.
Note: games domain does not contain aspect terms.
We provide a comprehensive dataset of 11323 sentences from the three domains, where each sentence is additionally annotated with a Boolean value indicating whether the sentence expresses a positive/negative opinion. In addition, we provide three separate datasets, one for each domain, containing only sentences that express opinions. The file named “AWARE_metadata.csv” contains a description of the dataset’s columns.
How AWARE can be used?
We designed AWARE such that it can be used to serve various tasks. The tasks can be, but are not limited to:
Sentiment Analysis.
Aspect Term Extraction.
Aspect Category Classification.
Aspect Sentiment Analysis.
Explicit/Implicit Aspect Term Classification.
Opinion/Not-Opinion Classification.
Furthermore, researchers can experiment with and investigate the effects of different domains on users' feedback.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global Data Labeling Solution and Services market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) and machine learning (ML) across diverse sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated market value of $70 billion by 2033. This significant expansion is fueled by the burgeoning need for high-quality training data to enhance the accuracy and performance of AI models. Key growth drivers include the expanding application of AI in various industries like automotive (autonomous vehicles), healthcare (medical image analysis), and financial services (fraud detection). The increasing availability of diverse data types (text, image/video, audio) further contributes to market growth. However, challenges such as the high cost of data labeling, data privacy concerns, and the need for skilled professionals to manage and execute labeling projects pose certain restraints on market expansion. Segmentation by application (automotive, government, healthcare, financial services, others) and data type (text, image/video, audio) reveals distinct growth trajectories within the market. The automotive and healthcare sectors currently dominate, but the government and financial services segments are showing promising growth potential. The competitive landscape is marked by a mix of established players and emerging startups. Companies like Amazon Mechanical Turk, Appen, and Labelbox are leading the market, leveraging their expertise in crowdsourcing, automation, and specialized data labeling solutions. However, the market shows strong potential for innovation, particularly in the development of automated data labeling tools and the expansion of services into niche areas. Regional analysis indicates strong market penetration in North America and Europe, driven by early adoption of AI technologies and robust research and development efforts. However, Asia-Pacific is expected to witness significant growth in the coming years fueled by rapid technological advancements and a rising demand for AI solutions. Further investment in R&D focused on automation, improved data security, and the development of more effective data labeling methodologies will be crucial for unlocking the full potential of this rapidly expanding market.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data includes deep learning weights of CNN and LSTM models trained with EV charging station user reviews to predict positive and negative consumer sentiment labels. To accompany the paper titled: "Real-time data from mobile platforms to evaluate sustainable transportation infrastructure."
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
PRDECT-ID Dataset is a collection of Indonesian product review data annotated with emotion and sentiment labels. The data were collected from one of the giant e-commerce in Indonesia named Tokopedia. The dataset contains product reviews from 29 product categories on Tokopedia that use the Indonesian language. Each product review is annotated with a single emotion, i.e., love, happiness, anger, fear, or sadness. The group of annotators does the annotation process to provide emotion labels by following the emotions annotation criteria created by an expert in clinical psychology. Other attributes related to the product review are also extracted, such as Location, Price, Overall Rating, Number Sold, Total Review, and Customer Rating, to support further research.
Natural Language Processing, Text Processing, Consumer Emotion, Text Mining, Sentiment Analysis
Rhio Sutoyo
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains all the data used in the JCDL 2020 research paper: Aspect-based Sentiment Analysis of Scientific Reviews
The dataset is split into multiple files containing all the sentence annotations and the ICLR open review dataset (with reviews and scores and the confidence scores, final recommendation, etc.) for the last three years.
The file "iclr_conf.p" is a pickle file which contains a NumPy array object. The array contains 2681 rows corresponding to each accepted or rejected paper of 2017,2018,2019 Each row contains 4 columns. The first column is the link of the paper in openreview.net, from where the data related to the paper is collected. The second column is either 0 or 1, corresponding to the final decision: rejection or acceptance respectively. The third column is the year of the conference for the particular submission. The fourth column is another NumPy array containing 3 reviews in 3 rows. Each row of this array contains 3 columns containing the list of sentences in the same sequence as it appears in the text of the review, the confidence(ranging from 1-5), and the rating(ranging(1-10)) respectively.
Each line of the file "sentences.csv" contains one sentence whose corresponding annotation is provided in the corresponding line in the file "annotations.csv" The file "annotations.csv" is a file containing 8 comma-separated integers in each line. Each column corresponds to the following aspects: Appropriateness, Clarity, Originality, Empirical/Theoretical Soundness, Meaningful Comparison, Substance, Impact of Dataset/Software/Ideas and Recommendation. An integer 0,1,2,3 corresponds to the following sentiment labels of the sentence on that aspect: Absent, Positive, Negative, Neutral
Please cite our paper published in JCDL-2020 if you use our data: https://dl.acm.org/doi/10.1145/3383583.3398541
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
We provide annotated datasets on a three-point sentiment scale (positive, neutral and negative) for Serbian, Bosnian, Macedonian, Albanian, and Estonian. For all languages except Estonian, we include pairs of source URL (where corresponding text can be found) and sentiment label.
For Estonian, we randomly sampled 100 articles from "Ekspress news article archive (in Estonian and Russian) 1.0" (http://hdl.handle.net/11356/1408).
The data is organized in Tab-Separated Values (TSV) format. For Serbian, Bosnian, Macedonian, and Albanian, the dataset contains two columns: sourceURL and sentiment. For Estonian, the dataset consists of three columns: text ID (from the CLARIN.SI reference above), body text, and sentiment label.
Sentiment140 consists of Twitter messages with emoticons, which are used as noisy labels for sentiment classification. For more detailed information please refer to the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Financial Sentiment Analysis’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sbhatti/financial-sentiment-analysis on 13 February 2022.
--- Dataset description provided by original source is as follows ---
The following data is intended for advancing financial sentiment analysis research. It's two datasets (FiQA, Financial PhraseBank) combined into one easy-to-use CSV file. It provides financial sentences with sentiment labels.
Malo, Pekka, et al. "Good debt or bad debt: Detecting semantic orientations in economic texts." Journal of the Association for Information Science and Technology 65.4 (2014): 782-796.
--- Original source retains full ownership of the source dataset ---
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Name: BBC Articles Sentiment Analysis Dataset
Source: BBC News
Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.
Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]
Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral
Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).
Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |
Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.
Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.