88 datasets found

BBC datasets for sentiment analysis
kaggle.com
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Turner (2024). BBC datasets for sentiment analysis [Dataset]. https://www.kaggle.com/datasets/amunsentom/article-dataset-2/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alan Turner
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Name: BBC Articles Sentiment Analysis Dataset

Source: BBC News

Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.

Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]

Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral

Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).

Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |

Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.

Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.
c
Data from: Facebook Data for Sentiment Analysis
lindat.mff.cuni.cz
live.european-language-grid.eu
+1more
Updated Jul 17, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Habernal; Tomáš Ptáček; Josef Steinberger (2013). Facebook Data for Sentiment Analysis [Dataset]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0022-FE82-7
Explore at:
Dataset updated
Jul 17, 2013
Authors
Ivan Habernal; Tomáš Ptáček; Josef Steinberger
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.
h
custom_sentiment_analysis_dataset
huggingface.co
Updated Sep 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
choi hyun woo (2024). custom_sentiment_analysis_dataset [Dataset]. https://huggingface.co/datasets/t7439/custom_sentiment_analysis_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 20, 2024
Authors
choi hyun woo
Description
Dataset Card for Custom Text Dataset

Dataset Name

Custom Text Dataset

Overview

This dataset contains text data for training sentiment analysis models. The data is collected from various sources, including books, articles, and web pages.

Composition

Number of records: 50,000 Fields: text, label Size: 134 MB

Collection Process

The data was collected using web scraping and manual extraction from public domain sources.… See the full description on the dataset page: https://huggingface.co/datasets/t7439/custom_sentiment_analysis_dataset.
d
Data from: Pseudo-Label Generation for Multi-Label Text Classification
catalog.data.gov
datasets.ai
+2more
Updated Dec 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description
With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
m
Data from: KurdiSent: A Corpus For Kurdish Sentiment Analysis
data.mendeley.com
Updated Feb 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soran Badawi (2023). KurdiSent: A Corpus For Kurdish Sentiment Analysis [Dataset]. http://doi.org/10.17632/3yrkswy6ph.2
Explore at:
Unique identifier
https://doi.org/10.17632/3yrkswy6ph.2
Dataset updated
Feb 6, 2023
Authors
Soran Badawi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Kurdish language is regarded as one of the less-resourced languages. The language is globally practised by 30-40 people. The language has 33 letters that are largely similar to the Arabic language. The Kurdish language has two major dialects Sorani and Badini. The dataset includes a collection of texts written in the Sorani dialect. It contains tweets the Twitter API. Due to security reasons and following the policies of Twitter, we removed the user's identity. We collected the tweets which was published during the time of the Corona Virus pandemic. The tweets are raw texts, and the content covers a varied range of topics, starting from politics, sports, entertainment, social life, etc. Data Labeling We used the Twitter developer (Twitter API) to mine the tweets. The dataset was annotated manually by three Kurdish native speakers. The annotators were required to identify the classes and categories of each text. The classes included positive, negative and neutral and the categories consisted of news, technology, art, social and health. The texts which were agreed upon by at least two annotators to possess a specific label and category were regarded as conflict-free and accepted for further processing. Other texts that caused conflict among all three raters were ignored and have been removed from the dataset. The doccano program was used to help the annotators label each text one by one.
Product Review Datasets for User Sentiment Analysis
datarade.ai
Updated Sep 28, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Product Review Datasets for User Sentiment Analysis [Dataset]. https://datarade.ai/data-products/product-review-datasets-for-user-sentiment-analysis-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Sep 28, 2018
Dataset authored and provided by
Oxylabs
Area covered
Canada, Italy, Hong Kong, Antigua and Barbuda, Barbados, Argentina, Libya, Egypt, Sudan, South Africa
Description
Product Review Datasets: Uncover user sentiment

Harness the power of Product Review Datasets to understand user sentiment and insights deeply. These datasets are designed to elevate your brand and product feature analysis, help you evaluate your competitive stance, and assess investment risks.

Data sources:

Trustpilot: datasets encompassing general consumer reviews and ratings across various businesses, products, and services.

Leave the data collection challenges to us and dive straight into market insights with clean, structured, and actionable data, including:

Product name;

Product category;

Number of ratings;

Ratings average;

Review title;

Review body;

Choose from multiple data delivery options to suit your needs:

Receive data in easy-to-read formats like spreadsheets or structured JSON files.

Select your preferred data storage solutions, including SFTP, Webhooks, Google Cloud Storage, AWS S3, and Microsoft Azure Storage.

Tailor data delivery frequencies, whether on-demand or per your agreed schedule.

Why choose Oxylabs?

Fresh and accurate data: Access organized, structured, and comprehensive data collected by our leading web scraping professionals.

Time and resource savings: Concentrate on your core business goals while we efficiently handle the data extraction process at an affordable cost.

Adaptable solutions: Share your specific data requirements, and we'll craft a customized data collection approach to meet your objectives.

Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA standards.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Join the ranks of satisfied customers who appreciate our meticulous attention to detail and personalized support. Experience the power of Product Review Datasets today to uncover valuable insights and enhance decision-making.
Sentiment Analysis outputs based on the combination of three classifiers for...
zenodo.org
data.niaid.nih.gov
bin
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caio Mello; Caio Mello; Gullal S. Cheema; Gullal S. Cheema (2022). Sentiment Analysis outputs based on the combination of three classifiers for news headlines and body text [Dataset]. http://doi.org/10.5281/zenodo.6326348
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6326348
Dataset updated
Mar 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Caio Mello; Caio Mello; Gullal S. Cheema; Gullal S. Cheema
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sentiment Analysis outputs based on the combination of three classifiers for news headlines and body text covering the Olympic legacy of Rio 2016 and London 2012. Data was searched via Google search engine. It is composed of sentiment labels assigned to 1271 news articles in total.

News outlets:

BBC

Daily Mail

The Telegraph

The Guardian

Globo

Estadao

Folha de S. Paulo

Events covered by the articles:

London 2012 Olympic legacy

Rio 2016 Olympic legacy

All classifiers were used in texts in English. Text originally published in Portuguese by the Brazilian media were automatically translated.

Sentiment classifiers used:

Vader

BERT (Trained on Amazon data)

BERT (Trained on twitter data - 140)

Each document (spreadsheet - xlsx) refers to one outlet and one event (London 2012 or Rio 2016).

How were labels assigned to the texts?

These labels are a combination of the three sentiment classifiers listed above. If two of them agree with the same label, then this label would be considered as right. Otherwise, the label ‘other’ was assigned.

For news article body text: the proportion of sentences of each sentiment type was used to assign labels to the whole article instead of averaging the sentence scores. For example, if the proportion of sentences with negative labels is greater than 50%, then the article is assigned a negative label.

The documents are composed of the following columns:

Rank: the position of the article on Google search ranking

Date: date of article's publication (DD/MM/YYYY)

Link: article's link

Title: article's title

Sentiment_Title: final sentiment for article headline

Sentiment_Text: final sentiment for article's body text

PS: Documents do not include articles' body text.

Sentiment is presented in labels as follows:

Pos: Positive

Neg: Negative

Neutral: Neutral

other: inconclusive - if each of the 3 classifiers assigned a different label to the article, the label 'other' was used. Therefore, 'other' identifies contradictory results.
h
large-twitter-tweets-sentiment
huggingface.co
Updated Mar 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gong Xiangbo (2024). large-twitter-tweets-sentiment [Dataset]. https://huggingface.co/datasets/gxb912/large-twitter-tweets-sentiment
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 6, 2024
Authors
Gong Xiangbo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "Large twitter tweets sentiment analysis"

Dataset Description Dataset Summary

This dataset is a collection of tweets formatted in a tabular data structure, annotated for sentiment analysis. Each tweet is associated with a sentiment label, with 1 indicating a Positive sentiment and 0 for a Negative sentiment.

Languages

The tweets in English.

Dataset Structure Data Instances

An instance of… See the full description on the dataset page: https://huggingface.co/datasets/gxb912/large-twitter-tweets-sentiment.
Produced Data of Naive Bayes Sentiment Classifier
zenodo.org
csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Resnik; Jeffrey Resnik (2024). Produced Data of Naive Bayes Sentiment Classifier [Dataset]. http://doi.org/10.5281/zenodo.7934164
Explore at:
png, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7934164
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jeffrey Resnik; Jeffrey Resnik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the data produced by the running of the Naive Bayes classifier algorithm. It is a list of every word in the vocabulary of the classifier, as well as the number of occurrences of each word, as well as the likelihood ratio of this word. Please note the likelihood ratio is calculated by taking the likelihood of word given a positive label divided by the likelihood of a word given a negative label. This data is licensed under the CC BY 4.0 international license, and may be taken and used freely with credit given. This data was produced by two different datasets, using a Naive Bayes classifier. These datasets were the Polarity Review v2.0 dataset from Cornell, and the Large Movie Review Dataset from Stanford.
SEN - Sentiment analysis of Entities in News headlines
zenodo.org
data.niaid.nih.gov
Updated Oct 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katarzyna Baraniak; Katarzyna Baraniak; Marcin Sydow; Marcin Sydow (2023). SEN - Sentiment analysis of Entities in News headlines [Dataset]. http://doi.org/10.1016/j.procs.2021.09.136
Explore at:
Unique identifier
https://doi.org/10.1016/j.procs.2021.09.136
Dataset updated
Oct 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Katarzyna Baraniak; Katarzyna Baraniak; Marcin Sydow; Marcin Sydow
Description
If you wish to use this data please cite:

Katarzyna Baraniak, Marcin Sydow,
A dataset for Sentiment analysis of Entities in News headlines (SEN),
Procedia Computer Science,
Volume 192,
2021,
Pages 3627-3636,
ISSN 1877-0509,
https://doi.org/10.1016/j.procs.2021.09.136.
(https://www.sciencedirect.com/science/article/pii/S1877050921018755)

bibtex: users.pja.edu.pl/~msyd/bibtex/sydow-baraniak-SENdataset-kes21.bib

SEN is a novel publicly available human-labelled dataset for training and testing machine learning algorithms for the problem of entity level sentiment analysis of political news headlines.

On-line news portals play a very important role in the information society. Fair media should present reliable and objective information. In practice there is an observable positive or negative bias concerning named entities (e.g. politicians) mentioned in the on-line news headlines.
Our dataset consists of 3819 human-labelled political news headlines coming from several major on-line media outlets in English and Polish.

Each record contains a news headline, a named entity mentioned in the headline and a human annotated label (one of “positive”, “neutral”, “negative” ). Our SEN dataset package consists of 2 parts: SEN-en (English headlines that split into SEN-en-R and SEN-en-AMT), and SEN-pl (Polish headlines). Each headline-entity pair was annotated via team of volunteer researchers (the whole SEN-pl dataset and a subset of 1271 English records: the SEN-en-R subset, “R” for “researchers”) or via the Amazon Mechanical Turk service (a subset of 1360 English records: the SEN-en-AMT subset).

During analysis of annotation outlying annotations and removed . Separate version of dataset without outliers is marked by "noutliers" in data file name.

Details of the process of preparing the dataset and presenting its analysis are presented in the paper.

In case of any questions, please contact one of the authors. Email adresses are in the paper.
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
application/rdfxml +5
Updated Jun 26, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING [Dataset]. https://data.nasa.gov/dataset/MULTI-LABEL-ASRS-DATASET-CLASSIFICATION-USING-SEMI/m4h6-922m
Explore at:
csv, application/rssxml, tsv, xml, application/rdfxml, jsonAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE CLUSTERING

MOHAMMAD SALIM AHMED, LATIFUR KHAN, NIKUNJ OZA, AND MANDAVA RAJESWARI

Abstract. There has been a lot of research targeting text classification. Many of them focus on a particular characteristic of text data - multi-labelity. This arises due to the fact that a document may be associated with multiple classes at the same time. The consequence of such a characteristic is the low performance of traditional binary or multi-class classification techniques on multi-label text data. In this paper, we propose a text classification technique that considers this characteristic and provides very good performance. Our multi-label text classification approach is an extension of our previously formulated [3] multi-class text classification approach called SISC (Semi-supervised Impurity based Subspace Clustering). We call this new classification model as SISC-ML(SISC Multi-Label). Empirical evaluation on real world multi-label NASA ASRS (Aviation Safety Reporting System) data set reveals that our approach outperforms state-of-theart text classification as well as subspace clustering algorithms.
d
EdChat Public Tweets
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gruzd, Anatoliy; Conroy, Nadia (2023). EdChat Public Tweets [Dataset]. https://search.dataone.org/view/sha256%3A1badf4ddc248d00bcd77d23dbff6f03aebe31d7ce40490aee2acbc79d468ecfa
Explore at:
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Gruzd, Anatoliy; Conroy, Nadia
Description
This is a data set of 482,251 public tweets and retweets (Twitter IDs) posted by the #edchat online community of educators who discuss current trends in teaching with technology. The data set was collected via Twitter's Streaming API between Feb 1, 2018 and Apr 4, 2018, and was used as part of the research on developing a learning analytics dashboard for teaching and learning with Twitter. Following Twitter's terms of service, the data set only includes unique identifiers of relevant tweets. To collect the actual tweets that are part of this data set, you will need to use one of the available third party tools such as Hydrator or Twarc ("hydrate" function). As part of this release, we are also attaching an enriched version of this data set that contains sentiment and opinion analysis labels that were produced by analyzing each tweet with the help of the NLTK SentimentAnalyzer Python package. *This work was supported in part by eCampusOntario and The Social Sciences and Humanities Research Council of Canada.
Z
AWARE: Dataset for Aspect-Based Sentiment Analysis of Apps Reviews
data.niaid.nih.gov
opendatalab.com
+2more
Updated Jan 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nouf Alturaief (2022). AWARE: Dataset for Aspect-Based Sentiment Analysis of Apps Reviews [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5528480
Explore at:
Dataset updated
Jan 25, 2022
Dataset provided by
Malak Baslyman
Hamoud Aljamaan
Nouf Alturaief
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The peer-reviewed paper of AWARE dataset is published in ASEW 2021, and can be accessed through: http://doi.org/10.1109/ASEW52652.2021.00049. Kindly cite this paper when using AWARE dataset.

Aspect-Based Sentiment Analysis (ABSA) aims to identify the opinion (sentiment) with respect to a specific aspect. Since there is a lack of smartphone apps reviews dataset that is annotated to support the ABSA task, we present AWARE: ABSA Warehouse of Apps REviews.

AWARE contains apps reviews from three different domains (Productivity, Social Networking, and Games), as each domain has its distinct functionalities and audience. Each sentence is annotated with three labels, as follows:

Aspect Term: a term that exists in the sentence and describes an aspect of the app that is expressed by the sentiment. A term value of “N/A” means that the term is not explicitly mentioned in the sentence.

Aspect Category: one of the pre-defined set of domain-specific categories that represent an aspect of the app (e.g., security, usability, etc.).

Sentiment: positive or negative.

Note: games domain does not contain aspect terms.

We provide a comprehensive dataset of 11323 sentences from the three domains, where each sentence is additionally annotated with a Boolean value indicating whether the sentence expresses a positive/negative opinion. In addition, we provide three separate datasets, one for each domain, containing only sentences that express opinions. The file named “AWARE_metadata.csv” contains a description of the dataset’s columns.

How AWARE can be used?

We designed AWARE such that it can be used to serve various tasks. The tasks can be, but are not limited to:

Sentiment Analysis.

Aspect Term Extraction.

Aspect Category Classification.

Aspect Sentiment Analysis.

Explicit/Implicit Aspect Term Classification.

Opinion/Not-Opinion Classification.

Furthermore, researchers can experiment with and investigate the effects of different domains on users' feedback.
D
Data Labeling Solution and Services Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AMA Research & Media LLP (2025). Data Labeling Solution and Services Report [Dataset]. https://www.archivemarketresearch.com/reports/data-labeling-solution-and-services-52815
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Mar 7, 2025
Dataset provided by
AMA Research & Media LLP
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Data Labeling Solution and Services market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) and machine learning (ML) across diverse sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033, reaching an estimated market value of $70 billion by 2033. This significant expansion is fueled by the burgeoning need for high-quality training data to enhance the accuracy and performance of AI models. Key growth drivers include the expanding application of AI in various industries like automotive (autonomous vehicles), healthcare (medical image analysis), and financial services (fraud detection). The increasing availability of diverse data types (text, image/video, audio) further contributes to market growth. However, challenges such as the high cost of data labeling, data privacy concerns, and the need for skilled professionals to manage and execute labeling projects pose certain restraints on market expansion. Segmentation by application (automotive, government, healthcare, financial services, others) and data type (text, image/video, audio) reveals distinct growth trajectories within the market. The automotive and healthcare sectors currently dominate, but the government and financial services segments are showing promising growth potential. The competitive landscape is marked by a mix of established players and emerging startups. Companies like Amazon Mechanical Turk, Appen, and Labelbox are leading the market, leveraging their expertise in crowdsourcing, automation, and specialized data labeling solutions. However, the market shows strong potential for innovation, particularly in the development of automated data labeling tools and the expansion of services into niche areas. Regional analysis indicates strong market penetration in North America and Europe, driven by early adoption of AI technologies and robust research and development efforts. However, Asia-Pacific is expected to witness significant growth in the coming years fueled by rapid technological advancements and a rising demand for AI solutions. Further investment in R&D focused on automation, improved data security, and the development of more effective data labeling methodologies will be crucial for unlocking the full potential of this rapidly expanding market.
EV Deep learning sentiment analysis weights
figshare.com
search.datacite.org
hdf
Updated Mar 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Isaac Asensio; Sooji Ha (2020). EV Deep learning sentiment analysis weights [Dataset]. http://doi.org/10.6084/m9.figshare.12044670.v2
Explore at:
hdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12044670.v2
Dataset updated
Mar 29, 2020
Dataset provided by
figshare
Authors
Omar Isaac Asensio; Sooji Ha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data includes deep learning weights of CNN and LSTM models trained with EV charging station user reviews to predict positive and negative consumer sentiment labels. To accompany the paper titled: "Real-time data from mobile platforms to evaluate sustainable transportation infrastructure."
PRDECT-ID: Indonesian Emotion Classification
kaggle.com
Updated Sep 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jocelyn Dumlao (2023). PRDECT-ID: Indonesian Emotion Classification [Dataset]. https://www.kaggle.com/datasets/jocelyndumlao/prdect-id-indonesian-emotion-classification/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jocelyn Dumlao
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description

PRDECT-ID Dataset is a collection of Indonesian product review data annotated with emotion and sentiment labels. The data were collected from one of the giant e-commerce in Indonesia named Tokopedia. The dataset contains product reviews from 29 product categories on Tokopedia that use the Indonesian language. Each product review is annotated with a single emotion, i.e., love, happiness, anger, fear, or sadness. The group of annotators does the annotation process to provide emotion labels by following the emotions annotation criteria created by an expert in clinical psychology. Other attributes related to the product review are also extracted, such as Location, Price, Overall Rating, Number Sold, Total Review, and Customer Rating, to support further research.

Categories

Natural Language Processing, Text Processing, Consumer Emotion, Text Mining, Sentiment Analysis

Acknowledgements & Source

Rhio Sutoyo

Data Source

View Details

Image Source

Please don't forget to upvote if you find this useful.
Z
Aspect-based Sentiment Analysis of Scientific Reviews - Openreview dataset
data.niaid.nih.gov
zenodo.org
Updated May 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Animesh Mukherjee (2021). Aspect-based Sentiment Analysis of Scientific Reviews - Openreview dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4068517
Explore at:
Dataset updated
May 17, 2021
Dataset provided by
Pawan Goyal
Animesh Mukherjee
Souvic Chakraborty
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains all the data used in the JCDL 2020 research paper: Aspect-based Sentiment Analysis of Scientific Reviews

The dataset is split into multiple files containing all the sentence annotations and the ICLR open review dataset (with reviews and scores and the confidence scores, final recommendation, etc.) for the last three years.

The file "iclr_conf.p" is a pickle file which contains a NumPy array object. The array contains 2681 rows corresponding to each accepted or rejected paper of 2017,2018,2019 Each row contains 4 columns. The first column is the link of the paper in openreview.net, from where the data related to the paper is collected. The second column is either 0 or 1, corresponding to the final decision: rejection or acceptance respectively. The third column is the year of the conference for the particular submission. The fourth column is another NumPy array containing 3 reviews in 3 rows. Each row of this array contains 3 columns containing the list of sentences in the same sequence as it appears in the text of the review, the confidence(ranging from 1-5), and the rating(ranging(1-10)) respectively.

Each line of the file "sentences.csv" contains one sentence whose corresponding annotation is provided in the corresponding line in the file "annotations.csv" The file "annotations.csv" is a file containing 8 comma-separated integers in each line. Each column corresponds to the following aspects: Appropriateness, Clarity, Originality, Empirical/Theoretical Soundness, Meaningful Comparison, Substance, Impact of Dataset/Software/Ideas and Recommendation. An integer 0,1,2,3 corresponds to the following sentiment labels of the sentence on that aspect: Absent, Positive, Negative, Neutral

Please cite our paper published in JCDL-2020 if you use our data: https://dl.acm.org/doi/10.1145/3383583.3398541
E
News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian...
live.european-language-grid.eu
clarin.si
binary format
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian and Estonian SADEmma 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23729
Explore at:
binary formatAvailable download formats
Dataset updated
Nov 12, 2024
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
We provide annotated datasets on a three-point sentiment scale (positive, neutral and negative) for Serbian, Bosnian, Macedonian, Albanian, and Estonian. For all languages except Estonian, we include pairs of source URL (where corresponding text can be found) and sentiment label.

For Estonian, we randomly sampled 100 articles from "Ekspress news article archive (in Estonian and Russian) 1.0" (http://hdl.handle.net/11356/1408).

The data is organized in Tab-Separated Values (TSV) format. For Serbian, Bosnian, Macedonian, and Albanian, the dataset contains two columns: sourceURL and sentiment. For Estonian, the dataset consists of three columns: text ID (from the CLARIN.SI reference above), body text, and sentiment label.
h
sentiment140
huggingface.co
opendatalab.com
+2more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford NLP, sentiment140 [Dataset]. https://huggingface.co/datasets/stanfordnlp/sentiment140
Explore at:
Dataset authored and provided by
Stanford NLP
Description
Sentiment140 consists of Twitter messages with emoticons, which are used as noisy labels for sentiment classification. For more detailed information please refer to the paper.
A
‘Financial Sentiment Analysis’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Financial Sentiment Analysis’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-financial-sentiment-analysis-5b39/latest
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Financial Sentiment Analysis’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sbhatti/financial-sentiment-analysis on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Data

The following data is intended for advancing financial sentiment analysis research. It's two datasets (FiQA, Financial PhraseBank) combined into one easy-to-use CSV file. It provides financial sentences with sentiment labels.

Citations

Malo, Pekka, et al. "Good debt or bad debt: Detecting semantic orientations in economic texts." Journal of the Association for Information Science and Technology 65.4 (2014): 782-796.

--- Original source retains full ownership of the source dataset ---

Facebook

Twitter

Click to copy link

Link copied

Cite

Alan Turner (2024). BBC datasets for sentiment analysis [Dataset]. https://www.kaggle.com/datasets/amunsentom/article-dataset-2/suggestions

BBC datasets for sentiment analysis

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 15, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Alan Turner

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Name: BBC Articles Sentiment Analysis Dataset

Source: BBC News

Description: This dataset consists of articles from the BBC News website, containing a diverse range of topics such as business, politics, entertainment, technology, sports, and more. The dataset includes articles from various time periods and categories, along with labels representing the sentiment of the article. The sentiment labels indicate whether the tone of the article is positive, negative, or neutral, making it suitable for sentiment analysis tasks.

Number of Instances: [Specify the number of articles in the dataset, for example, 2,225 articles]

Number of Features: 1. Article Text: The content of the article (string). 2. Sentiment Label: The sentiment classification of the article. The possible labels are: - Positive - Negative - Neutral

Data Fields: - id: Unique identifier for each article. - category: The category or topic of the article (e.g., business, politics, sports). - title: The title of the article. - content: The full text of the article. - sentiment: The sentiment label (positive, negative, or neutral).

Example: | id | category | title | content | sentiment | |----|-----------|---------------------------|-------------------------------------------------------------------------|-----------| | 1 | Business | "Stock Market Surge" | "The stock market has surged to new highs, driven by strong earnings..." | Positive | | 2 | Politics | "Election Results" | "The election results were a mixed bag, with some surprises along the way." | Neutral | | 3 | Sports | "Team Wins Championship" | "The team won the championship after a thrilling final match." | Positive | | 4 | Technology | "New Smartphone Release" | "The new smartphone release has received mixed reactions from users." | Negative |

Preprocessing Notes: - The text has been preprocessed to remove special characters and any HTML tags that might have been included in the original articles. - Tokenization or further text cleaning (e.g., lowercasing, stopword removal) may be necessary depending on the model and method used for sentiment classification.

Use Case: This dataset is ideal for training and evaluating machine learning models for sentiment classification, where the goal is to predict the sentiment (positive, negative, or neutral) based on the article's text.

Clear search

Close search

Google apps

Main menu

BBC datasets for sentiment analysis

Data from: Facebook Data for Sentiment Analysis

custom_sentiment_analysis_dataset

Data from: Pseudo-Label Generation for Multi-Label Text Classification

Data from: KurdiSent: A Corpus For Kurdish Sentiment Analysis

Product Review Datasets for User Sentiment Analysis

Sentiment Analysis outputs based on the combination of three classifiers for...

large-twitter-tweets-sentiment

Produced Data of Naive Bayes Sentiment Classifier

SEN - Sentiment analysis of Entities in News headlines

MULTI-LABEL ASRS DATASET CLASSIFICATION USING SEMI-SUPERVISED SUBSPACE...

EdChat Public Tweets

AWARE: Dataset for Aspect-Based Sentiment Analysis of Apps Reviews

Data Labeling Solution and Services Report

EV Deep learning sentiment analysis weights

PRDECT-ID: Indonesian Emotion Classification

Description

Categories

Acknowledgements & Source

Please don't forget to upvote if you find this useful.

Aspect-based Sentiment Analysis of Scientific Reviews - Openreview dataset

News sentiment analysis datasets for Serbian, Bosnian, Macedonian, Albanian...

sentiment140

‘Financial Sentiment Analysis’ analyzed by Analyst-2

Data

Citations

BBC datasets for sentiment analysis

BBC datasets for sentiment analysis