Arabic Sentiment Tweets Dataset (ASTD) is an Arabic social sentiment analysis dataset gathered from Twitter. It consists of about 10,000 tweets which are classified as objective, subjective positive, subjective negative, and subjective mixed.
The data has 67K+ reviews in Arabic for sentiment analysis Data collecting using web scraping for many companies Like ( talabat,kabiter,nasla,swifil,alsiwidiu,kilubatra,dumati,.........etc)
Coulnms
Reviews : review description rating : 1 postive , 0 neutral , -1 negative Company : continues company name for each review
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Arabic news credibility on Twitter using sentiment analysis and ensemble learning.
WHAT IS IT?
-----------
an Arabic news credibility model on Twitter using sentiment analysis and ensemble learning.
Here we include the Collected dataset and the source code of the proposed model written in Python language and using Keras library with Tensorflow backend.
Required Packages
------------------
To Run the model
---------------
One data file is required to run the model which are:
The dataset
---------------
CONTACTS
--------
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set consists of approximately 1.64 Million Arabic tweets (shared by their IDs) posted from 2009 to 2020, and their corresponding sentiment using a three-point classification system of Positive, Negative and Neutral/Mixed. No specific locations and/or keywords were specified throughout the data collection to obtain variation in the dialects and topics represented within the dataset. It is important to note that any biases in the proposed dataset in relation to the dialects and/or topics discussed were unintentional.
Please use the following citation if you use this data in a paper:
Abdaljalil, S., Hassanein, S., Mubarak, H., & Abdelali, A. (2023). Towards Generalization of Machine Learning Models: A Case Study of Arabic Sentiment Analysis. Proceedings of the International AAAI Conference on Web and Social Media, 17(1), 971-980.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains a labeled collection of approximately 50,000 social media posts in various Arabic dialects. Each post has been manually annotated with sentiment labels, providing a rich resource for natural language processing and sentiment analysis research.
UM6P College of Computing
The dataset is provided in a CSV format with the following columns:
- Post_ID
: Integer
- Text
: String
- Sentiment
: String (Positive, Negative, Neutral)
This dataset is ideal for tasks such as: - Training sentiment analysis models - Studying sentiment trends in Arabic social media - Exploring the linguistic characteristics of Arabic dialects - Benchmarking sentiment analysis tools
Post_ID | Text | Sentiment |
---|---|---|
1 | "هذا المنتج رائع جدًا وأحببته كثيرًا" | Positive |
2 | "لم يعجبني هذا الفيلم، كان مملًا جدًا" | Negative |
3 | "الطقس اليوم عادي، لا يوجد شيء مميز" | Neutral |
Please refer to the dataset license included in the dataset files for information on usage rights and restrictions.
An open access NLP dataset for Arabic dialects: data collection, labeling, and model construction, Elmehdi Boujou, Hamza Chataoui, Abdellah El Mekki, Saad Benjelloun, Ikram Chairi and Ismail Berrada MENACIS 2020 conference, In press.
Please cite: Alyami, S. N., & Olatunji, S. O. (2020). Application of Support Vector Machine for Arabic Sentiment Classication Using Twitter-Based Dataset, 19(1), 1–13. https://doi.org/10.1142/S0219649220400183
https://gitlab.com/european-language-grid/sail/sail-documents/blob/master/HENSOLDT-ANALYTICS_ELG_LICENSE.mdhttps://gitlab.com/european-language-grid/sail/sail-documents/blob/master/HENSOLDT-ANALYTICS_ELG_LICENSE.md
HENSOLDT ANALYTICS MediaMiningIndexer SED - sentiment detection/analysis engine that provides attitude of paragraphs of text that can be positive, negative or netural.
Arabic Datasets for research purposes
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains 15,000 Arabic tweets annotated for depression detection and includes linguistic feature augmentations to support research in natural language processing (NLP), sentiment analysis, and mental health detection. The dataset was curated to enable studies on automatic depression detection in Arabic social media and to support machine learning and deep learning approaches in the domain of computational mental health. Contents The dataset consists of the following columns: tweet: The original Arabic tweet text. label: Binary label indicating whether the tweet expresses signs of depression: 1 = Depression 0 = Non-depression negation_flag: Indicates presence (1) or absence (0) of negation in the tweet. intensifier_flag: Indicates presence (1) or absence (0) of intensifiers (words that strengthen the degree of emotion). Class (redundant but included for convenience): Textual label corresponding to the binary label (Depression or Non-depression). Binary Classification: Contains the count of instances in each class (appears as an artifact in the provided file). Key Features Language: Arabic (varied dialects and Modern Standard Arabic). Source: Publicly available tweets collected from Twitter (X). Annotation: Manual labeling by native Arabic speakers trained in psychology and linguistics. Linguistic augmentation: Flags for negation and intensifier usage are included to support linguistically informed NLP models. Potential Use Cases Depression detection models for Arabic texts. Linguistic analysis of depression expression in Arabic social media. Cross-lingual studies comparing depression signals across languages. Development of clinical decision support systems leveraging social media data. Licensing & Ethical Considerations The dataset consists of public social media posts. Researchers are advised to use it strictly for research purposes, respecting privacy and ethical guidelines. No personally identifiable information (PII) is included. Citation If you use this dataset, please cite it appropriately in your research publications and acknowledge the creators.
Dataset Card for cardiffnlp/tweet_sentiment_multilingual
Dataset Summary
Tweet Sentiment Multilingual consists of sentiment analysis dataset on Twitter in 8 different lagnuages.
arabic english french german hindi italian portuguese spanish
Supported Tasks and Leaderboards
text_classification: The dataset can be trained using a SentenceClassification model from HuggingFace transformers.
Dataset Structure
Data Instances
An instance from… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/tweet_sentiment_multilingual.
The products' opinions in Arabsentiment dataset is collected manually from different social products' resources for opinion mining, feature extraction and sentiment analysis tasks. The collected opinions included different types of direct opinions that include at least one product feature whether it stated explicitly or in implicit manner. The dataset contains twenty different products categories like home, baby, different types of software products and other product types. Additionally, the products’ features are identified manually from the customer opinions and the product description. The products are classified according to each product type and there is a specific search query related to each type. For each product, the product name and brief description about the product capabilities are registered in products information file and classified to specific product types with a specific initial query for each type. The collected data contains opinions about twenty different products' categories. These opinions are selected based on the text size and the number of features that appear in the opinionated text. For each opinion, we keep track of the opinionated text and the sentiment rating score entered by the customers. The rating score represent the overall polarity of the reviewer towards the products into one of two categories: positive or negative sentiment. The main dataset attributes involve the total number of directed opinions used in dataset that should include at least one explicit product features, the number of opinions with positive sentiment score is 1459 and negative sentiment polarity score is 516.
Few Arabic datasets are available for classification comparison and other NLP tasks. This dataset is mainly a compilation of several available datasets and a sampling of 100k rows (99999 to be exact).
The dataset combines reviews from hotels, books, movies, products and a few airlines. It has three classes (Mixed, Negative and Positive). Most were mapped from reviewers' ratings with 3 being mixed, above 3 positive and below 3 negative. Each row has a label and text separated by a tab (tsv). Text (reviews) were cleaned by removing Arabic diacritics and non-Arabic characters. The dataset has no duplicate reviews.
The hotels and book reviews are a subset of HARD and BRAD. The rest were selected from hadyelsahar with a little over 100 airlines reviews collected manually.
Let's jump in and use your best tools to beat the SOTA! Don't forget to show and share your work.
Arabic Sentiment Analysis Dataset
Dataset Description This dataset contains Arabic text snippets, each labeled with a sentiment polarity (positive or negative). The data appears to be intended for tasks like sentiment analysis or text classification. It is divided into separate training and testing files (train.tsv and test.tsv).
Source Files
train.tsv
test.tsv
Language
Arabic
Data Format
Tab-Separated Values (.tsv)
Each line consists of two fields separated… See the full description on the dataset page: https://huggingface.co/datasets/ImranzamanML/Arabic-Sentiments.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Arabizi is a modern variant of the Arabic language that is being increasingly used by millennials. In fact, Arabizi is Arabic expressed using text that is transliterated to Latin characters while numbers are used to represent characters and sounds that do not exist in Latin-character languages. The proposed datasets are labelled for sentiment analysis of lebanese arabizi twitter data.
Tweets have been collected randomly between 2017 and 2020. They all have geoTagging option turned on and in Lebanon.
Columns: Text, sentiment, highlight
They have been annotated with a minimum of 2-agreement: - unbalanced-sentiment-arabizi-ds.csv contains all the labelled tweets with a minimum of 2-agreement. - 2-class-sentiment-arabizi-ds.csv are labelled as positive or negative. - 3-class-sentiment-arabizi-ds.csv are labelled as positive, negative or neutral.
Both datasets have a third column called highlight: an informative column filled when the highlight is obvious. Options are: - Sectarianism: Prejudice, discrimination, or hatred arising from attaching relations of inferiority and superiority to differences between subdivisions within a group. - Sexism: Prejudice, stereotyping, or discrimination, typically against women, based on sex. - Racism: Prejudice, discrimination, or antagonism directed against someone of a different race based on the belief that one's own race is superior. - Foul language: Coarse or offensive language: swearing, bad words, obscene words, dirty words, … - Bullying: Seek to harm, intimidate, or coerce - Sarcasm: The use of irony to mock or convey contempt. - Joke: A thing that someone says to cause amusement or laughter, especially a story with a funny punchline. - Courtesy words: A polite remark or respectful act: ‘thank you’, ‘please’, ‘excuse me’, … - Saying: Any concisely written or spoken expression that is especially memorable because of its meaning or style. A quotation from a text or speech. - Known fact: Something that is generally recognized as a fact or truth: that grass is green
I would like to see Sentiment analysis models tested or validated on the datasets.
2-class-sentiment-arabizi-ds.csv - model: Decision trees - Accuracy 81% - Precision 81% - Recall 81% - F1 81%
3-class-sentiment-arabizi-ds.csv - model: Logistic regression - Accuracy 65% - Precision 65% - Recall 65% - F1 65%
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Arabic Text Dataset contains a collection of text samples written in Arabic. It includes various forms of content, such as news articles, social media posts, literature, and dialogue, spanning different topics and writing styles. This dataset is used for tasks such as natural language processing (NLP), text classification, sentiment analysis, and machine translation in Arabic language applications.
Sentiment analysis is pivotal in Natural Language Processing for understanding opinions and emotions in text. While advancements in Sentiment analysis for English are notable, Arabic Sentiment Analysis (ASA) lags, despite the growing Arabic online user base. Existing ASA benchmarks are often outdated and lack comprehensive evaluation capabilities for state-of-the-art models. To bridge this gap, we introduce ArSen, a meticulously annotated COVID-19-themed Arabic dataset, and the IFDHN, a novel model incorporating fuzzy logic for enhanced sentiment classification. ArSen provides a contemporary, robust benchmark, and IFDHN achieves state-of-the-art performance on ASA tasks. Comprehensive evaluations demonstrate the efficacy of IFDHN using the ArSen dataset, highlighting future research directions in ASA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Moroccan Darija offensive language detection dataset is a human-labeled dataset consisting of a set of Moroccan Darija sentences for offensive language detection. The dataset contains 20,402 sentences and their corresponding binary labels: 0 for a non-offensive sentence and 1 for an offensive sentence. The sentences were gathered from Twitter and YouTube comments and are written in both Latin and Arabic scripts. Inoffensive sentences account for 62.2% (12,685 sentences), while offensive sentences account for 37.8% (7,717 sentences). This contribution addresses the scarcity of labeled datasets for Moroccan Darija and provides a resource for natural language processing researchers interested in Moroccan Darija, particularly offensive language and sentiment analysis tasks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Kurdish language is regarded as one of the less-resourced languages. The language is globally practised by 30-40 people. The language has 33 letters that are largely similar to the Arabic language. The Kurdish language has two major dialects Sorani and Badini. The dataset includes a collection of texts written in the Sorani dialect. It contains tweets the Twitter API. Due to security reasons and following the policies of Twitter, we removed the user's identity. We collected the tweets which was published during the time of the Corona Virus pandemic. The tweets are raw texts, and the content covers a varied range of topics, starting from politics, sports, entertainment, social life, etc. Data Labeling We used the Twitter developer (Twitter API) to mine the tweets. The dataset was annotated manually by three Kurdish native speakers. The annotators were required to identify the classes and categories of each text. The classes included positive, negative and neutral and the categories consisted of news, technology, art, social and health. The texts which were agreed upon by at least two annotators to possess a specific label and category were regarded as conflict-free and accepted for further processing. Other texts that caused conflict among all three raters were ignored and have been removed from the dataset. The doccano program was used to help the annotators label each text one by one.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
This dataset was generated using two cascading stages of translation—a machine translation followed by a manual one. Machine translation was applied using Google translate to translate English Amazon product reviews to Standard Arabic. In contrast, the manual approach was applied to translate the resulting Arabic reviews to Bahraini ones by qualified native speakers utilizing constructed customized forms. The resulting parallel dataset of English, Standard Arabic, and Bahraini dialects is called English_Modern Standard Arabic_Bahraini Dialects product reviews for sentiment analysis “E_MSA_BDs-PR-SA”. The dataset is balanced, composed of 2,500 positive and 2,500 negative reviews.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Arabic Sentiment Tweets Dataset (ASTD) is an Arabic social sentiment analysis dataset gathered from Twitter. It consists of about 10,000 tweets which are classified as objective, subjective positive, subjective negative, and subjective mixed.