50 datasets found
  1. P

    ASTD Dataset

    • paperswithcode.com
    Updated Feb 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Nabil; Mohamed Aly; Amir Atiya (2021). ASTD Dataset [Dataset]. https://paperswithcode.com/dataset/astd
    Explore at:
    Dataset updated
    Feb 20, 2021
    Authors
    Mahmoud Nabil; Mohamed Aly; Amir Atiya
    Description

    Arabic Sentiment Tweets Dataset (ASTD) is an Arabic social sentiment analysis dataset gathered from Twitter. It consists of about 10,000 tweets which are classified as objective, subjective positive, subjective negative, and subjective mixed.

  2. Arabic Companies Reviews For Sentiment Analysis

    • kaggle.com
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mohamed ali salama (2023). Arabic Companies Reviews For Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/mohamedalisalama/arabic-companies-reviews-for-sentiment-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    mohamed ali salama
    Description

    Context

    The data has 67K+ reviews in Arabic for sentiment analysis Data collecting using web scraping for many companies Like ( talabat,kabiter,nasla,swifil,alsiwidiu,kilubatra,dumati,.........etc)

    Content

    Coulnms

    Reviews : review description rating : 1 postive , 0 neutral , -1 negative Company : continues company name for each review

  3. Data from: Arabic news credibility on Twitter using sentiment analysis and...

    • zenodo.org
    • data.niaid.nih.gov
    csv, txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duha Samdani; Duha Samdani; Mounira Taileb; Nada Almani; Mounira Taileb; Nada Almani (2023). Arabic news credibility on Twitter using sentiment analysis and ensemble learning [Dataset]. http://doi.org/10.5281/zenodo.8000717
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Duha Samdani; Duha Samdani; Mounira Taileb; Nada Almani; Mounira Taileb; Nada Almani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Arabic news credibility on Twitter using sentiment analysis and ensemble learning.

    WHAT IS IT?

    -----------

    an Arabic news credibility model on Twitter using sentiment analysis and ensemble learning.

    Here we include the Collected dataset and the source code of the proposed model written in Python language and using Keras library with Tensorflow backend.

    Required Packages

    ------------------

    1. Keras (https://keras.io/).
    2. Scikit-learn (http://scikit-learn.org/)
    3. Imnlearn (imbalanced-learn documentation — Version 0.10.1)

    To Run the model

    ---------------

    One data file is required to run the model which are:

    1. The data that were used are the collected dataset in the file, set the path of the required data file in the code.

    The dataset

    ---------------

    1. There are the dataset file with all features, you can choose the features that you need and apply it on the model.
    2. There are a description file that describe each feature in the news credibility dataset
    3. The file Tweet_ID contains the list of tweets id in the dataset.
    4. The annotated replies based on credibility is provided.

    CONTACTS

    --------

    • If you want to report bugs or have general queries email to

  4. Towards Generalization of Machine Learning Models: An Arabic Sentiment...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samir Abdaljalil; Shaimaa Hassanein; Hamdy Mubarak; Ahmed Abdelali; Samir Abdaljalil; Shaimaa Hassanein; Hamdy Mubarak; Ahmed Abdelali (2023). Towards Generalization of Machine Learning Models: An Arabic Sentiment Analysis Dataset [Dataset]. http://doi.org/10.5281/zenodo.7801450
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Samir Abdaljalil; Shaimaa Hassanein; Hamdy Mubarak; Ahmed Abdelali; Samir Abdaljalil; Shaimaa Hassanein; Hamdy Mubarak; Ahmed Abdelali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set consists of approximately 1.64 Million Arabic tweets (shared by their IDs) posted from 2009 to 2020, and their corresponding sentiment using a three-point classification system of Positive, Negative and Neutral/Mixed. No specific locations and/or keywords were specified throughout the data collection to obtain variation in the dialects and topics represented within the dataset. It is important to note that any biases in the proposed dataset in relation to the dialects and/or topics discussed were unintentional.

    Please use the following citation if you use this data in a paper:

    Abdaljalil, S., Hassanein, S., Mubarak, H., & Abdelali, A. (2023). Towards Generalization of Machine Learning Models: A Case Study of Arabic Sentiment Analysis. Proceedings of the International AAAI Conference on Web and Social Media, 17(1), 971-980.

  5. Social Media Posts in Arabic Dialect

    • kaggle.com
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UM6P Open Data (2024). Social Media Posts in Arabic Dialect [Dataset]. https://www.kaggle.com/datasets/um6popendata/sentiment-analysis-for-sm-posts-in-arabic-dialect
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Kaggle
    Authors
    UM6P Open Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset: Sentiment Analysis for Social Media Posts in Arabic Dialect

    Overview

    This dataset contains a labeled collection of approximately 50,000 social media posts in various Arabic dialects. Each post has been manually annotated with sentiment labels, providing a rich resource for natural language processing and sentiment analysis research.

    Dataset Owner

    UM6P College of Computing

    Content

    • Posts: The dataset includes raw text data of social media posts written in different Arabic dialects.
    • Sentiment Labels: Each post is labeled with one of the following sentiment categories:
      • Positive
      • Negative
      • Neutral

    Features

    • Post ID: A unique identifier for each social media post.
    • Text: The content of the social media post in Arabic.
    • Sentiment: The sentiment label assigned to the post (Positive, Negative, Neutral).

    Format

    The dataset is provided in a CSV format with the following columns: - Post_ID: Integer - Text: String - Sentiment: String (Positive, Negative, Neutral)

    Usage

    This dataset is ideal for tasks such as: - Training sentiment analysis models - Studying sentiment trends in Arabic social media - Exploring the linguistic characteristics of Arabic dialects - Benchmarking sentiment analysis tools

    Example Data

    Post_IDTextSentiment
    1"هذا المنتج رائع جدًا وأحببته كثيرًا"Positive
    2"لم يعجبني هذا الفيلم، كان مملًا جدًا"Negative
    3"الطقس اليوم عادي، لا يوجد شيء مميز"Neutral

    Licensing

    Please refer to the dataset license included in the dataset files for information on usage rights and restrictions.

    Citation

    An open access NLP dataset for Arabic dialects: data collection, labeling, and model construction, Elmehdi Boujou, Hamza Chataoui, Abdellah El Mekki, Saad Benjelloun, Ikram Chairi and Ismail Berrada MENACIS 2020 conference, In press.

  6. Arabic Sentiment Analysis Dataset SS2030 Dataset

    • kaggle.com
    Updated May 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Alyami (2019). Arabic Sentiment Analysis Dataset SS2030 Dataset [Dataset]. https://www.kaggle.com/snalyami3/arabic-sentiment-analysis-dataset-ss2030-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sarah Alyami
    Description

    Please cite: Alyami, S. N., & Olatunji, S. O. (2020). Application of Support Vector Machine for Arabic Sentiment Classication Using Twitter-Based Dataset, 19(1), 1–13. https://doi.org/10.1142/S0219649220400183

  7. E

    HENSOLDT ANALYTICS Sentiment Analysis for Arabic

    • live.european-language-grid.eu
    Updated Dec 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hensoldt Analytics (2021). HENSOLDT ANALYTICS Sentiment Analysis for Arabic [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/9464
    Explore at:
    Dataset updated
    Dec 20, 2021
    Dataset authored and provided by
    Hensoldt Analytics
    License

    https://gitlab.com/european-language-grid/sail/sail-documents/blob/master/HENSOLDT-ANALYTICS_ELG_LICENSE.mdhttps://gitlab.com/european-language-grid/sail/sail-documents/blob/master/HENSOLDT-ANALYTICS_ELG_LICENSE.md

    Description

    HENSOLDT ANALYTICS MediaMiningIndexer SED - sentiment detection/analysis engine that provides attitude of paragraphs of text that can be positive, negative or netural.

  8. Arabic Datasets for research purposes

    • zenodo.org
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abu Bakr Soliman; Abu Bakr Soliman (2020). Arabic Datasets for research purposes [Dataset]. http://doi.org/10.5281/zenodo.1034601
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abu Bakr Soliman; Abu Bakr Soliman
    Description

    Arabic Datasets for research purposes

  9. H

    Arabic Depression Tweets Dataset (15,000 Tweets) with Linguistic...

    • dataverse.harvard.edu
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelmoniem Helmy (2025). Arabic Depression Tweets Dataset (15,000 Tweets) with Linguistic Augmentation [Dataset]. http://doi.org/10.7910/DVN/UWLHRI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Abdelmoniem Helmy
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains 15,000 Arabic tweets annotated for depression detection and includes linguistic feature augmentations to support research in natural language processing (NLP), sentiment analysis, and mental health detection. The dataset was curated to enable studies on automatic depression detection in Arabic social media and to support machine learning and deep learning approaches in the domain of computational mental health. Contents The dataset consists of the following columns: tweet: The original Arabic tweet text. label: Binary label indicating whether the tweet expresses signs of depression: 1 = Depression 0 = Non-depression negation_flag: Indicates presence (1) or absence (0) of negation in the tweet. intensifier_flag: Indicates presence (1) or absence (0) of intensifiers (words that strengthen the degree of emotion). Class (redundant but included for convenience): Textual label corresponding to the binary label (Depression or Non-depression). Binary Classification: Contains the count of instances in each class (appears as an artifact in the provided file). Key Features Language: Arabic (varied dialects and Modern Standard Arabic). Source: Publicly available tweets collected from Twitter (X). Annotation: Manual labeling by native Arabic speakers trained in psychology and linguistics. Linguistic augmentation: Flags for negation and intensifier usage are included to support linguistically informed NLP models. Potential Use Cases Depression detection models for Arabic texts. Linguistic analysis of depression expression in Arabic social media. Cross-lingual studies comparing depression signals across languages. Development of clinical decision support systems leveraging social media data. Licensing & Ethical Considerations The dataset consists of public social media posts. Researchers are advised to use it strictly for research purposes, respecting privacy and ethical guidelines. No personally identifiable information (PII) is included. Citation If you use this dataset, please cite it appropriately in your research publications and acknowledge the creators.

  10. h

    tweet_sentiment_multilingual

    • huggingface.co
    • opendatalab.com
    Updated Dec 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cardiff NLP (2022). tweet_sentiment_multilingual [Dataset]. https://huggingface.co/datasets/cardiffnlp/tweet_sentiment_multilingual
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2022
    Dataset authored and provided by
    Cardiff NLP
    Description

    Dataset Card for cardiffnlp/tweet_sentiment_multilingual

      Dataset Summary
    

    Tweet Sentiment Multilingual consists of sentiment analysis dataset on Twitter in 8 different lagnuages.

    arabic english french german hindi italian portuguese spanish

      Supported Tasks and Leaderboards
    

    text_classification: The dataset can be trained using a SentenceClassification model from HuggingFace transformers.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    An instance from… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/tweet_sentiment_multilingual.

  11. d

    Direct Arabic products' opinions data set for opinion mining and sentiment...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    saad, sarah (2023). Direct Arabic products' opinions data set for opinion mining and sentiment analysis\" [Dataset]. http://doi.org/10.7910/DVN/YTSWJ4
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    saad, sarah
    Description

    The products' opinions in Arabsentiment dataset is collected manually from different social products' resources for opinion mining, feature extraction and sentiment analysis tasks. The collected opinions included different types of direct opinions that include at least one product feature whether it stated explicitly or in implicit manner. The dataset contains twenty different products categories like home, baby, different types of software products and other product types. Additionally, the products’ features are identified manually from the customer opinions and the product description. The products are classified according to each product type and there is a specific search query related to each type. For each product, the product name and brief description about the product capabilities are registered in products information file and classified to specific product types with a specific initial query for each type. The collected data contains opinions about twenty different products' categories. These opinions are selected based on the text size and the number of features that appear in the opinionated text. For each opinion, we keep track of the opinionated text and the sentiment rating score entered by the customers. The rating score represent the overall polarity of the reviewer towards the products into one of two categories: positive or negative sentiment. The main dataset attributes involve the total number of directed opinions used in dataset that should include at least one explicit product features, the number of opinions with positive sentiment score is 1459 and negative sentiment polarity score is 516.

  12. Arabic 100k Reviews

    • kaggle.com
    Updated Mar 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abed Khooli (2020). Arabic 100k Reviews [Dataset]. https://www.kaggle.com/abedkhooli/arabic-100k-reviews/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abed Khooli
    Description

    Context

    Few Arabic datasets are available for classification comparison and other NLP tasks. This dataset is mainly a compilation of several available datasets and a sampling of 100k rows (99999 to be exact).

    Content

    The dataset combines reviews from hotels, books, movies, products and a few airlines. It has three classes (Mixed, Negative and Positive). Most were mapped from reviewers' ratings with 3 being mixed, above 3 positive and below 3 negative. Each row has a label and text separated by a tab (tsv). Text (reviews) were cleaned by removing Arabic diacritics and non-Arabic characters. The dataset has no duplicate reviews.

    Acknowledgements

    The hotels and book reviews are a subset of HARD and BRAD. The rest were selected from hadyelsahar with a little over 100 airlines reviews collected manually.

    Inspiration

    Let's jump in and use your best tools to beat the SOTA! Don't forget to show and share your work.

  13. h

    Arabic-Sentiments

    • huggingface.co
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Imran Zaman (2025). Arabic-Sentiments [Dataset]. https://huggingface.co/datasets/ImranzamanML/Arabic-Sentiments
    Explore at:
    Dataset updated
    Apr 27, 2025
    Authors
    Muhammad Imran Zaman
    Description

    Arabic Sentiment Analysis Dataset

    Dataset Description This dataset contains Arabic text snippets, each labeled with a sentiment polarity (positive or negative). The data appears to be intended for tasks like sentiment analysis or text classification. It is divided into separate training and testing files (train.tsv and test.tsv).

    Source Files

    train.tsv
    test.tsv

    Language

    Arabic

    Data Format

    Tab-Separated Values (.tsv)
    Each line consists of two fields separated… See the full description on the dataset page: https://huggingface.co/datasets/ImranzamanML/Arabic-Sentiments.

  14. Datasets for sentiment analysis of arabizi tweets

    • kaggle.com
    Updated Jun 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria JM Raidy (2020). Datasets for sentiment analysis of arabizi tweets [Dataset]. https://www.kaggle.com/mariajmraidy/datasets-for-sentiment-analysis-of-arabizi/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 23, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Maria JM Raidy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Arabizi is a modern variant of the Arabic language that is being increasingly used by millennials. In fact, Arabizi is Arabic expressed using text that is transliterated to Latin characters while numbers are used to represent characters and sounds that do not exist in Latin-character languages. The proposed datasets are labelled for sentiment analysis of lebanese arabizi twitter data.

    Content

    Tweets have been collected randomly between 2017 and 2020. They all have geoTagging option turned on and in Lebanon.

    Columns: Text, sentiment, highlight

    They have been annotated with a minimum of 2-agreement: - unbalanced-sentiment-arabizi-ds.csv contains all the labelled tweets with a minimum of 2-agreement. - 2-class-sentiment-arabizi-ds.csv are labelled as positive or negative. - 3-class-sentiment-arabizi-ds.csv are labelled as positive, negative or neutral.

    Both datasets have a third column called highlight: an informative column filled when the highlight is obvious. Options are: - Sectarianism: Prejudice, discrimination, or hatred arising from attaching relations of inferiority and superiority to differences between subdivisions within a group. - Sexism: Prejudice, stereotyping, or discrimination, typically against women, based on sex. - Racism: Prejudice, discrimination, or antagonism directed against someone of a different race based on the belief that one's own race is superior. - Foul language: Coarse or offensive language: swearing, bad words, obscene words, dirty words, … - Bullying: Seek to harm, intimidate, or coerce - Sarcasm: The use of irony to mock or convey contempt. - Joke: A thing that someone says to cause amusement or laughter, especially a story with a funny punchline. - Courtesy words: A polite remark or respectful act: ‘thank you’, ‘please’, ‘excuse me’, … - Saying: Any concisely written or spoken expression that is especially memorable because of its meaning or style. A quotation from a text or speech. - Known fact: Something that is generally recognized as a fact or truth: that grass is green

    Inspiration

    I would like to see Sentiment analysis models tested or validated on the datasets.

    Best results to date 24-05-2020

    2-class-sentiment-arabizi-ds.csv - model: Decision trees - Accuracy 81% - Precision 81% - Recall 81% - F1 81%

    3-class-sentiment-arabizi-ds.csv - model: Logistic regression - Accuracy 65% - Precision 65% - Recall 65% - F1 65%

  15. s

    Arabic Text Dataset

    • shaip.com
    • tl.shaip.com
    • +1more
    json
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Arabic Text Dataset [Dataset]. https://www.shaip.com/offerings/language-text-datasets/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 26, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Arabic Text Dataset contains a collection of text samples written in Arabic. It includes various forms of content, such as news articles, social media posts, literature, and dialogue, spanning different topics and writing styles. This dataset is used for tasks such as natural language processing (NLP), text classification, sentiment analysis, and machine translation in Arabic language applications.

  16. P

    ArSen Dataset

    • paperswithcode.com
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang Fang; Cheng Xu; Shuhao Guan; Nan Yan; Yuke Mei (2024). ArSen Dataset [Dataset]. https://paperswithcode.com/dataset/arsen
    Explore at:
    Dataset updated
    Nov 14, 2024
    Authors
    Yang Fang; Cheng Xu; Shuhao Guan; Nan Yan; Yuke Mei
    Description

    Sentiment analysis is pivotal in Natural Language Processing for understanding opinions and emotions in text. While advancements in Sentiment analysis for English are notable, Arabic Sentiment Analysis (ASA) lags, despite the growing Arabic online user base. Existing ASA benchmarks are often outdated and lack comprehensive evaluation capabilities for state-of-the-art models. To bridge this gap, we introduce ArSen, a meticulously annotated COVID-19-themed Arabic dataset, and the IFDHN, a novel model incorporating fuzzy logic for enhanced sentiment classification. ArSen provides a contemporary, robust benchmark, and IFDHN achieves state-of-the-art performance on ASA tasks. Comprehensive evaluations demonstrate the efficacy of IFDHN using the ArSen dataset, highlighting future research directions in ASA.

  17. m

    Moroccan Darija Offensive Language Detection Dataset

    • data.mendeley.com
    Updated Sep 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anass Ibrahimi (2023). Moroccan Darija Offensive Language Detection Dataset [Dataset]. http://doi.org/10.17632/2y4m97b7dc.1
    Explore at:
    Dataset updated
    Sep 20, 2023
    Authors
    Anass Ibrahimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Morocco
    Description

    The Moroccan Darija offensive language detection dataset is a human-labeled dataset consisting of a set of Moroccan Darija sentences for offensive language detection. The dataset contains 20,402 sentences and their corresponding binary labels: 0 for a non-offensive sentence and 1 for an offensive sentence. The sentences were gathered from Twitter and YouTube comments and are written in both Latin and Arabic scripts. Inoffensive sentences account for 62.2% (12,685 sentences), while offensive sentences account for 37.8% (7,717 sentences). This contribution addresses the scarcity of labeled datasets for Moroccan Darija and provides a resource for natural language processing researchers interested in Moroccan Darija, particularly offensive language and sentiment analysis tasks.

  18. m

    Data from: KurdiSent: A Corpus For Kurdish Sentiment Analysis

    • data.mendeley.com
    Updated Feb 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soran Badawi (2023). KurdiSent: A Corpus For Kurdish Sentiment Analysis [Dataset]. http://doi.org/10.17632/3yrkswy6ph.2
    Explore at:
    Dataset updated
    Feb 6, 2023
    Authors
    Soran Badawi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Kurdish language is regarded as one of the less-resourced languages. The language is globally practised by 30-40 people. The language has 33 letters that are largely similar to the Arabic language. The Kurdish language has two major dialects Sorani and Badini. The dataset includes a collection of texts written in the Sorani dialect. It contains tweets the Twitter API. Due to security reasons and following the policies of Twitter, we removed the user's identity. We collected the tweets which was published during the time of the Corona Virus pandemic. The tweets are raw texts, and the content covers a varied range of topics, starting from politics, sports, entertainment, social life, etc. Data Labeling We used the Twitter developer (Twitter API) to mine the tweets. The dataset was annotated manually by three Kurdish native speakers. The annotators were required to identify the classes and categories of each text. The classes included positive, negative and neutral and the categories consisted of news, technology, art, social and health. The texts which were agreed upon by at least two annotators to possess a specific label and category were regarded as conflict-free and accepted for further processing. Other texts that caused conflict among all three raters were ignored and have been removed from the dataset. The doccano program was used to help the annotators label each text one by one.

  19. m

    Data from: Sentiment Analysis of Multilingual Dataset of Bahraini Dialects,...

    • data.mendeley.com
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thuraya Omran (2023). Sentiment Analysis of Multilingual Dataset of Bahraini Dialects, Arabic, and English [Dataset]. http://doi.org/10.17632/5rhw2srzjj.1
    Explore at:
    Dataset updated
    Feb 15, 2023
    Authors
    Thuraya Omran
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Area covered
    Bahrain
    Description

    This dataset was generated using two cascading stages of translation—a machine translation followed by a manual one. Machine translation was applied using Google translate to translate English Amazon product reviews to Standard Arabic. In contrast, the manual approach was applied to translate the resulting Arabic reviews to Bahraini ones by qualified native speakers utilizing constructed customized forms. The resulting parallel dataset of English, Standard Arabic, and Bahraini dialects is called English_Modern Standard Arabic_Bahraini Dialects product reviews for sentiment analysis “E_MSA_BDs-PR-SA”. The dataset is balanced, composed of 2,500 positive and 2,500 negative reviews.

  20. Z

    Sentiment dataset of Algerian dialect

    • data.niaid.nih.gov
    Updated Apr 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mazari, Ahmed cherif (2024). Sentiment dataset of Algerian dialect [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10937411
    Explore at:
    Dataset updated
    Apr 7, 2024
    Dataset authored and provided by
    Mazari, Ahmed cherif
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Algeria
    Description
    • This sentiment dataset of Algerian dialect consists of 11760 comments (6111 positive/ 5649 negative comments)) collected from (Facebook, YouTube and Twitter) during Hirak 2019.* Comments concern the Algerian spoken language, written in Arabic and/or Latin characters and/or Arabizi, which could be either Modern Standard Arabic, French or local dialect.* Value ‘1’ is attributed for Positive review / value ‘0’ attributed for Negative review.* Due to the nature of this Dataset, some comments contain offensive language. This does not reflect author values, however the aim is to providing a resource to help in analysing positive and negative sentiments (that probably containing harmful content).* For more information please contact (@Ahmed Cherif Mazari) : mazari.ac@gmail.com
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mahmoud Nabil; Mohamed Aly; Amir Atiya (2021). ASTD Dataset [Dataset]. https://paperswithcode.com/dataset/astd

ASTD Dataset

Arabic Sentiment Tweets Dataset

Explore at:
Dataset updated
Feb 20, 2021
Authors
Mahmoud Nabil; Mohamed Aly; Amir Atiya
Description

Arabic Sentiment Tweets Dataset (ASTD) is an Arabic social sentiment analysis dataset gathered from Twitter. It consists of about 10,000 tweets which are classified as objective, subjective positive, subjective negative, and subjective mixed.

Search
Clear search
Close search
Google apps
Main menu