9 datasets found

f
Statistics of the Languages spoken in South Africa. For each language, we...
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koena Ronny Mabokela; Mpho Primus; Turgay Celik (2025). Statistics of the Languages spoken in South Africa. For each language, we report the ISO, the African subfamily, and the prevalent countries where the language is also spoken. [Dataset]. http://doi.org/10.1371/journal.pone.0325102.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325102.t001
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Koena Ronny Mabokela; Mpho Primus; Turgay Celik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South Africa, Africa
Description
Statistics of the Languages spoken in South Africa. For each language, we report the ISO, the African subfamily, and the prevalent countries where the language is also spoken.
f
Performance (F1 score (%)) of the fine-tuned PLMs and ensemble models on...
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koena Ronny Mabokela; Mpho Primus; Turgay Celik (2025). Performance (F1 score (%)) of the fine-tuned PLMs and ensemble models on closely related language combinations. The average weighted F1 with 95% confidence intervals (CIs). [Dataset]. http://doi.org/10.1371/journal.pone.0325102.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325102.t010
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Koena Ronny Mabokela; Mpho Primus; Turgay Celik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance (F1 score (%)) of the fine-tuned PLMs and ensemble models on closely related language combinations. The average weighted F1 with 95% confidence intervals (CIs).
f
Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...
figshare.com
xlsx
Updated Oct 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.27072247.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27072247.v1
Dataset updated
Oct 12, 2024
Dataset provided by
figshare
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite this paper when using this dataset: N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292Abstract: The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages.For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post intoone of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutralhate or not hateanxiety/stress detected or no anxiety/stress detected.These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.The following is a description of the attributes present in this dataset:Post ID: Unique ID of each Instagram postPost Description: Complete description of each post in the language in which it was originally publishedDate: Date of publication in MM/DD/YYYY formatLanguage: Language of the post as detected using the Google Translate APITranslated Post Description: Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.Sentiment: Results of sentiment analysis (using the preprocessed version of the translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutralHate: Results of hate speech detection (using the preprocessed version of the translated Post Description) where each post was classified as hate or not hateAnxiety or Stress: Results of anxiety or stress detection (using the preprocessed version of the translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
f
Tweet sentiments in different languages together with sentiment labels.
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koena Ronny Mabokela; Mpho Primus; Turgay Celik (2025). Tweet sentiments in different languages together with sentiment labels. [Dataset]. http://doi.org/10.1371/journal.pone.0325102.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325102.t003
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Koena Ronny Mabokela; Mpho Primus; Turgay Celik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Tweet sentiments in different languages together with sentiment labels.
f
Hyperparameters used for our models.
plos.figshare.com
xls
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koena Ronny Mabokela; Mpho Primus; Turgay Celik (2025). Hyperparameters used for our models. [Dataset]. http://doi.org/10.1371/journal.pone.0325102.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325102.t008
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Koena Ronny Mabokela; Mpho Primus; Turgay Celik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While sentiment analysis systems excel in high-resource languages, most African languages facing limited resources, remain under-represented. This gap leaves a significant portion of the world’s population without access to technologies in their native languages. However, multilingual pre-trained language models (PLM) offer a promising approach for sentiment analysis in low-resource languages. Although the absence of large data in African languages poses a challenge for developing PLMs, fine-tuning and task adaptation of existing multilingual PLMs is an alternative solution. This paper explores the use of multilingual PLMs for sentiment analysis in five Southern African languages: Sepedi, Sesotho, Setswana, isiXhosa, and isiZulu. We leverage existing PLMs and fine-tune them for this specific task, avoiding training the models from scratch. Our work expands on the SAfriSenti corpus, a Twitter sentiment dataset for these languages. We employ various annotation techniques to create a labelled dataset and perform benchmark experiments utilising various multilingual PLMs. Our findings demonstrate the effectiveness of multilingual PLM, particularly for closely-related languages (Sotho-Tswana), where the ensemble PLMs method achieved an average weighted F1 score above 63%. In particular, Nguni closely-related languages achieved an even higher average weighted F1 score, exceeding 77%, highlighting the potential of PLMs for sentiment analysis in South African languages.
f
Distribution of training set and test set with their sentiment classes.
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koena Ronny Mabokela; Mpho Primus; Turgay Celik (2025). Distribution of training set and test set with their sentiment classes. [Dataset]. http://doi.org/10.1371/journal.pone.0325102.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325102.t007
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Koena Ronny Mabokela; Mpho Primus; Turgay Celik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Distribution of training set and test set with their sentiment classes.
f
Performance (F1 score (%)) of individual fine-tuned and ensemble PLMs for...
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koena Ronny Mabokela; Mpho Primus; Turgay Celik (2025). Performance (F1 score (%)) of individual fine-tuned and ensemble PLMs for sentiment analysis with confidence intervals (CIs). The average weighted F1 with 95% confidence intervals (CIs). [Dataset]. http://doi.org/10.1371/journal.pone.0325102.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325102.t009
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Koena Ronny Mabokela; Mpho Primus; Turgay Celik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance (F1 score (%)) of individual fine-tuned and ensemble PLMs for sentiment analysis with confidence intervals (CIs). The average weighted F1 with 95% confidence intervals (CIs).
f
PLMs-available. Number of African languages and Southern African languages...
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koena Ronny Mabokela; Mpho Primus; Turgay Celik (2025). PLMs-available. Number of African languages and Southern African languages covered in the MPLMs. [Dataset]. http://doi.org/10.1371/journal.pone.0325102.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325102.t002
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Koena Ronny Mabokela; Mpho Primus; Turgay Celik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Africa
Description
PLMs-available. Number of African languages and Southern African languages covered in the MPLMs.
f
Human sentiment annotation for isiXhosa and isiZulu.
plos.figshare.com
xls
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koena Ronny Mabokela; Mpho Primus; Turgay Celik (2025). Human sentiment annotation for isiXhosa and isiZulu. [Dataset]. http://doi.org/10.1371/journal.pone.0325102.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325102.t005
Dataset updated
Jun 5, 2025
Dataset provided by
PLOS ONE
Authors
Koena Ronny Mabokela; Mpho Primus; Turgay Celik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Human sentiment annotation for isiXhosa and isiZulu.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Koena Ronny Mabokela; Mpho Primus; Turgay Celik (2025). Statistics of the Languages spoken in South Africa. For each language, we report the ISO, the African subfamily, and the prevalent countries where the language is also spoken. [Dataset]. http://doi.org/10.1371/journal.pone.0325102.t001

Statistics of the Languages spoken in South Africa. For each language, we report the ISO, the African subfamily, and the prevalent countries where the language is also spoken.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0325102.t001

Dataset updated

Jun 5, 2025

Dataset provided by

PLOS ONE

Authors

Koena Ronny Mabokela; Mpho Primus; Turgay Celik

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

South Africa, Africa

Description

Statistics of the Languages spoken in South Africa. For each language, we report the ISO, the African subfamily, and the prevalent countries where the language is also spoken.

Clear search

Close search

Google apps

Main menu

Statistics of the Languages spoken in South Africa. For each language, we...

Performance (F1 score (%)) of the fine-tuned PLMs and ensemble models on...

Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

Tweet sentiments in different languages together with sentiment labels.

Hyperparameters used for our models.

Distribution of training set and test set with their sentiment classes.

Performance (F1 score (%)) of individual fine-tuned and ensemble PLMs for...

PLMs-available. Number of African languages and Southern African languages...

Human sentiment annotation for isiXhosa and isiZulu.

Statistics of the Languages spoken in South Africa. For each language, we report the ISO, the African subfamily, and the prevalent countries where the language is also spoken.