2 datasets found
  1. Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur, Ph.D.; Nirmalya Thakur, Ph.D. (2024). Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.13896353
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nirmalya Thakur, Ph.D.; Nirmalya Thakur, Ph.D.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 6, 2024
    Description

    Please cite the following paper when using this dataset:

    N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)

    Abstract

    The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.

    For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.

    The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)

    There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)

    The following is a description of the attributes present in this dataset

    • Post ID: Unique ID of each Instagram post
    • Post Description: Complete description of each post in the language in which it was originally published
    • Date: Date of publication in MM/DD/YYYY format
    • Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API
    • Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API
    • Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral

    Open Research Questions

    This dataset is expected to be helpful for the investigation of the following research questions and even beyond:

    1. How does sentiment toward COVID-19 vary across different languages?
    2. How has public sentiment toward COVID-19 evolved from 2020 to the present?
    3. How do cultural differences affect social media discourse about COVID-19 across various languages?
    4. How has COVID-19 impacted mental health, as reflected in social media posts across different languages?
    5. How effective were public health campaigns in shifting public sentiment in different languages?
    6. What patterns of vaccine hesitancy or support are present in different languages?
    7. How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?
    8. What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?
    9. How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?
    10. What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?

    All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).

  2. Facebook: quarterly number of MAU (monthly active users) worldwide 2008-2023...

    • statista.com
    Updated May 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Facebook: quarterly number of MAU (monthly active users) worldwide 2008-2023 [Dataset]. https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/
    Explore at:
    Dataset updated
    May 21, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    With roughly three billion monthly active users as of the second quarter of 2023, Facebook is the most used online social network worldwide. The platform surpassed two billion active users in the second quarter of 2017, taking just over 13 years to reach this milestone. In comparison, Meta-owned Instagram took 11.2 years, and Google’s YouTube took just over 14 years to achieve this landmark. As of January 2022, Facebook’s leading audience base was in India, with almost 330 million users whilst the United States ranked second with an approximate total of 179 million users. The platform also finds remarkable popularity in Indonesia and Brazil. Social Media usage in the United States In January 2021, Facebook was the platform on which users in the United States spent the most time per day. The average time spent on Facebook was 33 minutes, followed by TikTok with 32 minutes and Twitter with 31 daily minutes. Due to the COVID-19 outbreak in 2020, all major social media platforms saw an increase in daily usage, which then either plateaued or decreased in 2021. At the end of 2021, over a quarter of all Facebook users in the United States belonged to the 25 to 34 year age group and 18.2 percent of users were in the 35 to 44 year age group. In general, Facebook users were more likely to be female. Meta Platforms Meta is Facebook’s recently renamed parent company and had a grand total of 3.59 billion core product users by the final quarter of 2021. Other Meta products include Instagram, Facebook Messenger, WhatsApp and Oculus – Meta’s virtual reality subsidiary which produces VR headsets. In 2021, Meta's revenue amounted to 117 billion US dollars, up from around 86 billion U.S. dollars in the previous financial year.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nirmalya Thakur, Ph.D.; Nirmalya Thakur, Ph.D. (2024). Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.13896353
Organization logo

Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis

Related Article
Explore at:
binAvailable download formats
Dataset updated
Oct 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nirmalya Thakur, Ph.D.; Nirmalya Thakur, Ph.D.
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Oct 6, 2024
Description

Please cite the following paper when using this dataset:

N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)

Abstract

The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.

For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.

The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)

There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)

The following is a description of the attributes present in this dataset

  • Post ID: Unique ID of each Instagram post
  • Post Description: Complete description of each post in the language in which it was originally published
  • Date: Date of publication in MM/DD/YYYY format
  • Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API
  • Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API
  • Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral

Open Research Questions

This dataset is expected to be helpful for the investigation of the following research questions and even beyond:

  1. How does sentiment toward COVID-19 vary across different languages?
  2. How has public sentiment toward COVID-19 evolved from 2020 to the present?
  3. How do cultural differences affect social media discourse about COVID-19 across various languages?
  4. How has COVID-19 impacted mental health, as reflected in social media posts across different languages?
  5. How effective were public health campaigns in shifting public sentiment in different languages?
  6. What patterns of vaccine hesitancy or support are present in different languages?
  7. How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?
  8. What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?
  9. How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?
  10. What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?

All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).

Search
Clear search
Close search
Google apps
Main menu