Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The 1998 Lancet paper by Wakefield et al., despite subsequent retraction and evidence indicating no causal link between vaccinations and autism, triggered significant parental concern. The aim of this study was to analyze the online information available on this topic. Using localized versions of Google, we searched “autism vaccine” in English, French, Italian, Portuguese, Mandarin, and Arabic and analyzed 200 websites for each search engine result page (SERP). A common feature was the newsworthiness of the topic, with news outlets representing 25–50% of the SERP, followed by unaffiliated websites (blogs, social media) that represented 27–41% and included most of the vaccine-negative websites. Between 12 and 24% of websites had a negative stance on vaccines, while most websites were pro-vaccine (43–70%). However, their ranking by Google varied. While in Google.com, the first vaccine-negative website was the 43rd in the SERP, there was one vaccine-negative webpage in the top 10 websites in both the British and Australian localized versions and in French and two in Italian, Portuguese, and Mandarin, suggesting that the information quality algorithm used by Google may work better in English. Many webpages mentioned celebrities in the context of the link between vaccines and autism, with Donald Trump most frequently. Few websites (1–5%) promoted complementary and alternative medicine (CAM) but 50–100% of these were also vaccine-negative suggesting that CAM users are more exposed to vaccine-negative information. This analysis highlights the need for monitoring the web for information impacting on vaccine uptake.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)
Abstract
The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.
For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.
The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)
There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)
The following is a description of the attributes present in this dataset
Open Research Questions
This dataset is expected to be helpful for the investigation of the following research questions and even beyond:
All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The 1998 Lancet paper by Wakefield et al., despite subsequent retraction and evidence indicating no causal link between vaccinations and autism, triggered significant parental concern. The aim of this study was to analyze the online information available on this topic. Using localized versions of Google, we searched “autism vaccine” in English, French, Italian, Portuguese, Mandarin, and Arabic and analyzed 200 websites for each search engine result page (SERP). A common feature was the newsworthiness of the topic, with news outlets representing 25–50% of the SERP, followed by unaffiliated websites (blogs, social media) that represented 27–41% and included most of the vaccine-negative websites. Between 12 and 24% of websites had a negative stance on vaccines, while most websites were pro-vaccine (43–70%). However, their ranking by Google varied. While in Google.com, the first vaccine-negative website was the 43rd in the SERP, there was one vaccine-negative webpage in the top 10 websites in both the British and Australian localized versions and in French and two in Italian, Portuguese, and Mandarin, suggesting that the information quality algorithm used by Google may work better in English. Many webpages mentioned celebrities in the context of the link between vaccines and autism, with Donald Trump most frequently. Few websites (1–5%) promoted complementary and alternative medicine (CAM) but 50–100% of these were also vaccine-negative suggesting that CAM users are more exposed to vaccine-negative information. This analysis highlights the need for monitoring the web for information impacting on vaccine uptake.
mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The 1998 Lancet paper by Wakefield et al., despite subsequent retraction and evidence indicating no causal link between vaccinations and autism, triggered significant parental concern. The aim of this study was to analyze the online information available on this topic. Using localized versions of Google, we searched “autism vaccine” in English, French, Italian, Portuguese, Mandarin, and Arabic and analyzed 200 websites for each search engine result page (SERP). A common feature was the newsworthiness of the topic, with news outlets representing 25–50% of the SERP, followed by unaffiliated websites (blogs, social media) that represented 27–41% and included most of the vaccine-negative websites. Between 12 and 24% of websites had a negative stance on vaccines, while most websites were pro-vaccine (43–70%). However, their ranking by Google varied. While in Google.com, the first vaccine-negative website was the 43rd in the SERP, there was one vaccine-negative webpage in the top 10 websites in both the British and Australian localized versions and in French and two in Italian, Portuguese, and Mandarin, suggesting that the information quality algorithm used by Google may work better in English. Many webpages mentioned celebrities in the context of the link between vaccines and autism, with Donald Trump most frequently. Few websites (1–5%) promoted complementary and alternative medicine (CAM) but 50–100% of these were also vaccine-negative suggesting that CAM users are more exposed to vaccine-negative information. This analysis highlights the need for monitoring the web for information impacting on vaccine uptake.