ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This dataset contains posts from 28 subreddits (15 mental health support groups) from 2018-2020. We used this dataset to understand the impact of COVID-19 on mental health support groups from January to April, 2020 and included older timeframes to obtain baseline posts before COVID-19.
Please cite if you use this dataset:
Low, D. M., Rumker, L., Torous, J., Cecchi, G., Ghosh, S. S., & Talkar, T. (2020). Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study. Journal of medical Internet research, 22(10), e22635.
@article{low2020natural, title={Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study}, author={Low, Daniel M and Rumker, Laurie and Torous, John and Cecchi, Guillermo and Ghosh, Satrajit S and Talkar, Tanya}, journal={Journal of medical Internet research}, volume={22}, number={10}, pages={e22635}, year={2020}, publisher={JMIR Publications Inc., Toronto, Canada} }
License
This dataset is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://www.opendatacommons.org/licenses/pddl/1.0/
It was downloaded using pushshift API. Re-use of this data is subject to Reddit API terms.
Reddit Mental Health Dataset
Contains posts and text features for the following timeframes from 28 mental health and non-mental health subreddits:
filenames
and corresponding timeframes:
post:
Jan 1 to April 20, 2020 (called "mid-pandemic" in manuscript; r/COVID19_support appears). Unique users: 320,364. pre:
Dec 2018 to Dec 2019. A full year which provides more data for a baseline of Reddit posts. Unique users: 327,289.2019:
Jan 1 to April 20, 2019 (r/EDAnonymous appears). A control for seasonal fluctuations to match post
data. Unique users: 282,560.2018:
Jan 1 to April 20, 2018. A control for seasonal fluctuations to match post
data. Unique users: 177,089Unique users across all time windows (pre and 2019 overlap): 826,961.
See manuscript Supplementary Materials (https://doi.org/10.31234/osf.io/xvwcy) for more information.
Note: if subsampling (e.g., to balance subreddits), we recommend bootstrapping analyses for unbiased results.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Citation
Rani, S.; Ahmed, K.; Subramani, S. From Posts to Knowledge: Annotating a Pandemic-Era Reddit Dataset to Navigate Mental Health Narratives. Appl. Sci. 2024, 14, 1547. https://doi.org/10.3390/app14041547
RMHD Our dataset, meticulously curated from Reddit, encompasses a comprehensive collection of posts from five key subreddits focused on mental health: r/anxiety, r/depression, r/mentalhealth, r/suicidewatch, and r/lonely. These subreddits were chosen for their rich, focused discussions on mental health issues, making them invaluable for research in this area.
The dataset spans from January 2019 through August 2022 and is systematically structured into folders by year. Within each yearly folder, the data is further segmented into monthly batches. Each month's data is compiled into five separate CSV files, corresponding to the selected subreddits.
Structure of Part A : Raw Data:Each CSV file in our dataset includes the following columns, providing a detailed view of the Reddit posts along with essential metadata: Author: The username of the Reddit post's author. Created_utc: The UTC timestamp of when the post was created. Score:The net score (upvotes minus downvotes) of the post. Selftext: The main text content of the post. **Subreddit: **The subreddit from which the post was sourced. Title: The title of the Reddit post. Timestamp:The local date and time when the post was created, converted from the UTC timestamp. This structured approach allows researchers to conduct detailed, time-based analyses and to easily access data from specific subreddits.
Structure of Part B : Labelled Data :Part B of our dataset, which includes a subset of 800 manually annotated posts, is structured differently to provide focused insights into the mental health discussions. The columns in Part B are as follows: Score: The net score (upvotes minus downvotes) of the post. Selftext:The main text content of the post. Subreddit: The subreddit from which the post was sourced. Title: The title of the Reddit post. Label: The assigned label indicating the identified root cause of mental health issues, based on our annotation process are : Drug and Alcohol , Early Life, Personality,Trauma and Stress
This annotation process brings additional depth to the dataset, allowing researchers to explore the underlying factors contributing to mental health issues.
The dataset, with a zipped size of approximately 1.68GB, is publicly available and serves as a rich resource for researchers interested in exploring the root causes of mental health issues as represented in social media discussions, particularly within the diverse conversations found on Reddit.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a customized and re-labeled version of the original Jigsaw Toxic Comment Classification Challenge dataset.
Instead of toxic behavior categories, the comments are now annotated with depression severity levels, aiming to support mental health research and AI-based early detection of psychological distress.
šļø Label Categories: Each comment has been carefully annotated into one of the following classes: - psychotic_depression - severe_depression - moderate_depression - mild_depression - toxic_depression - major_depression
These labels help transform the original problem into a multi-class depression severity classification task. šØāš» Project Contributors: - Muhammad Mugees Asif ā Lead Annotator & AI Researcher - Dr. Arfan Ali Nagra ā Computational Intelligence Expert - Sana Asif ā Mental Health Research Support & Dataset Coordination
This dataset was created with the intention to help data scientists, researchers, and students work on AI solutions for mental health support.
ā ļø Acknowledgement: The original dataset was sourced from the Jigsaw Toxic Comment Classification Challenge hosted on Kaggle. Full credit to the creators of the original dataset. This re-labeled version is shared for educational and research purposes only.
https://qdr.syr.edu/policies/qdr-restricted-access-conditionshttps://qdr.syr.edu/policies/qdr-restricted-access-conditions
Project Summary This dataset contains all qualitative and quantitative data collected in the first phase of the Pandemic Journaling Project (PJP). PJP is a combined journaling platform and interdisciplinary, mixed-methods research study developed by two anthropologists, with support from a team of colleagues and students across the social sciences, humanities, and health fields. PJP launched in Spring 2020 as the COVID-19 pandemic was emerging in the United States. PJP was created in order to āpre-design an archiveā of COVID-19 narratives and experiences open to anyone around the world. The project is rooted in a commitment to democratizing knowledge production, in the spirit of āarchival activismā and using methods of āgrassroots collaborative ethnographyā (Willen et al. 2022; Wurtz et al. 2022; Zhang et al 2020; see also Carney 2021). The motto on the PJP website encapsulates these commitments: āUsually, history is written only by the powerful. When the history of COVID-19 is written, letās make sure that doesnāt happen.ā (A version of this Project Summary with links to the PJP website and other relevant sites is included in the public documentation of the project at QDR.) In PJPās first phase (PJP-1), the project provided a digital space where participants could create weekly journals of their COVID-19 experiences using a smartphone or computer. The platform was designed to be accessible to as wide a range of potential participants as possible. Anyone aged 15 or older, living anywhere in the world, could create journal entries using their choice of text, images, and/or audio recordings. The interface was accessible in English and Spanish, but participants could submit text and audio in any language. PJP-1 ran on a weekly basis from May 2020 to May 2022. Data Overview This Qualitative Data Repository (QDR) project contains all journal entries and closed-ended survey responses submitted during PJP-1, along with accompanying descriptive and explanatory materials. The dataset includes individual journal entries and accompanying quantitative survey responses from more than 1,800 participants in 55 countries. Of nearly 27,000 journal entries in total, over 2,700 included images and over 300 are audio files. All data were collected via the Qualtrics survey platform. PJP-1 was approved as a research study by the Institutional Review Board (IRB) at the University of Connecticut. Participants were introduced to the project in a variety of ways, including through the PJP website as well as professional networks, PJPās social media accounts (on Facebook, Instagram, and Twitter) , and media coverage of the project. Participants provided a single piece of contact information ā an email address or mobile phone number ā which was used to distribute weekly invitations to participate. This contact information has been stripped from the dataset and will not be accessible to researchers. PJP uses a mixed-methods research approach and a dynamic cohort design. After enrolling in PJP-1 via the projectās website, participants received weekly invitations to contribute to their journals via their choice of email or SMS (text message). Each weekly invitation included a link to that weekās journaling prompts and accompanying survey questions. Participants could join at any point, and they could stop participating at any point as well. They also could stop participating and later restart. Retention was encouraged with a monthly raffle of three $100 gift cards. All individuals who had contributed that month were eligible. Regardless of when they joined, all participants received the projectās narrative prompts and accompanying survey questions in the same order. In Week 1, before contributing their first journal entries, participants were presented with a baseline survey that collected demographic information, including political leanings, as well as self-reported data about COVID-19 exposure and physical and mental health status. Some of these survey questions were repeated at periodic intervals in subsequent weeks, providing quantitative measures of change over time that can be analyzed in conjunction with participants' qualitative entries. Surveys employed validated questions where possible. The core of PJP-1 involved two weekly opportunities to create journal entries in the format of their choice (text, image, and/or audio). Each week, journalers received a link with an invitation to create one entry in response to a recurring narrative prompt (āHow has the COVID-19 pandemic affected your life in the past week?ā) and a second journal entry in response to their choice of two more tightly focused prompts. Typically the pair of prompts included one focusing on subjective experience (e.g., the impact of the pandemic on relationships, sense of social connectedness, or mental health) and another with an external focus (e.g., key sources of scientific information, trust in government, or COVID-19ās economic impact). Each week,...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is structured as a graph, where nodes represent users and edges capture their interactions, including tweets, retweets, replies, and mentions. Each node provides detailed user attributes, such as unique ID, follower and following counts, and verification status, offering insights into each user's identity, role, and influence in the mental health discourse. The edges illustrate user interactions, highlighting engagement patterns and types of content that drive responses, such as tweet impressions. This interconnected structure enables sentiment analysis and public reaction studies, allowing researchers to explore engagement trends and identify the mental health topics that resonate most with users.
The dataset consists of three files: 1. Edges Data: Contains graph data essential for social network analysis, including fields for UserID (Source), UserID (Destination), Post/Tweet ID, and Date of Relationship. This file enables analysis of user connections without including tweet content, maintaining compliance with Twitter/Xās data-sharing policies. 2. Nodes Data: Offers user-specific details relevant to network analysis, including UserID, Account Creation Date, Follower and Following counts, Verified Status, and Date Joined Twitter. This file allows researchers to examine user behavior (e.g., identifying influential users or spam-like accounts) without direct reference to tweet content. 3. Twitter/X Content Data: This file contains only the raw tweet text as a single-column dataset, without associated user identifiers or metadata. By isolating the text, we ensure alignment with anonymization standards observed in similar published datasets, safeguarding user privacy in compliance with Twitter/X's data guidelines. This content is crucial for addressing the research focus on mental health discourse in social media. (References to prior Data in Brief publications involving Twitter/X data informed the dataset's structure.)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is derived from a mental health and depression survey, containing 1,998 cleaned responses with demographic, lifestyle, behavioral, and psychological features. The primary objective of the dataset is to support mental health classification tasks, particularly in predicting different types of depression such as job-related, family-related, or love-related depression.
The dataset includes information on age, gender, education, employment status, symptoms, lifestyle habits (sleep, eating, social media usage), coping strategies, and mental health support availability. Missing values, duplicates, and irrelevant text fields have been carefully preprocessed to ensure high quality and usability.
Researchers, students, and practitioners can use this dataset for:
Multi-class classification tasks (predicting depression types)
Exploratory data analysis on mental health patterns
Feature importance and explainability studies (e.g., LIME/SHAP)
Developing early detection models for mental health support
By contributing this dataset, the goal is to encourage data-driven approaches to mental health awareness, prevention, and support systems.
We collect this dataset from some mental health-related subreddits in https://www.reddit.com/ to further the study of mental disorders and suicidal ideation. We name this dataset as Reddit SuicideWatch and Mental Health Collection, or SWMH for short, where discussions comprise suicide-related intention and mental disorders like depression, anxiety, and bipolar. We use the Reddit official API and develop a web spider to collect the targeted forums. This collection contains a total of 54,412 posts. Specific subreddits are listed in Table 4 of the below paper, as well as the number and the percentage of posts collected in the train-val-test split.
This dataset is only for research. Please request with your institutional email.
If you use this dataset, please cite the paper as:
Ji, S., Li, X., Huang, Z. et al. Suicidal ideation and mental disorder detection with attentive relation networks. Neural Comput & Applic (2021). https://doi.org/10.1007/s00521-021-06208-y
@article{ji2021suicidal, title={Suicidal ideation and mental disorder detection with attentive relation networks}, author={Ji, Shaoxiong and Li, Xue and Huang, Zi and Cambria, Erik}, journal={Neural Computing and Applications}, year={2021}, publisher={Springer} }
The National Database for Clinical Trials Related to Mental Illness (NDCT) is an extensible informatics platform for relevant data at all levels of biological and behavioral organization (molecules, genes, neural tissue, behavioral, social and environmental interactions) and for all data types (text, numeric, image, time series, etc.) related to clinical trials funded by the National Institute of Mental Health. Sharing data, associated tools, methodologies and results, rather than just summaries or interpretations, accelerates research progress. Community-wide sharing requires common data definitions and standards, as well as comprehensive and coherent informatics approaches for the sharing of de-identified human subject research data. Built on the National Database for Autism Research (NDAR) informatics platform, NDCT provides a comprehensive data sharing platform for NIMH grantees supporting clinical trials.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes three distinct subsets of text:Open Access Academic Articles: A collection of 100 open-access articles from various academic journals focused on mental health and psychiatry published between 2016-2018. The articles are selected from reputable journals including JAMA, The Lancet Psychiatry, WPJ, and AM J Psy.ChatGPT-Generated Texts: Discussion section samples generated by ChatGPT (GPT-4 model, version as of August 3, 2023, OpenAI) that are designed to imitate the style and content of academic articles in the field of mental health and psychiatry.Claude-Generated Texts: Discussion section samples generated by Claude (Version 2, Anthropic) with the aim of imitating academic articles in the same field.Additionally, the dataset contains the results of tests performed using ZeroGPT and Originality.AI to evaluate the AI texts vs the academic articles for the percentage of texts identified as being AI-generated.Please cite this dataset if you make use of it in your research.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This data is from Open Source Mental Illness (OSMI) using survey data from years 2014, 2016, 2017, 2018 and 2019. Each survey measures and attitudes towards mental health and frequency of mental health disorders in the tech workplace.
The raw data was processed using Python, SQL and Excel for cleaning and manipulation.
Steps involved in cleaning were - Similar questions were group together - Values for answers were made consistent (ie 1 == 1.0) - Fixing spelling errors
The SQLite database contains 3 tables. Survey, Question, and Answer.
Survey (PRIMARY KEY INT SurveyID, TEXT Description) Question (PRIMARY KEY QuestionID, TEXT QuestionText) Answer (PRIMARY/FOREIGN KEY SurveyID, PRIMARY KEY UserID, PRIMARY/FOREIGN KEY QuestionID, TEXT AnswerText)
SuveyID are simply survey year ie 2014, 2016, 2017, 2018, 2019. The same question can be used for multiple surveys Answer table is a composite table with multiple primary keys. SurveyID and QuestionID are FOREIGN KEYS. Some questions can contain multiple answers, thus the same user can appear more than once for that questionid.
SELECT * FROM Question where QuestionID = 13;
SELECT AnswerText FROM Answer where QuestionID = 13;
SELECT AnswerText, COUNT(AnswerText) from Answer where QuestionID = 13 group by AnswerText;
SELECT AnswerText, COUNT(AnswerText) from Answer where QuestionID = 1 and surveyid = 2016 group by AnswerText;
SELECT surveyid, COUNT(DISTINCT(userid)) FROM answer GROUP BY surveyid;
The original data set can be found Open Source Mental Illness (OSMI) and can be downloaded and viewed here. This project was inspired here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The relationship between physical activity and mental health, especially depression, is one of the most studied topics in the field of exercise science and kinesiology. Although there is strong consensus that regular physical activity improves mental health and reduces depressive symptoms, some debate the mechanisms involved in this relationship as well as the limitations and definitions used in such studies. Meta-analyses and systematic reviews continue to examine the strength of the association between physical activity and depressive symptoms for the purpose of improving exercise prescription as treatment or combined treatment for depression. This dataset covers 27 review articles (either systematic review, meta-analysis, or both) and 365 primary study articles addressing the relationship between physical activity and depressive symptoms. Primary study articles are manually extracted from the review articles. We used a custom-made workflow (Fu, Yuanxi. (2022). Scopus author info tool (1.0.1) [Python]. https://github.com/infoqualitylab/Scopus_author_info_collection that uses the Scopus API and manual work to extract and disambiguate authorship information for the 392 reports. The author information file (author_list.csv) is the product of this workflow and can be used to compute the co-author network of the 392 articles. This dataset can be used to construct the inclusion network and the co-author network of the 27 review articles and 365 primary study articles. A primary study article is "included" in a review article if it is considered in the review article's evidence synthesis. Each included primary study article is cited in the review article, but not all references cited in a review article are included in the evidence synthesis or primary study articles. The inclusion network is a bipartite network with two types of nodes: one type represents review articles, and the other represents primary study articles. In an inclusion network, if a review article includes a primary study article, there is a directed edge from the review article node to the primary study article node. The attribute file (article_list.csv) includes attributes of the 392 articles, and the edge list file (inclusion_net_edges.csv) contains the edge list of the inclusion network. Collectively, this dataset reflects the evidence production and use patterns within the exercise science and kinesiology scientific community, investigating the relationship between physical activity and depressive symptoms. FILE FORMATS 1. article_list.csv - Unicode CSV 2. author_list.csv - Unicode CSV 3. Chinese_author_name_reference.csv - Unicode CSV 4. inclusion_net_edges.csv - Unicode CSV 5. review_article_details.csv - Unicode CSV 6. supplementary_reference_list.pdf - PDF 7. README.txt - text file 8. systematic_review_inclusion_criteria.csv - Unicode CSV UPDATES IN THIS VERSION COMPARED TO V3 (Clarke, Caitlin; Lischwe Mueller, Natalie; Joshi, Manasi Ballal; Fu, Yuanxi; Schneider, Jodi (2023): The Inclusion Network of 27 Review Articles Published between 2013-2018 Investigating the Relationship Between Physical Activity and Depressive Symptoms. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4614455_V3) - We added a new file systematic_review_inclusion_criteria.csv.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset, "Mental Health MCQs ā Diagnostic & Preventive", presents a structured collection of multiple-choice questions sourced from AI Kosha, an AI-driven platform focused on mental wellness. It includes over 150 high-quality MCQs covering key mental health topics such as anxiety, depression, and general psychological well-being.
The dataset is divided into two clear categories ā Preventive and Diagnostic ā allowing users to explore the educational and clinical sides of mental health simultaneously. Each entry has been organized for direct usability in quiz systems, LLM training, chatbot testing, or research applications.
This dataset can be leveraged in several data-driven mental health initiatives. Potential applications include:
id
: Unique identifier for each question.topic
: Area of focus (e.g., Stress, Anxiety).type
: Always labeled as "Preventive".question
: The full MCQ question.option1
to option4
: Four answer options.correct_option
: Text of the correct answer.correct_option_number
: Index (1-4) indicating the correct option.id
: Unique identifier.valid_question
: Boolean flag indicating question validity.topic
: Mental health condition or category.type
: Always labeled as "Diagnostic".question
: The diagnostic MCQ.option1
to option4
: Available choices.correct_answer
: Correct answer (text).correct_option_number
: Index (1-4) for the correct choice.This dataset was compiled using content provided by AI Kosha and is shared for educational and research purposes only. No personal or sensitive data is included. The content has been formatted to support ethical use in mental health tech development, NLP research, and public wellness education.
Special thanks to AI Kosha for making the base resource available. This dataset represents a step toward making mental health data more accessible for responsible and impactful innovation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This work focuses on the efficiency of the knowledge distillation approach in generating a lightweight yet powerful BERT-based model for natural language processing (NLP) applications. After the model creation, we applied the resulting model, LastBERT, to a real-world taskāclassifying severity levels of Attention Deficit Hyperactivity Disorder (ADHD)-related concerns from social media text data. Referring to LastBERT, a customized student BERT model, we significantly lowered model parameters from 110 million BERT base to 29 million-resulting in a model approximately 73.64% smaller. On the General Language Understanding Evaluation (GLUE) benchmark, comprising paraphrase identification, sentiment analysis, and text classification, the student model maintained strong performance across many tasks despite this reduction. The model was also used on a real-world ADHD dataset with an accuracy of 85%, F1 score of 85%, precision of 85%, and recall of 85%. When compared to DistilBERT (66 million parameters) and ClinicalBERT (110 million parameters), LastBERT demonstrated comparable performance, with DistilBERT slightly outperforming it at 87%, and ClinicalBERT achieving 86% across the same metrics. These findings highlight the LastBERT modelās capacity to classify degrees of ADHD severity properly, so it offers a useful tool for mental health professionals to assess and comprehend material produced by users on social networking platforms. The study emphasizes the possibilities of knowledge distillation to produce effective models fit for use in resource-limited conditions, hence advancing NLP and mental health diagnosis. Furthermore underlined by the considerable decrease in model size without appreciable performance loss is the lower computational resources needed for training and deployment, hence facilitating greater applicability. Especially using readily available computational tools like Google Colab and Kaggle Notebooks. This study shows the accessibility and usefulness of advanced NLP methods in pragmatic world applications.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a **Telugu-translated version** of the first 50,000 rows from the original [English Suicide Detection Dataset on Kaggle](https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch).
The purpose of this dataset is to support research in **mental health detection**, especially focusing on **Telugu-language content classification**.
The dataset is translated using the IndicTrans2 translation model.
Each entry contains a text post and a classification label indicating whether the content is **suicidal** or **non-suicidal**.
Interactions on social media have the potential to help us to understand human behaviour, including the development of both good and poor mental health. However, to do the best science we need to know as much as possible about the people who are participating in our research. The CLOSER group of UK longitudinal cohorts include people who have contributed their data to research since birth. By inviting participants in these cohorts to also allow us to derive information from their social media feeds, we will be able to relate this information to gold-standard measures of the behaviours we are trying to understand and to world-class data on other aspects of life. To work out the best way to do this, our project will engage with participants in the Children of the '90s cohort to find out what is acceptable to them in terms of collecting and using their interactions on social media. We will use what we have learnt to develop software that collects and codes social media data in a way that protects the anonymity of participants by scoring Tweets without making the text available to researchers. We will share this software with other CLOSER cohorts to make it easy for them to invite participants to contribute their Twitter data in a safe and secure way. The high-resolution data collected in this way will help us to understand human behaviour and how mental health changes over time. Collecting these data in well known groups of people will also give scientists the information they need to improve the quality of all research using social media.Interactions on social media have the potential to help us to understand human behaviour, including the development of both good and poor mental health. However, to do the best science we need to know as much as possible about the people who are participating in our research. The CLOSER group of UK longitudinal cohorts include people who have contributed their data to research since birth. By inviting participants in these cohorts to also allow us to derive information from their social media feeds, we will be able to relate this information to gold-standard measures of the behaviours we are trying to understand and to world-class data on other aspects of life. To work out the best way to do this, our project will engage with participants in the Children of the '90s cohort to find out what is acceptable to them in terms of collecting and using their interactions on social media. We will use what we have learnt to develop software that collects and codes social media data in a way that protects the anonymity of participants by scoring Tweets without making the text available to researchers. We will share this software with other CLOSER cohorts to make it easy for them to invite participants to contribute their Twitter data in a safe and secure way. The high-resolution data collected in this way will help us to understand human behaviour and how mental health changes over time. Collecting these data in well known groups of people will also give scientists the information they need to improve the quality of all research using social media. We are demonstrating collection, anonymisation and analysis of social media data from consenting participants in the Avon Longitudinal Study of Parents and Children. Initially we are studying Twitter use, and gathering data through the platforms API. Our software gathers social media posts and interactions from participants every few days, with datasets being stored under security ISO 27001 certification. Derived, depersonalised datasets can be made available to approved researchers, and we aim to provide a means to evaluate sentiment analysis methods against ground truth data.
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Project Overview This project focused on mapping how the general narrative about mental health is evolving, as we are witnessing a change in attitudes, perspectives and discourses on this topic. Therefore, this study examined what master narratives on mental health exist and captured the changing emphasis in these narratives over time. Data and Data Collection Overview A triangulation approach was used to gather data on evolving narratives via a scoping review, semi-structured interviews, and review of a podcast series consisting of interviews with changemakers in Dutch mental healthcare. For the scoping review, search terms relevant to the research question and areas of interest for this study were generated and used to search in SCOPUS, WorldCat and PubMed on August 4, 2023. 759 (excluding duplicates) number of articles were found initially. The titles and abstracts were further screened based on the inclusion criteria of full text availability, describing narratives on mental health (i.e., discourses, attitudes, perspectives or other expressions of narratives on mental health), written in Dutch or English, and published after 2018 to gather contemporary views. The final number of articles analyzed was 33. Eleven semi-structured interviews were conducted at the University Medical Centre Utrecht (UMCU). The interviewees had varying roles in the mental healthcare system. Nine of them were recruited at the UMCU and two were recruited from a broader professional network. Interviews took 60-90 minutes and started with verbal informed consent. To broaden understanding and to ensure professional perspectives which were not only UMCU-specific, all 11 episodes from the podcast āHoe de GGZ verandertā (How mental healthcare is changing; https://podcastluisteren.nl/pod/Hoe-de-GGZ-verandert?refresh=1#/) created in 2020 were also included in the analysis. Data were extracted from the episodes via relistening, noting down timestamps of relevant fragments, and transcribing these portions verbatim. The podcast and interview texts are in the original Dutch and were coded in English by the bilingual researchers without conducting translation. Most of the literature analyzed was originally published in English. Thus, the data excerpts included are in both languages depending on source. The texts from all three data sources were analyzed in the same manner, using inductive thematic analysis and selective coding. Open coding of the full-text articles, podcast interviews and semi-structured interview transcripts was performed to identify text units that refer to master narratives on mental health. Axial coding was conducted by iteratively clustering codes into themes. All text units resulting from the thematic analysis were also deductively coded with respect to the timeframe they represented (past, present, future). Selection and Organization of Shared Data The data file shared here consists of a combined spreadsheet with coded excerpts from all three data sources (selected scholarly literature, individual interview transcripts and podcast transcript). The excerpts are organized by source type in separate tabs of the spreadsheet, as well as combined all together by theme in a separate tab (labeled āCodingā). There is an additional tab (labeled āCodesā), which lists all four master narratives, their constituent themes and the topics which were identified as part of each theme, alongside frequencies for each of them across all analyzed texts. The documentation files shared consist of the questionnaire used for the individual interviews, this Data Narrative and an administrative README file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, āFive Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysisā, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)
Abstract
The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.
For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.
The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)
There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)
The following is a description of the attributes present in this dataset
Open Research Questions
This dataset is expected to be helpful for the investigation of the following research questions and even beyond:
All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset consists of 74 patients age 4-51 years old where Cortico-Cortical Evoked Potentials (CCEPs) were measured with Electro-CorticoGraphy (ECoG) during single pulse electrical stimulation. For a detailed description see:
This dataset is part of the RESPect (Registry for Epilepsy Surgery Patients) database, a dataset recorded at the University Medical Center of Utrecht, the Netherlands. The study was approved by the Medical Ethical Committee from the UMC Utrecht.
This data is organized according to the Brain Imaging Data Structure specification. A community-driven specification for organizing neurophysiology data along with its metadata. For more information on this data specification, see https://bids-specification.readthedocs.io/en/stable/
Each patient has their own folder (e.g., sub-ccepAgeUMCU01
to sub-ccepAgeUMCU74
) which contains the iEEG recordings data for that patient, as well as the metadata needed to understand the raw data and event timing.
Data are logically grouped in the same BIDS session and stored across runs that indicating the day and time point of recording during the monitoring period. If extra electrodes were added/removed during this period, the session was divided into different sessions (e.g. ses-1a and ses-1b). We use the optional run key-value pair to specify the day and the start time of the recording (e.g. run-021315, day 2 after implantation, which is day 1 of the monitoring period, at 13:15). The task key-value pair in long-term iEEG recordings describes the patient's state during the recording of this file. The task label is āSPESclinā since these files contain data collected during clinical single pulse electrical stimulation (SPES).
Electrode positions include Destrieux atlas labels that were estimated by running Freesurfer on the individual subject MRI scan and taking the most common surface label within a sphere around the electrode. All shared electrode positions were then converted to MNI152 space using the Freesurfer surface based non-linear transformation. We note that this surface based transformation distorts the dimensions of the grids, but maintains the gyral anatomy.
This dataset is made available under the Public Domain Dedication and License CC v1.0, whose full text can be found at https://creativecommons.org/publicdomain/zero/1.0/. We hope that all users will follow the ODC Attribution/Share-Alike Community Norms (http://www.opendatacommons.org/norms/odc-by-sa/); in particular, while not legally required, we hope that all users of the data will acknowledge by citing the following in any publication: Developmental trajectory of transmission speed in the human brain, D. van Blooijs, M.A. van den Boom, J.F. van der Aar, G.J.M. Huiskamp, G. Castegnaro, M. Demuru, W.J.E.M. Zweiphenning, P. van Eijsden, K. J. Miller, F.S.S. Leijten, D. Hermes, Nature Neuroscience, 2023, https://doi.org/10.1038/s41593-023-01272-0
Code to analyses these data is available at: https://github.com/MultimodalNeuroimagingLab/mnl_ccepAge
We thank the SEIN-UMCU RESPect database group (C.J.J. van Asch, L. van de Berg, S. Blok, M.D. Bourez, K.P.J. Braun, J.W. Dankbaar, C.H. Ferrier, T.A. Gebbink, P.H. Gosselaar, R. van Griethuysen, M.G.G. Hobbelink, F.W.A. Hoefnagels, N.E.C. van Klink, M.A. van āt Klooster, G.A.P. deKort, M.H.M. Mantione, A. Muhlebner, J.M. Ophorst, P.C. van Rijen, S.M.A. van der Salm, E.V. Schaft, M.M.J. van Schooneveld, H. Smeding, D. Sun, A. Velders, M.J.E. van Zandvoort, G.J.M. Zijlmans, E. Zuidhoek and J. Zwemmer) for their contributions and help in collecting the data, and G. Ojeda Valencia for proofreading the manuscript.
Research reported in this publication was supported by the National Institute of Mental Health of the National Institutes of Health under Award Number R01MH122258 (DH, FSSL, the content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health), the EpilepsieNL under Award Number NEF17-07 (DvB) and the UMC Utrecht Alexandre Suerman MD/PhD Stipendium 2015 (WZ).
Transcripts of qualitative interviews with young adults and professional online content producers exploring young people's engagement with health information online and on social media. The aim of this study was to explore the multiple ways young adults engage with health-related content online and develop an understanding of how social media are used for health information and communication. A further aim was to explore the areas of convergence and divergence between professional producersā perspectives on online resources about diabetes and CMHDs and prospective usersā perspectives. Production and consumption of text, image and video content about both diabetes and common mental health disorders (CMHDs), by individuals and organisations, has become commonplace since the widespread adoption of social media. Despite the increasing importance of these online spaces for health-related discussion few studies have fully explored peopleās experiences of drawing on social media content around either diabetes or CMHDs. The key findings of the study reflect the increasing prominence of health-related user-generated content online. While continued reliance on search-engines for locating relevant content was evident, some participants discussed accessing health-related content as part of their everyday social media activity. Further, participantsā perceptions and experiences of support from family, friends and formal health services appeared to relate to their online practices: those who described least supportive resources offline discussed engaging most actively in production and consumption of health-related user-generated content. Participants also discussed what limited their production of health-related content, suggesting that production of content related to diabetes or CMHDs could compromise their presentation of self online. Disjunctures were evident between the perspectives of producers and potential users, with producers prioritising dissemination of generic information and young adults emphasising the consumption of tailored content. The findings of the study suggest key opportunities for exploiting the potential of social media to engage with users but highlight potential barriers to some individualsā engagement. Interviews. Forty young adults, aged between 18 and 30 years, and six professional producers took part in semi-structured interviews.
The National Child Development Study (NCDS) is a continuing longitudinal study that seeks to follow the lives of all those living in Great Britain who were born in one particular week in 1958. The aim of the study is to improve understanding of the factors affecting human development over the whole lifespan.
The NCDS has its origins in the Perinatal Mortality Survey (PMS) (the original PMS study is held at the UK Data Archive under SN 2137). This study was sponsored by the National Birthday Trust Fund and designed to examine the social and obstetric factors associated with stillbirth and death in early infancy among the 17,000 children born in England, Scotland and Wales in that one week. Selected data from the PMS form NCDS sweep 0, held alongside NCDS sweeps 1-3, under SN 5565.
Survey and Biomeasures Data (GN 33004):
To date there have been ten attempts to trace all members of the birth cohort in order to monitor their physical, educational and social development. The first three sweeps were carried out by the National Children's Bureau, in 1965, when respondents were aged 7, in 1969, aged 11, and in 1974, aged 16 (these sweeps form NCDS1-3, held together with NCDS0 under SN 5565). The fourth sweep, also carried out by the National Children's Bureau, was conducted in 1981, when respondents were aged 23 (held under SN 5566). In 1985 the NCDS moved to the Social Statistics Research Unit (SSRU) - now known as the Centre for Longitudinal Studies (CLS). The fifth sweep was carried out in 1991, when respondents were aged 33 (held under SN 5567). For the sixth sweep, conducted in 1999-2000, when respondents were aged 42 (NCDS6, held under SN 5578), fieldwork was combined with the 1999-2000 wave of the 1970 Birth Cohort Study (BCS70), which was also conducted by CLS (and held under GN 33229). The seventh sweep was conducted in 2004-2005 when the respondents were aged 46 (held under SN 5579), the eighth sweep was conducted in 2008-2009 when respondents were aged 50 (held under SN 6137), the ninth sweep was conducted in 2013 when respondents were aged 55 (held under SN 7669), and the tenth sweep was conducted in 2020-24 when the respondents were aged 60-64 (held under SN 9412).
A Secure Access version of the NCDS is available under SN 9413, containing detailed sensitive variables not available under Safeguarded access (currently only sweep 10 data). Variables include uncommon health conditions (including age at diagnosis), full employment codes and income/finance details, and specific life circumstances (e.g. pregnancy details, year/age of emigration from GB).
Four separate datasets covering responses to NCDS over all sweeps are available. National Child Development Deaths Dataset: Special Licence Access (SN 7717) covers deaths; National Child Development Study Response and Outcomes Dataset (SN 5560) covers all other responses and outcomes; National Child Development Study: Partnership Histories (SN 6940) includes data on live-in relationships; and National Child Development Study: Activity Histories (SN 6942) covers work and non-work activities. Users are advised to order these studies alongside the other waves of NCDS.
From 2002-2004, a Biomedical Survey was completed and is available under End User Licence (EUL) (SN 8731) and Special Licence (SL) (SN 5594). Proteomics analyses of blood samples are available under SL SN 9254.
Linked Geographical Data (GN 33497):
A number of geographical variables are available, under more restrictive access conditions, which can be linked to the NCDS EUL and SL access studies.
Linked Administrative Data (GN 33396):
A number of linked administrative datasets are available, under more restrictive access conditions, which can be linked to the NCDS EUL and SL access studies. These include a Deaths dataset (SN 7717) available under SL and the Linked Health Administrative Datasets (SN 8697) available under Secure Access.
Multi-omics Data and Risk Scores Data (GN 33592)
Proteomics analyses were run on the blood samples collected from NCDS participants in 2002-2004 and are available under SL SN 9254. Metabolomics analyses were conducted on respondents of sweep 10 and are available under SL SN 9411.
Additional Sub-Studies (GN 33562):
In addition to the main NCDS sweeps, further studies have also been conducted on a range of subjects such as parent migration, unemployment, behavioural studies and respondent essays. The full list of NCDS studies available from the UK Data Service can be found on the NCDS series access data webpage.
How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from NCDS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.
Further information about the full NCDS series can be found on the Centre for Longitudinal Studies website.
The National Child Development Study: Biomedical Survey 2002-2004 was funded under the Medical Research Council 'Health of the Public' initiative, and was carried out in 2002-2004 in collaboration with the Institute of Child Health, St George's Hospital Medical School, and NatCen. The survey was designed to obtain objective measures of ill-health and biomedical risk factors in order to address a wide range of specific hypotheses relating to anthropometry: cardiovascular, respiratory and allergic diseases; visual and hearing impairment; and mental ill-health.
The majority of the biomedical data (1,064 variables) are now available under EUL (SN 8731), with some data considered sensitive still available under Special Licence (SN 5594). This decision was the result of the CLS's disclosure assessment of each variable and the broad aim to make as much data available with the lowest possible barriers. Information about the medication taken by the cohort members of the study is also available under EUL for the first time. These data were collected in 2002-2004, but they were never released via the UKDS.
The Special Licence dataset contains 122 variables including new data on child adversity not previously released, as well as a number of original variables that were previously available under Special Licence due to their sensitive nature, such as Clinical Interview Schedule-Revised (CIS-R) specific questions on mental health and questions which contain categories with small frequencies related to personal details such as skin colour, pregnancy, a surgical operation, specific height and unusual high number of children.
For the second edition (December 2020), the data and documentation have been revised. Previously unreleased variables on child adversity have been added and some variables removed as they are now available under EUL. Users are advised to download the EUL version (SN 8731) before deciding to apply for the Special Licence version.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This dataset contains posts from 28 subreddits (15 mental health support groups) from 2018-2020. We used this dataset to understand the impact of COVID-19 on mental health support groups from January to April, 2020 and included older timeframes to obtain baseline posts before COVID-19.
Please cite if you use this dataset:
Low, D. M., Rumker, L., Torous, J., Cecchi, G., Ghosh, S. S., & Talkar, T. (2020). Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study. Journal of medical Internet research, 22(10), e22635.
@article{low2020natural, title={Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study}, author={Low, Daniel M and Rumker, Laurie and Torous, John and Cecchi, Guillermo and Ghosh, Satrajit S and Talkar, Tanya}, journal={Journal of medical Internet research}, volume={22}, number={10}, pages={e22635}, year={2020}, publisher={JMIR Publications Inc., Toronto, Canada} }
License
This dataset is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://www.opendatacommons.org/licenses/pddl/1.0/
It was downloaded using pushshift API. Re-use of this data is subject to Reddit API terms.
Reddit Mental Health Dataset
Contains posts and text features for the following timeframes from 28 mental health and non-mental health subreddits:
filenames
and corresponding timeframes:
post:
Jan 1 to April 20, 2020 (called "mid-pandemic" in manuscript; r/COVID19_support appears). Unique users: 320,364. pre:
Dec 2018 to Dec 2019. A full year which provides more data for a baseline of Reddit posts. Unique users: 327,289.2019:
Jan 1 to April 20, 2019 (r/EDAnonymous appears). A control for seasonal fluctuations to match post
data. Unique users: 282,560.2018:
Jan 1 to April 20, 2018. A control for seasonal fluctuations to match post
data. Unique users: 177,089Unique users across all time windows (pre and 2019 overlap): 826,961.
See manuscript Supplementary Materials (https://doi.org/10.31234/osf.io/xvwcy) for more information.
Note: if subsampling (e.g., to balance subreddits), we recommend bootstrapping analyses for unbiased results.