Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
YDQ ≥ 5 indicates Internet addiction. YDQ scores of 3 or 4 = potential IA. CIUS ≥ 21 indicates compulsive Internet use.Internet Use Characteristics of 27 Participants Who Self-Reported Problem Internet Use.
Facebook
TwitterAs of October 2025, 6.04 billion individuals worldwide were internet users, which amounted to 73.2 percent of the global population. Of this total, 5.66 billion, or 68.7 percent of the world's population, were social media users. Global internet usage Connecting billions of people worldwide, the internet is a core pillar of the modern information society. Northern Europe ranked first among worldwide regions by the share of the population using the internet in 2025. In the Netherlands, Norway, and Saudi Arabia, 99 percent of the population used the internet as of February 2025. North Korea was at the opposite end of the spectrum, with virtually no internet usage penetration among the general population, ranking last worldwide. Eastern Asia was home to the largest number of online users worldwide—over 1.34 billion at the latest count. Southern Asia ranked second, with around 1.2 billion internet users. China, India, and the United States rank ahead of other countries worldwide by the number of internet users. Worldwide internet user demographics As of 2024, the share of female internet users worldwide was 65 percent, five percent less than that of men. Gender disparity in internet usage was bigger in African countries, with around a 10-percent difference. Worldwide regions, like the Commonwealth of Independent States and Europe, showed a smaller usage gap between these two genders. As of 2024, global internet usage was higher among individuals between 15 and 24 years old across all regions, with young people in Europe representing the most considerable usage penetration, 98 percent. In comparison, the worldwide average for the age group of 15 to 24 years was 79 percent. The income level of the countries was also an essential factor for internet access, as 93 percent of the population of the countries with high income reportedly used the internet, as opposed to only 27 percent of the low-income markets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionIn current digital era, adolescents’ Internet use has increased exponentially, with the Internet playing a more and more important role in their education and entertainment. However, due to the ongoing cognitive, emotion, and social development processes, youth and adolescents are more vulnerable to Internet addiction. Attention has been paid to the increased use of Internet during the COVID-19 pandemic and the influence of Internet literacy in prevention and intervention of Internet addiction.MethodsThe present study proposes a conceptual model to investigate the links between Internet literacy, Internet use of different purpose and duration, and Internet addiction among Chinese youth and adolescents. In this study, N = 2,276 adolescents studying in primary and secondary schools in East China were recruited, and they completed self-reports on sociodemographic characteristics, Internet literacy scale, Internet use, and Internet addiction scale.ResultsThe results showed a significant relationship between Internet use and Internet addiction. To be specific, the duration of Internet use significantly and positively affected Internet addiction. With different dimensions of Internet literacy required, entertainment-oriented Internet use had positive impact on Internet addiction, while education-oriented Internet use exerted negative effects on Internet addiction. As for Internet literacy, knowledge and skills for Internet (positively) and Internet self-management (negatively) significantly influenced the likelihood of Internet addiction.DiscussionThe findings suggest that Internet overuse increases the risk of Internet addiction in youth and adolescents, while entertainment-oriented rather than education-oriented Internet use is addictive. The role of Internet literacy is complicated, with critical Internet literacy preventing the development of Internet addiction among youth and adolescents, while functional Internet literacy increasing the risk.
Facebook
TwitterAs of 2024, the estimated number of internet users worldwide was 5.5 billion, up from 5.3 billion in the previous year. This share represents 68 percent of the global population. Internet access around the world Easier access to computers, the modernization of countries worldwide, and increased utilization of smartphones have allowed people to use the internet more frequently and conveniently. However, internet penetration often pertains to the current state of development regarding communications networks. As of January 2023, there were approximately 1.05 billion total internet users in China and 692 million total internet users in the United States. Online activities Social networking is one of the most popular online activities worldwide, and Facebook is the most popular online network based on active usage. As of the fourth quarter of 2023, there were over 3.07 billion monthly active Facebook users, accounting for well more than half of the internet users worldwide. Connecting with family and friends, expressing opinions, entertainment, and online shopping are amongst the most popular reasons for internet usage.
Facebook
TwitterStudies have identified high rates and severe consequences of Internet Addiction/Pathological Internet Use (IA/PIU) in university students. However, most research concerning IA/PIU in U.S. university students has been conducted within a quantitative research paradigm, and frequently fails to contextualize the problem of IA/PIU. To address this gap, we conducted an exploratory qualitative study using the focus group approach and examined 27 U.S. university students who self-identified as intensive Internet users, spent more than 25 hours/week on the Internet for non-school or non-work-related activities and who reported Internet-associated health and/or psychosocial problems. Students completed two IA/PIU measures (Young’s Diagnostic Questionnaire and the Compulsive Internet Use Scale) and participated in focus groups exploring the natural history of their Internet use; preferred online activities; emotional, interpersonal, and situational triggers for intensive Internet use; and health and/or psychosocial consequences of their Internet overuse. Students’ self-reports of Internet overuse problems were consistent with results of standardized measures. Students first accessed the Internet at an average age of 9 (SD = 2.7), and first had a problem with Internet overuse at an average age of 16 (SD = 4.3). Sadness and depression, boredom, and stress were common triggers of intensive Internet use. Social media use was nearly universal and pervasive in participants’ lives. Sleep deprivation, academic under-achievement, failure to exercise and to engage in face-to-face social activities, negative affective states, and decreased ability to concentrate were frequently reported consequences of intensive Internet use/Internet overuse. IA/PIU may be an underappreciated problem among U.S. university students and warrants additional research.
Facebook
TwitterWhen asked about "Attitudes towards the internet", most Australian respondents pick "It is important to me to have mobile internet access in any place" as an answer. 55 percent did so in our online survey in 2025. Looking to gain valuable insights about users of internet providers worldwide? Check out our reports on consumers who use internet providers. These reports give readers a thorough picture of these customers, including their identities, preferences, opinions, and methods of communication.
Facebook
TwitterAs of February 2025, the average daily social media usage of internet users worldwide amounted to 141 minutes per day, down from 143 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of 3 hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just 2 hours and 16 minutes. Global social media usage Currently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events and friends. Global impact of social media Social media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased polarization in politics, and heightened everyday distractions.
Facebook
TwitterDo you ever feel like you're being inundated with news from all sides, and you can't keep up? Well, you're not alone. In today's age of social media and 24-hour news cycles, it can be difficult to know what's going on in the world. And with so many different news sources to choose from, it can be hard to know who to trust.
That's where this dataset comes in. It captures data related to individuals' Sentiment Analysis toward different news sources. The data was collected by administering a survey to individuals who use different news sources. The survey responses were then analyzed to obtain the sentiment score for each news source.
So if you're feeling overwhelmed by the news, don't worry – this dataset has you covered. With its insights on which news sources are trustworthy and which ones aren't, you'll be able to make informed decisions about what to read – and what to skip
The Twitter Sentiment Analysis dataset can be used to analyze the impact of social media on news consumption. This data can be used to study how individuals' sentiments towards different news sources vary based on the source they use. The dataset can also be used to study how different factors, such as the time of day or the topic of the news, affect an individual's sentiments
File: news.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |
File: news_api.csv | Column name | Description | |:--------------|:------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Source | The news source the article is from. (String) |
File: politics.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |
File: sports.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |
File: television.csv | Column name | Description | |:-----------------------|:------------------------------------------------------| | **** | | | Title | The title of the news article. (String) | | Date | The date the news article was published. (Date) | | Time | The time the news article was published. (Time) | | Score | The sentiment score of the news article. (Float) | | Number of Comments | The number of comments on the news article. (Integer) |
File: trending.csv | Column name | Description ...
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Data for overview of peer-reviewed articles up to November 2024 on the reasons for social internet usage by people with intellectual disabilities. RQ: Why do people with ID engage in social internet use?
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Annual data on internet usage in Great Britain, including frequency of internet use, internet activities and internet purchasing.
Facebook
TwitterAs of February 2024, over a third of online users worldwide were aged between 25 and 34 years. Website visitors in this age bracket constituted the biggest group of online users worldwide. Also, 19 percent of global online users were aged 18 to 24 years. The global digital population aged 65 or older represented approximately 4.2 percent of all internet users worldwide. Social media usage and Meta Social media is a major driver of internet use, with a global penetration rate of 62.2 percent. On average, internet users spend 143 minutes per day on social media, highlighting its significant impact on daily online activities. The usage of social media is mostly dominated by Meta platforms, which own four of the largest social media platforms. Facebook leads the ranking with over three billion active users, followed by Instagram and WhatsApp. Instagram's global popularity Meta’s social video platform, Instagram, had long been one of the most engaging social media platforms worldwide, and it was projected to reach 1.44 billion monthly active users. Instagram was particularly favored by users aged 18 to 34, thanks to its ability to offer a variety of interactive content, from images and carousels. This diverse range of content types was a key factor in its popularity among its young user base.
Facebook
Twitterhttps://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.
Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.
We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.
Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.
The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.
To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.
The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.
The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:
Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.
The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.
A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.
The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.
The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Raw Data of manuscript: "Social isolation intensified the interests in toothache-related digital information during the COVID-19 pandemic"
Facebook
TwitterIn 2019, ** percent of respondents used the internet almost daily. This survey depicts the frequency of online activities in Germany in 2019. Other popular daily activities included reading articles and posts online, as well as using social media.
Facebook
TwitterStimation results for different internet usage modes.
Facebook
TwitterHow many people use social media?
Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
Who uses social media?
Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
How much time do people spend on social media?
Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
What are the most popular social media platforms?
Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.
Facebook
TwitterEmpirical studies have identified increasing rates of problematic Internet use worldwide and a host of related negative consequences. However, researchers disagree as to whether problematic Internet use is a subtype of behavioral addiction. Thus, there are not yet widely accepted and validated diagnostic criteria for problematic Internet use. To address this gap, we used mixed-methods to examine the extent to which signs and symptoms of problematic Internet use mirror DSM-5 diagnostic criteria for substance use disorder, gambling disorder, and Internet gaming disorder. A total of 27 university students, who self-identified as intensive Internet users and who reported Internet-use-associated health and/or psychosocial problems were recruited. Students completed two measures that assess problematic Internet use (Young’s Diagnostic Questionnaire and the Compulsive Internet Use Scale) and participated in focus groups exploring their experiences with problematic Internet use. Results of standardized measures and focus group discussions indicated substantial overlap between students’ experiences of problematic Internet use and the signs and symptoms reflected in the DSM-5 criteria for substance use disorder, gambling disorder, and Internet gaming disorder. These signs and symptoms included: a) use Internet longer than intended, b) preoccupation with the Internet, c) withdrawal symptoms when unable to access the Internet, d) unsuccessful attempts to stop or reduce Internet use, e) craving, f) loss of interest in hobbies or activities other than the Internet, g) excessive Internet use despite the knowledge of related problems, g) use of the Internet to escape or relieve a negative mood, and h) lying about Internet use. Tolerance, withdrawal symptoms, and recurrent Internet use in hazardous situations were uniquely manifested in the context of problematic Internet use. Implications for research and practice are discussed.
Facebook
TwitterAverage treatment effect of internet usage on farmers’ adoption behavior.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThe Internet is both an opportunity as well as a challenge for people with disabilities. However, this segment of the population is usually indicated among social groups experiencing digital divide. The study is focused on the analysis of factors determining Internet usage and undertaking specific activities online among people with disabilities based on a nationwide study performed in 2013 in Poland.MethodsSecondary analysis was performed on the data of persons who declared disability status in 2013 “Social Diagnosis” study. Multivariate logistic regression models were developed for the use of the Internet and performing three types of activities online.ResultsAmong 3,556 respondents with disability 51.02% were females, 25.19% 65 years of age and over and 33.05% were Internet users. The predictors of Internet usage included the degree of disability, place of residence, level of education, marital status, occupational status, net income, use of health care service and the use of mobile phone. The odds ratio that a person with disability belonging to the oldest category will use the Internet was only 0.04 (95% CI 0.02–0.09), when compared to the youngest category. The odds that a person with disability from the highest category of education will use the Internet were 18 times higher than in the case of persons with only basic education (OR 18.17, 95% CI 11.70–28.21). Common predictors of online activities (accessing websites of public institutions, checking and sending emails, publishing own content on the Internet) included age category and net income.ConclusionsPeople with disabilities in Poland are facing a significant digital divide. The factors determining the use of the Internet in this group are similar to those of the general population. On the other hand, people with disabilities who are active online, access diversified types of services including presentation of their own content online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
posts| Column Name | Type | Description |
|---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
crosspost_parent_id | TEXT | The ID of the original Reddit post if this post is a crosspost. |
post_id | TEXT | Unique identifier for the Reddit post. |
created_at | TIMESTAMP | The timestamp when the post was created. |
updated_at | TIMESTAMP | The timestamp when the post was last updated. |
language_code | TEXT | The language code of the post. |
score | INTEGER | The score (upvotes minus downvotes) of the post. |
upvote_ratio | REAL | The ratio of upvotes to total votes. |
gildings | INTEGER | Number of awards (gildings) received by the post. |
num_comments | INTEGER | Number of comments on the post. |
comments| Column Name | Type | Description |
|---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
post_id | TEXT | The ID of the Reddit post the comment belongs to. |
parent_id | TEXT | The ID of the parent comment (if a reply). |
comment_id | TEXT | Unique identifier for the comment. |
created_at | TIMESTAMP | The timestamp when the comment was created. |
last_modified_at | TIMESTAMP | The timestamp when the comment was last modified. |
score | INTEGER | The score (upvotes minus downvotes) of the comment. |
upvote_ratio | REAL | The ratio of upvotes to total votes for the comment. |
gilded | INTEGER | Number of awards (gildings) received by the comment. |
postlinks| Column Name | Type | Description |
|---|---|---|
post_id | TEXT | Unique identifier for the Reddit post. |
end_processed_valid | INTEGER | Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the Reddit post. |
final_valid | INTEGER | Whether the final URL from the post resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after redirections. |
redirected | INTEGER | Indicator of whether the posted URL was redirected (1) or not (0). |
in_title | INTEGER | Indicator of whether the link appears in the post title (1) or post body (0). |
commentlinks| Column Name | Type | Description |
|---|---|---|
comment_id | TEXT | Unique identifier for the Reddit comment. |
end_processed_valid | INTEGER | Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the comment. |
final_valid | INTEGER | Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
YDQ ≥ 5 indicates Internet addiction. YDQ scores of 3 or 4 = potential IA. CIUS ≥ 21 indicates compulsive Internet use.Internet Use Characteristics of 27 Participants Who Self-Reported Problem Internet Use.