Facebook
TwitterAs of October 2025, English was the dominant language for online content, used by nearly half of all websites worldwide. Spanish ranked second, accounting for around 6 percent of web content, followed by German with 5.9 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
Facebook
TwitterAccording to a 2023 survey, ** percent of internet users in urban India preferred using the internet in English. Meanwhile, ** percent of users accessed the internet in Indian languages, with Hindi being the most preferred language among them. Over *** million internet users reside in the urban areas of India.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Flash Eurobarometer studied how Europeans use different languages online. While 90% of European internet users prefer to surf the internet in their own language, 55% at least occasionally use a language other than their own when online according to a pan-EU Eurobarometer survey released today. However, 44% feel they are missing interesting information because web pages are not in a language that they understand.
Facebook
TwitterIn the third quarter of 2023, over 55 percent the Peruvian population over six years old speaking native languages such as Quechua or Aymara claimed having used the internet in the South American country. The internet penetration in Peru has been growing steadily, having reached 74 percent of the country's population in 2022.
Facebook
TwitterCanadian Internet use survey, Internet use, by language used to search for information, for Canada in 2005. (Terminated)
Facebook
TwitterThis statistic represents the number of non-English digital payment internet users across India in 2016, based on language. Hindi internet users had the highest number of digital payment users amounting to about ** million, followed by Tamil internet users at about **** million during the measured time period.
Facebook
TwitterThis statistic displays the number of Indian and English language internet users across India from 2011 to 2021. In 2016, the number of English internet users amounted to about *** million and was projected to increase to *** million in 2021. For Indian language users, this number was about *** million users in 2016, and was projected to reach *** million in 2021.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The statistical operation Survey on the Information Society-ESI-Families, provides regular information on the implementation of New Information and Communication Technology -ICT- among the population of the Basque Country. Specifically, it records and describes ICT equipment of the population both in the home and the place of study or in the workplace and measures the level of use made of it, especially as related to the Internet. It lets us compare the level of implementation of these ICT technologies In Basque society in relation to other surrounding communities.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Dataset, in 29 files of xlsx format, contains the data of all metrics and accumulated information as they are described in the methodology, results and discussion section of the research article "Exploring the Dominance of the English Language on the Websites of EU Countries".
Facebook
TwitterThis statistic represents the forecast for share of non-English internet users across India in 2020, based on language. Hindi was projected to have the highest share of internet users in the country with about ** percent, while the share was about ***** percent for Malayalam during the measured time period.
Facebook
TwitterOpen Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Canadian Internet use survey, Internet use, by language used to search for information, for Canada in 2005. (Terminated)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides a structured view of the world’s cities, countries, and languages, derived from the well-known World Database (SQL → CSV). It is designed to be beginner-friendly yet powerful for researchers, analysts, and data scientists who want to explore global demographics, population distribution, and linguistic diversity.
The dataset is split into three clean, relational tables:
Key columns:
ID → Unique city identifierName → City nameCountryCode → Links each city to its countryDistrict → Administrative divisionPopulation → Population of the cityKey columns:
Code → Unique country codeName → Country nameContinent, Region → Geographic classificationSurfaceArea → Area in square kilometersPopulation → Country’s populationGovernmentForm, HeadOfState → Political detailsKey columns:
CountryCode → Links to country.csvLanguage → Language nameIsOfficial → Whether the language is officialPercentage → Percentage of speakers in the populationThis dataset offers a balanced mix of geography, demography, and linguistics — perfect for analysts, students, and Kaggle competitors alike.
Facebook
TwitterThe Canadian Internet Use Survey (CIUS) measures the extent and scope to which individual Canadians use the Internet. Survey content includes the location of use (e.g., at home, at work), the frequency and intensity of use, the specific uses of the Internet from the home, the purchase of products and services (electronic commerce), and other issues related to Internet use (such as language of use and concerns over privacy). This content is supplemented by information on individual and household characteristics (e.g., age, income, education, family type) and some geographic detail (e.g. province, urban/rural, and CMA). The Canadian Internet Use Survey results are widely disseminated to a variety of users. All levels of government can use CIUS to shape policies and programmes related to the Internet (i.e. uptake and barriers, high speed access, Government on-line and other communication initiatives) and electronic commerce. Also, the Organization for Economic Cooperation and Development (OECD) uses the results for international benchmarking and comparison studies. The CIUS data support a wide range of research initiatives. In academia, micro data are made available to students and researchers within universities and colleges under the Data Liberation Initiative. The survey results are also used in the private sector for market research, as well as for consultation on regulatory issues related to the internet. Finally, the results of the CIUS are widely quoted in the media reflecting a high level of interest in the Internet and its users. The CIUS replaces the Household Internet Use Survey (HIUS), conducted from 1997 to 2003, which focused on household Internet penetration. The new survey was redesigned to focus more on Internet use by individuals and to conform to international standards regarding Internet statistics. Because CIUS collects information from the individual and HIUS was based on the household, it is not appropriate to directly compare results from 2005 with previous surveys.
Facebook
TwitterThis statistic gives information on the distribution of U.S. Hispanic internet users in 2015, by primary language. During the 2015 National Survey of Latinos conducted in November 2015, it was found that English was the dominant language for 31 percent of U.S. Hispanic internet users.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Developed the Index of Internet Connectivity as part of a package of measures to help monitor the UK's use of the Internet and the growth of e-commerce. Source agency: Office for National Statistics Designation: National Statistics Language: English Alternative title: Internet connectivity
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
posts| Column Name | Type | Description |
|---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
crosspost_parent_id | TEXT | The ID of the original Reddit post if this post is a crosspost. |
post_id | TEXT | Unique identifier for the Reddit post. |
created_at | TIMESTAMP | The timestamp when the post was created. |
updated_at | TIMESTAMP | The timestamp when the post was last updated. |
language_code | TEXT | The language code of the post. |
score | INTEGER | The score (upvotes minus downvotes) of the post. |
upvote_ratio | REAL | The ratio of upvotes to total votes. |
gildings | INTEGER | Number of awards (gildings) received by the post. |
num_comments | INTEGER | Number of comments on the post. |
comments| Column Name | Type | Description |
|---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
post_id | TEXT | The ID of the Reddit post the comment belongs to. |
parent_id | TEXT | The ID of the parent comment (if a reply). |
comment_id | TEXT | Unique identifier for the comment. |
created_at | TIMESTAMP | The timestamp when the comment was created. |
last_modified_at | TIMESTAMP | The timestamp when the comment was last modified. |
score | INTEGER | The score (upvotes minus downvotes) of the comment. |
upvote_ratio | REAL | The ratio of upvotes to total votes for the comment. |
gilded | INTEGER | Number of awards (gildings) received by the comment. |
postlinks| Column Name | Type | Description |
|---|---|---|
post_id | TEXT | Unique identifier for the Reddit post. |
end_processed_valid | INTEGER | Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the Reddit post. |
final_valid | INTEGER | Whether the final URL from the post resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after redirections. |
redirected | INTEGER | Indicator of whether the posted URL was redirected (1) or not (0). |
in_title | INTEGER | Indicator of whether the link appears in the post title (1) or post body (0). |
commentlinks| Column Name | Type | Description |
|---|---|---|
comment_id | TEXT | Unique identifier for the Reddit comment. |
end_processed_valid | INTEGER | Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the comment. |
final_valid | INTEGER | Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final |
Facebook
TwitterIn 2020, Francophone internet users in Canada spent an average of 3.06 hours per day online via PC, laptop or tablet devices. Daily online usage of Anglophone internet users was 3.92 hours per day. Overall, daily non-mobile internet usage had declined 0.3 and 6.5 percent for Francophones and Anglophones respectively. In contrast, mobile usage in Canada had increased significantly during the same period.
Facebook
TwitterInternet users aged 15 years and above in each of the EU27 Member States
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 8.92(USD Billion) |
| MARKET SIZE 2025 | 9.63(USD Billion) |
| MARKET SIZE 2035 | 20.5(USD Billion) |
| SEGMENTS COVERED | Delivery Mode, End User, Language Offered, Subscription Model, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | growing demand for multilingualism, increased internet penetration, rising adoption of mobile learning, personalized learning experiences, competition from free resources |
| MARKET FORECAST UNITS | USD Billion |
| KEY COMPANIES PROFILED | BBC Languages, Preply, Duolingo, HelloTalk, Busuu, Memrise, Lingoda, Verbling, Cudoo, Open English, Rosetta Stone, Voxy, Babbel, Pimsleur, italki, Tandem |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | Rise in mobile learning adoption, Increasing demand for personalized education, Expansion into underserved markets, Integration of AI technologies, Growth of corporate language training programs |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 7.9% (2025 - 2035) |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The statistical operation Information Society Survey -ESI- Companies, provides periodic information on the implementation of the new Information and Communication Technologies -ICT- in companies in the Basque Country. Specifically, it computes and describes the level of Internet use in the different establishments: Internet access systems, Internet activities, as well as the availability of a website and its main features. In addition, it measures the implementation of e-commerce purchases and sales in economic activity, and the means through which they are made.
Facebook
TwitterAs of October 2025, English was the dominant language for online content, used by nearly half of all websites worldwide. Spanish ranked second, accounting for around 6 percent of web content, followed by German with 5.9 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.