As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
According to a 2023 survey, 43 percent of internet users in urban India preferred using the Internet in English. Meanwhile, 57 percent of users accessed the internet in Indian languages, with Hindi being the most preferred language among them. Over 300 million internet users reside in the urban areas of India.
This statistic displays the number of Indian and English language internet users across India from 2011 to 2021. In 2016, the number of English internet users amounted to about 175 million and was projected to increase to 199 million in 2021. For Indian language users, this number was about 234 million users in 2016, and was projected to reach 536 million in 2021.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Dataset, in 29 files of xlsx format, contains the data of all metrics and accumulated information as they are described in the methodology, results and discussion section of the research article "Exploring the Dominance of the English Language on the Websites of EU Countries".
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Canadian Internet use survey, Internet use, by language used to search for information, for Canada in 2005. (Terminated)
According to the source, 9,154 language errors were published each day on the internet in Poland in 2023. Over 38 percent of mistakes were found on Facebook, 20.21 percent on Twitter.
https://www.fnfresearch.com/privacy-policyhttps://www.fnfresearch.com/privacy-policy
[209+ Pages Report] The global online language learning market size was valued at USD 14.2 billion in 2021 and is expected to reach a value of USD 28.5 billion by 2028 with growth at a CAGR of 18.8% during 2022-2028.
This statistic represents the share of internet adoption levels among non-English speakers across India in 2016, based on language. Tamil had the highest internet adoption levels during the measured period with about 42 percent, followed by Hindi and Kannada. Malayalam had the lowest in this list with about 27 percent.
This statistic shows the annual growth rate of online language education market size in China from 2012 to 2015 with estimates up until 2019. In 2015, the size of online language education market in China increased by almost 30 percent compared to the previous year.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Developed the Index of Internet Connectivity as part of a package of measures to help monitor the UK's use of the Internet and the growth of e-commerce. Source agency: Office for National Statistics Designation: National Statistics Language: English Alternative title: Internet connectivity
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
posts
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
crosspost_parent_id | TEXT | The ID of the original Reddit post if this post is a crosspost. |
post_id | TEXT | Unique identifier for the Reddit post. |
created_at | TIMESTAMP | The timestamp when the post was created. |
updated_at | TIMESTAMP | The timestamp when the post was last updated. |
language_code | TEXT | The language code of the post. |
score | INTEGER | The score (upvotes minus downvotes) of the post. |
upvote_ratio | REAL | The ratio of upvotes to total votes. |
gildings | INTEGER | Number of awards (gildings) received by the post. |
num_comments | INTEGER | Number of comments on the post. |
comments
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
post_id | TEXT | The ID of the Reddit post the comment belongs to. |
parent_id | TEXT | The ID of the parent comment (if a reply). |
comment_id | TEXT | Unique identifier for the comment. |
created_at | TIMESTAMP | The timestamp when the comment was created. |
last_modified_at | TIMESTAMP | The timestamp when the comment was last modified. |
score | INTEGER | The score (upvotes minus downvotes) of the comment. |
upvote_ratio | REAL | The ratio of upvotes to total votes for the comment. |
gilded | INTEGER | Number of awards (gildings) received by the comment. |
postlinks
Column Name | Type | Description |
---|---|---|
post_id | TEXT | Unique identifier for the Reddit post. |
end_processed_valid | INTEGER | Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the Reddit post. |
final_valid | INTEGER | Whether the final URL from the post resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after redirections. |
redirected | INTEGER | Indicator of whether the posted URL was redirected (1) or not (0). |
in_title | INTEGER | Indicator of whether the link appears in the post title (1) or post body (0). |
commentlinks
Column Name | Type | Description |
---|---|---|
comment_id | TEXT | Unique identifier for the Reddit comment. |
end_processed_valid | INTEGER | Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the comment. |
final_valid | INTEGER | Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of households by type of internet connection by Limistéir Pleanála Teanga. (Census 2022 Theme 15 Table 2 )Census 2022 table 15.2 is number of households with types of internet connection. Attributes include a breakdown of households by access to internet. Census 2022 theme 15 is Motor Car Availability and Internet Access. Teorainneacha na Limistéar Pleanála Teanga Gaeltachta. I gcomhréir le forálacha Acht na Gaeltachta 2012, tá 26 Limistéar Pleanála Teanga Gaeltachta sainaitheanta ag an Aire Ealaíon, Oidhreachta agus Gaeltachta. Faoin Acht, athainmneofar an Ghaeltacht atá ann faoi láthair mar Limistéir Pleanála Teanga Ghaeltachta ach pleananna teanga a bheith aontaithe ag pobail sna limistéir éagsúla de réir na gcritéar pleanála teanga atá forordaithe faoin Acht. Tá Údarás na Gaeltachta freagrach faoin Acht as tacú le heagraíochtaí maidir le hullmhú agus cur i bhfeidhm na bpleananna teanga sna Limistéir Pleanála Teanga Ghaeltachta. Gaeltacht Language Planning Area Boundaries. In line with the provisions of the Gaeltacht Act 2012, the Minister for Arts, Heritage and the Gaeltacht has identified 26 Gaeltacht Language Planning Areas. Under the Act, the existing Gaeltacht will be redesignated as Gaeltacht Language Planning Areas provided that language plans are agreed by the communities in the various areas in accordance with the language planning criteria prescribed under the Act. Údarás na Gaeltachta is responsible under the Act for supporting organisations with regard to the preparation and implementation of the language plans in the Gaeltacht Language Planning Areas. Coordinate reference system: Irish Transverse Mercator (EPSG 2157). These boundaries are based on 20m generalised boundaries sourced from Tailte Éireann Open Data Portal. This dataset is provided by Tailte Éireann, Limistéir Pleanála Teanga 2015.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The statistical operation Information Society Survey-ESI-Familias, provides periodic information on the implementation of the new Information and Communication Technologies -ICT- in the population of the Basque Country. In particular, it computes and describes the ICT equipment of the population both in the home and in the study center or in the workplace, and measures the level of use that is made of them, especially those related to the Internet. It allows us to compare the level of implementation of these ICT technologies in Basque society in relation to other countries in its environment. The statistical operation Information Society Survey-ESI-Familias, provides periodic information on the implementation of the new Information and Communication Technologies -ICT- in the population of the Basque Country. In particular, it computes and describes the ICT equipment of the population both in the home and in the study center or in the workplace, and measures the level of use that is made of them, especially those related to the Internet. It allows us to compare the level of implementation of these ICT technologies in Basque society in relation to other countries in its environment. The statistical operation Information Society Survey-ESI-Familias, provides periodic information on the implementation of the new Information and Communication Technologies -ICT- in the population of the Basque Country. In particular, it computes and describes the ICT equipment of the population both in the home and in the study center or in the workplace, and measures the level of use that is made of them, especially those related to the Internet. It allows us to compare the level of implementation of these ICT technologies in Basque society in relation to other countries in its environment.
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
The statistic shows distribution of programming languages used by Internet of Things developers, according to a survey conducted in 2016. At that time, 31.5 percent of respondents indicated that they were using Node.js when developing Internet of Things solutions.
Summary: This dataset shows statistics on the use of Irish in Irish-Language Networks from the 2011 and 2016 censuses. The Irish Language Networks are defined by Settlement or Electoral Divisional boundaries. This dataset is published online through the Language Planning Viewer run by the Department of Culture, Heritage and the Gaeltacht: http://arcg.is/2nkqdMb Abstract: The dataset presents statistics from the 2011 and 2016 censuses relating to the use of Irish language for the Irish Language Networks. The Irish Language Networks are defined as settlement or Electoral Division boundaries. This dataset is published online through the Language Planning Viewer application run by the Department of Culture, Heritage and the Gaeltacht: http://arcg.is/2nkqdMb
Download Free Sample
The online language learning market is expected to grow at a CAGR of 20% during the forecast period. This market growth can be attributed to various factors including increasing enrollment of foreign students.
The online language learning market report offers several other valuable insights such as:
CAGR of the market during the forecast period 2020-2024
Detailed information on factors that will drive online language learning market growth during the next five years
Precise estimation of the online language learning market size and its contribution to the parent market
Accurate predictions on upcoming trends and changes in consumer behavior
The growth of the online language learning market industry across APAC, Europe, North America, South America, and MEA
A thorough analysis of the market’s competitive landscape and detailed information on vendors
Comprehensive details of factors that will challenge the growth of online language learning market vendors
WSDL on W3C:n määrittämä XML-perustainen kieli, jolla kuvataan tietoverkossa tarjolla oleva web-teknologioihin perustuva palvelu, eli Web Service. (31.08.2011)
As of October 2024, an estimated 33.48 percent of Steam gaming platform users worldwide used Simplified Chinese as their main language. English was the second-most common language, selected by 32.68 percent of users.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Bosnian web corpus CLASSLA-web.bs 1.0 is based on the MaCoCu-bs 1.0 web corpus crawl (http://hdl.handle.net/11356/1808), which was additionally cleaned and enriched with linguistic and genre information. The CLASSLA-web.bs corpus is a part of the South Slavic CLASSLA-web corpus collection, which is the first collection of comparable corpora that encompasses the entire South Slavic language group.
The MaCoCu-bs 1.0 crawl was built by crawling the ".ba" internet top-level domain in 2021 and 2022, as well as extending the crawl dynamically to other domains. During the development of CLASSLA-web corpora, the MaCoCu web crawls were cleaned by removing paragraphs that are not in the target language, and by removing very short texts (less than 75 words or consisting only of paragraphs shorter than 70 characters). The corpus was also linguistically annotated with the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). The linguistic processing involved tokenization, morphosyntactic annotation, and lemmatization. Additionally, the corpus was automatically annotated with genres using the Transformer-based X-GENRE classifier (https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier). The following genre categories are used: News, Information/Explanation, Promotion, Opinion/Argumentation, Instruction, Legal, Prose/Lyrical, Forum, Other and Mix.
The corpus is available in vertical format, as used by Sketch Engine and CWB concordancers. Information is provided on the text-, paragraph-, sentence- and token-level. Each text is accompanied by the following metadata: text id, title, url, domain, top-level domain (tld, e.g., "com"), and predicted genre category. Each text is divided into paragraphs that are accompanied by the following metadata: paragraph id, the automatically identified language of the text in the paragraph, and paragraph quality. For quality, labels, such as "short" or "good" are assigned based on paragraph length, URL and stopword density via the jusText tool (https://corpus.tools/wiki/Justext). Paragraphs are further divided into sentences that have as metadata their sentence id. Inside sentences, tokens are provided in tabular format with their linguistic annotation. Details about the structural and positional attributes are also given in the accompanying registry file which was used to install the corpus on the CLARIN.SI concordancers.
Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.