89 datasets found
  1. Common languages used for web content 2025, by share of websites

    • statista.com
    • ai-chatbox.pro
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
    Explore at:
    Dataset updated
    Feb 11, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 2025
    Area covered
    Worldwide
    Description

    As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

  2. Top programming languages used for Internet of Things projects 2016

    • statista.com
    Updated Apr 14, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2016). Top programming languages used for Internet of Things projects 2016 [Dataset]. https://www.statista.com/statistics/658792/worldwide-internet-of-things-survey-programming-languages-used/
    Explore at:
    Dataset updated
    Apr 14, 2016
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 11, 2016 - Mar 25, 2016
    Area covered
    Worldwide
    Description

    The statistic shows distribution of programming languages used by Internet of Things developers, according to a survey conducted in 2016. At that time, 31.5 percent of respondents indicated that they were using Node.js when developing Internet of Things solutions.

  3. G

    Internet use, by language used to search for information

    • open.canada.ca
    • www150.statcan.gc.ca
    • +1more
    csv, html, xml
    Updated Jan 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics Canada (2023). Internet use, by language used to search for information [Dataset]. https://open.canada.ca/data/en/dataset/e2617831-7e2d-4da5-919f-47311eea3349
    Explore at:
    html, xml, csvAvailable download formats
    Dataset updated
    Jan 17, 2023
    Dataset provided by
    Statistics Canada
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    Canadian Internet use survey, Internet use, by language used to search for information, for Canada in 2005. (Terminated)

  4. Preferred language to access the internet India 2023

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Preferred language to access the internet India 2023 [Dataset]. https://www.statista.com/statistics/1459294/india-internet-access-by-language/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    India
    Description

    According to a 2023 survey, ** percent of internet users in urban India preferred using the internet in English. Meanwhile, ** percent of users accessed the internet in Indian languages, with Hindi being the most preferred language among them. Over *** million internet users reside in the urban areas of India.

  5. Most used programming languages among developers worldwide 2024

    • statista.com
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most used programming languages among developers worldwide 2024 [Dataset]. https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/
    Explore at:
    Dataset updated
    Feb 6, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 19, 2024 - Jun 20, 2024
    Area covered
    Worldwide
    Description

    As of 2024, JavaScript and HTML/CSS were the most commonly used programming languages among software developers around the world, with more than 62 percent of respondents stating that they used JavaScript and just around 53 percent using HTML/CSS. Python, SQL, and TypeScript rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.

  6. f

    Data_Sheet_1_The method behind the unprecedented production of indicators of...

    • frontiersin.figshare.com
    docx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Pimienta; Álvaro Blanco; Gilvan Müller de Oliveira (2023). Data_Sheet_1_The method behind the unprecedented production of indicators of the presence of languages in the Internet.docx [Dataset]. http://doi.org/10.3389/frma.2023.1149347.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Daniel Pimienta; Álvaro Blanco; Gilvan Müller de Oliveira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reliable and updated indicators of the presence of languages in the Internet are required to drive efficiently policies for languages, to forecast e-commerce market or to support further researches on the field of digital support of languages. This article presents a complete description of the methodological elements involved in the production of an unprecedented set of indicators of the presence in the Internet of the 329 languages with more than 1 million L1 speakers. A special emphasis is given to the treatment of the comprehensive set of biases involved in the process, either from the method or the various sources used in the modeling process. The biases related to other sources providing similar data are also discussed, and in particular, it is shown how the lack of consideration of the high level of multilingualism of the Web leads to a huge overestimation of the presence of English. The detailed list of sources is presented in the various annexes. For the first time in the history of the Internet, the production of indicators about virtual presence of a large set of languages could allow progress in the fields of economy of languages, cyber-geography of languages and language policies for multilingualism.

  7. Use of English language to access media in MENA 2015, by country

    • statista.com
    Updated Apr 15, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2015). Use of English language to access media in MENA 2015, by country [Dataset]. https://www.statista.com/statistics/603588/mena-english-use-media-access/
    Explore at:
    Dataset updated
    Apr 15, 2015
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 3, 2015 - Mar 9, 2015
    Area covered
    Africa, MENA
    Description

    This statistic shows the results of a survey conducted in the Middle East and North Africa region in 2015 on the percentage of people who use the English language to access different types of media. According to the survey, ** percent of respondents in Egypt used English to access the internet, as opposed to ***** percent who used English to access TV.

  8. e

    Population aged 15 and over of the Basque Country who are Internet users by...

    • euskadi.eus
    csv, xlsx
    Updated Oct 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Population aged 15 and over of the Basque Country who are Internet users by place of access and languages used, according to Province (%). [Dataset]. https://www.euskadi.eus/population-aged-15-and-over-of-the-basque-country-who-are-internet-users-by-place-of-access-and-languages-used-according-to-province/web01-ejeduki/en/
    Explore at:
    csv(0.64), xlsx(16.96)Available download formats
    Dataset updated
    Oct 30, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Basque Country
    Description

    The statistical operation Survey on the Information Society-ESI-Families, provides regular information on the implementation of New Information and Communication Technology -ICT- among the population of the Basque Country. Specifically, it records and describes ICT equipment of the population both in the home and the place of study or in the workplace and measures the level of use made of it, especially as related to the Internet. It lets us compare the level of implementation of these ICT technologies In Basque society in relation to other surrounding communities.

  9. The most spoken languages worldwide 2025

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  10. English Word Frequency

    • kaggle.com
    Updated Sep 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael Tatman (2017). English Word Frequency [Dataset]. https://www.kaggle.com/datasets/rtatman/english-word-frequency/discussion?sortBy=hot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 6, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rachael Tatman
    Description

    Context:

    How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.

    Content:

    This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.

    Acknowledgements:

    Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.

    The code used to generate this dataset is distributed under the MIT License.

    Inspiration:

    • Can you tag the part of speech of these words? Which parts of speech are most frequent? Is this similar to other languages, like Japanese?
    • What differences are there between the very frequent words in this dataset, and the the frequent words in other corpora, such as the Brown Corpus or the TIMIT corpus? What might these differences tell us about how language is used?
  11. u

    Internet use, by language used to search for information - Catalogue -...

    • data.urbandatacentre.ca
    • beta.data.urbandatacentre.ca
    Updated Oct 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Internet use, by language used to search for information - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-e2617831-7e2d-4da5-919f-47311eea3349
    Explore at:
    Dataset updated
    Oct 1, 2024
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    Canadian Internet use survey, Internet use, by language used to search for information, for Canada in 2005. (Terminated)

  12. e

    Flash Eurobarometer 313: User language preferences online

    • data.europa.eu
    zip
    Updated Jan 19, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Directorate-General for Communication (2015). Flash Eurobarometer 313: User language preferences online [Dataset]. https://data.europa.eu/data/datasets/s880_313?locale=lv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 19, 2015
    Dataset authored and provided by
    Directorate-General for Communication
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Flash Eurobarometer studied how Europeans use different languages online. While 90% of European internet users prefer to surf the internet in their own language, 55% at least occasionally use a language other than their own when online according to a pan-EU Eurobarometer survey released today. However, 44% feel they are missing interesting information because web pages are not in a language that they understand.

    The results by volumes are distributed as follows:
    • Volume A: Countries
    • Volume AA: Groups of countries
    • Volume A' (AP): Trends
    • Volume AA' (AAP): Trends of groups of countries
    • Volume B: EU/socio-demographics
    • Volume B' (BP) : Trends of EU/ socio-demographics
    • Volume C: Country/socio-demographics ---- Researchers may also contact GESIS - Leibniz Institute for the Social Sciences: https://www.gesis.org/eurobarometer
  13. Data from: Exploring the Dominance of the English Language on the Websites...

    • zenodo.org
    • data.niaid.nih.gov
    bin, xls
    Updated Mar 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giannakoulopoulos Andreas; Pergantis Minas; Konstantinou Nikos; Lamprogeorgos Aristeidis; Limniati Laida; Varlamis Iraklis; Giannakoulopoulos Andreas; Pergantis Minas; Konstantinou Nikos; Lamprogeorgos Aristeidis; Limniati Laida; Varlamis Iraklis (2020). Exploring the Dominance of the English Language on the Websites of EU Countries [Dataset]. http://doi.org/10.5281/zenodo.3698008
    Explore at:
    xls, binAvailable download formats
    Dataset updated
    Mar 5, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Giannakoulopoulos Andreas; Pergantis Minas; Konstantinou Nikos; Lamprogeorgos Aristeidis; Limniati Laida; Varlamis Iraklis; Giannakoulopoulos Andreas; Pergantis Minas; Konstantinou Nikos; Lamprogeorgos Aristeidis; Limniati Laida; Varlamis Iraklis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    European Union
    Description

    This Dataset, in 29 files of xlsx format, contains the data of all metrics and accumulated information as they are described in the methodology, results and discussion section of the research article "Exploring the Dominance of the English Language on the Websites of EU Countries".

  14. c

    Health communication and the internet: An analysis of adolescent language...

    • datacatalogue.cessda.eu
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adolphs, S (2025). Health communication and the internet: An analysis of adolescent language use on the teenage health freak website [Dataset]. http://doi.org/10.5255/UKDA-SN-850565
    Explore at:
    Dataset updated
    May 28, 2025
    Dataset provided by
    University of Nottingham
    Authors
    Adolphs, S
    Time period covered
    Jan 1, 2010 - Dec 31, 2010
    Area covered
    United Kingdom
    Variables measured
    Text unit
    Measurement technique
    Not Applicable
    Description

    This study explores the integration of corpus linguistic and sociolinguistic approaches for the analysis of a unique 4-million word longitudinal corpus of messages posted to the 'Teenage Health Freak' website. The website, run by UK-based GPs, is designed to be interactive, confidential and evidence-based providing adolescents with accessible advice and information pertaining to a broad range of health issues. The descriptive advantages afforded by the tools of corpus linguistics will be used to inform sociolinguistic observations of adolescent language innovation and change on the specific topic of health care. Key words and key phrases used by adolescent advice-seekers, with associated meanings and patterns of use over a period of ten years, will be extracted from the corpus and then analysed to highlight emergent trends in adolescent sociolinguistic style and register. As well as the academic value of this combined methodological approach, the findings of the analysis will be made available to health care providers and users of health care services in the form of a practical, encyclopaedic resource, thus contributing to the continuous professional development of user groups in the NHS, as well as being a resource for parents, teachers and adolescents themselves.

  15. e

    Languages available on the web in establishments with a website in the...

    • euskadi.eus
    csv, xlsx
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Languages available on the web in establishments with a website in the Basque Country according to province, activity branch (A38) and employment strata (%). [Dataset]. https://www.euskadi.eus/languages-available-on-the-web-in-establishments-with-a-website-in-the-basque-country-according-to-province-activity-branch-a38-and-employment-strata/web01-ejeduki/en/
    Explore at:
    xlsx(21.23), csv(3.51)Available download formats
    Dataset updated
    Jul 12, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Basque Country
    Description

    The statistical operation Survey on the Information Society-ESI- Companies, provides regular information on the implementation of New Information and Communication Technology -ICT- in the companies of the Basque Country. Specifically, it records and describes the level of use of the Internet in the different establishments: the systems of Internet access, activities carried out via the Internet, as well as the availability of the website and its main characteristics. It also measures the implementation of E-commerce purchases and sales in economic activity and the means used to carry it out.

  16. Opinions on the State of the Finnish Language 2013

    • services.fsd.tuni.fi
    • datacatalogue.cessda.eu
    zip
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Korhonen, Riitta; Lappalainen, Hanna (2025). Opinions on the State of the Finnish Language 2013 [Dataset]. http://doi.org/10.60686/t-fsd2979
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    Finnish Social Science Data Archive
    Authors
    Korhonen, Riitta; Lappalainen, Hanna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Finland
    Description

    The survey charted citizen views on written and spoken Finnish. Themes studied included, for instance, language change, the relationship between written and spoken language, loanwords from English, and language practices in the public sphere. Opinions on the use of Finnish and its change were charted. The respondents were asked how important they regarded the well-being of the Finnish language, whether they thought everyday language written by people was careless, whether the language used in factual texts (e.g. reports, news) had become too colloquial, and whether they thought that one can use the kind of language one wants to on the Internet (for example, on Facebook). Reasons for being or not being worried of language change were examined (e.g. "English grammar affects Finnish in a negative manner", "Language change does not mean a deterioration of language"). A number of statements relating to spoken language and language used when greeting and addressing people were presented to respondents who were asked to indicate the extent to which they agreed with them (e.g. "Guests of current affairs programmes on TV and radio may use colloquial language", "I find it irritating that telemarketers often address their customers by first name"). Opinions on the use and acceptability of English loanwords in Finnish were surveyed (e.g. "They bother me because I don't always understand what they mean", "They facilitate communication between people who speak different languages"). The respondents were also asked whether some words or phrases annoyed them. Regarding written language, opinions were charted on the fact that there are multiple acceptable/'correct' forms in written Finnish as well as who or which institution should be responsible for ensuring that factual texts are written according to language guidelines. Finally, the respondents were asked which issues relating to grammar, structure, and style they occasionally had to consider when writing factual texts. Background variables included the respondent's age group, gender, education, region (NUTS 3) of residence, mother tongue, languages spoken, and writing and reading habits.

  17. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  18. Jigsaw Train Translated (Yandex API)

    • kaggle.com
    Updated May 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ma7555 (2020). Jigsaw Train Translated (Yandex API) [Dataset]. https://www.kaggle.com/ma7555/jigsaw-train-translated-yandex-api/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 11, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ma7555
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Dataset

    This dataset was created by ma7555

    Released under GPL 2

    Contents

  19. p

    Trends in Reading and Language Arts Proficiency (2010-2022): Internet...

    • publicschoolreview.com
    Updated Jun 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Public School Review (2025). Trends in Reading and Language Arts Proficiency (2010-2022): Internet Academy vs. Washington vs. Federal Way School District [Dataset]. https://www.publicschoolreview.com/internet-academy-profile
    Explore at:
    Dataset updated
    Jun 3, 2025
    Dataset authored and provided by
    Public School Review
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Federal Way School District
    Description

    This dataset tracks annual reading and language arts proficiency from 2010 to 2022 for Internet Academy vs. Washington and Federal Way School District

  20. Digital Language Learning Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Digital Language Learning Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-digital-language-learning-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Digital Language Learning Market Outlook



    The global digital language learning market size was valued at approximately USD 12 billion in 2023 and is expected to reach around USD 25 billion by 2032, growing at a CAGR of 8.5% during the forecast period. The growth of this market is driven by factors such as increasing globalization, the rise of online education, and technological advancements that make language learning more accessible and engaging.



    One of the primary growth factors of the digital language learning market is the increasing prevalence of globalization and the demand for multilingual communication skills. In an interconnected world, the ability to communicate in multiple languages has become a critical skill for both personal and professional development. Businesses are expanding their operations across borders, which necessitates employees to be proficient in multiple languages. Consequently, both individuals and organizations are investing heavily in digital language learning solutions to bridge language barriers and enhance communication efficiency.



    Technological advancements have also played a significant role in propelling the growth of the digital language learning market. The advent of artificial intelligence, machine learning, and natural language processing has revolutionized the way languages are taught and learned. These technologies enable personalized learning experiences, adaptive learning paths, and real-time feedback, which significantly enhance the effectiveness of language acquisition. Moreover, the proliferation of smartphones and high-speed internet has made digital language learning solutions more accessible to a broader audience, further fueling market growth.



    The rise of online education and e-learning platforms has provided a significant boost to the digital language learning market. With the growing acceptance of online education as a viable alternative to traditional classroom-based learning, more individuals are turning to digital platforms for their language learning needs. These platforms offer flexibility, convenience, and a wide range of resources that cater to different learning styles and preferences. Additionally, the COVID-19 pandemic has accelerated the adoption of online education, as lockdowns and social distancing measures have forced educational institutions and learners to transition to digital modes of learning.



    The emergence of Online Language Training has further revolutionized the digital language learning landscape. With the flexibility and accessibility that online platforms provide, learners can access a plethora of resources tailored to their individual needs and learning styles. These platforms often incorporate multimedia elements, such as videos, interactive quizzes, and virtual classrooms, to create an engaging and immersive learning environment. The ability to learn at one's own pace and schedule has made online language training particularly appealing to busy professionals and students alike, who can now integrate language learning seamlessly into their daily routines. Additionally, the global reach of online platforms allows learners to connect with native speakers and cultural experts, enhancing their language proficiency and cultural understanding.



    Regionally, the Asia Pacific region is expected to witness substantial growth in the digital language learning market. This can be attributed to the increasing focus on English language learning in countries like China, Japan, and India, where English proficiency is seen as a key driver of academic and professional success. Additionally, government initiatives to promote digital education and the presence of a large population of young learners are further contributing to the market growth in this region. North America and Europe are also significant markets, driven by the high adoption of technology in education and the presence of a large number of immigrants seeking language learning solutions.



    Product Type Analysis



    The digital language learning market is segmented by product type into on-premises and cloud-based solutions. On-premises solutions involve the installation of software on local servers or personal computers, offering greater control over data and customization options. These solutions are often preferred by large organizations and academic institutions that require extensive language learning programs and have the necessary IT infrastructure to support them. However, the high initial costs and maintenance req

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
Organization logo

Common languages used for web content 2025, by share of websites

Explore at:
69 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Feb 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
Worldwide
Description

As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

Search
Clear search
Close search
Google apps
Main menu