As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
As of 2024, JavaScript and HTML/CSS were the most commonly used programming languages among software developers around the world, with more than 62 percent of respondents stating that they used JavaScript and just around 53 percent using HTML/CSS. Python, SQL, and TypeScript rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.
According to a 2023 survey, ** percent of internet users in urban India preferred using the internet in English. Meanwhile, ** percent of users accessed the internet in Indian languages, with Hindi being the most preferred language among them. Over *** million internet users reside in the urban areas of India.
How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.
This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.
Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.
The code used to generate this dataset is distributed under the MIT License.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reliable and updated indicators of the presence of languages in the Internet are required to drive efficiently policies for languages, to forecast e-commerce market or to support further researches on the field of digital support of languages. This article presents a complete description of the methodological elements involved in the production of an unprecedented set of indicators of the presence in the Internet of the 329 languages with more than 1 million L1 speakers. A special emphasis is given to the treatment of the comprehensive set of biases involved in the process, either from the method or the various sources used in the modeling process. The biases related to other sources providing similar data are also discussed, and in particular, it is shown how the lack of consideration of the high level of multilingualism of the Web leads to a huge overestimation of the presence of English. The detailed list of sources is presented in the various annexes. For the first time in the history of the Internet, the production of indicators about virtual presence of a large set of languages could allow progress in the fields of economy of languages, cyber-geography of languages and language policies for multilingualism.
description: The TIGER/Line Files are shapefiles and related database files (.dbf) that are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line File is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The primary legal divisions of most States are termed counties. In Louisiana, these divisions are known as parishes. In Alaska, which has no counties, the equivalent entities are the organized boroughs, city and boroughs, and municipalities, and for the unorganized area, census areas. The latter are delineated cooperatively for statistical purposes by the State of Alaska and the Census Bureau. In four States (Maryland, Missouri, Nevada, and Virginia), there are one or more incorporated places that are independent of any county organization and thus constitute primary divisions of their States. These incorporated places are known as independent cities and are treated as equivalent entities for purposes of data presentation. The District of Columbia and Guam have no primary divisions, and each area is considered an equivalent entity for purposes of data presentation. The Census Bureau treats the following entities as equivalents of counties for purposes of data presentation: Municipios in Puerto Rico, Districts and Islands in American Samoa, Municipalities in the Commonwealth of the Northern Mariana Islands, and Islands in the U.S. Virgin Islands. The entire area of the United States, Puerto Rico, and the Island Areas is covered by counties or equivalent entities. The 2010 Census boundaries for counties and equivalent entities are as of January 1, 2010, primarily as reported through the Census Bureau's Boundary and Annexation Survey (BAS).
This table contains data on individual languages spoken from the American Community Survey 2006-2010 database for counties. The American Community Survey (ACS) is a household survey conducted by the U.S. Census Bureau that currently has an annual sample size of about 3.5 million addresses. ACS estimates provides communities with the current information they need to plan investments and services. Information from the survey generates estimates that help determine how more than $400 billion in federal and state funds are distributed annually. Each year the survey produces data that cover the periods of 1-year, 3-year, and 5-year estimates for geographic areas in the United States and Puerto Rico, ranging from neighborhoods to Congressional districts to the entire nation. This table also has a companion table (Same table name with MOE Suffix) with the margin of error (MOE) values for each estimated element. MOE is expressed as a measure value for each estimated element. So a value of 25 and an MOE of 5 means 25 +/- 5 (or statistical certainty between 20 and 30). There are also special cases of MOE. An MOE of -1 means the associated estimates do not have a measured error. An MOE of 0 means that error calculation is not appropriate for the associated value. An MOE of 109 is set whenever an estimate value is 0. The MOEs of aggregated elements and percentages must be calculated. This process means using standard error calculations as described in "American Community Survey Multiyear Accuracy of the Data (3-year 2008-2010 and 5-year 2006-2010)". Also, following Census guidelines, aggregated MOEs do not use more than 1 0-element MOE (109) to prevent over estimation of the error. Due to the complexity of the calculations, some percentage MOEs cannot be calculated (these are set to null in the summary-level MOE tables).
The name for table 'ACS10LSPCNTYMOE' was added as a prefix to all field names imported from that table. Be sure to turn off 'Show Field Aliases' to see complete field names in the Attribute Table of this feature layer. This can be done in the 'Table Options' drop-down menu in the Attribute Table or with key sequence '[CTRL]+[SHIFT]+N'. Due to database restrictions, the prefix may have been abbreviated if the field name exceded the maximum allowed characters.; abstract: The TIGER/Line Files are shapefiles and related database files (.dbf) that are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line File is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The primary legal divisions of most States are termed counties. In Louisiana, these divisions are known as parishes. In Alaska, which has no counties, the equivalent entities are the organized boroughs, city and boroughs, and municipalities, and for the unorganized area, census areas. The latter are delineated cooperatively for statistical purposes by the State of Alaska and the Census Bureau. In four States (Maryland, Missouri, Nevada, and Virginia), there are one or more incorporated places that are independent of any county organization and thus constitute primary divisions of their States. These incorporated places are known as independent cities and are treated as equivalent entities for purposes of data presentation. The District of Columbia and Guam have no primary divisions, and each area is considered an equivalent entity for purposes of data presentation. The Census Bureau treats the following entities as equivalents of counties for purposes of data presentation: Municipios in Puerto Rico, Districts and Islands in American Samoa, Municipalities in the Commonwealth of the Northern Mariana Islands, and Islands in the U.S. Virgin Islands. The entire area of the United States, Puerto Rico, and the Island Areas is covered by counties or equivalent entities. The 2010 Census boundaries for counties and equivalent entities are as of January 1, 2010, primarily as reported through the Census Bureau's Boundary and Annexation Survey (BAS).
This table contains data on individual languages spoken from the American Community Survey 2006-2010 database for counties. The American Community Survey (ACS) is a household survey conducted by the U.S. Census Bureau that currently has an annual sample size of about 3.5 million addresses. ACS estimates provides communities with the current information they need to plan investments and services. Information from the survey generates estimates that help determine how more than $400 billion in federal and state funds are distributed annually. Each year the survey produces data that cover the periods of 1-year, 3-year, and 5-year estimates for geographic areas in the United States and Puerto Rico, ranging from neighborhoods to Congressional districts to the entire nation. This table also has a companion table (Same table name with MOE Suffix) with the margin of error (MOE) values for each estimated element. MOE is expressed as a measure value for each estimated element. So a value of 25 and an MOE of 5 means 25 +/- 5 (or statistical certainty between 20 and 30). There are also special cases of MOE. An MOE of -1 means the associated estimates do not have a measured error. An MOE of 0 means that error calculation is not appropriate for the associated value. An MOE of 109 is set whenever an estimate value is 0. The MOEs of aggregated elements and percentages must be calculated. This process means using standard error calculations as described in "American Community Survey Multiyear Accuracy of the Data (3-year 2008-2010 and 5-year 2006-2010)". Also, following Census guidelines, aggregated MOEs do not use more than 1 0-element MOE (109) to prevent over estimation of the error. Due to the complexity of the calculations, some percentage MOEs cannot be calculated (these are set to null in the summary-level MOE tables).
The name for table 'ACS10LSPCNTYMOE' was added as a prefix to all field names imported from that table. Be sure to turn off 'Show Field Aliases' to see complete field names in the Attribute Table of this feature layer. This can be done in the 'Table Options' drop-down menu in the Attribute Table or with key sequence '[CTRL]+[SHIFT]+N'. Due to database restrictions, the prefix may have been abbreviated if the field name exceded the maximum allowed characters.
Although Bangla is one of the most used languages in the world, finding a robust dataset of Bangla Spam SMS or email is almost impossible. This is a dataset of Bangla Spam SMS in which Spam messages are labeled as Spam and necessary messages are marked as ham. Here, commercial messages are included as spam as well as phishing and spamming ones. This data are collected by doing a survey and filtered. As almost everyone agreed upon the statement that they face more spam SMS rather than email, this dataset is created from those irritating messages.
The statistical operation Survey on the Information Society-ESI- Companies, provides regular information on the implementation of New Information and Communication Technology -ICT- in the companies of the Basque Country. Specifically, it records and describes the level of use of the Internet in the different establishments: the systems of Internet access, activities carried out via the Internet, as well as the availability of the website and its main characteristics. It also measures the implementation of E-commerce purchases and sales in economic activity and the means used to carry it out.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The Pinyin Input Method Market has been experiencing a significant trajectory in market size, with global figures estimated at $1.5 billion in 2023 and projected to reach approximately $2.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 7%. This robust growth can be attributed to several key factors, including the increasing digitalization across various sectors, the proliferation of smartphones, and the growing demand for efficient input methods that cater to Mandarin-speaking populations worldwide. The escalation of internet usage and the need for seamless communication in one of the most spoken languages globally is further propelling the market's upward trend.
One of the primary growth factors driving the Pinyin Input Method Market is the rapid digital transformation across industries. As businesses and educational institutions increasingly adopt digital platforms, there is a heightened need for effective input methods that can cater to Chinese-speaking users. The Pinyin input method, being one of the most efficient and widely used systems for Chinese character input, aligns perfectly with the needs of this growing user base. Additionally, the rise of e-learning platforms and remote work has necessitated reliable input methods, further contributing to market growth. The integration of Pinyin input across multiple devices and platforms, such as smartphones, tablets, and computers, has broadened its accessibility and usability, making it indispensable in the digital age.
Another significant growth factor is the increasing penetration of smartphones and mobile internet services. With Asia, particularly China, witnessing a surge in smartphone adoption, the demand for user-friendly and efficient input methods like Pinyin has soared. Mobile users require quick and intuitive typing solutions that can seamlessly integrate with their devices and applications. The Pinyin input method, with its ease of use and compatibility, perfectly meets these demands, thereby driving market expansion. Moreover, ongoing technological advancements in natural language processing and machine learning have enhanced the accuracy and predictive capabilities of Pinyin input systems, further boosting their adoption across diverse user segments.
The expansion of the Pinyin Input Method Market is also fueled by globalization and the growing significance of the Chinese language in international business, education, and cultural exchanges. As more non-native speakers seek to learn Mandarin for professional and personal reasons, the demand for effective learning tools, including Pinyin input methods, has surged. Educational institutions and language learning platforms are increasingly incorporating Pinyin input systems to facilitate the learning process and improve user engagement. This trend is expected to continue as the Chinese language gains prominence on the global stage, contributing to sustained market growth.
Regionally, Asia Pacific dominates the Pinyin Input Method Market due to the high concentration of Mandarin speakers and the widespread adoption of digital technologies. North America and Europe are also witnessing growth, driven by the increasing interest in Mandarin language learning and cross-cultural communications. In Latin America and the Middle East & Africa, the market is gradually expanding as more educational and business entities recognize the value of integrating Chinese language capabilities. The regional outlook highlights the global significance of the Pinyin input method in facilitating communication and bridging linguistic gaps in an increasingly interconnected world.
The Pinyin Input Method Market can be segmented by product type into software and hardware. Software solutions dominate this market segment, primarily due to their versatility and wide applicability across various devices and platforms. These solutions can be easily installed and integrated into existing systems, making them a preferred choice for both individual users and organizations. Software-based Pinyin input methods offer extensive customization options, allowing users to tailor their typing experience to their preferences, which enhances user satisfaction and drives market growth. The continuous development of advanced features, such as predictive text and voice recognition, further elevates the value proposition of software solutions in this market.
On the other hand, hardware solutions, although a smaller segment, play a crucial role in specific applications. Dedicated Pinyin input hardware, such as keyboards
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global market size for free online translator services was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 2.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 9.7% during the forecast period. One of the major growth factors driving this market is the increasing globalization and the need for effective communication across different languages and regions.
The demand for free online translators is significantly driven by the globalization of businesses, which necessitates the translation of documents, websites, and marketing materials into multiple languages to reach a broader audience. The rise in international trade and cross-border e-commerce activities has also amplified the need for seamless communication tools. Furthermore, the adoption of free online translators has grown exponentially due to the increasing number of internet users worldwide, many of whom require translation services to access content in different languages.
Another critical growth factor is the advancement in artificial intelligence (AI) and machine learning (ML) technologies, which have substantially improved the accuracy and reliability of online translation services. These technological advancements enable the development of sophisticated algorithms that can handle complex translations in real-time, thus enhancing user experience. Additionally, the integration of natural language processing (NLP) capabilities into translation software has made it possible to understand and translate idiomatic expressions and cultural nuances more accurately.
The increasing demand for multilingual communication in the educational sector is also a significant contributor to the market's growth. Educational institutions are leveraging free online translators to facilitate learning in diverse linguistic environments, thus making education more accessible to students who speak different languages. The proliferation of online learning platforms and international collaborations in academia further drives the need for reliable translation services.
In the realm of multilingual communication, the role of a Simultaneous Interpreter has become increasingly vital. These professionals are adept at providing real-time translations during conferences, meetings, and events, ensuring that language barriers do not impede the flow of information. As globalization continues to expand, the demand for simultaneous interpretation services is on the rise, particularly in international business settings and diplomatic engagements. The integration of technology with human expertise in this field is enhancing the accuracy and efficiency of translations, making it an indispensable service in today's interconnected world.
Regionally, the Asia Pacific is expected to witness significant growth in the free online translator market due to the region's diverse linguistic landscape and the increasing penetration of the internet. Countries like China, India, and Japan are leading the charge in adopting online translation services to bridge language barriers in business and personal communication. North America and Europe are also substantial markets, driven by technological advancements and high internet usage rates. Latin America and the Middle East & Africa regions are gradually catching up, with increasing internet penetration and growing awareness about the benefits of online translation tools.
The free online translator market is segmented by type into text translation, speech translation, image translation, and others. Text translation remains the most widely used type, primarily because it forms the basis of most online communication. Innovations in text translation have made it possible to translate large volumes of text quickly and accurately, which is essential for businesses, educational institutions, and individual users. Text translation tools are increasingly being integrated into various applications, such as web browsers, office suites, and mobile apps, making them highly accessible and user-friendly.
Speech translation has seen significant growth, fueled by advancements in voice recognition technologies and the increasing use of voice-activated assistants. This segment is partic
Table from the American Community Survey (ACS) 5-year series on languages spoken and English ability related topics for City of Seattle Council Districts, Comprehensive Plan Growth Areas and Community Reporting Areas. Table includes B16004 Age by Language Spoken at Home by Ability to Speak English, C16002 Household Language by Household Limited English-Speaking Status. Data is pulled from block group tables for the most recent ACS vintage and summarized to the neighborhoods based on block group assignment.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The TIGER/Line Files are shapefiles and related database files (.dbf) that are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line File is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. Census tracts are small, relatively permanent statistical subdivisions of a county or equivalent entity, and were defined by local participants as part of the 2010 Census Participant Statistical Areas Program. The Census Bureau delineated the census tracts in situations where no local participant existed or where all the potential participants declined to participate. The primary purpose of census tracts is to provide a stable set of geographic units for the presentation of census data and comparison back to previous decennial censuses. Census tracts generally have a population size between 1,200 and 8,000 people, with an optimum size of 4,000 people. When first delineated, census tracts were designed to be homogeneous with respect to population characteristics, economic status, and living conditions. The spatial size of census tracts varies widely depending on the density of settlement. Physical changes in street patterns caused by highway construction, new development, and so forth, may require boundary revisions. In addition, census tracts occasionally are split due to population growth, or combined as a result of substantial population decline. Census tract boundaries generally follow visible and identifiable features. They may follow legal boundaries such as minor civil division (MCD) or incorporated place boundaries in some States and situations to allow for census tract-to-governmental unit relationships where the governmental boundaries tend to remain unchanged between censuses. State and county boundaries always are census tract boundaries in the standard census geographic hierarchy. In a few rare instances, a census tract may consist of noncontiguous areas. These noncontiguous areas may occur where the census tracts are coextensive with all or parts of legal entities that are themselves noncontiguous. For the 2010 Census, the census tract code range of 9400 through 9499 was enforced for census tracts that include a majority American Indian population according to Census 2000 data and/or their area was primarily covered by federally recognized American Indian reservations and/or off-reservation trust lands; the code range 9800 through 9899 was enforced for those census tracts that contained little or no population and represented a relatively large special land use area such as a National Park, military installation, or a business/industrial park; and the code range 9900 through 9998 was enforced for those census tracts that contained only water area, no land area.
This table contains data on individual languages spoken from the American Community Survey 2006-2010 database for tracts. The American Community Survey (ACS) is a household survey conducted by the U.S. Census Bureau that currently has an annual sample size of about 3.5 million addresses. ACS estimates provides communities with the current information they need to plan investments and services. Information from the survey generates estimates that help determine how more than $400 billion in federal and state funds are distributed annually. Each year the survey produces data that cover the periods of 1-year, 3-year, and 5-year estimates for geographic areas in the United States and Puerto Rico, ranging from neighborhoods to Congressional districts to the entire nation. This table also has a companion table (Same table name with MOE Suffix) with the margin of error (MOE) values for each estimated element. MOE is expressed as a measure value for each estimated element. So a value of 25 and an MOE of 5 means 25 +/- 5 (or statistical certainty between 20 and 30). There are also special cases of MOE. An MOE of -1 means the associated estimates do not have a measured error. An MOE of 0 means that error calculation is not appropriate for the associated value. An MOE of 109 is set whenever an estimate value is 0. The MOEs of aggregated elements and percentages must be calculated. This process means using standard error calculations as described in "American Community Survey Multiyear Accuracy of the Data (3-year 2008-2010 and 5-year 2006-2010)". Also, following Census guidelines, aggregated MOEs do not use more than 1 0-element MOE (109) to prevent over estimation of the error. Due to the complexity of the calculations, some percentage MOEs cannot be calculated (these are set to null in the summary-level MOE tables).
The name for table 'ACS10LSPTRMOE' was added as a prefix to all field names imported from that table. Be sure to turn off 'Show Field Aliases' to see complete field names in the Attribute Table of this feature layer. This can be done in the 'Table Options' drop-down menu in the Attribute Table or with key sequence '[CTRL]+[SHIFT]+N'. Due to database restrictions, the prefix may have been abbreviated if the field name exceded the maximum allowed characters.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The 1998 Lancet paper by Wakefield et al., despite subsequent retraction and evidence indicating no causal link between vaccinations and autism, triggered significant parental concern. The aim of this study was to analyze the online information available on this topic. Using localized versions of Google, we searched “autism vaccine” in English, French, Italian, Portuguese, Mandarin, and Arabic and analyzed 200 websites for each search engine result page (SERP). A common feature was the newsworthiness of the topic, with news outlets representing 25–50% of the SERP, followed by unaffiliated websites (blogs, social media) that represented 27–41% and included most of the vaccine-negative websites. Between 12 and 24% of websites had a negative stance on vaccines, while most websites were pro-vaccine (43–70%). However, their ranking by Google varied. While in Google.com, the first vaccine-negative website was the 43rd in the SERP, there was one vaccine-negative webpage in the top 10 websites in both the British and Australian localized versions and in French and two in Italian, Portuguese, and Mandarin, suggesting that the information quality algorithm used by Google may work better in English. Many webpages mentioned celebrities in the context of the link between vaccines and autism, with Donald Trump most frequently. Few websites (1–5%) promoted complementary and alternative medicine (CAM) but 50–100% of these were also vaccine-negative suggesting that CAM users are more exposed to vaccine-negative information. This analysis highlights the need for monitoring the web for information impacting on vaccine uptake.
The survey charted the consumption of news and media and the trust in different media by Swedish-speaking Finns. Views on corruption were also examined. The data was collected as part of the Citizen Panel of Swedish-speaking Finns (Barometern), which is part of The Finnish Research Infrastructure for Public Opinion (FIRIPO). Respondents were first asked about the amount of media they use, followed by more detailed questions about their use of news media and social media. Next, respondents were asked to rate their level of trust in the different news media. In this context, they were also asked about their perception of the objectivity of journalists. Respondents were also asked about their willingness to pay for Swedish-speaking Finns online news. They were asked about the reasons why they would pay for online news and how much they would be prepared to pay for it. Similarly, they were asked about the reasons for not being willing to pay for online news and the reasons for cancelling a subscription. Next, respondents were asked about their use of other digital services and changes in their use since the Covid19 pandemic. Finally, respondents were asked about their views on the occurrence of corruption and their trust in different institutions. Background variables included the respondent's NUTS3 region of residence, age, gender, mother tongue, level of education, occupational status and political party choice.
https://ora.ox.ac.uk/terms_of_usehttps://ora.ox.ac.uk/terms_of_use
This collaborative project involved the University of Oxford and two universities in Papua, Universitas Cenderawasih and Universitas Negeri Papua, in the creation of an on-line database of 52 digital audio and video texts and the linguistically annotated transcriptions and translations of 23 of the texts for the Austronesian language Biak, a language with about 50,000-70,000 speakers in Papua. These resources provide a snapshot of audio and textual data on the language, and are useful for language preservation efforts, for ongoing efforts to produce teaching materials in the indigenous languages of Papua, and as a basis for the creation of dictionaries and glossaries in the language. Since they are linguistically annotated, they are also useful for linguists conducting research on Biak and related Austronesian languages. The annotated transcriptions are produced using Toolbox, a freely-available data management and analysis tool for language documentation, which supports the creation of resources in various forms: transcribed texts with free translations into Indonesian and English (of most use to the Biak-speaking community and for pedagogical use in Papua) and linguistically annotated transcriptions in two forms: a standard human-readable form like the paper-based corpora familiar to linguists, and a translation of this form to XML via the utility tools for Toolbox, suitable for computer analysis and database search.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The 1998 Lancet paper by Wakefield et al., despite subsequent retraction and evidence indicating no causal link between vaccinations and autism, triggered significant parental concern. The aim of this study was to analyze the online information available on this topic. Using localized versions of Google, we searched “autism vaccine” in English, French, Italian, Portuguese, Mandarin, and Arabic and analyzed 200 websites for each search engine result page (SERP). A common feature was the newsworthiness of the topic, with news outlets representing 25–50% of the SERP, followed by unaffiliated websites (blogs, social media) that represented 27–41% and included most of the vaccine-negative websites. Between 12 and 24% of websites had a negative stance on vaccines, while most websites were pro-vaccine (43–70%). However, their ranking by Google varied. While in Google.com, the first vaccine-negative website was the 43rd in the SERP, there was one vaccine-negative webpage in the top 10 websites in both the British and Australian localized versions and in French and two in Italian, Portuguese, and Mandarin, suggesting that the information quality algorithm used by Google may work better in English. Many webpages mentioned celebrities in the context of the link between vaccines and autism, with Donald Trump most frequently. Few websites (1–5%) promoted complementary and alternative medicine (CAM) but 50–100% of these were also vaccine-negative suggesting that CAM users are more exposed to vaccine-negative information. This analysis highlights the need for monitoring the web for information impacting on vaccine uptake.
This dataset for Intent classification from human speech covers 14 coarse-grained intents from the Banking domain. This work is inspired by a similar release in the Minds-14 dataset - here, we restrict ourselves to Indian English but with a much larger training set. The data was generated by 11 (Indian English) speakers and recorded over a telephony line. We also provide access to anonymized speaker information - like gender, languages spoken, and native language - to allow more structured discussions around robustness and bias in the models you train.
According to the Center for Strategic and International Studies, between June 2014 and July 2018, over 99 thousand tweets in German language were linked to the Russian Internet Research Agency. The second largest group of Twitter campaigns targeted Ukrainian users with over 82 thousand tweets.
According to a 2023 survey, the leading activity carried out on the internet by users in Indian languages was watching videos as reported by ** percent of the respondents. Listening to music was the second most popular activity within this demographic. Over *** million internet users reside in the urban areas of India.
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.