55 datasets found
  1. The most spoken languages worldwide 2025

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  2. Common languages used for web content 2025, by share of websites

    • statista.com
    • ai-chatbox.pro
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
    Explore at:
    Dataset updated
    Feb 11, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 2025
    Area covered
    Worldwide
    Description

    As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

  3. Ranking of languages spoken at home in the U.S. 2023

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    United States
    Description

    In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  4. Most common languages spoken in India 2011

    • statista.com
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most common languages spoken in India 2011 [Dataset]. https://www.statista.com/statistics/616508/most-common-languages-india/
    Explore at:
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2011
    Area covered
    India
    Description

    Hindi, with over *** million native speakers was the most spoken language across Indian homes, followed by Bengali with ** million speakers, as of 2011 census data. English native speakers accounted for about *** thousand during the measured time period. The colonial rule in India One of the most remarkable and widespread legacies that the British colonial rule left behind was the English language. Before independence, the English language was the solely used for higher education and in government and administrative processes. Post-independence, however, and till today, Hindi was claimed as the language with official government patronage. This lead to resistance from the southern states of India, where Hindi did not have prominence. Consequently, the Official Languages Act of 1963, was enacted by the parliament, which ensured the continued use of English for official purposes in conjunction with Hindi. Multi-linguistic cultures India has approximately ** major languages that are written in about ** different scripts. While the country’s official languages are both, English and Hindi, Hindi remains the most preferred language used online especially in the northern rural areas. The use of English is becoming increasingly popular in the urban areas. In addition, almost every state in India has its own official language that is studied in primary and secondary school as an obligatory second language. Among the most prominent are Bengali, Marathi, and Telugu.

  5. Number of native Spanish speakers worldwide 2024, by country

    • statista.com
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://www.statista.com/statistics/991020/number-native-spanish-speakers-country-worldwide/
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    World
    Description

    Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.

  6. E

    GlobalPhone Vietnamese

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Vietnamese [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0322/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Vietnamese part of GlobalPhone was collected in summer 2009. In total 160 speakers were recorded, 140 of them in the cities of Hanoi and Ho Chi Minh City in Vietnam, and an additional set of 20 speakers were recorded in Karlsruhe, Germany. All speakers are Vietnamese native speakers, covering the main dialectal variants from South and North Vietnam. Of these 160 speakers, 70 were female and 90 were male. The majority of speakers are well educated, being graduated students and engineers. The age distribution of the speakers ranges from 18 to 65 years. Each speaker read between 50 and 200 utterances from newspaper articles, corresponding to roughly 9.5 minutes of speech or 138 utterances per person, in total we recorded 22.112 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorde...

  7. s

    120 Million Word Spanish Corpus

    • marketplace.sshopencloud.eu
    Updated Apr 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). 120 Million Word Spanish Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/XTUFXt
    Explore at:
    Dataset updated
    Apr 24, 2020
    Description

    Spanish is the second most widely-spoken language on Earth; over one in 20 humans alive today is a native speaker of Spanish. This medium-sized corpus contains 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010. This dataset is made up of 57 text files. Each contains multiple Wikipedia articles in an XML format. The text of each article is surrounded by tags. The initial tag also contains metadata about the article, including the article’s id and the title of the article. The text “ENDOFARTICLE.” appears at the end of each article, before the closing tag.

  8. j

    Japan Centre of Excellence (JACEEX)

    • jaceex.com
    html
    Updated Jul 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Japan Centre of Excellence (JACEEX) (2019). Japan Centre of Excellence (JACEEX) [Dataset]. https://www.jaceex.com/ssw
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 16, 2019
    Dataset provided by
    https://www.jaceex.com/
    Authors
    Japan Centre of Excellence (JACEEX)
    Area covered
    Description

    Japan Centre of Excellence (JACEEX), is a brand under Jaceex Ventures LLP. Jaceex has been formed with a vision to create a world class workforce with skill sets, work and business ethics, sincerity and devotion as well as other great positive traits found in the Japanese workforce which has been responsible for having built world class Enterprises. For the Indian Students and youths stepping into this world, our objective is to provide life changing opportunity in the form of skill and work in Japan Japan Centre of Excellence (JACEEX) provides an integrated course schedule of learning through exploration, scrutiny and self reflection. We are offering Japanese Language and Culture training-Basic, Intermediate and High Levels. Our training is designed to make the trainee eligible to certify themselves with the globally recognised Japanese Language Proficiency Test (JLPT) Examination . This will help in building careers with Japanese companies in Japan , in India and also self employment.We also have the facility of Virtual Live class platform

  9. Language Named Authority List

    • data.europa.eu
    rdf xml, xml, zip
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Publications Office of the European Union (2024). Language Named Authority List [Dataset]. https://data.europa.eu/data/datasets/language?locale=en
    Explore at:
    xml, rdf xml, zipAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    Publications Office of the European Unionhttp://op.europa.eu/
    European Union-
    Authors
    Publications Office of the European Union
    License

    http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj

    Description

    Language is a controlled vocabulary that lists world languages and language varieties, including sign languages. Its main purpose is to support activities associated with the publication process. The full set of languages contains more than 8000 language varieties, each identified by a code equivalent to the ISO 639-3 code. Concepts are aligned with the ISO 639 international standard, which is issued in several parts: ISO 639-1 contains strictly two alphabetic letters (alpha-2), ISO 639-2/B (B = bibliographic) is used for bibliographic purpose (alpha-3), ISO 639-2/T (T = terminology) is used for technical purpose (alpha-3), ISO 639-3 covers all the languages and macro-languages of the world (alpha-3); the values are compliant with ISO 639-2/T. If an authority code is needed for a language without an assigned ISO code, an alphanumeric code is created to avoid confusion with the strictly alphabetic ISO codes. Labels are provided in all 24 official EU languages for the most frequently used languages. Language is under governance of the Interinstitutional Metadata and Formats Committee (IMFC). It is maintained by the Publications Office of the European Union and disseminated on the EU Vocabularies website. It is a corporate reference data asset covered by the Corporate Reference Data Management policy of the European Commission.

  10. 785 Million Language Translation Database for AI

    • kaggle.com
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ramakrishnan Lakshmanan
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

    Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

    Key Features:

    Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

    Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

    Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

    Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

    Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

    Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

    Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

    Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

    The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

    Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

    Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

    Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

    Dataset Preparation: The translation ...

  11. a

    Nigeria Language Areas

    • ebola-nga.opendata.arcgis.com
    Updated Dec 5, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Geospatial-Intelligence Agency (2014). Nigeria Language Areas [Dataset]. https://ebola-nga.opendata.arcgis.com/content/a8562de38b814219b331c7d49cc87ff4
    Explore at:
    Dataset updated
    Dec 5, 2014
    Dataset authored and provided by
    National Geospatial-Intelligence Agency
    Area covered
    Description

    There are over 500 known languages in Nigeria. While the official language is English, its use is largely confined to urban elites. The most commonly used languages are Hausa, Yoruba, Igbo (Ibo) and Fulfulde. Edo, Efik, Adamawa Fulfulde, Idoma, and Central Kanuri are also widely spoken. The area of greatest diversity is the ‘Middle Belt’, the band of territory stretching across the country between the large language blocs of the north and the south. The reason for this diversity remains unclear, but three of Africa's four language families meet in the Middle Belt of Nigeria. This has had sociolinguistic consequences where frequent conflicts have erupted between the culture and language of particular groups.

    ISO3 - International Organization for Standardization 3-digit country code

    LANG_FAM - Language family

    LANG_SUBGP - Language sub-family

    SOURCE_DT - Primary source creation date

    SOURCE - Primary source

    Collection

    This shapefile created by using Anthromapper consists of language layers that have been based on The World Language Mapping System (WLMS). Geographical terrain features, combined with a watershed model, were also used to predict the likely extent of ethnic and linguistic influence. The HGIS and metadata were supplemented with anthropological information from peer-reviewed journals and published books. The interpretation of names often produces multiple spellings of the same language; therefore similarly spelled or phonetic titles may be referencing the same language group.

    The data included herein have not been derived from a registered survey and should be considered approximate unless otherwise defined. While rigorous steps have been taken to ensure the quality of each dataset, DigitalGlobe Analytics is not responsible for the accuracy and completeness of data compiled from outside sources.

    Sources (HGIS)

    Anthromapper. DigitalGlobe Analytics, April 2013.

    World Language Mapping System (WLMS) Version 16. World GeoDatasets, April 2013.

    Sources (Metadata)

    Roger, Blench. "Position Paper: The Dimensions of Ethnicity, Language, and Culture in Nigeria." Last modified 2013. Accessed March 26, 2013. http://www.rogerblench.info.

    Roger, Blench. “The Status of the Languages of Central Nigeria.” Last modified 2013. Accessed March 26, 2013. http://www.rogerblench.info.

  12. c

    Language Services market size was estimated at USD 58.9 billion in 2022!

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Feb 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2024). Language Services market size was estimated at USD 58.9 billion in 2022! [Dataset]. https://www.cognitivemarketresearch.com/language-services-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Feb 19, 2024
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, the global language services market size was estimated at USD 58.9 billion in 2022 and is expected to grow at a compound annual growth rate (CAGR) of 6.2% from 2023 to 2030. Which Factors Drives the Language Services Market Growth?

    Cross-border contact has become more intense due to globalization, increasing the need for translation, localization, and interpretation services. Language solutions are required by growing multinational businesses, e-commerce, and multilingual customer service. Growth is also fueled by government programs that support accessibility and multilingualism. Technology advancements, including AI-driven translation tools, increase productivity and widen the market.

    These developments empower businesses to offer better-tailored solutions and services, which, in turn, contribute to the growth of the Language Services industry.

    For instance, A well-known international provider of language services, BIG Language Solutions, revealed in April 2022 that it had acquired the Milan-based company Lawlinguists, which offers legal translation services. With the addition of Italy, Germany, and Spain to BIG's European footprint through the purchase, its clients now have access to a wider range of excellent legal translation services, resources, and technology.

    (Source:biglanguage.com/blog/big-acquires-lawlinguists-expands-legal-offering-and-european-presence/)

    Globalization and Internationalization to Provide Viable Market Output
    

    A significant market driver for language services has been globalization. Communication in various languages is becoming increasingly important as firms grow internationally. The expansion of international trade, e-commerce, and cross-border investments all contribute to this trend. Companies must translate, localize, and adapt their products and services to local languages and cultures to remain competitive in the global market.

    There are approximately 7,139 languages spoken in the world today. However, many of these languages are endangered, with experts estimating that around 40% of languages are at risk of extinction.

    (Source:www.ohchr.org/en/stories/2019/10/many-indigenous-languages-are-danger-extinction)

    Multinational corporations with diverse workforces and clients from various language backgrounds have become popular due to globalization. These enterprises rely on translation services to eliminate language barriers to guarantee efficient internal communication and seamless relations with external parties. Language solutions, including document, website, and marketing material translation and conference and meeting interpretation services, greatly aid international collaboration and understanding.

    Technological Advancements to Propel Market Growth
    
    
    
    
    
    Localization of Digital Content
    

    Factors Restraining Growth of the Language Services Market

    Machine Translation Limitations to Hinder Market Growth
    

    The constraints of machine translation constrain the language services market. While machine translation quality has increased due to technological developments in AI, especially for complicated or specialized information, it still falls short of human translation in accuracy and nuance. The context and idiomatic idioms that machine translation systems frequently struggle with might cause translations to sound uncomfortable or inaccurate to native speakers. This restriction is especially important for fields like law, medicine, and marketing, where accuracy and cultural appropriateness are key.

    How COVID-19 Impacted the Language Services Market?

    To reach a worldwide audience, the pandemic drove digital transformation and remote labor, driving up demand for translation and localization services. Translations in the medical and scientific fields increased as information sharing became essential. Travel restrictions hampered on-site interpreting services simultaneously, increasing the demand for remote interpreting services. Due to the pandemic's emphasis on efficient intercultural communication, businesses, the medical community, and governments have all prioritized language services to enable proper information flow and support during the crisis What is Language Services?

    Language services means it is a professional service used for communication and understanding between different cultural groups. It facilitates effective comm...

  13. E

    GlobalPhone Portuguese (Brazilian)

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Portuguese (Brazilian) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0201/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Brazil
    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).

  14. a

    SLE Language Areas

    • ebola-nga.opendata.arcgis.com
    Updated Feb 2, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Geospatial-Intelligence Agency (2015). SLE Language Areas [Dataset]. https://ebola-nga.opendata.arcgis.com/content/ffe30c1c30ed48fcafb14e8a026128d5
    Explore at:
    Dataset updated
    Feb 2, 2015
    Dataset authored and provided by
    National Geospatial-Intelligence Agency
    Area covered
    Description

    While English is the official language, it is typically used for governmental, business, and media purposes. In day to day life most people in the country speak Krio, which is a style of Pidgin English or English-based creole language. Krio is the lingua franco for the country and the formal language for those who do not speak English. With the number of different ethnic groups, Krio unites these groups with a common language. The citizens who are fluent in English are among the elite minority and often experience privileges such as economic opportunities that non-English speakers are excluded from. Other common indigenous languages used in the country are Mende, Temne, and Limba. As the official language, English is the only language used in education. It is reported that school children who speak indigenous languages on school premises are punished. Students who fail English classes are not granted admission into college. Attribute Table Field DescriptionsISO3-International Organization for Standardization 3-digit country codeADM0_NAME-Administration level zero identification / nameLANG_FAM-Language familyLANG_SUBGR-Language subgroupALT_NAMES-Alternate namesCOMMENTS-Comments or notes regarding languageSOURCE_DT-Source one creation dateSOURCE-Source oneSOURCE2_DT-Source two creation dateSOURCE2-Source twoCollectionThis feature class was created using Anthromapper consisting of linguistic layers that have been primarily based on The World Language Mapping System (WMLS). Geographical terrain features, combined with a watershed model, were also used to predict the likely extent of linguistic influence. The metadata was supplemented with anthropological and linguistic information from peer-reviewed journals and published books. It should be noted that this feature class only depicts the majority first level languages spoken in a given area; there might be significant populations of other minority language speakers not shown in this dataset.The data included herein have not been derived from a registered survey and should be considered approximate unless otherwise defined. While rigorous steps have been taken to ensure the quality of each dataset, DigitalGlobe is not responsible for the accuracy and completeness of data compiled from outside sources.Sources (HGIS)Anthromapper. DigitalGlobe, November 2014.Ethnologue, “Languages of the World." 2012. Accessed November 2014. http://www.ethnologue.com.World Language Mapping System (WLMS) Version 16. World GeoDatasets, November 2014.Sources (Metadata)Antimoon, “English, French, and Arabic languages in Sierra Leone”. December 2009. Accessed December 2014. http://www.antimoon.com.Central Intelligence Agency. The World FactBook, “Serra Leone”. June 2014. Accessed November 2014. https://www.cia.gov/library/publications/the-world-factbook.DePauw University. Sierra Leone, “Language”. January 2014. Accessed December 2014. http://www.depauw.edu.National African Language Resource Center (NALRC), “Krio”. January 2014. Accessed December 2014. http://www.nalrc.indiana.edu.

  15. Most used programming languages among developers worldwide 2024

    • statista.com
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most used programming languages among developers worldwide 2024 [Dataset]. https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/
    Explore at:
    Dataset updated
    Feb 6, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 19, 2024 - Jun 20, 2024
    Area covered
    Worldwide
    Description

    As of 2024, JavaScript and HTML/CSS were the most commonly used programming languages among software developers around the world, with more than 62 percent of respondents stating that they used JavaScript and just around 53 percent using HTML/CSS. Python, SQL, and TypeScript rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.

  16. n

    Data from: An Interactive Foreign Language Trainer Using Assessment and...

    • narcis.nl
    • data.mendeley.com
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reyes, R (via Mendeley Data) (2020). An Interactive Foreign Language Trainer Using Assessment and Feedback Modalities [Dataset]. http://doi.org/10.17632/wnw9bdkxjb.1
    Explore at:
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Reyes, R (via Mendeley Data)
    Description

    Abstract. English has long been set as the “universal language.” Basically most, if not all countries in the world know how to speak English or at least try to use it in their everyday communications for the purpose of globalizing. This study is designed to help the students learn from one or all of the four most commonly used foreign languages in the field of Information Technology namely; Korean, Mandarin Chinese, Japanese, and Spanish. Composed of a set of words, phrases, and sentences, the program is intended to quickly teach the students in the form of basic, intermediate, and advanced levels. This study has used the Agile model in system development. Functionality, reliability, usability, efficiency, and portability were also considered in determining the level of the system’s acceptability in terms of ISO 25010:2011. This interactive foreign language trainer is built to associate fun with learning, to remedy the lack of perseverance by some in learning a new language, and to make learning the user’s favorite playtime activity. The study allows the user to interact with the program which provides support for their learning. Moreover, this study reveals that integrating feedback modalities in the training and assessment modules of the software strengthens and enhances the memory in learning the language.

    Keywords: Feedback, Assessment, Learning Modalities, Language Trainer, Interactive Technology

  17. E

    GlobalPhone Swedish

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Swedish [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0204/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Swedish corpus was produced using the Goeteborgs-Posten newspaper. It contains recordings of 98 speakers (50 males, 48 females) recorded in Stockholm and Vaernamo, Sweden. The following age distribution has been obtained: 9 speakers are below 19, 50 speakers are between 20 and 29, 12 speakers are between 30 and 39, 11 speakers are between 40 and 49, and 16 speakers are over 50.

  18. P

    JamPatoisNLI Dataset

    • paperswithcode.com
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). JamPatoisNLI Dataset [Dataset]. https://paperswithcode.com/dataset/jampatoisnli
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the effectiveness of transfer from large monolingual or multilingual pretrained models. While our work, along with previous work, shows that transfer from these models to low-resource languages that are unrelated to languages in their training set is not very effective, we would expect stronger results from transfer to creoles. Indeed, our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages, and help us begin to understand how the unique relationship between creoles and their high-resource base languages affect cross-lingual transfer. JamPatoisNLI, which consists of naturally-occurring premises and expert-written hypotheses, is a step towards steering research into a traditionally underserved language and a useful benchmark for understanding cross-lingual NLP.

  19. Screening phase tests of the PRISMA flow diagram Systematic review

    • zenodo.org
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shi-Qi MEI; Shi-Qi MEI; Felipe Gértrudix-Barrio; Felipe Gértrudix-Barrio (2025). Screening phase tests of the PRISMA flow diagram Systematic review [Dataset]. http://doi.org/10.5281/zenodo.14875814
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shi-Qi MEI; Shi-Qi MEI; Felipe Gértrudix-Barrio; Felipe Gértrudix-Barrio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 13, 2024
    Description

    Screening record, following the PRISMA 2020 method, carried out for a systematic review in the four most used bibliographic bases in the most spoken languages in the world: Web of Science, Scopus, Dialnet and China National Knowledge Infrastructure, which seeks both the learning of Spanish and the use of technologies and how these have experienced a very significant development in Chinese higher education.

  20. S

    Global Mandarin Learning Market Overview and Outlook 2025-2032

    • statsndata.org
    excel, pdf
    Updated May 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Mandarin Learning Market Overview and Outlook 2025-2032 [Dataset]. https://www.statsndata.org/report/mandarin-learning-market-178140
    Explore at:
    pdf, excelAvailable download formats
    Dataset updated
    May 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Mandarin Learning market has seen significant growth over the past decade, reflecting the increasing global interest in China as a major economic powerhouse and cultural influencer. With Mandarin being the most spoken language in the world, the demand for Mandarin language education has spiked, creating a divers

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Organization logo

The most spoken languages worldwide 2025

Explore at:
442 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Search
Clear search
Close search
Google apps
Main menu