73 datasets found
  1. The most spoken languages worldwide 2025

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  2. 🌍📚 World Languages Dataset 🌍📚

    • kaggle.com
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waqar Ali (2024). 🌍📚 World Languages Dataset 🌍📚 [Dataset]. https://www.kaggle.com/datasets/waqi786/world-languages-dataset
    Explore at:
    zip(5706 bytes)Available download formats
    Dataset updated
    Jul 30, 2024
    Authors
    Waqar Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    World
    Description

    This dataset provides a comprehensive overview of 500 languages spoken around the world. It captures essential linguistic features, including language families, geographical regions, writing systems, and the estimated number of native speakers. This dataset aims to highlight the rich diversity of languages and their cultural significance, offering valuable insights for linguists, researchers, and enthusiasts interested in global language distribution.

    The dataset contains real and accurate records for 500 languages across different regions and linguistic families. It covers a diverse range of languages, from widely spoken ones like English and Mandarin to less commonly known languages. The data was meticulously compiled to reflect the authentic linguistic landscape and provide a valuable resource for language studies and cultural analysis.

  3. Most spoken languages worldwide in Millions

    • kaggle.com
    zip
    Updated Oct 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batros Jamali (2023). Most spoken languages worldwide in Millions [Dataset]. https://www.kaggle.com/datasets/batrosjamali/most-spoken-languages-worldwide-in-millions
    Explore at:
    zip(585 bytes)Available download formats
    Dataset updated
    Oct 14, 2023
    Authors
    Batros Jamali
    Area covered
    World
    Description

    Dataset

    This dataset was created by Batros Jamali

    Contents

  4. Common languages used for web content 2025, by share of websites

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2025
    Area covered
    Worldwide
    Description

    As of October 2025, English was the dominant language for online content, used by nearly half of all websites worldwide. Spanish ranked second, accounting for around 6 percent of web content, followed by German with 5.9 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

  5. The Most Spoken Languages Around the World

    • kaggle.com
    zip
    Updated Nov 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benno Narmelan (2020). The Most Spoken Languages Around the World [Dataset]. https://www.kaggle.com/narmelan/100-most-spoken-languages-around-the-world
    Explore at:
    zip(1894 bytes)Available download formats
    Dataset updated
    Nov 4, 2020
    Authors
    Benno Narmelan
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    World
    Description

    Context

    After going through quite the verbal loop when ordering foreign currency through the bank, which involved a discussion with an assigned financial advisor at the branch the following day to confirm details, I noticed despite our names hinting at the assumed typical background similarities, communication by phone was much more difficult due to the thickness in accents and different speech patterns when voicing from a non-native speaker.

    It hit me then coming from an extremely multicultural and welcoming city, the challenges others from completely different labels given to them in life must go through in their daily affairs when having to face communication barriers that I myself encountered, particularly when interacting with those outside their usual bubble. Now imagine this situation occurring every hour across the world in various sectors of business. How may this impede, help or create frustrations in minor or major ways as a result of increasing workplace diversity quota demands, customer satisfaction needs and process efficiencies?

    The data I was looking for to explore this phenomena existed in the form of native and non-native speakers of the 100 most commonly spoken languages across the globe.

    Content

    The data in this database contains the following attributes:

    • Language - name of the language
    • Total Speakers - this assumes both native and non-native speakers
    • Native Speakers - native speakers of the language
    • Origin - family origin group of said language

    Acknowledgements

    The data was collected with the aid of WordTips visualization of the 22nd edition of Ethnologue - "a research center for language intelligence"

    https://www.ethnologue.com/world https://www.ethnologue.com/guides/ethnologue200 https://word.tips/pictures/b684e98f-f512-4ac0-96a4-0efcf6decbc0_most-spoken-languages-world-5.png?auto=compress,format&rect=0,0,2001,7115&w=800&h=2845

    Inspiration

    As globalization no longer constrains us, what implications will this have in terms of organizational communications conducted moving forward? I believe this is something to be examined in careful context in order to make customer relationship processes meaningful rather than it being confined to a strictly detached transactional basis.

  6. Most spoken Indian languages worldwide 2025

    • statista.com
    Updated Jun 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most spoken Indian languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/1614099/worldwide-indian-languages-spoken/
    Explore at:
    Dataset updated
    Jun 10, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    India
    Description

    As of 2025, ***** was the most spoken Indian language worldwide and ranked third globally, with approximately *** million speakers. ******* was the second most spoken Indian language, with approximately *** million speakers globally.

  7. Ranking of languages spoken at home in the U.S. 2024, by number of speakers

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of languages spoken at home in the U.S. 2024, by number of speakers [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2024
    Area covered
    United States
    Description

    In 2024, some 45 million people in the United States spoke Spanish at home. In comparison, the second most spoken non-English language spoken by households was Chinese, at just 3.7 million speakers.The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  8. List of languages by total number of speakers

    • kaggle.com
    zip
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raj Kumar Pandey (2023). List of languages by total number of speakers [Dataset]. https://www.kaggle.com/datasets/rajkumarpandey02/list-of-languages-by-total-number-of-speakers/suggestions
    Explore at:
    zip(1622 bytes)Available download formats
    Dataset updated
    Feb 28, 2023
    Authors
    Raj Kumar Pandey
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About Dataset

    Context

    It is difficult to define what constitutes a language as opposed to a dialect. For example, Chinese and Arabic are sometimes considered to be single languages, but each includes several mutually unintelligible varieties and so they are sometimes considered language families instead. Conversely, colloquial registers of Hindi and Urdu are almost completely mutually intelligible, and are sometimes classified as one language, Hindustani, instead of two separate languages. Such rankings should be used with caution, because it is not possible to devise a coherent set of linguistic criteria for distinguishing languages in a dialect continuum.

    Content

    In this Dataset we have The Complete List Of World Most Spoken Languages in World in Million provided by Wikipedia.

    Data Columns:

    • Index of Serial Number
    • Name of the Languages
    • Name of the Family
    • Name of the Branch
    • First Languages or (Native Languages)
    • Second Languages or (Neighboring Language)
    • Total Speakers (L1+L2)
  9. The most linguistically diverse countries worldwide 2025, by number of...

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, The most linguistically diverse countries worldwide 2025, by number of languages [Dataset]. https://www.statista.com/statistics/1224629/the-most-linguistically-diverse-countries-worldwide-by-number-of-languages/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    Papua New Guinea is the most linguistically diverse country in the world. As of 2025, it was home to 840 different languages. Indonesia ranked second with 709 languages spoken. In the United States, 335 languages were spoken in that same year.

  10. Number of native Spanish speakers worldwide 2024, by country

    • hazel.com.ua
    • monwebsite.ch
    • +5more
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://hazel.com.ua/?p=2385236
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    World
    Description

    Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.

  11. u

    Evidence of Universal Language Structure From Speakers Whose Language...

    • datacatalogue.ukdataservice.ac.uk
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Culbertson, J, University of Edinburgh; Alexander, M, University of Groningen; Patrick, K, University of Edinburgh; Klaus, A, UCL; David, A, QMUL (2023). Evidence of Universal Language Structure From Speakers Whose Language Violates It, 2017-2022 [Dataset]. http://doi.org/10.5255/UKDA-SN-856694
    Explore at:
    Dataset updated
    Oct 11, 2023
    Authors
    Culbertson, J, University of Edinburgh; Alexander, M, University of Groningen; Patrick, K, University of Edinburgh; Klaus, A, UCL; David, A, QMUL
    Area covered
    Kenya
    Description

    There is a longstanding debate in cognitive science surrounding the source of commonalities among languages of the world. Indeed, there are many potential explanations for such commonalities—accidents of history, common processes of language change, memory limitations, constraints on linguistic representations, etc. Recent research has used psycholinguistic experiments to provide empirical evidence linking common linguistic patterns to specific features of human cognition, but these experiments tend to use English speakers, who in many cases have direct experience with precisely the common patterns of interest. Here, we highlight the importance of testing populations whose languages go against cross-linguistic trends. We investigate whether monolingual speakers of Kîîtharaka, which has an unusual way of ordering words, mirror those of English speakers. We find that they do, supporting the hypothesis that universal cognitive representations play a role in shaping word order.

    Languages can be very different from each other. For example, just focussing on the order of words, languages like English put adjectives before nouns ('red house') while languages like Thai put them afterwards ('house red'). Similarly, languages like Vietnamese put Numerals before nouns ('three houses'), while others, like the Kitharaka (spoken in Kenya), put numerals after ('houses three'). If word ordering was simply due to happenstance, we would expect to see all different orders appearing in equal proportion across languages, but we don't find that. In fact, some orders are very common, some are very rare, and some don't seem to appear at all. For example, many languages are ordered like English ('three red houses'), and many are also ordered like Thai, which is exactly the reverse ('houses red three'). But the Kitharaka order ('houses three red') is much rarer, and its mirror image ('red three houses') never seems to occur. Why is this?

    One of the major controversies in the language sciences is whether we need to appeal to the basic set-up of the human mind to explain the ways languages can vary, or whether these properties are instead a result of cultural differences in communication and social interaction. A great deal of recent work coming from the perspective of psychology assumes the latter: that the properties of language can be boiled down to communication, interaction and the vagaries of history, while most work in linguistics assumes the former: there must be biases in the human mind that allow us to learn languages of particular types more easily than others. This project seeks to resolve that issue.

    In order to do this, we test how well people learn languages of various types, to see whether their behaviour follows the general tendencies we see across real languages. Importantly, we use artificially constructed languages, rather than natural languages, in order to make sure that they only differ in the crucial respects. For example, we present English speakers with artificial languages that use word orders from Thai and Kitharaka. If Thai orders are more common across languages than Kitharaka ones because the former are easier to learn, then we should see this reflected in the behaviour of learners in our experiments. We can also see whether such patterns are always harder to learn, or if speaking a language which uses them-like Kitharaka-makes them easier to pick up in a new language. To do this, our experiments compare English, Thai, Vietnamese and Kitharaka speakers. If our learners all show the same kinds of patterns in how they learn our artificial languages that we find across real languages, that will suggest that the way languages vary is not random, nor is it entirely a product of historical facts. Rather it would suggest that there are universal cognitive biases at play.

    We plan to look at not just the basic question of what orders appear, but also two other well-known cases where languages don't seem to vary randomly. The first relates to how words like adjectives and numbers are placed relative to the nouns they modify: most languages place them both before or after (like English and Thai), rather than putting them on opposite sides (e.g., 'two houses red', like Vietnamese). We will test whether this type of pattern is always easier to learn in a new language. Second, we will look at whether people prefer to learn languages with suffixes (e.g., 'cat-s') rather than prefixes (e.g., 'un-happy'). Both types are present in English, but most languages have (more) suffixes. Our project we will shed light on whether there are universal cognitive biases in language learning, if such biases are at play for the particular phenomena we look at, and how people's native languages affect these biases.

  12. m

    Data from: Tracking the Global Pulse: The first public Twitter dataset from...

    • data.mendeley.com
    Updated May 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kheir eddine daouadi (2025). Tracking the Global Pulse: The first public Twitter dataset from FIFA World Cup [Dataset]. http://doi.org/10.17632/gw3mcnbkwr.2
    Explore at:
    Dataset updated
    May 27, 2025
    Authors
    kheir eddine daouadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    The first public large-scale multilingual Twitter dataset related to the FIFA World Cup 2022, comprising over 28 million posts in 69 unique spoken languages, including Arabic, English, Spanish, French, and many others. This dataset aims to facilitate research in future sentiment analysis, cross-linguistic studies, event-based analytics, meme and hate speech detection, fake news detection, and social manipulation detection.

    The file 🚨Qatar22WC.csv🚨 contains tweet-level and user-level metadata for our collected tweets. 🚀Codebook for FIFA World Cup 2022 Twitter Dataset🚀 | Column Name | Description| |-------------------------------- |----------------------------------------------------------------------------------------| | day, month, year | The date where the tweet posted | | hou, min, sec | Hour, minute, and second of tweet timestamp | | age_of_the_user_account | User Account age in days | | tweet_count | Total number of tweets posted by the user | | location | User-defined location field | | follower_count | Number of followers the user has | | following_count | Number of accounts the user is following | | follower_to_Following | Follower-following ratio | | favouite_count | Number of likes the user did| | verified | Boolean indicating if the user is verified (1 = Verified, 0 = Not Verified) | | Avg_tweet_count | Average tweets per day for the user activity| | list_count | Number of lists the user is a member | | Tweet_Id | Tweet ID | | is_reply_tweet | ID of the tweet being replied to (if applicable) | | is_quote | boolean representing if the tweet is a quote | | retid | Retweet ID if it's a retweet; NaN otherwise | | lang | Language of the tweet | | hashtags | The keyword or hashtag used to collect the tweet | | is_image, | Boolean indicating if the tweet associated with image| | is_video | Boolean indicating if the tweet associated with video | |-------------------------------|----------------------------------------------------------------------------------------|

    Examples of use case queries are described in the file 🚨fifa_wc_qatar22_examples_of_use_case_queries.ipynb🚨 and accessible via: https://github.com/khairied/Qata_FIFA_World_Cup_22

    🚀 Please Cite This as: Daouadi, K. E., Boualleg, Y., Guehairia, O. & Taleb-Ahmed, A. (2025). Tracking the Global Pulse: The first public Twitter dataset from FIFA World Cup, Journal of Computational Social Science.

  13. FLORES-101

    • kaggle.com
    zip
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathurin Aché (2021). FLORES-101 [Dataset]. https://www.kaggle.com/mathurinache/flores101
    Explore at:
    zip(13628027 bytes)Available download formats
    Dataset updated
    Jun 7, 2021
    Authors
    Mathurin Aché
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Machine translation helps bridge the language barriers between people and information — but historically, research has focused on creating and evaluating translation systems for only a handful of languages, usually the few most spoken languages in the world. This excludes the billions of people worldwide who don’t happen to be fluent in languages such as English, Spanish, Russian, and Mandarin.

    We’ve recently made progress with machine translation systems like M2M-100 , our open source model that can translate a hundred different languages. Further advances necessitate tools with which to test and compare these translation systems with one another, though.

    Today, we are open-sourcing FLORES-101 , a first-of-its-kind, many-to-many evaluation data set covering 101 languages from all over the world. FLORES-101 is the missing piece, the tool that enables researchers to rapidly test and improve upon multilingual translation models like M2M-100.

    We’re making FLORES-101 publicly available because we believe in breaking down language barriers, and that means helping empower researchers to create more diverse (and locally relevant) translation tools — ones that may make it as easy to translate from, say, Bengali to Marathi as it is to translate from English to Spanish today. We’re making the full FLORES-101 data set , an accompanying tech report, and several models publicly available for the entire research community to use, to accelerate progress on many-to-many translation systems worldwide.

    Why evaluation matters Imagine trying to bake a cake — but not being able to taste it. It’s near-impossible to know whether it’s any good, and even harder to know how to improve the recipe for future attempts.

    Evaluating how well translation systems perform has been a major challenge for AI researchers — and that knowledge gap has impeded progress. If researchers cannot measure or compare their results, they can’t develop better translation systems. The AI research community needed an open and easily accessible way to perform high-quality, reliable measurement of many-to-many translation model performance and then compare results with others.

    Previous work on this problem relied heavily on translating in and out of English, often using proprietary data sets. But while this benefited English speakers, it was and is insufficient for many parts of the world where people need fast and accurate translation between regional languages — for instance, in India, where the constitution recognizes over 20 official languages.

    FLORES-101 focuses on what are known as low-resource languages, such as Amharic, Mongolian, and Urdu, which do not currently have extensive data sets for natural language processing research. For the first time, researchers will be able to reliably measure the quality of translations through 10,100 different translation directions — for example, directly from Hindi to Thai or Swahili. For context, evaluating in and out of English would provide merely 200 translation directions.

    The flexibility exhibited by FLORES is possible because we designed around many-to-many translation from the start. The data set contains the same set of sentences across all languages, enabling researchers to evaluate the performance of any and all translation directions.

    “Efforts like FLORES are of immense value, because they not only draw attention to under-served languages, but they immediately invite and actively facilitate research on all these languages,” said Antonios Anastasopoulos, assistant professor at George Mason University’s Department of Computer Science.

    Building a benchmark Good benchmarks are difficult to construct. They need to be able to accurately reflect meaningful differences between models so they can be used by researchers to make decisions. Translation benchmarks can be particularly difficult because the same quality standard must be met across all languages, not just a select few for which translators are more readily available.

    Lire
    -0:16 Paramètres visuels supplémentairesHD Diffuser sur Chrome CastAfficher en plein écran Remettre le son To that end, we created the FLORES-101 data set in a multistep workflow. Each document was first translated by a professional translator, and then verified by a human editor. Next, it proceeded to the quality-control phase, including checks for spelling, grammar, punctuation, and formatting, and comparison with translations from commercial engines. After that, a different set of translators performed human evaluation, identifying errors across numerous categories including unnatural translation, register, and grammar. Based on the number and severity of the identified errors, the translations were either sent back for retranslation or — if they met quality standards — the translations were considered complete.

    Translation quality is not enough on its own, though. Th...

  14. E

    GlobalPhone Chinese-Mandarin

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Chinese-Mandarin [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0193/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Mandarin corpus was produced using the Peoples Daily newspaper. It contains recordings of 132 speakers (64 males, 68 females) recorded in Beijing, Wuhan and Hekou, China. The following age distribution has been obtained: 16 speakers are below 19, 96 speakers are between 20 and 29, 16 speakers are between 30 and 39, 3 speakers are between 40 and 49 (1 speaker age is unknown).

  15. E

    GlobalPhone Portuguese (Brazilian)

    • live.european-language-grid.eu
    • catalogue.elra.info
    audio format
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GlobalPhone Portuguese (Brazilian) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1912
    Explore at:
    audio formatAvailable download formats
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Brazil
    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

    The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).

  16. E

    GlobalPhone Vietnamese

    • live.european-language-grid.eu
    • catalogue.elra.info
    audio format
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GlobalPhone Vietnamese [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2100
    Explore at:
    audio formatAvailable download formats
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

    The Vietnamese part of GlobalPhone was collected in summer 2009. In total 160 speakers were recorded, 140 of them in the cities of Hanoi and Ho Chi Minh City in Vietnam, and an additional set of 20 speakers were recorded in Karlsruhe, Germany. All speakers are Vietnamese native speakers, covering the main dialectal variants from South and North Vietnam. Of these 160 speakers, 70 were female and 90 were male. The majority of speakers are well educated, being graduated students and engineers. The age distribution of the speakers ranges from 18 to 65 years. Each speaker read between 50 and 200 utterances from newspaper articles, corresponding to roughly 9.5 minutes of speech or 138 utterances per person, in total we recorded 22.112 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small-sized rooms with very low background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The speech data was recorded in two phases. In a first phase data was collected from 140 speakers in the cities of Hanoi and Ho Chi Minh. In the second phase we selected utterances from the text corpus in order to cover rare Vietnamese phonemes. This second recording phase was carried out with 20 Vietnamese graduate students who live in Karlsruhe. In sum, 22.112 utterances were spoken, corresponding to 25.25 hours of speech. The text data used for recording mainly came from the news posted in online editions of 15 Vietnamese newspaper websites, where the first 12 were used for the training set, while the last three were used for the development and evaluation set. The text data collected from the first 12 websites cover almost 4 Million word tokens with a vocabulary of 30.000 words resulting in an Out-of-Vocabulary rate of 0% on the development set and 0.067% on the evaluation set. For the text selection we followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). The transcriptions are provided in Vietnamese-style Roman script, i.e. using several diacritics encoded in UTF-8. The Vietnamese data are organized in a training set of 140 speakers with 22.15 hours of speech, a development set of 10 speakers, 6 from North and 4 from South Vietnam with 1:40 hours of speech and an evaluation set of 10 speakers with same gender and dialect distribution as the development set with 1:30 hours of speech. More details on corpus statistics, collection scenario, and system building based on the Vietnamese part of GlobalPhone can be found under [Vu and Schultz, 2009, 2010].

    [Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002. [Vu and Schultz, 2010] Ngoc Thang Vu, Tanja Schultz (2010): Optimization On Vietnamese Large Vocabulary Speech Recognition, 2nd Workshop on Spoken Languages Technologies for Under-resourced Languages, SLTU 2010, Penang, Malaysia, May 2010. [Vu and Schultz, 2009] Ngoc Thang Vu, Tanja Schultz (2009): Vietnamese Large Vocabulary Continuous Speech Recognition, Automatic Speech Recognition and Understanding, ASRU 2009, Merano.

  17. Atlas of Pidgin and Creole Language Structures

    • kaggle.com
    zip
    Updated Jul 28, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael Tatman (2017). Atlas of Pidgin and Creole Language Structures [Dataset]. https://www.kaggle.com/rtatman/atlas-of-pidgin-and-creole-language-structures
    Explore at:
    zip(340471 bytes)Available download formats
    Dataset updated
    Jul 28, 2017
    Authors
    Rachael Tatman
    Description

    Context:

    When groups of people who don’t share a spoken language come together, they will often create a new language which combines elements of their first languages. These languages are known as “pidgins”. If they are then learned by children as their first language they become fully-fledged languages known as “creoles”. This dataset contains information on both creoles and pidgins spoken around the world.

    Content:

    This dataset includes information on the grammatical and lexical structures of 76 pidgin and creole languages. The language set contains not only the most widely studied Atlantic and Indian Ocean creoles, but also less well known pidgins and creoles from Africa, South Asia, Southeast Asia, Melanesia and Australia, including some extinct varieties, and several mixed languages.

    This dataset is made up of several tables, each of which contains different pieces of information:

    • language: A table of language names & the unique id’s associated with them.
    • language_data: A table of data on the different languages, including the name speakers’ call their language (Autoglossonym), other names the language is called, how many speakers it has, the language which contributed the most words to the language (Major lexifier), other languages which contribute to that language, where it is spoken, and where it is an official language fro. The column language_id has the id linked to the language table.
    • language_source: The sources referenced on each language (referencing the language and source tables).
    • langauge_table: Information on the geographic location of each language.
    • source: Information on the scholarly sources referenced for information on language.

    Acknowledgements:

    This dataset contains information from the online portion of the Atlas of Pidgin and Creole Language Structures (APiCS). It is distributed under a Creative Commons Attribution 3.0 Unported License . If you use this dataset in your work, please use this citation:

    Salikoko S. Mufwene. 2013. Kikongo-Kituba structure dataset. In: Michaelis, Susanne Maria & Maurer, Philippe & Haspelmath, Martin & Huber, Magnus (eds.) Atlas of Pidgin and Creole Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://apics-online.info/contributions/58, Accessed on 2017-07-28.)

    Inspiration:

    • Which areas of the world have the most creoles/pidgins?
    • Which language has contributed to the most creoles/pidgins? Why might this be?
    • Can you map the areas of influence of the various lexicalized Major Lexifier languages?

    You may also be interested in:

  18. D

    Language Learning Apps Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Language Learning Apps Market Research Report 2033 [Dataset]. https://dataintelo.com/report/language-learning-apps-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Language Learning Apps Market Outlook



    According to our latest research, the global Language Learning Apps market size reached USD 7.35 billion in 2024, reflecting a strong demand for digital language education solutions worldwide. The market is projected to grow at a CAGR of 18.2% during the forecast period from 2025 to 2033, reaching an estimated USD 34.85 billion by 2033. This robust growth is primarily driven by the increasing penetration of smartphones, the growing necessity for multilingual communication in an interconnected world, and the widespread adoption of e-learning methodologies across educational institutions and enterprises.




    One of the most significant growth factors propelling the Language Learning Apps market is the rapid advancement of mobile technology and the proliferation of affordable smartphones and high-speed internet connectivity. With mobile devices becoming ubiquitous, users are increasingly seeking flexible, on-the-go educational solutions. Language learning apps are leveraging this trend by offering interactive, adaptive, and personalized learning experiences that cater to diverse learning styles and schedules. Furthermore, the integration of artificial intelligence, speech recognition, and gamification features has made these apps more engaging and effective, encouraging higher user retention rates and expanding the addressable market.




    Another critical driver is the globalization of business and education, which necessitates proficiency in multiple languages. Enterprises are investing in upskilling their workforce to enhance cross-border communication and collaboration, while educational institutions are incorporating digital language learning tools into their curricula. The COVID-19 pandemic further accelerated the shift towards digital learning, as remote and hybrid education models became the norm. Consequently, both individual learners and organizations are increasingly turning to language learning apps for their convenience, scalability, and cost-effectiveness, fueling sustained market growth.




    Additionally, the market is benefiting from the rising demand for English and other widely spoken languages such as Mandarin, Spanish, and French, especially in emerging economies. Governments and educational authorities are actively promoting language education to improve employability and global competitiveness. The increasing availability of regionally tailored content and the localization of apps to support less commonly taught languages are further broadening the user base. Strategic partnerships between app developers, educational institutions, and technology providers are fostering innovation and expanding the reach of language learning solutions to underserved populations.




    From a regional perspective, Asia Pacific is emerging as the fastest-growing market, driven by a large population of young learners, rapid digitalization, and strong government initiatives supporting education technology. North America and Europe continue to dominate in terms of market share, owing to high digital literacy, established educational infrastructure, and a strong presence of leading app developers. Meanwhile, Latin America and the Middle East & Africa are witnessing increasing adoption rates, supported by rising smartphone penetration and a growing emphasis on bilingual education. This regional diversification is expected to further enhance the global growth trajectory of the Language Learning Apps market throughout the forecast period.



    Product Type Analysis



    The Product Type segment of the Language Learning Apps market is primarily categorized into web-based and mobile-based solutions. Mobile-based language learning apps have witnessed a remarkable surge in popularity, owing to the widespread adoption of smartphones and tablets. These apps offer unparalleled convenience, allowing users to practice languages anytime and anywhere, which aligns perfectly with the modern learner’s lifestyle. The integration of push notifications, offline access, and interactive features such as voice recognition and gamification has further enhanced user engagement and learning outcomes. As a result, mobile-based solutions accounted for the largest share of the market in 2024, and this dominance is expected to continue throughout the forecast period.




    Web-based language learning platforms, while slightly lagging behind mobile apps in terms of u

  19. h

    jampatoisnli

    • huggingface.co
    Updated Jul 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2023
    Authors
    Ruth-Ann Armstrong
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.

  20. Most used programming languages among developers worldwide 2025

    • statista.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Most used programming languages among developers worldwide 2025 [Dataset]. https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 29, 2025 - Jun 23, 2025
    Area covered
    Worldwide
    Description

    As of 2025, JavaScript and HTML/CSS are the most commonly used programming languages among software developers around the world, with more than 66 percent of respondents stating that they used JavaScript and just around 61.9 percent using HTML/CSS. Python, SQL, and Bash/Shell rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Organization logo

The most spoken languages worldwide 2025

Explore at:
464 scholarly articles cite this dataset (View in Google Scholar)
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Search
Clear search
Close search
Google apps
Main menu