100+ datasets found
  1. The most spoken languages worldwide 2025

    • statista.com
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  2. ๐ŸŒ๐Ÿ“š World Languages Dataset ๐ŸŒ๐Ÿ“š

    • kaggle.com
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WAQAR ALI (2024). ๐ŸŒ๐Ÿ“š World Languages Dataset ๐ŸŒ๐Ÿ“š [Dataset]. https://www.kaggle.com/datasets/waqi786/world-languages-dataset
    Explore at:
    zip(5706 bytes)Available download formats
    Dataset updated
    Jul 30, 2024
    Authors
    WAQAR ALI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    World
    Description

    This dataset provides a comprehensive overview of 500 languages spoken around the world. It captures essential linguistic features, including language families, geographical regions, writing systems, and the estimated number of native speakers. This dataset aims to highlight the rich diversity of languages and their cultural significance, offering valuable insights for linguists, researchers, and enthusiasts interested in global language distribution.

    The dataset contains real and accurate records for 500 languages across different regions and linguistic families. It covers a diverse range of languages, from widely spoken ones like English and Mandarin to less commonly known languages. The data was meticulously compiled to reflect the authentic linguistic landscape and provide a valuable resource for language studies and cultural analysis.

  3. Ranking of languages spoken at home in the U.S. 2024, by number of speakers

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of languages spoken at home in the U.S. 2024, by number of speakers [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2024
    Area covered
    United States
    Description

    In 2024, some 45 million people in the United States spoke Spanish at home. In comparison, the second most spoken non-English language spoken by households was Chinese, at just 3.7 million speakers.The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  4. Most spoken languages worldwide in Millions

    • kaggle.com
    zip
    Updated Oct 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batros Jamali (2023). Most spoken languages worldwide in Millions [Dataset]. https://www.kaggle.com/datasets/batrosjamali/most-spoken-languages-worldwide-in-millions
    Explore at:
    zip(585 bytes)Available download formats
    Dataset updated
    Oct 14, 2023
    Authors
    Batros Jamali
    Area covered
    World
    Description

    Dataset

    This dataset was created by Batros Jamali

    Contents

  5. Top Languages Spoken in the United States

    • kaggle.com
    zip
    Updated Oct 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Top Languages Spoken in the United States [Dataset]. https://www.kaggle.com/datasets/thedevastator/top-languages-spoken-in-the-united-states
    Explore at:
    zip(356420 bytes)Available download formats
    Dataset updated
    Oct 22, 2022
    Authors
    The Devastator
    Area covered
    United States
    Description

    Top Languages Spoken in the United States

    The Impact of linguistics on Community and Business in America

    About this dataset

    Languages are an important part of daily life in the USA. Here is a table that shows the most common languages spoken in the USA, as well as a big spreadsheet which shows each CBSA (Core-Based Statistical Area, or urban area).

    Language usage varies widely throughout the United States. According to the latest census data, over 350 different languages are represented in homes across the country. The following table and spreadsheet provide more detailed information on language usage throughout the various states and cities in the US:

    Columns: - index: Index column for dataframe - Table with column headers in row 5 and row headers in column A: Contains language data for each CBSA (Core Based Statistical Area) - Unnamed: 1: Rank of CBSA by total number of speakers of all languages - Unnamed: 2: Name of CBSA - Unnamed: 3: Population of CBSA - Unnamed: 4: Percent of population that speaks English very well - Unnamed: 5 through Unnamed: 58 : Languages spoken by at least 0.1% of the population, with corresponding percentages

    How to use the dataset

    1. This dataset can be used to understand the linguistic diversity of the United States, and to compare languages spoken across different states and cities.
    2. This data can also be used to explore trends in language usage over time.
    3. businesses can use this dataset to identify which languages are most commonly spoken in the areas in which they operate and tailor their marketing or customer service accordingly.
    4. Schools could use this dataset to plan language-learning programs based on the needs of their community.
    5. Policymakers could use this data to better understand linguistic diversity in the United States and design programs to support bilingualism or multilingualism

    Research Ideas

    1. Businesses can use this dataset to identify which languages are most commonly spoken in the areas in which they operate and cater their marketing or customer service accordingly.
    2. Schools could use this data to plan language-learning programs based on the needs of their community.
    3. Policymakers could use this dataset to better understand linguistic diversity in the United States and design programs to support bilingualism or multilingualism

    Acknowledgements

    This dataset was created by Gary Hoover. The data was sourced from https://www.kaggle.com/garyhoov/us-languages

    License

    Unknown License - Please check the dataset description for more information.

    Columns

    File: Languages Spoken at Home by Urban Area = CBSA.csv

    File: US Languages Spoken at Home 2014.csv | Column name | Description | |:-------------------------------------------------------------------|:--------------| | Table with column headers in row 5 and row headers in column A | |

  6. Most spoken Indian languages worldwide 2025

    • statista.com
    Updated Jan 26, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2026). Most spoken Indian languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/1614099/worldwide-indian-languages-spoken/
    Explore at:
    Dataset updated
    Jan 26, 2026
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    India
    Description

    As of 2025, ***** was the most spoken Indian language worldwide and ranked third globally, with approximately *** million speakers. ******* was the second most spoken Indian language, with approximately *** million speakers globally.

  7. Most Spoken Languages

    • dtbse.com
    csv, json
    Updated Feb 14, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ethnologue / Wikipedia (2026). Most Spoken Languages [Dataset]. https://dtbse.com/dataset/most-spoken-languages
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 14, 2026
    Dataset provided by
    Ethnologuehttp://ethnologue.com/
    Authors
    Ethnologue / Wikipedia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    25 most spoken languages in the world ranked by total number of speakers, with native speaker counts, language family, writing system, and number of countries.

  8. Common languages used for web content 2025, by share of websites

    • statista.com
    Updated Feb 19, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2026). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
    Explore at:
    Dataset updated
    Feb 19, 2026
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2025
    Area covered
    Worldwide
    Description

    As of October 2025, English was the dominant language of online content, used by nearly half of all websites worldwide. Spanish ranked second, accounting for around 6 percent of global web content, followed closely by German at 5.9. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

  9. The Most Spoken Languages Around the World

    • kaggle.com
    zip
    Updated Nov 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benno Narmelan (2020). The Most Spoken Languages Around the World [Dataset]. https://www.kaggle.com/narmelan/100-most-spoken-languages-around-the-world
    Explore at:
    zip(1894 bytes)Available download formats
    Dataset updated
    Nov 4, 2020
    Authors
    Benno Narmelan
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    World
    Description

    Context

    After going through quite the verbal loop when ordering foreign currency through the bank, which involved a discussion with an assigned financial advisor at the branch the following day to confirm details, I noticed despite our names hinting at the assumed typical background similarities, communication by phone was much more difficult due to the thickness in accents and different speech patterns when voicing from a non-native speaker.

    It hit me then coming from an extremely multicultural and welcoming city, the challenges others from completely different labels given to them in life must go through in their daily affairs when having to face communication barriers that I myself encountered, particularly when interacting with those outside their usual bubble. Now imagine this situation occurring every hour across the world in various sectors of business. How may this impede, help or create frustrations in minor or major ways as a result of increasing workplace diversity quota demands, customer satisfaction needs and process efficiencies?

    The data I was looking for to explore this phenomena existed in the form of native and non-native speakers of the 100 most commonly spoken languages across the globe.

    Content

    The data in this database contains the following attributes:

    • Language - name of the language
    • Total Speakers - this assumes both native and non-native speakers
    • Native Speakers - native speakers of the language
    • Origin - family origin group of said language

    Acknowledgements

    The data was collected with the aid of WordTips visualization of the 22nd edition of Ethnologue - "a research center for language intelligence"

    https://www.ethnologue.com/world https://www.ethnologue.com/guides/ethnologue200 https://word.tips/pictures/b684e98f-f512-4ac0-96a4-0efcf6decbc0_most-spoken-languages-world-5.png?auto=compress,format&rect=0,0,2001,7115&w=800&h=2845

    Inspiration

    As globalization no longer constrains us, what implications will this have in terms of organizational communications conducted moving forward? I believe this is something to be examined in careful context in order to make customer relationship processes meaningful rather than it being confined to a strictly detached transactional basis.

  10. List of languages by total number of speakers

    • kaggle.com
    zip
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raj Kumar Pandey (2023). List of languages by total number of speakers [Dataset]. https://www.kaggle.com/datasets/rajkumarpandey02/list-of-languages-by-total-number-of-speakers/code
    Explore at:
    zip(1622 bytes)Available download formats
    Dataset updated
    Feb 28, 2023
    Authors
    Raj Kumar Pandey
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About Dataset

    Context

    It is difficult to define what constitutes a language as opposed to a dialect. For example, Chinese and Arabic are sometimes considered to be single languages, but each includes several mutually unintelligible varieties and so they are sometimes considered language families instead. Conversely, colloquial registers of Hindi and Urdu are almost completely mutually intelligible, and are sometimes classified as one language, Hindustani, instead of two separate languages. Such rankings should be used with caution, because it is not possible to devise a coherent set of linguistic criteria for distinguishing languages in a dialect continuum.

    Content

    In this Dataset we have The Complete List Of World Most Spoken Languages in World in Million provided by Wikipedia.

    Data Columns:

    • Index of Serial Number
    • Name of the Languages
    • Name of the Family
    • Name of the Branch
    • First Languages or (Native Languages)
    • Second Languages or (Neighboring Language)
    • Total Speakers (L1+L2)
  11. Spoken Language Statistics

    • zenodo.org
    bin, pdf, txt
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Bampoulidis; Alex Bampoulidis (2024). Spoken Language Statistics [Dataset]. http://doi.org/10.5281/zenodo.55708
    Explore at:
    bin, pdf, txtAvailable download formats
    Dataset updated
    Aug 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alex Bampoulidis; Alex Bampoulidis
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Find out which are the top 10 most spoken languages in the world according to GeoNames and preserve the data containing the information needed, as some countries get split or merged, some languages get extinct, etc.

  12. Most common languages spoken at home in Australia by number of speakers 2016...

    • statista.com
    Updated Jul 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2017). Most common languages spoken at home in Australia by number of speakers 2016 [Dataset]. https://www.statista.com/statistics/987434/languages-spoken-home-australia-by-number-of-speakers/
    Explore at:
    Dataset updated
    Jul 15, 2017
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2016
    Area covered
    Australia
    Description

    This statistic illustrates the various languages spoken at home in Australia 2016, by number of speakers. During the period examined, approximately 17 million people in Australia spoke English at home, while close to 600 thousand spoke Mandarin at home.

  13. r

    Wikipedias in the world's most widely spoken languages - Chart

    • restofworld.org
    Updated Jan 30, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rest of World (2026). Wikipedias in the world's most widely spoken languages - Chart [Dataset]. https://restofworld.org/charts/2026/GsmYK-wikipedias-worlds-widely-spoken-languages
    Explore at:
    Dataset updated
    Jan 30, 2026
    Dataset authored and provided by
    Rest of World
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Anyone who has made at least one edit in the last 30 days is counted as a contributor.

  14. Most common languages spoken in India 2011

    • statista.com
    Updated Jan 26, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2026). Most common languages spoken in India 2011 [Dataset]. https://www.statista.com/statistics/616508/most-common-languages-india/
    Explore at:
    Dataset updated
    Jan 26, 2026
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2011
    Area covered
    India
    Description

    Hindi, with over *** million native speakers was the most spoken language across Indian homes, followed by Bengali with ** million speakers, as of 2011 census data. English native speakers accounted for about *** thousand during the measured time period. The colonial rule in India One of the most remarkable and widespread legacies that the British colonial rule left behind was the English language. Before independence, the English language was the solely used for higher education and in government and administrative processes. Post-independence, however, and till today, Hindi was claimed as the language with official government patronage. This lead to resistance from the southern states of India, where Hindi did not have prominence. Consequently, the Official Languages Act of 1963, was enacted by the parliament, which ensured the continued use of English for official purposes in conjunction with Hindi. Multi-linguistic cultures India has approximately ** major languages that are written in about ** different scripts. While the countryโ€™s official languages are both, English and Hindi, Hindi remains the most preferred language used online especially in the northern rural areas. The use of English is becoming increasingly popular in the urban areas. In addition, almost every state in India has its own official language that is studied in primary and secondary school as an obligatory second language. Among the most prominent are Bengali, Marathi, and Telugu.

  15. First official language spoken by language spoken most often at home, other...

    • www150.statcan.gc.ca
    • open.canada.ca
    csv, html
    Updated Aug 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Government of Canada, Statistics Canada (2022). First official language spoken by language spoken most often at home, other language(s) spoken regularly at home, and knowledge of official languages: Canada, provinces and territories, census divisions and census subdivisions [Dataset]. http://doi.org/10.25318/9810019101-eng
    Explore at:
    csv, htmlAvailable download formats
    Dataset updated
    Aug 17, 2022
    Dataset provided by
    Statistics Canadahttps://statcan.gc.ca/en
    Authors
    Government of Canada, Statistics Canada
    License

    https://www.statcan.gc.ca/en/terms-conditions/open-licencehttps://www.statcan.gc.ca/en/terms-conditions/open-licence

    Area covered
    Canada
    Description

    Data on first official language spoken, language spoken most often at home, other language(s) spoken regularly at home, knowledge of official languages and age for the population excluding institutional residents.

  16. World Languages

    • dtbse.com
    csv, json
    Updated Mar 6, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ethnologue (2026). World Languages [Dataset]. https://dtbse.com/dataset/world-languages
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Mar 6, 2026
    Dataset authored and provided by
    Ethnologuehttp://ethnologue.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    The most spoken and influential languages in the world โ€” from Mandarin's billion speakers to Icelandic's preserved roots.

  17. Most Popular Programming Languages 2004-2024

    • kaggle.com
    zip
    Updated Sep 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Roshan Riaz (2024). Most Popular Programming Languages 2004-2024 [Dataset]. https://www.kaggle.com/datasets/muhammadroshaanriaz/most-popular-programming-languages-2004-2024
    Explore at:
    zip(3491 bytes)Available download formats
    Dataset updated
    Sep 15, 2024
    Authors
    Muhammad Roshan Riaz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains the following columns:

    Month: The date (in year-month format) when the data was recorded. Python Worldwide(%): The percentage of global popularity for Python during that month. JavaScript Worldwide(%): The percentage of global popularity for JavaScript. Java Worldwide(%): The percentage of global popularity for Java. C# Worldwide(%): The percentage of global popularity for C#. PhP Worldwide(%): The percentage of global popularity for PhP. Flutter Worldwide(%): The percentage of global popularity for Flutter. React Worldwide(%): The percentage of global popularity for React. Swift Worldwide(%): The percentage of global popularity for Swift. TypeScript Worldwide(%): The percentage of global popularity for TypeScript. Matlab Worldwide(%): The percentage of global popularity for Matlab.

    Each row represents data for a particular month, starting from January 2004, tracking the popularity trends of these programming languages worldwide.

  18. First official language spoken by language spoken most often at home: Census...

    • www150.statcan.gc.ca
    • open.canada.ca
    csv, html
    Updated Aug 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Government of Canada, Statistics Canada (2022). First official language spoken by language spoken most often at home: Census metropolitan areas, tracted census agglomerations and census tracts [Dataset]. http://doi.org/10.25318/9810019401-eng
    Explore at:
    html, csvAvailable download formats
    Dataset updated
    Aug 17, 2022
    Dataset provided by
    Statistics Canadahttps://statcan.gc.ca/en
    Authors
    Government of Canada, Statistics Canada
    License

    https://www.statcan.gc.ca/en/terms-conditions/open-licencehttps://www.statcan.gc.ca/en/terms-conditions/open-licence

    Area covered
    Canada
    Description

    Data on first official language spoken, language spoken most often at home, age and gender for the population excluding institutional residents for census metropolitan areas, tracted census agglomerations and census tracts.

  19. g

    Detailed Language Spoken Most Often at Home (232), Detailed Other Languages...

    • gimi9.com
    Updated May 2, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). Detailed Language Spoken Most Often at Home (232), Detailed Other Languages Spoken Regularly at Home (233), Age Groups (17A) and Sex (3) for the Population Excluding Institutional Residents of Canada, Provinces, Territories, Census Metropolitan Areas and | gimi9.com [Dataset]. https://gimi9.com/dataset/ca_0460604b-cfa5-4684-85c2-a69e30fc1965/
    Explore at:
    Dataset updated
    May 2, 2012
    Area covered
    Canada
    Description

    This table is part of a series of tables that present a portrait of Canada based on the various census topics. The tables range in complexity and levels of geography. Content varies from a simple overview of the country to complex cross-tabulations; the tables may also cover several censuses.

  20. Spoken Language Identification

    • kaggle.com
    zip
    Updated Jul 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomasz (2018). Spoken Language Identification [Dataset]. https://www.kaggle.com/toponowicz/spoken-language-identification
    Explore at:
    zip(16022179692 bytes)Available download formats
    Dataset updated
    Jul 5, 2018
    Authors
    Tomasz
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset contains speech samples of English, German and Spanish languages. Samples are equally balanced between languages, genders and speakers.

    More information at the spoken-language-dataset repository.

    Background

    The project was inspired by the TopCoder contest, Spoken Languages 2. The given dataset contains 10 second of speech recorded in 1 of 176 languages. The entire dataset has been based on bible readings. Poorly, in many cases there is a single speaker per language (male in most cases). Even worse the same single speaker exists in the test set. Of course this can't lead to a good generic solution.

    There are two ways we can take:

    • First approach is to use a big dataset where all voice or language properties (e.g. gender, age, accent) become equally possible. A good example is the Common Voice from Mozilla. Most likely this leads to the best performance. However processing such a huge dataset is expensive and adding new languages is challenging.
    • Second approach is to use a small handcrafted dataset and boost it with data augmentation. The advantage is that we can add new languages quickly. Last but not least the dataset is small thus it can be processed quickly.

    The second approach has been taken.

    LibriVox recordings were used to prepare the dataset. Particular attention was paid to a big variety of unique speakers. Big variance forces the model to concentrate more on language properties than a specific voice. Samples are equally balanced between languages, genders and speakers in order not to favour any subgroup. Finally the dataset is divided into train and test set. Speakers present in the test set, are not present in the train set. This helps estimate a generalization error.

    The core of the train set is based on 420 minutes (2520 samples) of original recordings. After applying several audio transformations (pitch, speed and noise) the train set was extended to 12180 minutes (73080 samples). The test set contains 90 minutes (540 samples) of original recordings. No data augmentation has been applied.

    Original recordings contain 90 unique speakers. The number of unique speakers was increased by adjusting pitch (8 different levels) and speed (8 different levels). After applying audio transformations there are 1530 unique speakers.

    Data structure

    The dataset is divided into 2 directories:

    • train (73080 samples)
    • test (540 samples)

    Each sample is an FLAC audio file with:

    • sample rate: 22050
    • bit depth: 16
    • channels: 1
    • duration: 10 seconds (sharp)

    The original recordings are MP3 files but they are converted into FLAC files quickly to avoid re-encoding (and losing quality) during transformations.

    The filename of the sample has following syntax:

    (language)_(gender)_(recording ID).fragment(index)[.(transformation)(index)].flac
    

    ...and variables:

    • language: en, de, or es
    • gender: m or f
    • recording ID: a hash of the URL
    • fragment index: 1-30
    • transformation: speed, pitch or noise
    • transformation index:
      • if speed: 1-8
      • if pitch: 1-8
      • if noise: 1-12

    For example:

    es_m_f7d959494477e5e7e33d4666f15311c9.fragment9.speed8.flac
    

    Sample Model

    The dataset was used to train the spoken language identification model. The trained model has 97% score (i.e. F1 metric) against the test set. Additionally it generalizes well which was confirmed against real life content. The fact that samples are prefeclty stratified was one of the reasons to achieve such a high performance.

    Feel free to create your own model and share results!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Organization logo

The most spoken languages worldwide 2025

Explore at:
455 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Search
Clear search
Close search
Google apps
Main menu