31 datasets found
  1. The most spoken languages worldwide 2025

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  2. The Most Spoken Languages Around the World

    • kaggle.com
    zip
    Updated Nov 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benno Narmelan (2020). The Most Spoken Languages Around the World [Dataset]. https://www.kaggle.com/narmelan/100-most-spoken-languages-around-the-world
    Explore at:
    zip(1894 bytes)Available download formats
    Dataset updated
    Nov 4, 2020
    Authors
    Benno Narmelan
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    World
    Description

    Context

    After going through quite the verbal loop when ordering foreign currency through the bank, which involved a discussion with an assigned financial advisor at the branch the following day to confirm details, I noticed despite our names hinting at the assumed typical background similarities, communication by phone was much more difficult due to the thickness in accents and different speech patterns when voicing from a non-native speaker.

    It hit me then coming from an extremely multicultural and welcoming city, the challenges others from completely different labels given to them in life must go through in their daily affairs when having to face communication barriers that I myself encountered, particularly when interacting with those outside their usual bubble. Now imagine this situation occurring every hour across the world in various sectors of business. How may this impede, help or create frustrations in minor or major ways as a result of increasing workplace diversity quota demands, customer satisfaction needs and process efficiencies?

    The data I was looking for to explore this phenomena existed in the form of native and non-native speakers of the 100 most commonly spoken languages across the globe.

    Content

    The data in this database contains the following attributes:

    • Language - name of the language
    • Total Speakers - this assumes both native and non-native speakers
    • Native Speakers - native speakers of the language
    • Origin - family origin group of said language

    Acknowledgements

    The data was collected with the aid of WordTips visualization of the 22nd edition of Ethnologue - "a research center for language intelligence"

    https://www.ethnologue.com/world https://www.ethnologue.com/guides/ethnologue200 https://word.tips/pictures/b684e98f-f512-4ac0-96a4-0efcf6decbc0_most-spoken-languages-world-5.png?auto=compress,format&rect=0,0,2001,7115&w=800&h=2845

    Inspiration

    As globalization no longer constrains us, what implications will this have in terms of organizational communications conducted moving forward? I believe this is something to be examined in careful context in order to make customer relationship processes meaningful rather than it being confined to a strictly detached transactional basis.

  3. Common languages used for web content 2025, by share of websites

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2025
    Area covered
    Worldwide
    Description

    As of October 2025, English was the dominant language for online content, used by nearly half of all websites worldwide. Spanish ranked second, accounting for around 6 percent of web content, followed by German with 5.9 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

  4. Ranking of languages spoken at home in the U.S. 2024, by number of speakers

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of languages spoken at home in the U.S. 2024, by number of speakers [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2024
    Area covered
    United States
    Description

    In 2024, some 45 million people in the United States spoke Spanish at home. In comparison, the second most spoken non-English language spoken by households was Chinese, at just 3.7 million speakers.The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  5. a

    Which Language is predominantly spoken all over the world? - Original

    • univredlands.hub.arcgis.com
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    URSpatial (2022). Which Language is predominantly spoken all over the world? - Original [Dataset]. https://univredlands.hub.arcgis.com/maps/20074e3502ca4d5f9e3642302872b6f3
    Explore at:
    Dataset updated
    Mar 15, 2022
    Dataset authored and provided by
    URSpatial
    Area covered
    Description

    This map displays predominant language spoken all over the world. The pop-up contains information about the percentage of population speaking the predominant language. The main source of data is from the Central Intelligence Agency “World Factbook” - https://www.cia.gov/library/publications/the-world-factbook/.

  6. Number of native Spanish speakers worldwide 2024, by country

    • hazel.com.ua
    • monwebsite.ch
    • +5more
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://hazel.com.ua/?p=2385236
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    World
    Description

    Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.

  7. a

    Which Language is predominantly spoken all over the world?

    • univredlands.hub.arcgis.com
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    URSpatial (2022). Which Language is predominantly spoken all over the world? [Dataset]. https://univredlands.hub.arcgis.com/maps/9aeb06442df6444499693f96ecd1c6d5
    Explore at:
    Dataset updated
    Mar 15, 2022
    Dataset authored and provided by
    URSpatial
    Area covered
    Description

    This map displays predominant language spoken all over the world. The pop-up contains information about the percentage of population speaking the predominant language. The main source of data is from the Central Intelligence Agency “World Factbook” - https://www.cia.gov/library/publications/the-world-factbook/.

  8. The most linguistically diverse countries worldwide 2025, by number of...

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, The most linguistically diverse countries worldwide 2025, by number of languages [Dataset]. https://www.statista.com/statistics/1224629/the-most-linguistically-diverse-countries-worldwide-by-number-of-languages/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    Papua New Guinea is the most linguistically diverse country in the world. As of 2025, it was home to 840 different languages. Indonesia ranked second with 709 languages spoken. In the United States, 335 languages were spoken in that same year.

  9. p

    Trends in Reading and Language Arts Proficiency (2010-2022): Top Of The...

    • publicschoolreview.com
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Public School Review (2025). Trends in Reading and Language Arts Proficiency (2010-2022): Top Of The World Elementary School vs. California vs. Laguna Beach Unified School District [Dataset]. https://www.publicschoolreview.com/top-of-the-world-elementary-school-profile
    Explore at:
    Dataset updated
    Feb 9, 2025
    Dataset authored and provided by
    Public School Review
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Laguna Beach Unified School District
    Description

    This dataset tracks annual reading and language arts proficiency from 2010 to 2022 for Top Of The World Elementary School vs. California and Laguna Beach Unified School District

  10. Most used programming languages among developers worldwide 2025

    • statista.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Most used programming languages among developers worldwide 2025 [Dataset]. https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 29, 2025 - Jun 23, 2025
    Area covered
    Worldwide
    Description

    As of 2025, JavaScript and HTML/CSS are the most commonly used programming languages among software developers around the world, with more than 66 percent of respondents stating that they used JavaScript and just around 61.9 percent using HTML/CSS. Python, SQL, and Bash/Shell rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.

  11. E

    GLOBAL Multilingual Lexical Data - Bilingual - Level 1

    • catalog.elra.info
    Updated Oct 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2023). GLOBAL Multilingual Lexical Data - Bilingual - Level 1 [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-M0111_04/
    Explore at:
    Dataset updated
    Oct 3, 2023
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) formats. The main features include: •Single-word lemmas, inflected word forms, and multiword expressions•Phonetic transcription and alternative scripts•Grammatical information and subcategorization•Word sense disambiguation including definitions and sense indicators•Examples of usage•Attributes such as synonyms, antonyms, register, subject domain, etc.•Translation equivalentsMonolingual cores are distributed over 3 levels of language as follows:•Level 1 consists of 25 languages: Arabic, Chinese Simplified, Chinese Traditional, Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Latin, Norwegian, Polish, Portuguese Brazil, Portuguese Portugal, Russian, Spanish, Swedish, Thai, Turkish•Level 2 consists of 10 languages: Danish, Dutch, French, German, Hebrew, Italian, Norwegian, Portuguese Brazil, Spanish, Swedish•Level 3 consists of SpanishNote: Prices are indicated per language unit (for monolingual data) or per language pair unit (for bilingual data). If you would like to obtain several languages or pairs of languages, please indicate in your cart the number of copies (=languages/pairs of languages) and specify in the comments which languages you would like to get.

  12. D

    Instructor-led Language Training Market Report | Global Forecast From 2025...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Instructor-led Language Training Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-instructor-led-language-training-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Instructor-led Language Training Market Outlook



    The global instructor-led language training market size was valued at approximately USD 8 billion in 2023 and is projected to grow significantly, reaching nearly USD 12.5 billion by 2032, with a compound annual growth rate (CAGR) of around 5%. The growth of this market is being driven by several factors, including the increasing globalization of businesses, the rising demand for multilingual employees, and the growing emphasis on effective communication skills in both personal and professional settings. As the world becomes more interconnected, the ability to communicate in multiple languages is increasingly seen as a valuable asset, leading to a surge in demand for language training programs.



    One of the primary growth factors for the instructor-led language training market is the globalization of businesses and the need for companies to operate effectively across different linguistic and cultural contexts. As companies expand their operations into new regions, the ability to communicate with local clients, partners, and employees becomes crucial. This has led to a growing demand for language training programs that can equip employees with the necessary language skills. Moreover, the rise of remote work and virtual teams has further emphasized the need for effective communication across diverse geographies, fueling the demand for language training.



    Another significant factor contributing to the growth of this market is the increasing emphasis on personal development and lifelong learning. In a rapidly changing world, individuals are increasingly seeking to enhance their skills and knowledge to remain competitive in the job market. Language learning is seen as a key component of personal development, providing individuals with the ability to connect with different cultures and communities. As a result, there is a growing demand for language training programs that are tailored to individual learning needs and preferences, offering flexibility and convenience.



    The rise of digital technology and the increasing availability of online learning platforms have also played a crucial role in the growth of the instructor-led language training market. While traditional in-person language classes remain popular, virtual language training programs have gained significant traction due to their convenience and accessibility. These programs allow learners to access high-quality language instruction from anywhere in the world, making language learning more accessible to a wider audience. The integration of technology in language training programs has also enabled the development of innovative teaching methodologies and interactive learning experiences, further driving the growth of this market.



    In the context of globalization and the increasing need for multilingual communication, Study Abroad Training has emerged as a crucial component in language education. This type of training provides learners with immersive experiences in foreign countries, allowing them to practice language skills in real-world settings while gaining cultural insights. Study Abroad Training not only enhances language proficiency but also broadens learners' perspectives, making them more adaptable and culturally aware. As more students and professionals seek international exposure, the demand for Study Abroad Training is expected to rise, contributing to the growth of the language training market. This trend highlights the importance of experiential learning in achieving language fluency and intercultural competence.



    Regionally, the instructor-led language training market is experiencing significant growth across various parts of the world. North America and Europe are currently the largest markets for language training, driven by the presence of a large number of multinational companies and a strong emphasis on language education. However, the Asia Pacific region is expected to witness the highest growth during the forecast period, driven by the rapid economic development in countries like China and India and the increasing demand for English language proficiency. The growing importance of language skills in Latin America and the Middle East & Africa is also expected to contribute to the growth of the instructor-led language training market in these regions.



    Training Type Analysis



    The instructor-led language training market is segmented by training type into in-person and virtual training. In-person training remains a traditio

  13. E

    GlobalPhone Vietnamese

    • live.european-language-grid.eu
    • catalogue.elra.info
    audio format
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GlobalPhone Vietnamese [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2100
    Explore at:
    audio formatAvailable download formats
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

    The Vietnamese part of GlobalPhone was collected in summer 2009. In total 160 speakers were recorded, 140 of them in the cities of Hanoi and Ho Chi Minh City in Vietnam, and an additional set of 20 speakers were recorded in Karlsruhe, Germany. All speakers are Vietnamese native speakers, covering the main dialectal variants from South and North Vietnam. Of these 160 speakers, 70 were female and 90 were male. The majority of speakers are well educated, being graduated students and engineers. The age distribution of the speakers ranges from 18 to 65 years. Each speaker read between 50 and 200 utterances from newspaper articles, corresponding to roughly 9.5 minutes of speech or 138 utterances per person, in total we recorded 22.112 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small-sized rooms with very low background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The speech data was recorded in two phases. In a first phase data was collected from 140 speakers in the cities of Hanoi and Ho Chi Minh. In the second phase we selected utterances from the text corpus in order to cover rare Vietnamese phonemes. This second recording phase was carried out with 20 Vietnamese graduate students who live in Karlsruhe. In sum, 22.112 utterances were spoken, corresponding to 25.25 hours of speech. The text data used for recording mainly came from the news posted in online editions of 15 Vietnamese newspaper websites, where the first 12 were used for the training set, while the last three were used for the development and evaluation set. The text data collected from the first 12 websites cover almost 4 Million word tokens with a vocabulary of 30.000 words resulting in an Out-of-Vocabulary rate of 0% on the development set and 0.067% on the evaluation set. For the text selection we followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). The transcriptions are provided in Vietnamese-style Roman script, i.e. using several diacritics encoded in UTF-8. The Vietnamese data are organized in a training set of 140 speakers with 22.15 hours of speech, a development set of 10 speakers, 6 from North and 4 from South Vietnam with 1:40 hours of speech and an evaluation set of 10 speakers with same gender and dialect distribution as the development set with 1:30 hours of speech. More details on corpus statistics, collection scenario, and system building based on the Vietnamese part of GlobalPhone can be found under [Vu and Schultz, 2009, 2010].

    [Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002. [Vu and Schultz, 2010] Ngoc Thang Vu, Tanja Schultz (2010): Optimization On Vietnamese Large Vocabulary Speech Recognition, 2nd Workshop on Spoken Languages Technologies for Under-resourced Languages, SLTU 2010, Penang, Malaysia, May 2010. [Vu and Schultz, 2009] Ngoc Thang Vu, Tanja Schultz (2009): Vietnamese Large Vocabulary Continuous Speech Recognition, Automatic Speech Recognition and Understanding, ASRU 2009, Merano.

  14. World_Countries_Dataset

    • kaggle.com
    zip
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamza edit (2025). World_Countries_Dataset [Dataset]. https://www.kaggle.com/datasets/hamzaedit/world-countries-dataset
    Explore at:
    zip(10246 bytes)Available download formats
    Dataset updated
    Jul 22, 2025
    Authors
    Hamza edit
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    World
    Description

    **🌍 World Countries Dataset This World Countries Dataset contains detailed information about countries across the globe, offering insights into their geographic, demographic, and economic characteristics.

    It includes various features such as population, area, GDP, languages, and regional classifications. This dataset is ideal for projects related to data visualization, statistical analysis, geographical studies, or machine learning applications such as clustering or classification of countries.

    This dataset was manually compiled/collected from reliable open data sources (e.g., Wikipedia, World Bank, or other governmental datasets).

    **🔍 Sample Questions Explored Using Python: - Q. 1) Which countries have the highest and lowest population? - Q. 2) What is the average area (in sq. km) of countries in each region? - Q. 3) Which countries have more than 100 million population and GDP above $1 trillion? - Q. 4) Which languages are most commonly spoken across countries? - Q. 5) Show a bar graph comparing GDPs of G7 nations. - Q. 6) How many countries are there in each continent or region? - Q. 7) Which countries have both a high population density and low GDP per capita? - Q. 8) Create a world map visualization of population or GDP distribution. - Q. 9) What are the top 10 most densely populated countries? - Q. 10) How many landlocked countries are there in the world?

    **🧾 Features / Columns in the Dataset: - Country: The name of the country (e.g., "Pakistan", "France").

    • Capital: The capital city of the country.

    • Region: Broad geographical region (e.g., "Asia", "Europe").

    • Subregion: More specific geographical grouping (e.g., "Southern Asia").

    • Population: Total population of the country.

    • Area (sq. km): Total land area in square kilometers.

    • Population Density: Number of people per square kilometer.

    • GDP (USD): Gross Domestic Product (in U.S. dollars).

    • GDP per Capita: GDP divided by the population.

    • Official Languages: Officially recognized language(s) spoken.

    • Currency: Name of the currency used.

    • Timezones: Timezones in which the country falls.

    • Borders: List of bordering countries (if any).

    • Landlocked: Whether the country is landlocked (Yes/No).

    • Latitude / Longitude: Coordinates for geographical plotting.

  15. Z

    Dataset for: "Big data suggest strong constraints of linguistic similarity...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Job Schepens; Roeland van Hout; T. Florian Jaeger (2020). Dataset for: "Big data suggest strong constraints of linguistic similarity on adult language learning" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2863532
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Radboud Universiteit
    Freie Universitaet Berlin
    University of Rochester
    Authors
    Job Schepens; Roeland van Hout; T. Florian Jaeger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is adapted from raw data with fully anonymized results on the State Examination of Dutch as a Second Language. This exam is officially administred by the Board of Tests and Examinations (College voor Toetsen en Examens, or CvTE). See cvte.nl/about-cvte. The Board of Tests and Examinations is mandated by the Dutch government.

    The article accompanying the dataset:

    Schepens, Job, Roeland van Hout, and T. Florian Jaeger. “Big Data Suggest Strong Constraints of Linguistic Similarity on Adult Language Learning.” Cognition 194 (January 1, 2020): 104056. https://doi.org/10.1016/j.cognition.2019.104056.

    Every row in the dataset represents the first official testing score of a unique learner. The columns contain the following information as based on questionnaires filled in at the time of the exam:

    "L1" - The first language of the learner "C" - The country of birth "L1L2" - The combination of first and best additional language besides Dutch "L2" - The best additional language besides Dutch "AaA" - Age at Arrival in the Netherlands in years (starting date of residence) "LoR" - Length of residence in the Netherlands in years "Edu.day" - Duration of daily education (1 low, 2 middle, 3 high, 4 very high). From 1992 until 2006, learners' education has been measured by means of a side-by-side matrix question in a learner's questionnaire. Learners were asked to mark which type of education they have had (elementary, secondary, or tertiary schooling) by means of filling in for how many years they have been enrolled, in which country, and whether or not they have graduated. Based on this information we were able to estimate how many years learners have had education on a daily basis from six years of age onwards. Since 2006, the question about learners' education has been altered and it is asked directly how many years learners have had formal education on a daily basis from six years of age onwards. Possible answering categories are: 1) 0 thru 5 years; 2) 6 thru 10 years; 3) 11 thru 15 years; 4) 16 years or more. The answers have been merged into the categorical answer. "Sex" - Gender "Family" - Language Family "ISO639.3" - Language ID code according to Ethnologue "Enroll" - Proportion of school-aged youth enrolled in secondary education according to the World Bank. The World Bank reports on education data in a wide number of countries around the world on a regular basis. We took the gross enrollment rate in secondary schooling per country in the year the learner has arrived in the Netherlands as an indicator for a country's educational accessibility at the time learners have left their country of origin. "STEX_speaking_score" - The STEX test score for speaking proficiency. "Dissimilarity_morphological" - Morphological similarity "Dissimilarity_lexical" - Lexical similarity "Dissimilarity_phonological_new_features" - Phonological similarity (in terms of new features) "Dissimilarity_phonological_new_categories" - Phonological similarity (in terms of new sounds)

    A few rows of the data:

    "L1","C","L1L2","L2","AaA","LoR","Edu.day","Sex","Family","ISO639.3","Enroll","STEX_speaking_score","Dissimilarity_morphological","Dissimilarity_lexical","Dissimilarity_phonological_new_features","Dissimilarity_phonological_new_categories" "English","UnitedStates","EnglishMonolingual","Monolingual",34,0,4,"Female","Indo-European","eng ",94,541,0.0094,0.083191,11,19 "English","UnitedStates","EnglishGerman","German",25,16,3,"Female","Indo-European","eng ",94,603,0.0094,0.083191,11,19 "English","UnitedStates","EnglishFrench","French",32,3,4,"Male","Indo-European","eng ",94,562,0.0094,0.083191,11,19 "English","UnitedStates","EnglishSpanish","Spanish",27,8,4,"Male","Indo-European","eng ",94,537,0.0094,0.083191,11,19 "English","UnitedStates","EnglishMonolingual","Monolingual",47,5,3,"Male","Indo-European","eng ",94,505,0.0094,0.083191,11,19

  16. Language Exam Results - Clustering

    • kaggle.com
    zip
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emre AVARA (2022). Language Exam Results - Clustering [Dataset]. https://www.kaggle.com/datasets/poseidon95/language-exam-student-scores-custom-created
    Explore at:
    zip(2948 bytes)Available download formats
    Dataset updated
    Mar 22, 2022
    Authors
    Emre AVARA
    Description

    Context

    This dataset is created randomly using numpy random method. The whole point is, however, to provide a dataset for clustering (Logistic Regression, Neural Networks, etc.).

    Content

    The training dataset is a CSV file that represents 300 score of a language test ("Reading", "Listening", "Speaking", "Writing"). The values are floating point numbers between 0 and 1. Simply, the results are categorized according to the average of the scores from 4 main parts.

    The test dataset is a CSV file with 44 scores.

    Acknowledgements

    The name of the first user will be written.

    Inspiration

    I hope this dataset will encourage all newbies to enter the world of machine learning.

    Data license

    Obviously, data is free.

  17. Main language of Steam users worldwide 2024

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Main language of Steam users worldwide 2024 [Dataset]. https://www.statista.com/statistics/957319/steam-user-language/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2024
    Area covered
    Worldwide
    Description

    As of October 2024, an estimated 33.48 percent of Steam gaming platform users worldwide used Simplified Chinese as their main language. English was the second-most common language, selected by 32.68 percent of users.

  18. Supporting Information file S1 presents Figures S1–S5 with additional...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young-Ho Eom; Pablo Aragón; David Laniado; Andreas Kaltenbrunner; Sebastiano Vigna; Dima L. Shepelyansky (2023). Supporting Information file S1 presents Figures S1–S5 with additional information discussed above in the main part of the paper, lists of top 100 global PageRank and 2DRank names; Tables S1–S25 of top 10 names of given language and remained world from the global PageRank and 2DRank ranking lists of persons ordered by the score ΘP, A of Eq.(3). [Dataset]. http://doi.org/10.1371/journal.pone.0114825.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Young-Ho Eom; Pablo Aragón; David Laniado; Andreas Kaltenbrunner; Sebastiano Vigna; Dima L. Shepelyansky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    For a reader convenience the lists of all 100 ranked names for all 24 Wikipedia editions and corresponding network link data for each edition are also given at [39] in addition to Supporting Information file. All used computational data are publicly available at http://dumps.wikimedia.org/. All the raw data necessary to replicate the findings and conclusion of this study are within the paper, supporting information files and this Wikimedia web site. (PDF)

  19. Languages in Mexico 2020

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Languages in Mexico 2020 [Dataset]. https://www.statista.com/statistics/275440/languages-in-mexico/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    Mexico
    Description

    In 2020, about 93.8 percent of the Mexican population was monolingual in Spanish. Around five percent spoke a combination of Spanish and indigenous languages. Spanish is the third-most spoken native language worldwide, after Mandarin Chinese and Hindi. Mexican Spanish Spanish was first being used in Mexico in the 16th century, at the time of Spanish colonization during the Conquest campaigns of what is now Mexico and the Caribbean. As of 2018, Mexico is the country with the largest number of native Spanish speakers worldwide. Mexican Spanish is influenced by English and Nahuatl, and has about 120 million users. The Mexican government uses Spanish in the majority of its proceedings, however it recognizes 68 national languages, 63 of which are indigenous. Indigenous languages spoken Of the indigenous languages spoken, two of the most widely used are Nahuatl and Maya. Due to a history of marginalization of indigenous groups, most indigenous languages are endangered, and many linguists warn they might cease to be used after a span of just a few decades. In recent years, legislative attempts such as the San Andréas Accords have been made to protect indigenous groups, who make up about 25 million of Mexico’s 125 million total inhabitants, though the efficacy of such measures is yet to be seen.

  20. F

    Finnish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Finnish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-finnish-finland
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Finland
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Finnish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Finnish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Finnish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Finnish speech models that understand and respond to authentic Finnish accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Finnish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Finnish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Finland to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Finnish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Finnish.
    Voice Assistants: Build smart assistants capable of understanding natural Finnish conversations.
    <span

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Organization logo

The most spoken languages worldwide 2025

Explore at:
464 scholarly articles cite this dataset (View in Google Scholar)
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Search
Clear search
Close search
Google apps
Main menu