63 datasets found
  1. The most spoken languages worldwide 2025

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  2. Ranking of languages spoken at home in the U.S. 2023

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    United States
    Description

    In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  3. Number of Words in different Languages

    • kaggle.com
    Updated Apr 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tayyar Hussain (2023). Number of Words in different Languages [Dataset]. https://www.kaggle.com/datasets/tayyarhussain/number-of-words-in-different-languages
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2023
    Dataset provided by
    Kaggle
    Authors
    Tayyar Hussain
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction:

    This dataset provides detailed information on the number of words in various languages. It includes a comprehensive list of word counts for multiple languages, making it a valuable resource for linguists, language learners, and anyone interested in language diversity. The dataset is presented in the form of a list of dictionaries, with each dictionary containing information on the language, the number of words, and other details.

    The dataset covers a wide range of languages from around the world, including commonly spoken languages like English, Spanish, Mandarin, and Arabic, as well as lesser-known languages. The word counts are approximate and are based on the number of words in the respective dictionaries.

    In addition to the word counts, the dataset also includes information on the approximate number of headwords and definitions available for each language. This information provides further insight into the depth and complexity of the vocabulary of each language.

    The dataset is useful for a variety of purposes, such as language research, linguistic diversity studies, language teaching and learning, and natural language processing. The data is provided in a machine-readable format, making it easy to use and analyze.

    This dataset is a valuable resource for anyone interested in the linguistic diversity of the world's languages and provides a starting point for exploring the vast vocabulary of different languages.

    Column Descriptors:

    Language: The name of the language the dictionary pertains to. Number of Words: The approximate number of words included in the dictionary. Approx Headwords: The approximate number of headwords included in the dictionary. Approx Definitions: The approximate number of definitions included in the dictionary. Dictionary: The name or type of dictionary included in the dataset. Notes: Any additional notes or information regarding the dictionary or language.

  4. The most linguistically diverse countries worldwide 2025, by number of...

    • statista.com
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most linguistically diverse countries worldwide 2025, by number of languages [Dataset]. https://www.statista.com/statistics/1224629/the-most-linguistically-diverse-countries-worldwide-by-number-of-languages/
    Explore at:
    Dataset updated
    Apr 15, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    Papua New Guinea is the most linguistically diverse country in the world. As of 2025, it was home to 840 different languages. Indonesia ranked second with 709 languages spoken. In the United States, 335 languages were spoken in that same year.

  5. MCB_languages_county

    • kaggle.com
    Updated Oct 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marisol Brewster
    Description

    Context

    This is a dataset I found online through the Google Dataset Search portal.

    Content

    The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

    The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

    The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

    These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

    Acknowledgements

    Sources:

    Google Dataset Search: https://toolbox.google.com/datasetsearch

    2009-2013 American Community Survey

    Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

    Downloaded From: https://data.world/kvaughn/languages-county

    Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

  6. Language spoken at Home (Census 2016)

    • digital-earth-pacificcore.hub.arcgis.com
    • pacificgeoportal.com
    • +2more
    Updated May 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri Australia (2019). Language spoken at Home (Census 2016) [Dataset]. https://digital-earth-pacificcore.hub.arcgis.com/datasets/esriau::language-spoken-at-home-census-2016/about
    Explore at:
    Dataset updated
    May 26, 2019
    Dataset provided by
    Esrihttp://esri.com/
    Esri Australia
    Authors
    Esri Australia
    Description

    Does the person speak a language other than English at home? This map takes a look at answers to this question from Census Night.Colour:For each SA1 geography, the colour indicates which language 'wins'.SA1 geographies not coloured are either tied between two languages or not enough data Colour Intensity:The colour intensity compares the values of the winner to all other values and returns its dominance over other languages in the same geographyNotes:Only considers top 6 languages for VICCensus 2016 DataPacksPredominance VisualisationsSource CodeNotice that while one language level appears to dominate certain geographies, it doesn't necessarily mean it represents the majority of the population. In fact, as you explore most areas, you will find the predominant language makes up just a fraction of the population due to the number of languages considered.

  7. Common languages used for web content 2025, by share of websites

    • statista.com
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
    Explore at:
    Dataset updated
    Feb 11, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 2025
    Area covered
    Worldwide
    Description

    As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.

  8. Number of native Spanish speakers worldwide 2024, by country

    • statista.com
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://www.statista.com/statistics/991020/number-native-spanish-speakers-country-worldwide/
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    World
    Description

    Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.

  9. m

    Data from: Tracking the Global Pulse: The first public Twitter dataset from...

    • data.mendeley.com
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kheir eddine daouadi (2025). Tracking the Global Pulse: The first public Twitter dataset from FIFA World Cup [Dataset]. http://doi.org/10.17632/gw3mcnbkwr.2
    Explore at:
    Dataset updated
    May 27, 2025
    Authors
    kheir eddine daouadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    The first public large-scale multilingual Twitter dataset related to the FIFA World Cup 2022, comprising over 28 million posts in 69 unique spoken languages, including Arabic, English, Spanish, French, and many others. This dataset aims to facilitate research in future sentiment analysis, cross-linguistic studies, event-based analytics, meme and hate speech detection, fake news detection, and social manipulation detection.

    The file 🚨Qatar22WC.csv🚨 contains tweet-level and user-level metadata for our collected tweets. 🚀Codebook for FIFA World Cup 2022 Twitter Dataset🚀 | Column Name | Description| |-------------------------------- |----------------------------------------------------------------------------------------| | day, month, year | The date where the tweet posted | | hou, min, sec | Hour, minute, and second of tweet timestamp | | age_of_the_user_account | User Account age in days | | tweet_count | Total number of tweets posted by the user | | location | User-defined location field | | follower_count | Number of followers the user has | | following_count | Number of accounts the user is following | | follower_to_Following | Follower-following ratio | | favouite_count | Number of likes the user did| | verified | Boolean indicating if the user is verified (1 = Verified, 0 = Not Verified) | | Avg_tweet_count | Average tweets per day for the user activity| | list_count | Number of lists the user is a member | | Tweet_Id | Tweet ID | | is_reply_tweet | ID of the tweet being replied to (if applicable) | | is_quote | boolean representing if the tweet is a quote | | retid | Retweet ID if it's a retweet; NaN otherwise | | lang | Language of the tweet | | hashtags | The keyword or hashtag used to collect the tweet | | is_image, | Boolean indicating if the tweet associated with image| | is_video | Boolean indicating if the tweet associated with video | |-------------------------------|----------------------------------------------------------------------------------------|

    Examples of use case queries are described in the file 🚨fifa_wc_qatar22_examples_of_use_case_queries.ipynb🚨 and accessible via: https://github.com/khairied/Qata_FIFA_World_Cup_22

    🚀 Please Cite This as: Daouadi, K. E., Boualleg, Y., Guehairia, O. & Taleb-Ahmed, A. (2025). Tracking the Global Pulse: The first public Twitter dataset from FIFA World Cup, Journal of Computational Social Science.

  10. A

    ‘Extinct Languages’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Extinct Languages’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-extinct-languages-69fe/f057a7f7/?iid=005-024&v=presentation
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Extinct Languages’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/the-guardian/extinct-languages on 12 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    A recent Guardian blog post asks: "How many endangered languages are there in the World and what are the chances they will die out completely?" The United Nations Education, Scientific and Cultural Organisation (UNESCO) regularly publishes a list of endangered languages, using a classification system that describes its danger (or completion) of extinction.

    Content

    The full detailed dataset includes names of languages, number of speakers, the names of countries where the language is still spoken, and the degree of endangerment. The UNESCO endangerment classification is as follows:

    • Vulnerable: most children speak the language, but it may be restricted to certain domains (e.g., home)
    • Definitely endangered: children no longer learn the language as a 'mother tongue' in the home
    • Severely endangered: language is spoken by grandparents and older generations; while the parent generation may understand it, they do not speak it to children or among themselves
    • Critically endangered: the youngest speakers are grandparents and older, and they speak the language partially and infrequently
    • Extinct: there are no speakers left

    Acknowledgements

    Data was originally organized and published by The Guardian, and can be accessed via this Datablog post.

    Inspiration

    • How can you best visualize this data?
    • Which rare languages are more isolated (Sicilian, for example) versus more spread out? Can you come up with a hypothesis for why that is the case?
    • Can you compare the number of rare speakers with more relatable figures? For example, are there more Romani speakers in the world than there are residents in a small city in the United States?

    --- Original source retains full ownership of the source dataset ---

  11. f

    Statistics of the Languages spoken in South Africa. For each language, we...

    • plos.figshare.com
    xls
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koena Ronny Mabokela; Mpho Primus; Turgay Celik (2025). Statistics of the Languages spoken in South Africa. For each language, we report the ISO, the African subfamily, and the prevalent countries where the language is also spoken. [Dataset]. http://doi.org/10.1371/journal.pone.0325102.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Koena Ronny Mabokela; Mpho Primus; Turgay Celik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Africa, South Africa
    Description

    Statistics of the Languages spoken in South Africa. For each language, we report the ISO, the African subfamily, and the prevalent countries where the language is also spoken.

  12. SLE Language Areas

    • ebola-nga.opendata.arcgis.com
    Updated Feb 2, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Geospatial-Intelligence Agency (2015). SLE Language Areas [Dataset]. https://ebola-nga.opendata.arcgis.com/content/ffe30c1c30ed48fcafb14e8a026128d5
    Explore at:
    Dataset updated
    Feb 2, 2015
    Dataset authored and provided by
    National Geospatial-Intelligence Agencyhttp://www.nga.mil/
    Area covered
    Description

    While English is the official language, it is typically used for governmental, business, and media purposes. In day to day life most people in the country speak Krio, which is a style of Pidgin English or English-based creole language. Krio is the lingua franco for the country and the formal language for those who do not speak English. With the number of different ethnic groups, Krio unites these groups with a common language. The citizens who are fluent in English are among the elite minority and often experience privileges such as economic opportunities that non-English speakers are excluded from. Other common indigenous languages used in the country are Mende, Temne, and Limba. As the official language, English is the only language used in education. It is reported that school children who speak indigenous languages on school premises are punished. Students who fail English classes are not granted admission into college. Attribute Table Field DescriptionsISO3-International Organization for Standardization 3-digit country codeADM0_NAME-Administration level zero identification / nameLANG_FAM-Language familyLANG_SUBGR-Language subgroupALT_NAMES-Alternate namesCOMMENTS-Comments or notes regarding languageSOURCE_DT-Source one creation dateSOURCE-Source oneSOURCE2_DT-Source two creation dateSOURCE2-Source twoCollectionThis feature class was created using Anthromapper consisting of linguistic layers that have been primarily based on The World Language Mapping System (WMLS). Geographical terrain features, combined with a watershed model, were also used to predict the likely extent of linguistic influence. The metadata was supplemented with anthropological and linguistic information from peer-reviewed journals and published books. It should be noted that this feature class only depicts the majority first level languages spoken in a given area; there might be significant populations of other minority language speakers not shown in this dataset.The data included herein have not been derived from a registered survey and should be considered approximate unless otherwise defined. While rigorous steps have been taken to ensure the quality of each dataset, DigitalGlobe is not responsible for the accuracy and completeness of data compiled from outside sources.Sources (HGIS)Anthromapper. DigitalGlobe, November 2014.Ethnologue, “Languages of the World." 2012. Accessed November 2014. http://www.ethnologue.com.World Language Mapping System (WLMS) Version 16. World GeoDatasets, November 2014.Sources (Metadata)Antimoon, “English, French, and Arabic languages in Sierra Leone”. December 2009. Accessed December 2014. http://www.antimoon.com.Central Intelligence Agency. The World FactBook, “Serra Leone”. June 2014. Accessed November 2014. https://www.cia.gov/library/publications/the-world-factbook.DePauw University. Sierra Leone, “Language”. January 2014. Accessed December 2014. http://www.depauw.edu.National African Language Resource Center (NALRC), “Krio”. January 2014. Accessed December 2014. http://www.nalrc.indiana.edu.

  13. E

    GlobalPhone Arabic

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Arabic [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0192/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Arabic corpus was produced using the Assabah newspaper. It contains recordings of 78 speakers (35 males, 43 females) recorded in Tunisia, Palestine and Jordan. The following age distribution has been obtained: 20 speakers are below 19, 35 speakers are between 20 and 29, 13 speakers are between 30 and 39, 6 speakers are between 40 and 49, and 4 speakers are over 50.

  14. Language Services market size was estimated at USD 58.9 billion in 2022!

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Feb 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2024). Language Services market size was estimated at USD 58.9 billion in 2022! [Dataset]. https://www.cognitivemarketresearch.com/language-services-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Feb 19, 2024
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    According to Cognitive Market Research, the global language services market size was estimated at USD 58.9 billion in 2022 and is expected to grow at a compound annual growth rate (CAGR) of 6.2% from 2023 to 2030. Which Factors Drives the Language Services Market Growth?

    Cross-border contact has become more intense due to globalization, increasing the need for translation, localization, and interpretation services. Language solutions are required by growing multinational businesses, e-commerce, and multilingual customer service. Growth is also fueled by government programs that support accessibility and multilingualism. Technology advancements, including AI-driven translation tools, increase productivity and widen the market.

    These developments empower businesses to offer better-tailored solutions and services, which, in turn, contribute to the growth of the Language Services industry.

    For instance, A well-known international provider of language services, BIG Language Solutions, revealed in April 2022 that it had acquired the Milan-based company Lawlinguists, which offers legal translation services. With the addition of Italy, Germany, and Spain to BIG's European footprint through the purchase, its clients now have access to a wider range of excellent legal translation services, resources, and technology.

    (Source:biglanguage.com/blog/big-acquires-lawlinguists-expands-legal-offering-and-european-presence/)

    Globalization and Internationalization to Provide Viable Market Output
    

    A significant market driver for language services has been globalization. Communication in various languages is becoming increasingly important as firms grow internationally. The expansion of international trade, e-commerce, and cross-border investments all contribute to this trend. Companies must translate, localize, and adapt their products and services to local languages and cultures to remain competitive in the global market.

    There are approximately 7,139 languages spoken in the world today. However, many of these languages are endangered, with experts estimating that around 40% of languages are at risk of extinction.

    (Source:www.ohchr.org/en/stories/2019/10/many-indigenous-languages-are-danger-extinction)

    Multinational corporations with diverse workforces and clients from various language backgrounds have become popular due to globalization. These enterprises rely on translation services to eliminate language barriers to guarantee efficient internal communication and seamless relations with external parties. Language solutions, including document, website, and marketing material translation and conference and meeting interpretation services, greatly aid international collaboration and understanding.

    Technological Advancements to Propel Market Growth
    
    
    
    
    
    Localization of Digital Content
    

    Factors Restraining Growth of the Language Services Market

    Machine Translation Limitations to Hinder Market Growth
    

    The constraints of machine translation constrain the language services market. While machine translation quality has increased due to technological developments in AI, especially for complicated or specialized information, it still falls short of human translation in accuracy and nuance. The context and idiomatic idioms that machine translation systems frequently struggle with might cause translations to sound uncomfortable or inaccurate to native speakers. This restriction is especially important for fields like law, medicine, and marketing, where accuracy and cultural appropriateness are key.

    How COVID-19 Impacted the Language Services Market?

    To reach a worldwide audience, the pandemic drove digital transformation and remote labor, driving up demand for translation and localization services. Translations in the medical and scientific fields increased as information sharing became essential. Travel restrictions hampered on-site interpreting services simultaneously, increasing the demand for remote interpreting services. Due to the pandemic's emphasis on efficient intercultural communication, businesses, the medical community, and governments have all prioritized language services to enable proper information flow and support during the crisis What is Language Services?

    Language services means it is a professional service used for communication and understanding between different cultural groups. It facilitates effective comm...

  15. Enrollment numbers in language training Spain 2005 to 2023

    • statista.com
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Enrollment numbers in language training Spain 2005 to 2023 [Dataset]. https://www.statista.com/statistics/459491/enrollment-numbers-in-language-training-spain/
    Explore at:
    Dataset updated
    Jan 22, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Spain
    Description

    The number of enrollments in language schools in Spain reveals that Spaniards are well aware of the importance of foreign languages in modern times. During the 2022/23 academic year, almost 331,000 people were registered at the Spanish language schools to add a new language to their curricula. In a globalized world, languages are taking a much more important role on the job market. The most studied and spoken languages in the world include English, Mandarin, Hindi or Spanish.

    The importance of language knowledge in the job market Enrollment numbers at language schools come as no surprise considering that foreign languages have become a vital asset for job seekers in the last years. English, par excellence the most used language for international affairs, unsurprisingly ranked first on the list of most valued languages on the Spanish job market, with approximately 65.2 of job openings that require foreign language skills demanding this one. Far from that stood French, with 17.38 percent of the job openings.

    Languages in the Spanish multimedia scene Most of the best selling albums Spain during 2022 were recorded in the country’s main language Spanish, with 38 albums in the top 50. As for videogames, 96 percent of the games produced in the country had English as a language option. Spanish was the second most used language, being present in 91 percent of productions.

  16. h

    XLingHealth

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgia Tech CLAWS Lab (2024). XLingHealth [Dataset]. https://huggingface.co/datasets/claws-lab/XLingHealth
    Explore at:
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    Georgia Tech CLAWS Lab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "XLingHealth"

    XLingHealth is a Cross-Lingual Healthcare benchmark for clinical health inquiry that features the top four most spoken languages in the world: English, Spanish, Chinese, and Hindi.

      Statistics
    

    Dataset

    Examples

    Words (Q)

    Words (A)

    HealthQA 1,134 7.72 ± 2.41 242.85 ± 221.88

    LiveQA 246 41.76 ± 37.38 115.25 ± 112.75

    MedicationQA 690 6.86 ± 2.83 61.50 ± 69.44

    Words (Q) and #Words (A) represent the average number of words… See the full description on the dataset page: https://huggingface.co/datasets/claws-lab/XLingHealth.

  17. E

    GlobalPhone Chinese-Mandarin

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Chinese-Mandarin [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0193/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Mandarin corpus was produced using the Peoples Daily newspaper. It contains recordings of 132 speakers (64 males, 68 females) recorded in Beijing, Wuhan and Hekou, China. The following age distribution has been obtained: 16 speakers are below 19, 96 speakers are between 20 and 29, 16 speakers are between 30 and 39, 3 speakers are between 40 and 49 (1 speaker age is unknown).

  18. f

    Wikipedias Statistics.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taha Yasseri; Robert Sumi; János Kertész (2023). Wikipedias Statistics. [Dataset]. http://doi.org/10.1371/journal.pone.0030091.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Taha Yasseri; Robert Sumi; János Kertész
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    *For the languages which are widely spoken in the world, the origin country is not well-defined.†Esperanto has never been an official language of any country.‡Egypt (the most populated Arab country) time zone.+USA Central standard time zone.▴Central European time zone.▵Spain time zone.§Portugal time zone.Statistics about WPs under investigation. Name of the WP, language, the most populated country, in which the language is spoken, and total number of speakers in the world (millions) are reported in columns 1 to 4, followed by number of articles (thousands) in the WP, number of edits (millions), number of users (thousands), number of active users (users which have edited in the last month), and the percentage of edits by unregistered users (known by their IP-addresses) to the all edits. Two last columns consist of the assigned UTC offset to each WP and the Sleep Depth respectively. The demographic data is taken from Wikipedia and supposed to give an impression to the reader. In the paper, there is not any analysis based on this data.

  19. Share of U.S. population speaking a language besides English at home 2023,...

    • statista.com
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Share of U.S. population speaking a language besides English at home 2023, by state [Dataset]. https://www.statista.com/statistics/312940/share-of-us-population-speaking-a-language-other-than-english-at-home-by-state/
    Explore at:
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    United States
    Description

    As of 2023, more than ** percent of people in the United States spoke a language other than English at home. California had the highest share among all U.S. states, with ** percent of its population speaking a language other than English at home.

  20. m

    Cyberbullying dataset for Kurdish Language

    • data.mendeley.com
    Updated Aug 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soran Badawi (2025). Cyberbullying dataset for Kurdish Language [Dataset]. http://doi.org/10.17632/ck49jyxcbt.4
    Explore at:
    Dataset updated
    Aug 5, 2025
    Authors
    Soran Badawi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cyberbullying has become an increasingly prevalent issue in the digital age, with the rise of social media and online communication. It can take many forms, including verbal attacks, harassment, and discrimination, and it can have serious consequences for victims, including depression, anxiety, and even suicide. While much research has been done on cyberbullying in languages such as English, Spanish, and Chinese, there has been little focus on languages spoken by smaller populations, such as Kurdish. Kurdish is a language spoken by millions of people in the Middle East, including Turkey, Iran, Iraq, and Syria. It is an Indo-European language with several dialects, and it is considered an official language in Iraq and an official regional language in Iran. Despite its widespread use, there has been very little research on cyberbullying in Kurdish, and there are currently no datasets available that specifically focus on this issue. To address this gap, we have created the first ever cyberbullying dataset for the Kurdish language. This dataset contains three classes: neutral, racism, and sexism. The neutral class includes messages that do not contain any form of cyberbullying, while the racism and sexism classes include messages that contain discriminatory language based on race or gender, respectively. The dataset was created using a combination of manual and automated techniques. We collected a large number of messages from Twitter API, that were written in Kurdish. We then manually labeled these messages based on whether they contained cyberbullying or not, and further categorized them into the three classes. The resulting dataset contains over 30,000 messages, with roughly equal distribution among the three classes. It is a valuable resource for researchers and practitioners who are interested in studying cyberbullying in the Kurdish language and developing strategies to combat it. The dataset can be used for a variety of purposes, including training machine learning models to detect cyberbullying in Kurdish, analyzing the language used in cyberbullying messages to identify patterns and trends, and developing interventions to prevent and address cyberbullying in Kurdish-speaking communities.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Organization logo

The most spoken languages worldwide 2025

Explore at:
416 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Search
Clear search
Close search
Google apps
Main menu