66 datasets found

The most spoken languages worldwide 2025
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Ranking of languages spoken at home in the U.S. 2023
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
The Most Spoken Languages Around the World
kaggle.com
Updated Nov 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Narmelan Tharmalingam (2020). The Most Spoken Languages Around the World [Dataset]. https://www.kaggle.com/narmelan/100-most-spoken-languages-around-the-world/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 4, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Narmelan Tharmalingam
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Area covered
World
Description
Context

After going through quite the verbal loop when ordering foreign currency through the bank, which involved a discussion with an assigned financial advisor at the branch the following day to confirm details, I noticed despite our names hinting at the assumed typical background similarities, communication by phone was much more difficult due to the thickness in accents and different speech patterns when voicing from a non-native speaker.

It hit me then coming from an extremely multicultural and welcoming city, the challenges others from completely different labels given to them in life must go through in their daily affairs when having to face communication barriers that I myself encountered, particularly when interacting with those outside their usual bubble. Now imagine this situation occurring every hour across the world in various sectors of business. How may this impede, help or create frustrations in minor or major ways as a result of increasing workplace diversity quota demands, customer satisfaction needs and process efficiencies?

The data I was looking for to explore this phenomena existed in the form of native and non-native speakers of the 100 most commonly spoken languages across the globe.

Content

The data in this database contains the following attributes:

Language - name of the language

Total Speakers - this assumes both native and non-native speakers

Native Speakers - native speakers of the language

Origin - family origin group of said language

Acknowledgements

The data was collected with the aid of WordTips visualization of the 22nd edition of Ethnologue - "a research center for language intelligence"

https://www.ethnologue.com/world https://www.ethnologue.com/guides/ethnologue200 https://word.tips/pictures/b684e98f-f512-4ac0-96a4-0efcf6decbc0_most-spoken-languages-world-5.png?auto=compress,format&rect=0,0,2001,7115&w=800&h=2845

Inspiration

As globalization no longer constrains us, what implications will this have in terms of organizational communications conducted moving forward? I believe this is something to be examined in careful context in order to make customer relationship processes meaningful rather than it being confined to a strictly detached transactional basis.
Spoken Language Statistics
zenodo.org
bin, pdf, txt
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Bampoulidis; Alex Bampoulidis (2024). Spoken Language Statistics [Dataset]. http://doi.org/10.5281/zenodo.55708
Explore at:
bin, pdf, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.55708
Dataset updated
Aug 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alex Bampoulidis; Alex Bampoulidis
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Find out which are the top 10 most spoken languages in the world according to GeoNames and preserve the data containing the information needed, as some countries get split or merged, some languages get extinct, etc.
Common languages used for web content 2025, by share of websites
statista.com
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
Explore at:
Dataset updated
Feb 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
Worldwide
Description
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
Number of native Spanish speakers worldwide 2024, by country
statista.com
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://www.statista.com/statistics/991020/number-native-spanish-speakers-country-worldwide/
Explore at:
Dataset updated
Jan 15, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
World
Description
Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.
E
GlobalPhone Swahili
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Swahili [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0375/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The GlobalPhone Swahili corpus contains 7,728 utterances spoken by 70 speakers. Native speakers of Swahili were asked to read prompted sentences of newspaper articles. The entire collection took place in Nairobi, Kenya.Swahili Newspaper source:http://www.voaswahili.com
E
GlobalPhone Portuguese (Brazilian)
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Portuguese (Brazilian) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0201/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Brazil
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).
The most linguistically diverse countries worldwide 2025, by number of...
statista.com
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most linguistically diverse countries worldwide 2025, by number of languages [Dataset]. https://www.statista.com/statistics/1224629/the-most-linguistically-diverse-countries-worldwide-by-number-of-languages/
Explore at:
Dataset updated
Apr 15, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
Papua New Guinea is the most linguistically diverse country in the world. As of 2025, it was home to 840 different languages. Indonesia ranked second with 709 languages spoken. In the United States, 335 languages were spoken in that same year.
Language spoken at Home (Census 2016)
digital-earth-pacificcore.hub.arcgis.com
cacgeoportal.com
+1more
Updated May 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri Australia (2019). Language spoken at Home (Census 2016) [Dataset]. https://digital-earth-pacificcore.hub.arcgis.com/datasets/esriau::language-spoken-at-home-census-2016/about
Explore at:
Dataset updated
May 26, 2019
Dataset provided by
Esrihttp://esri.com/
Esri Australia
Authors
Esri Australia
Description
Does the person speak a language other than English at home? This map takes a look at answers to this question from Census Night.Colour:For each SA1 geography, the colour indicates which language 'wins'.SA1 geographies not coloured are either tied between two languages or not enough data Colour Intensity:The colour intensity compares the values of the winner to all other values and returns its dominance over other languages in the same geographyNotes:Only considers top 6 languages for VICCensus 2016 DataPacksPredominance VisualisationsSource CodeNotice that while one language level appears to dominate certain geographies, it doesn't necessarily mean it represents the majority of the population. In fact, as you explore most areas, you will find the predominant language makes up just a fraction of the population due to the number of languages considered.
Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...
zenodo.org
data.niaid.nih.gov
bin
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur, Ph.D.; Nirmalya Thakur, Ph.D. (2024). Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.13896353
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13896353
Dataset updated
Oct 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nirmalya Thakur, Ph.D.; Nirmalya Thakur, Ph.D.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 6, 2024
Description
Please cite the following paper when using this dataset:

N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)

Abstract

The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.

For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.

The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)

There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)

The following is a description of the attributes present in this dataset

Post ID: Unique ID of each Instagram post

Post Description: Complete description of each post in the language in which it was originally published

Date: Date of publication in MM/DD/YYYY format

Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API

Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API

Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral

Open Research Questions

This dataset is expected to be helpful for the investigation of the following research questions and even beyond:

How does sentiment toward COVID-19 vary across different languages?

How has public sentiment toward COVID-19 evolved from 2020 to the present?

How do cultural differences affect social media discourse about COVID-19 across various languages?

How has COVID-19 impacted mental health, as reflected in social media posts across different languages?

How effective were public health campaigns in shifting public sentiment in different languages?

What patterns of vaccine hesitancy or support are present in different languages?

How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?

What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?

How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?

What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?

All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
E
GlobalPhone Japanese
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Japanese [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0199/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Japanese corpus was produced using the Nikkei Shinbun newspaper. It contains recordings of 149 speakers (104 males, 44 females, 1 unspecified) recorded in Tokyo, Japan. The following age distribution has been obtained: 22 speakers are below 19, 90 speakers are between 20 and 29, 5 speakers are between 30 and 39, 2 speakers are between 40 and 49, and 1 speaker is over 50 (28 speakers age is unknown).
Language Named Authority List
data.europa.eu
rdf xml, xml, zip
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Publications Office of the European Union (2024). Language Named Authority List [Dataset]. https://data.europa.eu/data/datasets/language?locale=en
Explore at:
xml, rdf xml, zipAvailable download formats
Dataset updated
Sep 26, 2024
Dataset provided by
Publications Office of the European Unionhttp://op.europa.eu/
European Union-
Authors
Publications Office of the European Union
License
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
Description
Language is a controlled vocabulary that lists world languages and language varieties, including sign languages. Its main purpose is to support activities associated with the publication process. The full set of languages contains more than 8000 language varieties, each identified by a code equivalent to the ISO 639-3 code. Concepts are aligned with the ISO 639 international standard, which is issued in several parts: ISO 639-1 contains strictly two alphabetic letters (alpha-2), ISO 639-2/B (B = bibliographic) is used for bibliographic purpose (alpha-3), ISO 639-2/T (T = terminology) is used for technical purpose (alpha-3), ISO 639-3 covers all the languages and macro-languages of the world (alpha-3); the values are compliant with ISO 639-2/T. If an authority code is needed for a language without an assigned ISO code, an alphanumeric code is created to avoid confusion with the strictly alphabetic ISO codes. Labels are provided in all 24 official EU languages for the most frequently used languages. Language is under governance of the Interinstitutional Metadata and Formats Committee (IMFC). It is maintained by the Publications Office of the European Union and disseminated on the EU Vocabularies website. It is a corporate reference data asset covered by the Corporate Reference Data Management policy of the European Commission.
f
Data from: The Equivalence and Difference between the English and Chinese...
tandf.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachel Phillips; Yin Bun Cheung; Simon Lowes Collinson; May-Li Lim; Audrey Ling; Lei Feng; Tze-Pin Ng (2023). The Equivalence and Difference between the English and Chinese Language Versions of the Repeatable Battery for the Assessment of Neuropsychological Status [Dataset]. http://doi.org/10.6084/m9.figshare.4616188.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4616188.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Rachel Phillips; Yin Bun Cheung; Simon Lowes Collinson; May-Li Lim; Audrey Ling; Lei Feng; Tze-Pin Ng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Objective: Chinese is the most commonly spoken language in the world. The availability of Chinese translations of assessment scales is useful for research in multi-ethnic and multinational studies. This study aimed to establish whether each of the Chinese translations (Mandarin, Hokkien, Teochew, and Cantonese) of the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS) achieved measurement equivalence to the English version. Method: Participants included 1856 ethnic Chinese, older adults. The RBANS was administered in the language/dialect according to the participants’ preference by interviewers who were fluent in that language/dialect. Multiple regression analysis was used to adjust for demographic and clinical differences between participants who spoke different languages/dialects. Equivalence (practical equivalence) was declared if the 90% confidence interval for the adjusted mean difference fell entirely within the pre-specified equivalence margin, ±.2 (±.4) standard deviations. Results: The delayed memory index was at least practically equivalent across languages. The Mandarin, Hokkien, and Teochew versions of the immediate memory, language, and total scale score were practically equivalent to the English version; the Cantonese version showed small differences from the English version. Equivalence was not established for the Hokkien and Teochew versions of the visuospatial/constructional index. The attention index was different across languages. Conclusions: Data from the English and Chinese versions for the total scale score, language, delayed, and immediate memory indexes may be pooled for analysis. However, analysis of the attention and visuospatial/constructional indexes from the English and Chinese versions should include a covariate that represents the version in the statistical adjustment.
F
Finnish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Finnish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-finnish-finland
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Finnish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Finnish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Finnish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Finnish speech models that understand and respond to authentic Finnish accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Finnish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Finnish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Finland to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Finnish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Finnish.

•
Voice Assistants: Build smart assistants capable of understanding natural Finnish conversations.

<span
Most used programming languages among developers worldwide 2024
statista.com
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most used programming languages among developers worldwide 2024 [Dataset]. https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/
Explore at:
Dataset updated
Feb 6, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 19, 2024 - Jun 20, 2024
Area covered
Worldwide
Description
As of 2024, JavaScript and HTML/CSS were the most commonly used programming languages among software developers around the world, with more than 62 percent of respondents stating that they used JavaScript and just around 53 percent using HTML/CSS. Python, SQL, and TypeScript rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.
g
Statistics Canada, Population by Language Spoken at Home by Census Division,...
geocommons.com
Updated Jul 3, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brendan (2008). Statistics Canada, Population by Language Spoken at Home by Census Division, Alberta-Canada, 2006 [Dataset]. http://geocommons.com/search.html
Explore at:
Dataset updated
Jul 3, 2008
Dataset provided by
Statistics Canada
Brendan
Description
This dataset displays information regarding the language spoken most often at home. This data is available on the Census Division level, and is available from the 2006 Canadian Census. This data was obtained through: Statistics Canada. This data refers to the language spoken most often at home by the individual at the time of the census. Other languages spoken at home on a regular basis were also collected. Included are population figures for the following attributes: Total Population, English, French, Non-Official, English and French, English and Non-Official Language, French and Non-Official Language, and English French and Non-Official Speaking. This data is also broken down by Age Group.
h
jampatoisnli
huggingface.co
Updated Jul 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 21, 2023
Authors
Ruth-Ann Armstrong
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for [Dataset Name]

Dataset Summary

JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.
a
SLE Language Areas
ebola-nga.opendata.arcgis.com
Updated Feb 2, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Geospatial-Intelligence Agency (2015). SLE Language Areas [Dataset]. https://ebola-nga.opendata.arcgis.com/content/ffe30c1c30ed48fcafb14e8a026128d5
Explore at:
Dataset updated
Feb 2, 2015
Dataset authored and provided by
National Geospatial-Intelligence Agency
Area covered

Description
While English is the official language, it is typically used for governmental, business, and media purposes. In day to day life most people in the country speak Krio, which is a style of Pidgin English or English-based creole language. Krio is the lingua franco for the country and the formal language for those who do not speak English. With the number of different ethnic groups, Krio unites these groups with a common language. The citizens who are fluent in English are among the elite minority and often experience privileges such as economic opportunities that non-English speakers are excluded from. Other common indigenous languages used in the country are Mende, Temne, and Limba. As the official language, English is the only language used in education. It is reported that school children who speak indigenous languages on school premises are punished. Students who fail English classes are not granted admission into college. Attribute Table Field DescriptionsISO3-International Organization for Standardization 3-digit country codeADM0_NAME-Administration level zero identification / nameLANG_FAM-Language familyLANG_SUBGR-Language subgroupALT_NAMES-Alternate namesCOMMENTS-Comments or notes regarding languageSOURCE_DT-Source one creation dateSOURCE-Source oneSOURCE2_DT-Source two creation dateSOURCE2-Source twoCollectionThis feature class was created using Anthromapper consisting of linguistic layers that have been primarily based on The World Language Mapping System (WMLS). Geographical terrain features, combined with a watershed model, were also used to predict the likely extent of linguistic influence. The metadata was supplemented with anthropological and linguistic information from peer-reviewed journals and published books. It should be noted that this feature class only depicts the majority first level languages spoken in a given area; there might be significant populations of other minority language speakers not shown in this dataset.The data included herein have not been derived from a registered survey and should be considered approximate unless otherwise defined. While rigorous steps have been taken to ensure the quality of each dataset, DigitalGlobe is not responsible for the accuracy and completeness of data compiled from outside sources.Sources (HGIS)Anthromapper. DigitalGlobe, November 2014.Ethnologue, “Languages of the World." 2012. Accessed November 2014. http://www.ethnologue.com.World Language Mapping System (WLMS) Version 16. World GeoDatasets, November 2014.Sources (Metadata)Antimoon, “English, French, and Arabic languages in Sierra Leone”. December 2009. Accessed December 2014. http://www.antimoon.com.Central Intelligence Agency. The World FactBook, “Serra Leone”. June 2014. Accessed November 2014. https://www.cia.gov/library/publications/the-world-factbook.DePauw University. Sierra Leone, “Language”. January 2014. Accessed December 2014. http://www.depauw.edu.National African Language Resource Center (NALRC), “Krio”. January 2014. Accessed December 2014. http://www.nalrc.indiana.edu.
S
Global Mandarin Learning Market Overview and Outlook 2025-2032
statsndata.org
excel, pdf
Updated Aug 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global Mandarin Learning Market Overview and Outlook 2025-2032 [Dataset]. https://www.statsndata.org/report/mandarin-learning-market-178140
Explore at:
pdf, excelAvailable download formats
Dataset updated
Aug 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The Mandarin Learning market has seen significant growth over the past decade, reflecting the increasing global interest in China as a major economic powerhouse and cultural influencer. With Mandarin being the most spoken language in the world, the demand for Mandarin language education has spiked, creating a divers

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

The most spoken languages worldwide 2025

Explore at:

428 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 14, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2025

Area covered

World

Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Clear search

Close search

Google apps

Main menu

The most spoken languages worldwide 2025

Ranking of languages spoken at home in the U.S. 2023

The Most Spoken Languages Around the World

Context

Content

Acknowledgements

Inspiration

Spoken Language Statistics

Common languages used for web content 2025, by share of websites

Number of native Spanish speakers worldwide 2024, by country

GlobalPhone Swahili

GlobalPhone Portuguese (Brazilian)

The most linguistically diverse countries worldwide 2025, by number of...

Language spoken at Home (Census 2016)

Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...

GlobalPhone Japanese

Language Named Authority List

Data from: The Equivalence and Difference between the English and Chinese...

Finnish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Most used programming languages among developers worldwide 2024

Statistics Canada, Population by Language Spoken at Home by Census Division,...

jampatoisnli

SLE Language Areas

Global Mandarin Learning Market Overview and Outlook 2025-2032

The most spoken languages worldwide 2025