48 datasets found

The most spoken languages worldwide 2025
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Ranking of languages spoken at home in the U.S. 2023
statista.com
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Explore at:
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
Spoken Language Statistics
zenodo.org
bin, pdf, txt
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Bampoulidis; Alex Bampoulidis (2024). Spoken Language Statistics [Dataset]. http://doi.org/10.5281/zenodo.55708
Explore at:
bin, pdf, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.55708
Dataset updated
Aug 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alex Bampoulidis; Alex Bampoulidis
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Find out which are the top 10 most spoken languages in the world according to GeoNames and preserve the data containing the information needed, as some countries get split or merged, some languages get extinct, etc.
Common languages used for web content 2025, by share of websites
statista.com
ai-chatbox.pro
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
Explore at:
Dataset updated
Feb 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
Worldwide
Description
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
Number of native Spanish speakers worldwide 2024, by country
statista.com
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://www.statista.com/statistics/991020/number-native-spanish-speakers-country-worldwide/
Explore at:
Dataset updated
Jan 15, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
World
Description
Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.
E
GlobalPhone Polish
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Polish [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0320/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Polish part of GlobalPhone was collected from altogether 102 native speakers in Poland, of which 48 speakers were female and 54 speakers were male. The majority of speakers are between 20 and 39 years old, the age distribution ranges from 18 to 65 years. Most of the speakers are non-smokers in good health conditions. Each speaker read on average about 100 utterances from newspaper articles, in total we recorded 10130 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small and large rooms, about half of the recordings took place under very quiet noise conditions, the other half with moderate background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The text data used for reco...
Language Named Authority List
data.europa.eu
rdf xml, xml, zip
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Publications Office of the European Union (2024). Language Named Authority List [Dataset]. https://data.europa.eu/data/datasets/language?locale=en
Explore at:
xml, rdf xml, zipAvailable download formats
Dataset updated
Sep 26, 2024
Dataset provided by
Publications Office of the European Unionhttp://op.europa.eu/
European Union-
Authors
Publications Office of the European Union
License
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
Description
Language is a controlled vocabulary that lists world languages and language varieties, including sign languages. Its main purpose is to support activities associated with the publication process. The full set of languages contains more than 8000 language varieties, each identified by a code equivalent to the ISO 639-3 code. Concepts are aligned with the ISO 639 international standard, which is issued in several parts: ISO 639-1 contains strictly two alphabetic letters (alpha-2), ISO 639-2/B (B = bibliographic) is used for bibliographic purpose (alpha-3), ISO 639-2/T (T = terminology) is used for technical purpose (alpha-3), ISO 639-3 covers all the languages and macro-languages of the world (alpha-3); the values are compliant with ISO 639-2/T. If an authority code is needed for a language without an assigned ISO code, an alphanumeric code is created to avoid confusion with the strictly alphabetic ISO codes. Labels are provided in all 24 official EU languages for the most frequently used languages. Language is under governance of the Interinstitutional Metadata and Formats Committee (IMFC). It is maintained by the Publications Office of the European Union and disseminated on the EU Vocabularies website. It is a corporate reference data asset covered by the Corporate Reference Data Management policy of the European Commission.
Most used programming languages among developers worldwide 2024
statista.com
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most used programming languages among developers worldwide 2024 [Dataset]. https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/
Explore at:
Dataset updated
Feb 6, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 19, 2024 - Jun 20, 2024
Area covered
Worldwide
Description
As of 2024, JavaScript and HTML/CSS were the most commonly used programming languages among software developers around the world, with more than 62 percent of respondents stating that they used JavaScript and just around 53 percent using HTML/CSS. Python, SQL, and TypeScript rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.
The most linguistically diverse countries worldwide 2025, by number of...
statista.com
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most linguistically diverse countries worldwide 2025, by number of languages [Dataset]. https://www.statista.com/statistics/1224629/the-most-linguistically-diverse-countries-worldwide-by-number-of-languages/
Explore at:
Dataset updated
Apr 15, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description
Papua New Guinea is the most linguistically diverse country in the world. As of 2025, it was home to 840 different languages. Indonesia ranked second with 709 languages spoken. In the United States, 335 languages were spoken in that same year.
World Atlas of Language Structures
kaggle.com
Updated Sep 7, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael Tatman (2017). World Atlas of Language Structures [Dataset]. https://www.kaggle.com/rtatman/world-atlas-of-language-structures/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rachael Tatman
Description
Context:

There are over 7,000 human languages in the world. The World Atlas of Language Structures (WALS) contains information on the structure of 2,679 of them. It also includes information about where languages are used. WALS is widely-cited and used in the linguistics research community.

Content:

The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors. The atlas provides information on the location, linguistic affiliation and basic typological features of a great number of the world's languages

WALS Online is a publication of the (Max Planck Institute for Evolutionary Anthropology)[http://www.eva.mpg.de/]. It is a separate publication, edited by Dryer, Matthew S. & Haspelmath, Martin (Leipzig: Max Planck Institute for Evolutionary Anthropology, 2013) The main programmer is Robert Forkel.

This dataset includes three files:

source.bib: A BibTex file with all of the sources cited in the dataset in it

language.csv: A file with a list of all the languages included in WALS

wals-data.csv: A file containing information on the features associated with each individual language

Acknowledgements:

This dataset is licensed under a Creative Commons Attribution 4.0 International License .

The World Atlas of Language Structures was edited by Matthew Dryer and Martin Haspelmath. If you use this data in your work, please include the following citation:

Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013. The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info, Accessed on September 7, 2017.)

Inspiration:

This dataset was designed to make interactive maps of language features. Can you make an interactive map that shows different linguistic features? You might find it helpful to use Leaflet (for R) or Plotly (for Python). This blog post is a great resource to help you get started.

There’s a lot of discussion of “linguistic universals” in linguistics. These are specific features that every language (should) have. Can you identify any features that you think may be universals from this dataset?

You may also like:

Atlas of Pidgin and Creole Language Structures: Information on 76 Creole and Pidgin Languages

World Language Family Map

The Sign Language Analyses (SLAY) Database
f
Non-English language corpus.
plos.figshare.com
xls
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Non-English language corpus. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320701.t003
Dataset updated
Jun 2, 2025
Dataset provided by
PLOS ONE
Authors
Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.
E
GlobalPhone Japanese
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GlobalPhone Japanese [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0199/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Japanese corpus was produced using the Nikkei Shinbun newspaper. It contains recordings of 149 speakers (104 males, 44 females, 1 unspecified) recorded in Tokyo, Japan. The following age distribution has been obtained: 22 speakers are below 19, 90 speakers are between 20 and 29, 5 speakers are between 30 and 39, 2 speakers are between 40 and 49, and 1 speaker is over 50 (28 speakers age is unknown).
E
GlobalPhone Portuguese (Brazilian)
live.european-language-grid.eu
catalogue.elra.info
audio format
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GlobalPhone Portuguese (Brazilian) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1912
Explore at:
audio formatAvailable download formats
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Brazil
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).
h
XLingHealth
huggingface.co
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgia Tech CLAWS Lab (2024). XLingHealth [Dataset]. https://huggingface.co/datasets/claws-lab/XLingHealth
Explore at:
Dataset updated
Feb 7, 2024
Dataset authored and provided by
Georgia Tech CLAWS Lab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "XLingHealth"

XLingHealth is a Cross-Lingual Healthcare benchmark for clinical health inquiry that features the top four most spoken languages in the world: English, Spanish, Chinese, and Hindi.

Statistics

Dataset

Examples

Words (Q)

Words (A)

HealthQA 1,134 7.72 ± 2.41 242.85 ± 221.88

LiveQA 246 41.76 ± 37.38 115.25 ± 112.75

MedicationQA 690 6.86 ± 2.83 61.50 ± 69.44

Words (Q) and #Words (A) represent the average number of words… See the full description on the dataset page: https://huggingface.co/datasets/claws-lab/XLingHealth.
P
JamPatoisNLI Dataset
paperswithcode.com
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). JamPatoisNLI Dataset [Dataset]. https://paperswithcode.com/dataset/jampatoisnli
Explore at:
Dataset updated
Dec 6, 2022
Description
JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the effectiveness of transfer from large monolingual or multilingual pretrained models. While our work, along with previous work, shows that transfer from these models to low-resource languages that are unrelated to languages in their training set is not very effective, we would expect stronger results from transfer to creoles. Indeed, our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages, and help us begin to understand how the unique relationship between creoles and their high-resource base languages affect cross-lingual transfer. JamPatoisNLI, which consists of naturally-occurring premises and expert-written hypotheses, is a step towards steering research into a traditionally underserved language and a useful benchmark for understanding cross-lingual NLP.
f
Manual Inspection and Correction.
plos.figshare.com
xls
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Manual Inspection and Correction. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320701.t004
Dataset updated
Jun 2, 2025
Dataset provided by
PLOS ONE
Authors
Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.
S
Global Mandarin Learning Market Overview and Outlook 2025-2032
statsndata.org
excel, pdf
Updated May 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global Mandarin Learning Market Overview and Outlook 2025-2032 [Dataset]. https://www.statsndata.org/report/mandarin-learning-market-178140
Explore at:
pdf, excelAvailable download formats
Dataset updated
May 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The Mandarin Learning market has seen significant growth over the past decade, reflecting the increasing global interest in China as a major economic powerhouse and cultural influencer. With Mandarin being the most spoken language in the world, the demand for Mandarin language education has spiked, creating a divers
j
Japan Centre of Excellence (JACEEX)
jaceex.com
html
Updated Jul 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Japan Centre of Excellence (JACEEX) (2019). Japan Centre of Excellence (JACEEX) [Dataset]. https://www.jaceex.com/ssw
Explore at:
htmlAvailable download formats
Dataset updated
Jul 16, 2019
Dataset provided by
https://www.jaceex.com/
Authors
Japan Centre of Excellence (JACEEX)
Area covered
Description
Japan Centre of Excellence (JACEEX), is a brand under Jaceex Ventures LLP. Jaceex has been formed with a vision to create a world class workforce with skill sets, work and business ethics, sincerity and devotion as well as other great positive traits found in the Japanese workforce which has been responsible for having built world class Enterprises. For the Indian Students and youths stepping into this world, our objective is to provide life changing opportunity in the form of skill and work in Japan Japan Centre of Excellence (JACEEX) provides an integrated course schedule of learning through exploration, scrutiny and self reflection. We are offering Japanese Language and Culture training-Basic, Intermediate and High Levels. Our training is designed to make the trainee eligible to certify themselves with the globally recognised Japanese Language Proficiency Test (JLPT) Examination . This will help in building careers with Japanese companies in Japan , in India and also self employment.We also have the facility of Virtual Live class platform
a
Nigeria Language Areas
ebola-nga.opendata.arcgis.com
Updated Dec 5, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Geospatial-Intelligence Agency (2014). Nigeria Language Areas [Dataset]. https://ebola-nga.opendata.arcgis.com/content/a8562de38b814219b331c7d49cc87ff4
Explore at:
Dataset updated
Dec 5, 2014
Dataset authored and provided by
National Geospatial-Intelligence Agency
Area covered

Description
There are over 500 known languages in Nigeria. While the official language is English, its use is largely confined to urban elites. The most commonly used languages are Hausa, Yoruba, Igbo (Ibo) and Fulfulde. Edo, Efik, Adamawa Fulfulde, Idoma, and Central Kanuri are also widely spoken. The area of greatest diversity is the ‘Middle Belt’, the band of territory stretching across the country between the large language blocs of the north and the south. The reason for this diversity remains unclear, but three of Africa's four language families meet in the Middle Belt of Nigeria. This has had sociolinguistic consequences where frequent conflicts have erupted between the culture and language of particular groups.

ISO3 - International Organization for Standardization 3-digit country code

LANG_FAM - Language family

LANG_SUBGP - Language sub-family

SOURCE_DT - Primary source creation date

SOURCE - Primary source

Collection

This shapefile created by using Anthromapper consists of language layers that have been based on The World Language Mapping System (WLMS). Geographical terrain features, combined with a watershed model, were also used to predict the likely extent of ethnic and linguistic influence. The HGIS and metadata were supplemented with anthropological information from peer-reviewed journals and published books. The interpretation of names often produces multiple spellings of the same language; therefore similarly spelled or phonetic titles may be referencing the same language group.

The data included herein have not been derived from a registered survey and should be considered approximate unless otherwise defined. While rigorous steps have been taken to ensure the quality of each dataset, DigitalGlobe Analytics is not responsible for the accuracy and completeness of data compiled from outside sources.

Sources (HGIS)

Anthromapper. DigitalGlobe Analytics, April 2013.

World Language Mapping System (WLMS) Version 16. World GeoDatasets, April 2013.

Sources (Metadata)

Roger, Blench. "Position Paper: The Dimensions of Ethnicity, Language, and Culture in Nigeria." Last modified 2013. Accessed March 26, 2013. http://www.rogerblench.info.

Roger, Blench. “The Status of the Languages of Central Nigeria.” Last modified 2013. Accessed March 26, 2013. http://www.rogerblench.info.
f
Parameter of Resnet-50 and NASNetLarge.
plos.figshare.com
xls
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). Parameter of Resnet-50 and NASNetLarge. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320701.t011
Dataset updated
Jun 2, 2025
Dataset provided by
PLOS ONE
Authors
Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

The most spoken languages worldwide 2025

Explore at:

429 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 14, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2025

Area covered

World

Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Clear search

Close search

Google apps

Main menu

The most spoken languages worldwide 2025

Ranking of languages spoken at home in the U.S. 2023

Spoken Language Statistics

Common languages used for web content 2025, by share of websites

Number of native Spanish speakers worldwide 2024, by country

GlobalPhone Polish

Language Named Authority List

Most used programming languages among developers worldwide 2024

The most linguistically diverse countries worldwide 2025, by number of...

World Atlas of Language Structures

Context:

Content:

Acknowledgements:

Inspiration:

You may also like:

Non-English language corpus.

GlobalPhone Japanese

GlobalPhone Portuguese (Brazilian)

XLingHealth

Examples

Words (Q)

Words (A)

Words (Q) and #Words (A) represent the average number of words… See the full description on the dataset page: https://huggingface.co/datasets/claws-lab/XLingHealth.

JamPatoisNLI Dataset

Manual Inspection and Correction.

Global Mandarin Learning Market Overview and Outlook 2025-2032

Japan Centre of Excellence (JACEEX)

Nigeria Language Areas

Parameter of Resnet-50 and NASNetLarge.

The most spoken languages worldwide 2025