12 datasets found

The most spoken languages worldwide 2023
statista.com
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most spoken languages worldwide 2023 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset updated
Jan 23, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
World
Description
In 2023, there were around 1.5 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.1 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year.

Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation and other official pronouncements. The United States is a land of immigrations and the languages spoken in the United States vary as a result of the multi-cultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over 41 million people spoke at home in 2021. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.7 million Tagalog speakers and 1.5 million Vietnamese speakers counted in the United States that year.

Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 44 percent of California’s population was speaking a language other than English at home in 2021.
E
GlobalPhone German
catalogue.elra.info
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) (2017). GlobalPhone German [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0198/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The German corpus was produced using the Frankfurter Allgemeine und Sueddeutsche Zeitung newspaper. It contains recordings of 77 speakers (70 males, 7 females) recorded in Karlsruhe, Germany. No age distribution is available.
h
jampatoisnli
huggingface.co
Updated Jul 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 21, 2023
Authors
Ruth-Ann Armstrong
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for [Dataset Name]

Dataset Summary

JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.
E
GlobalPhone Portuguese (Brazilian)
live.european-language-grid.eu
catalog.elra.info
audio format
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GlobalPhone Portuguese (Brazilian) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1912
Explore at:
audio formatAvailable download formats
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Brazil
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).
Dataset for: "Big data suggest strong constraints of linguistic similarity...
zenodo.org
data.niaid.nih.gov
csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Job Schepens; Job Schepens; Roeland van Hout; Roeland van Hout; T. Florian Jaeger; T. Florian Jaeger (2020). Dataset for: "Big data suggest strong constraints of linguistic similarity on adult language learning" [Dataset]. http://doi.org/10.5281/zenodo.2863533
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2863533
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Job Schepens; Job Schepens; Roeland van Hout; Roeland van Hout; T. Florian Jaeger; T. Florian Jaeger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is adapted from raw data with fully anonymized results on the State Examination of Dutch as a Second Language. This exam is officially administred by the Board of Tests and Examinations (College voor Toetsen en Examens, or CvTE). See cvte.nl/about-cvte. The Board of Tests and Examinations is mandated by the Dutch government.

The article accompanying the dataset:

Schepens, Job, Roeland van Hout, and T. Florian Jaeger. “Big Data Suggest Strong Constraints of Linguistic Similarity on Adult Language Learning.” Cognition 194 (January 1, 2020): 104056. https://doi.org/10.1016/j.cognition.2019.104056.

Every row in the dataset represents the first official testing score of a unique learner.
The columns contain the following information as based on questionnaires filled in at the time of the exam:

"L1" - The first language of the learner
"C" - The country of birth
"L1L2" - The combination of first and best additional language besides Dutch
"L2" - The best additional language besides Dutch
"AaA" - Age at Arrival in the Netherlands in years (starting date of residence)
"LoR" - Length of residence in the Netherlands in years
"Edu.day" - Duration of daily education (1 low, 2 middle, 3 high, 4 very high). From 1992 until 2006, learners' education has been measured by means of a side-by-side matrix question in a learner's questionnaire. Learners were asked to mark which type of education they have had (elementary, secondary, or tertiary schooling) by means of filling in for how many years they have been enrolled, in which country, and whether or not they have graduated. Based on this information we were able to estimate how many years learners have had education on a daily basis from six years of age onwards. Since 2006, the question about learners' education has been altered and it is asked directly how many years learners have had formal education on a daily basis from six years of age onwards. Possible answering categories are: 1) 0 thru 5 years; 2) 6 thru 10 years; 3) 11 thru 15 years; 4) 16 years or more. The answers have been merged into the categorical answer.
"Sex" - Gender
"Family" - Language Family
"ISO639.3" - Language ID code according to Ethnologue
"Enroll" - Proportion of school-aged youth enrolled in secondary education according to the World Bank. The World Bank reports on education data in a wide number of countries around the world on a regular basis. We took the gross enrollment rate in secondary schooling per country in the year the learner has arrived in the Netherlands as an indicator for a country's educational accessibility at the time learners have left their country of origin.
"STEX_speaking_score" - The STEX test score for speaking proficiency.
"Dissimilarity_morphological" - Morphological similarity
"Dissimilarity_lexical" - Lexical similarity
"Dissimilarity_phonological_new_features" - Phonological similarity (in terms of new features)
"Dissimilarity_phonological_new_categories" - Phonological similarity (in terms of new sounds)

A few rows of the data:

"L1","C","L1L2","L2","AaA","LoR","Edu.day","Sex","Family","ISO639.3","Enroll","STEX_speaking_score","Dissimilarity_morphological","Dissimilarity_lexical","Dissimilarity_phonological_new_features","Dissimilarity_phonological_new_categories"
"English","UnitedStates","EnglishMonolingual","Monolingual",34,0,4,"Female","Indo-European","eng ",94,541,0.0094,0.083191,11,19
"English","UnitedStates","EnglishGerman","German",25,16,3,"Female","Indo-European","eng ",94,603,0.0094,0.083191,11,19
"English","UnitedStates","EnglishFrench","French",32,3,4,"Male","Indo-European","eng ",94,562,0.0094,0.083191,11,19
"English","UnitedStates","EnglishSpanish","Spanish",27,8,4,"Male","Indo-European","eng ",94,537,0.0094,0.083191,11,19
"English","UnitedStates","EnglishMonolingual","Monolingual",47,5,3,"Male","Indo-European","eng ",94,505,0.0094,0.083191,11,19

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

zenodo.org
data.niaid.nih.gov

txt

Updated Jan 27, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Ria Hari Gusmita; Ria Hari Gusmita; Asep Fajar Firmansyah; Asep Fajar Firmansyah (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. http://doi.org/10.5281/zenodo.7454892

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7454892

Dataset updated

Jan 27, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Ria Hari Gusmita; Ria Hari Gusmita; Asep Fajar Firmansyah; Asep Fajar Firmansyah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3117 sentences
62027 tokens
2475 named entities
18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

Allah
Allah's Throne
Artifact
Astronomical body
Event
False deity
Holy book
Language
Angel
Person
Messenger
Prophet
Sentient
Afterlife location
Geographical location
Color
Religion
Food
Fruit
The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri
Muhammad Destamal Junas
Naufaldi Hafidhigbal
Nur Kholis Azzam Ubaidillah
Puspitasari
Septiany Nur Anggita
Wilda Nurjannah
William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Eva Nugraha, M.Ag.
Dr. Jauhar Azizy, MA
Dr. Lilik Ummi Kultsum, MA

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

Supervised Learning Setting

The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

Maximum sequence length	Number of e-poch	Precision	Recall	F1 score
256	10	0.94	0.92	0.93
256	20	0.99	0.97	0.98
256	40	0.96	0.96	0.96
256	100	0.97	0.96	0.96
512	10	0.92	0.92	0.92
512	20	0.96	0.95	0.96
512	40	0.97	0.95	0.96
512	100	0.97	0.95	0.96

Transfer Learning Setting

We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

Maximum sequence length	Number of e-poch	Precision	Recall	F1 score
256	10	0.67	0.65	0.65
256	20	0.60	0.59	0.59
256	40	0.75	0.72	0.71
256	100	0.73	0.68	0.68
512	10	0.72	0.62	0.64
512	20	0.62	0.57	0.58
512	40	0.72	0.66	0.67
512	100	0.68	0.68	0.67

This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

How to Cite

@InProceedings{10.1007/978-3-031-35320-8_12,
author="Gusmita, Ria Hari
and Firmansyah, Asep Fajar
and Moussallem, Diego
and Ngonga Ngomo, Axel-Cyrille",
editor="M{\'e}tais, Elisabeth
and Meziane, Farid
and Sugumaran, Vijayan
and Manning, Warren
and Reiff-Marganiec, Stephan",
title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",
booktitle="Natural Language Processing and Information Systems",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="170--185",
abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",
isbn="978-3-031-35320-8"
}

Contact

If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

E
GlobalPhone Chinese-Shanghai
catalog.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Chinese-Shanghai [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0194/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Area covered
Shanghai
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Shanghai corpus was produced using the Peoples Daily newspaper. It contains recordings of 41 speakers (16 males, 25 females) recorded in Shanghai, China. The following age distribution has been obtained: 1 speaker is below 19, 2 speakers are between 20 and 29, 13 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 11 speakers are over 50.
HindiMathQuest - Math Problems & Reasoning
kaggle.com
Updated Oct 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dnyanesh Walwadkar (2024). HindiMathQuest - Math Problems & Reasoning [Dataset]. http://doi.org/10.34740/kaggle/ds/5832290
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5832290
Dataset updated
Oct 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dnyanesh Walwadkar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview:

The Hindi Mathematics Reasoning and Problem-Solving Dataset is designed to advance the capabilities of language models in understanding and solving mathematical problems presented in the Hindi language. The dataset covers a comprehensive range of question types, including logical reasoning, numeric calculations, translation-based problems, and complex mathematical tasks typically seen in competitive exams. This dataset is intended to fill a critical gap by focusing on numeric reasoning and mathematical logic in Hindi, offering high-quality prompts that challenge models to handle both linguistic and mathematical complexity in one of the world’s most widely spoken languages.

Key Features:

-**Diverse Range of Mathematical Problems**: The dataset includes questions from areas such as arithmetic, algebra, geometry, physics, and number theory, all expressed in Hindi.

-**Logical and Reasoning Tasks**: Includes logic-based problems requiring pattern recognition, deduction, and reasoning, often seen in competitive exams like IIT JEE, GATE, and GRE.

-**Complex Numerical Calculations in Hindi**: Numeric expressions and their handling in Hindi text, a common challenge for language models, are a major focus of this dataset. Questions require models to accurately interpret and solve mathematical problems where numbers are written in Hindi words (e.g., "पचासी हजार सात सौ नवासी" for 85789).

-**Real-World Application Scenarios**: Paragraph-based problems, puzzles, and word problems that mirror real-world scenarios and test both language comprehension and problem-solving capabilities.

-**Culturally Relevant Questions**: Carefully curated questions that avoid regional or social biases, ensuring that the dataset accurately reflects the linguistic and cultural nuances of Hindi-speaking regions.

Dataset Breakdown:

-**Logical and Reasoning-based Questions**: Questions testing pattern recognition, deduction, and logical reasoning, often seen in IQ tests and competitive exams.

Calculation-based Problems: Includes numeric operations such as addition, subtraction, multiplication, and division, presented in Hindi text.

-**Translation-based Mathematical Problems**: Questions that involve translating between numeric expressions and Hindi word forms, enhancing model understanding of Hindi numerals.

-**Competitive Exam-style Questions**: Sourced and inspired by advanced reasoning and problem-solving questions from exams like GATE, IIT JEE, and GRE, providing high-level challenge.

-**Series and Sequence Questions**: Number series, progressions, and pattern recognition problems, essential for logical reasoning tasks.

-**Paragraph-based Word Problems**: Real-world math problems described in multiple sentences of Hindi text, requiring deeper language comprehension and reasoning.

-**Geometry and Trigonometry**: Includes geometry-based problems using Hindi terminology for angles, shapes, and measurements.

-**Physics-based Problems**: Mathematical problems based on physics concepts like mechanics, thermodynamics, and electricity, all expressed in Hindi.

-**Graph and Data Interpretation**: Interpretation of graphs and data in Hindi, testing both visual and mathematical understanding.

-**Olympiad-style Questions**: Advanced math problems, similar to those found in math Olympiads, designed to test high-level reasoning and problem-solving skills.

Preprocessing and Quality Control:

-**Human Verification**: Over 30% of the dataset has been manually reviewed and verified by native Hindi speakers. Additionally, a random sample of English-to-Hindi translated prompts showed a 100% success rate in translation quality, further boosting confidence in the overall quality of the dataset.

-**Dataset Curation**: The dataset was generated using a combination of human-curated questions, AI-assisted translations from existing English datasets, and publicly available educational resources. Special attention was given to ensure cultural sensitivity and accurate representation of the language.

-**Handling Numeric Challenges in Hindi**: Special focus was given to numeric reasoning tasks, where numbers are presented in Hindi words—a well-known challenge for existing language models. The dataset aims to push the boundaries of current models by providing complex scenarios that require a deep understanding of both language and numeric relationships.

Usage:

This dataset is ideal for researchers, educators, and developers working on natural language processing, machine learning, and AI models tailored for Hindi-speaking populations. The dataset can be used for:

Fine-tuning language models for improved understanding of mathematical reasoning in Hindi.

Training question-answering systems for educational tools that cater to Hindi-speaking students.

Developing AI systems for competitive exam preparati...
P
VietMed-NER Dataset
paperswithcode.com
Updated Jun 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khai Le-Duc; David Thulke; Hung-Phong Tran; Long Vo-Dang; Khai-Nguyen Nguyen; Truong-Son Hy; Ralf Schlüter (2024). VietMed-NER Dataset [Dataset]. https://paperswithcode.com/dataset/vietmed-ner
Explore at:
Dataset updated
Jun 18, 2024
Authors
Khai Le-Duc; David Thulke; Hung-Phong Tran; Long Vo-Dang; Khai-Nguyen Nguyen; Truong-Son Hy; Ralf Schlüter
Description
Spoken Named Entity Recognition (NER) aims to extracting named entities from speech and categorizing them into types like person, location, organization, etc. In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence. We found that pre-trained multilingual models XLM-R outperformed all monolingual models on both reference text and ASR output. Also in general, encoders perform better than sequence-to-sequence models for the NER task. By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well. All code, data and models are made publicly available here: https://github.com/leduckhai/MultiMed
E
GlobalPhone Korean
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) (2017). GlobalPhone Korean [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0200/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Korean corpus was produced using the Hankyoreh Daily News. It contains recordings of 100 speakers (50 males, 50 females) recorded in Seoul, Korea. The following age distribution has been obtained: 7 speakers are below 19, 70 speakers are between 20 and 29, 19 speakers are between 30 and 39, and 3 speakers are between 40 and 49 (1 speaker age is unknown).
f
Table2_Swedish Youths as Listeners of Global Englishes Speakers With Diverse...
figshare.com
frontiersin.figshare.com
xlsx
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hyeseung Jeong; Anna Elgemark; Bosse Thorén (2023). Table2_Swedish Youths as Listeners of Global Englishes Speakers With Diverse Accents: Listener Intelligibility, Listener Comprehensibility, Accentedness Perception, and Accentedness Acceptance.XLSX [Dataset]. http://doi.org/10.3389/feduc.2021.651908.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2021.651908.s003
Dataset updated
Jun 10, 2023
Dataset provided by
Frontiers
Authors
Hyeseung Jeong; Anna Elgemark; Bosse Thorén
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As reflected in the concept of Global Englishes, English mediates global communication, where English speakers represent not merely those from English-speaking countries like United Kingdom or United States but also global people from a wide range of linguistic backgrounds, who speak the language with diverse accents. Thus, to communicate internationally, cultivating a maximized listening proficiency for and positive attitudes toward global Englishes speakers with diverse accents is ever more important. However, with their preference for American English and its popular culture, it is uncertain whether Swedish youth learners are developing these key linguistic qualities to be prepared for the globalized use of English. To address this, we randomly assigned 160 upper secondary students (mean age = 17.25) into six groups, where each group listened to one of six English speakers. The six speakers first languages were Mandarin, Russian/Ukrainian, Tamil, Lusoga/Luganda, American English, and British English. Through comparing the six student groups, we examined their listener intelligibility (actual understanding), listener comprehensibility (feeling of ease or difficulty), accentedness perception (perceiving an accent as native or foreign), and accentedness acceptance (showing a positive or negative attitude toward an accent) of diverse English accents. The results showed that the intelligibility scores and perception/attitude ratings of participants favored the two speakers with privileged accents–the American and British speakers. However, across all six groups, no correlation was detected between their actual understanding of the speakers and their perception/attitude ratings, which often had a strong correlation with their feelings of ease/difficulty regarding the speakers accents. Taken together, our results suggest that the current English education needs innovation to be more aligned with the national syllabus that promotes a global perspective. That is, students need to be guided to improve their actual understanding and sense of familiarity with Global English speakers besides the native accents that they prefer. Moreover, innovative pedagogical work should be undertaken to change Swedish youths’ perceptions and attitudes and prepare them to become open-minded toward diverse English speakers.
E
GlobalPhone Japanese
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) (2017). GlobalPhone Japanese [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0199/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Japanese corpus was produced using the Nikkei Shinbun newspaper. It contains recordings of 149 speakers (104 males, 44 females, 1 unspecified) recorded in Tokyo, Japan. The following age distribution has been obtained: 22 speakers are below 19, 90 speakers are between 20 and 29, 5 speakers are between 30 and 39, 2 speakers are between 40 and 49, and 1 speaker is over 50 (28 speakers age is unknown).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). The most spoken languages worldwide 2023 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

The most spoken languages worldwide 2023

Explore at:

411 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jan 23, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2022

Area covered

World

Description

In 2023, there were around 1.5 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.1 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year.

Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation and other official pronouncements. The United States is a land of immigrations and the languages spoken in the United States vary as a result of the multi-cultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over 41 million people spoke at home in 2021. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.7 million Tagalog speakers and 1.5 million Vietnamese speakers counted in the United States that year.

Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 44 percent of California’s population was speaking a language other than English at home in 2021.

Clear search

Close search

Google apps

Main menu

The most spoken languages worldwide 2023

GlobalPhone German

jampatoisnli

GlobalPhone Portuguese (Brazilian)

Dataset for: "Big data suggest strong constraints of linguistic similarity...

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

IndQNER

Named Entity Classes

Annotation Stage

Verification Stage

Evaluation

Supervised Learning Setting

Transfer Learning Setting

How to Cite

Contact

GlobalPhone Chinese-Shanghai

HindiMathQuest - Math Problems & Reasoning

Overview:

Key Features:

Dataset Breakdown:

Preprocessing and Quality Control:

Usage:

VietMed-NER Dataset

GlobalPhone Korean

Table2_Swedish Youths as Listeners of Global Englishes Speakers With Diverse...

GlobalPhone Japanese

The most spoken languages worldwide 2023