12 datasets found
  1. The most spoken languages worldwide 2023

    • statista.com
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2023 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Jan 23, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    World
    Description

    In 2023, there were around 1.5 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.1 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year.

    Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation and other official pronouncements. The United States is a land of immigrations and the languages spoken in the United States vary as a result of the multi-cultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over 41 million people spoke at home in 2021. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.7 million Tagalog speakers and 1.5 million Vietnamese speakers counted in the United States that year.

    Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 44 percent of California’s population was speaking a language other than English at home in 2021.

  2. E

    GlobalPhone German

    • catalogue.elra.info
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) (2017). GlobalPhone German [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0198/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The German corpus was produced using the Frankfurter Allgemeine und Sueddeutsche Zeitung newspaper. It contains recordings of 77 speakers (70 males, 7 females) recorded in Karlsruhe, Germany. No age distribution is available.

  3. h

    jampatoisnli

    • huggingface.co
    Updated Jul 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2023
    Authors
    Ruth-Ann Armstrong
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.

  4. E

    GlobalPhone Portuguese (Brazilian)

    • live.european-language-grid.eu
    • catalog.elra.info
    audio format
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GlobalPhone Portuguese (Brazilian) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1912
    Explore at:
    audio formatAvailable download formats
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Brazil
    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

    The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).

  5. Dataset for: "Big data suggest strong constraints of linguistic similarity...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Job Schepens; Job Schepens; Roeland van Hout; Roeland van Hout; T. Florian Jaeger; T. Florian Jaeger (2020). Dataset for: "Big data suggest strong constraints of linguistic similarity on adult language learning" [Dataset]. http://doi.org/10.5281/zenodo.2863533
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Job Schepens; Job Schepens; Roeland van Hout; Roeland van Hout; T. Florian Jaeger; T. Florian Jaeger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is adapted from raw data with fully anonymized results on the State Examination of Dutch as a Second Language. This exam is officially administred by the Board of Tests and Examinations (College voor Toetsen en Examens, or CvTE). See cvte.nl/about-cvte. The Board of Tests and Examinations is mandated by the Dutch government.

    The article accompanying the dataset:

    Schepens, Job, Roeland van Hout, and T. Florian Jaeger. “Big Data Suggest Strong Constraints of Linguistic Similarity on Adult Language Learning.” Cognition 194 (January 1, 2020): 104056. https://doi.org/10.1016/j.cognition.2019.104056.

    Every row in the dataset represents the first official testing score of a unique learner.
    The columns contain the following information as based on questionnaires filled in at the time of the exam:

    "L1" - The first language of the learner
    "C" - The country of birth
    "L1L2" - The combination of first and best additional language besides Dutch
    "L2" - The best additional language besides Dutch
    "AaA" - Age at Arrival in the Netherlands in years (starting date of residence)
    "LoR" - Length of residence in the Netherlands in years
    "Edu.day" - Duration of daily education (1 low, 2 middle, 3 high, 4 very high). From 1992 until 2006, learners' education has been measured by means of a side-by-side matrix question in a learner's questionnaire. Learners were asked to mark which type of education they have had (elementary, secondary, or tertiary schooling) by means of filling in for how many years they have been enrolled, in which country, and whether or not they have graduated. Based on this information we were able to estimate how many years learners have had education on a daily basis from six years of age onwards. Since 2006, the question about learners' education has been altered and it is asked directly how many years learners have had formal education on a daily basis from six years of age onwards. Possible answering categories are: 1) 0 thru 5 years; 2) 6 thru 10 years; 3) 11 thru 15 years; 4) 16 years or more. The answers have been merged into the categorical answer.
    "Sex" - Gender
    "Family" - Language Family
    "ISO639.3" - Language ID code according to Ethnologue
    "Enroll" - Proportion of school-aged youth enrolled in secondary education according to the World Bank. The World Bank reports on education data in a wide number of countries around the world on a regular basis. We took the gross enrollment rate in secondary schooling per country in the year the learner has arrived in the Netherlands as an indicator for a country's educational accessibility at the time learners have left their country of origin.
    "STEX_speaking_score" - The STEX test score for speaking proficiency.
    "Dissimilarity_morphological" - Morphological similarity
    "Dissimilarity_lexical" - Lexical similarity
    "Dissimilarity_phonological_new_features" - Phonological similarity (in terms of new features)
    "Dissimilarity_phonological_new_categories" - Phonological similarity (in terms of new sounds)


    A few rows of the data:

    "L1","C","L1L2","L2","AaA","LoR","Edu.day","Sex","Family","ISO639.3","Enroll","STEX_speaking_score","Dissimilarity_morphological","Dissimilarity_lexical","Dissimilarity_phonological_new_features","Dissimilarity_phonological_new_categories"
    "English","UnitedStates","EnglishMonolingual","Monolingual",34,0,4,"Female","Indo-European","eng ",94,541,0.0094,0.083191,11,19
    "English","UnitedStates","EnglishGerman","German",25,16,3,"Female","Indo-European","eng ",94,603,0.0094,0.083191,11,19
    "English","UnitedStates","EnglishFrench","French",32,3,4,"Male","Indo-European","eng ",94,562,0.0094,0.083191,11,19
    "English","UnitedStates","EnglishSpanish","Spanish",27,8,4,"Male","Indo-European","eng ",94,537,0.0094,0.083191,11,19
    "English","UnitedStates","EnglishMonolingual","Monolingual",47,5,3,"Male","Indo-European","eng ",94,505,0.0094,0.083191,11,19

  6. IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

    • zenodo.org
    • data.niaid.nih.gov
    txt
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ria Hari Gusmita; Ria Hari Gusmita; Asep Fajar Firmansyah; Asep Fajar Firmansyah (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. http://doi.org/10.5281/zenodo.7454892
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ria Hari Gusmita; Ria Hari Gusmita; Asep Fajar Firmansyah; Asep Fajar Firmansyah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IndQNER

    IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

    • 3117 sentences
    • 62027 tokens
    • 2475 named entities
    • 18 named entity categories

    Named Entity Classes

    The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

    1. Allah
    2. Allah's Throne
    3. Artifact
    4. Astronomical body
    5. Event
    6. False deity
    7. Holy book
    8. Language
    9. Angel
    10. Person
    11. Messenger
    12. Prophet
    13. Sentient
    14. Afterlife location
    15. Geographical location
    16. Color
    17. Religion
    18. Food
    19. Fruit
    20. The book of Allah

    Annotation Stage

    There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

    1. Anggita Maharani Gumay Putri
    2. Muhammad Destamal Junas
    3. Naufaldi Hafidhigbal
    4. Nur Kholis Azzam Ubaidillah
    5. Puspitasari
    6. Septiany Nur Anggita
    7. Wilda Nurjannah
    8. William Santoso

    Verification Stage

    We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

    1. Dr. Eva Nugraha, M.Ag.
    2. Dr. Jauhar Azizy, MA
    3. Dr. Lilik Ummi Kultsum, MA

    Evaluation

    We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

    Supervised Learning Setting

    The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

    Maximum sequence lengthNumber of e-pochPrecisionRecallF1 score
    256100.940.920.93
    25620 0.990.970.98
    256400.960.960.96
    2561000.970.960.96
    512100.920.920.92
    512200.960.950.96
    512400.970.950.96
    5121000.970.950.96

    Transfer Learning Setting

    We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

    Maximum sequence lengthNumber of e-pochPrecisionRecallF1 score
    256100.670.650.65
    25620 0.600.590.59
    256400.750.720.71
    2561000.730.680.68
    512100.720.620.64
    512200.620.570.58
    512400.720.660.67
    5121000.680.680.67

    This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

    How to Cite

    @InProceedings{10.1007/978-3-031-35320-8_12,
    author="Gusmita, Ria Hari
    and Firmansyah, Asep Fajar
    and Moussallem, Diego
    and Ngonga Ngomo, Axel-Cyrille",
    editor="M{\'e}tais, Elisabeth
    and Meziane, Farid
    and Sugumaran, Vijayan
    and Manning, Warren
    and Reiff-Marganiec, Stephan",
    title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",
    booktitle="Natural Language Processing and Information Systems",
    year="2023",
    publisher="Springer Nature Switzerland",
    address="Cham",
    pages="170--185",
    abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",
    isbn="978-3-031-35320-8"
    }

    Contact

    If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

  7. E

    GlobalPhone Chinese-Shanghai

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Chinese-Shanghai [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0194/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Area covered
    Shanghai
    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Shanghai corpus was produced using the Peoples Daily newspaper. It contains recordings of 41 speakers (16 males, 25 females) recorded in Shanghai, China. The following age distribution has been obtained: 1 speaker is below 19, 2 speakers are between 20 and 29, 13 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 11 speakers are over 50.

  8. HindiMathQuest - Math Problems & Reasoning

    • kaggle.com
    Updated Oct 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dnyanesh Walwadkar (2024). HindiMathQuest - Math Problems & Reasoning [Dataset]. http://doi.org/10.34740/kaggle/ds/5832290
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dnyanesh Walwadkar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview:

    The Hindi Mathematics Reasoning and Problem-Solving Dataset is designed to advance the capabilities of language models in understanding and solving mathematical problems presented in the Hindi language. The dataset covers a comprehensive range of question types, including logical reasoning, numeric calculations, translation-based problems, and complex mathematical tasks typically seen in competitive exams. This dataset is intended to fill a critical gap by focusing on numeric reasoning and mathematical logic in Hindi, offering high-quality prompts that challenge models to handle both linguistic and mathematical complexity in one of the world’s most widely spoken languages.

    Key Features:

    -**Diverse Range of Mathematical Problems**: The dataset includes questions from areas such as arithmetic, algebra, geometry, physics, and number theory, all expressed in Hindi.

    -**Logical and Reasoning Tasks**: Includes logic-based problems requiring pattern recognition, deduction, and reasoning, often seen in competitive exams like IIT JEE, GATE, and GRE.

    -**Complex Numerical Calculations in Hindi**: Numeric expressions and their handling in Hindi text, a common challenge for language models, are a major focus of this dataset. Questions require models to accurately interpret and solve mathematical problems where numbers are written in Hindi words (e.g., "पचासी हजार सात सौ नवासी" for 85789).

    -**Real-World Application Scenarios**: Paragraph-based problems, puzzles, and word problems that mirror real-world scenarios and test both language comprehension and problem-solving capabilities.

    -**Culturally Relevant Questions**: Carefully curated questions that avoid regional or social biases, ensuring that the dataset accurately reflects the linguistic and cultural nuances of Hindi-speaking regions.

    Dataset Breakdown:

    -**Logical and Reasoning-based Questions**: Questions testing pattern recognition, deduction, and logical reasoning, often seen in IQ tests and competitive exams.

    • Calculation-based Problems: Includes numeric operations such as addition, subtraction, multiplication, and division, presented in Hindi text.

    -**Translation-based Mathematical Problems**: Questions that involve translating between numeric expressions and Hindi word forms, enhancing model understanding of Hindi numerals.

    -**Competitive Exam-style Questions**: Sourced and inspired by advanced reasoning and problem-solving questions from exams like GATE, IIT JEE, and GRE, providing high-level challenge.

    -**Series and Sequence Questions**: Number series, progressions, and pattern recognition problems, essential for logical reasoning tasks.

    -**Paragraph-based Word Problems**: Real-world math problems described in multiple sentences of Hindi text, requiring deeper language comprehension and reasoning.

    -**Geometry and Trigonometry**: Includes geometry-based problems using Hindi terminology for angles, shapes, and measurements.

    -**Physics-based Problems**: Mathematical problems based on physics concepts like mechanics, thermodynamics, and electricity, all expressed in Hindi.

    -**Graph and Data Interpretation**: Interpretation of graphs and data in Hindi, testing both visual and mathematical understanding.

    -**Olympiad-style Questions**: Advanced math problems, similar to those found in math Olympiads, designed to test high-level reasoning and problem-solving skills.

    Preprocessing and Quality Control:

    -**Human Verification**: Over 30% of the dataset has been manually reviewed and verified by native Hindi speakers. Additionally, a random sample of English-to-Hindi translated prompts showed a 100% success rate in translation quality, further boosting confidence in the overall quality of the dataset.

    -**Dataset Curation**: The dataset was generated using a combination of human-curated questions, AI-assisted translations from existing English datasets, and publicly available educational resources. Special attention was given to ensure cultural sensitivity and accurate representation of the language.

    -**Handling Numeric Challenges in Hindi**: Special focus was given to numeric reasoning tasks, where numbers are presented in Hindi words—a well-known challenge for existing language models. The dataset aims to push the boundaries of current models by providing complex scenarios that require a deep understanding of both language and numeric relationships.

    Usage:

    This dataset is ideal for researchers, educators, and developers working on natural language processing, machine learning, and AI models tailored for Hindi-speaking populations. The dataset can be used for:

    • Fine-tuning language models for improved understanding of mathematical reasoning in Hindi.
    • Training question-answering systems for educational tools that cater to Hindi-speaking students.
    • Developing AI systems for competitive exam preparati...
  9. P

    VietMed-NER Dataset

    • paperswithcode.com
    Updated Jun 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khai Le-Duc; David Thulke; Hung-Phong Tran; Long Vo-Dang; Khai-Nguyen Nguyen; Truong-Son Hy; Ralf Schlüter (2024). VietMed-NER Dataset [Dataset]. https://paperswithcode.com/dataset/vietmed-ner
    Explore at:
    Dataset updated
    Jun 18, 2024
    Authors
    Khai Le-Duc; David Thulke; Hung-Phong Tran; Long Vo-Dang; Khai-Nguyen Nguyen; Truong-Son Hy; Ralf Schlüter
    Description

    Spoken Named Entity Recognition (NER) aims to extracting named entities from speech and categorizing them into types like person, location, organization, etc. In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence. We found that pre-trained multilingual models XLM-R outperformed all monolingual models on both reference text and ASR output. Also in general, encoders perform better than sequence-to-sequence models for the NER task. By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well. All code, data and models are made publicly available here: https://github.com/leduckhai/MultiMed

  10. E

    GlobalPhone Korean

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) (2017). GlobalPhone Korean [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0200/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Korean corpus was produced using the Hankyoreh Daily News. It contains recordings of 100 speakers (50 males, 50 females) recorded in Seoul, Korea. The following age distribution has been obtained: 7 speakers are below 19, 70 speakers are between 20 and 29, 19 speakers are between 30 and 39, and 3 speakers are between 40 and 49 (1 speaker age is unknown).

  11. f

    Table2_Swedish Youths as Listeners of Global Englishes Speakers With Diverse...

    • figshare.com
    • frontiersin.figshare.com
    xlsx
    Updated Jun 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyeseung Jeong; Anna Elgemark; Bosse Thorén (2023). Table2_Swedish Youths as Listeners of Global Englishes Speakers With Diverse Accents: Listener Intelligibility, Listener Comprehensibility, Accentedness Perception, and Accentedness Acceptance.XLSX [Dataset]. http://doi.org/10.3389/feduc.2021.651908.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Frontiers
    Authors
    Hyeseung Jeong; Anna Elgemark; Bosse Thorén
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As reflected in the concept of Global Englishes, English mediates global communication, where English speakers represent not merely those from English-speaking countries like United Kingdom or United States but also global people from a wide range of linguistic backgrounds, who speak the language with diverse accents. Thus, to communicate internationally, cultivating a maximized listening proficiency for and positive attitudes toward global Englishes speakers with diverse accents is ever more important. However, with their preference for American English and its popular culture, it is uncertain whether Swedish youth learners are developing these key linguistic qualities to be prepared for the globalized use of English. To address this, we randomly assigned 160 upper secondary students (mean age = 17.25) into six groups, where each group listened to one of six English speakers. The six speakers first languages were Mandarin, Russian/Ukrainian, Tamil, Lusoga/Luganda, American English, and British English. Through comparing the six student groups, we examined their listener intelligibility (actual understanding), listener comprehensibility (feeling of ease or difficulty), accentedness perception (perceiving an accent as native or foreign), and accentedness acceptance (showing a positive or negative attitude toward an accent) of diverse English accents. The results showed that the intelligibility scores and perception/attitude ratings of participants favored the two speakers with privileged accents–the American and British speakers. However, across all six groups, no correlation was detected between their actual understanding of the speakers and their perception/attitude ratings, which often had a strong correlation with their feelings of ease/difficulty regarding the speakers accents. Taken together, our results suggest that the current English education needs innovation to be more aligned with the national syllabus that promotes a global perspective. That is, students need to be guided to improve their actual understanding and sense of familiarity with Global English speakers besides the native accents that they prefer. Moreover, innovative pedagogical work should be undertaken to change Swedish youths’ perceptions and attitudes and prepare them to become open-minded toward diverse English speakers.

  12. E

    GlobalPhone Japanese

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) (2017). GlobalPhone Japanese [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0199/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Japanese corpus was produced using the Nikkei Shinbun newspaper. It contains recordings of 149 speakers (104 males, 44 females, 1 unspecified) recorded in Tokyo, Japan. The following age distribution has been obtained: 22 speakers are below 19, 90 speakers are between 20 and 29, 5 speakers are between 30 and 39, 2 speakers are between 40 and 49, and 1 speaker is over 50 (28 speakers age is unknown).

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). The most spoken languages worldwide 2023 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Organization logo

The most spoken languages worldwide 2023

Explore at:
411 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 23, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
World
Description

In 2023, there were around 1.5 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.1 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year.

Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation and other official pronouncements. The United States is a land of immigrations and the languages spoken in the United States vary as a result of the multi-cultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over 41 million people spoke at home in 2021. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.7 million Tagalog speakers and 1.5 million Vietnamese speakers counted in the United States that year.

Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 44 percent of California’s population was speaking a language other than English at home in 2021.

Search
Clear search
Close search
Google apps
Main menu