18 datasets found

The most spoken languages worldwide 2023
statista.com
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). The most spoken languages worldwide 2023 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Explore at:
Dataset updated
Jan 23, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
World
Description
In 2023, there were around 1.5 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.1 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year.

Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation and other official pronouncements. The United States is a land of immigrations and the languages spoken in the United States vary as a result of the multi-cultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over 41 million people spoke at home in 2021. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.7 million Tagalog speakers and 1.5 million Vietnamese speakers counted in the United States that year.

Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 44 percent of California’s population was speaking a language other than English at home in 2021.
E
GlobalPhone German
catalogue.elra.info
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) (2017). GlobalPhone German [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0198/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The German corpus was produced using the Frankfurter Allgemeine und Sueddeutsche Zeitung newspaper. It contains recordings of 77 speakers (70 males, 7 females) recorded in Karlsruhe, Germany. No age distribution is available.
h
jampatoisnli
huggingface.co
Updated Jul 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 21, 2023
Authors
Ruth-Ann Armstrong
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for [Dataset Name]

Dataset Summary

JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.
120 Million Word Spanish Corpus
kaggle.com
Updated Aug 8, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael Tatman (2017). 120 Million Word Spanish Corpus [Dataset]. https://www.kaggle.com/rtatman/120-million-word-spanish-corpus/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rachael Tatman
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Context:

Spanish is the second most widely-spoken language on Earth; over one in 20 humans alive today is a native speaker of Spanish. This medium-sized corpus contains 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010.

Content:

This dataset is made up of 57 text files. Each contains multiple Wikipedia articles in an XML format. The text of each article is surrounded by

Acknowledgements:

This dataset was collected by Samuel Reese, Gemma Boleda, Montse Cuadros, Lluís Padró and German Rigau. If you use it in your work, please cite the following paper:

Samuel Reese, Gemma Boleda, Montse Cuadros, Lluís Padró, German Rigau. Wikicorpus: A Word-Sense Disambiguated Multilingual Wikipedia Corpus. In Proceedings of 7th Language Resources and Evaluation Conference (LREC'10), La Valleta, Malta. May, 2010.

Inspiration:

Can you create a stop-word list for Spanish based on this corpus? How does it compare to the one in this dataset?

Can you build a topic model to cluster together articles on similar topics?

You may also like:

Brazilian Portuguese Literature Corpus: 3.7 million word corpus of Brazilian literature published between 1840-1908

Colonia Corpus of Historical Portuguese: A 5.1 million word corpus of historical Portuguese

The National University of Singapore SMS Corpus: A corpus of more than 67,000 SMS messages in Singapore English & Mandarin
E
GlobalPhone Portuguese (Brazilian)
catalog.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Portuguese (Brazilian) [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0201/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Brazil
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).
Data from: Foreign Language Proficiency Test Data from Three American...
icpsr.umich.edu
Updated Mar 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Winke, Paula Marie; Gass, Susan M.; Soneson, Dan; Rubio, Fernando; Hacking, Jane F. (2020). Foreign Language Proficiency Test Data from Three American Universities, [United States], 2014-2017 [Dataset]. http://doi.org/10.3886/ICPSR37499.v1
Explore at:
Unique identifier
https://doi.org/10.3886/ICPSR37499.v1
Dataset updated
Mar 10, 2020
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
Winke, Paula Marie; Gass, Susan M.; Soneson, Dan; Rubio, Fernando; Hacking, Jane F.
License
https://www.icpsr.umich.edu/web/ICPSR/studies/37499/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/37499/terms
Time period covered
Aug 15, 2014 - Jun 15, 2017
Area covered
United States, Utah, Michigan, Minnesota
Description
In the years 2014 through 2019, three U.S. universities, Michigan State University, the University of Minnesota, Twin Cities, and The University of Utah, received Language Proficiency Flagship Initiative grants as part of the larger Language Flagship, which is a National Security Education Program (NSEP) and Defense Language and National Security Education Office (DLNSEO) initiative to improve language learning in the United States. The goal of the three universities' Language Proficiency Flagship Initiative grants was to document language proficiency in regular tertiary foreign language programs so that the programs, and ones like them at other universities, could use the proficiency-achievement data to set programmatic learning benchmarks and recommendations, as called for by the Modern Language Association in 2007. This call was reiterated by the National Standards Collaborative Board in 2015.During the first three years of the three, university-specific five-year grants (Fall 2014 through Spring 2017), each university collected language proficiency data during academic years 2014-2015, 2015-2016, and 2016-2017, from language learners in selected, regular language programs to document the students' proficiency achievements.University A tested Chinese, French, Russian, and Spanish with the NSEP grant funding, and German, Italian, Japanese, Korean, and Portuguese with additional (in-kind) financial support from within University A.University B tested Arabic, French, Portuguese, Russian, and Spanish with the NSEP grant funding, and German and Korean with additional (in-kind) financial support from University B.University C tested Arabic, Chinese, Portuguese, and Russian with the NSEP grant funding, and Korean with additional (in-kind) financial support from University C.Each university additionally provided the students background questionnaires at the time of testing. As stipulated by the grant terms, at the universities, students were offered to take up to three proficiency tests each semester: speaking, listening, and reading. Writing was not assessed because the grants did not financially cover the costs of writing assessments. The universities were required by grant terms to use official, nationally recognized, and standardized language tests that reported scores out on one of two standardized proficiency test scales: either the American Councils of Teaching Foreign Languages (ACTFL, 2012) proficiency scale, or the Interagency Language Roundtable (ILR: Herzog, n.d.) proficiency scale. The three universities thus contracted mostly with Language Testing International, ACTFL's official testing subsidiary, to purchase and administer to students the Oral Proficiency Interview - computer (OPIc) for speaking, the Listening Proficiency Test (LPT) for listening, and the Reading Proficiency Test (RPT) for reading. However, earlier in the grant cycling, because ACTFL did not yet have tests in all of the languages to be tested, some of the earlier testing was contracted with American Councils and Avant STAMP, even though those tests are not specifically geared for the specific populations of learners present in the given project.Students were able to opt out of testing in certain cases; those cases varied from university to university. The speaking tests occurred normally within intact classes that came into computer labs to take the tests. Students were often times requested to take the listening and reading tests outside of class time in proctored language labs on the campuses on walk-in bases, or they took the listening and reading tests in a language lab during a regular class setting. These decisions were often made by the language instructors and/or the language programs. The data are cross-sectional, but certain individuals took the tests repeatedly, thus, longitudinal data sets are nested within the cross-sectional data.The three universities worked mostly independently during the initial year of data collection because the identities of the three universities receiving the grants was not announced until weeks before testing was to begin at all three campuses. Thus, each university independently designed its background questionnaire. However, because all three were guided by the same set of grant-rules to use nationally-recognized standardized tests for the assessments, combining all three universities' test data was
E
GlobalPhone Chinese-Shanghai
catalog.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Chinese-Shanghai [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0194/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Area covered
Shanghai
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Chinese-Shanghai corpus was produced using the Peoples Daily newspaper. It contains recordings of 41 speakers (16 males, 25 females) recorded in Shanghai, China. The following age distribution has been obtained: 1 speaker is below 19, 2 speakers are between 20 and 29, 13 speakers are between 30 and 39, 14 speakers are between 40 and 49, and 11 speakers are over 50.
c
Speech Across Dialects of English: Acoustic Measures from SPADE Project...
datacatalogue.cessda.eu
beta.ukdataservice.ac.uk
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stuart-Smith, J; Sonderegger, M; Mielke, J (2025). Speech Across Dialects of English: Acoustic Measures from SPADE Project Corpora, 1949-2019 [Dataset]. http://doi.org/10.5255/UKDA-SN-854959
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-854959
Dataset updated
Mar 19, 2025
Dataset provided by
University of Glasgow
McGill University
North Carolina State University
Authors
Stuart-Smith, J; Sonderegger, M; Mielke, J
Time period covered
Aug 31, 2017 - Aug 30, 2020
Area covered
Ireland, Canada, United States, United Kingdom
Variables measured
Individual, Text unit
Measurement technique
The acoustic measures provided were obtained from speech corpora collected as part of the SPADE project. Many of these were shared by a Data Guardian, an individual or institution with particular responsibility for one or more speech dataset(s), which they have either collected personally for a specific purpose, overseen the collection of, or now curate. The corpora are either public or private. Public corpora are either freely accessible or are available for sharing via a fee. Private corpora have been collected for a specific purpose, often sociolinguistic or phonetic. Together, the corpora feature speech from the UK, Ireland, Canada and the USA and were sourced in order to obtain good dialect coverage across a variety of social dimensions (e.g. age, gender, class, ethnicity). The speech is in a variety of formats including read speech, public speeches, oral histories and sociolinguistic interviews.The corpora were either already force-aligned or alignment was carried out as part of the SPADE project. Software developed as part of the SPADE project was then used to obtain vowel durations, static vowel formant measures and sibilant measures from the speech.
Description
The SPADE project aims to develop and apply user-friendly software for large-scale speech analysis of existing public and private English speech datasets, in order to understand more about English speech over space and time. To date, we have worked with 42 shared corpora comprising dialects from across the British Isles (England, Wales, Scotland, Ireland) and North America (US, Canada), with an effective time span of over 100 years. We make available here a link to our OSF repository (see below) which has acoustic measures datasets for sibilants and durations and static formants for vowels, for 39 corpora (~2200 hours of speech analysed from ~8600 speakers), with information about dataset generation. In addition, at the OSF site, we provide Praat TextGrids created by SPADE for some corpora. Reading passage text is provided when the measures are based on reading only. Datasets are in their raw form and will require cleaning (e.g. outlier removal) before analysis. In addition, we used whitelisting to anonymise measures datasets generated from non-public, restricted corpora.
Obtaining a data visualization of a text search within seconds via generic, large-scale search algorithms, such as Google n-gram viewer, is available to anyone. By contrast, speech research is only now entering its own 'big data' revolution. Historically, linguistic research has tended to carry out fine-grained analysis of a few aspects of speech from one or a few languages or dialects. The current scale of speech research studies has shaped our understanding of spoken language and the kinds of questions that we ask. Today, massive digital collections of transcribed speech are available from many different languages, gathered for many different purposes: from oral histories, to large datasets for training speech recognition systems, to legal and political interactions. Sophisticated speech processing tools exist to analyze these data, but require substantial technical skill. Given this confluence of data and tools, linguists have a new opportunity to answer fundamental questions about the nature and development of spoken language.

Our project seeks to establish the key tools to enable large-scale speech research to become as powerful and pervasive as large-scale text mining. It is based on a partnership of three teams based in Scotland, Canada and the US. Together we exploit methods from computing science and put them to work with tools and methods from speech science, linguistics and digital humanities, to discover how much the sounds of English across the Atlantic vary over space and time.

We have developed innovative and user-friendly software which exploits the availability of existing speech data and speech processing tools to facilitate large-scale integrated speech corpus analysis across many datasets together. The gains of such an approach are substantial: linguists will be able to scale up answers to existing research questions from one to many varieties of a language, and ask new and different questions about spoken language within and across social, regional, and cultural, contexts. Computational linguistics, speech technology, forensic and clinical linguistics researchers, who engage with variability in spoken language, will also benefit directly from our software. This project also opens up vast potential for those who already use digital scholarship for spoken language collections in the humanities and social sciences more broadly, e.g. literary scholars, sociologists, anthropologists, historians, political scientists. The possibility of ethically non-invasive inspection of speech and texts will allow analysts to uncover far more than is possible through textual analysis alone.

Our project has developed and applied our new software to a global language, English, using existing public and private spoken datasets of Old World (British Isles) and New World (North American) English, across an effective time span of more than 100 years, spanning the entire 20th century. Much of what we know about spoken English comes from influential studies on a few specific aspects of speech from one or two dialects. This vast literature has established important research questions which has been investigated for the first time on a much larger scale, through standardized data across many different varieties of English.

Our large-scale study complements current-scale studies, by enabling us to consider stability and change in English across the 20th century on an unparalleled scale. The global nature of English means that our findings will be interesting and relevant to a large international non-academic audience; they have been made accessible through an innovative and dynamic visualization of linguistic variation via an interactive sound mapping website. In addition to new insights into spoken English, this project also lays the crucial groundwork for large-scale speech studies across many datasets from different languages, of...
E
GlobalPhone Korean
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) (2017). GlobalPhone Korean [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0200/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Korean corpus was produced using the Hankyoreh Daily News. It contains recordings of 100 speakers (50 males, 50 females) recorded in Seoul, Korea. The following age distribution has been obtained: 7 speakers are below 19, 70 speakers are between 20 and 29, 19 speakers are between 30 and 39, and 3 speakers are between 40 and 49 (1 speaker age is unknown).
d
GLips - German Lipreading Dataset - Dataset - B2FIND
b2find.dkrz.de
Updated Oct 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). GLips - German Lipreading Dataset - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/aca4d0c4-9e81-560c-b3af-de011686ecc6
Explore at:
Dataset updated
Oct 29, 2023
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also included. The size of the uncompressed dataset is 16GB. Copyright of original data: Hessian Parliament (https://hessischer-landtag.de). If you use this dataset, you agree to use it for research purpose only and to cite the following reference in any works that make any use of the dataset.
h
BanglaNLP
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BanglaNLP [Dataset]. https://huggingface.co/datasets/likhonsheikh/BanglaNLP
Explore at:
Authors
Likhon Sheikh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
BanglaNLP: Bengali-English Parallel Dataset Tools

BanglaNLP is a comprehensive toolkit for creating high-quality Bengali-English parallel datasets from news sources, designed to improve machine translation and other cross-lingual NLP tasks for the Bengali language. Our work addresses the critical shortage of high-quality parallel data for Bengali, the 7th most spoken language in the world with over 230 million speakers.

🏆 Impact & Recognition

120K+ Sentence Pairs:… See the full description on the dataset page: https://huggingface.co/datasets/likhonsheikh/BanglaNLP.

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

zenodo.org
data.niaid.nih.gov

txt

Updated Jan 27, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Ria Hari Gusmita; Ria Hari Gusmita; Asep Fajar Firmansyah; Asep Fajar Firmansyah (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. http://doi.org/10.5281/zenodo.7454892

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7454892

Dataset updated

Jan 27, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Ria Hari Gusmita; Ria Hari Gusmita; Asep Fajar Firmansyah; Asep Fajar Firmansyah

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3117 sentences
62027 tokens
2475 named entities
18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

Allah
Allah's Throne
Artifact
Astronomical body
Event
False deity
Holy book
Language
Angel
Person
Messenger
Prophet
Sentient
Afterlife location
Geographical location
Color
Religion
Food
Fruit
The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri
Muhammad Destamal Junas
Naufaldi Hafidhigbal
Nur Kholis Azzam Ubaidillah
Puspitasari
Septiany Nur Anggita
Wilda Nurjannah
William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Eva Nugraha, M.Ag.
Dr. Jauhar Azizy, MA
Dr. Lilik Ummi Kultsum, MA

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

Supervised Learning Setting

The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

Maximum sequence length	Number of e-poch	Precision	Recall	F1 score
256	10	0.94	0.92	0.93
256	20	0.99	0.97	0.98
256	40	0.96	0.96	0.96
256	100	0.97	0.96	0.96
512	10	0.92	0.92	0.92
512	20	0.96	0.95	0.96
512	40	0.97	0.95	0.96
512	100	0.97	0.95	0.96

Transfer Learning Setting

We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

Maximum sequence length	Number of e-poch	Precision	Recall	F1 score
256	10	0.67	0.65	0.65
256	20	0.60	0.59	0.59
256	40	0.75	0.72	0.71
256	100	0.73	0.68	0.68
512	10	0.72	0.62	0.64
512	20	0.62	0.57	0.58
512	40	0.72	0.66	0.67
512	100	0.68	0.68	0.67

This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

How to Cite

@InProceedings{10.1007/978-3-031-35320-8_12,
author="Gusmita, Ria Hari
and Firmansyah, Asep Fajar
and Moussallem, Diego
and Ngonga Ngomo, Axel-Cyrille",
editor="M{\'e}tais, Elisabeth
and Meziane, Farid
and Sugumaran, Vijayan
and Manning, Warren
and Reiff-Marganiec, Stephan",
title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",
booktitle="Natural Language Processing and Information Systems",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="170--185",
abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",
isbn="978-3-031-35320-8"
}

Contact

If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

H
Replication Data for: Assessing child development scores among minority and...
dataverse.harvard.edu
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ann Miller (2023). Replication Data for: Assessing child development scores among minority and Indigenous language vs. dominant language speakers: A cross-sectional analysis using national Multiple Indicator Cluster Surveys [Dataset]. http://doi.org/10.7910/DVN/X1UAU8
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/X1UAU8
Dataset updated
Nov 2, 2023
Dataset provided by
Harvard Dataverse
Authors
Ann Miller
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
This Dataverse contains Stata 15.1 .do files and Excel spreadhseets with code to replicate the analyses. This paper used Multiple Indicator Cluster Survey (MICS) program data. MICS study data are the property of the UNICEF MICS program and the data use agreement is that the data “may not be redistributed or passed on to others in any form” so the databases themselves are not included in this Dataverse. However, the MICS study databases are available upon request to the MICS program for legitimate research purposes at the following website: https://mics.unicef.org/surveys. Abstract Background: Multiple studies have highlighted the inequities minority and Indigenous children face when accessing health care. Health and wellbeing is positively impacted when Indigenous children are educated and receive care in their maternal language. However, less is known about the association between minority or Indigenous language use and child development risks and outcomes. In this study, we provide global estimates of development risks and assess the associations between minority or Indigenous language status and early child development using the 10-item Early Child Development Index (ECDI), a tool widely used for global population assessments in children aged 3–4 years. Methods: We did a secondary analysis of cross-sectional data from 65 UNICEF Multiple Indicator Cluster Surveys (MICS) containing the ECDI from 2009–19 (waves 4–6). We included individual-level data for children aged 2–4 years (23–60 months) from datasets with ECDI modules, for surveys that captured the language of the respondent, interview, or head of household. The Expanded Graded Intergenerational Disruption Scale (EGIDS) was used to classify household languages as dominant versus minority or Indigenous at the country level. Our primary outcome was on-track overall development, defined per UNICEF’s guidelines as development being on track for at least three of the four ECDI domains (literacy–numeracy, learning, physical, and socioemotional). We performed logistic regression of pooled, weighted ECDI scores, aggregated by language status and adjusting for the covariables of child sex, child nutritional status (stunting), household wealth, maternal education, developmental support by an adult caregiver, and country-level early child education proportion. Regression analyses were done for all children aged 3–4 years with ECDI results, and separately for children with functional disabilities and ECDI results. Findings: 65 MICS datasets were included. 186 393 children aged 3–4 years had ECDI and language data, corresponding to an estimated represented population of 34 714 992 individuals. Estimated prevalence of on-track overall development as measured by ECDI scores was 65·7% (95% CI 64·2–67·2) for children from a minority or Indigenous language-speaking household, and 76·6% (75·7–77·4) for those from a dominant language-speaking household. After adjustment, dominant language status was associated with increased odds of on-track overall development (adjusted OR 1·54, 95% CI 1·40–1·71), which appeared to be largely driven by significantly increased odds of on-track development in the literacy–numeracy and socioemotional domains. For the represented population aged 2–4 years (n=11 465 601), the estimated prevalence of family-reported functional disability was 3·6% (95% CI 3·0–4·4). For the represented population aged 3–4 years (n=292,691), language status was not associated with on-track overall development among children with functional disability (adjusted OR 1·02, 95% CI 0·43–2·45). Interpretation: In a global dataset, children speaking a minority or Indigenous language were less likely to have on-track ECDI scores than those speaking a dominant language. Given the strong positive benefits of speaking an Indigenous language on the health and development of Indigenous children, this disparity is likely to reflect the sociolinguistic marginalisation faced by speakers of minority or Indigenous languages as well as differences in the performance of ECDI in these languages. Global efforts should consider performance of measures and monitor developmental data disaggregated by language status to stimulate efforts to address this disparity.
P
VietMed-NER Dataset
paperswithcode.com
Updated Jun 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khai Le-Duc; David Thulke; Hung-Phong Tran; Long Vo-Dang; Khai-Nguyen Nguyen; Truong-Son Hy; Ralf Schlüter (2024). VietMed-NER Dataset [Dataset]. https://paperswithcode.com/dataset/vietmed-ner
Explore at:
Dataset updated
Jun 18, 2024
Authors
Khai Le-Duc; David Thulke; Hung-Phong Tran; Long Vo-Dang; Khai-Nguyen Nguyen; Truong-Son Hy; Ralf Schlüter
Description
Spoken Named Entity Recognition (NER) aims to extracting named entities from speech and categorizing them into types like person, location, organization, etc. In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence. We found that pre-trained multilingual models XLM-R outperformed all monolingual models on both reference text and ASR output. Also in general, encoders perform better than sequence-to-sequence models for the NER task. By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well. All code, data and models are made publicly available here: https://github.com/leduckhai/MultiMed
HindiMathQuest - Math Problems & Reasoning
kaggle.com
Updated Oct 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dnyanesh Walwadkar (2024). HindiMathQuest - Math Problems & Reasoning [Dataset]. http://doi.org/10.34740/kaggle/ds/5832290
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5832290
Dataset updated
Oct 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dnyanesh Walwadkar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview:

The Hindi Mathematics Reasoning and Problem-Solving Dataset is designed to advance the capabilities of language models in understanding and solving mathematical problems presented in the Hindi language. The dataset covers a comprehensive range of question types, including logical reasoning, numeric calculations, translation-based problems, and complex mathematical tasks typically seen in competitive exams. This dataset is intended to fill a critical gap by focusing on numeric reasoning and mathematical logic in Hindi, offering high-quality prompts that challenge models to handle both linguistic and mathematical complexity in one of the world’s most widely spoken languages.

Key Features:

-**Diverse Range of Mathematical Problems**: The dataset includes questions from areas such as arithmetic, algebra, geometry, physics, and number theory, all expressed in Hindi.

-**Logical and Reasoning Tasks**: Includes logic-based problems requiring pattern recognition, deduction, and reasoning, often seen in competitive exams like IIT JEE, GATE, and GRE.

-**Complex Numerical Calculations in Hindi**: Numeric expressions and their handling in Hindi text, a common challenge for language models, are a major focus of this dataset. Questions require models to accurately interpret and solve mathematical problems where numbers are written in Hindi words (e.g., "पचासी हजार सात सौ नवासी" for 85789).

-**Real-World Application Scenarios**: Paragraph-based problems, puzzles, and word problems that mirror real-world scenarios and test both language comprehension and problem-solving capabilities.

-**Culturally Relevant Questions**: Carefully curated questions that avoid regional or social biases, ensuring that the dataset accurately reflects the linguistic and cultural nuances of Hindi-speaking regions.

Dataset Breakdown:

-**Logical and Reasoning-based Questions**: Questions testing pattern recognition, deduction, and logical reasoning, often seen in IQ tests and competitive exams.

Calculation-based Problems: Includes numeric operations such as addition, subtraction, multiplication, and division, presented in Hindi text.

-**Translation-based Mathematical Problems**: Questions that involve translating between numeric expressions and Hindi word forms, enhancing model understanding of Hindi numerals.

-**Competitive Exam-style Questions**: Sourced and inspired by advanced reasoning and problem-solving questions from exams like GATE, IIT JEE, and GRE, providing high-level challenge.

-**Series and Sequence Questions**: Number series, progressions, and pattern recognition problems, essential for logical reasoning tasks.

-**Paragraph-based Word Problems**: Real-world math problems described in multiple sentences of Hindi text, requiring deeper language comprehension and reasoning.

-**Geometry and Trigonometry**: Includes geometry-based problems using Hindi terminology for angles, shapes, and measurements.

-**Physics-based Problems**: Mathematical problems based on physics concepts like mechanics, thermodynamics, and electricity, all expressed in Hindi.

-**Graph and Data Interpretation**: Interpretation of graphs and data in Hindi, testing both visual and mathematical understanding.

-**Olympiad-style Questions**: Advanced math problems, similar to those found in math Olympiads, designed to test high-level reasoning and problem-solving skills.

Preprocessing and Quality Control:

-**Human Verification**: Over 30% of the dataset has been manually reviewed and verified by native Hindi speakers. Additionally, a random sample of English-to-Hindi translated prompts showed a 100% success rate in translation quality, further boosting confidence in the overall quality of the dataset.

-**Dataset Curation**: The dataset was generated using a combination of human-curated questions, AI-assisted translations from existing English datasets, and publicly available educational resources. Special attention was given to ensure cultural sensitivity and accurate representation of the language.

-**Handling Numeric Challenges in Hindi**: Special focus was given to numeric reasoning tasks, where numbers are presented in Hindi words—a well-known challenge for existing language models. The dataset aims to push the boundaries of current models by providing complex scenarios that require a deep understanding of both language and numeric relationships.

Usage:

This dataset is ideal for researchers, educators, and developers working on natural language processing, machine learning, and AI models tailored for Hindi-speaking populations. The dataset can be used for:

Fine-tuning language models for improved understanding of mathematical reasoning in Hindi.

Training question-answering systems for educational tools that cater to Hindi-speaking students.

Developing AI systems for competitive exam preparati...
E
GlobalPhone Japanese
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) (2017). GlobalPhone Japanese [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0199/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Japanese corpus was produced using the Nikkei Shinbun newspaper. It contains recordings of 149 speakers (104 males, 44 females, 1 unspecified) recorded in Tokyo, Japan. The following age distribution has been obtained: 22 speakers are below 19, 90 speakers are between 20 and 29, 5 speakers are between 30 and 39, 2 speakers are between 40 and 49, and 1 speaker is over 50 (28 speakers age is unknown).
Toronto Neighborhood Data
kaggle.com
Updated Jul 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sidharth Kumar Mohanty (2021). Toronto Neighborhood Data [Dataset]. https://www.kaggle.com/sidharth178/toronto-neighborhood-data/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sidharth Kumar Mohanty
Area covered
Toronto
Description
Context

With a population just short of 3 million people, the city of Toronto is the largest in Canada, and one of the largest in North America (behind only Mexico City, New York and Los Angeles). Toronto is also one of the most multicultural cities in the world, making life in Toronto a wonderful multicultural experience for all. More than 140 languages and dialects are spoken in the city, and almost half the population Toronto were born outside Canada.It is a place where people can try the best of each culture, either while they work or just passing through. Toronto is well known for its great food.

Content

This dataset was created by doing webscraping of Toronto wikipedia page . The dataset contains the latitude and longitude of all the neighborhoods and boroughs with postal code of Toronto City,Canada.
f
Urdu sentiment analysis related work summary.
plos.figshare.com
figshare.com
xls
Updated Aug 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdul Ghafoor; Ali Shariq Imran; Sher Muhammad Daudpota; Zenun Kastrati; Sarang Shaikh; Rakhi Batra (2023). Urdu sentiment analysis related work summary. [Dataset]. http://doi.org/10.1371/journal.pone.0290779.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290779.t001
Dataset updated
Aug 30, 2023
Dataset provided by
PLOS ONE
Authors
Abdul Ghafoor; Ali Shariq Imran; Sher Muhammad Daudpota; Zenun Kastrati; Sarang Shaikh; Rakhi Batra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). The most spoken languages worldwide 2023 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/

The most spoken languages worldwide 2023

Explore at:

411 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jan 23, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2022

Area covered

World

Description

In 2023, there were around 1.5 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.1 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year.

Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation and other official pronouncements. The United States is a land of immigrations and the languages spoken in the United States vary as a result of the multi-cultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over 41 million people spoke at home in 2021. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.7 million Tagalog speakers and 1.5 million Vietnamese speakers counted in the United States that year.

Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 44 percent of California’s population was speaking a language other than English at home in 2021.

Clear search

Close search

Google apps

Main menu

The most spoken languages worldwide 2023

GlobalPhone German

jampatoisnli

120 Million Word Spanish Corpus

Context:

Content:

Acknowledgements:

Inspiration:

You may also like:

GlobalPhone Portuguese (Brazilian)

Data from: Foreign Language Proficiency Test Data from Three American...

GlobalPhone Chinese-Shanghai

Speech Across Dialects of English: Acoustic Measures from SPADE Project...

GlobalPhone Korean

GLips - German Lipreading Dataset - Dataset - B2FIND

BanglaNLP

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

IndQNER

Named Entity Classes

Annotation Stage

Verification Stage

Evaluation

Supervised Learning Setting

Transfer Learning Setting

How to Cite

Contact

Replication Data for: Assessing child development scores among minority and...

VietMed-NER Dataset

HindiMathQuest - Math Problems & Reasoning

Overview:

Key Features:

Dataset Breakdown:

Preprocessing and Quality Control:

Usage:

GlobalPhone Japanese

Toronto Neighborhood Data

Context

Content

Urdu sentiment analysis related work summary.

The most spoken languages worldwide 2023