21 datasets found

h
SwitchLingua_text
huggingface.co
Updated May 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peng Xie (2025). SwitchLingua_text [Dataset]. https://huggingface.co/datasets/Shelton1013/SwitchLingua_text
Explore at:
Dataset updated
May 28, 2025
Authors
Peng Xie
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for SwitchLingua_text

Dataset Summary

SwitchLingua is a comprehensive multilingual and multicultural code-switching dataset designed to advance research in automatic speech recognition, natural language processing, and conversational AI. The textual data for SwitchLingua was first generated using the proposed LinguaMaster framework, and the audio data was recorded by 174 bilingual speakers from diverse linguistic and cultural backgrounds to ensure high… See the full description on the dataset page: https://huggingface.co/datasets/Shelton1013/SwitchLingua_text.
E
TC-STAR Bilingual Expressive Speech Database
catalogue.elra.info
live.european-language-grid.eu
Updated Dec 21, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2010). TC-STAR Bilingual Expressive Speech Database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0313/
Explore at:
Dataset updated
Dec 21, 2010
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
8 hours of speech as spoken by 2 female speakers and 2 male speakers for each language (English and Spanish).
h
ml_spoken_words
huggingface.co
Updated Jun 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MLCommons (2024). ml_spoken_words [Dataset]. https://huggingface.co/datasets/MLCommons/ml_spoken_words
Explore at:
Dataset updated
Jun 25, 2024
Dataset authored and provided by
MLCommons
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset.
f
Table_1_Expressing diminutive meaning in heritage Spanish: linking the...
frontiersin.figshare.com
figshare.com
xlsx
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abel Cruz (2024). Table_1_Expressing diminutive meaning in heritage Spanish: linking the heritage experience to diminutive use in everyday speech.XLSX [Dataset]. http://doi.org/10.3389/flang.2024.1377977.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/flang.2024.1377977.s001
Dataset updated
Jul 8, 2024
Dataset provided by
Frontiers
Authors
Abel Cruz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThis paper studies the pragmatic force that heritage speakers may convey through the use of the diminutive in everyday speech. In particular, I analyze the use of the Spanish diminutive in 49 sociolinguistic interviews from a Spanish–English bilingual community in Southern Arizona, U.S. where Spanish is the heritage language. I compare the use of the diminutive in heritage Spanish to the distribution of the diminutive in the speech of a Spanish monolingual community (18 sociolinguistic interviews) from the same dialectal region. Although Spanish and English employ different morphosyntactic strategies to express diminutive meaning, the analysis reveals that the diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona (i.e., similar diminutive distributions to their monolingual counterparts). While heritage speakers employed the diminutive -ito/a to express the notion of “smallness” in their Spanish-discourse, the analysis indicates that these language users are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. This particular finding suggests that the concept “child” is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children. The analysis further suggests that examining the pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages.MethodsIn this study, I analyze the use of Spanish diminutives in two U.S.-Mexico border regions. The first data set is representative of a Spanish–English bilingual community in Southern Arizona, U.S., provided in the Corpus del Español en el Sur de Arizona (The CESA Corpus). The CESA Corpus comprises 49 sociolinguistic interviews of ~1 h each for a total of ~305,542 words. The second data set comprises 18 sociolinguistic interviews of predominantly monolingual Spanish speakers from the city of Mexicali, Baja California in Mexico, provided in the Proyecto Para el Estudio Sociolingüístico del Español de España y de América (PRESEEA). The Mexicali data set consists of ~119,162 words.ResultsThe analysis revealed that the Spanish diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona. In addition to its prototypical meaning (i.e., the notion of “smallness”), the diminutive morpheme -ito/a conveyed an array of pragmatic functions in the everyday speech of Spanish heritage speakers and their monolingual counterparts from the same dialectal region. Importantly, these pragmatic functions are mediated by speakers' subjective perceptions of the entity in question. Unlike their monolingual counterparts, heritage speakers are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. Altogether, the study suggests that the concept “child” is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children.DiscussionIn this study, I followed Reynoso's framework to study the pragmatic dimensions of the diminutive in everyday speech, that is, speakers' publicly conveyed meaning. The analysis revealed that heritage speakers applied most of the pragmatic functions and their respective values observed in Reynoso's cross-dialectal study of Spanish diminutives, and hence providing further support for her framework. Similarly, the study provides further evidence to Jurafsky's proposal that morphological diminutives arise from semantic or pragmatic links with children. Finally, the analysis indicated that examining the semantic/pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages, which can in turn have further ramifications for heritage language learning and teaching.
E
Collins Multilingual database (MLD) – WordBank with audio files
catalogue.elra.info
live.european-language-grid.eu
Updated Nov 18, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2016). Collins Multilingual database (MLD) – WordBank with audio files [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0382/
Explore at:
Dataset updated
Nov 18, 2016
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, see ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank, see ELRA-T0377).This version includes the corresponding audio files covering 26 languages of the 32 languages available in the Collins MLD Wordbank: Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese. The WordBank contains 10,000 words for each language, XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).The full database contains 10,000 audio files for each language (26 languages), and 10,000 additional audio files corresponding to the 10,000 additional headwords in 12 languages. Audio was recorded by native speakers.
A Gold Standard Corpus for Activity Information (GoSCAI)
zenodo.org
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Dataset]. http://doi.org/10.5281/zenodo.15528545
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15528545
Dataset updated
May 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Description
A Gold Standard Corpus for Activity Information

Dataset Title: A Gold Standard Corpus for Activity Information (GoSCAI)

Dataset Curators: The Epidemiology & Biostatistics Section of the NIH Clinical Center Rehabilitation Medicine Department

Dataset Version: 1.0 (May 16, 2025)

Dataset Citation and DOI: NIH CC RMD Epidemiology & Biostatistics Section. (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Data set]. Zenodo. doi: 10.5281/zenodo.15528545

EXECUTIVE SUMMARY

This data statement is for a gold standard corpus of de-identified clinical notes that have been annotated for human functioning information based on the framework of the WHO's International Classification of Functioning, Disability and Health (ICF). The corpus includes 484 notes from a single institution within the United States written in English in a clinical setting. This dataset was curated for the purpose of training natural language processing models to automatically identify, extract, and classify information on human functioning at the whole-person, or activity, level.

CURATION RATIONALE

This dataset is curated to be a publicly available resource for the development and evaluation of methods for the automatic extraction and classification of activity-level functioning information as defined in the ICF. The goals of data curation are to 1) create a corpus of a size that can be manually deidentified and annotated, 2) maximize the density and diversity of functioning information of interest, and 3) allow public dissemination of the data.

LANGUAGE VARIETIES

Language Region: en-US

Prose Description: English as written by native and bilingual English speakers in a clinical setting

LANGUAGE USER DEMOGRAPHIC

The language users represented in this dataset are medical and clinical professionals who work in a research hospital setting. These individuals hold professional degrees corresponding to their respective specialties. Specific demographic characteristics of the language users such as age, gender, or race/ethnicity were not collected.

ANNOTATOR DEMOGRAPHIC

The annotator group consisted of five people, 33 to 76 years old, including four females and one male. Socioeconomically, they came from the middle and upper-middle income classes. Regarding first language, three annotators had English as their first language, one had Chinese, and one had Spanish. Proficiency in English, the language of the data being annotated, was native for three of the annotators and bilingual for the other two. The annotation team included clinical rehabilitation domain experts with backgrounds in occupational therapy, physical therapy, and individuals with public health and data science expertise. Prior to annotation, all annotators were trained on the specific annotation process using established guidelines for the given domain, and annotators were required to achieve a specified proficiency level prior to annotating notes in this corpus.

LINGUISTIC SITUATION AND TEXT CHARACTERISTICS

The notes in the dataset were written as part of clinical care within a U.S. research hospital between May 2008 and November 2019. These notes were written by health professionals asynchronously following the patient encounter to document the interaction and support continuity of care. The intended audience of these notes were clinicians involved in the patients' care. The included notes come from nine disciplines - neuropsychology, occupational therapy, physical medicine (physiatry), physical therapy, psychiatry, recreational therapy, social work, speech language pathology, and vocational rehabilitation. The notes were curated to support research on natural language processing for functioning information between 2018 and 2024.

PREPROCESSING AND DATA FORMATTING

The final corpus was derived from a set of clinical notes extracted from the hospital electronic medical record (EMR) for the purpose of clinical research. The original data include character-based digital content originally. We work in ASCII 8 or UNICODE encoding, and therefore part of our pre-processing includes running encoding detection and transformation from encodings such as Windows-1252 or ISO-8859 format to our preferred format.

On the larger corpus, we applied sampling to match our curation rationale. Given the resource constraints of manual annotation, we set out to create a dataset of 500 clinical notes, which would exclude notes over 10,000 characters in length.

To promote density and diversity, we used five note characteristics as sampling criteria. We used the text length as expressed in number of characters. Next, we considered the discipline group as derived from note type metadata and describes which discipline a note originated from: occupational and vocational therapy (OT/VOC), physical therapy (PT), recreation therapy (RT), speech and language pathology (SLP), social work (SW), or miscellaneous (MISC, including psychiatry, neurology and physiatry). These disciplines were selected for collecting the larger corpus because their notes are likely to include functioning information. Existing information extraction tools were used to obtain annotation counts in four areas of functioning and provided a note’s annotation count, annotation density (annotation count divided by text length), and domain count (number of domains with at least 1 annotation).

We used stratified sampling across the 6 discipline groups to ensure discipline diversity in the corpus. Because of low availability, 50 notes were sampled from SLP with relaxed criteria, and 90 notes each from the 5 other discipline groups with stricter criteria. Sampled SLP notes were those with the highest annotation density that had an annotation count of at least 5 and a domain count of at least 2. Other notes were sampled by highest annotation count and lowest text length, with a minimum annotation count of 15 and minimum domain count of 3.

The notes in the resulting sample included certain types of PHI and PII. To prepare for public dissemination, all sensitive or potentially identifying information was manually annotated in the notes and replaced with substituted content to ensure readability and enough context needed for machine learning without exposing any sensitive information. This de-identification effort was manually reviewed to ensure no PII or PHI exposure and correct any resulting readability issues. Notes about pediatric patients were excluded. No intent was made to sample multiple notes from the same patient. No metadata is provided to group notes other than by note type, discipline, or discipline group. The dataset is not organized beyond the provided metadata, but publications about models trained on this dataset should include information on the train/test splits used.

All notes were sentence-segmented and tokenized using the spaCy en_core_web_lg model with additional rules for sentence segmentation customized to the dataset. Notes are stored in an XML format readable by the GATE annotation software (https://gate.ac.uk/family/developer.html), which stores annotations separately in annotation sets.

CAPTURE QUALITY

As the clinical notes were extracted directly from the EMR in text format, the capture quality was determined to be high. The clinical notes did not have to be converted from other data formats, which means this dataset is free from noise introduced by conversion processes such as optical character recognition.

LIMITATIONS

Because of the effort required to manually deidentify and annotate notes, this corpus is limited in terms of size and representation. The curation decisions skewed note selection towards specific disciplines and note types to increase the likelihood of encountering information on functioning. Some subtypes of functioning occur infrequently in the data, or not at all. The deidentification of notes was done in a manner to preserve natural language as it would occur in the notes, but some information is lost, e.g. on rare diseases.

METADATA

Information on the manual annotation process is provided in the annotation guidelines for each of the four domains:

- Communication & Cognition (https://zenodo.org/records/13910167)

- Mobility (https://zenodo.org/records/11074838)

- Self-Care & Domestic Life (SCDL) (https://zenodo.org/records/11210183)

- Interpersonal Interactions & Relationships (IPIR) (https://zenodo.org/records/13774684)

Inter-annotator agreement was established on development datasets described in the annotation guidelines prior to the annotation of this gold standard corpus.

The gold standard corpus consists of 484 documents, which include 35,147 sentences in total. The distribution of annotated information is provided in the table below.

<td style="width: 1.75in; padding: 0in 5.4pt 0in

Domain

Number of Annotated Sentences

% of All Sentences

Mean Number of Annotated Sentences per Document

Communication & Cognition

6033

17.2%
E
TC-STAR Bilingual Voice-Conversion Spanish Speech Database
catalogue.elra.info
live.european-language-grid.eu
Updated Dec 21, 2010
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2010). TC-STAR Bilingual Voice-Conversion Spanish Speech Database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0311/
Explore at:
Dataset updated
Dec 21, 2010
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
4 hours and 80 minutes of speech as spoken by 2 female speakers and 2 male speakers, covering both mimics and parallel voice conversion data.
h
multilingual-NLI-26lang-2mil7
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moritz Laurer, multilingual-NLI-26lang-2mil7 [Dataset]. https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Moritz Laurer
Description
Datasheet for the dataset: multilingual-NLI-26lang-2mil7

Dataset Summary

This dataset contains 2 730 000 NLI text pairs in 26 languages spoken by more than 4 billion people. The dataset can be used to train models for multilingual NLI (Natural Language Inference) or zero-shot classification. The dataset is based on the English datasets MultiNLI, Fever-NLI, ANLI, LingNLI and WANLI and was created using the latest open-source machine translation models. The dataset is… See the full description on the dataset page: https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7.
f
Data_Sheet_3_Heritage Speakers as Part of the Native Language Continuum.PDF
frontiersin.figshare.com
pdf
Updated Jun 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heike Wiese; Artemis Alexiadou; Shanley Allen; Oliver Bunk; Natalia Gagarina; Kateryna Iefremenko; Maria Martynova; Tatiana Pashkova; Vicky Rizou; Christoph Schroeder; Anna Shadrova; Luka Szucsich; Rosemarie Tracy; Wintai Tsehaye; Sabine Zerbian; Yulia Zuban (2023). Data_Sheet_3_Heritage Speakers as Part of the Native Language Continuum.PDF [Dataset]. http://doi.org/10.3389/fpsyg.2021.717973.s003
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2021.717973.s003
Dataset updated
Jun 14, 2023
Dataset provided by
Frontiers
Authors
Heike Wiese; Artemis Alexiadou; Shanley Allen; Oliver Bunk; Natalia Gagarina; Kateryna Iefremenko; Maria Martynova; Tatiana Pashkova; Vicky Rizou; Christoph Schroeder; Anna Shadrova; Luka Szucsich; Rosemarie Tracy; Wintai Tsehaye; Sabine Zerbian; Yulia Zuban
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We argue for a perspective on bilingual heritage speakers as native speakers of both their languages and present results from a large-scale, cross-linguistic study that took such a perspective and approached bilinguals and monolinguals on equal grounds. We targeted comparable language use in bilingual and monolingual speakers, crucially covering broader repertoires than just formal language. A main database was the open-access RUEG corpus, which covers comparable informal vs. formal and spoken vs. written productions by adolescent and adult bilinguals with heritage-Greek, -Russian, and -Turkish in Germany and the United States and with heritage-German in the United States, and matching data from monolinguals in Germany, the United States, Greece, Russia, and Turkey. Our main results lie in three areas. (1) We found non-canonical patterns not only in bilingual, but also in monolingual speakers, including patterns that have so far been considered absent from native grammars, in domains of morphology, syntax, intonation, and pragmatics. (2) We found a degree of lexical and morphosyntactic inter-speaker variability in monolinguals that was sometimes higher than that of bilinguals, further challenging the model of the streamlined native speaker. (3) In majority language use, non-canonical patterns were dominant in spoken and/or informal registers, and this was true for monolinguals and bilinguals. In some cases, bilingual speakers were leading quantitatively. In heritage settings where the language was not part of formal schooling, we found tendencies of register leveling, presumably due to the fact that speakers had limited access to formal registers of the heritage language. Our findings thus indicate possible quantitative differences and different register distributions rather than distinct grammatical patterns in bilingual and monolingual speakers. This supports the integration of heritage speakers into the native-speaker continuum. Approaching heritage speakers from this perspective helps us to better understand the empirical data and can shed light on language variation and change in native grammars. Furthermore, our findings for monolinguals lead us to reconsider the state-of-the art on majority languages, given recurring evidence for non-canonical patterns that deviate from what has been assumed in the literature so far, and might have been attributed to bilingualism had we not included informal and spoken registers in monolinguals and bilinguals alike.
u
Community Interpreting Database Pilot Corpus (ComInDat)
fdr.uni-hamburg.de
Updated Mar 9, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angermeyer, Philipp; Bührig, Kristin; Meyer, Bernd; Angermeyer, Philipp; Bührig, Kristin; Meyer, Bernd (2010). Community Interpreting Database Pilot Corpus (ComInDat) [Dataset]. http://doi.org/10.25592/uhhfdm.1478
Explore at:
Unique identifier
https://doi.org/10.25592/uhhfdm.1478
Dataset updated
Mar 9, 2010
Dataset provided by
Mainz University
Authors
Angermeyer, Philipp; Bührig, Kristin; Meyer, Bernd; Angermeyer, Philipp; Bührig, Kristin; Meyer, Bernd
Description
Audio and video recordings of various types of community interpreted discourse (doctor-patient communication, simulated doctor-patient communication, courtroom communication) in German (simulated and authentic doctor-patient communication) and US (courtroom communication) institutions with varying community languages. Video recordings only exist for the simulated communication. For the authentic interpreted doctor-patient communication, no audio files will be made available.

The ComInDat pilot corpus contains sample data from three different projects: the DiK corpus of Portuguese/German and Turkish/German interpreted doctor-patient communication in hospitals (Bührig & Meyer 2004), he IiSCC-corpus, a corpus of interpreted court proceedings in different language constellations (Spanish/English, Russian/English, Haitian Creole/English and Polish/English) (Angermeyer 2006), a corpus of simulated interpreted doctor-patient interactions in different language constellations (Russian/German, Polish/German and Romanian/German) from a training seminar for bilingual nursing staff ("SimDiK", Bührig, Kliche, Meyer & Pawlack 2012). More information about the background of the corpus and the details of its design can be found in (Angermeyer, Meyer & Schmidt 2012). For more information about the project, please contact Philipp Angermeyer.

Angermeyer, P., Meyer, B. and Schmidt, T. (2012). Sharing Community Interpreting Corpora: A pilot study. In: Schmidt, T. and Wörner, K. (eds.) Multilingual Corpora and Multilingual Corpus Analysis. Amsterdam: Benjamins, 275-294.

CLARIN Metadata summary for Community Interpreting Database Pilot Corpus (ComInDat) (CMDI-based)

Title: Community Interpreting Database Pilot Corpus (ComInDat)
Description: Audio and video recordings of various types of community interpreted discourse (doctor-patient communication, simulated doctor-patient communication, courtroom communication) in German (simulated and authentic doctor-patient communication) and US (courtroom communication) institutions with varying community languages. Video recordings only exist for the simulated communication. For the authentic interpreted doctor-patient communication, no audio files will be made available.
Publication date: 2013-06-10
Data owner: Philipp Angermeyer, Department of Languages, Literatures and Linguistics / York University / 4700 Keele Street / Canada M3J 1P3, pangerme@yorku.ca, Kristin Bührig, Institut für Germanistik I / Von-Melle-Park 6 / D-20146 Hamburg, kristin.buehrig@uni-hamburg.de, Bernd Meyer, Arbeitsbereich Interkulturelle Kommunikation / Fachbereich 06: Translations-, Sprach- und Kulturwissenschaft / Johannes Gutenberg-Universität Mainz / An der Hochschule 2 / D-76726 Germersheim, meyerb@uni-mainz.de
Contributors: Philipp Angermeyer, Department of Languages, Literatures and Linguistics / York University / 4700 Keele Street / Canada M3J 1P3, pangerme@yorku.ca (compiler), Kristin Bührig, Institut für Germanistik I / Von-Melle-Park 6 / D-20146 Hamburg, kristin.buehrig@uni-hamburg.de (compiler), Bernd Meyer, Arbeitsbereich Interkulturelle Kommunikation / Fachbereich 06: Translations-, Sprach- und Kulturwissenschaft / Johannes Gutenberg-Universität Mainz / An der Hochschule 2 / D-76726 Germersheim, meyerb@uni-mainz.de (compiler)
Project: The Integration of Text, Sound, and Image into the Corpus-Based Analysis of Interpreter-Mediated Interaction
Keywords: community interpreting, doctor-patient communication, courtroom communication, EXMARaLDA
Languages: German (deu), English (eng), Spanish (spa), Turkish (tur), Polish (pol), Portuguese (por), Romanian (ron), Russian (rus), Haitian (hat)
Size: 54 speakers (35 female, 16 male, 3 unknown), 14 communications, 12 recordings, 83 minutes, 17 transcriptions, 35051 words
Annotation types: transcription (manual): HIAT/CHAT, deu: German translation, eng: English translation, k: free comment, lang: utterance language, sup: suprasegmental information, trans: utterance translation status, akz: accentuation/stress, pol: Polish translation
Temporal Coverage: 1999-07-01/2010-03-09
Spatial Coverage: Hamburg, DE; New York, US; Neumünster, DE
Genre: discourse
Modality: spoken
References: Angermeyer, P., Meyer, B. and Schmidt, T. (2012). Sharing Community Interpreting Corpora: A pilot study. In: Schmidt, T. and Wörner, K. (eds.) Multilingual Corpora and Multilingual Corpus Analysis. Amsterdam: Benjamins, 275-294.
h
Data from: miracl
huggingface.co
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SEACrowd (2024). miracl [Dataset]. https://huggingface.co/datasets/SEACrowd/miracl
Explore at:
Dataset updated
Jun 25, 2024
Dataset authored and provided by
SEACrowd
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers. MIRACL covers Indonesian and Thai languages. Before using this dataloader, please accept the acknowledgement at https://huggingface.co/datasets/miracl/miracl and use huggingface-cli login for authentication.
h
Multilingual_Intelligent_Speech_Dataset
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataocean AI (2024). Multilingual_Intelligent_Speech_Dataset [Dataset]. https://huggingface.co/datasets/DataoceanAI/Multilingual_Intelligent_Speech_Dataset
Explore at:
Dataset updated
Sep 12, 2024
Authors
Dataocean AI
Description
Specification

This dataset covers over 30 scenarios including sports, entertainment, health, shopping, pet, education, food, travel, and so on. For more details:https://dataoceanai.com/datasets/asr/multilingual-intelligent-speech-dataset/

ID:

King-ASR-959

SIZE:

219672 hours

LANGUAGE:

Over 100 languages covered

SAMPLE RATE:

16kHz/44.1kHz/48kHz

SPEAKERS：

215,891 People
h
TTS-Multilingual-Test-Set
huggingface.co
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MiniMax (2025). TTS-Multilingual-Test-Set [Dataset]. https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
Explore at:
Dataset updated
May 27, 2025
Dataset authored and provided by
MiniMax
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Overview

To assess the multilingual zero-shot voice cloning capabilities of TTS models, we have constructed a test set encompassing 24 languages. This dataset provides both audio samples for voice cloning and corresponding test texts. Specifically, the test set for each language includes: 100 distinct test sentences. Audio samples from two speakers (one male and one female) carefully selected from the Mozilla Common Voice (MCV) dataset, intended for voice cloning. Researchers can… See the full description on the dataset page: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set.

Data from: NeMig - A Bilingual News Collection and Knowledge Graph about...

data.niaid.nih.gov
zenodo.org

Updated May 9, 2023

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Iana, Andreea (2023). NeMig - A Bilingual News Collection and Knowledge Graph about Migration [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7442424

Explore at:

Dataset updated

May 9, 2023

Dataset provided by

Nikolajevic, Nevena
Paulheim, Heiko
Iana, Andreea
Grote, Alexander
Weinhardt, Christof
Müller, Philipp
Ludwig, Katharina
Alam, Mehwish

Description

NeMig represents a bilingual news collection and knowledge graphs on the topic of migration. The news corpora in German and English were collected from online media outlets from Germany and the US, respectively. NeMIg contains rich textual and metadata information, sentiment and political orientation annotations, as well as named entities extracted from the articles' content and metadata and linked to Wikidata. The corresponding knowledge graphs (NeMigKG) built from each corpus are expanded with up to two-hop neighbors from Wikidata of the initial set of linked entities.

NeMigKG comes in four flavors, for both the German, and the English corpora:

Base NeMigKG: contains literals and entities from the corresponding annotated news corpus;

Entities NeMigKG: derived from the Base NeMIg by removing all literal nodes, it contains only resource nodes;

Enriched Entities NeMigKG: derived from the Entities NeMig by enriching it with up to two-hop neighbors from Wikidata, it contains only resource nodes and Wikidata triples;

Complete NeMigKG: the combination of the Base and Enriched Entities NeMig, it contains both literals and resources.

Information about uploaded files:

(all files are b-zipped and in the N-Triples format.)

A description of the NeMigKG files is provided in the table below:

NeMigKG Files Description


    File
    Description




    nemig_${language}_ ${graph_type}-metadata.nt.bz2
    Metadata about the dataset, described using void vocabulary.


    nemig_${language}_ ${graph_type}-instances_types.nt.bz2
    Class definitions of news and event instances.


    nemig_${language}_ ${graph_type}-instances_labels.nt.bz2
    Labels of instances.


    nemig_${language}_ ${graph_type}-instances_related.nt.bz2
    Relations between news instances based on one another.


    nemig_${language}_ ${graph_type}-instances_metadata_literals.nt.bz2
    Relations between news instances and metadata literals (e.g. URL, publishing date, modification date, sentiment label, political orientation of news outlets).


    nemig_${language}_ ${graph_type}-instances_content_mapping.nt.bz2
    Mapping of news instances to content instances (e.g. title, abstract, body).


    nemig_${language}_ ${graph_type}-instances_topic_mapping.nt.bz2
    Mapping of news instances to sub-topic instances.

nemig_${language}_ ${graph_type}-instances_sentiment_mapping.nt.bz2

Mapping of news instances to sentiment classes.

emig_${language}_ ${graph_type}-instances_political_orientation_mapping.nt.bz2

Mapping of news outlets instances to political orientation classes.

    nemig_${language}_ ${graph_type}-instances_content_literals.nt.bz2
    Relations between content instances and corresponding literals (e.g. text of title, abstract, body).

nemig_${language}_ ${graph_type}-instances_sentiment_polorient_literals.nt.bz2

Relations between instances and corresponding sentiment or political orientation literals.

    nemig_${language}_ ${graph_type}-instances_metadata_resources.nt.bz2
    Relations between news or sub-topic instances and entities extracted from metadata (i.e. publishers, authors, keywords).


    nemig_${language}_ ${graph_type}-instances_event_mapping.nt.bz2
    Mapping of news instances to event instances.


    nemig_${language}_ ${graph_type}-event_resources.nt.bz2
    Relations between event instances and entities extracted from the text of the news (i.e. actors, places, mentions).


    nemig_${language}_ ${graph_type}-resources_provenance.nt.bz2
    Provenance information about the entities extracted from the text of the news (e.g. title, abstract, body).


    nemig_${language}_ ${graph_type}-wiki_resources.nt.bz2
    Relations between Wikidata entities from news and their k-hop entity neighbors from Wikidata.

The corresponding user data has been collected through online studies in Germany and the US. We used the participants' implicit feedback regarding their interest in an article to build their click history, and the explicit feedback in terms of news click behaviors to construct the impression logs. To protect user privacy, we assign each user an anonymized ID.

The German and English user datasets are zip-compressed folders, which contain two files each.

NeMig User Dataset File Description


    File
    Description

behaviors.tsv

The click history and impression logs of users.

demographics_politics.tsv

Demographic and political information of users.

The behaviors.tsv file contains the users' news click histories and the impression logs. It has 4 columns divided by the tab symbol:

Impression ID: the ID of an impression.

User ID: The anonymized ID of an user.

Click History: The news click history (list of news IDs) of a user before an impression.

Impression Log: List of news displayed to the user in a session and the user's click behavior on them (1 for click, 0 for non-click).

The demographics_politics.tsv file contains detailed information about the users' demographics and political interests. It has columns divided by the tab symbol. An explanation of all the columns and the questions used in the online studies to collect this information is shown in the table below.

Demographic and political user data description


    Column Name
    Question in German study
    Scale in German
    Question in English study
    Scale in English




    Demographics






    Gender
    Bitte geben Sie Ihr Geschlecht an
    0 = männlich
    1 = weiblich
    2 = divers
    3 = Keine Angabe
    Please indicate your gender.
    0 = male
    1 = female
    2 = other
    3 = no answer


    Age
    Bitte geben Sie Ihr Alter an  
    1-120
    Please indicate your age.
    1-120


    Qualification
    Welches ist Ihr höchster Bildungsabschluss?
    0 = Kein Schulabschluss
    1 = Haupt-/Gesamtschulabschluss
    2 = Realschulabschluss, Mittlere Reife, Fachschulreife
    3 = Fachhochschulreife, Abitur
    4 = Studium mit Abschluss
    5 = Promotion
    6 = Keine Angabe
    Please indicate your highest educational qualification.
    0 = less than high school
    1 = high school/GED
    2 = Vo-tech/business school
    3 = some college
    4 = college degree
    5 = university degree
    6 = doctoral degree
    7 = no answer


    Nationality
    Welche Staatsangehörigkeit besitzen Sie?
    0 = Nur die deutsche Staatsangehörigkeit
    1 = Die deutsche und eine andere Staatsangehörigkeit
    2 = Nur eine andere Staatsangehörigkeit
    3 = Keine Angabe
    What is your citizenship?
    0 = U.S. citizenship
    1 = U.S. and another non-U.S. citizenship
    2 = Only non-U.S. citizenship
    3 = No Answer


    BornIn
    Sind Sie in Deutschland geboren?
    0 = Ja
    1 = Nein
    2 = Keine Angabe
    Were you born in the U.S.?
    0 = Yes
    1 = No
    2 = No answer


    ParentsBornIn
    Sind Ihre Eltern in Deutschland geboren?
    0 = Mein Vater und meine Mutter sind beide in Deutschland geboren
    1 = Mein Vater ist in Deutschland geboren, meine Mutter nicht
    2 = Meine Mutter ist in Deutschland geboren, mein Vater nicht
    3 = Weder meine Mutter noch mein Vater sind in Deutschland geboren
    4 = Keine Angabe
    Were your parents born in the U.S.?
    0 = My father and my mother were both born in the U.S.
    1 = My father was born in the U.S., my mother was not
    2 = My mother was born in the U.S., my father was not
    3 = Neither my mother nor my father were born in the U.S
    4 = No answer


    Income
    Was ist Ihr persönliches monatliches Nettoeinkommen (nach Abzug der Steuern)? Bitte geben Sie eine ungefähre Schätzung an, falls Sie die genaue Zahl nicht kennen.
    0 = Weniger als 1000 €
    1 = 1001 € bis 2000 €
    2 = 2001 € bis 3000 €
    3 = 3001 € bis 4000 €
    4 = 4001 € bis 5000 €
    5 = Mehr als 5000 €
    6 = Keine Angabe
    What is your personal monthly net income (after taxes)? Please give an approximate estimation in case you are unsure.
    0 = Less than 1000 $
    1 = 1001 $ to 2000 $
    2 = 2001 $ to 3000 $
    3 = 3001 $ to 4000 $
    4 = 4001 $ to 5000 $
    5 = More than 5000 $
    6 = No Answer


    Empathy
    Wie sehr stimmen Sie den folgenden Aussagen zu?

7-point Likert scale

1=Trifft überhaupt nicht zu 7=Trifft voll und ganz zu

    How strongly do you agree with the following statements?
    7-point Likert scale

    1=Strongly disagree
    7=Strongly agree


    EMP1
    Wenn jemand anderes erfreut ist, tendiere ich dazu auch erfreut zu sein.

    When someone else is feeling excited, I tend to get excited too.



    EMP2
    Es regt mich auf, wenn jemand respektlos behandelt wird.

    It upsets me to see someone being treated disrespectfully.



    EMP3
    Es macht mir Freude, andere aufzumuntern.

    I enjoy making other people feel better.



    EMP4
    Ich bin besorgt um Personen, die weniger Glück haben als ich.

    I have tender, concerned feelings for people less fortunate than me.



    EMP5
    Ich fühle, wenn andere traurig sind, selbst wenn sie nichts sagen.

    I can tell when others are sad even when they do not say anything.



    EMP6
    Meistens bin ich mit den Stimmungen anderer Leute im Einklang.

    I find that I am “in tune” with other people’s moods.



    EMP7
    Ich empfinde einen starken Drang zu helfen, wenn ich jemanden sehe, der aufgebracht ist.

    I get a strong urge to help when I see someone who is upset.



    EMP8
    Wenn ich jemanden sehe, der ausgenutzt wird, möchte ich die Person beschützen.

    When I see someone being taken advantage of, I feel kind of protective towards him\her.



    Big5
    Ich

h
miracl-corpus
huggingface.co
Updated Sep 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MIRACL (2022). miracl-corpus [Dataset]. https://huggingface.co/datasets/miracl/miracl-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2022
Dataset authored and provided by
MIRACL
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for MIRACL Corpus

MIRACL 🌍🙌🌏 (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. This dataset contains the collection data of the 16 "known languages". The remaining 2 "surprise languages" will not be released until later. The corpus for each language is prepared from a Wikipedia… See the full description on the dataset page: https://huggingface.co/datasets/miracl/miracl-corpus.
h
test_4
huggingface.co
Updated Oct 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nguyen tuan (2024). test_4 [Dataset]. https://huggingface.co/datasets/tunngo/test_4
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 27, 2024
Authors
nguyen tuan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FLEURS

Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/tunngo/test_4.
h
IndicVoices
huggingface.co
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI4Bharat (2025). IndicVoices [Dataset]. https://huggingface.co/datasets/ai4bharat/IndicVoices
Explore at:
Dataset updated
Mar 5, 2025
Dataset authored and provided by
AI4Bharat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

Overview

INDICVOICES is a dataset of natural and spontaneous speech containing a total of 19550 hours of read (8%), extempore (76%) and conversational (15%) audio from 29K speakers covering 400+ Indian districts and 22 languages. Of these 19550 hours, 9200 hours have already been transcribed. Through this paper, we share our journey of capturing the cultural… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/IndicVoices.
h
mgsm_50
huggingface.co
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mgsm_50 [Dataset]. https://huggingface.co/datasets/irasalsabila/mgsm_50
Explore at:
Dataset updated
Mar 31, 2025
Authors
Ira Salsabila
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multilingual MGSM Dataset (Indonesian, Javanese, Sundanese)

This dataset is a multilingual extension of the MGSM dataset, translated into Indonesian (id), Javanese (jv), and Sundanese (su). The original English math word problems were translated with the following modifications:

Western names were replaced with culturally appropriate local names. All numerical values and logic were preserved. Native speakers reviewed translations for correctness and naturalness.

Each language is… See the full description on the dataset page: https://huggingface.co/datasets/irasalsabila/mgsm_50.
h
ifeval_mt
huggingface.co
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LumiOpen (2024). ifeval_mt [Dataset]. https://huggingface.co/datasets/LumiOpen/ifeval_mt
Explore at:
Dataset updated
Oct 2, 2024
Dataset authored and provided by
LumiOpen
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
IFEval Multilingual

These are machine-translated versions of Instruction Following Evaluation (IFEval). We will do our best to correct the translations. Translations were done using DeepL and the translations were reviewed and corrected by native speakers. We use this dataset in our fork of LM Eval Harness that supports multilingual ifeval.

Supported languages

Finnish: machine-translated manually corrected Swedish: machine-translated but not corrected
h
xor_tydi_qa
huggingface.co
opendatalab.com
Updated Jan 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akari Asai (2024). xor_tydi_qa [Dataset]. https://huggingface.co/datasets/akariasai/xor_tydi_qa
Explore at:
Dataset updated
Jan 23, 2024
Authors
Akari Asai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
XOR-TyDi QA brings together for the first time information-seeking questions, open-retrieval QA, and multilingual QA to create a multilingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.

Facebook

Twitter

Click to copy link

Link copied

Cite

Peng Xie (2025). SwitchLingua_text [Dataset]. https://huggingface.co/datasets/Shelton1013/SwitchLingua_text

SwitchLingua_text

Shelton1013/SwitchLingua_text

SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Explore at:

Dataset updated

May 28, 2025

Authors

Peng Xie

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for SwitchLingua_text

  Dataset Summary

SwitchLingua is a comprehensive multilingual and multicultural code-switching dataset designed to advance research in automatic speech recognition, natural language processing, and conversational AI. The textual data for SwitchLingua was first generated using the proposed LinguaMaster framework, and the audio data was recorded by 174 bilingual speakers from diverse linguistic and cultural backgrounds to ensure high… See the full description on the dataset page: https://huggingface.co/datasets/Shelton1013/SwitchLingua_text.

Clear search

Close search

Google apps

Main menu

Domain	Number of Annotated Sentences	% of All Sentences	Mean Number of Annotated Sentences per Document
Communication & Cognition	6033	17.2%

SwitchLingua_text

TC-STAR Bilingual Expressive Speech Database

ml_spoken_words

Table_1_Expressing diminutive meaning in heritage Spanish: linking the...

Collins Multilingual database (MLD) – WordBank with audio files

A Gold Standard Corpus for Activity Information (GoSCAI)

A Gold Standard Corpus for Activity Information

EXECUTIVE SUMMARY

CURATION RATIONALE

LANGUAGE VARIETIES

LANGUAGE USER DEMOGRAPHIC

ANNOTATOR DEMOGRAPHIC

LINGUISTIC SITUATION AND TEXT CHARACTERISTICS

PREPROCESSING AND DATA FORMATTING

CAPTURE QUALITY

LIMITATIONS

METADATA

TC-STAR Bilingual Voice-Conversion Spanish Speech Database

multilingual-NLI-26lang-2mil7

Data_Sheet_3_Heritage Speakers as Part of the Native Language Continuum.PDF

Community Interpreting Database Pilot Corpus (ComInDat)

Data from: miracl

Multilingual_Intelligent_Speech_Dataset

TTS-Multilingual-Test-Set

Data from: NeMig - A Bilingual News Collection and Knowledge Graph about...

miracl-corpus

test_4

IndicVoices

mgsm_50

ifeval_mt

xor_tydi_qa

SwitchLingua_textSee More Versions

Shelton1013/SwitchLingua_text

SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

SwitchLingua_text