Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for SwitchLingua_text
Dataset Summary
SwitchLingua is a comprehensive multilingual and multicultural code-switching dataset designed to advance research in automatic speech recognition, natural language processing, and conversational AI. The textual data for SwitchLingua was first generated using the proposed LinguaMaster framework, and the audio data was recorded by 174 bilingual speakers from diverse linguistic and cultural backgrounds to ensure high… See the full description on the dataset page: https://huggingface.co/datasets/Shelton1013/SwitchLingua_text.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
8 hours of speech as spoken by 2 female speakers and 2 male speakers for each language (English and Spanish).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThis paper studies the pragmatic force that heritage speakers may convey through the use of the diminutive in everyday speech. In particular, I analyze the use of the Spanish diminutive in 49 sociolinguistic interviews from a Spanish–English bilingual community in Southern Arizona, U.S. where Spanish is the heritage language. I compare the use of the diminutive in heritage Spanish to the distribution of the diminutive in the speech of a Spanish monolingual community (18 sociolinguistic interviews) from the same dialectal region. Although Spanish and English employ different morphosyntactic strategies to express diminutive meaning, the analysis reveals that the diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona (i.e., similar diminutive distributions to their monolingual counterparts). While heritage speakers employed the diminutive -ito/a to express the notion of “smallness” in their Spanish-discourse, the analysis indicates that these language users are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. This particular finding suggests that the concept “child” is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children. The analysis further suggests that examining the pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages.MethodsIn this study, I analyze the use of Spanish diminutives in two U.S.-Mexico border regions. The first data set is representative of a Spanish–English bilingual community in Southern Arizona, U.S., provided in the Corpus del Español en el Sur de Arizona (The CESA Corpus). The CESA Corpus comprises 49 sociolinguistic interviews of ~1 h each for a total of ~305,542 words. The second data set comprises 18 sociolinguistic interviews of predominantly monolingual Spanish speakers from the city of Mexicali, Baja California in Mexico, provided in the Proyecto Para el Estudio Sociolingüístico del Español de España y de América (PRESEEA). The Mexicali data set consists of ~119,162 words.ResultsThe analysis revealed that the Spanish diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona. In addition to its prototypical meaning (i.e., the notion of “smallness”), the diminutive morpheme -ito/a conveyed an array of pragmatic functions in the everyday speech of Spanish heritage speakers and their monolingual counterparts from the same dialectal region. Importantly, these pragmatic functions are mediated by speakers' subjective perceptions of the entity in question. Unlike their monolingual counterparts, heritage speakers are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. Altogether, the study suggests that the concept “child” is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children.DiscussionIn this study, I followed Reynoso's framework to study the pragmatic dimensions of the diminutive in everyday speech, that is, speakers' publicly conveyed meaning. The analysis revealed that heritage speakers applied most of the pragmatic functions and their respective values observed in Reynoso's cross-dialectal study of Spanish diminutives, and hence providing further support for her framework. Similarly, the study provides further evidence to Jurafsky's proposal that morphological diminutives arise from semantic or pragmatic links with children. Finally, the analysis indicated that examining the semantic/pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages, which can in turn have further ramifications for heritage language learning and teaching.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, see ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank, see ELRA-T0377).This version includes the corresponding audio files covering 26 languages of the 32 languages available in the Collins MLD Wordbank: Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese. The WordBank contains 10,000 words for each language, XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).The full database contains 10,000 audio files for each language (26 languages), and 10,000 additional audio files corresponding to the 10,000 additional headwords in 12 languages. Audio was recorded by native speakers.
Dataset Title: A Gold Standard Corpus for Activity Information (GoSCAI)
Dataset Curators: The Epidemiology & Biostatistics Section of the NIH Clinical Center Rehabilitation Medicine Department
Dataset Version: 1.0 (May 16, 2025)
Dataset Citation and DOI: NIH CC RMD Epidemiology & Biostatistics Section. (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Data set]. Zenodo. doi: 10.5281/zenodo.15528545
This data statement is for a gold standard corpus of de-identified clinical notes that have been annotated for human functioning information based on the framework of the WHO's International Classification of Functioning, Disability and Health (ICF). The corpus includes 484 notes from a single institution within the United States written in English in a clinical setting. This dataset was curated for the purpose of training natural language processing models to automatically identify, extract, and classify information on human functioning at the whole-person, or activity, level.
This dataset is curated to be a publicly available resource for the development and evaluation of methods for the automatic extraction and classification of activity-level functioning information as defined in the ICF. The goals of data curation are to 1) create a corpus of a size that can be manually deidentified and annotated, 2) maximize the density and diversity of functioning information of interest, and 3) allow public dissemination of the data.
Language Region: en-US
Prose Description: English as written by native and bilingual English speakers in a clinical setting
The language users represented in this dataset are medical and clinical professionals who work in a research hospital setting. These individuals hold professional degrees corresponding to their respective specialties. Specific demographic characteristics of the language users such as age, gender, or race/ethnicity were not collected.
The annotator group consisted of five people, 33 to 76 years old, including four females and one male. Socioeconomically, they came from the middle and upper-middle income classes. Regarding first language, three annotators had English as their first language, one had Chinese, and one had Spanish. Proficiency in English, the language of the data being annotated, was native for three of the annotators and bilingual for the other two. The annotation team included clinical rehabilitation domain experts with backgrounds in occupational therapy, physical therapy, and individuals with public health and data science expertise. Prior to annotation, all annotators were trained on the specific annotation process using established guidelines for the given domain, and annotators were required to achieve a specified proficiency level prior to annotating notes in this corpus.
The notes in the dataset were written as part of clinical care within a U.S. research hospital between May 2008 and November 2019. These notes were written by health professionals asynchronously following the patient encounter to document the interaction and support continuity of care. The intended audience of these notes were clinicians involved in the patients' care. The included notes come from nine disciplines - neuropsychology, occupational therapy, physical medicine (physiatry), physical therapy, psychiatry, recreational therapy, social work, speech language pathology, and vocational rehabilitation. The notes were curated to support research on natural language processing for functioning information between 2018 and 2024.
The final corpus was derived from a set of clinical notes extracted from the hospital electronic medical record (EMR) for the purpose of clinical research. The original data include character-based digital content originally. We work in ASCII 8 or UNICODE encoding, and therefore part of our pre-processing includes running encoding detection and transformation from encodings such as Windows-1252 or ISO-8859 format to our preferred format.
On the larger corpus, we applied sampling to match our curation rationale. Given the resource constraints of manual annotation, we set out to create a dataset of 500 clinical notes, which would exclude notes over 10,000 characters in length.
To promote density and diversity, we used five note characteristics as sampling criteria. We used the text length as expressed in number of characters. Next, we considered the discipline group as derived from note type metadata and describes which discipline a note originated from: occupational and vocational therapy (OT/VOC), physical therapy (PT), recreation therapy (RT), speech and language pathology (SLP), social work (SW), or miscellaneous (MISC, including psychiatry, neurology and physiatry). These disciplines were selected for collecting the larger corpus because their notes are likely to include functioning information. Existing information extraction tools were used to obtain annotation counts in four areas of functioning and provided a note’s annotation count, annotation density (annotation count divided by text length), and domain count (number of domains with at least 1 annotation).
We used stratified sampling across the 6 discipline groups to ensure discipline diversity in the corpus. Because of low availability, 50 notes were sampled from SLP with relaxed criteria, and 90 notes each from the 5 other discipline groups with stricter criteria. Sampled SLP notes were those with the highest annotation density that had an annotation count of at least 5 and a domain count of at least 2. Other notes were sampled by highest annotation count and lowest text length, with a minimum annotation count of 15 and minimum domain count of 3.
The notes in the resulting sample included certain types of PHI and PII. To prepare for public dissemination, all sensitive or potentially identifying information was manually annotated in the notes and replaced with substituted content to ensure readability and enough context needed for machine learning without exposing any sensitive information. This de-identification effort was manually reviewed to ensure no PII or PHI exposure and correct any resulting readability issues. Notes about pediatric patients were excluded. No intent was made to sample multiple notes from the same patient. No metadata is provided to group notes other than by note type, discipline, or discipline group. The dataset is not organized beyond the provided metadata, but publications about models trained on this dataset should include information on the train/test splits used.
All notes were sentence-segmented and tokenized using the spaCy en_core_web_lg model with additional rules for sentence segmentation customized to the dataset. Notes are stored in an XML format readable by the GATE annotation software (https://gate.ac.uk/family/developer.html), which stores annotations separately in annotation sets.
As the clinical notes were extracted directly from the EMR in text format, the capture quality was determined to be high. The clinical notes did not have to be converted from other data formats, which means this dataset is free from noise introduced by conversion processes such as optical character recognition.
Because of the effort required to manually deidentify and annotate notes, this corpus is limited in terms of size and representation. The curation decisions skewed note selection towards specific disciplines and note types to increase the likelihood of encountering information on functioning. Some subtypes of functioning occur infrequently in the data, or not at all. The deidentification of notes was done in a manner to preserve natural language as it would occur in the notes, but some information is lost, e.g. on rare diseases.
Information on the manual annotation process is provided in the annotation guidelines for each of the four domains:
- Communication & Cognition (https://zenodo.org/records/13910167)
- Mobility (https://zenodo.org/records/11074838)
- Self-Care & Domestic Life (SCDL) (https://zenodo.org/records/11210183)
- Interpersonal Interactions & Relationships (IPIR) (https://zenodo.org/records/13774684)
Inter-annotator agreement was established on development datasets described in the annotation guidelines prior to the annotation of this gold standard corpus.
The gold standard corpus consists of 484 documents, which include 35,147 sentences in total. The distribution of annotated information is provided in the table below.
Domain |
Number of Annotated Sentences |
% of All Sentences |
Mean Number of Annotated Sentences per Document |
Communication & Cognition |
6033 |
17.2% |
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
4 hours and 80 minutes of speech as spoken by 2 female speakers and 2 male speakers, covering both mimics and parallel voice conversion data.
Datasheet for the dataset: multilingual-NLI-26lang-2mil7
Dataset Summary
This dataset contains 2 730 000 NLI text pairs in 26 languages spoken by more than 4 billion people. The dataset can be used to train models for multilingual NLI (Natural Language Inference) or zero-shot classification. The dataset is based on the English datasets MultiNLI, Fever-NLI, ANLI, LingNLI and WANLI and was created using the latest open-source machine translation models. The dataset is… See the full description on the dataset page: https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We argue for a perspective on bilingual heritage speakers as native speakers of both their languages and present results from a large-scale, cross-linguistic study that took such a perspective and approached bilinguals and monolinguals on equal grounds. We targeted comparable language use in bilingual and monolingual speakers, crucially covering broader repertoires than just formal language. A main database was the open-access RUEG corpus, which covers comparable informal vs. formal and spoken vs. written productions by adolescent and adult bilinguals with heritage-Greek, -Russian, and -Turkish in Germany and the United States and with heritage-German in the United States, and matching data from monolinguals in Germany, the United States, Greece, Russia, and Turkey. Our main results lie in three areas. (1) We found non-canonical patterns not only in bilingual, but also in monolingual speakers, including patterns that have so far been considered absent from native grammars, in domains of morphology, syntax, intonation, and pragmatics. (2) We found a degree of lexical and morphosyntactic inter-speaker variability in monolinguals that was sometimes higher than that of bilinguals, further challenging the model of the streamlined native speaker. (3) In majority language use, non-canonical patterns were dominant in spoken and/or informal registers, and this was true for monolinguals and bilinguals. In some cases, bilingual speakers were leading quantitatively. In heritage settings where the language was not part of formal schooling, we found tendencies of register leveling, presumably due to the fact that speakers had limited access to formal registers of the heritage language. Our findings thus indicate possible quantitative differences and different register distributions rather than distinct grammatical patterns in bilingual and monolingual speakers. This supports the integration of heritage speakers into the native-speaker continuum. Approaching heritage speakers from this perspective helps us to better understand the empirical data and can shed light on language variation and change in native grammars. Furthermore, our findings for monolinguals lead us to reconsider the state-of-the art on majority languages, given recurring evidence for non-canonical patterns that deviate from what has been assumed in the literature so far, and might have been attributed to bilingualism had we not included informal and spoken registers in monolinguals and bilinguals alike.
Audio and video recordings of various types of community interpreted discourse (doctor-patient communication, simulated doctor-patient communication, courtroom communication) in German (simulated and authentic doctor-patient communication) and US (courtroom communication) institutions with varying community languages. Video recordings only exist for the simulated communication. For the authentic interpreted doctor-patient communication, no audio files will be made available.
The ComInDat pilot corpus contains sample data from three different projects: the DiK corpus of Portuguese/German and Turkish/German interpreted doctor-patient communication in hospitals (Bührig & Meyer 2004), he IiSCC-corpus, a corpus of interpreted court proceedings in different language constellations (Spanish/English, Russian/English, Haitian Creole/English and Polish/English) (Angermeyer 2006), a corpus of simulated interpreted doctor-patient interactions in different language constellations (Russian/German, Polish/German and Romanian/German) from a training seminar for bilingual nursing staff ("SimDiK", Bührig, Kliche, Meyer & Pawlack 2012). More information about the background of the corpus and the details of its design can be found in (Angermeyer, Meyer & Schmidt 2012). For more information about the project, please contact Philipp Angermeyer.
Angermeyer, P., Meyer, B. and Schmidt, T. (2012). Sharing Community Interpreting Corpora: A pilot study. In: Schmidt, T. and Wörner, K. (eds.) Multilingual Corpora and Multilingual Corpus Analysis. Amsterdam: Benjamins, 275-294.
CLARIN Metadata summary for Community Interpreting Database Pilot Corpus (ComInDat) (CMDI-based)
Title: Community Interpreting Database Pilot Corpus (ComInDat)
Description: Audio and video recordings of various types of community interpreted discourse (doctor-patient communication, simulated doctor-patient communication, courtroom communication) in German (simulated and authentic doctor-patient communication) and US (courtroom communication) institutions with varying community languages. Video recordings only exist for the simulated communication. For the authentic interpreted doctor-patient communication, no audio files will be made available.
Publication date: 2013-06-10
Data owner: Philipp Angermeyer, Department of Languages, Literatures and Linguistics / York University / 4700 Keele Street / Canada M3J 1P3, pangerme@yorku.ca, Kristin Bührig, Institut für Germanistik I / Von-Melle-Park 6 / D-20146 Hamburg, kristin.buehrig@uni-hamburg.de, Bernd Meyer, Arbeitsbereich Interkulturelle Kommunikation / Fachbereich 06: Translations-, Sprach- und Kulturwissenschaft / Johannes Gutenberg-Universität Mainz / An der Hochschule 2 / D-76726 Germersheim, meyerb@uni-mainz.de
Contributors: Philipp Angermeyer, Department of Languages, Literatures and Linguistics / York University / 4700 Keele Street / Canada M3J 1P3, pangerme@yorku.ca (compiler), Kristin Bührig, Institut für Germanistik I / Von-Melle-Park 6 / D-20146 Hamburg, kristin.buehrig@uni-hamburg.de (compiler), Bernd Meyer, Arbeitsbereich Interkulturelle Kommunikation / Fachbereich 06: Translations-, Sprach- und Kulturwissenschaft / Johannes Gutenberg-Universität Mainz / An der Hochschule 2 / D-76726 Germersheim, meyerb@uni-mainz.de (compiler)
Project: The Integration of Text, Sound, and Image into the Corpus-Based Analysis of Interpreter-Mediated Interaction
Keywords: community interpreting, doctor-patient communication, courtroom communication, EXMARaLDA
Languages: German (deu), English (eng), Spanish (spa), Turkish (tur), Polish (pol), Portuguese (por), Romanian (ron), Russian (rus), Haitian (hat)
Size: 54 speakers (35 female, 16 male, 3 unknown), 14 communications, 12 recordings, 83 minutes, 17 transcriptions, 35051 words
Annotation types: transcription (manual): HIAT/CHAT, deu: German translation, eng: English translation, k: free comment, lang: utterance language, sup: suprasegmental information, trans: utterance translation status, akz: accentuation/stress, pol: Polish translation
Temporal Coverage: 1999-07-01/2010-03-09
Spatial Coverage: Hamburg, DE; New York, US; Neumünster, DE
Genre: discourse
Modality: spoken
References: Angermeyer, P., Meyer, B. and Schmidt, T. (2012). Sharing Community Interpreting Corpora: A pilot study. In: Schmidt, T. and Wörner, K. (eds.) Multilingual Corpora and Multilingual Corpus Analysis. Amsterdam: Benjamins, 275-294.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers. MIRACL covers Indonesian and Thai languages. Before using this dataloader, please accept the acknowledgement at https://huggingface.co/datasets/miracl/miracl and use huggingface-cli login for authentication.
Specification
This dataset covers over 30 scenarios including sports, entertainment, health, shopping, pet, education, food, travel, and so on. For more details:https://dataoceanai.com/datasets/asr/multilingual-intelligent-speech-dataset/
ID:
King-ASR-959
SIZE:
219672 hours
LANGUAGE:
Over 100 languages covered
SAMPLE RATE:
16kHz/44.1kHz/48kHz
SPEAKERS:
215,891 People
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Overview
To assess the multilingual zero-shot voice cloning capabilities of TTS models, we have constructed a test set encompassing 24 languages. This dataset provides both audio samples for voice cloning and corresponding test texts. Specifically, the test set for each language includes: 100 distinct test sentences. Audio samples from two speakers (one male and one female) carefully selected from the Mozilla Common Voice (MCV) dataset, intended for voice cloning. Researchers can… See the full description on the dataset page: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set.
NeMig represents a bilingual news collection and knowledge graphs on the topic of migration. The news corpora in German and English were collected from online media outlets from Germany and the US, respectively. NeMIg contains rich textual and metadata information, sentiment and political orientation annotations, as well as named entities extracted from the articles' content and metadata and linked to Wikidata. The corresponding knowledge graphs (NeMigKG) built from each corpus are expanded with up to two-hop neighbors from Wikidata of the initial set of linked entities.
NeMigKG comes in four flavors, for both the German, and the English corpora:
Base NeMigKG: contains literals and entities from the corresponding annotated news corpus;
Entities NeMigKG: derived from the Base NeMIg by removing all literal nodes, it contains only resource nodes;
Enriched Entities NeMigKG: derived from the Entities NeMig by enriching it with up to two-hop neighbors from Wikidata, it contains only resource nodes and Wikidata triples;
Complete NeMigKG: the combination of the Base and Enriched Entities NeMig, it contains both literals and resources.
Information about uploaded files:
(all files are b-zipped and in the N-Triples format.)
A description of the NeMigKG files is provided in the table below:
NeMigKG Files Description
File
Description
nemig_${language}_ ${graph_type}-metadata.nt.bz2
Metadata about the dataset, described using void vocabulary.
nemig_${language}_ ${graph_type}-instances_types.nt.bz2
Class definitions of news and event instances.
nemig_${language}_ ${graph_type}-instances_labels.nt.bz2
Labels of instances.
nemig_${language}_ ${graph_type}-instances_related.nt.bz2
Relations between news instances based on one another.
nemig_${language}_ ${graph_type}-instances_metadata_literals.nt.bz2
Relations between news instances and metadata literals (e.g. URL, publishing date, modification date, sentiment label, political orientation of news outlets).
nemig_${language}_ ${graph_type}-instances_content_mapping.nt.bz2
Mapping of news instances to content instances (e.g. title, abstract, body).
nemig_${language}_ ${graph_type}-instances_topic_mapping.nt.bz2
Mapping of news instances to sub-topic instances.
nemig_${language}_ ${graph_type}-instances_sentiment_mapping.nt.bz2
Mapping of news instances to sentiment classes.
emig_${language}_ ${graph_type}-instances_political_orientation_mapping.nt.bz2
Mapping of news outlets instances to political orientation classes.
nemig_${language}_ ${graph_type}-instances_content_literals.nt.bz2
Relations between content instances and corresponding literals (e.g. text of title, abstract, body).
nemig_${language}_ ${graph_type}-instances_sentiment_polorient_literals.nt.bz2
Relations between instances and corresponding sentiment or political orientation literals.
nemig_${language}_ ${graph_type}-instances_metadata_resources.nt.bz2
Relations between news or sub-topic instances and entities extracted from metadata (i.e. publishers, authors, keywords).
nemig_${language}_ ${graph_type}-instances_event_mapping.nt.bz2
Mapping of news instances to event instances.
nemig_${language}_ ${graph_type}-event_resources.nt.bz2
Relations between event instances and entities extracted from the text of the news (i.e. actors, places, mentions).
nemig_${language}_ ${graph_type}-resources_provenance.nt.bz2
Provenance information about the entities extracted from the text of the news (e.g. title, abstract, body).
nemig_${language}_ ${graph_type}-wiki_resources.nt.bz2
Relations between Wikidata entities from news and their k-hop entity neighbors from Wikidata.
The corresponding user data has been collected through online studies in Germany and the US. We used the participants' implicit feedback regarding their interest in an article to build their click history, and the explicit feedback in terms of news click behaviors to construct the impression logs. To protect user privacy, we assign each user an anonymized ID.
The German and English user datasets are zip-compressed folders, which contain two files each.
NeMig User Dataset File Description
File
Description
behaviors.tsv
The click history and impression logs of users.
demographics_politics.tsv
Demographic and political information of users.
The behaviors.tsv file contains the users' news click histories and the impression logs. It has 4 columns divided by the tab symbol:
Impression ID: the ID of an impression.
User ID: The anonymized ID of an user.
Click History: The news click history (list of news IDs) of a user before an impression.
Impression Log: List of news displayed to the user in a session and the user's click behavior on them (1 for click, 0 for non-click).
The demographics_politics.tsv file contains detailed information about the users' demographics and political interests. It has columns divided by the tab symbol. An explanation of all the columns and the questions used in the online studies to collect this information is shown in the table below.
Demographic and political user data description
Column Name
Question in German study
Scale in German
Question in English study
Scale in English
Demographics
Gender
Bitte geben Sie Ihr Geschlecht an
0 = männlich
1 = weiblich
2 = divers
3 = Keine Angabe
Please indicate your gender.
0 = male
1 = female
2 = other
3 = no answer
Age
Bitte geben Sie Ihr Alter an
1-120
Please indicate your age.
1-120
Qualification
Welches ist Ihr höchster Bildungsabschluss?
0 = Kein Schulabschluss
1 = Haupt-/Gesamtschulabschluss
2 = Realschulabschluss, Mittlere Reife, Fachschulreife
3 = Fachhochschulreife, Abitur
4 = Studium mit Abschluss
5 = Promotion
6 = Keine Angabe
Please indicate your highest educational qualification.
0 = less than high school
1 = high school/GED
2 = Vo-tech/business school
3 = some college
4 = college degree
5 = university degree
6 = doctoral degree
7 = no answer
Nationality
Welche Staatsangehörigkeit besitzen Sie?
0 = Nur die deutsche Staatsangehörigkeit
1 = Die deutsche und eine andere Staatsangehörigkeit
2 = Nur eine andere Staatsangehörigkeit
3 = Keine Angabe
What is your citizenship?
0 = U.S. citizenship
1 = U.S. and another non-U.S. citizenship
2 = Only non-U.S. citizenship
3 = No Answer
BornIn
Sind Sie in Deutschland geboren?
0 = Ja
1 = Nein
2 = Keine Angabe
Were you born in the U.S.?
0 = Yes
1 = No
2 = No answer
ParentsBornIn
Sind Ihre Eltern in Deutschland geboren?
0 = Mein Vater und meine Mutter sind beide in Deutschland geboren
1 = Mein Vater ist in Deutschland geboren, meine Mutter nicht
2 = Meine Mutter ist in Deutschland geboren, mein Vater nicht
3 = Weder meine Mutter noch mein Vater sind in Deutschland geboren
4 = Keine Angabe
Were your parents born in the U.S.?
0 = My father and my mother were both born in the U.S.
1 = My father was born in the U.S., my mother was not
2 = My mother was born in the U.S., my father was not
3 = Neither my mother nor my father were born in the U.S
4 = No answer
Income
Was ist Ihr persönliches monatliches Nettoeinkommen (nach Abzug der Steuern)? Bitte geben Sie eine ungefähre Schätzung an, falls Sie die genaue Zahl nicht kennen.
0 = Weniger als 1000 €
1 = 1001 € bis 2000 €
2 = 2001 € bis 3000 €
3 = 3001 € bis 4000 €
4 = 4001 € bis 5000 €
5 = Mehr als 5000 €
6 = Keine Angabe
What is your personal monthly net income (after taxes)? Please give an approximate estimation in case you are unsure.
0 = Less than 1000 $
1 = 1001 $ to 2000 $
2 = 2001 $ to 3000 $
3 = 3001 $ to 4000 $
4 = 4001 $ to 5000 $
5 = More than 5000 $
6 = No Answer
Empathy
Wie sehr stimmen Sie den folgenden Aussagen zu?
7-point Likert scale
1=Trifft überhaupt nicht zu 7=Trifft voll und ganz zu
How strongly do you agree with the following statements?
7-point Likert scale
1=Strongly disagree
7=Strongly agree
EMP1
Wenn jemand anderes erfreut ist, tendiere ich dazu auch erfreut zu sein.
When someone else is feeling excited, I tend to get excited too.
EMP2
Es regt mich auf, wenn jemand respektlos behandelt wird.
It upsets me to see someone being treated disrespectfully.
EMP3
Es macht mir Freude, andere aufzumuntern.
I enjoy making other people feel better.
EMP4
Ich bin besorgt um Personen, die weniger Glück haben als ich.
I have tender, concerned feelings for people less fortunate than me.
EMP5
Ich fühle, wenn andere traurig sind, selbst wenn sie nichts sagen.
I can tell when others are sad even when they do not say anything.
EMP6
Meistens bin ich mit den Stimmungen anderer Leute im Einklang.
I find that I am “in tune” with other people’s moods.
EMP7
Ich empfinde einen starken Drang zu helfen, wenn ich jemanden sehe, der aufgebracht ist.
I get a strong urge to help when I see someone who is upset.
EMP8
Wenn ich jemanden sehe, der ausgenutzt wird, möchte ich die Person beschützen.
When I see someone being taken advantage of, I feel kind of protective towards him\her.
Big5
Ich
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for MIRACL Corpus
MIRACL 🌍🙌🌏 (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. This dataset contains the collection data of the 16 "known languages". The remaining 2 "surprise languages" will not be released until later. The corpus for each language is prepared from a Wikipedia… See the full description on the dataset page: https://huggingface.co/datasets/miracl/miracl-corpus.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FLEURS
Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/tunngo/test_4.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages
Overview
INDICVOICES is a dataset of natural and spontaneous speech containing a total of 19550 hours of read (8%), extempore (76%) and conversational (15%) audio from 29K speakers covering 400+ Indian districts and 22 languages. Of these 19550 hours, 9200 hours have already been transcribed. Through this paper, we share our journey of capturing the cultural… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/IndicVoices.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multilingual MGSM Dataset (Indonesian, Javanese, Sundanese)
This dataset is a multilingual extension of the MGSM dataset, translated into Indonesian (id), Javanese (jv), and Sundanese (su). The original English math word problems were translated with the following modifications:
Western names were replaced with culturally appropriate local names. All numerical values and logic were preserved. Native speakers reviewed translations for correctness and naturalness.
Each language is… See the full description on the dataset page: https://huggingface.co/datasets/irasalsabila/mgsm_50.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
IFEval Multilingual
These are machine-translated versions of Instruction Following Evaluation (IFEval). We will do our best to correct the translations. Translations were done using DeepL and the translations were reviewed and corrected by native speakers. We use this dataset in our fork of LM Eval Harness that supports multilingual ifeval.
Supported languages
Finnish: machine-translated manually corrected Swedish: machine-translated but not corrected
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
XOR-TyDi QA brings together for the first time information-seeking questions, open-retrieval QA, and multilingual QA to create a multilingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for SwitchLingua_text
Dataset Summary
SwitchLingua is a comprehensive multilingual and multicultural code-switching dataset designed to advance research in automatic speech recognition, natural language processing, and conversational AI. The textual data for SwitchLingua was first generated using the proposed LinguaMaster framework, and the audio data was recorded by 174 bilingual speakers from diverse linguistic and cultural backgrounds to ensure high… See the full description on the dataset page: https://huggingface.co/datasets/Shelton1013/SwitchLingua_text.