21 datasets found
  1. h

    SwitchLingua_text

    • huggingface.co
    Updated May 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peng Xie (2025). SwitchLingua_text [Dataset]. https://huggingface.co/datasets/Shelton1013/SwitchLingua_text
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    Peng Xie
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for SwitchLingua_text

      Dataset Summary
    

    SwitchLingua is a comprehensive multilingual and multicultural code-switching dataset designed to advance research in automatic speech recognition, natural language processing, and conversational AI. The textual data for SwitchLingua was first generated using the proposed LinguaMaster framework, and the audio data was recorded by 174 bilingual speakers from diverse linguistic and cultural backgrounds to ensure high… See the full description on the dataset page: https://huggingface.co/datasets/Shelton1013/SwitchLingua_text.

  2. E

    TC-STAR Bilingual Expressive Speech Database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Dec 21, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2010). TC-STAR Bilingual Expressive Speech Database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0313/
    Explore at:
    Dataset updated
    Dec 21, 2010
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    8 hours of speech as spoken by 2 female speakers and 2 male speakers for each language (English and Spanish).

  3. h

    ml_spoken_words

    • huggingface.co
    Updated Jun 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons (2024). ml_spoken_words [Dataset]. https://huggingface.co/datasets/MLCommons/ml_spoken_words
    Explore at:
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    MLCommons
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset.

  4. f

    Table_1_Expressing diminutive meaning in heritage Spanish: linking the...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abel Cruz (2024). Table_1_Expressing diminutive meaning in heritage Spanish: linking the heritage experience to diminutive use in everyday speech.XLSX [Dataset]. http://doi.org/10.3389/flang.2024.1377977.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Frontiers
    Authors
    Abel Cruz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThis paper studies the pragmatic force that heritage speakers may convey through the use of the diminutive in everyday speech. In particular, I analyze the use of the Spanish diminutive in 49 sociolinguistic interviews from a Spanish–English bilingual community in Southern Arizona, U.S. where Spanish is the heritage language. I compare the use of the diminutive in heritage Spanish to the distribution of the diminutive in the speech of a Spanish monolingual community (18 sociolinguistic interviews) from the same dialectal region. Although Spanish and English employ different morphosyntactic strategies to express diminutive meaning, the analysis reveals that the diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona (i.e., similar diminutive distributions to their monolingual counterparts). While heritage speakers employed the diminutive -ito/a to express the notion of “smallness” in their Spanish-discourse, the analysis indicates that these language users are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. This particular finding suggests that the concept “child” is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children. The analysis further suggests that examining the pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages.MethodsIn this study, I analyze the use of Spanish diminutives in two U.S.-Mexico border regions. The first data set is representative of a Spanish–English bilingual community in Southern Arizona, U.S., provided in the Corpus del Español en el Sur de Arizona (The CESA Corpus). The CESA Corpus comprises 49 sociolinguistic interviews of ~1 h each for a total of ~305,542 words. The second data set comprises 18 sociolinguistic interviews of predominantly monolingual Spanish speakers from the city of Mexicali, Baja California in Mexico, provided in the Proyecto Para el Estudio Sociolingüístico del Español de España y de América (PRESEEA). The Mexicali data set consists of ~119,162 words.ResultsThe analysis revealed that the Spanish diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona. In addition to its prototypical meaning (i.e., the notion of “smallness”), the diminutive morpheme -ito/a conveyed an array of pragmatic functions in the everyday speech of Spanish heritage speakers and their monolingual counterparts from the same dialectal region. Importantly, these pragmatic functions are mediated by speakers' subjective perceptions of the entity in question. Unlike their monolingual counterparts, heritage speakers are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. Altogether, the study suggests that the concept “child” is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children.DiscussionIn this study, I followed Reynoso's framework to study the pragmatic dimensions of the diminutive in everyday speech, that is, speakers' publicly conveyed meaning. The analysis revealed that heritage speakers applied most of the pragmatic functions and their respective values observed in Reynoso's cross-dialectal study of Spanish diminutives, and hence providing further support for her framework. Similarly, the study provides further evidence to Jurafsky's proposal that morphological diminutives arise from semantic or pragmatic links with children. Finally, the analysis indicated that examining the semantic/pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages, which can in turn have further ramifications for heritage language learning and teaching.

  5. E

    Collins Multilingual database (MLD) – WordBank with audio files

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Nov 18, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2016). Collins Multilingual database (MLD) – WordBank with audio files [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0382/
    Explore at:
    Dataset updated
    Nov 18, 2016
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, see ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank, see ELRA-T0377).This version includes the corresponding audio files covering 26 languages of the 32 languages available in the Collins MLD Wordbank: Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese. The WordBank contains 10,000 words for each language, XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).The full database contains 10,000 audio files for each language (26 languages), and 10,000 additional audio files corresponding to the 10,000 additional headwords in 12 languages. Audio was recorded by native speakers.

  6. A Gold Standard Corpus for Activity Information (GoSCAI)

    • zenodo.org
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Dataset]. http://doi.org/10.5281/zenodo.15528545
    Explore at:
    Dataset updated
    May 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Description

    A Gold Standard Corpus for Activity Information

    Dataset Title: A Gold Standard Corpus for Activity Information (GoSCAI)

    Dataset Curators: The Epidemiology & Biostatistics Section of the NIH Clinical Center Rehabilitation Medicine Department

    Dataset Version: 1.0 (May 16, 2025)

    Dataset Citation and DOI: NIH CC RMD Epidemiology & Biostatistics Section. (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Data set]. Zenodo. doi: 10.5281/zenodo.15528545

    EXECUTIVE SUMMARY

    This data statement is for a gold standard corpus of de-identified clinical notes that have been annotated for human functioning information based on the framework of the WHO's International Classification of Functioning, Disability and Health (ICF). The corpus includes 484 notes from a single institution within the United States written in English in a clinical setting. This dataset was curated for the purpose of training natural language processing models to automatically identify, extract, and classify information on human functioning at the whole-person, or activity, level.

    CURATION RATIONALE

    This dataset is curated to be a publicly available resource for the development and evaluation of methods for the automatic extraction and classification of activity-level functioning information as defined in the ICF. The goals of data curation are to 1) create a corpus of a size that can be manually deidentified and annotated, 2) maximize the density and diversity of functioning information of interest, and 3) allow public dissemination of the data.

    LANGUAGE VARIETIES

    Language Region: en-US

    Prose Description: English as written by native and bilingual English speakers in a clinical setting

    LANGUAGE USER DEMOGRAPHIC

    The language users represented in this dataset are medical and clinical professionals who work in a research hospital setting. These individuals hold professional degrees corresponding to their respective specialties. Specific demographic characteristics of the language users such as age, gender, or race/ethnicity were not collected.

    ANNOTATOR DEMOGRAPHIC

    The annotator group consisted of five people, 33 to 76 years old, including four females and one male. Socioeconomically, they came from the middle and upper-middle income classes. Regarding first language, three annotators had English as their first language, one had Chinese, and one had Spanish. Proficiency in English, the language of the data being annotated, was native for three of the annotators and bilingual for the other two. The annotation team included clinical rehabilitation domain experts with backgrounds in occupational therapy, physical therapy, and individuals with public health and data science expertise. Prior to annotation, all annotators were trained on the specific annotation process using established guidelines for the given domain, and annotators were required to achieve a specified proficiency level prior to annotating notes in this corpus.

    LINGUISTIC SITUATION AND TEXT CHARACTERISTICS

    The notes in the dataset were written as part of clinical care within a U.S. research hospital between May 2008 and November 2019. These notes were written by health professionals asynchronously following the patient encounter to document the interaction and support continuity of care. The intended audience of these notes were clinicians involved in the patients' care. The included notes come from nine disciplines - neuropsychology, occupational therapy, physical medicine (physiatry), physical therapy, psychiatry, recreational therapy, social work, speech language pathology, and vocational rehabilitation. The notes were curated to support research on natural language processing for functioning information between 2018 and 2024.

    PREPROCESSING AND DATA FORMATTING

    The final corpus was derived from a set of clinical notes extracted from the hospital electronic medical record (EMR) for the purpose of clinical research. The original data include character-based digital content originally. We work in ASCII 8 or UNICODE encoding, and therefore part of our pre-processing includes running encoding detection and transformation from encodings such as Windows-1252 or ISO-8859 format to our preferred format.

    On the larger corpus, we applied sampling to match our curation rationale. Given the resource constraints of manual annotation, we set out to create a dataset of 500 clinical notes, which would exclude notes over 10,000 characters in length.

    To promote density and diversity, we used five note characteristics as sampling criteria. We used the text length as expressed in number of characters. Next, we considered the discipline group as derived from note type metadata and describes which discipline a note originated from: occupational and vocational therapy (OT/VOC), physical therapy (PT), recreation therapy (RT), speech and language pathology (SLP), social work (SW), or miscellaneous (MISC, including psychiatry, neurology and physiatry). These disciplines were selected for collecting the larger corpus because their notes are likely to include functioning information. Existing information extraction tools were used to obtain annotation counts in four areas of functioning and provided a note’s annotation count, annotation density (annotation count divided by text length), and domain count (number of domains with at least 1 annotation).

    We used stratified sampling across the 6 discipline groups to ensure discipline diversity in the corpus. Because of low availability, 50 notes were sampled from SLP with relaxed criteria, and 90 notes each from the 5 other discipline groups with stricter criteria. Sampled SLP notes were those with the highest annotation density that had an annotation count of at least 5 and a domain count of at least 2. Other notes were sampled by highest annotation count and lowest text length, with a minimum annotation count of 15 and minimum domain count of 3.

    The notes in the resulting sample included certain types of PHI and PII. To prepare for public dissemination, all sensitive or potentially identifying information was manually annotated in the notes and replaced with substituted content to ensure readability and enough context needed for machine learning without exposing any sensitive information. This de-identification effort was manually reviewed to ensure no PII or PHI exposure and correct any resulting readability issues. Notes about pediatric patients were excluded. No intent was made to sample multiple notes from the same patient. No metadata is provided to group notes other than by note type, discipline, or discipline group. The dataset is not organized beyond the provided metadata, but publications about models trained on this dataset should include information on the train/test splits used.

    All notes were sentence-segmented and tokenized using the spaCy en_core_web_lg model with additional rules for sentence segmentation customized to the dataset. Notes are stored in an XML format readable by the GATE annotation software (https://gate.ac.uk/family/developer.html), which stores annotations separately in annotation sets.

    CAPTURE QUALITY

    As the clinical notes were extracted directly from the EMR in text format, the capture quality was determined to be high. The clinical notes did not have to be converted from other data formats, which means this dataset is free from noise introduced by conversion processes such as optical character recognition.

    LIMITATIONS

    Because of the effort required to manually deidentify and annotate notes, this corpus is limited in terms of size and representation. The curation decisions skewed note selection towards specific disciplines and note types to increase the likelihood of encountering information on functioning. Some subtypes of functioning occur infrequently in the data, or not at all. The deidentification of notes was done in a manner to preserve natural language as it would occur in the notes, but some information is lost, e.g. on rare diseases.

    METADATA

    Information on the manual annotation process is provided in the annotation guidelines for each of the four domains:

    - Communication & Cognition (https://zenodo.org/records/13910167)

    - Mobility (https://zenodo.org/records/11074838)

    - Self-Care & Domestic Life (SCDL) (https://zenodo.org/records/11210183)

    - Interpersonal Interactions & Relationships (IPIR) (https://zenodo.org/records/13774684)

    Inter-annotator agreement was established on development datasets described in the annotation guidelines prior to the annotation of this gold standard corpus.

    The gold standard corpus consists of 484 documents, which include 35,147 sentences in total. The distribution of annotated information is provided in the table below.

    <td style="width: 1.75in; padding: 0in 5.4pt 0in

    Domain

    Number of Annotated Sentences

    % of All Sentences

    Mean Number of Annotated Sentences per Document

    Communication & Cognition

    6033

    17.2%

  7. E

    TC-STAR Bilingual Voice-Conversion Spanish Speech Database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Dec 21, 2010
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2010). TC-STAR Bilingual Voice-Conversion Spanish Speech Database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0311/
    Explore at:
    Dataset updated
    Dec 21, 2010
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    4 hours and 80 minutes of speech as spoken by 2 female speakers and 2 male speakers, covering both mimics and parallel voice conversion data.

  8. h

    multilingual-NLI-26lang-2mil7

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moritz Laurer, multilingual-NLI-26lang-2mil7 [Dataset]. https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Moritz Laurer
    Description

    Datasheet for the dataset: multilingual-NLI-26lang-2mil7

      Dataset Summary
    

    This dataset contains 2 730 000 NLI text pairs in 26 languages spoken by more than 4 billion people. The dataset can be used to train models for multilingual NLI (Natural Language Inference) or zero-shot classification. The dataset is based on the English datasets MultiNLI, Fever-NLI, ANLI, LingNLI and WANLI and was created using the latest open-source machine translation models. The dataset is… See the full description on the dataset page: https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7.

  9. f

    Data_Sheet_3_Heritage Speakers as Part of the Native Language Continuum.PDF

    • frontiersin.figshare.com
    pdf
    Updated Jun 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heike Wiese; Artemis Alexiadou; Shanley Allen; Oliver Bunk; Natalia Gagarina; Kateryna Iefremenko; Maria Martynova; Tatiana Pashkova; Vicky Rizou; Christoph Schroeder; Anna Shadrova; Luka Szucsich; Rosemarie Tracy; Wintai Tsehaye; Sabine Zerbian; Yulia Zuban (2023). Data_Sheet_3_Heritage Speakers as Part of the Native Language Continuum.PDF [Dataset]. http://doi.org/10.3389/fpsyg.2021.717973.s003
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    Frontiers
    Authors
    Heike Wiese; Artemis Alexiadou; Shanley Allen; Oliver Bunk; Natalia Gagarina; Kateryna Iefremenko; Maria Martynova; Tatiana Pashkova; Vicky Rizou; Christoph Schroeder; Anna Shadrova; Luka Szucsich; Rosemarie Tracy; Wintai Tsehaye; Sabine Zerbian; Yulia Zuban
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We argue for a perspective on bilingual heritage speakers as native speakers of both their languages and present results from a large-scale, cross-linguistic study that took such a perspective and approached bilinguals and monolinguals on equal grounds. We targeted comparable language use in bilingual and monolingual speakers, crucially covering broader repertoires than just formal language. A main database was the open-access RUEG corpus, which covers comparable informal vs. formal and spoken vs. written productions by adolescent and adult bilinguals with heritage-Greek, -Russian, and -Turkish in Germany and the United States and with heritage-German in the United States, and matching data from monolinguals in Germany, the United States, Greece, Russia, and Turkey. Our main results lie in three areas. (1) We found non-canonical patterns not only in bilingual, but also in monolingual speakers, including patterns that have so far been considered absent from native grammars, in domains of morphology, syntax, intonation, and pragmatics. (2) We found a degree of lexical and morphosyntactic inter-speaker variability in monolinguals that was sometimes higher than that of bilinguals, further challenging the model of the streamlined native speaker. (3) In majority language use, non-canonical patterns were dominant in spoken and/or informal registers, and this was true for monolinguals and bilinguals. In some cases, bilingual speakers were leading quantitatively. In heritage settings where the language was not part of formal schooling, we found tendencies of register leveling, presumably due to the fact that speakers had limited access to formal registers of the heritage language. Our findings thus indicate possible quantitative differences and different register distributions rather than distinct grammatical patterns in bilingual and monolingual speakers. This supports the integration of heritage speakers into the native-speaker continuum. Approaching heritage speakers from this perspective helps us to better understand the empirical data and can shed light on language variation and change in native grammars. Furthermore, our findings for monolinguals lead us to reconsider the state-of-the art on majority languages, given recurring evidence for non-canonical patterns that deviate from what has been assumed in the literature so far, and might have been attributed to bilingualism had we not included informal and spoken registers in monolinguals and bilinguals alike.

  10. u

    Community Interpreting Database Pilot Corpus (ComInDat)

    • fdr.uni-hamburg.de
    Updated Mar 9, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angermeyer, Philipp; Bührig, Kristin; Meyer, Bernd; Angermeyer, Philipp; Bührig, Kristin; Meyer, Bernd (2010). Community Interpreting Database Pilot Corpus (ComInDat) [Dataset]. http://doi.org/10.25592/uhhfdm.1478
    Explore at:
    Dataset updated
    Mar 9, 2010
    Dataset provided by
    Mainz University
    Authors
    Angermeyer, Philipp; Bührig, Kristin; Meyer, Bernd; Angermeyer, Philipp; Bührig, Kristin; Meyer, Bernd
    Description

    Audio and video recordings of various types of community interpreted discourse (doctor-patient communication, simulated doctor-patient communication, courtroom communication) in German (simulated and authentic doctor-patient communication) and US (courtroom communication) institutions with varying community languages. Video recordings only exist for the simulated communication. For the authentic interpreted doctor-patient communication, no audio files will be made available.

    The ComInDat pilot corpus contains sample data from three different projects: the DiK corpus of Portuguese/German and Turkish/German interpreted doctor-patient communication in hospitals (Bührig & Meyer 2004), he IiSCC-corpus, a corpus of interpreted court proceedings in different language constellations (Spanish/English, Russian/English, Haitian Creole/English and Polish/English) (Angermeyer 2006), a corpus of simulated interpreted doctor-patient interactions in different language constellations (Russian/German, Polish/German and Romanian/German) from a training seminar for bilingual nursing staff ("SimDiK", Bührig, Kliche, Meyer & Pawlack 2012). More information about the background of the corpus and the details of its design can be found in (Angermeyer, Meyer & Schmidt 2012). For more information about the project, please contact Philipp Angermeyer.

    Angermeyer, P., Meyer, B. and Schmidt, T. (2012). Sharing Community Interpreting Corpora: A pilot study. In: Schmidt, T. and Wörner, K. (eds.) Multilingual Corpora and Multilingual Corpus Analysis. Amsterdam: Benjamins, 275-294.

    CLARIN Metadata summary for Community Interpreting Database Pilot Corpus (ComInDat) (CMDI-based)

    Title: Community Interpreting Database Pilot Corpus (ComInDat)
    Description: Audio and video recordings of various types of community interpreted discourse (doctor-patient communication, simulated doctor-patient communication, courtroom communication) in German (simulated and authentic doctor-patient communication) and US (courtroom communication) institutions with varying community languages. Video recordings only exist for the simulated communication. For the authentic interpreted doctor-patient communication, no audio files will be made available.
    Publication date: 2013-06-10
    Data owner: Philipp Angermeyer, Department of Languages, Literatures and Linguistics / York University / 4700 Keele Street / Canada M3J 1P3, pangerme@yorku.ca, Kristin Bührig, Institut für Germanistik I / Von-Melle-Park 6 / D-20146 Hamburg, kristin.buehrig@uni-hamburg.de, Bernd Meyer, Arbeitsbereich Interkulturelle Kommunikation / Fachbereich 06: Translations-, Sprach- und Kulturwissenschaft / Johannes Gutenberg-Universität Mainz / An der Hochschule 2 / D-76726 Germersheim, meyerb@uni-mainz.de
    Contributors: Philipp Angermeyer, Department of Languages, Literatures and Linguistics / York University / 4700 Keele Street / Canada M3J 1P3, pangerme@yorku.ca (compiler), Kristin Bührig, Institut für Germanistik I / Von-Melle-Park 6 / D-20146 Hamburg, kristin.buehrig@uni-hamburg.de (compiler), Bernd Meyer, Arbeitsbereich Interkulturelle Kommunikation / Fachbereich 06: Translations-, Sprach- und Kulturwissenschaft / Johannes Gutenberg-Universität Mainz / An der Hochschule 2 / D-76726 Germersheim, meyerb@uni-mainz.de (compiler)
    Project: The Integration of Text, Sound, and Image into the Corpus-Based Analysis of Interpreter-Mediated Interaction
    Keywords: community interpreting, doctor-patient communication, courtroom communication, EXMARaLDA
    Languages: German (deu), English (eng), Spanish (spa), Turkish (tur), Polish (pol), Portuguese (por), Romanian (ron), Russian (rus), Haitian (hat)
    Size: 54 speakers (35 female, 16 male, 3 unknown), 14 communications, 12 recordings, 83 minutes, 17 transcriptions, 35051 words
    Annotation types: transcription (manual): HIAT/CHAT, deu: German translation, eng: English translation, k: free comment, lang: utterance language, sup: suprasegmental information, trans: utterance translation status, akz: accentuation/stress, pol: Polish translation
    Temporal Coverage: 1999-07-01/2010-03-09
    Spatial Coverage: Hamburg, DE; New York, US; Neumünster, DE
    Genre: discourse
    Modality: spoken
    References: Angermeyer, P., Meyer, B. and Schmidt, T. (2012). Sharing Community Interpreting Corpora: A pilot study. In: Schmidt, T. and Wörner, K. (eds.) Multilingual Corpora and Multilingual Corpus Analysis. Amsterdam: Benjamins, 275-294.

  11. h

    Data from: miracl

    • huggingface.co
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEACrowd (2024). miracl [Dataset]. https://huggingface.co/datasets/SEACrowd/miracl
    Explore at:
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    SEACrowd
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers. MIRACL covers Indonesian and Thai languages. Before using this dataloader, please accept the acknowledgement at https://huggingface.co/datasets/miracl/miracl and use huggingface-cli login for authentication.

  12. h

    Multilingual_Intelligent_Speech_Dataset

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataocean AI (2024). Multilingual_Intelligent_Speech_Dataset [Dataset]. https://huggingface.co/datasets/DataoceanAI/Multilingual_Intelligent_Speech_Dataset
    Explore at:
    Dataset updated
    Sep 12, 2024
    Authors
    Dataocean AI
    Description

    Specification

    This dataset covers over 30 scenarios including sports, entertainment, health, shopping, pet, education, food, travel, and so on. For more details:https://dataoceanai.com/datasets/asr/multilingual-intelligent-speech-dataset/

      ID:
    

    King-ASR-959

      SIZE:
    

    219672 hours

      LANGUAGE:
    

    Over 100 languages covered

      SAMPLE RATE:
    

    16kHz/44.1kHz/48kHz

      SPEAKERS:
    

    215,891 People

  13. h

    TTS-Multilingual-Test-Set

    • huggingface.co
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MiniMax (2025). TTS-Multilingual-Test-Set [Dataset]. https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set
    Explore at:
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    MiniMax
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Overview

    To assess the multilingual zero-shot voice cloning capabilities of TTS models, we have constructed a test set encompassing 24 languages. This dataset provides both audio samples for voice cloning and corresponding test texts. Specifically, the test set for each language includes: 100 distinct test sentences. Audio samples from two speakers (one male and one female) carefully selected from the Mozilla Common Voice (MCV) dataset, intended for voice cloning. Researchers can… See the full description on the dataset page: https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set.

  14. Z

    Data from: NeMig - A Bilingual News Collection and Knowledge Graph about...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iana, Andreea (2023). NeMig - A Bilingual News Collection and Knowledge Graph about Migration [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7442424
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    Nikolajevic, Nevena
    Paulheim, Heiko
    Iana, Andreea
    Grote, Alexander
    Weinhardt, Christof
    Müller, Philipp
    Ludwig, Katharina
    Alam, Mehwish
    Description

    NeMig represents a bilingual news collection and knowledge graphs on the topic of migration. The news corpora in German and English were collected from online media outlets from Germany and the US, respectively. NeMIg contains rich textual and metadata information, sentiment and political orientation annotations, as well as named entities extracted from the articles' content and metadata and linked to Wikidata. The corresponding knowledge graphs (NeMigKG) built from each corpus are expanded with up to two-hop neighbors from Wikidata of the initial set of linked entities.

    NeMigKG comes in four flavors, for both the German, and the English corpora:

    Base NeMigKG: contains literals and entities from the corresponding annotated news corpus;

    Entities NeMigKG: derived from the Base NeMIg by removing all literal nodes, it contains only resource nodes;

    Enriched Entities NeMigKG: derived from the Entities NeMig by enriching it with up to two-hop neighbors from Wikidata, it contains only resource nodes and Wikidata triples;

    Complete NeMigKG: the combination of the Base and Enriched Entities NeMig, it contains both literals and resources.

    Information about uploaded files:

    (all files are b-zipped and in the N-Triples format.)

    A description of the NeMigKG files is provided in the table below:

    NeMigKG Files Description
    
    
        File
        Description
    
    
    
    
        nemig_${language}_ ${graph_type}-metadata.nt.bz2
        Metadata about the dataset, described using void vocabulary.
    
    
        nemig_${language}_ ${graph_type}-instances_types.nt.bz2
        Class definitions of news and event instances.
    
    
        nemig_${language}_ ${graph_type}-instances_labels.nt.bz2
        Labels of instances.
    
    
        nemig_${language}_ ${graph_type}-instances_related.nt.bz2
        Relations between news instances based on one another.
    
    
        nemig_${language}_ ${graph_type}-instances_metadata_literals.nt.bz2
        Relations between news instances and metadata literals (e.g. URL, publishing date, modification date, sentiment label, political orientation of news outlets).
    
    
        nemig_${language}_ ${graph_type}-instances_content_mapping.nt.bz2
        Mapping of news instances to content instances (e.g. title, abstract, body).
    
    
        nemig_${language}_ ${graph_type}-instances_topic_mapping.nt.bz2
        Mapping of news instances to sub-topic instances.
    

    nemig_${language}_ ${graph_type}-instances_sentiment_mapping.nt.bz2

    Mapping of news instances to sentiment classes.

    emig_${language}_ ${graph_type}-instances_political_orientation_mapping.nt.bz2

    Mapping of news outlets instances to political orientation classes.

        nemig_${language}_ ${graph_type}-instances_content_literals.nt.bz2
        Relations between content instances and corresponding literals (e.g. text of title, abstract, body).
    

    nemig_${language}_ ${graph_type}-instances_sentiment_polorient_literals.nt.bz2

    Relations between instances and corresponding sentiment or political orientation literals.

        nemig_${language}_ ${graph_type}-instances_metadata_resources.nt.bz2
        Relations between news or sub-topic instances and entities extracted from metadata (i.e. publishers, authors, keywords).
    
    
        nemig_${language}_ ${graph_type}-instances_event_mapping.nt.bz2
        Mapping of news instances to event instances.
    
    
        nemig_${language}_ ${graph_type}-event_resources.nt.bz2
        Relations between event instances and entities extracted from the text of the news (i.e. actors, places, mentions).
    
    
        nemig_${language}_ ${graph_type}-resources_provenance.nt.bz2
        Provenance information about the entities extracted from the text of the news (e.g. title, abstract, body).
    
    
        nemig_${language}_ ${graph_type}-wiki_resources.nt.bz2
        Relations between Wikidata entities from news and their k-hop entity neighbors from Wikidata.
    

    The corresponding user data has been collected through online studies in Germany and the US. We used the participants' implicit feedback regarding their interest in an article to build their click history, and the explicit feedback in terms of news click behaviors to construct the impression logs. To protect user privacy, we assign each user an anonymized ID.

    The German and English user datasets are zip-compressed folders, which contain two files each.

    NeMig User Dataset File Description
    
    
        File
        Description
    

    behaviors.tsv

    The click history and impression logs of users.

    demographics_politics.tsv

    Demographic and political information of users.

    The behaviors.tsv file contains the users' news click histories and the impression logs. It has 4 columns divided by the tab symbol:

    Impression ID: the ID of an impression.

    User ID: The anonymized ID of an user.

    Click History: The news click history (list of news IDs) of a user before an impression.

    Impression Log: List of news displayed to the user in a session and the user's click behavior on them (1 for click, 0 for non-click).

    The demographics_politics.tsv file contains detailed information about the users' demographics and political interests. It has columns divided by the tab symbol. An explanation of all the columns and the questions used in the online studies to collect this information is shown in the table below.

    Demographic and political user data description
    
    
        Column Name
        Question in German study
        Scale in German
        Question in English study
        Scale in English
    
    
    
    
        Demographics
    
    
    
    
    
    
        Gender
        Bitte geben Sie Ihr Geschlecht an
        0 = männlich
        1 = weiblich
        2 = divers
        3 = Keine Angabe
        Please indicate your gender.
        0 = male
        1 = female
        2 = other
        3 = no answer
    
    
        Age
        Bitte geben Sie Ihr Alter an  
        1-120
        Please indicate your age.
        1-120
    
    
        Qualification
        Welches ist Ihr höchster Bildungsabschluss?
        0 = Kein Schulabschluss
        1 = Haupt-/Gesamtschulabschluss
        2 = Realschulabschluss, Mittlere Reife, Fachschulreife
        3 = Fachhochschulreife, Abitur
        4 = Studium mit Abschluss
        5 = Promotion
        6 = Keine Angabe
        Please indicate your highest educational qualification.
        0 = less than high school
        1 = high school/GED
        2 = Vo-tech/business school
        3 = some college
        4 = college degree
        5 = university degree
        6 = doctoral degree
        7 = no answer
    
    
        Nationality
        Welche Staatsangehörigkeit besitzen Sie?
        0 = Nur die deutsche Staatsangehörigkeit
        1 = Die deutsche und eine andere Staatsangehörigkeit
        2 = Nur eine andere Staatsangehörigkeit
        3 = Keine Angabe
        What is your citizenship?
        0 = U.S. citizenship
        1 = U.S. and another non-U.S. citizenship
        2 = Only non-U.S. citizenship
        3 = No Answer
    
    
        BornIn
        Sind Sie in Deutschland geboren?
        0 = Ja
        1 = Nein
        2 = Keine Angabe
        Were you born in the U.S.?
        0 = Yes
        1 = No
        2 = No answer
    
    
        ParentsBornIn
        Sind Ihre Eltern in Deutschland geboren?
        0 = Mein Vater und meine Mutter sind beide in Deutschland geboren
        1 = Mein Vater ist in Deutschland geboren, meine Mutter nicht
        2 = Meine Mutter ist in Deutschland geboren, mein Vater nicht
        3 = Weder meine Mutter noch mein Vater sind in Deutschland geboren
        4 = Keine Angabe
        Were your parents born in the U.S.?
        0 = My father and my mother were both born in the U.S.
        1 = My father was born in the U.S., my mother was not
        2 = My mother was born in the U.S., my father was not
        3 = Neither my mother nor my father were born in the U.S
        4 = No answer
    
    
        Income
        Was ist Ihr persönliches monatliches Nettoeinkommen (nach Abzug der Steuern)? Bitte geben Sie eine ungefähre Schätzung an, falls Sie die genaue Zahl nicht kennen.
        0 = Weniger als 1000 €
        1 = 1001 € bis 2000 €
        2 = 2001 € bis 3000 €
        3 = 3001 € bis 4000 €
        4 = 4001 € bis 5000 €
        5 = Mehr als 5000 €
        6 = Keine Angabe
        What is your personal monthly net income (after taxes)? Please give an approximate estimation in case you are unsure.
        0 = Less than 1000 $
        1 = 1001 $ to 2000 $
        2 = 2001 $ to 3000 $
        3 = 3001 $ to 4000 $
        4 = 4001 $ to 5000 $
        5 = More than 5000 $
        6 = No Answer
    
    
        Empathy
        Wie sehr stimmen Sie den folgenden Aussagen zu?
    

    7-point Likert scale

    1=Trifft überhaupt nicht zu 7=Trifft voll und ganz zu

        How strongly do you agree with the following statements?
        7-point Likert scale
    
        1=Strongly disagree
        7=Strongly agree
    
    
        EMP1
        Wenn jemand anderes erfreut ist, tendiere ich dazu auch erfreut zu sein.
    
        When someone else is feeling excited, I tend to get excited too.
    
    
    
        EMP2
        Es regt mich auf, wenn jemand respektlos behandelt wird.
    
        It upsets me to see someone being treated disrespectfully.
    
    
    
        EMP3
        Es macht mir Freude, andere aufzumuntern.
    
        I enjoy making other people feel better.
    
    
    
        EMP4
        Ich bin besorgt um Personen, die weniger Glück haben als ich.
    
        I have tender, concerned feelings for people less fortunate than me.
    
    
    
        EMP5
        Ich fühle, wenn andere traurig sind, selbst wenn sie nichts sagen.
    
        I can tell when others are sad even when they do not say anything.
    
    
    
        EMP6
        Meistens bin ich mit den Stimmungen anderer Leute im Einklang.
    
        I find that I am “in tune” with other people’s moods.
    
    
    
        EMP7
        Ich empfinde einen starken Drang zu helfen, wenn ich jemanden sehe, der aufgebracht ist.
    
        I get a strong urge to help when I see someone who is upset.
    
    
    
        EMP8
        Wenn ich jemanden sehe, der ausgenutzt wird, möchte ich die Person beschützen.
    
        When I see someone being taken advantage of, I feel kind of protective towards him\her.
    
    
    
        Big5
        Ich
    
  15. h

    miracl-corpus

    • huggingface.co
    Updated Sep 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MIRACL (2022). miracl-corpus [Dataset]. https://huggingface.co/datasets/miracl/miracl-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2022
    Dataset authored and provided by
    MIRACL
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for MIRACL Corpus

    MIRACL 🌍🙌🌏 (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world. This dataset contains the collection data of the 16 "known languages". The remaining 2 "surprise languages" will not be released until later. The corpus for each language is prepared from a Wikipedia… See the full description on the dataset page: https://huggingface.co/datasets/miracl/miracl-corpus.

  16. h

    test_4

    • huggingface.co
    Updated Oct 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nguyen tuan (2024). test_4 [Dataset]. https://huggingface.co/datasets/tunngo/test_4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 27, 2024
    Authors
    nguyen tuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FLEURS

    Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/tunngo/test_4.

  17. h

    IndicVoices

    • huggingface.co
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI4Bharat (2025). IndicVoices [Dataset]. https://huggingface.co/datasets/ai4bharat/IndicVoices
    Explore at:
    Dataset updated
    Mar 5, 2025
    Dataset authored and provided by
    AI4Bharat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

      Overview
    

    INDICVOICES is a dataset of natural and spontaneous speech containing a total of 19550 hours of read (8%), extempore (76%) and conversational (15%) audio from 29K speakers covering 400+ Indian districts and 22 languages. Of these 19550 hours, 9200 hours have already been transcribed. Through this paper, we share our journey of capturing the cultural… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/IndicVoices.

  18. h

    mgsm_50

    • huggingface.co
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mgsm_50 [Dataset]. https://huggingface.co/datasets/irasalsabila/mgsm_50
    Explore at:
    Dataset updated
    Mar 31, 2025
    Authors
    Ira Salsabila
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multilingual MGSM Dataset (Indonesian, Javanese, Sundanese)

    This dataset is a multilingual extension of the MGSM dataset, translated into Indonesian (id), Javanese (jv), and Sundanese (su). The original English math word problems were translated with the following modifications:

    Western names were replaced with culturally appropriate local names. All numerical values and logic were preserved. Native speakers reviewed translations for correctness and naturalness.

    Each language is… See the full description on the dataset page: https://huggingface.co/datasets/irasalsabila/mgsm_50.

  19. h

    ifeval_mt

    • huggingface.co
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LumiOpen (2024). ifeval_mt [Dataset]. https://huggingface.co/datasets/LumiOpen/ifeval_mt
    Explore at:
    Dataset updated
    Oct 2, 2024
    Dataset authored and provided by
    LumiOpen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    IFEval Multilingual

    These are machine-translated versions of Instruction Following Evaluation (IFEval). We will do our best to correct the translations. Translations were done using DeepL and the translations were reviewed and corrected by native speakers. We use this dataset in our fork of LM Eval Harness that supports multilingual ifeval.

      Supported languages
    

    Finnish: machine-translated manually corrected Swedish: machine-translated but not corrected

  20. h

    xor_tydi_qa

    • huggingface.co
    • opendatalab.com
    Updated Jan 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akari Asai (2024). xor_tydi_qa [Dataset]. https://huggingface.co/datasets/akariasai/xor_tydi_qa
    Explore at:
    Dataset updated
    Jan 23, 2024
    Authors
    Akari Asai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    XOR-TyDi QA brings together for the first time information-seeking questions, open-retrieval QA, and multilingual QA to create a multilingual open-retrieval QA dataset that enables cross-lingual answer retrieval. It consists of questions written by information-seeking native speakers in 7 typologically diverse languages and answer annotations that are retrieved from multilingual document collections. There are three sub-tasks: XOR-Retrieve, XOR-EnglishSpan, and XOR-Full.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Peng Xie (2025). SwitchLingua_text [Dataset]. https://huggingface.co/datasets/Shelton1013/SwitchLingua_text

SwitchLingua_text

Shelton1013/SwitchLingua_text

SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset

Explore at:
Dataset updated
May 28, 2025
Authors
Peng Xie
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Dataset Card for SwitchLingua_text

  Dataset Summary

SwitchLingua is a comprehensive multilingual and multicultural code-switching dataset designed to advance research in automatic speech recognition, natural language processing, and conversational AI. The textual data for SwitchLingua was first generated using the proposed LinguaMaster framework, and the audio data was recorded by 174 bilingual speakers from diverse linguistic and cultural backgrounds to ensure high… See the full description on the dataset page: https://huggingface.co/datasets/Shelton1013/SwitchLingua_text.

Search
Clear search
Close search
Google apps
Main menu