9 datasets found
  1. h

    multilingual-NLI-26lang-2mil7

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moritz Laurer, multilingual-NLI-26lang-2mil7 [Dataset]. https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Moritz Laurer
    Description

    Datasheet for the dataset: multilingual-NLI-26lang-2mil7

      Dataset Summary
    

    This dataset contains 2 730 000 NLI text pairs in 26 languages spoken by more than 4 billion people. The dataset can be used to train models for multilingual NLI (Natural Language Inference) or zero-shot classification. The dataset is based on the English datasets MultiNLI, Fever-NLI, ANLI, LingNLI and WANLI and was created using the latest open-source machine translation models. The dataset is… See the full description on the dataset page: https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7.

  2. h

    ml_spoken_words

    • huggingface.co
    Updated Jun 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons (2024). ml_spoken_words [Dataset]. https://huggingface.co/datasets/MLCommons/ml_spoken_words
    Explore at:
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    MLCommons
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset.

  3. f

    Table_2_Expressing diminutive meaning in heritage Spanish: linking the...

    • figshare.com
    • frontiersin.figshare.com
    xlsx
    Updated Jul 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abel Cruz (2024). Table_2_Expressing diminutive meaning in heritage Spanish: linking the heritage experience to diminutive use in everyday speech.XLSX [Dataset]. http://doi.org/10.3389/flang.2024.1377977.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Frontiers
    Authors
    Abel Cruz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThis paper studies the pragmatic force that heritage speakers may convey through the use of the diminutive in everyday speech. In particular, I analyze the use of the Spanish diminutive in 49 sociolinguistic interviews from a Spanish–English bilingual community in Southern Arizona, U.S. where Spanish is the heritage language. I compare the use of the diminutive in heritage Spanish to the distribution of the diminutive in the speech of a Spanish monolingual community (18 sociolinguistic interviews) from the same dialectal region. Although Spanish and English employ different morphosyntactic strategies to express diminutive meaning, the analysis reveals that the diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona (i.e., similar diminutive distributions to their monolingual counterparts). While heritage speakers employed the diminutive -ito/a to express the notion of “smallness” in their Spanish-discourse, the analysis indicates that these language users are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. This particular finding suggests that the concept “child” is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children. The analysis further suggests that examining the pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages.MethodsIn this study, I analyze the use of Spanish diminutives in two U.S.-Mexico border regions. The first data set is representative of a Spanish–English bilingual community in Southern Arizona, U.S., provided in the Corpus del Español en el Sur de Arizona (The CESA Corpus). The CESA Corpus comprises 49 sociolinguistic interviews of ~1 h each for a total of ~305,542 words. The second data set comprises 18 sociolinguistic interviews of predominantly monolingual Spanish speakers from the city of Mexicali, Baja California in Mexico, provided in the Proyecto Para el Estudio Sociolingüístico del Español de España y de América (PRESEEA). The Mexicali data set consists of ~119,162 words.ResultsThe analysis revealed that the Spanish diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona. In addition to its prototypical meaning (i.e., the notion of “smallness”), the diminutive morpheme -ito/a conveyed an array of pragmatic functions in the everyday speech of Spanish heritage speakers and their monolingual counterparts from the same dialectal region. Importantly, these pragmatic functions are mediated by speakers' subjective perceptions of the entity in question. Unlike their monolingual counterparts, heritage speakers are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. Altogether, the study suggests that the concept “child” is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children.DiscussionIn this study, I followed Reynoso's framework to study the pragmatic dimensions of the diminutive in everyday speech, that is, speakers' publicly conveyed meaning. The analysis revealed that heritage speakers applied most of the pragmatic functions and their respective values observed in Reynoso's cross-dialectal study of Spanish diminutives, and hence providing further support for her framework. Similarly, the study provides further evidence to Jurafsky's proposal that morphological diminutives arise from semantic or pragmatic links with children. Finally, the analysis indicated that examining the semantic/pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages, which can in turn have further ramifications for heritage language learning and teaching.

  4. Department of Rehabilitation Office Contact Information and Addresses with...

    • data.ca.gov
    csv, docx, zip
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Rehabilitation (2024). Department of Rehabilitation Office Contact Information and Addresses with Languages Spoken [Dataset]. https://data.ca.gov/dataset/department-of-rehabilitation-office-contact-information-and-addresses-with-languages-spoken
    Explore at:
    csv, zip, docxAvailable download formats
    Dataset updated
    Aug 28, 2024
    Dataset authored and provided by
    California Department of Rehabilitationhttp://www.dor.ca.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a list of Department of Rehabilitation (DOR) offices and includes contact information, addresses, and languages spoken in each office. Note: In addition to the languages listed, the DOR has various Bilingual language resources available in each office that allow us to serve members of the public who may speak a language other than English.

  5. h

    spc

    • huggingface.co
    • opendatalab.com
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2024). spc [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/spc
    Explore at:
    Dataset updated
    May 16, 2024
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This is a collection of parallel corpora collected by Hercules Dalianis and his research group for bilingual dictionary construction. More information in: Hercules Dalianis, Hao-chun Xing, Xin Zhang: Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction, In Proceedings of LREC2010 (source: http://people.dsv.su.se/~hercules/SEC/) and Konstantinos Charitakis (2007): Using Parallel Corpora to Create a Greek-English Dictionary with UPLUG, In Proceedings of NODALIDA 2007. Afrikaans-English: Aldin Draghoender and Mattias Kanhov: Creating a reusable English – Afrikaans parallel corpora for bilingual dictionary construction

    4 languages, 3 bitexts total number of files: 6 total number of tokens: 1.32M total number of sentence fragments: 0.15M

  6. e

    Cross-cultural differences in biased cognition - Pilot task data - Dataset -...

    • b2find.eudat.eu
    Updated Mar 30, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Cross-cultural differences in biased cognition - Pilot task data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/545c1da9-93df-58a4-9cb2-41c48d1170cb
    Explore at:
    Dataset updated
    Mar 30, 2014
    Description

    This data collection consists of pilot data measuring task equivalence for measures of attention and interpretation bias. Congruent Mandarin and English emotional Stroop, attention probe (both measuring attention bias) and similarity ratings task and scrambled sentence task (both measuring interpretation bias) were developed using back-translation and decentering procedures. Tasks were then completed by 47 bilingual Mandarin-English speakers. Presented are data detailing personal characteristics, task scores and bias scores.The way in which we process information in the world around us has a significant effect on our health and well being. For example, some people are more prone than others to notice potential dangers, to remember bad things from the past and assume the worst, when the meaning of an event or comment is uncertain. These tendencies are called negative cognitive biases and can lead to low mood and poor quality of life. They also make people vulnerable to mental illnesses. In contrast, those with positive cognitive biases tend to function well and remain healthy. To date most of this work has been conducted on white, western populations and we do not know whether similar cognitive biases exist in Eastern cultures. This project will examine cognitive biases in Eastern (Hong Kong nationals ) and Western (UK nationals) people to see whether there are any differences between the two. It will also examine what happens to cognitive biases when someone migrates to a different culture. This will tell us whether influences from the society and culture around us have any effect on our cognitive biases. Finally the project will consider how much our own cognitive biases are inherited from our parents. Together these results will tell us whether the known good and bad effects of cognitive biases apply to non Western cultural groups as well, and how much cognitive biases are decided by our genes or our environment. Participants: Fluent bilingual Mandarin and English speakers, aged 16-65 with no current major physical illness or psychological disorder, who were not receiving psychological therapy or medication for psychological conditions. Sampling procedure: Participants were recruited using circular emails which are sent to all university staff and students as well as through flyers around campuses. Relevant societies and language schools in central London were also contacted. Data collection: Participants completed four cognitive bias tasks (emotional Stroop, attention probe, similarity ratings task and scrambled sentence task) in both English and Mandarin. Order of language presentation and task presentation were counterbalanced.

  7. E

    Collins Multilingual database (MLD) – PhraseBank with audio files

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Nov 18, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2016). Collins Multilingual database (MLD) – PhraseBank with audio files [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0383/
    Explore at:
    Dataset updated
    Nov 18, 2016
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, see ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank, see ELRA-T0377).This version includes the audio files corresponding to each phrase in the Collins MLD PhraseBank for 28 languages: Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Farsi, Finnish, French, German, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese. Audio was recorded by a native speaker. It contains 2,000 audio files for each language.The PhraseBank consists of 2,000 phrases in 28 language. Phrases are organised under 12 topics and 67 subtopics: talking to people, getting around, accommodation, shopping, leisure, communications, practicalities, health and beauty, eating and drinking. Romanization is provided for Arabic, Farsi and Hindi.

  8. f

    The English–Spanish Vocabulary Inventory (De Anda et al., 2022)

    • asha.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephanie De Anda; Lauren M. Cycyk; Heather Moore; Lidia Huerta; Anne L. Larson; Marika R. King (2023). The English–Spanish Vocabulary Inventory (De Anda et al., 2022) [Dataset]. http://doi.org/10.23641/asha.17704391.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ASHA journals
    Authors
    Stephanie De Anda; Lauren M. Cycyk; Heather Moore; Lidia Huerta; Anne L. Larson; Marika R. King
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Purpose: Despite the increasing population of dual language learners (DLLs) in the United States, vocabulary measures for young DLLs have largely relied on instruments developed for monolinguals. The multistudy project reports on the psychometric properties of the English–Spanish Vocabulary Inventory (ESVI), which was designed to capture unique cross-language measures of lexical knowledge that are critical for assessing DLLs’ vocabulary, including translation equivalents (whether the child knows the words for the same concept in each language), total vocabulary (the number of words known across both languages), and conceptual vocabulary (the number of words known that represent unique concepts in either language). Method: Three studies included 87 Spanish–English DLLs (Mage = 26.58 months, SD = 2.86 months) with and without language delay from two geographic regions. Multiple measures (e.g., caregiver report, observation, behavioral tasks, and standardized assessments) determined content validity, construct validity, social validity, and criterion validity of the ESVI. Results: Monolingual instruments used in bilingual contexts significantly undercounted lexical knowledge as measured on the ESVI. Scores on the ESVI were related to performance on other measures of communication, indicating acceptable content, construct, and criterion validity. Social validity ratings were similarly positive. ESVI scores were also associated with suspected language delay. Conclusions: These studies provide initial evidence of the adequacy of the ESVI for use in research and clinical contexts with young children learning English and Spanish (with or without a language delay). Developing tools such as the ESVI promotes culturally and linguistically responsive practices that support accurate assessment of DLLs’ lexical development.

    Supplemental Material S1. English–Spanish Vocabulary Inventory (ESVI) correlations across studies.

    Supplemental Material S2. English–Spanish Vocabulary Inventory (ESVI) regression results across studies.

    De Anda, S., Cycyk, L. M., Moore, H., Huerta, L., Larson, A. L., & King, M. (2022). Psychometric properties of the English–Spanish Vocabulary Inventory in toddlers with and without early language delay. Journal of Speech, Language, and Hearing Research. Advance online publication. https://doi.org/10.1044/2021_JSLHR-21-00240

  9. h

    InternVL-SA-1B-Caption

    • huggingface.co
    Updated Sep 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenGVLab (2025). InternVL-SA-1B-Caption [Dataset]. https://huggingface.co/datasets/OpenGVLab/InternVL-SA-1B-Caption
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2025
    Dataset authored and provided by
    OpenGVLab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for InternVL-SA-1B-Caption

      Overview
    

    The InternVL-SA-1B-Caption Dataset is a bilingual dataset created using the InternVL2-Llama3-76B model. The dataset contains 12 million image-caption pairs in both English and Chinese. All images are sourced from Meta’s SA-1B dataset, and captions were generated using specific prompts designed to minimize hallucinations and ensure accurate descriptions based on visible image content. The dataset is intended for use in tasks… See the full description on the dataset page: https://huggingface.co/datasets/OpenGVLab/InternVL-SA-1B-Caption.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Moritz Laurer, multilingual-NLI-26lang-2mil7 [Dataset]. https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7

multilingual-NLI-26lang-2mil7

MoritzLaurer/multilingual-NLI-26lang-2mil7

Explore at:
9 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Moritz Laurer
Description

Datasheet for the dataset: multilingual-NLI-26lang-2mil7

  Dataset Summary

This dataset contains 2 730 000 NLI text pairs in 26 languages spoken by more than 4 billion people. The dataset can be used to train models for multilingual NLI (Natural Language Inference) or zero-shot classification. The dataset is based on the English datasets MultiNLI, Fever-NLI, ANLI, LingNLI and WANLI and was created using the latest open-source machine translation models. The dataset is… See the full description on the dataset page: https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7.

Search
Clear search
Close search
Google apps
Main menu