18 datasets found
  1. Ranking of languages spoken at home in the U.S. 2023

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    United States
    Description

    In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

  2. Share of U.S. population speaking a language besides English at home 2023,...

    • statista.com
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Share of U.S. population speaking a language besides English at home 2023, by state [Dataset]. https://www.statista.com/statistics/312940/share-of-us-population-speaking-a-language-other-than-english-at-home-by-state/
    Explore at:
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    United States
    Description

    As of 2023, more than ** percent of people in the United States spoke a language other than English at home. California had the highest share among all U.S. states, with ** percent of its population speaking a language other than English at home.

  3. The most spoken languages worldwide 2025

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  4. h

    multilingual-NLI-26lang-2mil7

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moritz Laurer, multilingual-NLI-26lang-2mil7 [Dataset]. https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Moritz Laurer
    Description

    Datasheet for the dataset: multilingual-NLI-26lang-2mil7

      Dataset Summary
    

    This dataset contains 2 730 000 NLI text pairs in 26 languages spoken by more than 4 billion people. The dataset can be used to train models for multilingual NLI (Natural Language Inference) or zero-shot classification. The dataset is based on the English datasets MultiNLI, Fever-NLI, ANLI, LingNLI and WANLI and was created using the latest open-source machine translation models. The dataset is
 See the full description on the dataset page: https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7.

  5. o

    English-ASL Language Interoperability Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). English-ASL Language Interoperability Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/2e2e9584-b0d7-417f-8460-ab0184e20a58
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Health Information Systems & Technology
    Description

    This dataset offers a powerful synthetic English-ASL gloss parallel corpus that was generated in 2012, providing an exciting opportunity to bridge the cultural divide between English and American Sign Language. By exploring this cross-cultural language interoperability, it aims to connect linguistic communities and bring together aspects of communication often seen as separated. The data supports innovative approaches to machine translation models and helps to uncover further insights into bridging linguistic divides.

    Columns

    The dataset consists of two primary columns:

    • gloss: This column contains the ASL gloss representation in a given context for any keyword or phrase spoken in ASL. It provides English representations of an ASL sign, helping users to better understand the correlation between written English and ASL signs.
    • text: This column provides a written translation or interpretation in English for each corresponding ASL sign within the gloss column.

    Distribution

    The dataset is typically provided in a CSV file format, specifically referenced as train.csv. It comprises two columns: gloss and text. The gloss column contains 81,123 unique values, while the text column contains 81,016 unique values. This indicates the dataset consists of approximately 81,123 records.

    Usage

    This dataset can be used for a variety of applications and use cases, including:

    • Creating a variety of scenarios which emulate common conversation topics found in everyday life, such as greetings, family activities, or home chores, by pairing individual words with their translations into ASL signs.
    • Helping users to gain proficiency over time in having coherent conversations using both spoken languages and signed languages such as American Sign Language (ASL).
    • Developing generative ASL-English bilingual chat bots.
    • Benchmarking different translation models to measure their accuracy.
    • Assessing various translation techniques and determining which is the most successful in translating from English to ASL.
    • Further exploration using predictive models to unravel complex linguistic problems that often abound cross-cultural communication barriers.

    Coverage

    The dataset focuses on the linguistic relationship between English and American Sign Language. While specific demographic details are not provided, its general availability is noted as global. The data was generated in 2012, offering a snapshot from that time.

    License

    CC0

    Who Can Use It

    This dataset is ideal for:

    • Researchers interested in linguistics, natural language processing (NLP), and machine translation.
    • Individuals seeking to learn and practise American Sign Language, aiming to improve their proficiency in coherent conversations using both spoken and signed communication.
    • Developers and data scientists working on AI models, chat bots, or translation systems that involve ASL and English.
    • Anyone interested in cross-cultural communication and bridging linguistic divides through language interoperability.

    Dataset Name Suggestions

    • ASL-English Parallel Gloss Corpus 2012
    • American Sign Language Translation Data
    • English-ASL Language Interoperability Dataset
    • ASL Gloss Representation Corpus
    • Bilingual ASL-English Communication Data

    Attributes

    Original Data Source: AslgPc12 (English-ASL Gloss Parallel Corpus 2012)

  6. Department of Rehabilitation Office Contact Information and Addresses with...

    • data.ca.gov
    • data.chhs.ca.gov
    • +3more
    csv, docx, zip
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Rehabilitation (2024). Department of Rehabilitation Office Contact Information and Addresses with Languages Spoken [Dataset]. https://data.ca.gov/dataset/department-of-rehabilitation-office-contact-information-and-addresses-with-languages-spoken
    Explore at:
    csv, zip, docxAvailable download formats
    Dataset updated
    Aug 28, 2024
    Dataset authored and provided by
    California Department of Rehabilitationhttp://www.dor.ca.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a list of Department of Rehabilitation (DOR) offices and includes contact information, addresses, and languages spoken in each office. Note: In addition to the languages listed, the DOR has various Bilingual language resources available in each office that allow us to serve members of the public who may speak a language other than English.

  7. h

    ml_spoken_words

    • huggingface.co
    Updated Jun 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons (2024). ml_spoken_words [Dataset]. https://huggingface.co/datasets/MLCommons/ml_spoken_words
    Explore at:
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    MLCommons
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset.

  8. f

    Table_1_Expressing diminutive meaning in heritage Spanish: linking the...

    • frontiersin.figshare.com
    • figshare.com
    xlsx
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abel Cruz (2024). Table_1_Expressing diminutive meaning in heritage Spanish: linking the heritage experience to diminutive use in everyday speech.XLSX [Dataset]. http://doi.org/10.3389/flang.2024.1377977.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Frontiers
    Authors
    Abel Cruz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThis paper studies the pragmatic force that heritage speakers may convey through the use of the diminutive in everyday speech. In particular, I analyze the use of the Spanish diminutive in 49 sociolinguistic interviews from a Spanish–English bilingual community in Southern Arizona, U.S. where Spanish is the heritage language. I compare the use of the diminutive in heritage Spanish to the distribution of the diminutive in the speech of a Spanish monolingual community (18 sociolinguistic interviews) from the same dialectal region. Although Spanish and English employ different morphosyntactic strategies to express diminutive meaning, the analysis reveals that the diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona (i.e., similar diminutive distributions to their monolingual counterparts). While heritage speakers employed the diminutive -ito/a to express the notion of “smallness” in their Spanish-discourse, the analysis indicates that these language users are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. This particular finding suggests that the concept “child” is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children. The analysis further suggests that examining the pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages.MethodsIn this study, I analyze the use of Spanish diminutives in two U.S.-Mexico border regions. The first data set is representative of a Spanish–English bilingual community in Southern Arizona, U.S., provided in the Corpus del Español en el Sur de Arizona (The CESA Corpus). The CESA Corpus comprises 49 sociolinguistic interviews of ~1 h each for a total of ~305,542 words. The second data set comprises 18 sociolinguistic interviews of predominantly monolingual Spanish speakers from the city of Mexicali, Baja California in Mexico, provided in the Proyecto Para el Estudio SociolingĂŒĂ­stico del Español de España y de AmĂ©rica (PRESEEA). The Mexicali data set consists of ~119,162 words.ResultsThe analysis revealed that the Spanish diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona. In addition to its prototypical meaning (i.e., the notion of “smallness”), the diminutive morpheme -ito/a conveyed an array of pragmatic functions in the everyday speech of Spanish heritage speakers and their monolingual counterparts from the same dialectal region. Importantly, these pragmatic functions are mediated by speakers' subjective perceptions of the entity in question. Unlike their monolingual counterparts, heritage speakers are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. Altogether, the study suggests that the concept “child” is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children.DiscussionIn this study, I followed Reynoso's framework to study the pragmatic dimensions of the diminutive in everyday speech, that is, speakers' publicly conveyed meaning. The analysis revealed that heritage speakers applied most of the pragmatic functions and their respective values observed in Reynoso's cross-dialectal study of Spanish diminutives, and hence providing further support for her framework. Similarly, the study provides further evidence to Jurafsky's proposal that morphological diminutives arise from semantic or pragmatic links with children. Finally, the analysis indicated that examining the semantic/pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages, which can in turn have further ramifications for heritage language learning and teaching.

  9. U.S. - children who speak another language than English at home 1979-2019

    • statista.com
    Updated Jul 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). U.S. - children who speak another language than English at home 1979-2019 [Dataset]. https://www.statista.com/statistics/476745/number-of-children-who-speak-another-language-than-english-at-home-in-the-us/
    Explore at:
    Dataset updated
    Jul 5, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    In 2019, about 12.08 million children were speaking another language other than English at home in the United States. This number is fairly consistent with the previous year, where 12.13 million children spoke another language at home.

  10. E

    Collins Multilingual database (MLD) - PhraseBank

    • live.european-language-grid.eu
    • catalogue.elra.info
    Updated Dec 7, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Collins Multilingual database (MLD) - PhraseBank [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2572
    Explore at:
    Dataset updated
    Dec 7, 2016
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, distributed separately under reference ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank).

    The PhraseBank consists of 2,000 phrases in 28 languages (Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Farsi, Finnish, French, German, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese). Phrases are organised under 12 main topics and 67 subtopics. Covered topics are: talking to people, getting around, accommodation, shopping, leisure, communications, practicalities, health and beauty, eating and drinking, time.

    Romanization is provided for Arabic, Farsi and Hindi.

    Audio files corresponding to each phrase are available and are distributed in a package referenced ELRA-S0383.

  11. A

    ‘Department of Rehabilitation Office Contact Information and Addresses with...

    • analyst-2.ai
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Department of Rehabilitation Office Contact Information and Addresses with Languages Spoken’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-department-of-rehabilitation-office-contact-information-and-addresses-with-languages-spoken-a297/55d8eeb2/?iid=001-893&v=presentation
    Explore at:
    Dataset updated
    Jan 26, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Department of Rehabilitation Office Contact Information and Addresses with Languages Spoken’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/b3462d31-650b-43fa-9f80-c6efd6d5ce88 on 26 January 2022.

    --- Dataset description provided by original source is as follows ---

    This dataset is a list of Department of Rehabilitation (DOR) offices and includes contact information, addresses, and languages spoken in each office. Note: In addition to the languages listed, the DOR has various Bilingual language resources available in each office that allow us to serve members of the public who may speak a language other than English.

    --- Original source retains full ownership of the source dataset ---

  12. A Gold Standard Corpus for Activity Information (GoSCAI)

    • zenodo.org
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Dataset]. http://doi.org/10.5281/zenodo.15528545
    Explore at:
    Dataset updated
    May 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Description

    A Gold Standard Corpus for Activity Information

    Dataset Title: A Gold Standard Corpus for Activity Information (GoSCAI)

    Dataset Curators: The Epidemiology & Biostatistics Section of the NIH Clinical Center Rehabilitation Medicine Department

    Dataset Version: 1.0 (May 16, 2025)

    Dataset Citation and DOI: NIH CC RMD Epidemiology & Biostatistics Section. (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Data set]. Zenodo. doi: 10.5281/zenodo.15528545

    EXECUTIVE SUMMARY

    This data statement is for a gold standard corpus of de-identified clinical notes that have been annotated for human functioning information based on the framework of the WHO's International Classification of Functioning, Disability and Health (ICF). The corpus includes 484 notes from a single institution within the United States written in English in a clinical setting. This dataset was curated for the purpose of training natural language processing models to automatically identify, extract, and classify information on human functioning at the whole-person, or activity, level.

    CURATION RATIONALE

    This dataset is curated to be a publicly available resource for the development and evaluation of methods for the automatic extraction and classification of activity-level functioning information as defined in the ICF. The goals of data curation are to 1) create a corpus of a size that can be manually deidentified and annotated, 2) maximize the density and diversity of functioning information of interest, and 3) allow public dissemination of the data.

    LANGUAGE VARIETIES

    Language Region: en-US

    Prose Description: English as written by native and bilingual English speakers in a clinical setting

    LANGUAGE USER DEMOGRAPHIC

    The language users represented in this dataset are medical and clinical professionals who work in a research hospital setting. These individuals hold professional degrees corresponding to their respective specialties. Specific demographic characteristics of the language users such as age, gender, or race/ethnicity were not collected.

    ANNOTATOR DEMOGRAPHIC

    The annotator group consisted of five people, 33 to 76 years old, including four females and one male. Socioeconomically, they came from the middle and upper-middle income classes. Regarding first language, three annotators had English as their first language, one had Chinese, and one had Spanish. Proficiency in English, the language of the data being annotated, was native for three of the annotators and bilingual for the other two. The annotation team included clinical rehabilitation domain experts with backgrounds in occupational therapy, physical therapy, and individuals with public health and data science expertise. Prior to annotation, all annotators were trained on the specific annotation process using established guidelines for the given domain, and annotators were required to achieve a specified proficiency level prior to annotating notes in this corpus.

    LINGUISTIC SITUATION AND TEXT CHARACTERISTICS

    The notes in the dataset were written as part of clinical care within a U.S. research hospital between May 2008 and November 2019. These notes were written by health professionals asynchronously following the patient encounter to document the interaction and support continuity of care. The intended audience of these notes were clinicians involved in the patients' care. The included notes come from nine disciplines - neuropsychology, occupational therapy, physical medicine (physiatry), physical therapy, psychiatry, recreational therapy, social work, speech language pathology, and vocational rehabilitation. The notes were curated to support research on natural language processing for functioning information between 2018 and 2024.

    PREPROCESSING AND DATA FORMATTING

    The final corpus was derived from a set of clinical notes extracted from the hospital electronic medical record (EMR) for the purpose of clinical research. The original data include character-based digital content originally. We work in ASCII 8 or UNICODE encoding, and therefore part of our pre-processing includes running encoding detection and transformation from encodings such as Windows-1252 or ISO-8859 format to our preferred format.

    On the larger corpus, we applied sampling to match our curation rationale. Given the resource constraints of manual annotation, we set out to create a dataset of 500 clinical notes, which would exclude notes over 10,000 characters in length.

    To promote density and diversity, we used five note characteristics as sampling criteria. We used the text length as expressed in number of characters. Next, we considered the discipline group as derived from note type metadata and describes which discipline a note originated from: occupational and vocational therapy (OT/VOC), physical therapy (PT), recreation therapy (RT), speech and language pathology (SLP), social work (SW), or miscellaneous (MISC, including psychiatry, neurology and physiatry). These disciplines were selected for collecting the larger corpus because their notes are likely to include functioning information. Existing information extraction tools were used to obtain annotation counts in four areas of functioning and provided a note’s annotation count, annotation density (annotation count divided by text length), and domain count (number of domains with at least 1 annotation).

    We used stratified sampling across the 6 discipline groups to ensure discipline diversity in the corpus. Because of low availability, 50 notes were sampled from SLP with relaxed criteria, and 90 notes each from the 5 other discipline groups with stricter criteria. Sampled SLP notes were those with the highest annotation density that had an annotation count of at least 5 and a domain count of at least 2. Other notes were sampled by highest annotation count and lowest text length, with a minimum annotation count of 15 and minimum domain count of 3.

    The notes in the resulting sample included certain types of PHI and PII. To prepare for public dissemination, all sensitive or potentially identifying information was manually annotated in the notes and replaced with substituted content to ensure readability and enough context needed for machine learning without exposing any sensitive information. This de-identification effort was manually reviewed to ensure no PII or PHI exposure and correct any resulting readability issues. Notes about pediatric patients were excluded. No intent was made to sample multiple notes from the same patient. No metadata is provided to group notes other than by note type, discipline, or discipline group. The dataset is not organized beyond the provided metadata, but publications about models trained on this dataset should include information on the train/test splits used.

    All notes were sentence-segmented and tokenized using the spaCy en_core_web_lg model with additional rules for sentence segmentation customized to the dataset. Notes are stored in an XML format readable by the GATE annotation software (https://gate.ac.uk/family/developer.html), which stores annotations separately in annotation sets.

    CAPTURE QUALITY

    As the clinical notes were extracted directly from the EMR in text format, the capture quality was determined to be high. The clinical notes did not have to be converted from other data formats, which means this dataset is free from noise introduced by conversion processes such as optical character recognition.

    LIMITATIONS

    Because of the effort required to manually deidentify and annotate notes, this corpus is limited in terms of size and representation. The curation decisions skewed note selection towards specific disciplines and note types to increase the likelihood of encountering information on functioning. Some subtypes of functioning occur infrequently in the data, or not at all. The deidentification of notes was done in a manner to preserve natural language as it would occur in the notes, but some information is lost, e.g. on rare diseases.

    METADATA

    Information on the manual annotation process is provided in the annotation guidelines for each of the four domains:

    - Communication & Cognition (https://zenodo.org/records/13910167)

    - Mobility (https://zenodo.org/records/11074838)

    - Self-Care & Domestic Life (SCDL) (https://zenodo.org/records/11210183)

    - Interpersonal Interactions & Relationships (IPIR) (https://zenodo.org/records/13774684)

    Inter-annotator agreement was established on development datasets described in the annotation guidelines prior to the annotation of this gold standard corpus.

    The gold standard corpus consists of 484 documents, which include 35,147 sentences in total. The distribution of annotated information is provided in the table below.

    <td style="width: 1.75in; padding: 0in 5.4pt 0in

    Domain

    Number of Annotated Sentences

    % of All Sentences

    Mean Number of Annotated Sentences per Document

    Communication & Cognition

    6033

    17.2%

  13. ACS Specific Language Spoken by English Ability Variables - Centroids

    • mapdirect-fdep.opendata.arcgis.com
    Updated Apr 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2023). ACS Specific Language Spoken by English Ability Variables - Centroids [Dataset]. https://mapdirect-fdep.opendata.arcgis.com/maps/06e52c059c024eb3832cca444757408c
    Explore at:
    Dataset updated
    Apr 3, 2023
    Dataset authored and provided by
    Esrihttp://esri.com/
    Area covered
    Description

    This layer shows language or language groups spoken at home by English ability. This is shown by tract, county, and state centroids. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the count and percent of individuals age 5+ who are bilingual in English and another language (speak English very well and speak another language at home). To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): C16001 Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.

  14. E

    English-Lithuanian EASTIN-CL Multilingual Ontology of Assistive Technology...

    • catalogue.elra.info
    • live.european-language-grid.eu
    • +1more
    Updated Feb 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2020). English-Lithuanian EASTIN-CL Multilingual Ontology of Assistive Technology (Processed) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-M0074/
    Explore at:
    Dataset updated
    Feb 27, 2020
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu.EASTIN-CL Multilingual Ontology of Assistive Technology was created within the EASTIN-CL project aimed at applying language technologies to portal of assistive technologies http://www.eastin.eu to enhance it and make it more accessible for people in different languages.Based on Multilingual Ontology a query tool was built allowing users of the portal to type the lookup words which are then mapped to assistive device product classes.The terminology resource was created by first selecting base terminology in English, then having domain experts translate it into 6 other languages.

  15. Chinese Communities: Family Ethnography Data, 2017-2020

    • beta.ukdataservice.ac.uk
    • datacatalogue.cessda.eu
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiao Lan Curdt-Christiansen (2022). Chinese Communities: Family Ethnography Data, 2017-2020 [Dataset]. http://doi.org/10.5255/ukda-sn-855705
    Explore at:
    Dataset updated
    2022
    Dataset provided by
    DataCitehttps://www.datacite.org/
    UK Data Servicehttps://ukdataservice.ac.uk/
    Authors
    Xiao Lan Curdt-Christiansen
    Description

    10 families of different types (SES) and structures (e.g. nuclear, extended, single-parent) were observed. The data provide insight into family members’ ideological positions that can be congruent or conflictual and which may cause conflicting views about how to raise bilingual children. Interactional data capture the actual language practices in families across the communities. The data also allow us to observe the silent cultural conversations among family members, and to identify the critical moments of policy enactment.

  16. Z

    Data from: NeMig - A Bilingual News Collection and Knowledge Graph about...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iana, Andreea (2023). NeMig - A Bilingual News Collection and Knowledge Graph about Migration [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7442424
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    Nikolajevic, Nevena
    Paulheim, Heiko
    Ludwig, Katharina
    Iana, Andreea
    Weinhardt, Christof
    Grote, Alexander
    MĂŒller, Philipp
    Alam, Mehwish
    Description

    NeMig represents a bilingual news collection and knowledge graphs on the topic of migration. The news corpora in German and English were collected from online media outlets from Germany and the US, respectively. NeMIg contains rich textual and metadata information, sentiment and political orientation annotations, as well as named entities extracted from the articles' content and metadata and linked to Wikidata. The corresponding knowledge graphs (NeMigKG) built from each corpus are expanded with up to two-hop neighbors from Wikidata of the initial set of linked entities.

    NeMigKG comes in four flavors, for both the German, and the English corpora:

    Base NeMigKG: contains literals and entities from the corresponding annotated news corpus;

    Entities NeMigKG: derived from the Base NeMIg by removing all literal nodes, it contains only resource nodes;

    Enriched Entities NeMigKG: derived from the Entities NeMig by enriching it with up to two-hop neighbors from Wikidata, it contains only resource nodes and Wikidata triples;

    Complete NeMigKG: the combination of the Base and Enriched Entities NeMig, it contains both literals and resources.

    Information about uploaded files:

    (all files are b-zipped and in the N-Triples format.)

    A description of the NeMigKG files is provided in the table below:

    NeMigKG Files Description
    
    
        File
        Description
    
    
    
    
        nemig_${language}_ ${graph_type}-metadata.nt.bz2
        Metadata about the dataset, described using void vocabulary.
    
    
        nemig_${language}_ ${graph_type}-instances_types.nt.bz2
        Class definitions of news and event instances.
    
    
        nemig_${language}_ ${graph_type}-instances_labels.nt.bz2
        Labels of instances.
    
    
        nemig_${language}_ ${graph_type}-instances_related.nt.bz2
        Relations between news instances based on one another.
    
    
        nemig_${language}_ ${graph_type}-instances_metadata_literals.nt.bz2
        Relations between news instances and metadata literals (e.g. URL, publishing date, modification date, sentiment label, political orientation of news outlets).
    
    
        nemig_${language}_ ${graph_type}-instances_content_mapping.nt.bz2
        Mapping of news instances to content instances (e.g. title, abstract, body).
    
    
        nemig_${language}_ ${graph_type}-instances_topic_mapping.nt.bz2
        Mapping of news instances to sub-topic instances.
    

    nemig_${language}_ ${graph_type}-instances_sentiment_mapping.nt.bz2

    Mapping of news instances to sentiment classes.

    emig_${language}_ ${graph_type}-instances_political_orientation_mapping.nt.bz2

    Mapping of news outlets instances to political orientation classes.

        nemig_${language}_ ${graph_type}-instances_content_literals.nt.bz2
        Relations between content instances and corresponding literals (e.g. text of title, abstract, body).
    

    nemig_${language}_ ${graph_type}-instances_sentiment_polorient_literals.nt.bz2

    Relations between instances and corresponding sentiment or political orientation literals.

        nemig_${language}_ ${graph_type}-instances_metadata_resources.nt.bz2
        Relations between news or sub-topic instances and entities extracted from metadata (i.e. publishers, authors, keywords).
    
    
        nemig_${language}_ ${graph_type}-instances_event_mapping.nt.bz2
        Mapping of news instances to event instances.
    
    
        nemig_${language}_ ${graph_type}-event_resources.nt.bz2
        Relations between event instances and entities extracted from the text of the news (i.e. actors, places, mentions).
    
    
        nemig_${language}_ ${graph_type}-resources_provenance.nt.bz2
        Provenance information about the entities extracted from the text of the news (e.g. title, abstract, body).
    
    
        nemig_${language}_ ${graph_type}-wiki_resources.nt.bz2
        Relations between Wikidata entities from news and their k-hop entity neighbors from Wikidata.
    

    The corresponding user data has been collected through online studies in Germany and the US. We used the participants' implicit feedback regarding their interest in an article to build their click history, and the explicit feedback in terms of news click behaviors to construct the impression logs. To protect user privacy, we assign each user an anonymized ID.

    The German and English user datasets are zip-compressed folders, which contain two files each.

    NeMig User Dataset File Description
    
    
        File
        Description
    

    behaviors.tsv

    The click history and impression logs of users.

    demographics_politics.tsv

    Demographic and political information of users.

    The behaviors.tsv file contains the users' news click histories and the impression logs. It has 4 columns divided by the tab symbol:

    Impression ID: the ID of an impression.

    User ID: The anonymized ID of an user.

    Click History: The news click history (list of news IDs) of a user before an impression.

    Impression Log: List of news displayed to the user in a session and the user's click behavior on them (1 for click, 0 for non-click).

    The demographics_politics.tsv file contains detailed information about the users' demographics and political interests. It has columns divided by the tab symbol. An explanation of all the columns and the questions used in the online studies to collect this information is shown in the table below.

    Demographic and political user data description
    
    
        Column Name
        Question in German study
        Scale in German
        Question in English study
        Scale in English
    
    
    
    
        Demographics
    
    
    
    
    
    
        Gender
        Bitte geben Sie Ihr Geschlecht an
        0 = mÀnnlich
        1 = weiblich
        2 = divers
        3 = Keine Angabe
        Please indicate your gender.
        0 = male
        1 = female
        2 = other
        3 = no answer
    
    
        Age
        Bitte geben Sie Ihr Alter an  
        1-120
        Please indicate your age.
        1-120
    
    
        Qualification
        Welches ist Ihr höchster Bildungsabschluss?
        0 = Kein Schulabschluss
        1 = Haupt-/Gesamtschulabschluss
        2 = Realschulabschluss, Mittlere Reife, Fachschulreife
        3 = Fachhochschulreife, Abitur
        4 = Studium mit Abschluss
        5 = Promotion
        6 = Keine Angabe
        Please indicate your highest educational qualification.
        0 = less than high school
        1 = high school/GED
        2 = Vo-tech/business school
        3 = some college
        4 = college degree
        5 = university degree
        6 = doctoral degree
        7 = no answer
    
    
        Nationality
        Welche Staatsangehörigkeit besitzen Sie?
        0 = Nur die deutsche Staatsangehörigkeit
        1 = Die deutsche und eine andere Staatsangehörigkeit
        2 = Nur eine andere Staatsangehörigkeit
        3 = Keine Angabe
        What is your citizenship?
        0 = U.S. citizenship
        1 = U.S. and another non-U.S. citizenship
        2 = Only non-U.S. citizenship
        3 = No Answer
    
    
        BornIn
        Sind Sie in Deutschland geboren?
        0 = Ja
        1 = Nein
        2 = Keine Angabe
        Were you born in the U.S.?
        0 = Yes
        1 = No
        2 = No answer
    
    
        ParentsBornIn
        Sind Ihre Eltern in Deutschland geboren?
        0 = Mein Vater und meine Mutter sind beide in Deutschland geboren
        1 = Mein Vater ist in Deutschland geboren, meine Mutter nicht
        2 = Meine Mutter ist in Deutschland geboren, mein Vater nicht
        3 = Weder meine Mutter noch mein Vater sind in Deutschland geboren
        4 = Keine Angabe
        Were your parents born in the U.S.?
        0 = My father and my mother were both born in the U.S.
        1 = My father was born in the U.S., my mother was not
        2 = My mother was born in the U.S., my father was not
        3 = Neither my mother nor my father were born in the U.S
        4 = No answer
    
    
        Income
        Was ist Ihr persönliches monatliches Nettoeinkommen (nach Abzug der Steuern)? Bitte geben Sie eine ungefÀhre SchÀtzung an, falls Sie die genaue Zahl nicht kennen.
        0 = Weniger als 1000 €
        1 = 1001 € bis 2000 €
        2 = 2001 € bis 3000 €
        3 = 3001 € bis 4000 €
        4 = 4001 € bis 5000 €
        5 = Mehr als 5000 €
        6 = Keine Angabe
        What is your personal monthly net income (after taxes)? Please give an approximate estimation in case you are unsure.
        0 = Less than 1000 $
        1 = 1001 $ to 2000 $
        2 = 2001 $ to 3000 $
        3 = 3001 $ to 4000 $
        4 = 4001 $ to 5000 $
        5 = More than 5000 $
        6 = No Answer
    
    
        Empathy
        Wie sehr stimmen Sie den folgenden Aussagen zu?
    

    7-point Likert scale

    1=Trifft ĂŒberhaupt nicht zu 7=Trifft voll und ganz zu

        How strongly do you agree with the following statements?
        7-point Likert scale
    
        1=Strongly disagree
        7=Strongly agree
    
    
        EMP1
        Wenn jemand anderes erfreut ist, tendiere ich dazu auch erfreut zu sein.
    
        When someone else is feeling excited, I tend to get excited too.
    
    
    
        EMP2
        Es regt mich auf, wenn jemand respektlos behandelt wird.
    
        It upsets me to see someone being treated disrespectfully.
    
    
    
        EMP3
        Es macht mir Freude, andere aufzumuntern.
    
        I enjoy making other people feel better.
    
    
    
        EMP4
        Ich bin besorgt um Personen, die weniger GlĂŒck haben als ich.
    
        I have tender, concerned feelings for people less fortunate than me.
    
    
    
        EMP5
        Ich fĂŒhle, wenn andere traurig sind, selbst wenn sie nichts sagen.
    
        I can tell when others are sad even when they do not say anything.
    
    
    
        EMP6
        Meistens bin ich mit den Stimmungen anderer Leute im Einklang.
    
        I find that I am “in tune” with other people’s moods.
    
    
    
        EMP7
        Ich empfinde einen starken Drang zu helfen, wenn ich jemanden sehe, der aufgebracht ist.
    
        I get a strong urge to help when I see someone who is upset.
    
    
    
        EMP8
        Wenn ich jemanden sehe, der ausgenutzt wird, möchte ich die Person beschĂŒtzen.
    
        When I see someone being taken advantage of, I feel kind of protective towards him\her.
    
    
    
        Big5
        Ich
    
  17. E

    MLCC Multilingual and Parallel Corpora

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated May 23, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2012). MLCC Multilingual and Parallel Corpora [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-W0023/
    Explore at:
    Dataset updated
    May 23, 2012
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The MLCC text corpus has two main components - one set to allow comparable studies to be carried out in different languages and one set as the basis for translation studies. The first set is referred as the Polylingual Document Collection, a collection of newspaper articles from financial newspapers in 6 languages (Dutch, English, French, German, Italian and Spanish). It consists of the following sub-corpora:Dutch - Het Financieele Dagblad - 1992-1993 (Samples) The corpus contains articles from the Dutch financial newspaper Het Financieele Dagblad editions of 2nd January 1992 through to 24th December 1993. It contains around 8.5 million words of text.English - The Financial Times - 1993 (Samples)The corpus contains articles from the British financial newspaper The Financial Times editions from the year 1993. The corpus contains around 30 million words.French - Le Monde - 1992-1993 (Samples) A corpus of articles from the French newspaper Le Monde, consisting of two years worth (1992-1993) of articles on financial subjects, approximately 10 million words.German - Handelsblatt - 1986-1988 (Samples)This subcorpus consists of articles from the period 02.01.1986 to 15.06.1988. It contains some 33 million words. It may be possible to obtain more recent articles from Handelsblatt.Italian - Il Sole 24 Ore - 1992-1993 (Samples) The corpus described here contains articles from the Italian financial newspaper Il Sole 24 Ore from the year 1992. This corpus contains some 1.88 million words. The SGML-markup was done by the University of Edinburgh.Spanish - Expansion - 1994 (Samples)This subcorpus contains articles from the Spanish financial newspaper Expansion editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to 27.12.1994. It contains some 10 million words.The second set is a Multilingual Parallel Corpus consisting of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data, provided by the European Commission, comprises two sub-corpora from the Official Journal of the European Communities:Official Journal of the European Commission, C Series: Written Questions 1993Records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the European Community in all official languages (previously nine). This corpus contains written questions asked by members of the European Parliament and corresponding answers from the European Commission in 9 parallel versions. The total size of the corpus is approximately 10.2 million words (ca. 1.1 million words per language).Official Journal of the European Commission, Annex: Debates of the European Parliament 1992-1994This parallel corpus is the records of Parliamentary sitting published as an annex to the Official Journal of the European Community Debates of the European Parliament. The Parliamentary Debates are a record of what was said by mem...

  18. h

    spc

    • huggingface.co
    • opendatalab.com
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2023). spc [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/spc
    Explore at:
    Dataset updated
    Aug 29, 2023
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This is a collection of parallel corpora collected by Hercules Dalianis and his research group for bilingual dictionary construction. More information in: Hercules Dalianis, Hao-chun Xing, Xin Zhang: Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction, In Proceedings of LREC2010 (source: http://people.dsv.su.se/~hercules/SEC/) and Konstantinos Charitakis (2007): Using Parallel Corpora to Create a Greek-English Dictionary with UPLUG, In Proceedings of NODALIDA 2007. Afrikaans-English: Aldin Draghoender and Mattias Kanhov: Creating a reusable English – Afrikaans parallel corpora for bilingual dictionary construction

    4 languages, 3 bitexts total number of files: 6 total number of tokens: 1.32M total number of sentence fragments: 0.15M

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). Ranking of languages spoken at home in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/183483/ranking-of-languages-spoken-at-home-in-the-us-in-2008/
Organization logo

Ranking of languages spoken at home in the U.S. 2023

Explore at:
14 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
United States
Description

In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.

Search
Clear search
Close search
Google apps
Main menu