100+ datasets found
  1. d

    Data from: Domestic and International Common Language Database (DICL)

    • researchdiscovery.drexel.edu
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tamara Gurevich; Peter Herman; Farid Toubal; Yoto Yotov (2025). Domestic and International Common Language Database (DICL) [Dataset]. https://researchdiscovery.drexel.edu/esploro/outputs/dataset/Domestic-and-International-Common-Language-Database/991022032773104721
    Explore at:
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    United States International Trade Commission
    Authors
    Tamara Gurevich; Peter Herman; Farid Toubal; Yoto Yotov
    Time period covered
    2024
    Description

    The database contains index measures of linguistic similarity both domestically and internationally. The domestic measures capture linguistic similarities present among populations within a single country while the international indexes capture language similarities between two different countries. The 8 indices reflect three different aspects of language: common official languages, common native and acquired spoken languages, and linguistic proximity across different languages. This database has many uses, such as in models of bilateral flows—including FDI, migration, and international trade—as well as in regional or country level analyses. Extensive and detailed coverage: Bilateral indexes for 242 countries Based on 6,674 individual languages

  2. d

    Languages Spoken in Iowa

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.iowa.gov (2024). Languages Spoken in Iowa [Dataset]. https://catalog.data.gov/dataset/languages-spoken-in-iowa
    Explore at:
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    data.iowa.gov
    Area covered
    Iowa
    Description

    Data portal and tool to better assist state agencies, policymakers and community-based organizations statewide in harnessing this data to improve operations, streamline outreach initiatives and meet federal-funding obligations.

  3. E

    Collins Multilingual database (MLD) - WordBank

    • live.european-language-grid.eu
    • catalogue.elra.info
    Updated Dec 7, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Collins Multilingual database (MLD) - WordBank [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1496
    Explore at:
    Dataset updated
    Dec 7, 2016
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank) and a multilingual set of sentences in 28 languages (the PhraseBank, distributed separately under reference ELRA-T0377).

    The WordBank contains 10,000 words for each language (Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese, Hindi, Tamil, Bengali, Malayalam, Romanian, Ukrainian), XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).

    All English headwords contain Cobuild learner’s dictionary style definitions and one or more examples of the word in context.

    Lemmatized lists and verb tables are available for English, French, German, Spanish and Italian. Romanization is provided for Chinese, Japanese, Korean and Thai.

    The corresponding audio files are available for 26 languages of the 32 languages (thus excluding Hindi, Tamil, Bengali, Malayalam, Romanian and Ukrainian) and are distributed in a package referenced ELRA-S0382.

  4. a

    Languages spoken by tract, ACS

    • hub.arcgis.com
    • massachsuetts-environmental-justice-datasets-mass-eoeea.hub.arcgis.com
    Updated May 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MA Executive Office of Energy and Environmental Affairs (2021). Languages spoken by tract, ACS [Dataset]. https://hub.arcgis.com/datasets/6c8c34fa83564796b564bdad99be912c
    Explore at:
    Dataset updated
    May 19, 2021
    Dataset authored and provided by
    MA Executive Office of Energy and Environmental Affairs
    Area covered
    Description

    The American Community Survey, Table B16001 provided detailed individual-level language estimates at the tract level of 42 non-English language categories, tabulated by the English-speaking ability. Two sets of languages data are included here, with population counts and percentages for both:the tract population speaking languages other than English, regardless of English=speaking ability, identified by the language name, and the languages spoken other than English by the tract population who does not speak English 'very well', identified by the language name followed by "_Enw".The default pop-up for this service presents the second of these data: languages spoken other than English by the tract population who does not speak English 'very well'.In part because of privacy concerns with the very small counts in some categories in Table B16001, the Census changed the American Community Survey estimates of the languages spoken by individuals. In 2016, the number of categories previously presented in Table B16001 was reduced to reflect the most commonly spoken languages, and several languages spoken in Massachusetts were grouped into generalized (i.e., "Other...") categories.Table B16001 has been renamed Table C16001 with these generalized categories. Therefore, although the information presented in this datalayer is not current, and these data cannot be updated.

  5. Z

    Cariban Lexical Database (CaLeD)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ferraz Gerardi, Fabrício (2024). Cariban Lexical Database (CaLeD) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10019096
    Explore at:
    Dataset updated
    Apr 21, 2024
    Dataset provided by
    Ferraz Gerardi, Fabrício
    Orphão de Carvalho, Fernando
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a comprehensive collection of lexical items from various languages within the Carib linguistic family. It is structured to facilitate computational historical linguistics analysis, offering detailed information on language characteristics, word forms, and cognacy judgments. The data is curated to support research in linguistic typology, historical linguistics, and related fields.

    Data Structure

    The dataset is presented in a TSV (Tab-Separated Values) format, ensuring easy integration with common data analysis tools. Each lexical item in the dataset is detailed with multiple linguistic attributes, including phonological transcriptions, morphological analysis, and cognacy information. The following table summarizes the fields included in the dataset:

    Field Name Data Type Description

    ID string Unique identifier for each dataset entry.

    ID_lang string Unique identifier for the language within the dataset.

    Glottocode string Code uniquely identifying the language in the Glottolog database.

    Glottolog_Name string Name of the language as recorded in the Glottolog database.

    ISO639P3code string ISO 639-3 code for the language.

    ID_param string Unique identifier for the linguistic parameter or concept within the dataset.

    Concepticon_ID integer Identifier for the concept in the Concepticon database.

    Concepticon_Gloss string Gloss or definition of the concept from the Concepticon database.

    Value string Value of the linguistic data point, typically a word or phrase in the language.

    Form string Phonetic or phonological transcription of the linguistic data point.

    Segments string Further phonetic or phonological breakdown of the form.

    Source string Reference to the source or citation where the data was obtained.

    Morphemes string Morphological breakdown of the form.

    SimpleCognate integer Cognacy judgment, indicating whether the form is cognate with forms of the same meaning in related languages.

    PartialCognates string Partial cognacy coding, detailing the cognacy of individual segments or morphemes.

    Intended Use

    This dataset is intended for researchers and linguists specializing in the Carib linguistic family. It provides valuable insights into the lexical similarities and differences across the languages within this family, supporting studies on language evolution, relationships, and structure.

    Additional Resources

    Metadata for Validation: This dataset comes with comprehensive metadata following the Frictionless Data standard, ensuring that the data structure and types are accurately described for validation purposes. This metadata aids in maintaining the integrity and usability of the data across various computational platforms and research projects.

    CLDF Version Available: For researchers utilizing the Cross-Linguistic Data Formats (CLDF), a version of this dataset is available in CLDF specifications. This version is provided as a zipped file, facilitating easier distribution and handling.

  6. E

    Collins Multilingual database (MLD) - PhraseBank

    • live.european-language-grid.eu
    • catalogue.elra.info
    Updated Dec 7, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Collins Multilingual database (MLD) - PhraseBank [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2572
    Explore at:
    Dataset updated
    Dec 7, 2016
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, distributed separately under reference ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank).

    The PhraseBank consists of 2,000 phrases in 28 languages (Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Farsi, Finnish, French, German, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese). Phrases are organised under 12 main topics and 67 subtopics. Covered topics are: talking to people, getting around, accommodation, shopping, leisure, communications, practicalities, health and beauty, eating and drinking, time.

    Romanization is provided for Arabic, Farsi and Hindi.

    Audio files corresponding to each phrase are available and are distributed in a package referenced ELRA-S0383.

  7. d

    Preferred Language Spoken in California Facilities

    • catalog.data.gov
    • data.ca.gov
    • +2more
    Updated Nov 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Access and Information (2024). Preferred Language Spoken in California Facilities [Dataset]. https://catalog.data.gov/dataset/preferred-language-spoken-in-california-facilities-114e4
    Explore at:
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    Department of Health Care Access and Information
    Area covered
    California
    Description

    The dataset contains combined counts for hospital discharges, emergency room encounters, and ambulatory surgeries by preferred language spoken at each facility. The nearly 100 languages collected in the patient-level data were combined into eight geographical or cultural groups: English Language, Spanish Language, Asian/Pacific Islander Languages, Middle Eastern Languages, European Languages, African Languages, Latin American Languages, Native American Languages, and Sign Language. See the Preferred Language Spoken Language List below to see the exact separation of languages.

  8. i

    Spoken Indian Language Identification Database

    • ieee-dataport.org
    Updated Dec 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sunil Kumar Kopparapu (2022). Spoken Indian Language Identification Database [Dataset]. https://ieee-dataport.org/open-access/spoken-indian-language-identification-database
    Explore at:
    Dataset updated
    Dec 28, 2022
    Authors
    Sunil Kumar Kopparapu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Spoken Indian Language Identification Database(9 languages

  9. f

    Iconicity Patterns in Sign Languages

    • uvaauas.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    V. Kimmelman; George Moroz; Anna Klezovich (2023). Iconicity Patterns in Sign Languages [Dataset]. http://doi.org/10.21942/uva.6850298.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    University of Amsterdam / Amsterdam University of Applied Sciences
    Authors
    V. Kimmelman; George Moroz; Anna Klezovich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a database containing 1542 signs from 19 signs languages in 7 semantic fields annotated according to several iconicity features. Please see the website link below for all the details, or read the following paper describing the database: Kimmelman, V., Klezovich, A., & Moroz, G. (2018). IPSL: A Database of Iconicity Patterns in Sign Languages. Creation and Use. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 4230-4234). ELRA. URL: http://www.lrec-conf.org/proceedings/lrec2018/summaries/102.html

  10. Z

    Linkeast: A lexical database of Eastern Polynesian Languages

    • data.niaid.nih.gov
    Updated Mar 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    François, Alexandre (2023). Linkeast: A lexical database of Eastern Polynesian Languages [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7718955
    Explore at:
    Dataset updated
    Mar 12, 2023
    Dataset provided by
    François, Alexandre
    Walworth, Mary
    Talfer, Hugues
    Vernaudon, Jacques
    Charpentier, Jean-Michel
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Polynesia
    Description

    This dataset is an extraction of the data contained in LinkEast, a lexical database of Eastern Polynesian languages. LinkEast is housed at the Université de la Polynésie française, within Anareo, a digital infrastructure dedicated to research on and documentation of the languages of French Polynesia. This data (v1.0) is based primarily on the Linguistic Atlas of French Polynesia (Charpentier & François, 2015).

    References: Charpentier, Jean-Michel, & François, Alexandre, 2015, Atlas linguistique de la Polynésie française, Berlin et Papeete, Mouton de Gruyter et Université de la Polynésie française.

  11. a

    Languages and English Ability - Seattle Neighborhoods

    • data-seattlecitygis.opendata.arcgis.com
    • data.seattle.gov
    • +4more
    Updated Feb 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Seattle ArcGIS Online (2024). Languages and English Ability - Seattle Neighborhoods [Dataset]. https://data-seattlecitygis.opendata.arcgis.com/datasets/SeattleCityGIS::languages-and-english-ability-seattle-neighborhoods
    Explore at:
    Dataset updated
    Feb 22, 2024
    Dataset authored and provided by
    City of Seattle ArcGIS Online
    Area covered
    Seattle
    Description

    Table from the American Community Survey (ACS) 5-year series on languages spoken and English ability related topics for City of Seattle Council Districts, Comprehensive Plan Growth Areas and Community Reporting Areas. Table includes B16004 Age by Language Spoken at Home by Ability to Speak English, C16002 Household Language by Household Limited English-Speaking Status. Data is pulled from block group tables for the most recent ACS vintage and summarized to the neighborhoods based on block group assignment.Table created for and used in the Neighborhood Profiles application.Vintages: 2023ACS Table(s): B16004, C16002Data downloaded from: Census Bureau's Explore Census Data The United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2020 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.

  12. 785 Million Language Translation Database for AI

    • kaggle.com
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ramakrishnan Lakshmanan
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

    Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

    Key Features:

    Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

    Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

    Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

    Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

    Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

    Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

    Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

    Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

    The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

    Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

    Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

    Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

    Dataset Preparation: The translation ...

  13. Health Workforce Languages

    • data.chhs.ca.gov
    • data.ca.gov
    • +1more
    xlsx
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Access and Information (2024). Health Workforce Languages [Dataset]. https://data.chhs.ca.gov/dataset/health-workforce-languages
    Explore at:
    xlsx(21509), xlsx(23671), xlsx(772183), xlsx(16659)Available download formats
    Dataset updated
    Aug 28, 2024
    Dataset authored and provided by
    Department of Health Care Access and Information
    Description

    This dataset contains statistically weighted estimates of the languages spoken by 47 key health workforce professions actively licensed in California as of July 1st, 2023. These metrics can be compared by US Census Bureau language group, workforce category, license type, time since license issue date (in years), and CHIS region.

  14. s

    Data from: Linguistic development in L2 Spanish: creation and analysis of a...

    • eprints.soton.ac.uk
    Updated May 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitchell, Rosamond; Marsden, Emma; Myles, Florence (2023). Linguistic development in L2 Spanish: creation and analysis of a learner corpus [Dataset]. http://doi.org/10.5255/UKDA-SN-850024
    Explore at:
    Dataset updated
    May 6, 2023
    Dataset provided by
    UK Data Archive
    Authors
    Mitchell, Rosamond; Marsden, Emma; Myles, Florence
    Description

    This project had two aims: to establish a small scale, high quality database of spoken learner Spanish, and to undertake a short programme of substantive research into L2 (second language)Spanish. The data was collected from classroom learners of Spanish (with English as their first language), from beginners to advanced level, using specially designed elicitation tasks. For comparison purposes, native speakers were also recorded undertaking the same tasks. The resulting database contains digital soundfiles of learner speech, accompanied by transcripts in CHILDES (Child Language Data Exchange System) format which are tagged for parts of speech. The material will be made freely available for use among the Spanish second language acquisition research community, through a specially created website. The substantive research programme investigates the acquisition of central morphosyntactic properties of Spanish, such as word order, clitic pronouns, verbal morphology and wh-questions, providing a description and analysis of developmental sequences of L2 Spanish from an interface perspective. Phenomena such as the role of rote-learned formulas in instructed L2 Spanish were also studied. Research such as this enables us to better understand the processes involved in learning a second language in a classroom setting, and thus supports curriculum design for instructed L2 programmes.

  15. r

    Data from: Loanwords and native words in the Nordic languages database...

    • researchdata.se
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo Tarsi (2025). Loanwords and native words in the Nordic languages database (c.1550–c. 1900) [Dataset]. http://doi.org/10.57804/gyz8-ns75
    Explore at:
    (793708)Available download formats
    Dataset updated
    Jan 8, 2025
    Dataset provided by
    Uppsala University
    Authors
    Matteo Tarsi
    Description

    Excelfile that contains data on loanwords and native words in the Nordic languages during the period 1550 to 1900.

    This data stems from a two-years research about loanwords and native words in the Nordic languages (Swedish, Danish, Icelandic) in the period c. 1550-c. 1900. The data gathering focused on a set of meanings from different semantic fields in the quest for loanwords and native words throughout time. The semantic fields from which data was sampled are:

    Astrology Astronomy Buildings and Architectural Elements Chemical Elements Clothes and Accessories Food and Beverages Grammatical Terminology Law Learning Materials, Minerals, Textiles and Fabrics Mathematics and Geometry Medicine Musical Instruments Natural Products and Plants People and Honorific and Professional Titles Religious Terminology Statal Organization

    The dataset was originally published in DiVA and moved to SND in 2024.

  16. r

    Data from: DiACL - Diachronic Atlas of Comparative Linguistics

    • demo.researchdata.se
    • researchdata.se
    Updated Dec 16, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerd Carling (2019). DiACL - Diachronic Atlas of Comparative Linguistics [Dataset]. https://demo.researchdata.se/en/catalogue/dataset/ext0269-1
    Explore at:
    Dataset updated
    Dec 16, 2019
    Dataset provided by
    Lund University
    Authors
    Gerd Carling
    Time period covered
    2013
    Description

    DiACL is an open access database with lexical and typological/morphosyntactic data for historical, comparative and phylogenetic linguistics. It contains data from 500 languages of 18 families, divided into three macro-areas: Eurasia, Pacific, and the Amazon. The database has the following content: 1) Lexical datasets with basic vocabularies (Swadesh lists), 2) Lexical datasets with culture vocabularies, focusing on subsistence system vocabulary, 3) Typological/morphosyntactic datasets including the main types Word Order, Alignment, and Nominal/ Verbal Morphology. DiACL contains data from contemporary and historical languages, and, if possible, reconstructed languages. Data is derived from dictionaries, grammars, or by new fieldwork (in particular data from Caucasus and the Amazon). All data is sourced in scientifically reliable literature.

    Purpose:

    The aim of the database is to make datasets for evolutionary and comparative linguistics available open access. The datasets, which are comparative and complete, span over large geographic areas, containing data for typology/morphosyntax and lexicon (basic vocabulary and culture vocabulary). Datasets can be used to investigate spatio-temporal and linguistic comparative correlations.

    Data has been compiled by analyzing grammars and dictionaries and by means of fieldwork. The population of data into the database is controlled by matrix documents, questionnaires and careful instructions, in order to creat complete and comparable dtasets. The process of population has been supervised by the database editor (Gerd Carling) and language experts for each language group.

    Cite as: Carling, Gerd (ed.) 2016/2017. Diachronic Atlas of Comparative Linguistics Online. (Available at: https://lundic.ht.lu.se/. Accessed on: z.).

    Data from individual languages should preferably be quoted by their source: NN, NN, NN, NN. Data set: x (basic vocabulary/culture vocabulary/typology), y (language). In: Carling, Gerd (ed.) 2016/2017. Diachronic Atlas of Comparative Linguistics Online. (Available at: https://lundic.ht.lu.se/. Accessed on: z.).

  17. n

    105,941 Images Natural Scenes OCR Data of 12 Languages

    • m.nexdata.ai
    • nexdata.ai
    Updated Sep 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 105,941 Images Natural Scenes OCR Data of 12 Languages [Dataset]. https://m.nexdata.ai/datasets/ocr/1064
    Explore at:
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    nexdata technology inc
    Authors
    Nexdata
    Variables measured
    Device, Accuracy, Data size, Diversity, Image parameter, Annotation content, Collecting environment
    Description

    105,941 Images Natural Scenes OCR Data of 12 Languages. The data covers 12 languages (6 Asian languages, 6 European languages), multiple natural scenes, multiple photographic angles. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The data can be used for tasks such as OCR of multi-language.

  18. E

    BABEL Polish database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Apr 29, 2010
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2010). BABEL Polish database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0307/
    Explore at:
    Dataset updated
    Apr 29, 2010
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The BABEL Polish Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304). The project began in March 1995 and was completed in December 1998. The objective was to create a database of languages of Central and Eastern Europe in parallel to the EUROM1 databases produced by the SAM Project (funded by the ESPRIT programme). The BABEL consortium included six partners from Central and Eastern Europe (who had the major responsibility of planning and carrying out the recording and labelling) and six from Western Europe (whose role was mainly to advise and in some cases to act as host to BABEL researchers). The five databases collected within the project concern the Bulgarian, Estonian, Hungarian, Polish, and Romanian languages.The Polish database consists of the basic "common" set which is:•The Many Talker Set: 30 males, 30 females; each to read 100 numbers, 3 connected passages and 5 “filler” sentences (or 4 passages if no fillers needed).•The Few Talker Set: 5 males, 5 females, normally selected from the above group: each to read 5 blocks of 100 numbers, 15 passages and 25 filler sentences ( or 20 passages if fillers not needed), and 5 lists of syllables.•The Very Few Talker Set: 1 male, 1 female, selected from many-talker set: 5 blocks of syllables, with and without carrier sentences.

  19. p

    Census 2021 - CD CSD - Languages Spoken Most Often at Home

    • data.peelregion.ca
    • census.peelregion.ca
    • +1more
    Updated Aug 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Regional Municipality of Peel (2022). Census 2021 - CD CSD - Languages Spoken Most Often at Home [Dataset]. https://data.peelregion.ca/datasets/census-2021-cd-csd-languages-spoken-most-often-at-home
    Explore at:
    Dataset updated
    Aug 23, 2022
    Dataset authored and provided by
    Regional Municipality of Peel
    License

    https://www.statcan.gc.ca/en/reference/licencehttps://www.statcan.gc.ca/en/reference/licence

    Area covered
    Description

    Census Tract (CT) level data from the 2021 Census Program. Includes most of the information released as part of the Complete Profiles for the Languages release. Due to the complexity of the data, changes were made to the field names in order to accommodate the limitations of the database. This makes some uses harder as it requires careful use of the field names and totals to provide accurate values and analysis.Knowledge of official language - means that the person can have a simple conversation in either or both English and French.Language spoken most often at home - what a person uses most often in their house when conversing with someone else in their home. For a child that can't yet speak, it's the language that's most often spoken to the child.Mother tongue - is the language first learned in childhood and still understood by the person.

  20. ACS Language Spoken at Home Variables - Boundaries

    • hub.arcgis.com
    • hrtc-oc-cerf.hub.arcgis.com
    • +4more
    Updated Oct 20, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2018). ACS Language Spoken at Home Variables - Boundaries [Dataset]. https://hub.arcgis.com/maps/527ea2b5ba814c8ca1c34a2945e1b751
    Explore at:
    Dataset updated
    Oct 20, 2018
    Dataset authored and provided by
    Esrihttp://esri.com/
    Area covered
    Description

    This layer shows language group of language spoken at home by age. This is shown by tract, county, and state boundaries. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the percentage of the population age 5+ who speak Spanish at home. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): B16007Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters).The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tamara Gurevich; Peter Herman; Farid Toubal; Yoto Yotov (2025). Domestic and International Common Language Database (DICL) [Dataset]. https://researchdiscovery.drexel.edu/esploro/outputs/dataset/Domestic-and-International-Common-Language-Database/991022032773104721

Data from: Domestic and International Common Language Database (DICL)

Related Article
Explore at:
Dataset updated
Mar 11, 2025
Dataset provided by
United States International Trade Commission
Authors
Tamara Gurevich; Peter Herman; Farid Toubal; Yoto Yotov
Time period covered
2024
Description

The database contains index measures of linguistic similarity both domestically and internationally. The domestic measures capture linguistic similarities present among populations within a single country while the international indexes capture language similarities between two different countries. The 8 indices reflect three different aspects of language: common official languages, common native and acquired spoken languages, and linguistic proximity across different languages. This database has many uses, such as in models of bilateral flows—including FDI, migration, and international trade—as well as in regional or country level analyses. Extensive and detailed coverage: Bilateral indexes for 242 countries Based on 6,674 individual languages

Search
Clear search
Close search
Google apps
Main menu