100+ datasets found

d
Data from: Domestic and International Common Language Database (DICL)
researchdiscovery.drexel.edu
Updated Mar 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tamara Gurevich; Peter Herman; Farid Toubal; Yoto Yotov (2025). Domestic and International Common Language Database (DICL) [Dataset]. https://researchdiscovery.drexel.edu/esploro/outputs/dataset/Domestic-and-International-Common-Language-Database/991022032773104721
Explore at:
Dataset updated
Mar 11, 2025
Dataset provided by
United States International Trade Commission
Authors
Tamara Gurevich; Peter Herman; Farid Toubal; Yoto Yotov
Time period covered
2024
Description
The database contains index measures of linguistic similarity both domestically and internationally. The domestic measures capture linguistic similarities present among populations within a single country while the international indexes capture language similarities between two different countries. The 8 indices reflect three different aspects of language: common official languages, common native and acquired spoken languages, and linguistic proximity across different languages. This database has many uses, such as in models of bilateral flows—including FDI, migration, and international trade—as well as in regional or country level analyses. Extensive and detailed coverage: Bilateral indexes for 242 countries Based on 6,674 individual languages
d
Languages Spoken in Iowa
catalog.data.gov
datasets.ai
+1more
Updated Dec 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.iowa.gov (2024). Languages Spoken in Iowa [Dataset]. https://catalog.data.gov/dataset/languages-spoken-in-iowa
Explore at:
Dataset updated
Dec 6, 2024
Dataset provided by
data.iowa.gov
Area covered
Iowa
Description
Data portal and tool to better assist state agencies, policymakers and community-based organizations statewide in harnessing this data to improve operations, streamline outreach initiatives and meet federal-funding obligations.
E
Collins Multilingual database (MLD) - WordBank
live.european-language-grid.eu
catalogue.elra.info
Updated Dec 7, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Collins Multilingual database (MLD) - WordBank [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/1496
Explore at:
Dataset updated
Dec 7, 2016
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank) and a multilingual set of sentences in 28 languages (the PhraseBank, distributed separately under reference ELRA-T0377).

The WordBank contains 10,000 words for each language (Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese, Hindi, Tamil, Bengali, Malayalam, Romanian, Ukrainian), XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs. An additional dataset of 10,000 headwords is included for 12 languages (Chinese, American and British English, French, German, Italian, Japanese, Korean, Iberian and Brazilian Portuguese, Iberian and Latin American Spanish).

All English headwords contain Cobuild learner’s dictionary style definitions and one or more examples of the word in context.

Lemmatized lists and verb tables are available for English, French, German, Spanish and Italian. Romanization is provided for Chinese, Japanese, Korean and Thai.

The corresponding audio files are available for 26 languages of the 32 languages (thus excluding Hindi, Tamil, Bengali, Malayalam, Romanian and Ukrainian) and are distributed in a package referenced ELRA-S0382.
a
Languages spoken by tract, ACS
hub.arcgis.com
massachsuetts-environmental-justice-datasets-mass-eoeea.hub.arcgis.com
Updated May 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MA Executive Office of Energy and Environmental Affairs (2021). Languages spoken by tract, ACS [Dataset]. https://hub.arcgis.com/datasets/6c8c34fa83564796b564bdad99be912c
Explore at:
Dataset updated
May 19, 2021
Dataset authored and provided by
MA Executive Office of Energy and Environmental Affairs
Area covered

Description
The American Community Survey, Table B16001 provided detailed individual-level language estimates at the tract level of 42 non-English language categories, tabulated by the English-speaking ability. Two sets of languages data are included here, with population counts and percentages for both:the tract population speaking languages other than English, regardless of English=speaking ability, identified by the language name, and the languages spoken other than English by the tract population who does not speak English 'very well', identified by the language name followed by "_Enw".The default pop-up for this service presents the second of these data: languages spoken other than English by the tract population who does not speak English 'very well'.In part because of privacy concerns with the very small counts in some categories in Table B16001, the Census changed the American Community Survey estimates of the languages spoken by individuals. In 2016, the number of categories previously presented in Table B16001 was reduced to reflect the most commonly spoken languages, and several languages spoken in Massachusetts were grouped into generalized (i.e., "Other...") categories.Table B16001 has been renamed Table C16001 with these generalized categories. Therefore, although the information presented in this datalayer is not current, and these data cannot be updated.
Z
Cariban Lexical Database (CaLeD)
data.niaid.nih.gov
zenodo.org
Updated Apr 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ferraz Gerardi, Fabrício (2024). Cariban Lexical Database (CaLeD) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10019096
Explore at:
Dataset updated
Apr 21, 2024
Dataset provided by
Ferraz Gerardi, Fabrício
Orphão de Carvalho, Fernando
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a comprehensive collection of lexical items from various languages within the Carib linguistic family. It is structured to facilitate computational historical linguistics analysis, offering detailed information on language characteristics, word forms, and cognacy judgments. The data is curated to support research in linguistic typology, historical linguistics, and related fields.

Data Structure

The dataset is presented in a TSV (Tab-Separated Values) format, ensuring easy integration with common data analysis tools. Each lexical item in the dataset is detailed with multiple linguistic attributes, including phonological transcriptions, morphological analysis, and cognacy information. The following table summarizes the fields included in the dataset:

Field Name Data Type Description

ID string Unique identifier for each dataset entry.

ID_lang string Unique identifier for the language within the dataset.

Glottocode string Code uniquely identifying the language in the Glottolog database.

Glottolog_Name string Name of the language as recorded in the Glottolog database.

ISO639P3code string ISO 639-3 code for the language.

ID_param string Unique identifier for the linguistic parameter or concept within the dataset.

Concepticon_ID integer Identifier for the concept in the Concepticon database.

Concepticon_Gloss string Gloss or definition of the concept from the Concepticon database.

Value string Value of the linguistic data point, typically a word or phrase in the language.

Form string Phonetic or phonological transcription of the linguistic data point.

Segments string Further phonetic or phonological breakdown of the form.

Source string Reference to the source or citation where the data was obtained.

Morphemes string Morphological breakdown of the form.

SimpleCognate integer Cognacy judgment, indicating whether the form is cognate with forms of the same meaning in related languages.

PartialCognates string Partial cognacy coding, detailing the cognacy of individual segments or morphemes.

Intended Use

This dataset is intended for researchers and linguists specializing in the Carib linguistic family. It provides valuable insights into the lexical similarities and differences across the languages within this family, supporting studies on language evolution, relationships, and structure.

Additional Resources

Metadata for Validation: This dataset comes with comprehensive metadata following the Frictionless Data standard, ensuring that the data structure and types are accurately described for validation purposes. This metadata aids in maintaining the integrity and usability of the data across various computational platforms and research projects.

CLDF Version Available: For researchers utilizing the Cross-Linguistic Data Formats (CLDF), a version of this dataset is available in CLDF specifications. This version is provided as a zipped file, facilitating easier distribution and handling.
E
Collins Multilingual database (MLD) - PhraseBank
live.european-language-grid.eu
catalogue.elra.info
Updated Dec 7, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Collins Multilingual database (MLD) - PhraseBank [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2572
Explore at:
Dataset updated
Dec 7, 2016
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, distributed separately under reference ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank).

The PhraseBank consists of 2,000 phrases in 28 languages (Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Farsi, Finnish, French, German, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese). Phrases are organised under 12 main topics and 67 subtopics. Covered topics are: talking to people, getting around, accommodation, shopping, leisure, communications, practicalities, health and beauty, eating and drinking, time.

Romanization is provided for Arabic, Farsi and Hindi.

Audio files corresponding to each phrase are available and are distributed in a package referenced ELRA-S0383.
d
Preferred Language Spoken in California Facilities
catalog.data.gov
data.ca.gov
+2more
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Health Care Access and Information (2024). Preferred Language Spoken in California Facilities [Dataset]. https://catalog.data.gov/dataset/preferred-language-spoken-in-california-facilities-114e4
Explore at:
Dataset updated
Nov 27, 2024
Dataset provided by
Department of Health Care Access and Information
Area covered
California
Description
The dataset contains combined counts for hospital discharges, emergency room encounters, and ambulatory surgeries by preferred language spoken at each facility. The nearly 100 languages collected in the patient-level data were combined into eight geographical or cultural groups: English Language, Spanish Language, Asian/Pacific Islander Languages, Middle Eastern Languages, European Languages, African Languages, Latin American Languages, Native American Languages, and Sign Language. See the Preferred Language Spoken Language List below to see the exact separation of languages.
i
Spoken Indian Language Identification Database
ieee-dataport.org
Updated Dec 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sunil Kumar Kopparapu (2022). Spoken Indian Language Identification Database [Dataset]. https://ieee-dataport.org/open-access/spoken-indian-language-identification-database
Explore at:
Dataset updated
Dec 28, 2022
Authors
Sunil Kumar Kopparapu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
Spoken Indian Language Identification Database(9 languages
f
Iconicity Patterns in Sign Languages
uvaauas.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
V. Kimmelman; George Moroz; Anna Klezovich (2023). Iconicity Patterns in Sign Languages [Dataset]. http://doi.org/10.21942/uva.6850298.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.21942/uva.6850298.v1
Dataset updated
May 31, 2023
Dataset provided by
University of Amsterdam / Amsterdam University of Applied Sciences
Authors
V. Kimmelman; George Moroz; Anna Klezovich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a database containing 1542 signs from 19 signs languages in 7 semantic fields annotated according to several iconicity features. Please see the website link below for all the details, or read the following paper describing the database: Kimmelman, V., Klezovich, A., & Moroz, G. (2018). IPSL: A Database of Iconicity Patterns in Sign Languages. Creation and Use. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 4230-4234). ELRA. URL: http://www.lrec-conf.org/proceedings/lrec2018/summaries/102.html
Z
Linkeast: A lexical database of Eastern Polynesian Languages
data.niaid.nih.gov
Updated Mar 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
François, Alexandre (2023). Linkeast: A lexical database of Eastern Polynesian Languages [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7718955
Explore at:
Dataset updated
Mar 12, 2023
Dataset provided by
François, Alexandre
Walworth, Mary
Talfer, Hugues
Vernaudon, Jacques
Charpentier, Jean-Michel
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Polynesia
Description
This dataset is an extraction of the data contained in LinkEast, a lexical database of Eastern Polynesian languages. LinkEast is housed at the Université de la Polynésie française, within Anareo, a digital infrastructure dedicated to research on and documentation of the languages of French Polynesia. This data (v1.0) is based primarily on the Linguistic Atlas of French Polynesia (Charpentier & François, 2015).

References: Charpentier, Jean-Michel, & François, Alexandre, 2015, Atlas linguistique de la Polynésie française, Berlin et Papeete, Mouton de Gruyter et Université de la Polynésie française.
a
Languages and English Ability - Seattle Neighborhoods
data-seattlecitygis.opendata.arcgis.com
data.seattle.gov
+4more
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Seattle ArcGIS Online (2024). Languages and English Ability - Seattle Neighborhoods [Dataset]. https://data-seattlecitygis.opendata.arcgis.com/datasets/SeattleCityGIS::languages-and-english-ability-seattle-neighborhoods
Explore at:
Dataset updated
Feb 22, 2024
Dataset authored and provided by
City of Seattle ArcGIS Online
Area covered
Seattle
Description
Table from the American Community Survey (ACS) 5-year series on languages spoken and English ability related topics for City of Seattle Council Districts, Comprehensive Plan Growth Areas and Community Reporting Areas. Table includes B16004 Age by Language Spoken at Home by Ability to Speak English, C16002 Household Language by Household Limited English-Speaking Status. Data is pulled from block group tables for the most recent ACS vintage and summarized to the neighborhoods based on block group assignment.Table created for and used in the Neighborhood Profiles application.Vintages: 2023ACS Table(s): B16004, C16002Data downloaded from: Census Bureau's Explore Census Data The United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2020 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
785 Million Language Translation Database for AI
kaggle.com
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ramakrishnan Lakshmanan
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

Key Features:

Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

Dataset Preparation: The translation ...
Health Workforce Languages
data.chhs.ca.gov
data.ca.gov
+1more
xlsx
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Health Care Access and Information (2024). Health Workforce Languages [Dataset]. https://data.chhs.ca.gov/dataset/health-workforce-languages
Explore at:
xlsx(21509), xlsx(23671), xlsx(772183), xlsx(16659)Available download formats
Dataset updated
Aug 28, 2024
Dataset authored and provided by
Department of Health Care Access and Information
Description
This dataset contains statistically weighted estimates of the languages spoken by 47 key health workforce professions actively licensed in California as of July 1st, 2023. These metrics can be compared by US Census Bureau language group, workforce category, license type, time since license issue date (in years), and CHIS region.
s
Data from: Linguistic development in L2 Spanish: creation and analysis of a...
eprints.soton.ac.uk
Updated May 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitchell, Rosamond; Marsden, Emma; Myles, Florence (2023). Linguistic development in L2 Spanish: creation and analysis of a learner corpus [Dataset]. http://doi.org/10.5255/UKDA-SN-850024
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-850024
Dataset updated
May 6, 2023
Dataset provided by
UK Data Archive
Authors
Mitchell, Rosamond; Marsden, Emma; Myles, Florence
Description
This project had two aims: to establish a small scale, high quality database of spoken learner Spanish, and to undertake a short programme of substantive research into L2 (second language)Spanish. The data was collected from classroom learners of Spanish (with English as their first language), from beginners to advanced level, using specially designed elicitation tasks. For comparison purposes, native speakers were also recorded undertaking the same tasks. The resulting database contains digital soundfiles of learner speech, accompanied by transcripts in CHILDES (Child Language Data Exchange System) format which are tagged for parts of speech. The material will be made freely available for use among the Spanish second language acquisition research community, through a specially created website. The substantive research programme investigates the acquisition of central morphosyntactic properties of Spanish, such as word order, clitic pronouns, verbal morphology and wh-questions, providing a description and analysis of developmental sequences of L2 Spanish from an interface perspective. Phenomena such as the role of rote-learned formulas in instructed L2 Spanish were also studied. Research such as this enables us to better understand the processes involved in learning a second language in a classroom setting, and thus supports curriculum design for instructed L2 programmes.
r
Data from: Loanwords and native words in the Nordic languages database...
researchdata.se
Updated Jan 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo Tarsi (2025). Loanwords and native words in the Nordic languages database (c.1550–c. 1900) [Dataset]. http://doi.org/10.57804/gyz8-ns75
Explore at:
(793708)Available download formats
Unique identifier
https://doi.org/10.57804/gyz8-ns75
Dataset updated
Jan 8, 2025
Dataset provided by
Uppsala University
Authors
Matteo Tarsi
Description
Excelfile that contains data on loanwords and native words in the Nordic languages during the period 1550 to 1900.

This data stems from a two-years research about loanwords and native words in the Nordic languages (Swedish, Danish, Icelandic) in the period c. 1550-c. 1900. The data gathering focused on a set of meanings from different semantic fields in the quest for loanwords and native words throughout time. The semantic fields from which data was sampled are:

Astrology Astronomy Buildings and Architectural Elements Chemical Elements Clothes and Accessories Food and Beverages Grammatical Terminology Law Learning Materials, Minerals, Textiles and Fabrics Mathematics and Geometry Medicine Musical Instruments Natural Products and Plants People and Honorific and Professional Titles Religious Terminology Statal Organization

The dataset was originally published in DiVA and moved to SND in 2024.
r
Data from: DiACL - Diachronic Atlas of Comparative Linguistics
demo.researchdata.se
researchdata.se
Updated Dec 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerd Carling (2019). DiACL - Diachronic Atlas of Comparative Linguistics [Dataset]. https://demo.researchdata.se/en/catalogue/dataset/ext0269-1
Explore at:
Dataset updated
Dec 16, 2019
Dataset provided by
Lund University
Authors
Gerd Carling
Time period covered
2013
Description
DiACL is an open access database with lexical and typological/morphosyntactic data for historical, comparative and phylogenetic linguistics. It contains data from 500 languages of 18 families, divided into three macro-areas: Eurasia, Pacific, and the Amazon. The database has the following content: 1) Lexical datasets with basic vocabularies (Swadesh lists), 2) Lexical datasets with culture vocabularies, focusing on subsistence system vocabulary, 3) Typological/morphosyntactic datasets including the main types Word Order, Alignment, and Nominal/ Verbal Morphology. DiACL contains data from contemporary and historical languages, and, if possible, reconstructed languages. Data is derived from dictionaries, grammars, or by new fieldwork (in particular data from Caucasus and the Amazon). All data is sourced in scientifically reliable literature.

Purpose:

The aim of the database is to make datasets for evolutionary and comparative linguistics available open access. The datasets, which are comparative and complete, span over large geographic areas, containing data for typology/morphosyntax and lexicon (basic vocabulary and culture vocabulary). Datasets can be used to investigate spatio-temporal and linguistic comparative correlations.

Data has been compiled by analyzing grammars and dictionaries and by means of fieldwork. The population of data into the database is controlled by matrix documents, questionnaires and careful instructions, in order to creat complete and comparable dtasets. The process of population has been supervised by the database editor (Gerd Carling) and language experts for each language group.

Cite as: Carling, Gerd (ed.) 2016/2017. Diachronic Atlas of Comparative Linguistics Online. (Available at: https://lundic.ht.lu.se/. Accessed on: z.).

Data from individual languages should preferably be quoted by their source: NN, NN, NN, NN. Data set: x (basic vocabulary/culture vocabulary/typology), y (language). In: Carling, Gerd (ed.) 2016/2017. Diachronic Atlas of Comparative Linguistics Online. (Available at: https://lundic.ht.lu.se/. Accessed on: z.).
n
105,941 Images Natural Scenes OCR Data of 12 Languages
m.nexdata.ai
nexdata.ai
Updated Sep 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 105,941 Images Natural Scenes OCR Data of 12 Languages [Dataset]. https://m.nexdata.ai/datasets/ocr/1064
Explore at:
Dataset updated
Sep 28, 2023
Dataset provided by
nexdata technology inc
Authors
Nexdata
Variables measured
Device, Accuracy, Data size, Diversity, Image parameter, Annotation content, Collecting environment
Description
105,941 Images Natural Scenes OCR Data of 12 Languages. The data covers 12 languages (6 Asian languages, 6 European languages), multiple natural scenes, multiple photographic angles. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data. The data can be used for tasks such as OCR of multi-language.
E
BABEL Polish database
catalogue.elra.info
live.european-language-grid.eu
Updated Apr 29, 2010
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2010). BABEL Polish database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0307/
Explore at:
Dataset updated
Apr 29, 2010
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
The BABEL Polish Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304). The project began in March 1995 and was completed in December 1998. The objective was to create a database of languages of Central and Eastern Europe in parallel to the EUROM1 databases produced by the SAM Project (funded by the ESPRIT programme). The BABEL consortium included six partners from Central and Eastern Europe (who had the major responsibility of planning and carrying out the recording and labelling) and six from Western Europe (whose role was mainly to advise and in some cases to act as host to BABEL researchers). The five databases collected within the project concern the Bulgarian, Estonian, Hungarian, Polish, and Romanian languages.The Polish database consists of the basic "common" set which is:•The Many Talker Set: 30 males, 30 females; each to read 100 numbers, 3 connected passages and 5 “filler” sentences (or 4 passages if no fillers needed).•The Few Talker Set: 5 males, 5 females, normally selected from the above group: each to read 5 blocks of 100 numbers, 15 passages and 25 filler sentences ( or 20 passages if fillers not needed), and 5 lists of syllables.•The Very Few Talker Set: 1 male, 1 female, selected from many-talker set: 5 blocks of syllables, with and without carrier sentences.
p
Census 2021 - CD CSD - Languages Spoken Most Often at Home
data.peelregion.ca
census.peelregion.ca
+1more
Updated Aug 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Regional Municipality of Peel (2022). Census 2021 - CD CSD - Languages Spoken Most Often at Home [Dataset]. https://data.peelregion.ca/datasets/census-2021-cd-csd-languages-spoken-most-often-at-home
Explore at:
Dataset updated
Aug 23, 2022
Dataset authored and provided by
Regional Municipality of Peel
License
https://www.statcan.gc.ca/en/reference/licencehttps://www.statcan.gc.ca/en/reference/licence
Area covered

Description
Census Tract (CT) level data from the 2021 Census Program. Includes most of the information released as part of the Complete Profiles for the Languages release. Due to the complexity of the data, changes were made to the field names in order to accommodate the limitations of the database. This makes some uses harder as it requires careful use of the field names and totals to provide accurate values and analysis.Knowledge of official language - means that the person can have a simple conversation in either or both English and French.Language spoken most often at home - what a person uses most often in their house when conversing with someone else in their home. For a child that can't yet speak, it's the language that's most often spoken to the child.Mother tongue - is the language first learned in childhood and still understood by the person.
ACS Language Spoken at Home Variables - Boundaries
hub.arcgis.com
hrtc-oc-cerf.hub.arcgis.com
+4more
Updated Oct 20, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2018). ACS Language Spoken at Home Variables - Boundaries [Dataset]. https://hub.arcgis.com/maps/527ea2b5ba814c8ca1c34a2945e1b751
Explore at:
Dataset updated
Oct 20, 2018
Dataset authored and provided by
Esrihttp://esri.com/
Area covered

Description
This layer shows language group of language spoken at home by age. This is shown by tract, county, and state boundaries. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the percentage of the population age 5+ who speak Spanish at home. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): B16007Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters).The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.

Facebook

Twitter

Click to copy link

Link copied

Cite

Tamara Gurevich; Peter Herman; Farid Toubal; Yoto Yotov (2025). Domestic and International Common Language Database (DICL) [Dataset]. https://researchdiscovery.drexel.edu/esploro/outputs/dataset/Domestic-and-International-Common-Language-Database/991022032773104721

Data from: Domestic and International Common Language Database (DICL)

Explore at:

Dataset updated

Mar 11, 2025

Dataset provided by

United States International Trade Commission

Authors

Tamara Gurevich; Peter Herman; Farid Toubal; Yoto Yotov

Time period covered

2024

Description

The database contains index measures of linguistic similarity both domestically and internationally. The domestic measures capture linguistic similarities present among populations within a single country while the international indexes capture language similarities between two different countries. The 8 indices reflect three different aspects of language: common official languages, common native and acquired spoken languages, and linguistic proximity across different languages. This database has many uses, such as in models of bilateral flows—including FDI, migration, and international trade—as well as in regional or country level analyses. Extensive and detailed coverage: Bilateral indexes for 242 countries Based on 6,674 individual languages

Clear search

Close search

Google apps

Main menu

Data from: Domestic and International Common Language Database (DICL)

Languages Spoken in Iowa

Collins Multilingual database (MLD) - WordBank

Languages spoken by tract, ACS

Cariban Lexical Database (CaLeD)

Collins Multilingual database (MLD) - PhraseBank

Preferred Language Spoken in California Facilities

Spoken Indian Language Identification Database

Iconicity Patterns in Sign Languages

Linkeast: A lexical database of Eastern Polynesian Languages

Languages and English Ability - Seattle Neighborhoods

785 Million Language Translation Database for AI

Health Workforce Languages

Data from: Linguistic development in L2 Spanish: creation and analysis of a...

Data from: Loanwords and native words in the Nordic languages database...

Data from: DiACL - Diachronic Atlas of Comparative Linguistics

105,941 Images Natural Scenes OCR Data of 12 Languages

BABEL Polish database

Census 2021 - CD CSD - Languages Spoken Most Often at Home

ACS Language Spoken at Home Variables - Boundaries

Data from: Domestic and International Common Language Database (DICL)See More Versions

Data from: Domestic and International Common Language Database (DICL)