In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
As of 2023, more than ** percent of people in the United States spoke a language other than English at home. California had the highest share among all U.S. states, with ** percent of its population speaking a language other than English at home.
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Datasheet for the dataset: multilingual-NLI-26lang-2mil7
Dataset Summary
This dataset contains 2 730 000 NLI text pairs in 26 languages spoken by more than 4 billion people. The dataset can be used to train models for multilingual NLI (Natural Language Inference) or zero-shot classification. The dataset is based on the English datasets MultiNLI, Fever-NLI, ANLI, LingNLI and WANLI and was created using the latest open-source machine translation models. The dataset is⊠See the full description on the dataset page: https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset offers a powerful synthetic English-ASL gloss parallel corpus that was generated in 2012, providing an exciting opportunity to bridge the cultural divide between English and American Sign Language. By exploring this cross-cultural language interoperability, it aims to connect linguistic communities and bring together aspects of communication often seen as separated. The data supports innovative approaches to machine translation models and helps to uncover further insights into bridging linguistic divides.
The dataset consists of two primary columns:
The dataset is typically provided in a CSV file format, specifically referenced as train.csv
. It comprises two columns: gloss
and text
. The gloss
column contains 81,123 unique values, while the text
column contains 81,016 unique values. This indicates the dataset consists of approximately 81,123 records.
This dataset can be used for a variety of applications and use cases, including:
The dataset focuses on the linguistic relationship between English and American Sign Language. While specific demographic details are not provided, its general availability is noted as global. The data was generated in 2012, offering a snapshot from that time.
CC0
This dataset is ideal for:
Original Data Source: AslgPc12 (English-ASL Gloss Parallel Corpus 2012)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a list of Department of Rehabilitation (DOR) offices and includes contact information, addresses, and languages spoken in each office. Note: In addition to the languages listed, the DOR has various Bilingual language resources available in each office that allow us to serve members of the public who may speak a language other than English.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThis paper studies the pragmatic force that heritage speakers may convey through the use of the diminutive in everyday speech. In particular, I analyze the use of the Spanish diminutive in 49 sociolinguistic interviews from a SpanishâEnglish bilingual community in Southern Arizona, U.S. where Spanish is the heritage language. I compare the use of the diminutive in heritage Spanish to the distribution of the diminutive in the speech of a Spanish monolingual community (18 sociolinguistic interviews) from the same dialectal region. Although Spanish and English employ different morphosyntactic strategies to express diminutive meaning, the analysis reveals that the diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona (i.e., similar diminutive distributions to their monolingual counterparts). While heritage speakers employed the diminutive -ito/a to express the notion of âsmallnessâ in their Spanish-discourse, the analysis indicates that these language users are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. This particular finding suggests that the concept âchildâ is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children. The analysis further suggests that examining the pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages.MethodsIn this study, I analyze the use of Spanish diminutives in two U.S.-Mexico border regions. The first data set is representative of a SpanishâEnglish bilingual community in Southern Arizona, U.S., provided in the Corpus del Español en el Sur de Arizona (The CESA Corpus). The CESA Corpus comprises 49 sociolinguistic interviews of ~1 h each for a total of ~305,542 words. The second data set comprises 18 sociolinguistic interviews of predominantly monolingual Spanish speakers from the city of Mexicali, Baja California in Mexico, provided in the Proyecto Para el Estudio SociolingĂŒĂstico del Español de España y de AmĂ©rica (PRESEEA). The Mexicali data set consists of ~119,162 words.ResultsThe analysis revealed that the Spanish diminutive morpheme -ito/a is a productive morphological device in the Spanish-discourse of heritage speakers from Southern Arizona. In addition to its prototypical meaning (i.e., the notion of âsmallnessâ), the diminutive morpheme -ito/a conveyed an array of pragmatic functions in the everyday speech of Spanish heritage speakers and their monolingual counterparts from the same dialectal region. Importantly, these pragmatic functions are mediated by speakers' subjective perceptions of the entity in question. Unlike their monolingual counterparts, heritage speakers are more likely to invoke a subjective evaluation through the diminutive -ito/a when talking about their family members and/or childhood experiences. Altogether, the study suggests that the concept âchildâ is the semantic/pragmatic driving force of the diminutive in heritage Spanish as a marker of speech by, about, to, or with some relation to children.DiscussionIn this study, I followed Reynoso's framework to study the pragmatic dimensions of the diminutive in everyday speech, that is, speakers' publicly conveyed meaning. The analysis revealed that heritage speakers applied most of the pragmatic functions and their respective values observed in Reynoso's cross-dialectal study of Spanish diminutives, and hence providing further support for her framework. Similarly, the study provides further evidence to Jurafsky's proposal that morphological diminutives arise from semantic or pragmatic links with children. Finally, the analysis indicated that examining the semantic/pragmatic dimensions of the diminutive in everyday speech can provide important insights into how heritage speakers encode and create cultural meaning in their heritage languages, which can in turn have further ramifications for heritage language learning and teaching.
In 2019, about 12.08 million children were speaking another language other than English at home in the United States. This number is fairly consistent with the previous year, where 12.13 million children spoke another language at home.
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank, distributed separately under reference ELRA-T0376) and a multilingual set of sentences in 28 languages (the PhraseBank).
The PhraseBank consists of 2,000 phrases in 28 languages (Arabic, Chinese, Croatian, Czech, Danish, Dutch, American English, British English, Farsi, Finnish, French, German, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian), Portuguese (Brazilian), Russian, Spanish (Iberian), Spanish (Latin American), Swedish, Thai, Turkish, Vietnamese). Phrases are organised under 12 main topics and 67 subtopics. Covered topics are: talking to people, getting around, accommodation, shopping, leisure, communications, practicalities, health and beauty, eating and drinking, time.
Romanization is provided for Arabic, Farsi and Hindi.
Audio files corresponding to each phrase are available and are distributed in a package referenced ELRA-S0383.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of âDepartment of Rehabilitation Office Contact Information and Addresses with Languages Spokenâ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/b3462d31-650b-43fa-9f80-c6efd6d5ce88 on 26 January 2022.
--- Dataset description provided by original source is as follows ---
This dataset is a list of Department of Rehabilitation (DOR) offices and includes contact information, addresses, and languages spoken in each office. Note: In addition to the languages listed, the DOR has various Bilingual language resources available in each office that allow us to serve members of the public who may speak a language other than English.
--- Original source retains full ownership of the source dataset ---
Dataset Title: A Gold Standard Corpus for Activity Information (GoSCAI)
Dataset Curators: The Epidemiology & Biostatistics Section of the NIH Clinical Center Rehabilitation Medicine Department
Dataset Version: 1.0 (May 16, 2025)
Dataset Citation and DOI: NIH CC RMD Epidemiology & Biostatistics Section. (2025). A Gold Standard Corpus for Activity Information (GoSCAI) [Data set]. Zenodo. doi: 10.5281/zenodo.15528545
This data statement is for a gold standard corpus of de-identified clinical notes that have been annotated for human functioning information based on the framework of the WHO's International Classification of Functioning, Disability and Health (ICF). The corpus includes 484 notes from a single institution within the United States written in English in a clinical setting. This dataset was curated for the purpose of training natural language processing models to automatically identify, extract, and classify information on human functioning at the whole-person, or activity, level.
This dataset is curated to be a publicly available resource for the development and evaluation of methods for the automatic extraction and classification of activity-level functioning information as defined in the ICF. The goals of data curation are to 1) create a corpus of a size that can be manually deidentified and annotated, 2) maximize the density and diversity of functioning information of interest, and 3) allow public dissemination of the data.
Language Region: en-US
Prose Description: English as written by native and bilingual English speakers in a clinical setting
The language users represented in this dataset are medical and clinical professionals who work in a research hospital setting. These individuals hold professional degrees corresponding to their respective specialties. Specific demographic characteristics of the language users such as age, gender, or race/ethnicity were not collected.
The annotator group consisted of five people, 33 to 76 years old, including four females and one male. Socioeconomically, they came from the middle and upper-middle income classes. Regarding first language, three annotators had English as their first language, one had Chinese, and one had Spanish. Proficiency in English, the language of the data being annotated, was native for three of the annotators and bilingual for the other two. The annotation team included clinical rehabilitation domain experts with backgrounds in occupational therapy, physical therapy, and individuals with public health and data science expertise. Prior to annotation, all annotators were trained on the specific annotation process using established guidelines for the given domain, and annotators were required to achieve a specified proficiency level prior to annotating notes in this corpus.
The notes in the dataset were written as part of clinical care within a U.S. research hospital between May 2008 and November 2019. These notes were written by health professionals asynchronously following the patient encounter to document the interaction and support continuity of care. The intended audience of these notes were clinicians involved in the patients' care. The included notes come from nine disciplines - neuropsychology, occupational therapy, physical medicine (physiatry), physical therapy, psychiatry, recreational therapy, social work, speech language pathology, and vocational rehabilitation. The notes were curated to support research on natural language processing for functioning information between 2018 and 2024.
The final corpus was derived from a set of clinical notes extracted from the hospital electronic medical record (EMR) for the purpose of clinical research. The original data include character-based digital content originally. We work in ASCII 8 or UNICODE encoding, and therefore part of our pre-processing includes running encoding detection and transformation from encodings such as Windows-1252 or ISO-8859 format to our preferred format.
On the larger corpus, we applied sampling to match our curation rationale. Given the resource constraints of manual annotation, we set out to create a dataset of 500 clinical notes, which would exclude notes over 10,000 characters in length.
To promote density and diversity, we used five note characteristics as sampling criteria. We used the text length as expressed in number of characters. Next, we considered the discipline group as derived from note type metadata and describes which discipline a note originated from: occupational and vocational therapy (OT/VOC), physical therapy (PT), recreation therapy (RT), speech and language pathology (SLP), social work (SW), or miscellaneous (MISC, including psychiatry, neurology and physiatry). These disciplines were selected for collecting the larger corpus because their notes are likely to include functioning information. Existing information extraction tools were used to obtain annotation counts in four areas of functioning and provided a noteâs annotation count, annotation density (annotation count divided by text length), and domain count (number of domains with at least 1 annotation).
We used stratified sampling across the 6 discipline groups to ensure discipline diversity in the corpus. Because of low availability, 50 notes were sampled from SLP with relaxed criteria, and 90 notes each from the 5 other discipline groups with stricter criteria. Sampled SLP notes were those with the highest annotation density that had an annotation count of at least 5 and a domain count of at least 2. Other notes were sampled by highest annotation count and lowest text length, with a minimum annotation count of 15 and minimum domain count of 3.
The notes in the resulting sample included certain types of PHI and PII. To prepare for public dissemination, all sensitive or potentially identifying information was manually annotated in the notes and replaced with substituted content to ensure readability and enough context needed for machine learning without exposing any sensitive information. This de-identification effort was manually reviewed to ensure no PII or PHI exposure and correct any resulting readability issues. Notes about pediatric patients were excluded. No intent was made to sample multiple notes from the same patient. No metadata is provided to group notes other than by note type, discipline, or discipline group. The dataset is not organized beyond the provided metadata, but publications about models trained on this dataset should include information on the train/test splits used.
All notes were sentence-segmented and tokenized using the spaCy en_core_web_lg model with additional rules for sentence segmentation customized to the dataset. Notes are stored in an XML format readable by the GATE annotation software (https://gate.ac.uk/family/developer.html), which stores annotations separately in annotation sets.
As the clinical notes were extracted directly from the EMR in text format, the capture quality was determined to be high. The clinical notes did not have to be converted from other data formats, which means this dataset is free from noise introduced by conversion processes such as optical character recognition.
Because of the effort required to manually deidentify and annotate notes, this corpus is limited in terms of size and representation. The curation decisions skewed note selection towards specific disciplines and note types to increase the likelihood of encountering information on functioning. Some subtypes of functioning occur infrequently in the data, or not at all. The deidentification of notes was done in a manner to preserve natural language as it would occur in the notes, but some information is lost, e.g. on rare diseases.
Information on the manual annotation process is provided in the annotation guidelines for each of the four domains:
- Communication & Cognition (https://zenodo.org/records/13910167)
- Mobility (https://zenodo.org/records/11074838)
- Self-Care & Domestic Life (SCDL) (https://zenodo.org/records/11210183)
- Interpersonal Interactions & Relationships (IPIR) (https://zenodo.org/records/13774684)
Inter-annotator agreement was established on development datasets described in the annotation guidelines prior to the annotation of this gold standard corpus.
The gold standard corpus consists of 484 documents, which include 35,147 sentences in total. The distribution of annotated information is provided in the table below.
Domain |
Number of Annotated Sentences |
% of All Sentences |
Mean Number of Annotated Sentences per Document |
Communication & Cognition |
6033 |
17.2% |
This layer shows language or language groups spoken at home by English ability. This is shown by tract, county, and state centroids. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the count and percent of individuals age 5+ who are bilingual in English and another language (speak English very well and speak another language at home). To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): C16001 Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu.EASTIN-CL Multilingual Ontology of Assistive Technology was created within the EASTIN-CL project aimed at applying language technologies to portal of assistive technologies http://www.eastin.eu to enhance it and make it more accessible for people in different languages.Based on Multilingual Ontology a query tool was built allowing users of the portal to type the lookup words which are then mapped to assistive device product classes.The terminology resource was created by first selecting base terminology in English, then having domain experts translate it into 6 other languages.
10 families of different types (SES) and structures (e.g. nuclear, extended, single-parent) were observed. The data provide insight into family membersâ ideological positions that can be congruent or conflictual and which may cause conflicting views about how to raise bilingual children. Interactional data capture the actual language practices in families across the communities. The data also allow us to observe the silent cultural conversations among family members, and to identify the critical moments of policy enactment.
NeMig represents a bilingual news collection and knowledge graphs on the topic of migration. The news corpora in German and English were collected from online media outlets from Germany and the US, respectively. NeMIg contains rich textual and metadata information, sentiment and political orientation annotations, as well as named entities extracted from the articles' content and metadata and linked to Wikidata. The corresponding knowledge graphs (NeMigKG) built from each corpus are expanded with up to two-hop neighbors from Wikidata of the initial set of linked entities.
NeMigKG comes in four flavors, for both the German, and the English corpora:
Base NeMigKG: contains literals and entities from the corresponding annotated news corpus;
Entities NeMigKG: derived from the Base NeMIg by removing all literal nodes, it contains only resource nodes;
Enriched Entities NeMigKG: derived from the Entities NeMig by enriching it with up to two-hop neighbors from Wikidata, it contains only resource nodes and Wikidata triples;
Complete NeMigKG: the combination of the Base and Enriched Entities NeMig, it contains both literals and resources.
Information about uploaded files:
(all files are b-zipped and in the N-Triples format.)
A description of the NeMigKG files is provided in the table below:
NeMigKG Files Description
File
Description
nemig_${language}_ ${graph_type}-metadata.nt.bz2
Metadata about the dataset, described using void vocabulary.
nemig_${language}_ ${graph_type}-instances_types.nt.bz2
Class definitions of news and event instances.
nemig_${language}_ ${graph_type}-instances_labels.nt.bz2
Labels of instances.
nemig_${language}_ ${graph_type}-instances_related.nt.bz2
Relations between news instances based on one another.
nemig_${language}_ ${graph_type}-instances_metadata_literals.nt.bz2
Relations between news instances and metadata literals (e.g. URL, publishing date, modification date, sentiment label, political orientation of news outlets).
nemig_${language}_ ${graph_type}-instances_content_mapping.nt.bz2
Mapping of news instances to content instances (e.g. title, abstract, body).
nemig_${language}_ ${graph_type}-instances_topic_mapping.nt.bz2
Mapping of news instances to sub-topic instances.
nemig_${language}_ ${graph_type}-instances_sentiment_mapping.nt.bz2
Mapping of news instances to sentiment classes.
emig_${language}_ ${graph_type}-instances_political_orientation_mapping.nt.bz2
Mapping of news outlets instances to political orientation classes.
nemig_${language}_ ${graph_type}-instances_content_literals.nt.bz2
Relations between content instances and corresponding literals (e.g. text of title, abstract, body).
nemig_${language}_ ${graph_type}-instances_sentiment_polorient_literals.nt.bz2
Relations between instances and corresponding sentiment or political orientation literals.
nemig_${language}_ ${graph_type}-instances_metadata_resources.nt.bz2
Relations between news or sub-topic instances and entities extracted from metadata (i.e. publishers, authors, keywords).
nemig_${language}_ ${graph_type}-instances_event_mapping.nt.bz2
Mapping of news instances to event instances.
nemig_${language}_ ${graph_type}-event_resources.nt.bz2
Relations between event instances and entities extracted from the text of the news (i.e. actors, places, mentions).
nemig_${language}_ ${graph_type}-resources_provenance.nt.bz2
Provenance information about the entities extracted from the text of the news (e.g. title, abstract, body).
nemig_${language}_ ${graph_type}-wiki_resources.nt.bz2
Relations between Wikidata entities from news and their k-hop entity neighbors from Wikidata.
The corresponding user data has been collected through online studies in Germany and the US. We used the participants' implicit feedback regarding their interest in an article to build their click history, and the explicit feedback in terms of news click behaviors to construct the impression logs. To protect user privacy, we assign each user an anonymized ID.
The German and English user datasets are zip-compressed folders, which contain two files each.
NeMig User Dataset File Description
File
Description
behaviors.tsv
The click history and impression logs of users.
demographics_politics.tsv
Demographic and political information of users.
The behaviors.tsv file contains the users' news click histories and the impression logs. It has 4 columns divided by the tab symbol:
Impression ID: the ID of an impression.
User ID: The anonymized ID of an user.
Click History: The news click history (list of news IDs) of a user before an impression.
Impression Log: List of news displayed to the user in a session and the user's click behavior on them (1 for click, 0 for non-click).
The demographics_politics.tsv file contains detailed information about the users' demographics and political interests. It has columns divided by the tab symbol. An explanation of all the columns and the questions used in the online studies to collect this information is shown in the table below.
Demographic and political user data description
Column Name
Question in German study
Scale in German
Question in English study
Scale in English
Demographics
Gender
Bitte geben Sie Ihr Geschlecht an
0 = mÀnnlich
1 = weiblich
2 = divers
3 = Keine Angabe
Please indicate your gender.
0 = male
1 = female
2 = other
3 = no answer
Age
Bitte geben Sie Ihr Alter an
1-120
Please indicate your age.
1-120
Qualification
Welches ist Ihr höchster Bildungsabschluss?
0 = Kein Schulabschluss
1 = Haupt-/Gesamtschulabschluss
2 = Realschulabschluss, Mittlere Reife, Fachschulreife
3 = Fachhochschulreife, Abitur
4 = Studium mit Abschluss
5 = Promotion
6 = Keine Angabe
Please indicate your highest educational qualification.
0 = less than high school
1 = high school/GED
2 = Vo-tech/business school
3 = some college
4 = college degree
5 = university degree
6 = doctoral degree
7 = no answer
Nationality
Welche Staatsangehörigkeit besitzen Sie?
0 = Nur die deutsche Staatsangehörigkeit
1 = Die deutsche und eine andere Staatsangehörigkeit
2 = Nur eine andere Staatsangehörigkeit
3 = Keine Angabe
What is your citizenship?
0 = U.S. citizenship
1 = U.S. and another non-U.S. citizenship
2 = Only non-U.S. citizenship
3 = No Answer
BornIn
Sind Sie in Deutschland geboren?
0 = Ja
1 = Nein
2 = Keine Angabe
Were you born in the U.S.?
0 = Yes
1 = No
2 = No answer
ParentsBornIn
Sind Ihre Eltern in Deutschland geboren?
0 = Mein Vater und meine Mutter sind beide in Deutschland geboren
1 = Mein Vater ist in Deutschland geboren, meine Mutter nicht
2 = Meine Mutter ist in Deutschland geboren, mein Vater nicht
3 = Weder meine Mutter noch mein Vater sind in Deutschland geboren
4 = Keine Angabe
Were your parents born in the U.S.?
0 = My father and my mother were both born in the U.S.
1 = My father was born in the U.S., my mother was not
2 = My mother was born in the U.S., my father was not
3 = Neither my mother nor my father were born in the U.S
4 = No answer
Income
Was ist Ihr persönliches monatliches Nettoeinkommen (nach Abzug der Steuern)? Bitte geben Sie eine ungefÀhre SchÀtzung an, falls Sie die genaue Zahl nicht kennen.
0 = Weniger als 1000 âŹ
1 = 1001 ⏠bis 2000 âŹ
2 = 2001 ⏠bis 3000 âŹ
3 = 3001 ⏠bis 4000 âŹ
4 = 4001 ⏠bis 5000 âŹ
5 = Mehr als 5000 âŹ
6 = Keine Angabe
What is your personal monthly net income (after taxes)? Please give an approximate estimation in case you are unsure.
0 = Less than 1000 $
1 = 1001 $ to 2000 $
2 = 2001 $ to 3000 $
3 = 3001 $ to 4000 $
4 = 4001 $ to 5000 $
5 = More than 5000 $
6 = No Answer
Empathy
Wie sehr stimmen Sie den folgenden Aussagen zu?
7-point Likert scale
1=Trifft ĂŒberhaupt nicht zu 7=Trifft voll und ganz zu
How strongly do you agree with the following statements?
7-point Likert scale
1=Strongly disagree
7=Strongly agree
EMP1
Wenn jemand anderes erfreut ist, tendiere ich dazu auch erfreut zu sein.
When someone else is feeling excited, I tend to get excited too.
EMP2
Es regt mich auf, wenn jemand respektlos behandelt wird.
It upsets me to see someone being treated disrespectfully.
EMP3
Es macht mir Freude, andere aufzumuntern.
I enjoy making other people feel better.
EMP4
Ich bin besorgt um Personen, die weniger GlĂŒck haben als ich.
I have tender, concerned feelings for people less fortunate than me.
EMP5
Ich fĂŒhle, wenn andere traurig sind, selbst wenn sie nichts sagen.
I can tell when others are sad even when they do not say anything.
EMP6
Meistens bin ich mit den Stimmungen anderer Leute im Einklang.
I find that I am âin tuneâ with other peopleâs moods.
EMP7
Ich empfinde einen starken Drang zu helfen, wenn ich jemanden sehe, der aufgebracht ist.
I get a strong urge to help when I see someone who is upset.
EMP8
Wenn ich jemanden sehe, der ausgenutzt wird, möchte ich die Person beschĂŒtzen.
When I see someone being taken advantage of, I feel kind of protective towards him\her.
Big5
Ich
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The MLCC text corpus has two main components - one set to allow comparable studies to be carried out in different languages and one set as the basis for translation studies. The first set is referred as the Polylingual Document Collection, a collection of newspaper articles from financial newspapers in 6 languages (Dutch, English, French, German, Italian and Spanish). It consists of the following sub-corpora:Dutch - Het Financieele Dagblad - 1992-1993 (Samples) The corpus contains articles from the Dutch financial newspaper Het Financieele Dagblad editions of 2nd January 1992 through to 24th December 1993. It contains around 8.5 million words of text.English - The Financial Times - 1993 (Samples)The corpus contains articles from the British financial newspaper The Financial Times editions from the year 1993. The corpus contains around 30 million words.French - Le Monde - 1992-1993 (Samples) A corpus of articles from the French newspaper Le Monde, consisting of two years worth (1992-1993) of articles on financial subjects, approximately 10 million words.German - Handelsblatt - 1986-1988 (Samples)This subcorpus consists of articles from the period 02.01.1986 to 15.06.1988. It contains some 33 million words. It may be possible to obtain more recent articles from Handelsblatt.Italian - Il Sole 24 Ore - 1992-1993 (Samples) The corpus described here contains articles from the Italian financial newspaper Il Sole 24 Ore from the year 1992. This corpus contains some 1.88 million words. The SGML-markup was done by the University of Edinburgh.Spanish - Expansion - 1994 (Samples)This subcorpus contains articles from the Spanish financial newspaper Expansion editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to 27.12.1994. It contains some 10 million words.The second set is a Multilingual Parallel Corpus consisting of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data, provided by the European Commission, comprises two sub-corpora from the Official Journal of the European Communities:Official Journal of the European Commission, C Series: Written Questions 1993Records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the European Community in all official languages (previously nine). This corpus contains written questions asked by members of the European Parliament and corresponding answers from the European Commission in 9 parallel versions. The total size of the corpus is approximately 10.2 million words (ca. 1.1 million words per language).Official Journal of the European Commission, Annex: Debates of the European Parliament 1992-1994This parallel corpus is the records of Parliamentary sitting published as an annex to the Official Journal of the European Community Debates of the European Parliament. The Parliamentary Debates are a record of what was said by mem...
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This is a collection of parallel corpora collected by Hercules Dalianis and his research group for bilingual dictionary construction. More information in: Hercules Dalianis, Hao-chun Xing, Xin Zhang: Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction, In Proceedings of LREC2010 (source: http://people.dsv.su.se/~hercules/SEC/) and Konstantinos Charitakis (2007): Using Parallel Corpora to Create a Greek-English Dictionary with UPLUG, In Proceedings of NODALIDA 2007. Afrikaans-English: Aldin Draghoender and Mattias Kanhov: Creating a reusable English â Afrikaans parallel corpora for bilingual dictionary construction
4 languages, 3 bitexts total number of files: 6 total number of tokens: 1.32M total number of sentence fragments: 0.15M
Not seeing a result you expected?
Learn how you can add new datasets to our index.
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.