35 datasets found

2013 American Community Survey - Table Packages: Detailed Language Spoken in...
catalog.data.gov
s.cnmilf.com
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Census Bureau (2023). 2013 American Community Survey - Table Packages: Detailed Language Spoken in the U.S. [Dataset]. https://catalog.data.gov/dataset/2013-american-community-survey-table-packages-detailed-language-spoken-in-the-u-s
Explore at:
Dataset updated
Jul 19, 2023
Dataset provided by
United States Census Bureauhttp://census.gov/
Area covered
United States
Description
This data set uses the 2009-2013 American Community Survey to tabulate the number of speakers of languages spoken at home and the number of speakers of each language who speak English less than very well. These tabulations are available for the following geographies: nation; each of the 50 states, plus Washington, D.C. and Puerto Rico; counties with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish; core-based statistical areas (metropolitan statistical areas and micropolitan statistical areas) with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish.
Percent of Population with Limited Ability to Speak English
data.amerigeoss.org
coronavirus-resources.esri.com
+1more
esri rest, html
Updated Jul 24, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ESRI (2019). Percent of Population with Limited Ability to Speak English [Dataset]. https://data.amerigeoss.org/dataset/percent-of-population-with-limited-ability-to-speak-english
Explore at:
html, esri restAvailable download formats
Dataset updated
Jul 24, 2019
Dataset provided by
Esrihttp://esri.com/
Description
This map shows the percent of population with a limited ability to speak English by census tract. Search to your community and investigate the top language needs in nearby census tracts.

*DATA AS OF 2011-2015*
Data Source: U.S. Census Bureau's American Community Survey 5-year estimates, 2011-2015, Table B16001.

Complete list of all languages available in this data set (29):
Spanish or Spanish Creole; French (including Patois, Cajun); French Creole; Italian; Portuguese; German; Yiddish; Greek; Russian; Polish; Serbo-Croatian; Armenian; Persian; Gujarati; Hindi; Urdu; Chinese; Japanese; Korean; Mon-Khmer, Cambodian; Hmong; Thai; Laotian; Vietnamese; Tagalog; Navajo; Hungarian; Arabic; Hebrew. Those who have limited English ability and speak other languages are included in the percentage depicted in the map, but other languages will not appear in the ranked list or in the table.

Accompanying feature layer and viewing app are also available.
LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER IN LIMITED...
catalog.data.gov
Updated Jan 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Seattle ArcGIS Online (2025). LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER IN LIMITED ENGLISH SPEAKING HOUSEHOLDS (B16003) [Dataset]. https://catalog.data.gov/dataset/language-spoken-at-home-for-the-population-5-years-and-over-in-limited-english-speaking-ho
Explore at:
Dataset updated
Jan 31, 2025
Dataset provided by
https://arcgis.com/
Description
Table from the American Community Survey (ACS) B16003 of age by language spoken at home for the population 5 years and over in limited English-speaking households. These are multiple, nonoverlapping vintages of the 5-year ACS estimates of population and housing attributes starting in 2010 shown by the corresponding census tract vintage. Also includes the most recent release annually.King County, Washington census tracts with nonoverlapping vintages of the 5-year American Community Survey (ACS) estimates starting in 2010. Vintage identified in the "ACS Vintage" field.The census tract boundaries match the vintage of the ACS data (currently 2010 and 2020) so please note the geographic changes between the decades. Tracts have been coded as being within the City of Seattle as well as assigned to neighborhood groups called "Community Reporting Areas". These areas were created after the 2000 census to provide geographically consistent neighborhoods through time for reporting U.S. Census Bureau data. This is not an attempt to identify neighborhood boundaries as defined by neighborhoods themselves.Vintages: 2010, 2015, 2020, 2021, 2022, 2023ACS Table(s): B16003Data downloaded from: <a href='https://data.c
d
Population of the Limited English Proficient (LEP) Speakers by Community...
catalog.data.gov
data.cityofnewyork.us
+1more
Updated Jan 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofnewyork.us (2024). Population of the Limited English Proficient (LEP) Speakers by Community District [Dataset]. https://catalog.data.gov/dataset/population-of-the-limited-english-proficient-lep-speakers-by-community-district
Explore at:
Dataset updated
Jan 19, 2024
Dataset provided by
data.cityofnewyork.us
Description
Many residents of New York City speak more than one language; a number of them speak and understand non-English languages more fluently than English. This dataset, derived from the Census Bureau's American Community Survey (ACS), includes information on over 1.7 million limited English proficient (LEP) residents and a subset of that population called limited English proficient citizens of voting age (CVALEP) at the Community District level. There are 59 community districts throughout NYC, with each district being represented by a Community Board.
ACS English Ability and Linguistic Isolation Variables - Boundaries
hub.arcgis.com
covid-hub.gio.georgia.gov
+2more
Updated Nov 14, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2019). ACS English Ability and Linguistic Isolation Variables - Boundaries [Dataset]. https://hub.arcgis.com/maps/0c4d1027de6b4d6eb896d95f1240e1aa
Explore at:
Dataset updated
Nov 14, 2019
Dataset authored and provided by
Esrihttp://esri.com/
Area covered

Description
This layer shows English ability and linguistic isolation by age group. This is shown by tract, county, and state boundaries. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. Linguistically isolated households are households in which no one 14 and over speak English only or speaks a language other than English at home and speaks English very well. This layer is symbolized to show the percent of adult (18+) population who have limited English ability. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): B16003, B16004 (Not all lines of ACS table B16004 are available in this feature layer.)Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters).The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
n
1,136 Hours - English(the United States) Spontaneous Dialogue Smartphone...
m.nexdata.ai
nexdata.ai
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
1,136 Hours - English(the United States) Spontaneous Dialogue Smartphone speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/1004
Explore at:
Dataset updated
Nov 8, 2023
Dataset provided by
nexdata technology inc
Authors
Nexdata
Area covered
United States
Variables measured
Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
Description
English(the United States) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering generic domain. Transcribed with text content, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(1,416 Americans), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
n
Data from: Language Spoken at Home
linc.osbm.nc.gov
csv, excel, geojson +1
Updated Oct 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Language Spoken at Home [Dataset]. https://linc.osbm.nc.gov/explore/dataset/language-spoken-at-home/
Explore at:
geojson, csv, json, excelAvailable download formats
Dataset updated
Oct 3, 2024
Description
Language spoken at home and the ability to speak English for the population age 5 and over as reported by the US Census Bureau's, American Community Survey (ACS) 5-year estimates table C16001.
LIMITED ENGLISH SPEAKING HOUSEHOLDS (S1602)
catalog.data.gov
Updated Jan 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Seattle ArcGIS Online (2025). LIMITED ENGLISH SPEAKING HOUSEHOLDS (S1602) [Dataset]. https://catalog.data.gov/dataset/limited-english-speaking-households-s1602
Explore at:
Dataset updated
Jan 31, 2025
Dataset provided by
https://arcgis.com/
Description
Table from the American Community Survey (ACS) S1602 limited English speaking households (households where no one age 14 and over speaks English "very well"). These are multiple, nonoverlapping vintages of the 5-year ACS estimates of population and housing attributes starting in 2010 shown by the corresponding census tract vintage. Also includes the most recent release annually.King County, Washington census tracts with nonoverlapping vintages of the 5-year American Community Survey (ACS) estimates starting in 2010. Vintage identified in the "ACS Vintage" field.The census tract boundaries match the vintage of the ACS data (currently 2010 and 2020) so please note the geographic changes between the decades. Tracts have been coded as being within the City of Seattle as well as assigned to neighborhood groups called "Community Reporting Areas". These areas were created after the 2000 census to provide geographically consistent neighborhoods through time for reporting U.S. Census Bureau data. This is not an attempt to identify neighborhood boundaries as defined by neighborhoods themselves.Vintages: 2010, 2015, 2020, 2021, 2022, <a href='https://www.census.gov/programs-surveys/acs/news/data-releases/2023/release.html#5yr' style='font-family:inhe
Languages and English Ability - Seattle Neighborhoods
arc-gis-hub-home-arcgishub.hub.arcgis.com
catalog.data.gov
Updated Feb 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Languages and English Ability - Seattle Neighborhoods [Dataset]. https://arc-gis-hub-home-arcgishub.hub.arcgis.com/datasets/5ebf54a443194f1080ffde06d1d381b5
Explore at:
Dataset updated
Feb 22, 2024
Dataset provided by
https://arcgis.com/
Authors
City of Seattle ArcGIS Online
Area covered
Seattle
Description
Table from the American Community Survey (ACS) 5-year series on languages spoken and English ability related topics for City of Seattle Council Districts, Comprehensive Plan Growth Areas and Community Reporting Areas. Table includes B16004 Age by Language Spoken at Home by Ability to Speak English, C16002 Household Language by Household Limited English-Speaking Status. Data is pulled from block group tables for the most recent ACS vintage and summarized to the neighborhoods based on block group assignment.Table created for and used in the Neighborhood Profiles application.Vintages: 2023ACS Table(s): B16004, C16002Data downloaded from: Census Bureau's Explore Census Data The United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2020 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
Language Spoken at Home 2018-2022 - STATES
hub.arcgis.com
mce-data-uscensus.hub.arcgis.com
Updated Feb 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
US Census Bureau (2024). Language Spoken at Home 2018-2022 - STATES [Dataset]. https://hub.arcgis.com/maps/d89bebf3729d4540856fb3176c9d32f8
Explore at:
Dataset updated
Feb 4, 2024
Dataset provided by
United States Census Bureauhttp://census.gov/
Authors
US Census Bureau
Area covered
Pacific Ocean, North Pacific Ocean
Description
This layer shows Language Spoken at Home. This is shown by state and county boundaries. This service contains the 2018-2022 release of data from the American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the percentage of households with Limited English Speaking Status. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2018-2022ACS Table(s): B16004, DP02, S1601, S1602Data downloaded from: CensusBureau's API for American Community Survey Date of API call: January 18, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the Cartographic Boundaries via US Census TIGER geodatabases. Boundaries are updated at the same time as the data updates, and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines clipped for cartographic purposes. For state and county boundaries, the water and coastlines are derived from the coastlines of the 500k TIGER Cartographic Boundary Shapefiles. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto Rico. The Counties (and equivalent) layer contains 3221 records - all counties and equivalent, Washington D.C., and Puerto Rico municipios. See Areas Published. Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells.Margin of error (MOE) values of -555555555 in the API (or "*****" (five asterisks) on data.census.gov) are displayed as 0 in this dataset. The estimates associated with these MOEs have been controlled to independent counts in the ACS weighting and have zero sampling error. So, the MOEs are effectively zeroes, and are treated as zeroes in MOE calculations. Other negative values on the API, such as -222222222, -666666666, -888888888, and -999999999, all represent estimates or MOEs that can't be calculated or can't be published, usually due to small sample sizes. All of these are rendered in this dataset as null (blank) values.
h
peoples_speech
huggingface.co
Updated Nov 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
peoples_speech [Dataset]. https://huggingface.co/datasets/MLCommons/peoples_speech
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 12, 2022
Dataset authored and provided by
MLCommons
License
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Description
Dataset Card for People's Speech

Dataset Summary

The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/peoples_speech.
Language Spoken at Home by Zip Code Tabulation Area 2012-2016
johnsnowlabs.com
csv
Updated Jan 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Language Spoken at Home by Zip Code Tabulation Area 2012-2016 [Dataset]. https://www.johnsnowlabs.com/marketplace/language-spoken-at-home-by-zip-code-tabulation-area-2012-2016/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Time period covered
2012 - 2016
Area covered
United States
Description
This American Community Survey (ACS) data set identifies the language spoken at home by zip code tabulation area within the United States, from 2012 through 2016. The dataset identifies languages spoken and how well English is spoken by Zip Code Tabulation Area.
Health Insurance Minorities and Low English Level by Tracts 2014-2018
johnsnowlabs.com
csv
Updated Jan 20, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Health Insurance Minorities and Low English Level by Tracts 2014-2018 [Dataset]. https://www.johnsnowlabs.com/marketplace/health-insurance-minorities-and-low-english-level-by-tracts-2014-2018/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Time period covered
2014 - 2018
Area covered
US
Description
This dataset contains census tract level and estimated data about the number of uninsured non-institutionalized civilians, the number of persons belonging to minority (from ethnicity point of view, including Hispanic/Latino population) and the number of persons aged 5 and older who speak English less than well. In this dataset could be found all US census tracts and the estimates are made using data collected from 2014 to 2018 by the American Community Survey (ACS).
n
215 Hours - English(the United States) Scripted Monologue Smartphone speech...
m.nexdata.ai
Updated Sep 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 215 Hours - English(the United States) Scripted Monologue Smartphone speech dataset [Dataset]. https://m.nexdata.ai/datasets/speechrecog/78
Explore at:
Dataset updated
Sep 28, 2023
Dataset provided by
nexdata technology inc
Authors
Nexdata
Area covered
United States
Variables measured
Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
Description
English(the United States) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering economy, entertainment, news, informal language, numbers, alphabet domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(349 speakers), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Data from: Foreign Language Proficiency Test Data from Three American...
icpsr.umich.edu
Updated Mar 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Winke, Paula Marie; Gass, Susan M.; Soneson, Dan; Rubio, Fernando; Hacking, Jane F. (2020). Foreign Language Proficiency Test Data from Three American Universities, [United States], 2014-2017 [Dataset]. http://doi.org/10.3886/ICPSR37499.v1
Explore at:
Unique identifier
https://doi.org/10.3886/ICPSR37499.v1
Dataset updated
Mar 10, 2020
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
Winke, Paula Marie; Gass, Susan M.; Soneson, Dan; Rubio, Fernando; Hacking, Jane F.
License
https://www.icpsr.umich.edu/web/ICPSR/studies/37499/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/37499/terms
Time period covered
Aug 15, 2014 - Jun 15, 2017
Area covered
Michigan, Utah, Minnesota, United States
Description
In the years 2014 through 2019, three U.S. universities, Michigan State University, the University of Minnesota, Twin Cities, and The University of Utah, received Language Proficiency Flagship Initiative grants as part of the larger Language Flagship, which is a National Security Education Program (NSEP) and Defense Language and National Security Education Office (DLNSEO) initiative to improve language learning in the United States. The goal of the three universities' Language Proficiency Flagship Initiative grants was to document language proficiency in regular tertiary foreign language programs so that the programs, and ones like them at other universities, could use the proficiency-achievement data to set programmatic learning benchmarks and recommendations, as called for by the Modern Language Association in 2007. This call was reiterated by the National Standards Collaborative Board in 2015.During the first three years of the three, university-specific five-year grants (Fall 2014 through Spring 2017), each university collected language proficiency data during academic years 2014-2015, 2015-2016, and 2016-2017, from language learners in selected, regular language programs to document the students' proficiency achievements.University A tested Chinese, French, Russian, and Spanish with the NSEP grant funding, and German, Italian, Japanese, Korean, and Portuguese with additional (in-kind) financial support from within University A.University B tested Arabic, French, Portuguese, Russian, and Spanish with the NSEP grant funding, and German and Korean with additional (in-kind) financial support from University B.University C tested Arabic, Chinese, Portuguese, and Russian with the NSEP grant funding, and Korean with additional (in-kind) financial support from University C.Each university additionally provided the students background questionnaires at the time of testing. As stipulated by the grant terms, at the universities, students were offered to take up to three proficiency tests each semester: speaking, listening, and reading. Writing was not assessed because the grants did not financially cover the costs of writing assessments. The universities were required by grant terms to use official, nationally recognized, and standardized language tests that reported scores out on one of two standardized proficiency test scales: either the American Councils of Teaching Foreign Languages (ACTFL, 2012) proficiency scale, or the Interagency Language Roundtable (ILR: Herzog, n.d.) proficiency scale. The three universities thus contracted mostly with Language Testing International, ACTFL's official testing subsidiary, to purchase and administer to students the Oral Proficiency Interview - computer (OPIc) for speaking, the Listening Proficiency Test (LPT) for listening, and the Reading Proficiency Test (RPT) for reading. However, earlier in the grant cycling, because ACTFL did not yet have tests in all of the languages to be tested, some of the earlier testing was contracted with American Councils and Avant STAMP, even though those tests are not specifically geared for the specific populations of learners present in the given project.Students were able to opt out of testing in certain cases; those cases varied from university to university. The speaking tests occurred normally within intact classes that came into computer labs to take the tests. Students were often times requested to take the listening and reading tests outside of class time in proctored language labs on the campuses on walk-in bases, or they took the listening and reading tests in a language lab during a regular class setting. These decisions were often made by the language instructors and/or the language programs. The data are cross-sectional, but certain individuals took the tests repeatedly, thus, longitudinal data sets are nested within the cross-sectional data.The three universities worked mostly independently during the initial year of data collection because the identities of the three universities receiving the grants was not announced until weeks before testing was to begin at all three campuses. Thus, each university independently designed its background questionnaire. However, because all three were guided by the same set of grant-rules to use nationally-recognized standardized tests for the assessments, combining all three universities' test data was
w
R2 & NE: County Level 2006-2010 ACS Language Summary
data.wu.ac.at
tgrshp (compressed)
Updated Jan 9, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Environmental Protection Agency (2018). R2 & NE: County Level 2006-2010 ACS Language Summary [Dataset]. https://data.wu.ac.at/schema/data_gov/MDlmYWU1NTgtYTE0MC00MmQzLWFiNjctZTZmNmRjZWRjZjUw
Explore at:
tgrshp (compressed)Available download formats
Dataset updated
Jan 9, 2018
Dataset provided by
U.S. Environmental Protection Agency
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Area covered
d35544e3e9f834628d1f33fe9e4dcf1c018814d6
Description
The TIGER/Line Files are shapefiles and related database files (.dbf) that are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line File is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The primary legal divisions of most States are termed counties. In Louisiana, these divisions are known as parishes. In Alaska, which has no counties, the equivalent entities are the organized boroughs, city and boroughs, and municipalities, and for the unorganized area, census areas. The latter are delineated cooperatively for statistical purposes by the State of Alaska and the Census Bureau. In four States (Maryland, Missouri, Nevada, and Virginia), there are one or more incorporated places that are independent of any county organization and thus constitute primary divisions of their States. These incorporated places are known as independent cities and are treated as equivalent entities for purposes of data presentation. The District of Columbia and Guam have no primary divisions, and each area is considered an equivalent entity for purposes of data presentation. The Census Bureau treats the following entities as equivalents of counties for purposes of data presentation: Municipios in Puerto Rico, Districts and Islands in American Samoa, Municipalities in the Commonwealth of the Northern Mariana Islands, and Islands in the U.S. Virgin Islands. The entire area of the United States, Puerto Rico, and the Island Areas is covered by counties or equivalent entities. The 2010 Census boundaries for counties and equivalent entities are as of January 1, 2010, primarily as reported through the Census Bureau's Boundary and Annexation Survey (BAS).

This table contains data on language ability and linguistic isolation from the American Community Survey 2006-2010 database for counties. Linguistic isolation is defined as no one 14 and over speaks English only or speaks English "very well". The American Community Survey (ACS) is a household survey conducted by the U.S. Census Bureau that currently has an annual sample size of about 3.5 million addresses. ACS estimates provides communities with the current information they need to plan investments and services. Information from the survey generates estimates that help determine how more than $400 billion in federal and state funds are distributed annually. Each year the survey produces data that cover the periods of 1-year, 3-year, and 5-year estimates for geographic areas in the United States and Puerto Rico, ranging from neighborhoods to Congressional districts to the entire nation. This table also has a companion table (Same table name with MOE Suffix) with the margin of error (MOE) values for each estimated element. MOE is expressed as a measure value for each estimated element. So a value of 25 and an MOE of 5 means 25 +/- 5 (or statistical certainty between 20 and 30). There are also special cases of MOE. An MOE of -1 means the associated estimates do not have a measured error. An MOE of 0 means that error calculation is not appropriate for the associated value. An MOE of 109 is set whenever an estimate value is 0. The MOEs of aggregated elements and percentages must be calculated. This process means using standard error calculations as described in "American Community Survey Multiyear Accuracy of the Data (3-year 2008-2010 and 5-year 2006-2010)". Also, following Census guidelines, aggregated MOEs do not use more than 1 0-element MOE (109) to prevent over estimation of the error. Due to the complexity of the calculations, some percentage MOEs cannot be calculated (these are set to null in the summary-level MOE tables).

The name for table 'ACS10LANCNTYMOE' was added as a prefix to all field names imported from that table. Be sure to turn off 'Show Field Aliases' to see complete field names in the Attribute Table of this feature layer. This can be done in the 'Table Options' drop-down menu in the Attribute Table or with key sequence '[CTRL]+[SHIFT]+N'. Due to database restrictions, the prefix may have been abbreviated if the field name exceded the maximum allowed characters.
h
hind_encorp
huggingface.co
paperswithcode.com
+3more
Updated Mar 22, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Rychlý (2014). hind_encorp [Dataset]. https://huggingface.co/datasets/pary/hind_encorp
Explore at:
Dataset updated
Mar 22, 2014
Authors
Pavel Rychlý
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus. For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.
w
Data from: Children’s third-party understanding of communicative...
openscholarship.wustl.edu
data.library.wustl.edu
docx, xlsx
Updated Dec 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Afshordi, Narges; Sullivan, Kathleen R.; Markson, Lori (2017). Children’s third-party understanding of communicative interactions in a foreign language DataSet [Dataset]. http://doi.org/10.7936/K74B30QG
Explore at:
docx(107472), xlsx(25278)Available download formats
Unique identifier
https://doi.org/10.7936/K74B30QG
Dataset updated
Dec 14, 2017
Dataset provided by
Harvard University
Washington University in St. Louis
US Department of Health & Human Services
Authors
Afshordi, Narges; Sullivan, Kathleen R.; Markson, Lori
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Two studies explored young children’s understanding of the role of shared language in communication by investigating how monolingual English-speaking children interact with an English speaker, a Spanish speaker, and a bilingual experimenter who spoke both English and Spanish. When the bilingual experimenter spoke in Spanish or English to request objects, four-year-old children, but not three-year-olds, used her language choice to determine whom she addressed (e.g. requests in Spanish were directed to the Spanish speaker). Importantly, children used this cue – language choice – only in a communicative context. The findings suggest that by four years, monolingual children recognize that speaking the same language enables successful communication, even when that language is unfamiliar to them. Three-year-old children’s failure to make this distinction suggests that this capacity likely undergoes significant development in early childhood, although other capacities might also be at play.
Department of Rehabilitation Office Contact Information and Addresses with...
data.ca.gov
data.chhs.ca.gov
+2more
csv, docx, zip
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Rehabilitation Office Contact Information and Addresses with Languages Spoken [Dataset]. https://data.ca.gov/dataset/department-of-rehabilitation-office-contact-information-and-addresses-with-languages-spoken
Explore at:
csv, zip, docxAvailable download formats
Dataset updated
Aug 28, 2024
Dataset authored and provided by
California Department of Rehabilitationhttp://www.dor.ca.gov/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a list of Department of Rehabilitation (DOR) offices and includes contact information, addresses, and languages spoken in each office. Note: In addition to the languages listed, the DOR has various Bilingual language resources available in each office that allow us to serve members of the public who may speak a language other than English.
F
Wake Words & Voice Commands Speech Data: English (US)
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Wake Words & Voice Commands Speech Data: English (US) [Dataset]. https://www.futurebeeai.com/dataset/wake-words-and-commands-dataset/wake-words-and-commands-english-us
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Area covered
United States
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the US English Wake Word & Command Dataset, meticulously designed to advance the development and accuracy of voice-activated systems. This dataset features an extensive collection of wake words and commands, essential for triggering and interacting with voice assistants and other voice-activated devices. Our dataset ensures these systems respond promptly and accurately to user inputs, enhancing their reliability and user experience.
Speech Data
This training dataset comprises over 20,000 audio recordings of wake words and command phrases designed to build robust and accurate voice assistant speech technology. Each participant recorded 400 recordings in diverse environments and at varying speeds. This dataset contains audio recordings of wake words, as well as wake words followed by commands.
•Participant Diversity:
•
Speakers: 50 native US English speakers from the FutureBeeAI Community.

•
Regions: Various states/provinces of United States of America, ensuring a balanced representation of accents, dialects, and demographics.

•
Profile: Participants range from 18 to 70 years old, with a gender ratio of 60% male and 40% female.

•Recording Details:
•
Nature: Scripted audio recordings of wake words and command phrases.

•
Duration: Average of 1 to 15 seconds per recording.

•
Formats: WAV format with stereo channels, 16-bit depth, and sample rates from 16 to 48 kHz.

Dataset Diversity
This dataset includes recordings of various types of wake words and commands, in different environments and at different speeds, making it highly diverse.
•Different Types of Wake Words:
•
Automobile Wake words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Hey Mini, Hey Toyota, Ok ford, Hey Hyundai, Ok Honda, Hello Kia, Hey Dodge, etc

•
Voice Assistant Wake Words: Hey Siri, Ok google, Alexa, Hey Cartana, Hi Bixby, Hey Celia, Hey Google, etc

•
Home Appliences Wake Words: Hi LG, Ok LG, Hello Lloyd, etc

•
Different Types of Voice Commands: Depending on application and use case the dataset contains various types of commands like

•
Automobile: Playing Music, Checking for Direction, Integrating with at-home devices, Booking appointment, Voice Search, Voice Ordering, Providing feedback, and more

•
Voice Assistant: Asking general question, defination, translation, explanation, Asking for trivia or fun facts, Playing Music, Make a call, Controlling at-home devices, Checking direction, nearby places and traffic condition, Shopping, Calender, Reminder and To-do list, and many more

•
Home Appliences: Controlling Appliences, Checking Appliences Status, Setting up reminders or alarms, To-do list and shopping lists, and many more

•Different Recording Environment:
•Without any background noise or echo
•Background traffic noise
•Background people talking
•Different Recording Pace
•Normal speaking speed
•Fast speaking speed
This extensive coverage ensures the dataset includes realistic scenarios, which is essential for developing effective voice assistant speech recognition models.
Metadata
The dataset provides comprehensive metadata for each audio recording and participant:
•
Participant Metadata: Unique identifier, age, gender, country, state, district, accent and dialect.

•
Other Metadata: Recording transcript, Recording environment, Recording pace, device details, sample rate, bit depth, file format, etc.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Census Bureau (2023). 2013 American Community Survey - Table Packages: Detailed Language Spoken in the U.S. [Dataset]. https://catalog.data.gov/dataset/2013-american-community-survey-table-packages-detailed-language-spoken-in-the-u-s

2013 American Community Survey - Table Packages: Detailed Language Spoken in the U.S.

Explore at:

Dataset updated

Jul 19, 2023

Dataset provided by

United States Census Bureauhttp://census.gov/

Area covered

United States

Description

This data set uses the 2009-2013 American Community Survey to tabulate the number of speakers of languages spoken at home and the number of speakers of each language who speak English less than very well. These tabulations are available for the following geographies: nation; each of the 50 states, plus Washington, D.C. and Puerto Rico; counties with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish; core-based statistical areas (metropolitan statistical areas and micropolitan statistical areas) with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish.

Clear search

Close search

Google apps

Main menu

2013 American Community Survey - Table Packages: Detailed Language Spoken in...

Percent of Population with Limited Ability to Speak English

LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER IN LIMITED...

Population of the Limited English Proficient (LEP) Speakers by Community...

ACS English Ability and Linguistic Isolation Variables - Boundaries

1,136 Hours - English(the United States) Spontaneous Dialogue Smartphone...

Data from: Language Spoken at Home

LIMITED ENGLISH SPEAKING HOUSEHOLDS (S1602)

Languages and English Ability - Seattle Neighborhoods

Language Spoken at Home 2018-2022 - STATES

peoples_speech

Language Spoken at Home by Zip Code Tabulation Area 2012-2016

Health Insurance Minorities and Low English Level by Tracts 2014-2018

215 Hours - English(the United States) Scripted Monologue Smartphone speech...

Data from: Foreign Language Proficiency Test Data from Three American...

R2 & NE: County Level 2006-2010 ACS Language Summary

hind_encorp

Data from: Children’s third-party understanding of communicative...

Department of Rehabilitation Office Contact Information and Addresses with...

Wake Words & Voice Commands Speech Data: English (US)

Introduction

Speech Data

Dataset Diversity

Metadata

2013 American Community Survey - Table Packages: Detailed Language Spoken in the U.S.