Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Everyone who speaks a language, speaks it with an accent. A particular accent essentially reflects a person's linguistic background. When people listen to someone speak with a different accent from their own, they notice the difference, and they may even make certain biased social judgments about the speaker.
The speech accent archive is established to uniformly exhibit a large set of speech accents from a variety of language backgrounds. Native and non-native speakers of English all read the same English paragraph and are carefully recorded. The archive is constructed as a teaching tool and as a research tool. It is meant to be used by linguists as well as other people who simply wish to listen to and compare the accents of different English speakers.
This dataset allows you to compare the demographic and linguistic backgrounds of the speakers in order to determine which variables are key predictors of each accent. The speech accent archive demonstrates that accents are systematic rather than merely mistaken speech.
All of the linguistic analyses of the accents are available for public scrutiny. We welcome comments on the accuracy of our transcriptions and analyses.
This dataset contains 2140 speech samples, each from a different talker reading the same reading passage. Talkers come from 177 countries and have 214 different native languages. Each talker is speaking in English.
This dataset contains the following files:
This dataset was collected by many individuals (full list here) under the supervision of Steven H. Weinberger. The most up-to-date version of the archive is hosted by George Mason University. If you use this dataset in your work, please include the following citation:
Weinberger, S. (2013). Speech accent archive. George Mason University.
This datasets is distributed under a CC BY-NC-SA 2.0 license.
The following types of people may find this dataset interesting:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
English Proficiency by Age reports demographic details regarding how many people speak English natively, and the proficiency of non-native speakers.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications:
Facebook
TwitterAttribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Dataset Card for People's Speech
Dataset Summary
The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.
Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/peoples_speech.
Facebook
Twitter535 Hours – German-Accented English Speech Dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(1,162 people in total), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterBackgroundIn the US, people who don’t speak English well often have a lower quality of life than those who do [1]. They may also have limited access to health care, including mental health services, and may not be able to take part in key national health surveys like the Behavioral Risk Factor Surveillance System (BRFSS). Communities where many people have limited English skills tend to live closer to toxic chemicals. Limited English skills can also make it harder for community members to get involved in local decision-making, which can affect environmental policies and lead to health inequalities. Data SourceWashington Office of the Superintendent of Public Instruction (OSPI) | Public Records CenterMethodologyThe data was collected through a public records request from the OSPI data portal. It shows what languages students speak at home, organized by school district. OSPI collects and reports data by academic year. For example, the 2023 data comes from the 2022-2023 school year (August 1, 2022 to May 31, 2023). OSPI updates this information regularly.CaveatsThese figures only include households with children enrolled in public schools from pre-K through 12th grade. The data may change over time as new information becomes available. Source1. Shariff-Marco, S., Gee, G. C., Breen, N., Willis, G., Reeve, B. B., Grant, D., Ponce, N. A., Krieger, N., Landrine, H., Williams, D. R., Alegria, M., Mays, V. M., Johnson, T. P., & Brown, E. R. (2009). A mixed-methods approach to developing a self-reported racial/ethnic discrimination measure for use in multiethnic health surveys. Ethnicity & disease, 19(4), 447–453.CitationWashington Tracking Network, Washington State Department of Health. Languages Spoken at Home. Data from the Washington Office of Superintendent of Public Instruction (OSPI). Published January 2026. Web.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Indian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Indian English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Indian accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Indian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple English speech and language AI applications:
Facebook
TwitterThis dataset contains 117 hours of English speech from Latin American speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(281 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The census is undertaken by the Office for National Statistics every 10 years and gives us a picture of all the people and households in England and Wales. The most recent census took place in March of 2021.The census asks every household questions about the people who live there and the type of home they live in. In doing so, it helps to build a detailed snapshot of society. Information from the census helps the government and local authorities to plan and fund local services, such as education, doctors' surgeries and roads.Key census statistics for Leicester are published on the open data platform to make information accessible to local services, voluntary and community groups, and residents. There is also a dashboard published showcasing various datasets from the census allowing users to view data for all MSOAs and compare this with Leicester overall statistics.Further information about the census and full datasets can be found on the ONS website - https://www.ons.gov.uk/census/aboutcensus/censusproductsProficiency in EnglishThis dataset provides Census 2021 estimates that classify usual residents in England and Wales by their proficiency in English. The estimates are as at Census Day, 21 March 2021.Definition: How well people whose main language is not English (English or Welsh in Wales) speak English.This dataset provides details for the MSOAs of Leicester city.
Facebook
TwitterThis dataset contains 520 hours of English speech from French speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(1,089 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterThis dataset contains 388 hours of English speech from Spanish speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(891 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This provides estimates of the percentage of usual residents aged 3 and over in England and Wales by their proficiency in English. The proficiency in English classification corresponds to the tick box response options on the census questionnaire. Estimates are used to help central government, local authorities and the NHS allocate resources and provide services for non-English speakers. It also helps public service providers effectively target the delivery of their services. For example, translation and interpretation services and material in alternative languages. Statistical Disclosure Control - In order to protect against disclosure of personal information from the Census, there has been swapping of records in the Census database between different geographic areas, and so some counts will be affected. In the main, the greatest effects will be at the lowest geographies, since the record swapping is targeted towards those households with unusual characteristics in small areas. Data is Powered by LG Inform Plus and automatically checked for new data on the 3rd of each month.
Facebook
TwitterThis dataset contains 230 hours of English speech from Russian speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(498 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterThis dataset contains 201 hours of English speech from Singaporean speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(452 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterThis dataset contains 207 hours of English speech from Canadian speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(466 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for "english_dialects"
Dataset Summary
This dataset consists of 31 hours of transcribed high-quality audio of English sentences recorded by 120 volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The speakers self-identified as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English. The recording scripts… See the full description on the dataset page: https://huggingface.co/datasets/ylacombe/english_dialects.
Facebook
TwitterMany residents of New York City speak more than one language; a number of them speak and understand non-English languages more fluently than English. This dataset, derived from the Census Bureau's American Community Survey (ACS), includes information on over 1.7 million limited English proficient (LEP) residents and a subset of that population called limited English proficient citizens of voting age (CVALEP) at the Community District level. There are 59 community districts throughout NYC, with each district being represented by a Community Board.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Table contains count and percentage of county residents ages 5 years and older who speak English less than "very well". Data are presented at county, city, zip code and census tract level. Data are presented for zip codes (ZCTAs) fully within the county. Source: U.S. Census Bureau, 2016-2020 American Community Survey 5-year estimates, Table S1601; data accessed on August 23, 2022 from https://api.census.gov. The 2020 Decennial geographies are used for data summarization.METADATA:notes (String): Lists table title, notes, sourcesgeolevel (String): Level of geographyGEOID (Numeric): Geography IDNAME (String): Name of geographypop_5plus (Numeric): Population ages 5 years and olderspeak_Eng_lt_very_well (Numeric): Number of people ages 5 and older who speak English less than "very well"pct_speak_Eng_lt_very_well (Numeric): Percent of people ages 5 and older who speak English less than "very well"
Facebook
TwitterThe Delaware Valley Regional Planning Commission (DVRPC) is committed to upholding the principles and intentions of the 1964 Civil Rights Act and related nondiscrimination statutes in all of the Commission’s work, including publications, products, communications, public input, and decision-making processes. Language barriers may prohibit people who are Limited in English Proficiency (also known as LEP persons) from obtaining services, information, or participating in public planning processes. To better identify LEP populations and thoroughly evaluate the Commission’s efforts to provide meaningful access, DVRPC has produced this Limited-English Proficiency Plan. This is the data that was used to make the maps for the upcoming plan. Public Use Microdata Area (PUMA), are geographies of at least 100,000 people that are nested within states or equivalent entities. States are able to delineate PUMAs within their borders, or use PUMA Criteria provided by the Census Bureau. Census tables used to gather data from the 2019- 2023 American Community Survey 5-Year Estimates ACS 2019-2023, Table B16001: Language Spoken at Home by Ability to Speak English for the Population 5 Years and Over. ACS data are derived from a survey and are subject to sampling variablity.
Vietnamese Source of PUMA boundaries: US Census Bureau. The TIGER/Line Files Please refer to U:_OngoingProjects\LEP\ACS_5YR_B16001_PUMAs_metadata.xlsx for full attribute loop up and fields used in making the DVRPC LEP Map Series. Please contact Chris Pollard (cpollard@dvrpc.org) should you have any questions about this dataset.
Facebook
TwitterSpeech recognition has improved dramatically over the past years due to advances in machine learning and the availability of speech data. Speech recognition is nowadays powering a multitude of applications, from home virtual assistants to call centers, and it is expected to be integrated in many more systems, some of which might be critical for inclusivity.
Machine learning solutions are however constrained by the quality of the data they are trained on. If our data does not represent our target population well, we can only aspire for our solution to work well on the sub-population that our data represents. In other words, solutions from non-representative data are inevitably biased towards a sub-population. In the context of speech recognition, machine learning solutions trained on non-representative datasets will not perform well on any sub-population that is not represented well, which can have a detrimental impact on inclusivity.
The MLEnd Spoken Numerals dataset is a collection of more than 32k audio recordings produced by 154 speakers. Each audio recording corresponds to one English numeral (from "zero" to "billion") that is read using different intonations ("neutral", "bored", "excited" and "question"). Our participants have a diverse background: 31 nationalities and 42 unique languages are represented in the MLEnd Spoken Numerals dataset. This dataset comes with additional demographic information about our participants.
The MLEnd datasets have been created by students at the School of Electronic Engineering and Computer Science, Queen Mary University of London. Other datasets include the MLEnd Hums and Whistles dataset, also available on Kaggle. Do not hesitate to reach out if you want to know more about how we did it.
Enjoy!
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Everyone who speaks a language, speaks it with an accent. A particular accent essentially reflects a person's linguistic background. When people listen to someone speak with a different accent from their own, they notice the difference, and they may even make certain biased social judgments about the speaker.
The speech accent archive is established to uniformly exhibit a large set of speech accents from a variety of language backgrounds. Native and non-native speakers of English all read the same English paragraph and are carefully recorded. The archive is constructed as a teaching tool and as a research tool. It is meant to be used by linguists as well as other people who simply wish to listen to and compare the accents of different English speakers.
This dataset allows you to compare the demographic and linguistic backgrounds of the speakers in order to determine which variables are key predictors of each accent. The speech accent archive demonstrates that accents are systematic rather than merely mistaken speech.
All of the linguistic analyses of the accents are available for public scrutiny. We welcome comments on the accuracy of our transcriptions and analyses.
This dataset contains 2140 speech samples, each from a different talker reading the same reading passage. Talkers come from 177 countries and have 214 different native languages. Each talker is speaking in English.
This dataset contains the following files:
This dataset was collected by many individuals (full list here) under the supervision of Steven H. Weinberger. The most up-to-date version of the archive is hosted by George Mason University. If you use this dataset in your work, please include the following citation:
Weinberger, S. (2013). Speech accent archive. George Mason University.
This datasets is distributed under a CC BY-NC-SA 2.0 license.
The following types of people may find this dataset interesting: