66 datasets found
  1. Speech Accent Archive

    • kaggle.com
    • marketplace.sshopencloud.eu
    zip
    Updated Nov 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael Tatman (2017). Speech Accent Archive [Dataset]. https://www.kaggle.com/rtatman/speech-accent-archive
    Explore at:
    zip(907049873 bytes)Available download formats
    Dataset updated
    Nov 6, 2017
    Authors
    Rachael Tatman
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context:

    Everyone who speaks a language, speaks it with an accent. A particular accent essentially reflects a person's linguistic background. When people listen to someone speak with a different accent from their own, they notice the difference, and they may even make certain biased social judgments about the speaker.

    The speech accent archive is established to uniformly exhibit a large set of speech accents from a variety of language backgrounds. Native and non-native speakers of English all read the same English paragraph and are carefully recorded. The archive is constructed as a teaching tool and as a research tool. It is meant to be used by linguists as well as other people who simply wish to listen to and compare the accents of different English speakers.

    This dataset allows you to compare the demographic and linguistic backgrounds of the speakers in order to determine which variables are key predictors of each accent. The speech accent archive demonstrates that accents are systematic rather than merely mistaken speech.

    All of the linguistic analyses of the accents are available for public scrutiny. We welcome comments on the accuracy of our transcriptions and analyses.

    Content:

    This dataset contains 2140 speech samples, each from a different talker reading the same reading passage. Talkers come from 177 countries and have 214 different native languages. Each talker is speaking in English.

    This dataset contains the following files:

    • reading-passage.txt: the text all speakers read
    • speakers_all.csv: demographic information on every speaker
    • recording: a zipped folder containing .mp3 files with speech

    Acknowledgements:

    This dataset was collected by many individuals (full list here) under the supervision of Steven H. Weinberger. The most up-to-date version of the archive is hosted by George Mason University. If you use this dataset in your work, please include the following citation:

    Weinberger, S. (2013). Speech accent archive. George Mason University.

    This datasets is distributed under a CC BY-NC-SA 2.0 license.

    Inspiration:

    The following types of people may find this dataset interesting:

    • ESL teachers who instruct non-native speakers of English
    • Actors who need to learn an accent
    • Engineers who train speech recognition machines
    • Linguists who do research on foreign accent
    • Phoneticians who teach phonetic transcription
    • Speech pathologists
    • Anyone who finds foreign accent to be interesting
  2. c

    English Proficiency by Age - Datasets - CTData.org

    • data.ctdata.org
    Updated Mar 16, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). English Proficiency by Age - Datasets - CTData.org [Dataset]. http://data.ctdata.org/dataset/english-proficiency-by-age
    Explore at:
    Dataset updated
    Mar 16, 2016
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    English Proficiency by Age reports demographic details regarding how many people speak English natively, and the proficiency of non-native speakers.

  3. F

    American English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native US English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of United States of America to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for US English.
    Voice Assistants: Build smart assistants capable of understanding natural American conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  4. h

    peoples_speech

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons, peoples_speech [Dataset]. https://huggingface.co/datasets/MLCommons/peoples_speech
    Explore at:
    Dataset authored and provided by
    MLCommons
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    Dataset Card for People's Speech

      Dataset Summary
    

    The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.

      Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/peoples_speech.
    
  5. 535 Hours – German-Accented English Speech Dataset for ASR

    • nexdata.ai
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 535 Hours – German-Accented English Speech Dataset for ASR [Dataset]. https://www.nexdata.ai/datasets/speechrecog/987
    Explore at:
    Dataset updated
    Dec 19, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Features of annotation
    Description

    535 Hours – German-Accented English Speech Dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(1,162 people in total), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  6. w

    Language Spoken at Home Full Dataset

    • geo.wa.gov
    Updated Jan 1, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shelby.Flanagan@doh.wa.gov_WADOH (2026). Language Spoken at Home Full Dataset [Dataset]. https://geo.wa.gov/items/4185536f95d649dcb8d586cb873e5d81
    Explore at:
    Dataset updated
    Jan 1, 2026
    Dataset authored and provided by
    Shelby.Flanagan@doh.wa.gov_WADOH
    Area covered
    Description

    BackgroundIn the US, people who don’t speak English well often have a lower quality of life than those who do [1]. They may also have limited access to health care, including mental health services, and may not be able to take part in key national health surveys like the Behavioral Risk Factor Surveillance System (BRFSS). Communities where many people have limited English skills tend to live closer to toxic chemicals. Limited English skills can also make it harder for community members to get involved in local decision-making, which can affect environmental policies and lead to health inequalities. Data SourceWashington Office of the Superintendent of Public Instruction (OSPI) | Public Records CenterMethodologyThe data was collected through a public records request from the OSPI data portal. It shows what languages students speak at home, organized by school district. OSPI collects and reports data by academic year. For example, the 2023 data comes from the 2022-2023 school year (August 1, 2022 to May 31, 2023). OSPI updates this information regularly.CaveatsThese figures only include households with children enrolled in public schools from pre-K through 12th grade. The data may change over time as new information becomes available. Source1. Shariff-Marco, S., Gee, G. C., Breen, N., Willis, G., Reeve, B. B., Grant, D., Ponce, N. A., Krieger, N., Landrine, H., Williams, D. R., Alegria, M., Mays, V. M., Johnson, T. P., & Brown, E. R. (2009). A mixed-methods approach to developing a self-reported racial/ethnic discrimination measure for use in multiethnic health surveys. Ethnicity & disease, 19(4), 447–453.CitationWashington Tracking Network, Washington State Department of Health. Languages Spoken at Home. Data from the Washington Office of Superintendent of Public Instruction (OSPI). Published January 2026. Web.

  7. F

    Indian English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Indian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Indian English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Indian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Indian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Indian English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of India to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Indian English.
    Voice Assistants: Build smart assistants capable of understanding natural Indian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  8. English Speech Dataset (Latin American Speakers) – 117 Hours Scripted...

    • nexdata.ai
    Updated Oct 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). English Speech Dataset (Latin American Speakers) – 117 Hours Scripted Monologue by Smartphone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1021
    Explore at:
    Dataset updated
    Oct 13, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Features of annotation
    Description

    This dataset contains 117 hours of English speech from Latin American speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(281 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  9. l

    Census 21 - English proficiency MSOA

    • data.leicester.gov.uk
    csv, excel, geojson +1
    Updated Aug 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Census 21 - English proficiency MSOA [Dataset]. https://data.leicester.gov.uk/explore/dataset/census-21-english-proficiency-msoa/
    Explore at:
    csv, geojson, excel, jsonAvailable download formats
    Dataset updated
    Aug 22, 2023
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    The census is undertaken by the Office for National Statistics every 10 years and gives us a picture of all the people and households in England and Wales. The most recent census took place in March of 2021.The census asks every household questions about the people who live there and the type of home they live in. In doing so, it helps to build a detailed snapshot of society. Information from the census helps the government and local authorities to plan and fund local services, such as education, doctors' surgeries and roads.Key census statistics for Leicester are published on the open data platform to make information accessible to local services, voluntary and community groups, and residents. There is also a dashboard published showcasing various datasets from the census allowing users to view data for all MSOAs and compare this with Leicester overall statistics.Further information about the census and full datasets can be found on the ONS website - https://www.ons.gov.uk/census/aboutcensus/censusproductsProficiency in EnglishThis dataset provides Census 2021 estimates that classify usual residents in England and Wales by their proficiency in English. The estimates are as at Census Day, 21 March 2021.Definition: How well people whose main language is not English (English or Welsh in Wales) speak English.This dataset provides details for the MSOAs of Leicester city.

  10. English Speech Dataset (French Speakers) – 520 Hours Scripted Monologue by...

    • nexdata.ai
    • m.nexdata.ai
    Updated Oct 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). English Speech Dataset (French Speakers) – 520 Hours Scripted Monologue by Smartphone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/989
    Explore at:
    Dataset updated
    Oct 7, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    French
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Features of annotation
    Description

    This dataset contains 520 hours of English speech from French speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(1,089 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  11. English Speech Dataset (Spanish Speakers) – 388 Hours Scripted Monologue by...

    • nexdata.ai
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). English Speech Dataset (Spanish Speakers) – 388 Hours Scripted Monologue by Smartphone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/990
    Explore at:
    Dataset updated
    Oct 31, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Features of annotation
    Description

    This dataset contains 388 hours of English speech from Spanish speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(891 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  12. b

    Percentage main language is not English: Cannot speak English - Birmingham...

    • cityobservatory.birmingham.gov.uk
    csv, excel, geojson +1
    Updated Sep 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Percentage main language is not English: Cannot speak English - Birmingham Wards [Dataset]. https://cityobservatory.birmingham.gov.uk/explore/dataset/percentage-cannot-speak-english-birmingham-wards/
    Explore at:
    excel, csv, json, geojsonAvailable download formats
    Dataset updated
    Sep 6, 2021
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Area covered
    Birmingham
    Description

    This provides estimates of the percentage of usual residents aged 3 and over in England and Wales by their proficiency in English. The proficiency in English classification corresponds to the tick box response options on the census questionnaire. Estimates are used to help central government, local authorities and the NHS allocate resources and provide services for non-English speakers. It also helps public service providers effectively target the delivery of their services. For example, translation and interpretation services and material in alternative languages. Statistical Disclosure Control - In order to protect against disclosure of personal information from the Census, there has been swapping of records in the Census database between different geographic areas, and so some counts will be affected. In the main, the greatest effects will be at the lowest geographies, since the record swapping is targeted towards those households with unusual characteristics in small areas. Data is Powered by LG Inform Plus and automatically checked for new data on the 3rd of each month.

  13. English Speech Dataset (Russian Speakers) – 230 Hours Scripted Monologue by...

    • nexdata.ai
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). English Speech Dataset (Russian Speakers) – 230 Hours Scripted Monologue by Smartphone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1042
    Explore at:
    Dataset updated
    Oct 31, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition
    Description

    This dataset contains 230 hours of English speech from Russian speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(498 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  14. English Speech Dataset (Singaporean Speakers) - 201 Hours Scripted Monologue...

    • nexdata.ai
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). English Speech Dataset (Singaporean Speakers) - 201 Hours Scripted Monologue by Smartphone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1045
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition
    Description

    This dataset contains 201 hours of English speech from Singaporean speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(452 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  15. English Speech Dataset (Canadian Speakers) – 207 Hours Scripted Monologue by...

    • nexdata.ai
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). English Speech Dataset (Canadian Speakers) – 207 Hours Scripted Monologue by Smartphone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1047
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Canada
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition
    Description

    This dataset contains 207 hours of English speech from Canadian speakers, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(466 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  16. h

    english_dialects

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoach Lacombe, english_dialects [Dataset]. https://huggingface.co/datasets/ylacombe/english_dialects
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Yoach Lacombe
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for "english_dialects"

      Dataset Summary
    

    This dataset consists of 31 hours of transcribed high-quality audio of English sentences recorded by 120 volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The speakers self-identified as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English. The recording scripts… See the full description on the dataset page: https://huggingface.co/datasets/ylacombe/english_dialects.

  17. d

    Population of the Limited English Proficient (LEP) Speakers by Community...

    • catalog.data.gov
    • data.cityofnewyork.us
    • +1more
    Updated Jan 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2024). Population of the Limited English Proficient (LEP) Speakers by Community District [Dataset]. https://catalog.data.gov/dataset/population-of-the-limited-english-proficient-lep-speakers-by-community-district
    Explore at:
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    data.cityofnewyork.us
    Description

    Many residents of New York City speak more than one language; a number of them speak and understand non-English languages more fluently than English. This dataset, derived from the Census Bureau's American Community Survey (ACS), includes information on over 1.7 million limited English proficient (LEP) residents and a subset of that population called limited English proficient citizens of voting age (CVALEP) at the Community District level. There are 59 community districts throughout NYC, with each district being represented by a Community Board.

  18. a

    People Speaking English Less Than "Very Well" GIS

    • hub.arcgis.com
    • data-sccphd.opendata.arcgis.com
    Updated Aug 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santa Clara County Public Health (2022). People Speaking English Less Than "Very Well" GIS [Dataset]. https://hub.arcgis.com/maps/sccphd::people-speaking-english-less-than-very-well-gis
    Explore at:
    Dataset updated
    Aug 24, 2022
    Dataset authored and provided by
    Santa Clara County Public Health
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Table contains count and percentage of county residents ages 5 years and older who speak English less than "very well". Data are presented at county, city, zip code and census tract level. Data are presented for zip codes (ZCTAs) fully within the county. Source: U.S. Census Bureau, 2016-2020 American Community Survey 5-year estimates, Table S1601; data accessed on August 23, 2022 from https://api.census.gov. The 2020 Decennial geographies are used for data summarization.METADATA:notes (String): Lists table title, notes, sourcesgeolevel (String): Level of geographyGEOID (Numeric): Geography IDNAME (String): Name of geographypop_5plus (Numeric): Population ages 5 years and olderspeak_Eng_lt_very_well (Numeric): Number of people ages 5 and older who speak English less than "very well"pct_speak_Eng_lt_very_well (Numeric): Percent of people ages 5 and older who speak English less than "very well"

  19. D

    2023 Limited English Proficiency (LEP) for the DVRPC Region Public Use...

    • catalog.dvrpc.org
    • njogis-newjersey.opendata.arcgis.com
    • +1more
    api, geojson, html +1
    Updated Nov 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DVRPC (2025). 2023 Limited English Proficiency (LEP) for the DVRPC Region Public Use Microdata Areas [Dataset]. https://catalog.dvrpc.org/dataset/2023-limited-english-proficiency-lep-for-the-dvrpc-region-public-use-microdata-areas
    Explore at:
    api, xml, html, geojsonAvailable download formats
    Dataset updated
    Nov 4, 2025
    Dataset authored and provided by
    DVRPC
    Description

    The Delaware Valley Regional Planning Commission (DVRPC) is committed to upholding the principles and intentions of the 1964 Civil Rights Act and related nondiscrimination statutes in all of the Commission’s work, including publications, products, communications, public input, and decision-making processes. Language barriers may prohibit people who are Limited in English Proficiency (also known as LEP persons) from obtaining services, information, or participating in public planning processes. To better identify LEP populations and thoroughly evaluate the Commission’s efforts to provide meaningful access, DVRPC has produced this Limited-English Proficiency Plan. This is the data that was used to make the maps for the upcoming plan. Public Use Microdata Area (PUMA), are geographies of at least 100,000 people that are nested within states or equivalent entities. States are able to delineate PUMAs within their borders, or use PUMA Criteria provided by the Census Bureau. Census tables used to gather data from the 2019- 2023 American Community Survey 5-Year Estimates ACS 2019-2023, Table B16001: Language Spoken at Home by Ability to Speak English for the Population 5 Years and Over. ACS data are derived from a survey and are subject to sampling variablity.

    *Limited English Proficiency (LEP) refers to those persons that speak English less than "very well". DVRPC has mapped the below Language Groups for our Plan.

    Spanish

    Russian

    Chinese

    Korean

    Vietnamese Source of PUMA boundaries: US Census Bureau. The TIGER/Line Files Please refer to U:_OngoingProjects\LEP\ACS_5YR_B16001_PUMAs_metadata.xlsx for full attribute loop up and fields used in making the DVRPC LEP Map Series. Please contact Chris Pollard (cpollard@dvrpc.org) should you have any questions about this dataset.

  20. MLEnd Spoken Numerals

    • kaggle.com
    Updated May 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus Requena (2022). MLEnd Spoken Numerals [Dataset]. http://doi.org/10.34740/kaggle/dsv/3650468
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 17, 2022
    Dataset provided by
    Kaggle
    Authors
    Jesus Requena
    Description

    Speech recognition has improved dramatically over the past years due to advances in machine learning and the availability of speech data. Speech recognition is nowadays powering a multitude of applications, from home virtual assistants to call centers, and it is expected to be integrated in many more systems, some of which might be critical for inclusivity.

    Machine learning solutions are however constrained by the quality of the data they are trained on. If our data does not represent our target population well, we can only aspire for our solution to work well on the sub-population that our data represents. In other words, solutions from non-representative data are inevitably biased towards a sub-population. In the context of speech recognition, machine learning solutions trained on non-representative datasets will not perform well on any sub-population that is not represented well, which can have a detrimental impact on inclusivity.

    The MLEnd Spoken Numerals dataset is a collection of more than 32k audio recordings produced by 154 speakers. Each audio recording corresponds to one English numeral (from "zero" to "billion") that is read using different intonations ("neutral", "bored", "excited" and "question"). Our participants have a diverse background: 31 nationalities and 42 unique languages are represented in the MLEnd Spoken Numerals dataset. This dataset comes with additional demographic information about our participants.

    The MLEnd datasets have been created by students at the School of Electronic Engineering and Computer Science, Queen Mary University of London. Other datasets include the MLEnd Hums and Whistles dataset, also available on Kaggle. Do not hesitate to reach out if you want to know more about how we did it.

    Enjoy!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rachael Tatman (2017). Speech Accent Archive [Dataset]. https://www.kaggle.com/rtatman/speech-accent-archive
Organization logo

Speech Accent Archive

Parallel English speech samples from 177 countries

Explore at:
zip(907049873 bytes)Available download formats
Dataset updated
Nov 6, 2017
Authors
Rachael Tatman
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Context:

Everyone who speaks a language, speaks it with an accent. A particular accent essentially reflects a person's linguistic background. When people listen to someone speak with a different accent from their own, they notice the difference, and they may even make certain biased social judgments about the speaker.

The speech accent archive is established to uniformly exhibit a large set of speech accents from a variety of language backgrounds. Native and non-native speakers of English all read the same English paragraph and are carefully recorded. The archive is constructed as a teaching tool and as a research tool. It is meant to be used by linguists as well as other people who simply wish to listen to and compare the accents of different English speakers.

This dataset allows you to compare the demographic and linguistic backgrounds of the speakers in order to determine which variables are key predictors of each accent. The speech accent archive demonstrates that accents are systematic rather than merely mistaken speech.

All of the linguistic analyses of the accents are available for public scrutiny. We welcome comments on the accuracy of our transcriptions and analyses.

Content:

This dataset contains 2140 speech samples, each from a different talker reading the same reading passage. Talkers come from 177 countries and have 214 different native languages. Each talker is speaking in English.

This dataset contains the following files:

  • reading-passage.txt: the text all speakers read
  • speakers_all.csv: demographic information on every speaker
  • recording: a zipped folder containing .mp3 files with speech

Acknowledgements:

This dataset was collected by many individuals (full list here) under the supervision of Steven H. Weinberger. The most up-to-date version of the archive is hosted by George Mason University. If you use this dataset in your work, please include the following citation:

Weinberger, S. (2013). Speech accent archive. George Mason University.

This datasets is distributed under a CC BY-NC-SA 2.0 license.

Inspiration:

The following types of people may find this dataset interesting:

  • ESL teachers who instruct non-native speakers of English
  • Actors who need to learn an accent
  • Engineers who train speech recognition machines
  • Linguists who do research on foreign accent
  • Phoneticians who teach phonetic transcription
  • Speech pathologists
  • Anyone who finds foreign accent to be interesting
Search
Clear search
Close search
Google apps
Main menu