69 datasets found
  1. c

    English Proficiency by Age - Datasets - CTData.org

    • data.ctdata.org
    Updated Mar 17, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). English Proficiency by Age - Datasets - CTData.org [Dataset]. http://data.ctdata.org/dataset/english-proficiency-by-age
    Explore at:
    Dataset updated
    Mar 17, 2016
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    English Proficiency by Age reports demographic details regarding how many people speak English natively, and the proficiency of non-native speakers.

  2. d

    Population of the Limited English Proficient (LEP) Speakers by Community...

    • datasets.ai
    • data.cityofnewyork.us
    • +1more
    23, 40, 55, 8
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of New York (2024). Population of the Limited English Proficient (LEP) Speakers by Community District [Dataset]. https://datasets.ai/datasets/population-of-the-limited-english-proficient-lep-speakers-by-community-district
    Explore at:
    55, 23, 8, 40Available download formats
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    City of New York
    Description

    Many residents of New York City speak more than one language; a number of them speak and understand non-English languages more fluently than English. This dataset, derived from the Census Bureau's American Community Survey (ACS), includes information on over 1.7 million limited English proficient (LEP) residents and a subset of that population called limited English proficient citizens of voting age (CVALEP) at the Community District level. There are 59 community districts throughout NYC, with each district being represented by a Community Board.

  3. F

    British English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). British English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-uk
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United Kingdom
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the UK English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world UK English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic British accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of UK English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native UK English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of United Kingdom to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for UK English.
    Voice Assistants: Build smart assistants capable of understanding natural British conversations.

  4. F

    American English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native US English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of United States of America to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for US English.
    Voice Assistants: Build smart assistants capable of understanding natural American conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  5. h

    english_dialects

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    english_dialects [Dataset]. https://huggingface.co/datasets/ylacombe/english_dialects
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Yoach Lacombe
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for "english_dialects"

      Dataset Summary
    

    This dataset consists of 31 hours of transcribed high-quality audio of English sentences recorded by 120 volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The speakers self-identified as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English. The recording scripts… See the full description on the dataset page: https://huggingface.co/datasets/ylacombe/english_dialects.

  6. The most spoken languages worldwide 2025

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  7. F

    Canadian English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Canadian English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-canada
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Canada
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Canadian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Canadian English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Canadian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Canadian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Canadian English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Canada to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Canadian English.
    Voice Assistants: Build smart assistants capable of understanding natural Canadian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  8. h

    peoples_speech

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons, peoples_speech [Dataset]. https://huggingface.co/datasets/MLCommons/peoples_speech
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    MLCommons
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    Dataset Card for People's Speech

      Dataset Summary
    

    The People's Speech Dataset is among the world's largest English speech recognition corpus today that is licensed for academic and commercial usage under CC-BY-SA and CC-BY 4.0. It includes 30,000+ hours of transcribed speech in English languages with a diverse set of speakers. This open dataset is large enough to train speech-to-text systems and crucially is available with a permissive license.

      Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/peoples_speech.
    
  9. F

    Indian English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Indian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Indian English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Indian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Indian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Indian English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of India to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Indian English.
    Voice Assistants: Build smart assistants capable of understanding natural Indian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  10. b

    Percentage main language is not English: Can speak English well - WMCA

    • cityobservatory.birmingham.gov.uk
    csv, excel, geojson +1
    Updated Sep 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Percentage main language is not English: Can speak English well - WMCA [Dataset]. https://cityobservatory.birmingham.gov.uk/explore/dataset/percentage-can-speak-english-well-wmca/
    Explore at:
    json, csv, excel, geojsonAvailable download formats
    Dataset updated
    Sep 6, 2021
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    This provides estimates of the percentage of usual residents aged 3 and over in England and Wales by their proficiency in English. The proficiency in English classification corresponds to the tick box response options on the census questionnaire. Estimates are used to help central government, local authorities and the NHS allocate resources and provide services for non-English speakers. It also helps public service providers effectively target the delivery of their services. For example, translation and interpretation services and material in alternative languages. Statistical Disclosure Control - In order to protect against disclosure of personal information from the Census, there has been swapping of records in the Census database between different geographic areas, and so some counts will be affected. In the main, the greatest effects will be at the lowest geographies, since the record swapping is targeted towards those households with unusual characteristics in small areas. Data is Powered by LG Inform Plus and automatically checked for new data on the 3rd of each month.

  11. Non-Native Children English Speech (NNCES) Corpus

    • kaggle.com
    Updated Oct 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kodali Radha (2022). Non-Native Children English Speech (NNCES) Corpus [Dataset]. http://doi.org/10.34740/kaggle/dsv/4416485
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kodali Radha
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Non-native children’s English speech (NNCES) corpus: There were a total of 50 children, 25 females and 25 males, ranging in age from 8 to 12. All of the children are native speakers of Telugu, an Indian regional language, who are learning English as a second language. All of the audio clips were acquired in a .wav file using the open source SurveyLex platform, which supports dual channel at 44.1 kHz and a data rate of 16 bits per sample. Every questionnaire is conducted 10 times per child to assess the variation in words and sentences. The data was recorded for a total around 20 hours. It incorporates both read speech, with a total of 5000 utterances, and spontaneous speech, with a total of 5000 utterances with word level transcription.

  12. f

    Dataset of British English speech recordings for psychoacoustics and speech...

    • salford.figshare.com
    application/x-gzip
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trevor John Cox; Simone Graetzer; Michael A Akeroyd; Jonathan Barker; John Culling; Graham Naylor; Eszter Porter; Rhoddy Viveros Muñoz (2025). Dataset of British English speech recordings for psychoacoustics and speech processing research [Dataset]. http://doi.org/10.17866/rd.salford.16918180.v3
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    University of Salford
    Authors
    Trevor John Cox; Simone Graetzer; Michael A Akeroyd; Jonathan Barker; John Culling; Graham Naylor; Eszter Porter; Rhoddy Viveros Muñoz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    The Clarity Speech Corpus is a forty speaker British English speech dataset. The corpus was created for the purpose of running listening tests to gauge speech intelligibility and quality in the Clarity Project, which has the goal of advancing speech signal processing by hearing aids through a series of challenges. The dataset is suitable for machine learning and other uses in speech and hearing technology, acoustics and psychoacoustics. The data comprises recordings of approximately 10,000 sentences drawn from the British National Corpus (BNC) with suitable length, words and grammatical construction for speech intelligibility testing. The collection process involved the selection of a subset of BNC sentences, the recording of these produced by 40 British English speakers, and the processing of these recordings to create individual sentence recordings with associated prompts and metadata.clarity_utterances.v1_2.tar.gz contains all the recordings as .wav files, with the accompanying metadata such as text prompts in clarity_master.json. Further details are given in the readme.Sample_clarity_utterances.zip contains a sample of 10.Please reference the following data paper, which has details on how the corpus was generated: Graetzer, S., Akeroyd, M.A., Barker, J., Cox, T.J., Culling, J.F., Naylor, G., Porter, E. and Muñoz, R.V., 2022. Dataset of British English speech recordings for psychoacoustics and speech processing research: the Clarity Speech Corpus. Data in Brief, p.107951.

  13. Speech Accent Archive

    • kaggle.com
    Updated Nov 6, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael Tatman (2017). Speech Accent Archive [Dataset]. https://www.kaggle.com/datasets/rtatman/speech-accent-archive/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rachael Tatman
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context:

    Everyone who speaks a language, speaks it with an accent. A particular accent essentially reflects a person's linguistic background. When people listen to someone speak with a different accent from their own, they notice the difference, and they may even make certain biased social judgments about the speaker.

    The speech accent archive is established to uniformly exhibit a large set of speech accents from a variety of language backgrounds. Native and non-native speakers of English all read the same English paragraph and are carefully recorded. The archive is constructed as a teaching tool and as a research tool. It is meant to be used by linguists as well as other people who simply wish to listen to and compare the accents of different English speakers.

    This dataset allows you to compare the demographic and linguistic backgrounds of the speakers in order to determine which variables are key predictors of each accent. The speech accent archive demonstrates that accents are systematic rather than merely mistaken speech.

    All of the linguistic analyses of the accents are available for public scrutiny. We welcome comments on the accuracy of our transcriptions and analyses.

    Content:

    This dataset contains 2140 speech samples, each from a different talker reading the same reading passage. Talkers come from 177 countries and have 214 different native languages. Each talker is speaking in English.

    This dataset contains the following files:

    • reading-passage.txt: the text all speakers read
    • speakers_all.csv: demographic information on every speaker
    • recording: a zipped folder containing .mp3 files with speech

    Acknowledgements:

    This dataset was collected by many individuals (full list here) under the supervision of Steven H. Weinberger. The most up-to-date version of the archive is hosted by George Mason University. If you use this dataset in your work, please include the following citation:

    Weinberger, S. (2013). Speech accent archive. George Mason University.

    This datasets is distributed under a CC BY-NC-SA 2.0 license.

    Inspiration:

    The following types of people may find this dataset interesting:

    • ESL teachers who instruct non-native speakers of English
    • Actors who need to learn an accent
    • Engineers who train speech recognition machines
    • Linguists who do research on foreign accent
    • Phoneticians who teach phonetic transcription
    • Speech pathologists
    • Anyone who finds foreign accent to be interesting
  14. English proficiency in European countries in 2019

    • statista.com
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). English proficiency in European countries in 2019 [Dataset]. https://www.statista.com/statistics/990547/countries-in-europe-for-english/
    Explore at:
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Mar 2019
    Area covered
    Europe
    Description

    This statistic presents the leading European countries by their level of English proficiency as of March 2019. According to data provided by Klazz, Sweden had the highest percentage of people who were proficient in English at ** percent of the population.

  15. P

    CALLHOME American English Speech Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CALLHOME American English Speech Dataset [Dataset]. https://paperswithcode.com/dataset/callhome-american-english-speech
    Explore at:
    Area covered
    United States
    Description

    The CALLHOME English Corpus is a collection of unscripted telephone conversations between native speakers of English. Here are the key details:

    Participants: 120 individuals. Type of Study: Naturalistic. Location: USA. Media Type: Audio. DOI: doi:10.21415/T5KP54. Contents: The corpus contains 120 telephone conversations, each lasting up to 30 minutes. Origins: All calls originated in North America, with 90 calls placed to various locations overseas and 30 calls within North America. Transcripts: The transcripts cover contiguous 5 or 10-minute segments from recorded conversations. Speaker Awareness: All speakers were aware that they were being recorded. Topics: Participants had no guidelines on what to talk about; most called family members or close friends overseas. Purpose: Collected primarily to support the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense.

  16. b

    Percentage main language is not English: Cannot speak English - Birmingham...

    • cityobservatory.birmingham.gov.uk
    csv, excel, geojson +1
    Updated Sep 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Percentage main language is not English: Cannot speak English - Birmingham Constituency [Dataset]. https://cityobservatory.birmingham.gov.uk/explore/dataset/percentage-cannot-speak-english-birmingham-constituency/
    Explore at:
    excel, json, csv, geojsonAvailable download formats
    Dataset updated
    Sep 6, 2021
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Area covered
    Birmingham
    Description

    This provides estimates of the percentage of usual residents aged 3 and over in England and Wales by their proficiency in English. The proficiency in English classification corresponds to the tick box response options on the census questionnaire. Estimates are used to help central government, local authorities and the NHS allocate resources and provide services for non-English speakers. It also helps public service providers effectively target the delivery of their services. For example, translation and interpretation services and material in alternative languages. Statistical Disclosure Control - In order to protect against disclosure of personal information from the Census, there has been swapping of records in the Census database between different geographic areas, and so some counts will be affected. In the main, the greatest effects will be at the lowest geographies, since the record swapping is targeted towards those households with unusual characteristics in small areas. Data is Powered by LG Inform Plus and automatically checked for new data on the 3rd of each month.

  17. P

    The People’s Speech Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Galvez; Greg Diamos; Juan Ciro; Juan Felipe Cerón; Keith Achorn; Anjali Gopi; David Kanter; Maximilian Lam; Mark Mazumder; Vijay Janapa Reddi (2023). The People’s Speech Dataset [Dataset]. https://paperswithcode.com/dataset/the-peoples-speech
    Explore at:
    Dataset updated
    Dec 4, 2023
    Authors
    Daniel Galvez; Greg Diamos; Juan Ciro; Juan Felipe Cerón; Keith Achorn; Anjali Gopi; David Kanter; Maximilian Lam; Mark Mazumder; Vijay Janapa Reddi
    Description

    The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions.

  18. w

    Bronx Non-English vs. English Speaking (pointmap)

    • data.wu.ac.at
    csv, json, xml
    Updated Mar 20, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    American Fact Finder (2013). Bronx Non-English vs. English Speaking (pointmap) [Dataset]. https://data.wu.ac.at/schema/bronx_lehman_cuny_edu/ZzI3Yy13eWZ5
    Explore at:
    json, xml, csvAvailable download formats
    Dataset updated
    Mar 20, 2013
    Dataset provided by
    American Fact Finder
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Area covered
    The Bronx
    Description

    5-year ACS survey on languages spoken in Bronx households based on census tract and 2010 census data. Margin of Error columns were included because census tract data is estimated on a sample of the population, not an actual 100% count. The two layers for this map were created using the same data, illustrating percentage of English speakers (blue) and percentage of those who speak another as a first language. These two were layered to show which census tracts have a greater percentage of non-english speaking residents.

  19. d

    Low English Proficiency Populations Index

    • catalog.data.gov
    • movedc.dc.gov
    • +1more
    Updated Feb 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Washington, DC (2025). Low English Proficiency Populations Index [Dataset]. https://catalog.data.gov/dataset/low-english-proficiency-populations-index
    Explore at:
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    City of Washington, DC
    Description

    Using Census data, the Low English Proficiency Populations Index shows Census blocks classified into five categories based on the population of persons with low English proficiency as a percentage of the total population. Persons with low English proficiency are persons identified by the Census as speaking English less than “very well.” An index score of five indicates a higher density of persons with low English proficiency.

  20. l

    Census 21 - English proficiency MSOA

    • data.leicester.gov.uk
    csv, excel, geojson +1
    Updated Aug 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Census 21 - English proficiency MSOA [Dataset]. https://data.leicester.gov.uk/explore/dataset/census-21-english-proficiency-msoa/
    Explore at:
    csv, geojson, excel, jsonAvailable download formats
    Dataset updated
    Aug 22, 2023
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    The census is undertaken by the Office for National Statistics every 10 years and gives us a picture of all the people and households in England and Wales. The most recent census took place in March of 2021.The census asks every household questions about the people who live there and the type of home they live in. In doing so, it helps to build a detailed snapshot of society. Information from the census helps the government and local authorities to plan and fund local services, such as education, doctors' surgeries and roads.Key census statistics for Leicester are published on the open data platform to make information accessible to local services, voluntary and community groups, and residents. There is also a dashboard published showcasing various datasets from the census allowing users to view data for all MSOAs and compare this with Leicester overall statistics.Further information about the census and full datasets can be found on the ONS website - https://www.ons.gov.uk/census/aboutcensus/censusproductsProficiency in EnglishThis dataset provides Census 2021 estimates that classify usual residents in England and Wales by their proficiency in English. The estimates are as at Census Day, 21 March 2021.Definition: How well people whose main language is not English (English or Welsh in Wales) speak English.This dataset provides details for the MSOAs of Leicester city.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2016). English Proficiency by Age - Datasets - CTData.org [Dataset]. http://data.ctdata.org/dataset/english-proficiency-by-age

English Proficiency by Age - Datasets - CTData.org

Explore at:
Dataset updated
Mar 17, 2016
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

English Proficiency by Age reports demographic details regarding how many people speak English natively, and the proficiency of non-native speakers.

Search
Clear search
Close search
Google apps
Main menu