48 datasets found
  1. F

    American English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native US English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of United States of America to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for US English.
    Voice Assistants: Build smart assistants capable of understanding natural American conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  2. w

    Language Spoken at Home Full Dataset

    • geo.wa.gov
    Updated Jan 1, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shelby.Flanagan@doh.wa.gov_WADOH (2026). Language Spoken at Home Full Dataset [Dataset]. https://geo.wa.gov/items/4185536f95d649dcb8d586cb873e5d81
    Explore at:
    Dataset updated
    Jan 1, 2026
    Dataset authored and provided by
    Shelby.Flanagan@doh.wa.gov_WADOH
    Area covered
    Description

    BackgroundIn the US, people who don’t speak English well often have a lower quality of life than those who do [1]. They may also have limited access to health care, including mental health services, and may not be able to take part in key national health surveys like the Behavioral Risk Factor Surveillance System (BRFSS). Communities where many people have limited English skills tend to live closer to toxic chemicals. Limited English skills can also make it harder for community members to get involved in local decision-making, which can affect environmental policies and lead to health inequalities. Data SourceWashington Office of the Superintendent of Public Instruction (OSPI) | Public Records CenterMethodologyThe data was collected through a public records request from the OSPI data portal. It shows what languages students speak at home, organized by school district. OSPI collects and reports data by academic year. For example, the 2023 data comes from the 2022-2023 school year (August 1, 2022 to May 31, 2023). OSPI updates this information regularly.CaveatsThese figures only include households with children enrolled in public schools from pre-K through 12th grade. The data may change over time as new information becomes available. Source1. Shariff-Marco, S., Gee, G. C., Breen, N., Willis, G., Reeve, B. B., Grant, D., Ponce, N. A., Krieger, N., Landrine, H., Williams, D. R., Alegria, M., Mays, V. M., Johnson, T. P., & Brown, E. R. (2009). A mixed-methods approach to developing a self-reported racial/ethnic discrimination measure for use in multiethnic health surveys. Ethnicity & disease, 19(4), 447–453.CitationWashington Tracking Network, Washington State Department of Health. Languages Spoken at Home. Data from the Washington Office of Superintendent of Public Instruction (OSPI). Published January 2026. Web.

  3. F

    American English Call Center Data for Telecom AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English Call Center Data for Telecom AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/telecom-call-center-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This US English Call Center Speech Dataset for the Telecom industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking telecom customers. Featuring over 30 hours of real-world, unscripted audio, it delivers authentic customer-agent interactions across key telecom support scenarios to help train robust ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI engineers, telecom automation teams, and NLP researchers to build high-accuracy, production-ready models for telecom-specific use cases.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native US English speakers. Captured in realistic customer support settings, these conversations span a wide range of telecom topics from network complaints to billing issues, offering a strong foundation for training and evaluating telecom voice AI solutions.

    Participant Diversity:
    Speakers: 60 native US English speakers from our verified contributor pool.
    Regions: Representing multiple provinces across United States of America to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral ensuring broad scenario coverage for telecom AI development.

    Inbound Calls:
    Phone Number Porting
    Network Connectivity Issues
    Billing and Payments
    Technical Support
    Service Activation
    International Roaming Enquiry
    Refund Requests and Billing Adjustments
    Emergency Service Access, and others
    Outbound Calls:
    Welcome Calls & Onboarding
    Payment Reminders
    Customer Satisfaction Surveys
    Technical Updates
    Service Usage Reviews
    Network Complaint Status Calls, and more

    This variety helps train telecom-specific models to manage real-world customer interactions and understand context-specific voice patterns.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., pauses, coughs)
    High transcription accuracy with word error rate < 5% thanks to dual-layered quality checks.

    These transcriptions are production-ready, allowing for faster development of ASR and conversational AI systems in the Telecom domain.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender, accent, dialect, and location.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  4. Census Data - Languages spoken in Chicago, 2008 – 2012

    • data.cityofchicago.org
    • healthdata.gov
    • +3more
    csv, xlsx, xml
    Updated Sep 12, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Census Bureau (2014). Census Data - Languages spoken in Chicago, 2008 – 2012 [Dataset]. https://data.cityofchicago.org/Health-Human-Services/Census-Data-Languages-spoken-in-Chicago-2008-2012/a2fk-ec6q
    Explore at:
    xlsx, xml, csvAvailable download formats
    Dataset updated
    Sep 12, 2014
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Authors
    U.S. Census Bureau
    Area covered
    Chicago
    Description

    This dataset contains estimates of the number of residents aged 5 years or older in Chicago who “speak English less than very well,” by the non-English language spoken at home and community area of residence, for the years 2008 – 2012. See the full dataset description for more information at: https://data.cityofchicago.org/api/views/fpup-mc9v/files/dK6ZKRQZJ7XEugvUavf5MNrGNW11AjdWw0vkpj9EGjg?download=true&filename=P:\EPI\OEPHI\MATERIALS\REFERENCES\ECONOMIC_INDICATORS\Dataset_Description_Languages_2012_FOR_PORTAL_ONLY.pdf

  5. F

    American English Call Center Data for BFSI AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English Call Center Data for BFSI AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/bfsi-call-center-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This US English Call Center Speech Dataset for the BFSI (Banking, Financial Services, and Insurance) sector is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking customers. Featuring over 30 hours of real-world, unscripted audio, it offers authentic customer-agent interactions across a range of BFSI services to train robust and domain-aware ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI developers, financial technology teams, and NLP researchers to build high-accuracy, production-ready models across BFSI customer service scenarios.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native US English speakers. Captured in realistic financial support settings, these conversations span diverse BFSI topics from loan enquiries and card disputes to insurance claims and investment options, providing deep contextual coverage for model training and evaluation.

    Participant Diversity:
    Speakers: 60 native US English speakers from our verified contributor pool.
    Regions: Representing multiple provinces across United States of America to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral, ensuring real-world BFSI voice coverage.

    Inbound Calls:
    Debit Card Block Request
    Transaction Disputes
    Loan Enquiries
    Credit Card Billing Issues
    Account Closure & Claims
    Policy Renewals & Cancellations
    Retirement & Tax Planning
    Investment Risk Queries, and more
    Outbound Calls:
    Loan & Credit Card Offers
    Customer Surveys
    EMI Reminders
    Policy Upgrades
    Insurance Follow-ups
    Investment Opportunity Calls
    Retirement Planning Reviews, and more

    This variety ensures models trained on the dataset are equipped to handle complex financial dialogues with contextual accuracy.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    30 hours-coded Segments
    Non-speech Tags (e.g., pauses, background noise)
    High transcription accuracy with word error rate < 5% due to double-layered quality checks.

    These transcriptions are production-ready, making financial domain model training faster and more accurate.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender,

  6. F

    American English Call Center Data for Delivery & Logistics AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English Call Center Data for Delivery & Logistics AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/delivery-call-center-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This US English Call Center Speech Dataset for the Delivery and Logistics industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English-speaking customers. With over 30 hours of real-world, unscripted call center audio, this dataset captures authentic delivery-related conversations essential for training high-performance ASR models.

    Curated by FutureBeeAI, this dataset empowers AI teams, logistics tech providers, and NLP researchers to build accurate, production-ready models for customer support automation in delivery and logistics.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native US English speakers. Captured across various delivery and logistics service scenarios, these conversations cover everything from order tracking to missed delivery resolutions offering a rich, real-world training base for AI models.

    Participant Diversity:
    Speakers: 60 native US English speakers from our verified contributor pool.
    Regions: Multiple provinces of United States of America for accent and dialect diversity.
    Participant Profile: Balanced gender distribution (60% male, 40% female) with ages ranging from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted customer-agent dialogues.
    Call Duration: 5 to 15 minutes on average.
    Audio Format: Stereo WAV, 16-bit depth, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in clean, noise-free, echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound delivery-related conversations, covering varied outcomes (positive, negative, neutral) to train adaptable voice models.

    Inbound Calls:
    Order Tracking
    Delivery Complaints
    Undeliverable Addresses
    Return Process Enquiries
    Delivery Method Selection
    Order Modifications, and more
    Outbound Calls:
    Delivery Confirmations
    Subscription Offer Calls
    Incorrect Address Follow-ups
    Missed Delivery Notifications
    Delivery Feedback Surveys
    Out-of-Stock Alerts, and others

    This comprehensive coverage reflects real-world logistics workflows, helping voice AI systems interpret context and intent with precision.

    Transcription

    All recordings come with high-quality, human-generated verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., pauses, noise)
    High transcription accuracy with word error rate under 5% via dual-layer quality checks.

    These transcriptions support fast, reliable model development for English voice AI applications in the delivery sector.

    Metadata

    Detailed metadata is included for each participant and conversation:

    Participant Metadata: ID, age, gender, region, accent, dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical attributes.

    This metadata aids in training specialized models, filtering demographics, and running advanced analytics.

    Usage and Applications

    <p

  7. F

    US Spanish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). US Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-us-spanish
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the US Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US Spanish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic US accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native US Spanish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of USA to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Spanish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for US Spanish.
    Voice Assistants: Build smart assistants capable of understanding natural US conversations.
    <span

  8. D

    2023 Limited English Proficiency (LEP) for the DVRPC Region Public Use...

    • catalog.dvrpc.org
    • njogis-newjersey.opendata.arcgis.com
    • +1more
    api, geojson, html +1
    Updated Nov 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DVRPC (2025). 2023 Limited English Proficiency (LEP) for the DVRPC Region Public Use Microdata Areas [Dataset]. https://catalog.dvrpc.org/dataset/2023-limited-english-proficiency-lep-for-the-dvrpc-region-public-use-microdata-areas
    Explore at:
    api, xml, html, geojsonAvailable download formats
    Dataset updated
    Nov 4, 2025
    Dataset authored and provided by
    DVRPC
    Description

    The Delaware Valley Regional Planning Commission (DVRPC) is committed to upholding the principles and intentions of the 1964 Civil Rights Act and related nondiscrimination statutes in all of the Commission’s work, including publications, products, communications, public input, and decision-making processes. Language barriers may prohibit people who are Limited in English Proficiency (also known as LEP persons) from obtaining services, information, or participating in public planning processes. To better identify LEP populations and thoroughly evaluate the Commission’s efforts to provide meaningful access, DVRPC has produced this Limited-English Proficiency Plan. This is the data that was used to make the maps for the upcoming plan. Public Use Microdata Area (PUMA), are geographies of at least 100,000 people that are nested within states or equivalent entities. States are able to delineate PUMAs within their borders, or use PUMA Criteria provided by the Census Bureau. Census tables used to gather data from the 2019- 2023 American Community Survey 5-Year Estimates ACS 2019-2023, Table B16001: Language Spoken at Home by Ability to Speak English for the Population 5 Years and Over. ACS data are derived from a survey and are subject to sampling variablity.

    *Limited English Proficiency (LEP) refers to those persons that speak English less than "very well". DVRPC has mapped the below Language Groups for our Plan.

    Spanish

    Russian

    Chinese

    Korean

    Vietnamese Source of PUMA boundaries: US Census Bureau. The TIGER/Line Files Please refer to U:_OngoingProjects\LEP\ACS_5YR_B16001_PUMAs_metadata.xlsx for full attribute loop up and fields used in making the DVRPC LEP Map Series. Please contact Chris Pollard (cpollard@dvrpc.org) should you have any questions about this dataset.

  9. U.S. Languages Census Data 2009-2013

    • kaggle.com
    zip
    Updated Feb 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Martinson (2021). U.S. Languages Census Data 2009-2013 [Dataset]. https://www.kaggle.com/amartinson193/us-languages-census-data-20092013
    Explore at:
    zip(91261 bytes)Available download formats
    Dataset updated
    Feb 19, 2021
    Authors
    Andreas Martinson
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Context

    This dataset has been cleaned and prepared from the U.S. Census Bureau and can be found here: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html. Further details can be found here: https://www.census.gov/data/developers/data-sets/language-stats.html. We only cleaned data for the state level. There is data for the nation, county, and 'core-based statistical area' levels if you are interested in looking at and cleaning that data.

    The original tables had data split into a new tab for each state and wasn't conducive to data analysis. We consolidated all of the information into one table and put it into a tidy format.

    The dataset has each of the 50 states, plus Washington, D.C. and Puerto Rico.

    Content

    The dataset has the following columns: - Group: Character - Subgroup: Character - Language: Character - State: Character - Speakers: Number (Integer) - Margin of Error - English Speakers: Number (Integer) - nonEnglishSpeakers: Number (Integer) - Margin of Error - NonEnglishSpeakers: Number (Integer)

    Acknowledgements

    This dataset was cleaned for a Data Visualization class I took in Fall 2020. Here is the link to the final project: https://datavis-fall-2020-team.github.io/uslanguages.github.io/

    Here is a link to our repository: https://github.com/DataVis-Fall-2020-Team/uslanguages.github.io

    Inspiration

    The questions we originally sought to answer were: - Which languages are spoken in the U.S.? - Where are these languages spoken within the U.S.? - Which states have the most language diversity? - Which foreign language speakers are most fluent in English? - How have the languages spoken changed over time?

  10. F

    American English Call Center Data for Realestate AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This US English Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for English -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.

    Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.

    Speech Data

    The dataset features 30 hours of dual-channel call center recordings between native US English speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.

    Participant Diversity:
    Speakers: 60 native US English speakers from our verified contributor community.
    Regions: Representing different provinces across United States of America to ensure accent and dialect variation.
    Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted agent-customer discussions.
    Call Duration: Average 5–15 minutes per call.
    Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in noise-free and echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.

    Inbound Calls:
    Property Inquiries
    Rental Availability
    Renovation Consultation
    Property Features & Amenities
    Investment Property Evaluation
    Ownership History & Legal Info, and more
    Outbound Calls:
    New Listing Notifications
    Post-Purchase Follow-ups
    Property Recommendations
    Value Updates
    Customer Satisfaction Surveys, and others

    Such domain-rich variety ensures model generalization across common real estate support conversations.

    Transcription

    All recordings are accompanied by precise, manually verified transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., background noise, pauses)
    High transcription accuracy with word error rate below 5% via dual-layer human review.

    These transcriptions streamline ASR and NLP development for English real estate voice applications.

    Metadata

    Detailed metadata accompanies each participant and conversation:

    Participant Metadata: ID, age, gender, location, accent, and dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

    This enables smart filtering, dialect-focused model training, and structured dataset exploration.

    Usage and Applications

    This dataset is ideal for voice AI and NLP systems built for the real estate sector:

    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  11. The most spoken languages worldwide 2025

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  12. F

    US English TTS Speech Dataset for Speech Synthesis

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). US English TTS Speech Dataset for Speech Synthesis [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/tts-monolgue-english-us
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The English TTS Monologue Speech Dataset is a professionally curated resource built to train realistic, expressive, and production-grade text-to-speech (TTS) systems. It contains studio-recorded long-form speech by trained native English voice artists, each contributing 1 to 2 hours of clean, uninterrupted monologue audio.

    Unlike typical prompt-based datasets with short, isolated phrases, this collection features long-form, topic-driven monologues that mirror natural human narration. It includes content types that are directly useful for real-world applications, like audiobook-style storytelling, educational lectures, health advisories, product explainers, digital how-tos, formal announcements, and more.

    All recordings are captured in professional studios using high-end equipment and under the guidance of experienced voice directors.

    Recording & Audio Quality

    Audio Format: WAV, 48 kHz, available in 16-bit, 24-bit, and 32-bit depth
    SNR: Minimum 30 dB
    Channel: Mono
    Recording Duration: 20-30 minutes
    Recording Environment: Studio-controlled, acoustically treated rooms
    Per Speaker Volume: 1–2 hours of speech per artist
    Quality Control: Each file is reviewed and cleaned for common acoustic issues, including: reverberation, lip smacks, mouth clicks, thumping, hissing, plosives, sibilance, background noise, static interference, clipping, and other artifacts.

    Only clean, production-grade audio makes it into the final dataset.

    Voice Artist Selection

    All voice artists are native English speakers with professional training or prior experience in narration. We ensure a diverse pool in terms of age, gender, and region to bring a balanced and rich vocal dataset.

    Artist Profile:
    Gender: Male and Female
    Age Range: 20–60 years
    Regions: Native English-speaking states from United States of America
    Selection Process: All artists are screened, onboarded, and sample-approved using FutureBeeAI’s proprietary Yugo platform.

    Script Quality & Coverage

    Scripts are not generic or repetitive. Scripts are professionally authored by domain experts to reflect real-world use cases. They avoid redundancy and include modern vocabulary, emotional range, and phonetically rich sentence structures.

    Word Count per Script: 3,000–5,000 words per 30-minute session
    Content Types:
    Storytelling
    Script and book reading
    Informational explainers
    Government service instructions
    E-commerce tutorials
    Motivational content
    Health & wellness guides
    Education & career advice
    Linguistic Design: Balanced punctuation, emotional range, modern syntax, and vocabulary diversity

    Transcripts & Alignment

    While the script is used during the recording, we also provide post-recording updates to ensure the transcript reflects the final spoken audio. Minor edits are made to adjust for skipped or rephrased words.

    Segmentation: Time-stamped at the sentence level, aligned to actual spoken delivery
    Format: Available in plain text and JSON
    Post-processing:
    Corrected for

  13. Share of U.S. school-age children who don't speak English at home 1979-2022

    • statista.com
    Updated Apr 25, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2014). Share of U.S. school-age children who don't speak English at home 1979-2022 [Dataset]. https://www.statista.com/statistics/476804/percentage-of-school-age-children-who-speak-another-language-than-english-at-home-in-the-us/
    Explore at:
    Dataset updated
    Apr 25, 2014
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    In 2022, about 21.4 percent of schoolchildren spoke another language than English at home in the United States. This is a slight increase from 2021, when 21.3 percent of U.S. schoolchildren did not speak English at home.

  14. d

    ACS 5YR Demographic Estimate Data by State

    • datasets.ai
    • catalog.data.gov
    21, 57
    Updated Feb 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Housing and Urban Development (2024). ACS 5YR Demographic Estimate Data by State [Dataset]. https://datasets.ai/datasets/acs-5yr-demographic-estimate-data-by-state
    Explore at:
    57, 21Available download formats
    Dataset updated
    Feb 29, 2024
    Dataset authored and provided by
    Department of Housing and Urban Development
    Description

    2016-2020 ACS 5-Year estimates of demographic variables (see below) compiled at the State level. These variables include Sex By Age, Hispanic Or Latino Origin By Race, Household Type (Including Living Alone), Households By Presence Of People Under 18 Years By Household Type, Households By Presence Of People 60 Years And Over By Household Type, Nativity By Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over, Average Household Size Of Occupied Housing Units By Tenure, and Sex by Educational Attainment for the Population 18 Years and Over.

  15. What languages are spoken by people who have limited English ability?

    • anrgeodata.vermont.gov
    • visionzero.geohub.lacity.org
    Updated Apr 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urban Observatory by Esri (2023). What languages are spoken by people who have limited English ability? [Dataset]. https://anrgeodata.vermont.gov/maps/2c15f2d8d81343d883e70a73317a5cb9
    Explore at:
    Dataset updated
    Apr 21, 2023
    Dataset provided by
    Esrihttp://esri.com/
    Authors
    Urban Observatory by Esri
    Area covered
    Description

    This map shows the predominant language(s) spoken by people who have limited English speaking ability. This is shown using American Community Survey data from the US Census Bureau by state, county, and tract.There are 12 different language/language groupings: SpanishFrench, Haitian, or CajunKoreanChinese (including Mandarin and Cantonese)VietnameseTagalog (including Filipino)ArabicGerman or other West GermanicRussian, Polish, or other SlavicOther Indo-European (such as Italian or Portuguese)Other Asian and Pacific Island (such as Japanese or Hmong)Other and unspecified (such as Navajo or Hebrew).This map also uses a feature effect to identify the counties with either 10,000 or 5% of the population having limited English ability. According to the Voting Rights Act, "localities where there are more than 10,000 or over 5 percent of the total voting age citizens in a single political subdivision (usually a county, but a township or municipality in some states) who are members of a single language minority group, have depressed literacy rates, and do not speak English very well" are required to "provide [voting materials] in the language of the applicable minority group as well as in the English language".This map uses these hosted feature layers containing the most recent American Community Survey data. These layers are part of ArcGIS Living Atlas, and are updated every year when the American Community Survey releases new estimates, so values in the map always reflect the newest data available.

  16. t

    LANGUAGE SPOKEN AT HOME - DP02_DES_ZIP

    • portal.tad3.org
    Updated Jul 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). LANGUAGE SPOKEN AT HOME - DP02_DES_ZIP [Dataset]. https://portal.tad3.org/dataset/language-spoken-at-home-dp02_des_zip
    Explore at:
    Dataset updated
    Jul 23, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES LANGUAGE SPOKEN AT HOME - DP02 Universe - Population 5 Year and over Survey-Program - American Community Survey 5-year estimates Years - 2020, 2021, 2022 In households where one or more people spoke a language other than English, the household language assigned to all household members was the non-English language spoken by the first person with a non-English language in the following order: householder, spouse, parent, sibling, child, grandchild, in-law, other relative, unmarried partner, housemate or roommate, roomer/boarder, foster child, or other nonrelatives. Therefore, a person who spoke only English may have had a non-English household language assigned during tabulations as a result of living in a household with a non-English household language. If no member of the household age 5 and over speaks a language other than English at home then the household language is English only.

  17. Number of native Spanish speakers worldwide 2024, by country

    • statista.com
    • boostndoto.org
    • +5more
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://www.statista.com/statistics/991020/number-native-spanish-speakers-country-worldwide/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    World
    Description

    Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.

  18. a

    Predominant Language Spoken at Home - ACS 2016-Copy-Copy

    • umn.hub.arcgis.com
    Updated Dec 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Minnesota (2019). Predominant Language Spoken at Home - ACS 2016-Copy-Copy [Dataset]. https://umn.hub.arcgis.com/maps/75844925ac7d472ea49c62e518686577
    Explore at:
    Dataset updated
    Dec 24, 2019
    Dataset authored and provided by
    University of Minnesota
    Area covered
    Description

    This map shows the predominant language spoken at home by the US population aged 5+. This is shown by Census Tract and County centroids. The data values are from the 2012-2016 American Community Survey 5-year estimates in the S1601 Table for Language Spoken at Home. The popup in the map provides a breakdown of the population age 5+ by the language spoken at home. Data values for other age groups are also available within the data's table. The color of the symbols represent the most common language spoken at home. This predominance map style compares the count of people age 5+ based on what language is spoken at home, and returns the value with the highest count. The census breaks down the population 5+ by the following language options:English OnlyNon-English - SpanishNon-English - Asian and Pacific Islander LanguagesNon-English - Indo European LanguagesNon-English - OtherThe size of the symbols represents how many people are 5 years or older, which helps highlight the quantity of people that live within an area that were sampled for this language categorization. The strength of the color represents HOW predominant an language is within an area. If the symbol is a strong color, it makes up a larger portion of the population. This map is designed for a dark basemap such as the Human Geography Basemap or the Dark Gray Canvas Basemap. See the web map to see the pattern at both the county and tract level. This map helps to show the most common language spoken at home at both a regional and local level. The tract pattern shows how distinct neighborhoods are clustered by which language they speak. The county pattern shows how language is used throughout the country. This pattern is shown by census tracts at large scales, and counties at smaller scales.This data was downloaded from the United States Census Bureau American Fact Finder on January 16, 2018. It was then joined with 2016 vintage centroid points and hosted to ArcGIS Online and into the Living Atlas.Nationally, the breakdown of education for the population 5+ is as follows:Total EstimateMargin of ErrorPercent EstimateMargin of ErrorPopulation 5 years and over298,691,202+/-3,594(X)(X)Speak only English235,519,143+/-154,40978.90%+/-0.1Spanish39,145,066+/-94,57113.10%+/-0.1Asian and Pacific Island languages10,172,370+/-22,5613.40%+/-0.1Other Indo-European languages10,827,536+/-46,3353.60%+/-0.1Other languages3,027,087+/-23,3021.00%+/-0.1

  19. 2012 American Community Survey: B16002 | HOUSEHOLD LANGUAGE BY HOUSEHOLDS IN...

    • data.census.gov
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ACS, 2012 American Community Survey: B16002 | HOUSEHOLD LANGUAGE BY HOUSEHOLDS IN WHICH NO ONE 14 AND OVER SPEAKS ENGLISH ONLY OR SPEAKS A LANGUAGE OTHER THAN ENGLISH AT HOME AND SPEAKS ENGLISH "VERY WELL" (ACS 1-Year Estimates Detailed Tables) [Dataset]. https://data.census.gov/table/ACSDT1Y2012.B16002?q=El+Dorado+County,+California+No+Topic
    Explore at:
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Authors
    ACS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    2012
    Description

    Supporting documentation on code lists, subject definitions, data accuracy, and statistical testing can be found on the American Community Survey website in the Data and Documentation section...Sample size and data quality measures (including coverage rates, allocation rates, and response rates) can be found on the American Community Survey website in the Methodology section..Although the American Community Survey (ACS) produces population, demographic and housing unit estimates, it is the Census Bureau''s Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns and estimates of housing units for states and counties..Explanation of Symbols:An ''**'' entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate..An ''-'' entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution..An ''-'' following a median estimate means the median falls in the lowest interval of an open-ended distribution..An ''+'' following a median estimate means the median falls in the upper interval of an open-ended distribution..An ''***'' entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate..An ''*****'' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate. .An ''N'' entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small..An ''(X)'' means that the estimate is not applicable or not available..Estimates of urban and rural population, housing units, and characteristics reflect boundaries of urban areas defined based on Census 2000 data. Boundaries for urban areas have not been updated since Census 2000. As a result, data for urban and rural areas from the ACS do not necessarily reflect the results of ongoing urbanization..While the 2012 American Community Survey (ACS) data generally reflect the December 2009 Office of Management and Budget (OMB) definitions of metropolitan and micropolitan statistical areas; in certain instances the names, codes, and boundaries of the principal cities shown in ACS tables may differ from the OMB definitions due to differences in the effective dates of the geographic entities..The household language assigned to the housing unit is the non-English language spoken by the first person with a non-English language. This assignment scheme ranks household members in the following order: householder, spouse, parent, sibling, child, grandchild, other relative, stepchild, unmarried partner, housemate or roommate, and other nonrelatives. If no member of the household age 5 and over speaks a language other than English at home then the household language is English only..A household defined as "No one age 14 and over speaks English only or speaks English "very well"" is one in which no person age 14 and over speaks English at least "very well." That is, no person age 14 and over (1) speaks only English at home or (2) speaks another language at home and speaks English "very well." By definition, English-only households cannot belong to this group. Previous Census Bureau data products have referred to these households as "linguistically isolated." This table is directly comparable to last year..Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted roughly as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables..Source: U.S. Census Bureau, 2012 American Community Survey

  20. ACS 5YR Demographic Estimate Data by Place

    • hudgis-hud.opendata.arcgis.com
    • data.lojic.org
    • +1more
    Updated Aug 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Housing and Urban Development (2023). ACS 5YR Demographic Estimate Data by Place [Dataset]. https://hudgis-hud.opendata.arcgis.com/datasets/f83fd832f14c44168d4cc44cda660c1b
    Explore at:
    Dataset updated
    Aug 21, 2023
    Dataset provided by
    United States Department of Housing and Urban Developmenthttp://www.hud.gov/
    Authors
    Department of Housing and Urban Development
    Area covered
    Description

    2016-2020 ACS 5-Year estimates of demographic variables (see below) compiled at the place level.The American Community Survey (ACS) 5 Year 2016-2020 demographic information is a subset of information available for download from the U.S. Census. Tables used in the development of this dataset include: B01001 - Sex By Age; B03002 - Hispanic Or Latino Origin By Race; B11001 - Household Type (Including Living Alone);B11005 - Households By Presence Of People Under 18 Years By Household Type; B11006 - Households By Presence Of People 60 Years And Over By Household Type; B16005 - Nativity By Language Spoken At Home By Ability To Speak English For The Population 5 Years And Over; B25010 - Average Household Size Of Occupied Housing Units By Tenure, and; B15001 - Sex by Educational Attainment for the Population 18 Years and Over; To learn more about the American Community Survey (ACS), and associated datasets visit: https://www.census.gov/programs-surveys/acs, for questions about the spatial attribution of this dataset, please reach out to us at GISHelpdesk@hud.gov. Data Dictionary: DD_ACS 5-Year Demographic Estimate Data by Place Date of Coverage: 2016-2020

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). American English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-usa

American English General Conversation Speech Dataset for ASR

American English General Conversation Speech Corpus

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Area covered
United States
Dataset funded by
FutureBeeAI
Description

Introduction

Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.

Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.

Speech Data

The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

Participant Diversity:
Speakers: 60 verified native US English speakers from FutureBeeAI’s contributor community.
Regions: Representing various provinces of United States of America to ensure dialectal diversity and demographic balance.
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
Recording Details:
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
Duration: Each conversation ranges from 15 to 60 minutes.
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity

The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

Sample Topics Include:
Family & Relationships
Food & Recipes
Education & Career
Healthcare Discussions
Social Issues
Technology & Gadgets
Travel & Local Culture
Shopping & Marketplace Experiences, and many more.

Transcription

Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

Transcription Highlights:
Speaker-segmented dialogues
Time-coded utterances
Non-speech elements (pauses, laughter, etc.)
High transcription accuracy, achieved through double QA pass, average WER < 5%

These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

Metadata

The dataset comes with granular metadata for both speakers and recordings:

Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

Usage and Applications

This dataset is a versatile resource for multiple English speech and language AI applications:

ASR Development: Train accurate speech-to-text systems for US English.
Voice Assistants: Build smart assistants capable of understanding natural American conversations.
<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

Search
Clear search
Close search
Google apps
Main menu