90 datasets found
  1. a

    Percent Spanish Speakers

    • gis-kingcounty.opendata.arcgis.com
    • hub.arcgis.com
    Updated Aug 10, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    King County (2016). Percent Spanish Speakers [Dataset]. https://gis-kingcounty.opendata.arcgis.com/datasets/percent-spanish-speakers
    Explore at:
    Dataset updated
    Aug 10, 2016
    Dataset authored and provided by
    King County
    Area covered
    Description

    Languages:Percent Spanish Speakers: Basic demographics by census tracts in King County based on current American Community Survey 5 Year Average (ACS). Included demographics are: total population; foreign born; median household income; English language proficiency; languages spoken; race and ethnicity; sex; and age. Numbers and derived percentages are estimates based on the current year's ACS. GEO_ID_TRT is the key field and may be used to join to other demographic Census data tables.

  2. F

    Mexican Spanish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Mexico
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Mexican Spanish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Mexico to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Spanish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Mexican Spanish.
    Voice Assistants: Build smart assistants capable of understanding natural Mexican conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  3. a

    Serving Spanish Speakers in Denver

    • denver-data-library-mappingjustice.hub.arcgis.com
    Updated Aug 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of California San Diego (2020). Serving Spanish Speakers in Denver [Dataset]. https://denver-data-library-mappingjustice.hub.arcgis.com/datasets/UCSDOnline::serving-spanish-speakers-in-denver-1
    Explore at:
    Dataset updated
    Aug 31, 2020
    Dataset authored and provided by
    University of California San Diego
    Area covered
    Denver
    Description

    This map shows that the communities that are located in the southwestern and northern ends of Denver's downtown are predominately low-income communities. These same communities are also home to areas in Denver that have higher percentages of people who only speak Spanish, while the communities to the southeast and northwest have higher household incomes. While diversity indexes are higher in the Spanish-speaking areas, this diversity may be due to (presumably) Black residents, who also suffer from poverty rates across the country. The point stands that residents in these areas are generally subject to lower quality of life, due to the observations mentioned.

  4. 478 Hours - Spanish Conversational Speech Data by Mobile Phone

    • nexdata.ai
    • m.nexdata.ai
    Updated Dec 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 478 Hours - Spanish Conversational Speech Data by Mobile Phone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1147
    Explore at:
    Dataset updated
    Dec 5, 2023
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    Spanish(Spain) Spontaneous Dialogue Smartphone speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(596 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  5. F

    Spanish Conversation Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Spanish Conversation Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/spanish-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 10,000 chat conversations, each focusing on specific Healthcare related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 150+ native Spanish participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Healthcare topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Healthcare use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Appointment Scheduling
    New Patient Registration
    Surgery Consultation
    Consultation regarding Diet, and many more
    Outbound Chats:
    Appointment Reminder
    Health & Wellness Subscription Programs
    Lab Test Results
    Health Risk Assessments
    Preventive Care Reminders, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Spanish Healthcare interactions. This diversity ensures the dataset accurately represents the language used by Spanish speakers in Healthcare contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Spanish personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Spanish-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Spanish forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Spanish Healthcare conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Spanish Healthcare interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Healthcare customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    Solution Delivery
    Closing and Follow-ups
    Feedback, etc

    This structured and varied conversational flow enables the creation of advanced NLP models that can effectively manage and respond to a wide range of customer service scenarios.

    Data Format and Structure

    The dataset is available in JSON, CSV, and TXT formats, with each conversation containing attributes like participant identifiers and chat

  6. 189 Hours - Spanish(Latin America) Children Real-world Casual Conversation...

    • nexdata.ai
    Updated Nov 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 189 Hours - Spanish(Latin America) Children Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1250
    Explore at:
    Dataset updated
    Nov 1, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Latin America
    Variables measured
    Age, Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Spanish(Latin America) Children Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, age, accent and other attributes. Our dataset was collected from extensive and diversify speakers(12 years old and younger children), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  7. 2013 American Community Survey - Table Packages: Detailed Language Spoken in...

    • catalog.data.gov
    Updated Jul 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Census Bureau (2023). 2013 American Community Survey - Table Packages: Detailed Language Spoken in the U.S. [Dataset]. https://catalog.data.gov/dataset/2013-american-community-survey-table-packages-detailed-language-spoken-in-the-u-s
    Explore at:
    Dataset updated
    Jul 19, 2023
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Area covered
    United States
    Description

    This data set uses the 2009-2013 American Community Survey to tabulate the number of speakers of languages spoken at home and the number of speakers of each language who speak English less than very well. These tabulations are available for the following geographies: nation; each of the 50 states, plus Washington, D.C. and Puerto Rico; counties with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish; core-based statistical areas (metropolitan statistical areas and micropolitan statistical areas) with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish.

  8. F

    Colombian Spanish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Colombian Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-colombia
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Colombian Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Colombian Spanish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Colombian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Colombian Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Colombian Spanish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Colombia to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Spanish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Colombian Spanish.
    Voice Assistants: Build smart assistants capable of understanding natural Colombian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex;

  9. h

    messirve

    • huggingface.co
    Updated Apr 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spanish Info Retrieval (2024). messirve [Dataset]. https://huggingface.co/datasets/spanish-ir/messirve
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2024
    Dataset authored and provided by
    Spanish Info Retrieval
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for MessIRve

    MessIRve is a large-scale dataset for Spanish IR, designed to better capture the information needs of Spanish speakers across different countries. Queries are obtained from Google's autocomplete API (www.google.com/complete), and relevant documents are Spanish Wikipedia paragraphs containing answers from Google Search "featured snippets". This data collection strategy is inspired by GooAQ. The files presented here are the qrels. The style in which they… See the full description on the dataset page: https://huggingface.co/datasets/spanish-ir/messirve.

  10. 435 Hours - Spanish Speech Data by Mobile Phone

    • m.nexdata.ai
    • nexdata.ai
    Updated Jun 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 435 Hours - Spanish Speech Data by Mobile Phone [Dataset]. https://m.nexdata.ai/datasets/speechrecog/951?source=Github
    Explore at:
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    Spanish(Spain) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers, news and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(989 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  11. 338 Hours-Spanish Speech Data by Mobile Phone

    • m.nexdata.ai
    • nexdata.ai
    Updated Nov 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 338 Hours-Spanish Speech Data by Mobile Phone [Dataset]. https://m.nexdata.ai/datasets/speechrecog/245?source=Github
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition, Features of annotation
    Description

    Spanish Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering News, comments, encyclopedia, economy, science and law domains, with balanced gender distribution. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(800 people from Spain, Mexico, Argentina, etc.), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  12. m

    General conversation speech datasets in Spanish for Social Conversation

    • data.macgence.com
    mp3
    Updated Mar 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). General conversation speech datasets in Spanish for Social Conversation [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-spanish-for-social-conversation
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 22, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes General Conversation, featuring Spanish speakers from Spain with detailed metadata.

  13. h

    EpaDB

    • huggingface.co
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koel Labs (2025). EpaDB [Dataset]. https://huggingface.co/datasets/KoelLabs/EpaDB
    Explore at:
    Dataset updated
    Jun 3, 2025
    Dataset authored and provided by
    Koel Labs
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    EpaDB

    EpaDB is a speech database of 50 native Spanish speakers (25 male, 25 female) from Argentina speaking English. It contains phonemic annotations using mainly the sounds supported by ARPABet with a few extensions to model Spanish influenced dialects of English. It was developed by Jazmin Vidal, Luciana Ferrer, and Leonardo Brambilla at the Speech Lab. Read more on their official github and paper.

      This Processed Version
    

    We have processed the dataset into an easily… See the full description on the dataset page: https://huggingface.co/datasets/KoelLabs/EpaDB.

  14. F

    Mexican Spanish Call Center Data for Realestate AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Mexico
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mexican Spanish Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Spanish -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.

    Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.

    Speech Data

    The dataset features 30 hours of dual-channel call center recordings between native Mexican Spanish speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.

    Participant Diversity:
    Speakers: 60 native Mexican Spanish speakers from our verified contributor community.
    Regions: Representing different provinces across Mexico to ensure accent and dialect variation.
    Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted agent-customer discussions.
    Call Duration: Average 5–15 minutes per call.
    Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in noise-free and echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.

    Inbound Calls:
    Property Inquiries
    Rental Availability
    Renovation Consultation
    Property Features & Amenities
    Investment Property Evaluation
    Ownership History & Legal Info, and more
    Outbound Calls:
    New Listing Notifications
    Post-Purchase Follow-ups
    Property Recommendations
    Value Updates
    Customer Satisfaction Surveys, and others

    Such domain-rich variety ensures model generalization across common real estate support conversations.

    Transcription

    All recordings are accompanied by precise, manually verified transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., background noise, pauses)
    High transcription accuracy with word error rate below 5% via dual-layer human review.

    These transcriptions streamline ASR and NLP development for Spanish real estate voice applications.

    Metadata

    Detailed metadata accompanies each participant and conversation:

    Participant Metadata: ID, age, gender, location, accent, and dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

    This enables smart filtering, dialect-focused model training, and structured dataset exploration.

    Usage and Applications

    This dataset is ideal for voice AI and NLP systems built for the real estate sector:

    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  15. F

    Mexican Spanish Call Center Data for Healthcare AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish Call Center Data for Healthcare AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/healthcare-call-center-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Mexico
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mexican Spanish Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of Spanish speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.

    Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.

    Speech Data

    The dataset features 30 Hours of dual-channel call center conversations between native Mexican Spanish speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.

    Participant Diversity:
    Speakers: 60 verified native Mexican Spanish speakers from our contributor community.
    Regions: Diverse provinces across Mexico to ensure broad dialectal representation.
    Participant Profile: Age range of 18–70 with a gender mix of 60% male and 40% female.
    RecordingDetails:
    Conversation Nature: Naturally flowing, unscripted conversations.
    Call Duration: Each session ranges between 5 to 15 minutes.
    Audio Format: WAV format, stereo, 16-bit depth at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clear conditions without background noise or echo.

    Topic Diversity

    The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).

    Inbound Calls:
    Appointment Scheduling
    New Patient Registration
    Surgical Consultation
    Dietary Advice and Consultations
    Insurance Coverage Inquiries
    Follow-up Treatment Requests, and more
    OutboundCalls:
    Appointment Reminders
    Preventive Care Campaigns
    Test Results & Lab Reports
    Health Risk Assessment Calls
    Vaccination Updates
    Wellness Subscription Outreach, and more

    These real-world interactions help build speech models that understand healthcare domain nuances and user intent.

    Transcription

    Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.

    Transcription Includes:
    Speaker-identified Dialogues
    Time-coded Segments
    Non-speech Annotations (e.g., silence, cough)
    High transcription accuracy with word error rate is below 5%, backed by dual-layer QA checks.

    Metadata

    Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.

    Participant Metadata: ID, gender, age, region, accent, and dialect.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    Usage and Applications

    This dataset can be used across a range of healthcare and voice AI use cases:

  16. F

    Mexican Spanish Call Center Data for Retail & E-Commerce AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish Call Center Data for Retail & E-Commerce AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/retail-call-center-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mexican Spanish Call Center Speech Dataset for the Retail and E-commerce industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Spanish speakers. Featuring over 30 hours of real-world, unscripted audio, it provides authentic human-to-human customer service conversations vital for training robust ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI developers, data scientists, and language model researchers to build high-accuracy, production-ready models across retail-focused use cases.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native Mexican Spanish speakers. Captured in realistic scenarios, these conversations span diverse retail topics from product inquiries to order cancellations, providing a wide context range for model training and testing.

    Participant Diversity:
    Speakers: 60 native Mexican Spanish speakers from our verified contributor pool.
    Regions: Representing multiple provinces across Mexico to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral, ensuring real-world scenario coverage.

    Inbound Calls:
    Product Inquiries
    Order Cancellations
    Refund & Exchange Requests
    Subscription Queries, and more
    Outbound Calls:
    Order Confirmations
    Upselling & Promotions
    Account Updates
    Loyalty Program Offers
    Customer Verifications, and others

    Such variety enhances your model’s ability to generalize across retail-specific voice interactions.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    30 hours-coded Segments
    Non-speech Tags (e.g., pauses, cough)
    High transcription accuracy with word error rate < 5% due to double-layered quality checks.

    These transcriptions are production-ready, making model training faster and more accurate.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender, accent, dialect, and location.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    This granularity supports advanced analytics, dialect filtering, and fine-tuned model evaluation.

    Usage and Applications

    This dataset is ideal for a range of voice AI and NLP applications:

    Automatic Speech Recognition (ASR): Fine-tune Spanish speech-to-text systems.
    <span

  17. 488 Hours - Spanish(Spain) Spontaneous Dialogue Telephony speech dataset

    • nexdata.ai
    • m.nexdata.ai
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 488 Hours - Spanish(Spain) Spontaneous Dialogue Telephony speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1234
    Explore at:
    Dataset updated
    Jan 14, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    Spain
    Variables measured
    Format, Country, Speaker, Language, Accuracy rate, Content category, Recording device, Recording condition, Language(Region) Code, Features of annotation
    Description

    Spanish(Spain) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(600 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  18. m

    Call Center conversation speech datasets in Spanish European for Gas

    • data.macgence.com
    mp3
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Call Center conversation speech datasets in Spanish European for Gas [Dataset]. https://data.macgence.com/dataset/call-center-conversation-speech-datasets-in-spanish-european-for-gas
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes Call Center conversations from Gas, featuring Spanish European speakers from Spain ,with detailed metadata.

  19. 145 Hours - Spanish(spain) Children Real-world Casual Conversation and...

    • nexdata.ai
    • m.nexdata.ai
    Updated Dec 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 145 Hours - Spanish(spain) Children Real-world Casual Conversation and Monologue speech dataset [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1251
    Explore at:
    Dataset updated
    Dec 8, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Spain
    Variables measured
    Age, Format, Country, Accuracy, Language, Content category, Language(Region) Code, Recording environment, Features of annotation
    Description

    Spanish(spain) Children Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety show and other generic domains, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, age, accent and other attributes. Our dataset was collected from extensive and diversify speakers(12 years old and younger children), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  20. E

    US Spanish Speecon database

    • catalogue.elra.info
    Updated Feb 22, 2007
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). US Spanish Speecon database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0211/
    Explore at:
    Dataset updated
    Feb 22, 2007
    Dataset provided by
    ELRA (European Language Resources Association)
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Area covered
    United States
    Description

    The US Spanish Speecon database is divided into 2 sets: 1) The first set comprises the recordings of 550 adult Spanish speakers in the US (255 males, 295 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place), and consisting of about 208 hours of audio data. 2) The second set comprises the recordings of 50 child Spanish speakers in the US (28 boys, 22 girls), recorded over 4 microphone channels in 1 recording environment (children room), and consisting of about 14.7 hours of audio data. This database is partitioned into 22 DVDs (first set) and 3 DVDs (second set).The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items:Calibration data: 6 noise recordings The “silence word” recordingFree spontaneous items (adults only):2 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)17 Elicited spontaneous items (adults only):3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language Read speech:30 phonetically rich sentences uttered by adults and 60 uttered by children5 phonetically rich words (adults only)4 isolated digits1 isolated digit sequence4 connected digit sequences1 telephone number3 natural numbers1 money amount2 time phrases (T1 : analogue, T2 : digital)3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)3 letter sequences1 proper name2 city or street names2 questions2 special keyboard characters 1 Web address1 email address208 application specific words and phrases per session (adults)74 toy commands, 14 phone commands and 34 general commands (children)The following age distribution has been obtained: Adults: 223 speakers are between 15 and 30, 191 speakers are between 31 and 45, and 136 speakers are over 46.Children: 15 speakers are between 8 and 10, 35 speakers are between 11 and 14.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
King County (2016). Percent Spanish Speakers [Dataset]. https://gis-kingcounty.opendata.arcgis.com/datasets/percent-spanish-speakers

Percent Spanish Speakers

Explore at:
37 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 10, 2016
Dataset authored and provided by
King County
Area covered
Description

Languages:Percent Spanish Speakers: Basic demographics by census tracts in King County based on current American Community Survey 5 Year Average (ACS). Included demographics are: total population; foreign born; median household income; English language proficiency; languages spoken; race and ethnicity; sex; and age. Numbers and derived percentages are estimates based on the current year's ACS. GEO_ID_TRT is the key field and may be used to join to other demographic Census data tables.

Search
Clear search
Close search
Google apps
Main menu