27 datasets found
  1. Languages in Mexico 2020

    • statista.com
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Languages in Mexico 2020 [Dataset]. https://www.statista.com/statistics/275440/languages-in-mexico/
    Explore at:
    Dataset updated
    Apr 15, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    Mexico
    Description

    In 2020, about 93.8 percent of the Mexican population was monolingual in Spanish. Around five percent spoke a combination of Spanish and indigenous languages. Spanish is the third-most spoken native language worldwide, after Mandarin Chinese and Hindi.

    Mexican Spanish

    Spanish was first being used in Mexico in the 16th century, at the time of Spanish colonization during the Conquest campaigns of what is now Mexico and the Caribbean. As of 2018, Mexico is the country with the largest number of native Spanish speakers worldwide. Mexican Spanish is influenced by English and Nahuatl, and has about 120 million users. The Mexican government uses Spanish in the majority of its proceedings, however it recognizes 68 national languages, 63 of which are indigenous.

    Indigenous languages spoken

    Of the indigenous languages spoken, two of the most widely used are Nahuatl and Maya. Due to a history of marginalization of indigenous groups, most indigenous languages are endangered, and many linguists warn they might cease to be used after a span of just a few decades. In recent years, legislative attempts such as the San Andréas Accords have been made to protect indigenous groups, who make up about 25 million of Mexico’s 125 million total inhabitants, though the efficacy of such measures is yet to be seen.

  2. Speakers of indigenous languages in Mexico 2020, by language

    • ai-chatbox.pro
    • statista.com
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jose Sanchez (2025). Speakers of indigenous languages in Mexico 2020, by language [Dataset]. https://www.ai-chatbox.pro/?_=%2Fstudy%2F115828%2Fdemographics-of-mexico%2F%23XgboD02vawLZsmJjSPEePEUG%2FVFd%2Bik%3D
    Explore at:
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Jose Sanchez
    Area covered
    Mexico
    Description

    There were more than seven million speakers of indigenous languages in Mexico as of 2020. Nahuatl was the most spoken indigenous language (although it is also considered a group of languages), with more than 1.65 million speakers. Both the Mayan languages Tseltal and Tsotsil were spoken by over 550,000 persons. Furthermore, about a third of all the indigenous language speakers were located in just two states: Chiapas and Oaxaca.

  3. Speakers of indigenous languages Mexico 2020, by region

    • ai-chatbox.pro
    • statista.com
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Speakers of indigenous languages Mexico 2020, by region [Dataset]. https://www.ai-chatbox.pro/?_=%2Fstatistics%2F1323026%2Findigenous-language-speakers-by-state-mexico%2F%23XgboD02vawLbpWJjSPEePEUG%2FVFd%2Bik%3D
    Explore at:
    Dataset updated
    Sep 10, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Mar 2, 2020 - Mar 27, 2020
    Area covered
    Mexico
    Description

    There were more than seven million speakers of indigenous languages in Mexico as of 2020. Chiapas and Oaxaca ranked as the federal entities with the largest population aged over three years who speak an indigenous language, with 1.5 and 1.2 million people respectively. Moreover, Nahuatl was the most spoken indigenous language or group of languages.

  4. Number of indigenous language speakers in Nuevo Leon 2020

    • statista.com
    Updated Jul 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Number of indigenous language speakers in Nuevo Leon 2020 [Dataset]. https://www.statista.com/statistics/1385793/number-indigenous-language-speakers-nuevo-leon-mexico/
    Explore at:
    Dataset updated
    Jul 5, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    Mexico, Nuevo Leon
    Description

    In 2020, Nahuatl emerged as the most widely spoken indigenous language among the most prominent ones in the Mexican state of Nuevo Leon, boasting 54,110 speakers. Following closely behind was Huasteco, with the substantial figure of 19,460 speakers.

  5. Number of indigenous language speakers in Durango 2020

    • statista.com
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of indigenous language speakers in Durango 2020 [Dataset]. https://www.statista.com/statistics/1388755/number-indigenous-language-speakers-durango-mexico/
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    Mexico
    Description

    In 2020, the linguistic landscape of the Mexican state of Durango exhibited a rich diversity, with Southern Tepehuán emerging as the primary indigenous language spoken by an estimated ****** individuals. Huichol and Nahuatl were also among the most spoken languages in the region.

  6. F

    Mexican Spanish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Mexico
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Mexican Spanish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Mexico to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Spanish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Mexican Spanish.
    Voice Assistants: Build smart assistants capable of understanding natural Mexican conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  7. F

    Mexican Spanish Call Center Data for Retail & E-Commerce AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish Call Center Data for Retail & E-Commerce AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/retail-call-center-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mexican Spanish Call Center Speech Dataset for the Retail and E-commerce industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Spanish speakers. Featuring over 30 hours of real-world, unscripted audio, it provides authentic human-to-human customer service conversations vital for training robust ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI developers, data scientists, and language model researchers to build high-accuracy, production-ready models across retail-focused use cases.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native Mexican Spanish speakers. Captured in realistic scenarios, these conversations span diverse retail topics from product inquiries to order cancellations, providing a wide context range for model training and testing.

    Participant Diversity:
    Speakers: 60 native Mexican Spanish speakers from our verified contributor pool.
    Regions: Representing multiple provinces across Mexico to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral, ensuring real-world scenario coverage.

    Inbound Calls:
    Product Inquiries
    Order Cancellations
    Refund & Exchange Requests
    Subscription Queries, and more
    Outbound Calls:
    Order Confirmations
    Upselling & Promotions
    Account Updates
    Loyalty Program Offers
    Customer Verifications, and others

    Such variety enhances your model’s ability to generalize across retail-specific voice interactions.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    30 hours-coded Segments
    Non-speech Tags (e.g., pauses, cough)
    High transcription accuracy with word error rate < 5% due to double-layered quality checks.

    These transcriptions are production-ready, making model training faster and more accurate.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender, accent, dialect, and location.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    This granularity supports advanced analytics, dialect filtering, and fine-tuned model evaluation.

    Usage and Applications

    This dataset is ideal for a range of voice AI and NLP applications:

    Automatic Speech Recognition (ASR): Fine-tune Spanish speech-to-text systems.
    <span

  8. Number of native Spanish speakers worldwide 2024, by country

    • statista.com
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of native Spanish speakers worldwide 2024, by country [Dataset]. https://www.statista.com/statistics/991020/number-native-spanish-speakers-country-worldwide/
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    World
    Description

    Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.

  9. F

    Mexican Spanish Call Center Data for Realestate AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Mexico
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mexican Spanish Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Spanish -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.

    Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.

    Speech Data

    The dataset features 30 hours of dual-channel call center recordings between native Mexican Spanish speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.

    Participant Diversity:
    Speakers: 60 native Mexican Spanish speakers from our verified contributor community.
    Regions: Representing different provinces across Mexico to ensure accent and dialect variation.
    Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted agent-customer discussions.
    Call Duration: Average 5–15 minutes per call.
    Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in noise-free and echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.

    Inbound Calls:
    Property Inquiries
    Rental Availability
    Renovation Consultation
    Property Features & Amenities
    Investment Property Evaluation
    Ownership History & Legal Info, and more
    Outbound Calls:
    New Listing Notifications
    Post-Purchase Follow-ups
    Property Recommendations
    Value Updates
    Customer Satisfaction Surveys, and others

    Such domain-rich variety ensures model generalization across common real estate support conversations.

    Transcription

    All recordings are accompanied by precise, manually verified transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., background noise, pauses)
    High transcription accuracy with word error rate below 5% via dual-layer human review.

    These transcriptions streamline ASR and NLP development for Spanish real estate voice applications.

    Metadata

    Detailed metadata accompanies each participant and conversation:

    Participant Metadata: ID, age, gender, location, accent, and dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

    This enables smart filtering, dialect-focused model training, and structured dataset exploration.

    Usage and Applications

    This dataset is ideal for voice AI and NLP systems built for the real estate sector:

    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  10. Number of indigenous language speakers in Sonora 2020

    • statista.com
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of indigenous language speakers in Sonora 2020 [Dataset]. https://www.statista.com/statistics/1388253/number-indigenous-language-speakers-sonora-mexico/
    Explore at:
    Dataset updated
    Jul 10, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    Mexico
    Description

    In the year 2020, the linguistic diversity within the Mexican state of Sonora was mostly dominated by Mayo emerging as the primary indigenous language, spoken by approximately ****** individuals. Not far behind was Yaqui, with the significant figure of ****** speakers.

  11. F

    Mexican Spanish Call Center Data for Telecom AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish Call Center Data for Telecom AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/telecom-call-center-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mexican Spanish Call Center Speech Dataset for the Telecom industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Spanish-speaking telecom customers. Featuring over 30 hours of real-world, unscripted audio, it delivers authentic customer-agent interactions across key telecom support scenarios to help train robust ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI engineers, telecom automation teams, and NLP researchers to build high-accuracy, production-ready models for telecom-specific use cases.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native Mexican Spanish speakers. Captured in realistic customer support settings, these conversations span a wide range of telecom topics from network complaints to billing issues, offering a strong foundation for training and evaluating telecom voice AI solutions.

    Participant Diversity:
    Speakers: 60 native Mexican Spanish speakers from our verified contributor pool.
    Regions: Representing multiple provinces across Mexico to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral ensuring broad scenario coverage for telecom AI development.

    Inbound Calls:
    Phone Number Porting
    Network Connectivity Issues
    Billing and Payments
    Technical Support
    Service Activation
    International Roaming Enquiry
    Refund Requests and Billing Adjustments
    Emergency Service Access, and others
    Outbound Calls:
    Welcome Calls & Onboarding
    Payment Reminders
    Customer Satisfaction Surveys
    Technical Updates
    Service Usage Reviews
    Network Complaint Status Calls, and more

    This variety helps train telecom-specific models to manage real-world customer interactions and understand context-specific voice patterns.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., pauses, coughs)
    High transcription accuracy with word error rate < 5% thanks to dual-layered quality checks.

    These transcriptions are production-ready, allowing for faster development of ASR and conversational AI systems in the Telecom domain.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender, accent, dialect, and location.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  12. F

    Mexican Spanish Call Center Data for Delivery & Logistics AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish Call Center Data for Delivery & Logistics AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/delivery-call-center-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mexican Spanish Call Center Speech Dataset for the Delivery and Logistics industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Spanish-speaking customers. With over 30 hours of real-world, unscripted call center audio, this dataset captures authentic delivery-related conversations essential for training high-performance ASR models.

    Curated by FutureBeeAI, this dataset empowers AI teams, logistics tech providers, and NLP researchers to build accurate, production-ready models for customer support automation in delivery and logistics.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native Mexican Spanish speakers. Captured across various delivery and logistics service scenarios, these conversations cover everything from order tracking to missed delivery resolutions offering a rich, real-world training base for AI models.

    Participant Diversity:
    Speakers: 60 native Mexican Spanish speakers from our verified contributor pool.
    Regions: Multiple provinces of Mexico for accent and dialect diversity.
    Participant Profile: Balanced gender distribution (60% male, 40% female) with ages ranging from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted customer-agent dialogues.
    Call Duration: 5 to 15 minutes on average.
    Audio Format: Stereo WAV, 16-bit depth, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in clean, noise-free, echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound delivery-related conversations, covering varied outcomes (positive, negative, neutral) to train adaptable voice models.

    Inbound Calls:
    Order Tracking
    Delivery Complaints
    Undeliverable Addresses
    Return Process Enquiries
    Delivery Method Selection
    Order Modifications, and more
    Outbound Calls:
    Delivery Confirmations
    Subscription Offer Calls
    Incorrect Address Follow-ups
    Missed Delivery Notifications
    Delivery Feedback Surveys
    Out-of-Stock Alerts, and others

    This comprehensive coverage reflects real-world logistics workflows, helping voice AI systems interpret context and intent with precision.

    Transcription

    All recordings come with high-quality, human-generated verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., pauses, noise)
    High transcription accuracy with word error rate under 5% via dual-layer quality checks.

    These transcriptions support fast, reliable model development for Spanish voice AI applications in the delivery sector.

    Metadata

    Detailed metadata is included for each participant and conversation:

    Participant Metadata: ID, age, gender, region, accent, dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical attributes.

    This metadata aids in training specialized models, filtering demographics, and running advanced analytics.

    Usage and Applications

    <p

  13. A

    Ethnobotanical Research and Language Documentation of Nahuatl

    • abacus.library.ubc.ca
    txt
    Updated Jul 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2023). Ethnobotanical Research and Language Documentation of Nahuatl [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml;jsessionid=97387341fcdaf61438a34c572d6d?persistentId=hdl%3A11272.1%2FAB2%2FEEHKAK&version=&q=&fileTypeGroupFacet=%22Text%22&fileAccess=Restricted
    Explore at:
    txt(3132)Available download formats
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Abacus Data Network
    Description

    AbstractIntroduction Ethnobotanical Research and Language Documentation of Nahuatl consists of approximately 190 hours of field recordings collected in the Sierra Nororiental and Sierra Norte regions of Puebla, Mexico. The corpus contains audio and video recordings of native Nahuatl speakers during the collection of particular plants; partial transcripts (Nahuatl and Spanish); a Highland Puebla Nahuat dictionary; botanical and ethnobotanical data; and speaker metadata. Nahuatl is one of the most widely spoken indigenous languages in the Americas with approximately 1.5 million speakers in Mexico. Many distinct and sometimes mutually intelligible varieties have been recognized. The recordings in this release were collected between 2008 and 2019 in two different municipalities: Cuetzalan del Progreso and Tepetzintla. Speech from Cuetzalan represents Highland Puebla Nahuat, and speech from Tepetzintla represents Zacatlán-Ahuacatlám-Tepetzintla Nahuatl. Data The recordings consist of a speaker talking about a plant's nomenclature, classification, and use. Audio files are primarily single channel 48kHz, 16-bit wav. Some data is also presented as mp3. Video files are presented as mp4. Transcripts are included for the Cuetzalan recordings in Transcriber format. These transcripts have been partially translated into Spanish using ELAN. A Highland Puebla Nahuat dictionary is included in both text and Toolbox XML formats. Botanical and ethnobotanical information is presented as a collection of pdfs, and images as jpegs. Further information about the corpus is available in the included documentation. Note that some folders are empty and are planned to be used in future work.

  14. Number of indigenous language speakers in Chihuahua 2020

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of indigenous language speakers in Chihuahua 2020 [Dataset]. https://www.statista.com/statistics/1388524/number-indigenous-language-speakers-chihuahua-mexico/
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    Mexico
    Description

    In the year 2020, the Mexican state of Chihuahua showcased a rich linguistic variety, with Tarahumara being the most predominant indigenous language, spoken by approximately ****** individuals. Following closely behind was Northern Tepehuán, recording over 10,000 speakers.

  15. F

    Mexican Spanish Call Center Data for BFSI AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish Call Center Data for BFSI AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/bfsi-call-center-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Mexico
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mexican Spanish Call Center Speech Dataset for the BFSI (Banking, Financial Services, and Insurance) sector is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Spanish-speaking customers. Featuring over 30 hours of real-world, unscripted audio, it offers authentic customer-agent interactions across a range of BFSI services to train robust and domain-aware ASR models.

    Curated by FutureBeeAI, this dataset empowers voice AI developers, financial technology teams, and NLP researchers to build high-accuracy, production-ready models across BFSI customer service scenarios.

    Speech Data

    The dataset contains 30 hours of dual-channel call center recordings between native Mexican Spanish speakers. Captured in realistic financial support settings, these conversations span diverse BFSI topics from loan enquiries and card disputes to insurance claims and investment options, providing deep contextual coverage for model training and evaluation.

    Participant Diversity:
    Speakers: 60 native Mexican Spanish speakers from our verified contributor pool.
    Regions: Representing multiple provinces across Mexico to ensure coverage of various accents and dialects.
    Participant Profile: Balanced gender mix (60% male, 40% female) with age distribution from 18 to 70 years.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted interactions between agents and customers.
    Call Duration: Ranges from 5 to 15 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clean conditions with no echo or background noise.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls with varied conversational outcomes like positive, negative, and neutral, ensuring real-world BFSI voice coverage.

    Inbound Calls:
    Debit Card Block Request
    Transaction Disputes
    Loan Enquiries
    Credit Card Billing Issues
    Account Closure & Claims
    Policy Renewals & Cancellations
    Retirement & Tax Planning
    Investment Risk Queries, and more
    Outbound Calls:
    Loan & Credit Card Offers
    Customer Surveys
    EMI Reminders
    Policy Upgrades
    Insurance Follow-ups
    Investment Opportunity Calls
    Retirement Planning Reviews, and more

    This variety ensures models trained on the dataset are equipped to handle complex financial dialogues with contextual accuracy.

    Transcription

    All audio files are accompanied by manually curated, time-coded verbatim transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    30 hours-coded Segments
    Non-speech Tags (e.g., pauses, background noise)
    High transcription accuracy with word error rate < 5% due to double-layered quality checks.

    These transcriptions are production-ready, making financial domain model training faster and more accurate.

    Metadata

    Rich metadata is available for each participant and conversation:

    Participant Metadata: ID, age, gender,

  16. Number of indigenous language speakers in Jalisco 2020

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of indigenous language speakers in Jalisco 2020 [Dataset]. https://www.statista.com/statistics/1384856/number-indigenous-language-speakers-jalisco-mexico/
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    Jalisco, Mexico
    Description

    In 2020, the most spoken indigenous language among the main present ones in the Mexican state of Jalisco were Huichol and Nahuatl with ****** and ****** speakers respectively.

  17. Speech To Speech Translation Market Analysis North America, Europe, APAC,...

    • technavio.com
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2024). Speech To Speech Translation Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, India, China, UK, Germany, Japan, France, Canada, Italy, Mexico - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/speech-to-speech-translation-market-industry-analysis
    Explore at:
    Dataset updated
    Nov 11, 2024
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    Canada, Germany, United Kingdom, France, United States, Mexico, Global
    Description

    Snapshot img

    Speech To Speech Translation Market Size 2024-2028

    The speech to speech translation market size is forecast to increase by USD 217.2 million at a CAGR of 8.9% between 2023 and 2028.

    Speech-to-speech translation is a rapidly evolving market, driven by the increasing demand for multilingual communications in various sectors. Contact centers and military organizations are major contributors to this growth, as they require real-time translation services for effective communication. The adoption of AI-enhanced speech-to-speech translation technology is on the rise, with software solutions gaining popularity in mobile applications and web applications. Machine translation (MT) and speech recognition technologies form the foundation of S2S translation systems. The future of S2S translation lies in continued advancements in machine translation, speech recognition, and deep learning technologies. The electronics industry and business tours also present significant opportunities for this market. However, challenges such as varying internet costs and coverage globally, as well as the need for high voice recognition accuracy, remain key considerations. One Hour Translation (OHT) and virtual assistants like Google Assistant are leading the way in providing quick and accurate translation solutions. Overall, the market is expected to grow steadily due to the increasing need for seamless multilingual communication in various industries.
    

    What will be the Size of the Speech To Speech Translation Market During the Forecast Period?

    Request Free Sample

    The speech-to-speech (S2S) translation market is witnessing significant growth due to the increasing need for effective cross-lingual communication in various sectors. This market caters to both business-to-business (B2B) and business-to-consumer (B2C) applications, encompassing international business, tourism, and various industries with language differences. Language barriers pose a significant challenge in international business, leading to communication gaps that can hinder business reach and intercultural exchange. Speech-to-speech translation technology addresses this issue by enabling real-time translation between spoken languages, making it an essential tool for global enterprises. In the travel and tourism industry, S2S translation plays a crucial role in enhancing the travel experience for tourists.
    With the rise in international tourism and smartphone ownership, there is a growing demand for portable translation devices that can provide instant translations, making travel more accessible and convenient. The healthcare sector also benefits significantly from S2S translation technology. Effective communication between medical professionals and patients with language barriers is essential for accurate diagnoses and treatments. S2S translation technology ensures that medical consultations are efficient and effective, improving patient care and outcomes. Speech synthesis, a critical component of S2S translation, converts text into spoken words, enhancing accessibility for individuals with visual impairments. This technology has a wide range of applications, including education, entertainment, and daily life assistance.
    Deep network architecture and advanced algorithms enable high inference speed, ensuring real-time translations that are accurate and efficient. The B2B market for S2S translation is driven by the need for seamless communication between international businesses and their partners. This technology facilitates cross-border transactions, negotiations, and collaborations, reducing the communication gap and increasing business opportunities. In summary, the speech-to-speech translation market is a growing sector that addresses the communication challenges posed by language differences in various industries. This technology bridges the gap between different cultures and languages, enhancing intercultural exchange, improving business efficiency, and providing better accessibility for individuals with visual impairments.
    

    How is this Speech To Speech Translation Industry segmented and which is the largest segment?

    The speech to speech translation industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

    Type
    
      Hardware
      Software
    
    
    Geography
    
      North America
    
        Canada
        Mexico
        US
    
    
      Europe
    
        Germany
        UK
        France
        Italy
    
    
      APAC
    
        China
        India
        Japan
    
    
      South America
    
    
    
      Middle East and Africa
    

    By Type Insights

    The hardware segment is estimated to witness significant growth during the forecast period.
    

    Speech-to-speech translation technology enables users to convert spoken words into text on their screens, streamlining communication. This technology combines speech

  18. F

    Mexican Spanish Call Center Data for Healthcare AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish Call Center Data for Healthcare AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/healthcare-call-center-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Mexico
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Mexican Spanish Call Center Speech Dataset for the Healthcare industry is purpose-built to accelerate the development of Spanish speech recognition, spoken language understanding, and conversational AI systems. With 30 Hours of unscripted, real-world conversations, it delivers the linguistic and contextual depth needed to build high-performance ASR models for medical and wellness-related customer service.

    Created by FutureBeeAI, this dataset empowers voice AI teams, NLP researchers, and data scientists to develop domain-specific models for hospitals, clinics, insurance providers, and telemedicine platforms.

    Speech Data

    The dataset features 30 Hours of dual-channel call center conversations between native Mexican Spanish speakers. These recordings cover a variety of healthcare support topics, enabling the development of speech technologies that are contextually aware and linguistically rich.

    Participant Diversity:
    Speakers: 60 verified native Mexican Spanish speakers from our contributor community.
    Regions: Diverse provinces across Mexico to ensure broad dialectal representation.
    Participant Profile: Age range of 18–70 with a gender mix of 60% male and 40% female.
    RecordingDetails:
    Conversation Nature: Naturally flowing, unscripted conversations.
    Call Duration: Each session ranges between 5 to 15 minutes.
    Audio Format: WAV format, stereo, 16-bit depth at 8kHz and 16kHz sample rates.
    Recording Environment: Captured in clear conditions without background noise or echo.

    Topic Diversity

    The dataset spans inbound and outbound calls, capturing a broad range of healthcare-specific interactions and sentiment types (positive, neutral, negative).

    Inbound Calls:
    Appointment Scheduling
    New Patient Registration
    Surgical Consultation
    Dietary Advice and Consultations
    Insurance Coverage Inquiries
    Follow-up Treatment Requests, and more
    OutboundCalls:
    Appointment Reminders
    Preventive Care Campaigns
    Test Results & Lab Reports
    Health Risk Assessment Calls
    Vaccination Updates
    Wellness Subscription Outreach, and more

    These real-world interactions help build speech models that understand healthcare domain nuances and user intent.

    Transcription

    Every audio file is accompanied by high-quality, manually created transcriptions in JSON format.

    Transcription Includes:
    Speaker-identified Dialogues
    Time-coded Segments
    Non-speech Annotations (e.g., silence, cough)
    High transcription accuracy with word error rate is below 5%, backed by dual-layer QA checks.

    Metadata

    Each conversation and speaker includes detailed metadata to support fine-tuned training and analysis.

    Participant Metadata: ID, gender, age, region, accent, and dialect.
    Conversation Metadata: Topic, sentiment, call type, sample rate, and technical specs.

    Usage and Applications

    This dataset can be used across a range of healthcare and voice AI use cases:

  19. Mexico: main offshore support languages 2020

    • statista.com
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2023). Mexico: main offshore support languages 2020 [Dataset]. https://www.statista.com/statistics/1221566/mexico-offshore-services-support-languages/
    Explore at:
    Dataset updated
    Nov 28, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    Mexico
    Description

    In 2020, the most common language of offshore support in Mexico was Spanish. According to a survey, approximately 92 percent of companies stated that they offered support in this language, while 84 percent reported providing assistance in English. That year, Latin America and the Caribbean was the main recipient region of offshore services from companies in Mexico.

  20. Number of indigenous language speakers in Puebla 2020

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Number of indigenous language speakers in Puebla 2020 [Dataset]. https://www.statista.com/statistics/1387958/number-indigenous-language-speakers-puebla-mexico/
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    Puebla, Mexico
    Description

    In the year 2020, the Mexican state of Puebla showcased a diverse linguistic landscape, with Nahuatl standing out as the predominant indigenous language, spoken by more than ******* individuals. Following closely behind was Totonac, which boasted a substantial number of ******* speakers.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). Languages in Mexico 2020 [Dataset]. https://www.statista.com/statistics/275440/languages-in-mexico/
Organization logo

Languages in Mexico 2020

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 15, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2020
Area covered
Mexico
Description

In 2020, about 93.8 percent of the Mexican population was monolingual in Spanish. Around five percent spoke a combination of Spanish and indigenous languages. Spanish is the third-most spoken native language worldwide, after Mandarin Chinese and Hindi.

Mexican Spanish

Spanish was first being used in Mexico in the 16th century, at the time of Spanish colonization during the Conquest campaigns of what is now Mexico and the Caribbean. As of 2018, Mexico is the country with the largest number of native Spanish speakers worldwide. Mexican Spanish is influenced by English and Nahuatl, and has about 120 million users. The Mexican government uses Spanish in the majority of its proceedings, however it recognizes 68 national languages, 63 of which are indigenous.

Indigenous languages spoken

Of the indigenous languages spoken, two of the most widely used are Nahuatl and Maya. Due to a history of marginalization of indigenous groups, most indigenous languages are endangered, and many linguists warn they might cease to be used after a span of just a few decades. In recent years, legislative attempts such as the San Andréas Accords have been made to protect indigenous groups, who make up about 25 million of Mexico’s 125 million total inhabitants, though the efficacy of such measures is yet to be seen.

Search
Clear search
Close search
Google apps
Main menu