In 2020, about 93.8 percent of the Mexican population was monolingual in Spanish. Around five percent spoke a combination of Spanish and indigenous languages. Spanish is the third-most spoken native language worldwide, after Mandarin Chinese and Hindi.
Mexican Spanish
Spanish was first being used in Mexico in the 16th century, at the time of Spanish colonization during the Conquest campaigns of what is now Mexico and the Caribbean. As of 2018, Mexico is the country with the largest number of native Spanish speakers worldwide. Mexican Spanish is influenced by English and Nahuatl, and has about 120 million users. The Mexican government uses Spanish in the majority of its proceedings, however it recognizes 68 national languages, 63 of which are indigenous.
Indigenous languages spoken
Of the indigenous languages spoken, two of the most widely used are Nahuatl and Maya. Due to a history of marginalization of indigenous groups, most indigenous languages are endangered, and many linguists warn they might cease to be used after a span of just a few decades. In recent years, legislative attempts such as the San Andréas Accords have been made to protect indigenous groups, who make up about 25 million of Mexico’s 125 million total inhabitants, though the efficacy of such measures is yet to be seen.
There were more than ************* speakers of indigenous languages in Mexico as of 2020. Nahuatl was the most spoken indigenous language (although it is also considered a group of languages), with more than **** million speakers. Both the Mayan languages Tseltal and Tsotsil were spoken by over ******* persons. Furthermore, about ******* of all the indigenous language speakers were located in just two states: Chiapas and Oaxaca.
There were more than seven million speakers of indigenous languages in Mexico as of 2020. Chiapas and Oaxaca ranked as the federal entities with the largest population aged over three years who speak an indigenous language, with 1.5 and 1.2 million people respectively. Moreover, Nahuatl was the most spoken indigenous language or group of languages.
In the year 2020, Mazahua stood out as the predominant indigenous language among the prominent ones spoken in Mexico State, with a count over 111,000 people proficient in the language. Not far behind was Otomi, with a significant number of 102.600 speakers.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Spanish (Mexico) DatasetConjunto de datos español (México)High-Quality Spanish Mexico TTS Dataset for AI & Speech Models Contact Us OverviewTitleSpanish (Mexico) Language DatasetDataset TypeTTSDescriptionSingle-utterance recordings, which tend to fall in…
In 2020, the Mexican state of Guerrero exhibited a rich variety of indigenous languages. Among these, Nahuatl emerged as the predominant language, spoken by an estimated 157,740 individuals. Additionally, the presence of Mixtec and Tlapanec languages made a significant impact.
Comprehensive dataset of 2 Chinese language schools in State of Mexico, Mexico as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
In 2020, Nahuatl emerged as the most widely spoken indigenous language among the most prominent ones in the Mexican state of Nuevo Leon, boasting 54,110 speakers. Following closely behind was Huasteco, with the substantial figure of 19,460 speakers.
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The SALA II Spanish from Mexico database collected in Mexico was recorded within the scope of the SALA II project.The SALA II Spanish from Mexico database contains the recordings of 1,075 Mexican speakers (539 males and 536 females) recorded over the Mexican mobile telephone network.The following acoustic conditions were selected as representative of a mobile user's environment: * Passenger in moving car, railway, bus, etc. (155 speakers) * Public place (279 speakers) * Stationary pedestrian by road side (223 speakers) * Home/office environment (364 speakers) * Passenger in moving car using a hands-free kit (54 speakers) This database is distributed as 1 DVD-ROM The speech files are stored as sequences of 8-bit, 8kHz a-law speech files and are not compressed, according to the specifications of SALA II. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SALA II format and content specifications.Each speaker uttered the following items: * 6 application words * 1 sequence of 10 isolated digits * 4 connected digits (1 sheet number -6 digits, 1 telephone number -9/11 digits, 1 credit card number -14/16 digits, 1 PIN code -6 digits) * 3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression) * 2 spotting phrase using an embedded application word * 2 isolated digits * 3 spelled words (1surname, 1 directory assistance city name, 1 real/artificial name for coverage) * 1 currency money amount * 1 natural number * 5 directory assistance names (1 surname out of a set of 500, 1 city of birth/growing up, 1 most frequent city out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 "forename surname" out of a set of 150 ) * 2 yes/no questions (1 predominantly "yes" question, 1 predominantly "no" question) * 9 phonetically rich sentences * 2 time phrases (1 spontaneous time of day, 1word style time phrase) * 4 phonetically rich words The following age distribution has been obtained: 7 speakers are under 16, 643 speakers are between 16 and 30, 248 speakers are between 31 and 45, 169 speakers are between 46 and 60, and 8 speakers are over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Language is the human universal mode of communication, and is dynamic and constantly in flux accommodating user needs as individuals interface with a changing world. However, we know surprisingly little about how language responds to market integration, a pressing force affecting indigenous communities worldwide today. While models of culture change often emphasize the replacement of one language, trait, or phenomenon with another following socioeconomic transitions, we present a more nuanced framework. We use demographic, economic, linguistic, and social network data from a rural Maya community that spans a 27-year period and the transition to market integration. By adopting this multivariate approach for the acquisition and use of languages, we find that while the number of bilingual speakers has significantly increased over time, bilingualism appears stable rather than transitionary. We provide evidence that when indigenous and majority languages provide complementary social and economic payoffs, both can be maintained. Our results predict the circumstances under which indigenous language use may be sustained or at risk. More broadly, the results point to the evolutionary dynamics that shaped the current distribution of the world’s linguistic diversity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Language is the human universal mode of communication, and is dynamic and constantly in flux accommodating user needs as individuals interface with a changing world. However, we know surprisingly little about how language responds to market integration, a pressing force affecting indigenous communities worldwide today. While models of culture change often emphasize the replacement of one language, trait, or phenomenon with another following socioeconomic transitions, we present a more nuanced framework. We use demographic, economic, linguistic, and social network data from a rural Maya community that spans a 27-year period and the transition to market integration. By adopting this multivariate approach for the acquisition and use of languages, we find that while the number of bilingual speakers has significantly increased over time, bilingualism appears stable rather than transitionary. We provide evidence that when indigenous and majority languages provide complementary social and economic payoffs, both can be maintained. Our results predict the circumstances under which indigenous language use may be sustained or at risk. More broadly, the results point to the evolutionary dynamics that shaped the current distribution of the world’s linguistic diversity.
Comprehensive dataset of 2,719 English language schools in Mexico as of July, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
In the year 2020, the linguistic diversity within the Mexican state of Sonora was mostly dominated by Mayo emerging as the primary indigenous language, spoken by approximately ****** individuals. Not far behind was Yaqui, with the significant figure of ****** speakers.
Comprehensive dataset of 32 English language camps in Mexico as of June, 2025. Includes verified contact information (email, phone), geocoded addresses, customer ratings, reviews, business categories, and operational details. Perfect for market research, lead generation, competitive analysis, and business intelligence. Download a complimentary sample to evaluate data quality and completeness.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks annual reading and language arts proficiency from 2011 to 2022 for Mexico Elementary School vs. New York and Mexico Central School District
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is An areal-typological study of American Indian languages north of Mexico. It features 7 columns including author, publication date, language, and book publisher.
This dataset provides information on 61 in Tamaulipas, Mexico as of June, 2025. It includes details such as email addresses (where publicly available), phone numbers (where publicly available), and geocoded addresses. Explore market trends, identify potential business partners, and gain valuable insights into the industry. Download a complimentary sample of 10 records to see what's included.
Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:
Key Features (approximate numbers):
Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.
The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.
Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.
This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.
Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.
This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Mexican Spanish Scripted Monologue Speech Dataset for the Travel Domain. This meticulously curated dataset is designed to advance the development of Spanish language speech recognition models, particularly for the Travel industry.
This training dataset comprises over 6,000 high-quality scripted prompt recordings in Mexican Spanish. These recordings cover various topics and scenarios relevant to the Travel domain, designed to build robust and accurate customer service speech technology.
Each scripted prompt is crafted to reflect real-life scenarios encountered in the Travel domain, ensuring applicability in training robust natural language processing and speech recognition models.
In addition to high-quality audio recordings, the dataset includes meticulously prepared text files with verbatim transcriptions of each audio file. These transcriptions are essential for training accurate and robust speech recognition models.
In 2020, about 93.8 percent of the Mexican population was monolingual in Spanish. Around five percent spoke a combination of Spanish and indigenous languages. Spanish is the third-most spoken native language worldwide, after Mandarin Chinese and Hindi.
Mexican Spanish
Spanish was first being used in Mexico in the 16th century, at the time of Spanish colonization during the Conquest campaigns of what is now Mexico and the Caribbean. As of 2018, Mexico is the country with the largest number of native Spanish speakers worldwide. Mexican Spanish is influenced by English and Nahuatl, and has about 120 million users. The Mexican government uses Spanish in the majority of its proceedings, however it recognizes 68 national languages, 63 of which are indigenous.
Indigenous languages spoken
Of the indigenous languages spoken, two of the most widely used are Nahuatl and Maya. Due to a history of marginalization of indigenous groups, most indigenous languages are endangered, and many linguists warn they might cease to be used after a span of just a few decades. In recent years, legislative attempts such as the San Andréas Accords have been made to protect indigenous groups, who make up about 25 million of Mexico’s 125 million total inhabitants, though the efficacy of such measures is yet to be seen.