Facebook
TwitterMexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.
Facebook
TwitterThe United States is the non-hispanic country with the largest number of native Spanish speakers in the world, with approximately 41.89 million people with a native command of the language in 2024. However, the European Union had the largest group of non-native speakers with limited proficiency of Spanish, at around 28 million people. Furthermore, Mexico is the country with the largest number of native Spanish speakers in the world as of 2024.
Facebook
TwitterThe United States is the country with the largest number of Spanish language students, at approximately 8.59 million people in 2024. The second country is Brazil, with around 4.05 million students of the Spanish language. Moreover, the United States is also the non-hispanic country with the largest number of native Spanish speakers in the world.
Facebook
TwitterLinguistically annotated Spanish language datasets with headwords, definitions, senses, examples, POS tags, semantic metadata, and usage info. Ideal for dictionary tools, NLP, and TTS model training or fine-tuning.
Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:
Key Features (approximate numbers):
Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.
The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.
Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.
This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.
Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.
This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.
About the sample:
The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.
If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Spanish(Spain) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(600 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1234?source=Kaggle
8kHz 8bit, a-law/u-law pcm, mono channel
Dialogue based on given topics
Low background noise (indoor)
Telephony
Spain(ESP)
es-ES
Spanish
600 people in total, 49% male and 51% female
Transcription text, timestamp, speaker ID, gender
Word accuracy rate(WAR) 98%
Commercial License
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
July 2025 UPDATE: We released version 1.1, adding almost 200k new queries 🎉🎉🎉. Use with: country = "full" # "ar", "bo", ... version = "1.1" dataset = datasets.load_dataset("spanish-ir/messirve", country, revision=version) print(dataset)
Dataset Card for MessIRve
MessIRve is a large-scale dataset for Spanish IR, designed to better capture the information needs of Spanish speakers across different countries. Queries are obtained from Google's autocomplete API… See the full description on the dataset page: https://huggingface.co/datasets/spanish-ir/messirve.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Parental burnout is a unique and context-specific syndrome resulting from a chronic imbalance of risks over resources in the parenting domain. The current research aims to evaluate the psychometric properties of the Spanish version of the Parental Burnout Assessment (PBA) across Spanish-speaking countries with two consecutive studies. In Study 1, we analyzed the data through a bifactor model within an Exploratory Structural Equation Modeling (ESEM) on the pooled sample of participants (N = 1,979) obtaining good fit indices. We then attained measurement invariance across both gender and countries in a set of nested models with gradually increasing parameter constraints. Latent means comparisons across countries showed that among the participants’ countries, Chile had the highest parental burnout score, likewise, comparisons across gender evidenced that mothers displayed higher scores than fathers, as shown in previous studies. Reliability coefficients were high. In Study 2 (N = 1,171), we tested the relations between parental burnout and three specific consequences, i.e., escape and suicidal ideations, parental neglect, and parental violence toward one’s children. The medium to large associations found provided support for the PBA’s predictive validity. Overall, we concluded that the Spanish version of the PBA has good psychometric properties. The results support its relevance for the assessment of parental burnout among Spanish-speaking parents, offering new opportunities for cross-cultural research in the parenting domain.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ACS DEMOGRAPHIC AND HOUSING ESTIMATES HISPANIC OR LATINO AND RACE - DP05 Universe - Total population Survey-Program - American Community Survey 5-year estimates Years - 2020, 2021, 2022 The terms “Hispanic,” “Latino,” and “Spanish” are used interchangeably. Some respondents identify with all three terms while others may identify with only one of these three specific terms. People who identify with the terms “Hispanic,” “Latino,” or “Spanish” are those who classify themselves in one of the specific Hispanic, Latino, or Spanish categories listed on the questionnaire (“Mexican, Mexican Am., or Chicano,” “Puerto Rican,” or “Cuban”) as well as those who indicate that they are “another Hispanic, Latino, or Spanish origin.” People who do not identify with one of the specific origins listed on the questionnaire but indicate that they are “another Hispanic, Latino, or Spanish origin” are those whose origins are from Spain, the Spanish-speaking countries of Central or South America, or another Spanish culture or origin. Origin can be viewed as the heritage, nationality group, lineage, or country of birth of the person or the person’s parents or ancestors before their arrival in the UnitedStates. People who identify their origin as Hispanic, Latino, or Spanish may be of any race.
Facebook
TwitterIn 2023, a Spanish-language e-book cost on average ***** euros in Spain, where such e-books were the most expensive in comparison to other Spanish-speaking countries. Mexico and Peru followed, where Spanish-language e-books cost an average of *** euros and *** euros respectively.
Facebook
TwitterSpanish(Spain) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and in-car command, numbers, news and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(989 people in total), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Format
16kHz, 16bit, uncompressed wav, mono channel;
Recording condition
Low background noise(indoor), without echo;
Content category
Generic domain; news; human-machine interaction; smart home command and control; in-car command and control; numbers
Recording device
Android Smartphone, iPhone;
Speaker
989 speakers totally, with 49% male and 51% female ; and 57% speakers of all are in the age group of 17-25,39% speakers of all are in the age group of 26-45, 4% speakers of all are in the age group of 46-60;
Country
Spain(ESP);
Language(Region) Code
es-ES;
Language
Spanish;
Features of annotation
Transcription text;
Accuracy Rate
Sentence Accuracy Rate (SAR) 95%
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Spanish(Mexico) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(122 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link:https://www.nexdata.ai/datasets/speechrecog/1352?source=Kaggle
8kHz 8bit, a-law/u-law pcm, mono channel
Dialogue based on given topics
Low background noise (indoor)
Telephony
Mexico(MEX)
es-MX
Spanish
122 people in total, 53% male and 47% female
Transcription text, timestamp, speaker ID, gender, noise
Word accuracy rate(WAR) 98%
Commercial License
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset comprises 488 hours of high-quality telephone audio recordings in Spanish, featuring 600 native speakers and achieving a 95% sentence accuracy rate. Designed for advancing speech recognition models and language processing, this extensive speech data corpus covers diverse topics and domains, making it ideal for training robust automatic speech recognition (ASR) systems. - Get the data
| Characteristic | Data |
|---|---|
| Description | Audio of telephone dialogues in Spanish for training NLP models in real-world conversational scenarios. |
| Data types | Audio |
| Tasks | Speech recognition, NLP |
| Country | Spain (ESP) |
| Hours of telephone dialogue | 488 |
| Number of speakers | 600 |
| Labeling | Annotation (text content, speaker's ID, gender, age and other attributes) |
| Gender | Male (49%), Female (51%) |
| Recording device | Telephone |
Facebook
TwitterIn 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Data sources:
Travelers in Spain by tourist spots and country of residence (selection of 154 municipalities). Data from the 2017 hotel occupancy survey / Publication date: 09/10/2018 Portal de Datos Abiertos de Esri España. https://opendata.esri.es/datasets/ComunidadSIG::viajeros-entrados-por-puntos-turisticos-y-pais-de-residencia-/explore?location=28.382780%2C-15.044915%2C8.00
Municipal, provincial and regional limits. Centro Nacional de Información Geográfica (CNIG) - National Center for Geographic Information https://centrodedescargas.cnig.es/CentroDescargas/catalogo.do?Serie=LILIM
CartoBase ANE
Centro Nacional de Información Geográfica (CNIG) - National Center for Geographic Information
https://centrodedescargas.cnig.es/CentroDescargas/catalogo.do?Serie=LILIM
Files * Visitors_Turist_Sites: This Geodataframe is based on a selection of 154 municipalities where, from the Hotel Occupancy Survey (Encuesta de Ocupación Hotelera) conducted by the National Institute of Statistics (Spain), distinctions are made between different nationalities of visitors.
Spanish_Provinces_Peninsula: Provincial limits of Spain (Iberian Peninsula and Balearic Islands)
Spanish_Provinces_CanaryIslands: Provincial limits of Spain (Canary Islands)
Geo_world: The cartographic bases of the National Atlas of Spain (ANE) - World Map.
License: All this data is licensed under CC-BY 4.0.** https://creativecommons.org/licenses/by/4.0/deed.es
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Combined Longitudinal Study of the Second Generation in Spain data set, Waves 1, 2, and 3. This is the publicly available version of the ILSEG data (ILSEG is the Spanish acronym for Investigación Longitudinal de la Segunda Generación, Longitudinal Study of the Second Generation). Questions address the situations and plans for the future of young Spaniards who are children of immigrants to Spain, who were living in Madrid and Barcelona and attending secondary school in 2007-2008 and the 2011-2012 and 2015-2016 follow ups). The longitudinal study of the second Generation (ILSEG in its Spanish initials) represents the first attempt to conduct a large-scale study of the adaptation of children of immigrants to Spanish society over time. To that end, a large and statistically representative sample of children born to foreign parents in Spain or those brought at an early age to the country was identified and interviewed in metropolitan Madrid and Barcelona for wave 1. In total, almost 7,000 children of immigrants attending basic secondary school in close to 200 educational centers in both cities took part in the study. Because of sample attrition, wave 2 introduced a replacement sample. Additionally, a native born sample of children of Spaniards was also included to enable comparisons between native and immigrant-origin populations of the same age cohort.Topics include basic demographics, national origins, Spanish language acquisition, foreign language knowledge and retention, parents' education and employment, respondents' education and aspirations, religion, household arrangements, life experiences, and attitudes about Spanish society. Demographic variables include age, sex, birth country, language proficiency (Spanish and Catalan), language spoken in the home, number of siblings, mother's and father's birth country, religion, national identity, parent's sex, parent's marital status, parent's birth year, and the year the parent arrived in Spain.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Datasets on refugee claims in Spain between 2013 and 2021. This dataset is composed of two data frames. Each data frame is distributed by male and female requests.
AsiloCA: request made focused on each autonomous community. Some usefull features information:
AsiloEspaña: requests made focused on the countries of origin. Some usefull features information:
Facebook
Twitterhttps://www.ine.es/aviso_legalhttps://www.ine.es/aviso_legal
Migration Statistic: Flow of emigration abroad of people aged 25 and over by year, sex, country of birth (Spanish/foreign) and level of studies (grouping of levels). Annual. National.
Facebook
TwitterIn 2023, California had the highest Hispanic population in the United States, with over 15.76 million people claiming Hispanic heritage. Texas, Florida, New York, and Illinois rounded out the top five states for Hispanic residents in that year. History of Hispanic people Hispanic people are those whose heritage stems from a former Spanish colony. The Spanish Empire colonized most of Central and Latin America in the 15th century, which began when Christopher Columbus arrived in the Americas in 1492. The Spanish Empire expanded its territory throughout Central America and South America, but the colonization of the United States did not include the Northeastern part of the United States. Despite the number of Hispanic people living in the United States having increased, the median income of Hispanic households has fluctuated slightly since 1990. Hispanic population in the United States Hispanic people are the second-largest ethnic group in the United States, making Spanish the second most common language spoken in the country. In 2021, about one-fifth of Hispanic households in the United States made between 50,000 to 74,999 U.S. dollars. The unemployment rate of Hispanic Americans has fluctuated significantly since 1990, but has been on the decline since 2010, with the exception of 2020 and 2021, due to the impact of the coronavirus (COVID-19) pandemic.
Facebook
TwitterSpanish(Spain) Unscripted Call Center Telephony speech dataset, covers telecom domain. Including terms and emotions in call center scenario, mirrors real-world interactions. Transcribed with text content, speaker's ID and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Format
8kHz 16bit, wav, mono channel
Recording condition
Phone recording system, with low background noise (call center scenario)
Recording content
Spontaneous inbound and outbound callings in typical domain, such as telecom
Country
Spain(ESP),etc.
Language(Region) Code
es-ES, etc.
Language
Spanish
Features of annotation
Transcription text, timestamps, speaker ID, noise symbols, sensitive information
Accuracy
Word Accuracy Rate (WAR) 98% (punctuation, sentence symbols, accent and other non-speech labeling are not included in accuracy statistics due to subjectivity)
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Biometric Attack Dataset, Hispanic People
The similar dataset that includes all ethnicities - Anti Spoofing Real Dataset
The dataset for face anti spoofing and face recognition includes images and videos of hispanic people. 32,600+ photos & video of 16,300 people from 20 countries. The dataset helps in enchancing the performance of the model by providing wider range of data for a specific ethnic group. The videos were gathered by capturing faces of genuine individuals… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/hispanic-people-liveness-detection-video-dataset.
Facebook
TwitterMexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.