7 datasets found

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display...
datarade.ai
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
Explore at:
.csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://www.lexico.com/
Area covered
Ecuador, Paraguay, Panama, Bolivia (Plurinational State of), Chile, Cuba, Nicaragua, Colombia, Honduras, Costa Rica
Description
Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

Spanish Monolingual Dictionary Data

Spanish Bilingual Dictionary Data

Spanish Sentences Data

Synonyms and Antonyms Data

Audio Data

Word list Data

Key Features (approximate numbers):

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Headwords: 73,000

Senses: 123,000

Sentence examples: 104,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300

Senses: 103,500

Example sentences: 74,500

Example translations: 83,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700

Antonyms: 9,500

Format: XML format

Delivery: Email (link-based file sharing)

Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900

Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.
h
messirve
huggingface.co
Updated Apr 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spanish Info Retrieval (2024). messirve [Dataset]. https://huggingface.co/datasets/spanish-ir/messirve
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2024
Dataset authored and provided by
Spanish Info Retrieval
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for MessIRve

MessIRve is a large-scale dataset for Spanish IR, designed to better capture the information needs of Spanish speakers across different countries. Queries are obtained from Google's autocomplete API (www.google.com/complete), and relevant documents are Spanish Wikipedia paragraphs containing answers from Google Search "featured snippets". This data collection strategy is inspired by GooAQ. The files presented here are the qrels. The style in which they… See the full description on the dataset page: https://huggingface.co/datasets/spanish-ir/messirve.
16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...
data.nexdata.ai
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM) Data | Speech AI Datasets|Machine Learning (ML) Data [Dataset]. https://data.nexdata.ai/products/nexdata-multilingual-conversational-speech-data-16khz-mob-nexdata
Explore at:
Dataset updated
Aug 3, 2024
Dataset authored and provided by
Nexdata
Area covered
Hong Kong, Ukraine, Bulgaria, Brazil, Egypt, Syrian Arab Republic, Switzerland, Malaysia, Italy, Pakistan
Description
Nexdata has off-the-shelf 35,000 hours Machine Learning (ML) Data of 16kHz conversational speech, covering 100+ countries including English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Hindi, Russia and etc.
c
Data from: Dataset for 'How brands highlight country of origin in magazine...
datacatalogue.cessda.eu
ssh.datastations.nl
Updated Apr 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen (2023). Dataset for 'How brands highlight country of origin in magazine advertising: A content analysis' [Dataset]. http://doi.org/10.17026/dans-ztf-w83f
Explore at:
Unique identifier
https://doi.org/10.17026/dans-ztf-w83f
Dataset updated
Apr 11, 2023
Dataset provided by
Radboud University
Authors
J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen
Description
Dataset for content analysis published in "Hornikx, J., Meurs, F. van, Janssen, A., & Heuvel, J. van den (2020). How brands highlight country of origin in magazine advertising: A content analysis. Journal of Global Marketing, 33 (1), 34-45."
*Abstract (taken from publication)
Aichner (2014) proposes a classification of ways in which brands communicate their country of origin (COO). The current, exploratory study is the first to empirically investigate the frequency with which brands employ such COO markers in magazine advertisements. An analysis of about 750 ads from the British, Dutch, and Spanish editions of Cosmopolitan showed that the prototypical ‘made in’ marker was rarely used, and that ‘COO embedded in company name’ and ‘use of COO language’ were most frequently employed. In all, 36% of the total number of ads contained at least one COO marker, underlining the importance of the COO construct.
*Methodology (taken from publication)
Sample
The use of COO markers in advertising was examined in print advertisements from three different countries to increase the robustness of the findings. Given the exploratory nature of this study, two practical selection criteria guided our country choice: the three countries included both smaller and larger countries in Europe, and they represented languages that the team was familiar with in order to reliably code the advertisements on the relevant variables. The three European countries selected were the Netherlands, Spain, and the United Kingdom. The dataset for the UK was discarded for testing H1 about the use of English as a foreign language, as will be explained in more detail in the coding procedure.
The magazine Cosmopolitan was chosen as the source of advertisements. The choice for one specific magazine title reduces the generalizability of the findings (i.e., limited to the corresponding products and target consumers), but this magazine was chosen intentionally because an informal analysis suggested that it carried advertising for a large number of product categories that are considered ethnic products, such as cosmetics, watches, and shoes (Usunier & Cestre, 2007). This suggestion was corroborated in the main analysis: the majority of the ads in the corpus referred to a product that Usunier and Cestre (2007) classify as ethnic products. Table 2 provides a description of the product categories and brands referred to in the advertisements. Ethnic products have a prototypical COO in the minds of consumers (e.g., cosmetics – France), which makes it likely that the COOs are highlighted through the use of COO markers.
Cosmopolitan is an international magazine that has different local editions in the three countries. The magazine, which is targeted at younger women (18–35 years old), reaches more than three million young women per month through its online, social and print platforms in the Netherlands (Hearst Netherlands, 2016), has about 517,000 readers per month in Spain (PrNoticias, 2016) and about 1.18 million readers per month in the UK (Hearst Magazine U.K., 2016).
The sample consisted of all advertisements from all monthly issues that appeared in 2016 in the three countries. This whole-year cluster was selected so as to prevent potential seasonal influences (Neuendorf, 2002). In total, the corpus consisted of 745 advertisements, of which 111 were from the Dutch, 367 from the British and 267 from the Spanish Cosmopolitan. Two categories of ads were excluded in the selection process: (1) advertisements for subscription to Cosmopolitan itself, and (2) advertisements that were identical to ads that had appeared in another issue in one of the three countries. As a result, each advertisement was unique.
Coding procedure
For all advertisements, four variables were coded: product type, presence of types of COO markers, COO referred to, and the use of English as a COO marker. In the first place, product type was assessed by the two coders. Coders classified each product to one of the 32 product types. In order to assess the reliability of the codings, ten per cent of the ads were independently coded by a second coder. The interrater reliability of the variable product category was good (κ = .97, p < .000, 97.33% agreement between both coders). Table 2 lists the most frequent product types; the label ‘other’ covers 17 types of product, including charity, education, and furniture.
In the second place, it was recorded whether one or more of the COO markers occurred in a given ad. In the third place, if a marker was identified, it was assessed to which COO the markers referred. Table 1 lists the nine possible COO markers defined by Aichner (2014) and the COOs referred to, with examples taken from the current content analysis. The interrater reliability for the type of COO marker was very good (κ = .80, p < .000, 96.30% agreement between the coders), and the interrater reliability for COO referred to was...
P
FooDI-ML Dataset
paperswithcode.com
Updated Oct 6, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Amat Olóndriz; Ponç Palau Puigdevall; Adrià Salvador Palau (2021). FooDI-ML Dataset [Dataset]. https://paperswithcode.com/dataset/foodi-ml
Explore at:
Dataset updated
Oct 6, 2021
Authors
David Amat Olóndriz; Ponç Palau Puigdevall; Adrià Salvador Palau
Description
Food Drinks and groceries Images Multi Lingual (FooDI-ML) is a dataset that contains over 1.5M unique images and over 9.5M store names, product names descriptions, and collection sections gathered from the Glovo application. The data made available corresponds to food, drinks and groceries products from 37 countries in Europe, the Middle East, Africa and Latin America. The dataset comprehends 33 languages, including 870K samples of languages of countries from Eastern Europe and Western Asia such as Ukrainian and Kazakh, which have been so far underrepresented in publicly available visiolinguistic datasets. The dataset also includes widely spoken languages such as Spanish and English.

Description from: FooDI-ML: a large multi-language dataset of food, drinks and groceries images and descriptions
T
Spain Exports
tradingeconomics.com
de.tradingeconomics.com
+13more
csv, excel, json, xml
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2025). Spain Exports [Dataset]. https://tradingeconomics.com/spain/exports
Explore at:
excel, csv, xml, jsonAvailable download formats
Dataset updated
May 15, 2025
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 31, 1962 - Apr 30, 2025
Area covered
Spain
Description
Exports in Spain decreased to 32510800 EUR Thousand in April from 34119900 EUR Thousand in March of 2025. This dataset provides the latest reported value for - Spain Exports - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
f
Data_Sheet_1_Spanish Version of the Teachers’ Sense of Efficacy Scale: An...
frontiersin.figshare.com
figshare.com
docx
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fátima Salas-Rodríguez; Sonia Lara; Martín Martínez (2023). Data_Sheet_1_Spanish Version of the Teachers’ Sense of Efficacy Scale: An Adaptation and Validation Study.docx [Dataset]. http://doi.org/10.3389/fpsyg.2021.714145.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2021.714145.s001
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Fátima Salas-Rodríguez; Sonia Lara; Martín Martínez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Teachers’ Sense of Efficacy Scale (TSES) has been the most widely used instrument to assess teacher efficacy beliefs. However, no study has been carried out concerning the TSES psychometric properties with teachers in Mexico, the country with the highest number of Spanish-speakers worldwide. The purpose of the present study is to examine the reliability, internal and external validity evidence of the TSES (short form) adapted into Spanish with a sample of 190 primary and secondary Mexican teachers from 25 private schools. Results of construct analysis confirm the three-factor-correlated structure of the original scale. Criterion validity evidence was established between self-efficacy and job satisfaction. Differences in self-efficacy were related to teachers’ gender, years of experience and grade level taught. Some limitations are discussed, and future research directions are recommended.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage

Explore at:

.csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats

Dataset updated

Jul 11, 2025

Dataset authored and provided by

Oxford Languageshttps://www.lexico.com/

Area covered

Ecuador, Paraguay, Panama, Bolivia (Plurinational State of), Chile, Cuba, Nicaragua, Colombia, Honduras, Costa Rica

Description

Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

Spanish Monolingual Dictionary Data
Spanish Bilingual Dictionary Data
Spanish Sentences Data
Synonyms and Antonyms Data
Audio Data
Word list Data

Key Features (approximate numbers):

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Headwords: 73,000
Senses: 123,000
Sentence examples: 104,000
Format: XML and JSON formats
Delivery: Email (link-based file sharing) and REST API
Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300
Senses: 103,500
Example sentences: 74,500
Example translations: 83,800
Format: XML and JSON formats
Delivery: Email (link-based file sharing) and REST API
Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000
Format: XML and JSON format
Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700
Antonyms: 9,500
Format: XML format
Delivery: Email (link-based file sharing)
Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900
Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000
Format: CSV and TXT formats
Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

Clear search

Close search

Google apps

Main menu

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display...

messirve

16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...

Data from: Dataset for 'How brands highlight country of origin in magazine...

FooDI-ML Dataset

Spain Exports

Data_Sheet_1_Spanish Version of the Teachers’ Sense of Efficacy Scale: An...

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage