7 datasets found
  1. Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display...

    • datarade.ai
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
    Explore at:
    .csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Oxford Languageshttps://www.lexico.com/
    Area covered
    Ecuador, Paraguay, Panama, Bolivia (Plurinational State of), Chile, Cuba, Nicaragua, Colombia, Honduras, Costa Rica
    Description

    Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

    1. Spanish Monolingual Dictionary Data
    2. Spanish Bilingual Dictionary Data
    3. Spanish Sentences Data
    4. Synonyms and Antonyms Data
    5. Audio Data
    6. Word list Data

    Key Features (approximate numbers):

    1. Spanish Monolingual Dictionary Data

    Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

    • Headwords: 73,000
    • Senses: 123,000
    • Sentence examples: 104,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

    • Translations: 221,300
    • Senses: 103,500
    • Example sentences: 74,500
    • Example translations: 83,800
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Sentences Data

    Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

    • Sentences volume: 1,840,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    1. Spanish Synonyms and Antonyms Data

    This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

    • Synonyms: 127,700
    • Antonyms: 9,500
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. Spanish Audio Data (word-level)

    Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

    • Audio files: 20,900
    • Format: XLSX (for index), MP3 and WAV (audio files)
    1. Spanish Word List Data

    This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

    • Wordforms: 450,000
    • Format: CSV and TXT formats
    • Delivery: Email (link-based file sharing)

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

  2. h

    messirve

    • huggingface.co
    Updated Apr 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spanish Info Retrieval (2024). messirve [Dataset]. https://huggingface.co/datasets/spanish-ir/messirve
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2024
    Dataset authored and provided by
    Spanish Info Retrieval
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for MessIRve

    MessIRve is a large-scale dataset for Spanish IR, designed to better capture the information needs of Spanish speakers across different countries. Queries are obtained from Google's autocomplete API (www.google.com/complete), and relevant documents are Spanish Wikipedia paragraphs containing answers from Google Search "featured snippets". This data collection strategy is inspired by GooAQ. The files presented here are the qrels. The style in which they… See the full description on the dataset page: https://huggingface.co/datasets/spanish-ir/messirve.

  3. 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...

    • data.nexdata.ai
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM) Data | Speech AI Datasets|Machine Learning (ML) Data [Dataset]. https://data.nexdata.ai/products/nexdata-multilingual-conversational-speech-data-16khz-mob-nexdata
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Hong Kong, Ukraine, Bulgaria, Brazil, Egypt, Syrian Arab Republic, Switzerland, Malaysia, Italy, Pakistan
    Description

    Nexdata has off-the-shelf 35,000 hours Machine Learning (ML) Data of 16kHz conversational speech, covering 100+ countries including English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Hindi, Russia and etc.

  4. c

    Data from: Dataset for 'How brands highlight country of origin in magazine...

    • datacatalogue.cessda.eu
    • ssh.datastations.nl
    Updated Apr 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen (2023). Dataset for 'How brands highlight country of origin in magazine advertising: A content analysis' [Dataset]. http://doi.org/10.17026/dans-ztf-w83f
    Explore at:
    Dataset updated
    Apr 11, 2023
    Dataset provided by
    Radboud University
    Authors
    J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen
    Description

    Dataset for content analysis published in "Hornikx, J., Meurs, F. van, Janssen, A., & Heuvel, J. van den (2020). How brands highlight country of origin in magazine advertising: A content analysis. Journal of Global Marketing, 33 (1), 34-45."

    *Abstract (taken from publication)
    Aichner (2014) proposes a classification of ways in which brands communicate their country of origin (COO). The current, exploratory study is the first to empirically investigate the frequency with which brands employ such COO markers in magazine advertisements. An analysis of about 750 ads from the British, Dutch, and Spanish editions of Cosmopolitan showed that the prototypical ‘made in’ marker was rarely used, and that ‘COO embedded in company name’ and ‘use of COO language’ were most frequently employed. In all, 36% of the total number of ads contained at least one COO marker, underlining the importance of the COO construct.

    *Methodology (taken from publication)

    Sample
    The use of COO markers in advertising was examined in print advertisements from three different countries to increase the robustness of the findings. Given the exploratory nature of this study, two practical selection criteria guided our country choice: the three countries included both smaller and larger countries in Europe, and they represented languages that the team was familiar with in order to reliably code the advertisements on the relevant variables. The three European countries selected were the Netherlands, Spain, and the United Kingdom. The dataset for the UK was discarded for testing H1 about the use of English as a foreign language, as will be explained in more detail in the coding procedure.

    The magazine Cosmopolitan was chosen as the source of advertisements. The choice for one specific magazine title reduces the generalizability of the findings (i.e., limited to the corresponding products and target consumers), but this magazine was chosen intentionally because an informal analysis suggested that it carried advertising for a large number of product categories that are considered ethnic products, such as cosmetics, watches, and shoes (Usunier & Cestre, 2007). This suggestion was corroborated in the main analysis: the majority of the ads in the corpus referred to a product that Usunier and Cestre (2007) classify as ethnic products. Table 2 provides a description of the product categories and brands referred to in the advertisements. Ethnic products have a prototypical COO in the minds of consumers (e.g., cosmetics – France), which makes it likely that the COOs are highlighted through the use of COO markers.

    Cosmopolitan is an international magazine that has different local editions in the three countries. The magazine, which is targeted at younger women (18–35 years old), reaches more than three million young women per month through its online, social and print platforms in the Netherlands (Hearst Netherlands, 2016), has about 517,000 readers per month in Spain (PrNoticias, 2016) and about 1.18 million readers per month in the UK (Hearst Magazine U.K., 2016).

    The sample consisted of all advertisements from all monthly issues that appeared in 2016 in the three countries. This whole-year cluster was selected so as to prevent potential seasonal influences (Neuendorf, 2002). In total, the corpus consisted of 745 advertisements, of which 111 were from the Dutch, 367 from the British and 267 from the Spanish Cosmopolitan. Two categories of ads were excluded in the selection process: (1) advertisements for subscription to Cosmopolitan itself, and (2) advertisements that were identical to ads that had appeared in another issue in one of the three countries. As a result, each advertisement was unique.

    Coding procedure
    For all advertisements, four variables were coded: product type, presence of types of COO markers, COO referred to, and the use of English as a COO marker. In the first place, product type was assessed by the two coders. Coders classified each product to one of the 32 product types. In order to assess the reliability of the codings, ten per cent of the ads were independently coded by a second coder. The interrater reliability of the variable product category was good (κ = .97, p < .000, 97.33% agreement between both coders). Table 2 lists the most frequent product types; the label ‘other’ covers 17 types of product, including charity, education, and furniture.

    In the second place, it was recorded whether one or more of the COO markers occurred in a given ad. In the third place, if a marker was identified, it was assessed to which COO the markers referred. Table 1 lists the nine possible COO markers defined by Aichner (2014) and the COOs referred to, with examples taken from the current content analysis. The interrater reliability for the type of COO marker was very good (κ = .80, p < .000, 96.30% agreement between the coders), and the interrater reliability for COO referred to was...

  5. P

    FooDI-ML Dataset

    • paperswithcode.com
    Updated Oct 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Amat Olóndriz; Ponç Palau Puigdevall; Adrià Salvador Palau (2021). FooDI-ML Dataset [Dataset]. https://paperswithcode.com/dataset/foodi-ml
    Explore at:
    Dataset updated
    Oct 6, 2021
    Authors
    David Amat Olóndriz; Ponç Palau Puigdevall; Adrià Salvador Palau
    Description

    Food Drinks and groceries Images Multi Lingual (FooDI-ML) is a dataset that contains over 1.5M unique images and over 9.5M store names, product names descriptions, and collection sections gathered from the Glovo application. The data made available corresponds to food, drinks and groceries products from 37 countries in Europe, the Middle East, Africa and Latin America. The dataset comprehends 33 languages, including 870K samples of languages of countries from Eastern Europe and Western Asia such as Ukrainian and Kazakh, which have been so far underrepresented in publicly available visiolinguistic datasets. The dataset also includes widely spoken languages such as Spanish and English.

    Description from: FooDI-ML: a large multi-language dataset of food, drinks and groceries images and descriptions

  6. T

    Spain Exports

    • tradingeconomics.com
    • de.tradingeconomics.com
    • +13more
    csv, excel, json, xml
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2025). Spain Exports [Dataset]. https://tradingeconomics.com/spain/exports
    Explore at:
    excel, csv, xml, jsonAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 31, 1962 - Apr 30, 2025
    Area covered
    Spain
    Description

    Exports in Spain decreased to 32510800 EUR Thousand in April from 34119900 EUR Thousand in March of 2025. This dataset provides the latest reported value for - Spain Exports - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.

  7. f

    Data_Sheet_1_Spanish Version of the Teachers’ Sense of Efficacy Scale: An...

    • frontiersin.figshare.com
    • figshare.com
    docx
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fátima Salas-Rodríguez; Sonia Lara; Martín Martínez (2023). Data_Sheet_1_Spanish Version of the Teachers’ Sense of Efficacy Scale: An Adaptation and Validation Study.docx [Dataset]. http://doi.org/10.3389/fpsyg.2021.714145.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Fátima Salas-Rodríguez; Sonia Lara; Martín Martínez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Teachers’ Sense of Efficacy Scale (TSES) has been the most widely used instrument to assess teacher efficacy beliefs. However, no study has been carried out concerning the TSES psychometric properties with teachers in Mexico, the country with the highest number of Spanish-speakers worldwide. The purpose of the present study is to examine the reliability, internal and external validity evidence of the TSES (short form) adapted into Spanish with a sample of 190 primary and secondary Mexican teachers from 25 private schools. Results of construct analysis confirm the three-factor-correlated structure of the original scale. Criterion validity evidence was established between self-efficacy and job satisfaction. Differences in self-efficacy were related to teachers’ gender, years of experience and grade level taught. Some limitations are discussed, and future research directions are recommended.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
Organization logo

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage

Explore at:
.csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://www.lexico.com/
Area covered
Ecuador, Paraguay, Panama, Bolivia (Plurinational State of), Chile, Cuba, Nicaragua, Colombia, Honduras, Costa Rica
Description

Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

  1. Spanish Monolingual Dictionary Data
  2. Spanish Bilingual Dictionary Data
  3. Spanish Sentences Data
  4. Synonyms and Antonyms Data
  5. Audio Data
  6. Word list Data

Key Features (approximate numbers):

  1. Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

  • Headwords: 73,000
  • Senses: 123,000
  • Sentence examples: 104,000
  • Format: XML and JSON formats
  • Delivery: Email (link-based file sharing) and REST API
  • Updated frequency: annually
  1. Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

  • Translations: 221,300
  • Senses: 103,500
  • Example sentences: 74,500
  • Example translations: 83,800
  • Format: XML and JSON formats
  • Delivery: Email (link-based file sharing) and REST API
  • Updated frequency: annually
  1. Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

  • Sentences volume: 1,840,000
  • Format: XML and JSON format
  • Delivery: Email (link-based file sharing) and REST API
  1. Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

  • Synonyms: 127,700
  • Antonyms: 9,500
  • Format: XML format
  • Delivery: Email (link-based file sharing)
  • Updated frequency: annually
  1. Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

  • Audio files: 20,900
  • Format: XLSX (for index), MP3 and WAV (audio files)
  1. Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

  • Wordforms: 450,000
  • Format: CSV and TXT formats
  • Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

Search
Clear search
Close search
Google apps
Main menu