4 datasets found
  1. Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display...

    • datarade.ai
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
    Explore at:
    .csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Oxford Languageshttps://www.lexico.com/
    Area covered
    Chile, Paraguay, Ecuador, Bolivia (Plurinational State of), Panama, Colombia, Nicaragua, Cuba, Honduras, Costa Rica
    Description

    Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

    1. Spanish Monolingual Dictionary Data
    2. Spanish Bilingual Dictionary Data
    3. Spanish Sentences Data
    4. Synonyms and Antonyms Data
    5. Audio Data
    6. Word list Data

    Key Features (approximate numbers):

    1. Spanish Monolingual Dictionary Data

    Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

    • Headwords: 73,000
    • Senses: 123,000
    • Sentence examples: 104,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

    • Translations: 221,300
    • Senses: 103,500
    • Example sentences: 74,500
    • Example translations: 83,800
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Sentences Data

    Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

    • Sentences volume: 1,840,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    1. Spanish Synonyms and Antonyms Data

    This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

    • Synonyms: 127,700
    • Antonyms: 9,500
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. Spanish Audio Data (word-level)

    Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

    • Audio files: 20,900
    • Format: XLSX (for index), MP3 and WAV (audio files)
    1. Spanish Word List Data

    This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

    • Wordforms: 450,000
    • Format: CSV and TXT formats
    • Delivery: Email (link-based file sharing)

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

  2. 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...

    • data.nexdata.ai
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM) Data | Speech AI Datasets|Machine Learning (ML) Data [Dataset]. https://data.nexdata.ai/products/nexdata-multilingual-conversational-speech-data-16khz-mob-nexdata
    Explore at:
    Dataset updated
    Aug 3, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Ukraine, Hong Kong, Egypt, Bulgaria, Syrian Arab Republic, Brazil, Malaysia, Italy, Pakistan, Switzerland
    Description

    Nexdata has off-the-shelf 35,000 hours Machine Learning (ML) Data of 16kHz conversational speech, covering 100+ countries including English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Hindi, Russia and etc.

  3. D

    Data from: Dataset for 'How brands highlight country of origin in magazine...

    • ssh.datastations.nl
    • datacatalogue.cessda.eu
    pdf, tsv, txt, xml +1
    Updated Jun 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen; J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen (2020). Dataset for 'How brands highlight country of origin in magazine advertising: A content analysis' [Dataset]. http://doi.org/10.17026/DANS-ZTF-W83F
    Explore at:
    tsv(40846), zip(32664), xml(11286), txt(782), pdf(126553)Available download formats
    Dataset updated
    Jun 8, 2020
    Dataset provided by
    DANS Data Station Social Sciences and Humanities
    Authors
    J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen; J.M.A. Hornikx; J. van den Heuvel; W.F.J. van Meurs; A.J.M. Janssen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for content analysis published in "Hornikx, J., Meurs, F. van, Janssen, A., & Heuvel, J. van den (2020). How brands highlight country of origin in magazine advertising: A content analysis. Journal of Global Marketing, 33 (1), 34-45."*Abstract (taken from publication)Aichner (2014) proposes a classification of ways in which brands communicate their country of origin (COO). The current, exploratory study is the first to empirically investigate the frequency with which brands employ such COO markers in magazine advertisements. An analysis of about 750 ads from the British, Dutch, and Spanish editions of Cosmopolitan showed that the prototypical ‘made in’ marker was rarely used, and that ‘COO embedded in company name’ and ‘use of COO language’ were most frequently employed. In all, 36% of the total number of ads contained at least one COO marker, underlining the importance of the COO construct.*Methodology (taken from publication)SampleThe use of COO markers in advertising was examined in print advertisements from three different countries to increase the robustness of the findings. Given the exploratory nature of this study, two practical selection criteria guided our country choice: the three countries included both smaller and larger countries in Europe, and they represented languages that the team was familiar with in order to reliably code the advertisements on the relevant variables. The three European countries selected were the Netherlands, Spain, and the United Kingdom. The dataset for the UK was discarded for testing H1 about the use of English as a foreign language, as will be explained in more detail in the coding procedure.The magazine Cosmopolitan was chosen as the source of advertisements. The choice for one specific magazine title reduces the generalizability of the findings (i.e., limited to the corresponding products and target consumers), but this magazine was chosen intentionally because an informal analysis suggested that it carried advertising for a large number of product categories that are considered ethnic products, such as cosmetics, watches, and shoes (Usunier & Cestre, 2007). This suggestion was corroborated in the main analysis: the majority of the ads in the corpus referred to a product that Usunier and Cestre (2007) classify as ethnic products. Table 2 provides a description of the product categories and brands referred to in the advertisements. Ethnic products have a prototypical COO in the minds of consumers (e.g., cosmetics – France), which makes it likely that the COOs are highlighted through the use of COO markers.Cosmopolitan is an international magazine that has different local editions in the three countries. The magazine, which is targeted at younger women (18–35 years old), reaches more than three million young women per month through its online, social and print platforms in the Netherlands (Hearst Netherlands, 2016), has about 517,000 readers per month in Spain (PrNoticias, 2016) and about 1.18 million readers per month in the UK (Hearst Magazine U.K., 2016).The sample consisted of all advertisements from all monthly issues that appeared in 2016 in the three countries. This whole-year cluster was selected so as to prevent potential seasonal influences (Neuendorf, 2002). In total, the corpus consisted of 745 advertisements, of which 111 were from the Dutch, 367 from the British and 267 from the Spanish Cosmopolitan. Two categories of ads were excluded in the selection process: (1) advertisements for subscription to Cosmopolitan itself, and (2) advertisements that were identical to ads that had appeared in another issue in one of the three countries. As a result, each advertisement was unique.Coding procedureFor all advertisements, four variables were coded: product type, presence of types of COO markers, COO referred to, and the use of English as a COO marker. In the first place, product type was assessed by the two coders. Coders classified each product to one of the 32 product types. In order to assess the reliability of the codings, ten per cent of the ads were independently coded by a second coder. The interrater reliability of the variable product category was good (κ = .97, p < .000, 97.33% agreement between both coders). Table 2 lists the most frequent product types; the label ‘other’ covers 17 types of product, including charity, education, and furniture.In the second place, it was recorded whether one or more of the COO markers occurred in a given ad. In the third place, if a marker was identified, it was assessed to which COO the markers referred. Table 1 lists the nine possible COO markers defined by Aichner (2014) and the COOs referred to, with examples taken from the current content analysis. The interrater reliability for the type of COO marker was very good (κ = .80, p < .000, 96.30% agreement between the coders), and the interrater reliability for COO referred to was excellent (κ = 1.00, p < .000).After the independent assessments of the two...

  4. P

    FooDI-ML Dataset

    • paperswithcode.com
    Updated Oct 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Amat Olóndriz; Ponç Palau Puigdevall; Adrià Salvador Palau (2021). FooDI-ML Dataset [Dataset]. https://paperswithcode.com/dataset/foodi-ml
    Explore at:
    Dataset updated
    Oct 6, 2021
    Authors
    David Amat Olóndriz; Ponç Palau Puigdevall; Adrià Salvador Palau
    Description

    Food Drinks and groceries Images Multi Lingual (FooDI-ML) is a dataset that contains over 1.5M unique images and over 9.5M store names, product names descriptions, and collection sections gathered from the Glovo application. The data made available corresponds to food, drinks and groceries products from 37 countries in Europe, the Middle East, Africa and Latin America. The dataset comprehends 33 languages, including 870K samples of languages of countries from Eastern Europe and Western Asia such as Ukrainian and Kazakh, which have been so far underrepresented in publicly available visiolinguistic datasets. The dataset also includes widely spoken languages such as Spanish and English.

    Description from: FooDI-ML: a large multi-language dataset of food, drinks and groceries images and descriptions

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
Organization logo

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage

Explore at:
.csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://www.lexico.com/
Area covered
Chile, Paraguay, Ecuador, Bolivia (Plurinational State of), Panama, Colombia, Nicaragua, Cuba, Honduras, Costa Rica
Description

Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

  1. Spanish Monolingual Dictionary Data
  2. Spanish Bilingual Dictionary Data
  3. Spanish Sentences Data
  4. Synonyms and Antonyms Data
  5. Audio Data
  6. Word list Data

Key Features (approximate numbers):

  1. Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

  • Headwords: 73,000
  • Senses: 123,000
  • Sentence examples: 104,000
  • Format: XML and JSON formats
  • Delivery: Email (link-based file sharing) and REST API
  • Updated frequency: annually
  1. Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

  • Translations: 221,300
  • Senses: 103,500
  • Example sentences: 74,500
  • Example translations: 83,800
  • Format: XML and JSON formats
  • Delivery: Email (link-based file sharing) and REST API
  • Updated frequency: annually
  1. Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

  • Sentences volume: 1,840,000
  • Format: XML and JSON format
  • Delivery: Email (link-based file sharing) and REST API
  1. Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

  • Synonyms: 127,700
  • Antonyms: 9,500
  • Format: XML format
  • Delivery: Email (link-based file sharing)
  • Updated frequency: annually
  1. Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

  • Audio files: 20,900
  • Format: XLSX (for index), MP3 and WAV (audio files)
  1. Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

  • Wordforms: 450,000
  • Format: CSV and TXT formats
  • Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

Search
Clear search
Close search
Google apps
Main menu