5 datasets found
  1. Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display...

    • datarade.ai
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
    Explore at:
    .csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Oxford Languageshttps://www.lexico.com/
    Area covered
    Nicaragua, Honduras, Costa Rica, Paraguay, Colombia, Chile, Bolivia (Plurinational State of), Ecuador, Panama, Cuba
    Description

    Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

    1. Spanish Monolingual Dictionary Data
    2. Spanish Bilingual Dictionary Data
    3. Spanish Sentences Data
    4. Synonyms and Antonyms Data
    5. Audio Data
    6. Word list Data

    Key Features (approximate numbers):

    1. Spanish Monolingual Dictionary Data

    Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

    • Headwords: 73,000
    • Senses: 123,000
    • Sentence examples: 104,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

    • Translations: 221,300
    • Senses: 103,500
    • Example sentences: 74,500
    • Example translations: 83,800
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Sentences Data

    Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

    • Sentences volume: 1,840,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    1. Spanish Synonyms and Antonyms Data

    This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

    • Synonyms: 127,700
    • Antonyms: 9,500
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. Spanish Audio Data (word-level)

    Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

    • Audio files: 20,900
    • Format: XLSX (for index), MP3 and WAV (audio files)
    1. Spanish Word List Data

    This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

    • Wordforms: 450,000
    • Format: CSV and TXT formats
    • Delivery: Email (link-based file sharing)

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

  2. c

    ckanext-opensearch - Extensions - CKAN Ecosystem Catalog

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-opensearch - Extensions - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-opensearch
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    The ckanext-opensearch extension enhances CKAN by adding an OpenSearch interface, enabling machine-to-machine interaction for dataset discovery. It exposes several endpoints providing XML documents compliant with the OpenSearch standard, facilitating complex search workflows particularly useful in Earth Observation contexts. These endpoints serve description documents and Atom feeds representing search results, allowing external applications to query and retrieve datasets from the CKAN instance. Key Features: OpenSearch Description Documents: Provides /opensearch/description.xml endpoint to serve OpenSearch Description Documents (OSDD) that define the search parameters available. The osdd parameter dictates whether to retrieve the document for dataset search, collection search, or a specific collection. Dataset Search Endpoint: Offers /opensearch/search.atom endpoint for performing standard dataset searches, mirroring the functionality of CKAN's package search API. This allows external systems to query datasets using parameters defined in the dataset description document. Collection Search Endpoint: Implements /opensearch/collection_search.atom endpoint for searching collections of datasets. This supports a two-step search process, enabling users to first discover relevant collections and then search within those collections for specific datasets. Collection-Specific Search: Allows searching within a specific collection, using parameters that may be unique to that collection, defined in a dedicated description document for each collection ID. TOML Configuration: Uses TOML files (collectionslist.toml, datasetparameters.toml, etc.) to define collections, parameters, validators, and converters, facilitating easy customization and extension of the search interface. Jinja2 Templating: Employs Jinja2 templates to generate the XML description documents and Atom feeds, simplifying the process of aligning the output with various OpenSearch specifications. Namespaces Management: Utilizes namespaces.toml to manage XML namespaces, ensuring that the XML documents generated adhere to relevant standards. Use Cases: Earth Observation Data Discovery: Facilitates the "two-step" search process common in Earth Observation, allowing users to first find collections based on high-level criteria and then search within those collections for specific datasets or products and can be integrated into NextGEOSS projects. Machine-to-Machine Dataset Retrieval: Provides a standard interface for external applications to programmatically query and retrieve datasets from a CKAN instance. Technical Integration: The extension adds new API endpoints to CKAN. It leverages Jinja2 templating for XML generation, and TOML configuration files to control the parameters, namespaces, and available collections. The extension uses CKAN's plugins architecture to incorporate the new functionalities without altering core CKAN code. Benefits & Impact: This extension enables machine-readable access to CKAN datasets, adhering to the OpenSearch standard. It supports complex, two-step search workflows (particularly relevant in the Earth Observation domain), allowing users to search for collections and then search within those collections using specific parameters. The use of TOML files and Jinja2 templates facilitates maintainability, customization, and compliance with OpenSearch and related standards.

  3. E

    New Oxford Dictionary of English, 2nd Edition

    • live.european-language-grid.eu
    • catalog.elra.info
    Updated Dec 6, 2005
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2005). New Oxford Dictionary of English, 2nd Edition [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2276
    Explore at:
    Dataset updated
    Dec 6, 2005
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    This is Oxford University Press's most comprehensive single-volume dictionary, with 170,000 entries covering all varieties of English worldwide. The NODE data set constitutes a fully integrated range of formal data types suitable for language engineering and NLP applications: It is available in XML or SGML. - Source dictionary data. The NODE data set includes all the information present in the New Oxford Dictionary of English itself, such as definition text, example sentences, grammatical indicators, and encyclopaedic material. - Morphological data. Each NODE lemma (both headwords and subentries) has a full listing of all possible syntactic forms (e.g. plurals for nouns, inflections for verbs, comparatives and superlatives for adjectives), tagged to show their syntactic relationships. Each form has an IPA pronunciation. Full morphological data is also given for spelling variants (e.g. typical American variants), and a system of links enables straightforward correlation of variant forms to standard forms. The data set thus provides robust support for all look-up routines, and is equally viable for applications dealing with American and British English. - Phrases and idioms. The NODE data set provides a rich and flexible codification of over 10,000 phrasal verbs and other multi-word phrases. It features comprehensive lexical resources enabling applications to identify a phrase not only in the form listed in the dictionary but also in a range of real-world variations, including alternative wording, variable syntactic patterns, inflected verbs, optional determiners, etc. - Subject classification. Using a categorization scheme of 200 key domains, over 80,000 words and senses have been associated with particular subject areas, from aeronautics to zoology. As well as facilitating the extraction of subject-specific sub-lexicons, this also provides an extensive resource for document categorization and information retrieval. - Semantic relationships. The relationships between every noun and noun sense in the dictionary are being codified using an extensive semantic taxonomy on the model of the Princeton WordNet project. (Mapping to WordNet 1.7 is supported.) This structure allows elements of the basic lexical database to function as a formal knowledge database, enabling functionality such as sense disambiguation and logical inference. - Derived from the detailed and authoritative corpus-based research of Oxford University Press's lexicographic team, the NODE data set is a powerful asset for any task dealing with real-world contemporary English usage. By integrating a number of different data types into a single structure, it creates a coherent resource which can be queried along numerous axes, allowing open-ended exploitation by many kinds of language-related applications.

  4. c

    ckanext-search-tweaks

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-search-tweaks [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-search-tweaks
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    The search-tweaks extension for CKAN provides a suite of plugins designed to enhance and customize the search functionality within the CKAN ecosystem. This extension allows administrators to exert greater control over search results, implement features like query-based relevance boosting, field-based prioritization, and spellcheck suggestions. By enabling various plugins, users can significantly refine the search experience and improve data discovery for end-users. Compatible with CKAN 2.9 and later, it offers a modular approach to search customization, allowing for tailored configurations to meet specific needs. Key Features: Base Functionality (search_tweaks plugin): Provides essential logic for all other plugins within the extension, including automatically switching to the EDisMax query parser in Solr if none is specified, and enabling the required ISearchTweaks interface. Query Relevance Boosting (searchtweaksquery_relevance plugin): Dynamically promotes datasets in search results based on the frequency of visits after a particular search query. It collects usage statistics (defaulting to Redis) and converts them into a numeric Solr field for boosting dataset scores, requiring a cronjob for updating the search index. Field Relevance Boosting (searchtweaksfield_relevance plugin): Increases the relevance of datasets based on the values of a specified numeric field. This provides control over result rankings without extra coding. Spellcheck Suggestions (searchtweaksspellcheck plugin): Exposes search suggestions to CKAN templates via Solr's spellcheck component, delivering a "Did you mean?" feature to guide users towards more accurate search terms. This plugin necessitates substantial configuration within Solr, including modifications to solrconfig.xml and adding a periodic cron job to update the suggestions dictionary. Configurable Solr Boosting: This functionality allows administrators to define Solr's boost function to tailor search results based on specific needs, promoting datasets according to specified criteria. Technical Integration: The search-tweaks extension integrates deeply with CKAN's search infrastructure, primarily by leveraging Apache Solr. Enabling the spellcheck functionality requires modifying Solr's solrconfig.xml files, defining spellcheck fields in the schema and deploying cron jobs for updating suggestions. Also, The core search_tweaks plugin enables the ckanext.search_tweaks.iterfaces.ISearchTweaks interface, providing a way for other extensions or custom code to interact with and further modify the search behavior. Benefits & Impact: Implementing the search-tweaks extension enables CKAN administrators to optimize search functionality, making data discovery more efficient and user-friendly. The extension addresses the challenge of presenting the most relevant datasets to users by incorporating visit statistics, field values, and spellcheck suggestions into the search ranking algorithm. This ultimately leads to improved data accessibility, user satisfaction, and more effective utilization of the CKAN platform.

  5. f

    PMC-Patients Dataset

    • figshare.com
    application/x-gzip
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhengyun Zhao (2023). PMC-Patients Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24504115.v1
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Nov 6, 2023
    Dataset provided by
    figshare
    Authors
    Zhengyun Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PMC-Patients DatasetThe core file of our dataset, containing the patient summaries, demographics, and relational annotations.### PMC-Patients.jsonPatient summaries are presented as a json file, which is a list of dictionaries with the following keys:- patient_id: string. A continuous id of patients, starting from 0.- patient_uid: string. Unique ID for each patient, with format PMID-x, where PMID is the PubMed Identifier of source article of the note and x denotes index of the note in source article.- PMID: string. PMID for source article.- file_path: string. File path of xml file of source article.- title: string. Source article title.- patient: string. Patient note.- age: list of tuples. Each entry is in format (value, unit) where value is a float number and unit is in 'year', 'month', 'week', 'day' and 'hour' indicating age unit. For example, [[1.0, 'year'], [2.0, 'month']] indicating the patient is a one-year- and two-month-old infant.- gender: 'M' or 'F'. Male or Female.- relevant_articles: dict. The key is PMID of the relevant articles and the corresponding value is its relevance score (2 or 1 as defined in the Methods'' section).- `similar_patients`: dict. The key is patient_uid of the similar patients and the corresponding value is its similarity score (2 or 1 as defined in theMethods'' section).

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
Organization logo

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage

Explore at:
.csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://www.lexico.com/
Area covered
Nicaragua, Honduras, Costa Rica, Paraguay, Colombia, Chile, Bolivia (Plurinational State of), Ecuador, Panama, Cuba
Description

Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

  1. Spanish Monolingual Dictionary Data
  2. Spanish Bilingual Dictionary Data
  3. Spanish Sentences Data
  4. Synonyms and Antonyms Data
  5. Audio Data
  6. Word list Data

Key Features (approximate numbers):

  1. Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

  • Headwords: 73,000
  • Senses: 123,000
  • Sentence examples: 104,000
  • Format: XML and JSON formats
  • Delivery: Email (link-based file sharing) and REST API
  • Updated frequency: annually
  1. Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

  • Translations: 221,300
  • Senses: 103,500
  • Example sentences: 74,500
  • Example translations: 83,800
  • Format: XML and JSON formats
  • Delivery: Email (link-based file sharing) and REST API
  • Updated frequency: annually
  1. Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

  • Sentences volume: 1,840,000
  • Format: XML and JSON format
  • Delivery: Email (link-based file sharing) and REST API
  1. Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

  • Synonyms: 127,700
  • Antonyms: 9,500
  • Format: XML format
  • Delivery: Email (link-based file sharing)
  • Updated frequency: annually
  1. Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

  • Audio files: 20,900
  • Format: XLSX (for index), MP3 and WAV (audio files)
  1. Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

  • Wordforms: 450,000
  • Format: CSV and TXT formats
  • Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

Search
Clear search
Close search
Google apps
Main menu