5 datasets found

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display...
datarade.ai
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
Explore at:
.csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://www.lexico.com/
Area covered
Nicaragua, Honduras, Costa Rica, Paraguay, Colombia, Chile, Bolivia (Plurinational State of), Ecuador, Panama, Cuba
Description
Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

Spanish Monolingual Dictionary Data

Spanish Bilingual Dictionary Data

Spanish Sentences Data

Synonyms and Antonyms Data

Audio Data

Word list Data

Key Features (approximate numbers):

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Headwords: 73,000

Senses: 123,000

Sentence examples: 104,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300

Senses: 103,500

Example sentences: 74,500

Example translations: 83,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700

Antonyms: 9,500

Format: XML format

Delivery: Email (link-based file sharing)

Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900

Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.
c
ckanext-opensearch - Extensions - CKAN Ecosystem Catalog
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-opensearch - Extensions - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-opensearch
Explore at:
Dataset updated
Jun 4, 2025
Description
The ckanext-opensearch extension enhances CKAN by adding an OpenSearch interface, enabling machine-to-machine interaction for dataset discovery. It exposes several endpoints providing XML documents compliant with the OpenSearch standard, facilitating complex search workflows particularly useful in Earth Observation contexts. These endpoints serve description documents and Atom feeds representing search results, allowing external applications to query and retrieve datasets from the CKAN instance. Key Features: OpenSearch Description Documents: Provides /opensearch/description.xml endpoint to serve OpenSearch Description Documents (OSDD) that define the search parameters available. The osdd parameter dictates whether to retrieve the document for dataset search, collection search, or a specific collection. Dataset Search Endpoint: Offers /opensearch/search.atom endpoint for performing standard dataset searches, mirroring the functionality of CKAN's package search API. This allows external systems to query datasets using parameters defined in the dataset description document. Collection Search Endpoint: Implements /opensearch/collection_search.atom endpoint for searching collections of datasets. This supports a two-step search process, enabling users to first discover relevant collections and then search within those collections for specific datasets. Collection-Specific Search: Allows searching within a specific collection, using parameters that may be unique to that collection, defined in a dedicated description document for each collection ID. TOML Configuration: Uses TOML files (collectionslist.toml, datasetparameters.toml, etc.) to define collections, parameters, validators, and converters, facilitating easy customization and extension of the search interface. Jinja2 Templating: Employs Jinja2 templates to generate the XML description documents and Atom feeds, simplifying the process of aligning the output with various OpenSearch specifications. Namespaces Management: Utilizes namespaces.toml to manage XML namespaces, ensuring that the XML documents generated adhere to relevant standards. Use Cases: Earth Observation Data Discovery: Facilitates the "two-step" search process common in Earth Observation, allowing users to first find collections based on high-level criteria and then search within those collections for specific datasets or products and can be integrated into NextGEOSS projects. Machine-to-Machine Dataset Retrieval: Provides a standard interface for external applications to programmatically query and retrieve datasets from a CKAN instance. Technical Integration: The extension adds new API endpoints to CKAN. It leverages Jinja2 templating for XML generation, and TOML configuration files to control the parameters, namespaces, and available collections. The extension uses CKAN's plugins architecture to incorporate the new functionalities without altering core CKAN code. Benefits & Impact: This extension enables machine-readable access to CKAN datasets, adhering to the OpenSearch standard. It supports complex, two-step search workflows (particularly relevant in the Earth Observation domain), allowing users to search for collections and then search within those collections using specific parameters. The use of TOML files and Jinja2 templates facilitates maintainability, customization, and compliance with OpenSearch and related standards.
E
New Oxford Dictionary of English, 2nd Edition
live.european-language-grid.eu
catalog.elra.info
Updated Dec 6, 2005
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2005). New Oxford Dictionary of English, 2nd Edition [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2276
Explore at:
Dataset updated
Dec 6, 2005
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
This is Oxford University Press's most comprehensive single-volume dictionary, with 170,000 entries covering all varieties of English worldwide. The NODE data set constitutes a fully integrated range of formal data types suitable for language engineering and NLP applications: It is available in XML or SGML. - Source dictionary data. The NODE data set includes all the information present in the New Oxford Dictionary of English itself, such as definition text, example sentences, grammatical indicators, and encyclopaedic material. - Morphological data. Each NODE lemma (both headwords and subentries) has a full listing of all possible syntactic forms (e.g. plurals for nouns, inflections for verbs, comparatives and superlatives for adjectives), tagged to show their syntactic relationships. Each form has an IPA pronunciation. Full morphological data is also given for spelling variants (e.g. typical American variants), and a system of links enables straightforward correlation of variant forms to standard forms. The data set thus provides robust support for all look-up routines, and is equally viable for applications dealing with American and British English. - Phrases and idioms. The NODE data set provides a rich and flexible codification of over 10,000 phrasal verbs and other multi-word phrases. It features comprehensive lexical resources enabling applications to identify a phrase not only in the form listed in the dictionary but also in a range of real-world variations, including alternative wording, variable syntactic patterns, inflected verbs, optional determiners, etc. - Subject classification. Using a categorization scheme of 200 key domains, over 80,000 words and senses have been associated with particular subject areas, from aeronautics to zoology. As well as facilitating the extraction of subject-specific sub-lexicons, this also provides an extensive resource for document categorization and information retrieval. - Semantic relationships. The relationships between every noun and noun sense in the dictionary are being codified using an extensive semantic taxonomy on the model of the Princeton WordNet project. (Mapping to WordNet 1.7 is supported.) This structure allows elements of the basic lexical database to function as a formal knowledge database, enabling functionality such as sense disambiguation and logical inference. - Derived from the detailed and authoritative corpus-based research of Oxford University Press's lexicographic team, the NODE data set is a powerful asset for any task dealing with real-world contemporary English usage. By integrating a number of different data types into a single structure, it creates a coherent resource which can be queried along numerous axes, allowing open-ended exploitation by many kinds of language-related applications.
c
ckanext-search-tweaks
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-search-tweaks [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-search-tweaks
Explore at:
Dataset updated
Jun 4, 2025
Description
The search-tweaks extension for CKAN provides a suite of plugins designed to enhance and customize the search functionality within the CKAN ecosystem. This extension allows administrators to exert greater control over search results, implement features like query-based relevance boosting, field-based prioritization, and spellcheck suggestions. By enabling various plugins, users can significantly refine the search experience and improve data discovery for end-users. Compatible with CKAN 2.9 and later, it offers a modular approach to search customization, allowing for tailored configurations to meet specific needs. Key Features: Base Functionality (search_tweaks plugin): Provides essential logic for all other plugins within the extension, including automatically switching to the EDisMax query parser in Solr if none is specified, and enabling the required ISearchTweaks interface. Query Relevance Boosting (searchtweaksquery_relevance plugin): Dynamically promotes datasets in search results based on the frequency of visits after a particular search query. It collects usage statistics (defaulting to Redis) and converts them into a numeric Solr field for boosting dataset scores, requiring a cronjob for updating the search index. Field Relevance Boosting (searchtweaksfield_relevance plugin): Increases the relevance of datasets based on the values of a specified numeric field. This provides control over result rankings without extra coding. Spellcheck Suggestions (searchtweaksspellcheck plugin): Exposes search suggestions to CKAN templates via Solr's spellcheck component, delivering a "Did you mean?" feature to guide users towards more accurate search terms. This plugin necessitates substantial configuration within Solr, including modifications to solrconfig.xml and adding a periodic cron job to update the suggestions dictionary. Configurable Solr Boosting: This functionality allows administrators to define Solr's boost function to tailor search results based on specific needs, promoting datasets according to specified criteria. Technical Integration: The search-tweaks extension integrates deeply with CKAN's search infrastructure, primarily by leveraging Apache Solr. Enabling the spellcheck functionality requires modifying Solr's solrconfig.xml files, defining spellcheck fields in the schema and deploying cron jobs for updating suggestions. Also, The core search_tweaks plugin enables the ckanext.search_tweaks.iterfaces.ISearchTweaks interface, providing a way for other extensions or custom code to interact with and further modify the search behavior. Benefits & Impact: Implementing the search-tweaks extension enables CKAN administrators to optimize search functionality, making data discovery more efficient and user-friendly. The extension addresses the challenge of presenting the most relevant datasets to users by incorporating visit statistics, field values, and spellcheck suggestions into the search ranking algorithm. This ultimately leads to improved data accessibility, user satisfaction, and more effective utilization of the CKAN platform.
f
PMC-Patients Dataset
figshare.com
application/x-gzip
Updated Nov 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhengyun Zhao (2023). PMC-Patients Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24504115.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24504115.v1
Dataset updated
Nov 6, 2023
Dataset provided by
figshare
Authors
Zhengyun Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PMC-Patients DatasetThe core file of our dataset, containing the patient summaries, demographics, and relational annotations.### PMC-Patients.jsonPatient summaries are presented as a json file, which is a list of dictionaries with the following keys:- patient_id: string. A continuous id of patients, starting from 0.- patient_uid: string. Unique ID for each patient, with format PMID-x, where PMID is the PubMed Identifier of source article of the note and x denotes index of the note in source article.- PMID: string. PMID for source article.- file_path: string. File path of xml file of source article.- title: string. Source article title.- patient: string. Patient note.- age: list of tuples. Each entry is in format (value, unit) where value is a float number and unit is in 'year', 'month', 'week', 'day' and 'hour' indicating age unit. For example, [[1.0, 'year'], [2.0, 'month']] indicating the patient is a one-year- and two-month-old infant.- gender: 'M' or 'F'. Male or Female.- relevant_articles: dict. The key is PMID of the relevant articles and the corresponding value is its relevance score (2 or 1 as defined in the Methods'' section).- `similar_patients`: dict. The key is patient_uid of the similar patients and the corresponding value is its similarity score (2 or 1 as defined in theMethods'' section).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage

Explore at:

.csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats

Dataset updated

Jul 11, 2025

Dataset authored and provided by

Oxford Languageshttps://www.lexico.com/

Area covered

Nicaragua, Honduras, Costa Rica, Paraguay, Colombia, Chile, Bolivia (Plurinational State of), Ecuador, Panama, Cuba

Description

Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

Spanish Monolingual Dictionary Data
Spanish Bilingual Dictionary Data
Spanish Sentences Data
Synonyms and Antonyms Data
Audio Data
Word list Data

Key Features (approximate numbers):

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Headwords: 73,000
Senses: 123,000
Sentence examples: 104,000
Format: XML and JSON formats
Delivery: Email (link-based file sharing) and REST API
Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300
Senses: 103,500
Example sentences: 74,500
Example translations: 83,800
Format: XML and JSON formats
Delivery: Email (link-based file sharing) and REST API
Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000
Format: XML and JSON format
Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700
Antonyms: 9,500
Format: XML format
Delivery: Email (link-based file sharing)
Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900
Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000
Format: CSV and TXT formats
Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

Clear search

Close search

Google apps

Main menu

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display...

ckanext-opensearch - Extensions - CKAN Ecosystem Catalog

New Oxford Dictionary of English, 2nd Edition

ckanext-search-tweaks

PMC-Patients Dataset

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage