53 datasets found
  1. E

    New Oxford Dictionary of English, 2nd Edition

    • live.european-language-grid.eu
    • catalog.elra.info
    Updated Dec 6, 2005
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2005). New Oxford Dictionary of English, 2nd Edition [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2276
    Explore at:
    Dataset updated
    Dec 6, 2005
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    This is Oxford University Press's most comprehensive single-volume dictionary, with 170,000 entries covering all varieties of English worldwide. The NODE data set constitutes a fully integrated range of formal data types suitable for language engineering and NLP applications: It is available in XML or SGML. - Source dictionary data. The NODE data set includes all the information present in the New Oxford Dictionary of English itself, such as definition text, example sentences, grammatical indicators, and encyclopaedic material. - Morphological data. Each NODE lemma (both headwords and subentries) has a full listing of all possible syntactic forms (e.g. plurals for nouns, inflections for verbs, comparatives and superlatives for adjectives), tagged to show their syntactic relationships. Each form has an IPA pronunciation. Full morphological data is also given for spelling variants (e.g. typical American variants), and a system of links enables straightforward correlation of variant forms to standard forms. The data set thus provides robust support for all look-up routines, and is equally viable for applications dealing with American and British English. - Phrases and idioms. The NODE data set provides a rich and flexible codification of over 10,000 phrasal verbs and other multi-word phrases. It features comprehensive lexical resources enabling applications to identify a phrase not only in the form listed in the dictionary but also in a range of real-world variations, including alternative wording, variable syntactic patterns, inflected verbs, optional determiners, etc. - Subject classification. Using a categorization scheme of 200 key domains, over 80,000 words and senses have been associated with particular subject areas, from aeronautics to zoology. As well as facilitating the extraction of subject-specific sub-lexicons, this also provides an extensive resource for document categorization and information retrieval. - Semantic relationships. The relationships between every noun and noun sense in the dictionary are being codified using an extensive semantic taxonomy on the model of the Princeton WordNet project. (Mapping to WordNet 1.7 is supported.) This structure allows elements of the basic lexical database to function as a formal knowledge database, enabling functionality such as sense disambiguation and logical inference. - Derived from the detailed and authoritative corpus-based research of Oxford University Press's lexicographic team, the NODE data set is a powerful asset for any task dealing with real-world contemporary English usage. By integrating a number of different data types into a single structure, it creates a coherent resource which can be queried along numerous axes, allowing open-ended exploitation by many kinds of language-related applications.

  2. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  3. American English Language Datasets | 150+ Years of Research | Textual Data |...

    • datarade.ai
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). American English Language Datasets | 150+ Years of Research | Textual Data | NLP | LLMs | TTS | Dictionary Display | Game | US English Coverage [Dataset]. https://datarade.ai/data-products/american-english-language-datasets-150-years-of-research-oxford-languages
    Explore at:
    .csv, .json, .mp3, .wav, .xls, .xmlAvailable download formats
    Dataset updated
    Jul 29, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    United States
    Description

    One of our flagship datasets, the American English data, is expertly curated and linguistically annotated by professionals, with annual updates to ensure accuracy and relevance. The below datasets in American English are available for license:

    1. American English Monolingual Dictionary Data
    2. American English Synonyms and Antonyms Data
    3. American English Pronunciations with Audio

    Key Features (approximate numbers):

    1. American English Monolingual Dictionary Data

    Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure relevance and accuracy.

    • Headwords: 140,000
    • Senses: 222,000
    • Sentence examples: 140,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. American English Synonyms and Antonyms Data

    The American English Synonyms and Antonyms Dataset is a leading resource offering comprehensive and up-to-date coverage of word relationships in contemporary American English. It includes rich linguistic detail such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.

    • Synonyms: 600,000
    • Antonyms: 22,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. American English Pronunciations with Audio (word-level)

    This dataset provides IPA transcriptions and mapped audio files for words in contemporary American English, with a focus on US speaker usage. It includes syllabified transcriptions, variant spellings, part-of-speech tags, and pronunciation group identifiers. Audio files are supplied separately and linked where available – ideal for TTS, ASR, and pronunciation modeling.

    • Transcriptions (IPA): 250,000
    • Audio files: 180,000
    • Format: XLSX (for transcriptions), MP3 and WAV (audio files)
    • Updated frequency: annually

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.

  4. o

    Armenian-Russian-English Dictionary of Forest Terminology - Dataset - Data...

    • data.opendata.am
    Updated Jul 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Armenian-Russian-English Dictionary of Forest Terminology - Dataset - Data Catalog Armenia [Dataset]. https://data.opendata.am/dataset/recc-95f59cfffe224b46a641db49186efe7e
    Explore at:
    Dataset updated
    Jul 8, 2023
    Area covered
    Armenia
    Description

    The dictionary includes about 1650 terms and concepts (in Armenian, Russian and English) used in forest and landscaping sectors with a brief explanation in Armenian.Citation: J.H. Vardanyan, H.T. Sayadyan, Armenian-Russian-English Dictionary of Forest Terminology, Publishing House of the Institute of Botany of NAS RA, Yerevan, 2008.

  5. British English Language Datasets | 150+ Years of Research | Natural...

    • datarade.ai
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). British English Language Datasets | 150+ Years of Research | Natural Language Processing (NLP) Data | LLMs | TTS | Dictionary Display | EU Coverage [Dataset]. https://datarade.ai/data-products/british-english-language-datasets-150-years-of-research-oxford-languages
    Explore at:
    .json, .xml, .csv, .xls, .mp3, .wavAvailable download formats
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    United Kingdom
    Description

    Our British English language datasets are meticulously curated and annotated by experienced linguistics and language experts, ensuring exceptional accuracy, consistency, and linguistic depth. The below datasets in British English are available for license:

    1. British English Monolingual Dictionary Data
    2. British English Synonyms and Antonyms Data
    3. British English Pronunciations with Audio

    Key Features (approximate numbers):

    1. British English Monolingual Dictionary Data

    Our British English monolingual dataset delivers clear, reliable definitions and authentic usage examples, featuring a high volume of headwords and in-depth coverage of the British English variant of English. As one of the world’s most authoritative lexical resources, it’s trusted by leading academic, AI, and language technology organizations.

    • Headwords: 146,000
    • Senses: 230,000
    • Sentence examples: 149,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: twice a year
    1. British English Synonyms and Antonyms Data

    This British English language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for NLP tasks such as semantic search, word sense disambiguation, and language generation.

    • Synonyms: 600,000
    • Antonyms: 22,000
    • Usage Examples: 39,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. British English Pronunciations with audio (word-level)

    This dataset provides IPA transcriptions and mapped audio files for words in contemporary British English, with a focus on UK speaker usage. It includes syllabified transcriptions, variant spellings, part-of-speech tags, and pronunciation group identifiers. Audio files are supplied separately and linked where available – ideal for TTS, ASR, and pronunciation modeling.

    • Transcriptions (IPA): 250,000
    • Audio files: 180,000
    • Format: XLSX (for transcriptions), MP3 and WAV (audio files)
    • Updated frequency: annually

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.

  6. Dictionary Graph

    • kaggle.com
    Updated Sep 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bfbarry (2021). Dictionary Graph [Dataset]. https://www.kaggle.com/bfbarry/dictionary-graph/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2021
    Dataset provided by
    Kaggle
    Authors
    bfbarry
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dictionary Network Graph

    Contains a graph representation of the the English dictionary, where each word is a node and its edges are defined when a word appears in a definition. The JSON file is of the form: JSON {word: [Each, word, in, its, definition] ... }

    Use this dataset to explore the structure of natural language!

  7. Online Marshallese-English Dictionary

    • pacific-data.sprep.org
    • rmi-data.sprep.org
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Secretariat of the Pacific Regional Environment Programme (2025). Online Marshallese-English Dictionary [Dataset]. https://pacific-data.sprep.org/dataset/online-marshallese-english-dictionary
    Explore at:
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    Pacific Regional Environment Programmehttps://www.sprep.org/
    License

    http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa

    Area covered
    177.495117 18.604601, 157.807617 2.504085)), 157.807617 18.604601, POLYGON ((157.807617 2.504085, 177.495117 2.504085, Republic of the Marshall Islands
    Description

    This dataset contains resources that link to Online Marshallese-English Dictionaries.

  8. n

    English Contractor Monthly Orthodontic Activity - Datasets - Open Data...

    • opendata.nhsbsa.net
    Updated Oct 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). English Contractor Monthly Orthodontic Activity - Datasets - Open Data Portal [Dataset]. https://opendata.nhsbsa.net/dataset/english-contractor-monthly-orthodontic-activity
    Explore at:
    Dataset updated
    Oct 2, 2023
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    You can read more about this dataset and how to use it in the English Contractor Monthly General Dental and Orthodontic Activity guidance documentation (ODT: 234KB). You can view all definitions for the fields included in the dataset in the English Contractor Monthly Orthodontic Activity Data Dictionary (XLSX: 10KB) We also publish the English Contractor Monthly General Dental Activity dataset.

  9. LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data...

    • datarade.ai
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data | TTS | Dictionary Display | Translation Data | LATAM Coverage [Dataset]. https://datarade.ai/data-products/latam-data-suite-1-8m-sentences-nlp-tts-dictionary-d-oxford-languages
    Explore at:
    .csv, .json, .mp3, .wav, .xls, .xmlAvailable download formats
    Dataset updated
    Jul 22, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Spain, Bolivia (Plurinational State of), Uruguay, Colombia, Panama, Ecuador, Dominican Republic, Puerto Rico, Peru, Mexico
    Description

    Discover our expertly curated language datasets in the LATAM Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

    • Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

    • Sentences Curated examples of real-world usage with contextual annotations.

    • Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.

    • Audio Data Native speaker recordings for TTS and pronunciation modeling.

    • Word Lists Frequency-ranked and thematically grouped lists.

    Learn more about the datasets included in the data suite:

    1. Portuguese Monolingual Dictionary Data
    2. Portuguese Bilingual Dictionary Data
    3. Spanish Monolingual Dictionary Data
    4. Spanish Bilingual Dictionary Data
    5. Spanish Sentences Data
    6. Spanish Synonyms and Antonyms Data
    7. Spanish Audio Data
    8. Spanish Word List Data
    9. American English Monolingual Dictionary Data
    10. American English Synonyms and Antonyms Data
    11. American English Pronunciations with Audio

    Key Features (approximate numbers):

    1. Portuguese Monolingual Dictionary Data

    Our Portuguese monolingual covers both European and Latin American varieties, featuring clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Portuguese language.

    • Headwords: 143,600
    • Senses: 285,500
    • Sentence examples: 69,300
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    1. Portuguese Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Portuguese and from Portuguese to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality that span both European and Latin American Portuguese varieties.

    • Translations: 300,000
    • Senses: 158,000
    • Example sentences: 117,800
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Monolingual Dictionary Data

    Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

    • Headwords: 73,000
    • Senses: 123,000
    • Sentence examples: 104,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

    • Translations: 221,300
    • Senses: 103,500
    • Example sentences: 74,500
    • Example translations: 83,800
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Sentences Data

    Spanish sentences retrieved from corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

    • Sentences volume: 1,840,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    1. Spanish Synonyms and Antonyms Data

    This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

    • Synonyms: 127,700
    • Antonyms: 9,500
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. Spanish Audio Data (word-level)

    Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

    • Audio files: 20,900
    • Format: XLSX (for index), MP3 and WAV (audio files)
    1. Spanish Word List Data

    This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

    • Wordforms: 450,000
    • Format: CSV and TXT formats
    • Delivery: Email (link-based file sharing)
    1. American English Monolingual Dictionary Data

    Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure rel...

  10. l

    Census 21 - English proficiency MSOA

    • data.leicester.gov.uk
    csv, excel, geojson +1
    Updated Aug 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Census 21 - English proficiency MSOA [Dataset]. https://data.leicester.gov.uk/explore/dataset/census-21-english-proficiency-msoa/
    Explore at:
    csv, geojson, excel, jsonAvailable download formats
    Dataset updated
    Aug 22, 2023
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    The census is undertaken by the Office for National Statistics every 10 years and gives us a picture of all the people and households in England and Wales. The most recent census took place in March of 2021.The census asks every household questions about the people who live there and the type of home they live in. In doing so, it helps to build a detailed snapshot of society. Information from the census helps the government and local authorities to plan and fund local services, such as education, doctors' surgeries and roads.Key census statistics for Leicester are published on the open data platform to make information accessible to local services, voluntary and community groups, and residents. There is also a dashboard published showcasing various datasets from the census allowing users to view data for all MSOAs and compare this with Leicester overall statistics.Further information about the census and full datasets can be found on the ONS website - https://www.ons.gov.uk/census/aboutcensus/censusproductsProficiency in EnglishThis dataset provides Census 2021 estimates that classify usual residents in England and Wales by their proficiency in English. The estimates are as at Census Day, 21 March 2021.Definition: How well people whose main language is not English (English or Welsh in Wales) speak English.This dataset provides details for the MSOAs of Leicester city.

  11. g

    Arabic Pronunciation Dictionary Dataset

    • gts.ai
    json
    Updated Feb 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2023). Arabic Pronunciation Dictionary Dataset [Dataset]. https://gts.ai/case-study/arabic-pronunciation-dictionary-dataset/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Feb 13, 2023
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Explore the Arabic Pronunciation Dictionary Dataset for accurate pronunciation data, a valuable resource for language learners.

  12. Words for Wordle

    • kaggle.com
    Updated Feb 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Advait Kale (2022). Words for Wordle [Dataset]. https://www.kaggle.com/datasets/uniquekale/wordle-words
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Advait Kale
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a curated list of frequently used English Dictionary Words. This dataset is mainly designed to help a user solve the Wordle Game in a much faster way.

  13. English Contractor Monthly General Dental Activity - Datasets - Open Data...

    • opendata.nhsbsa.net
    Updated Sep 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nhsbsa.net (2023). English Contractor Monthly General Dental Activity - Datasets - Open Data Portal [Dataset]. https://opendata.nhsbsa.net/dataset/english-contractor-monthly-general-dental-activity
    Explore at:
    Dataset updated
    Sep 29, 2023
    Dataset provided by
    NHS Business Services Authority
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    The English Contractor Monthly General Dental Activity dataset provides information about the general dental activity carried out by dentists on behalf of the NHS in England. Published monthly, the data is released at individual dental contract level and only relate to contracts that provide General NHS Dental Services. However, they do contain information on all service lines held by that contract in relation to the UDA and FP17s delivered. This dataset allows you to look at the activity provided for each financial year, monthly at national, commissioner and contract level. You can compare the general activity delivered to that commissioned in the contract data. This will give you an idea of how areas of contracts are performing. The dataset will not give the whole picture of NHS dentistry in England as it is possible for other dental services to commissioned locally. You can read more about this dataset and how to use it in the English Contractor Monthly General Dental and Orthodontic Activity guidance documentation (ODT: 234KB). You can view all definitions for the fields included in the dataset in the English Contractor Monthly General Dental Activity Data Dictionary (XLSX: 12KB) We also publish the English Contractor Monthly Orthodontic Dental Activity dataset. Overview of Service

  14. Z

    Dravidian dataset for the Lindgren (2023) study

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lindgren, Freja (2023). Dravidian dataset for the Lindgren (2023) study [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7668296
    Explore at:
    Dataset updated
    Mar 7, 2023
    Dataset provided by
    Tresoldi, Tiago
    Kotian, Spandana
    Mannby, Emil Magnus
    Lindgren, Freja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset on Dravidian languages, focusing on Tulu, accompanying the master's thesis of Lindgren (2023). Includes data from the author's fieldwork, contributions from other authors, and data adapted from the Kolipakam et al. (2018) study.

    The author collected a set of lexical items for comparative analysis to address the classification of Tulu, Koraga, and Bellari within the Dravidian language family. This dataset included 114 comparative concepts previously collected for the study of Bellari (Bhat, 1971), the Leipzig-Jakarta list of lexical items, a subset of the 100-word Swadesh list not present in the Leipzig-Jakarta list, and counting words from 1 to 10, and a more extensive list of pronouns, totalling 231 comparative concepts. Word lists were gathered from various sources, including grammars and dictionaries, for Bellari (Bhat, 1971), four Koraga varieties (Onti, Tappu, and Mudu from Bhat, 1971; Ande from Shetty, 2008), Kannada (Kittel, 1894; Učida, Rajapurohit & Takashima, 2018; Spencer, 1950; Zydenbos, 2011; Sridhar, 1990), Malayalam (Moag & Moag, 1967; Asher & Kumari, 1967; Sudha, 1984; Jiang, 2010), Tamil (Borin et al., 2013), Byari (Upadhyaya, 2011), and Pattapu (IRA, 2013), and three Tulu varieties. The Tulu word lists included those from M. M. Bhat (1967), a dictionary of Tulu including words from multiple dialects, a Madhwa Brahmin wordlist collected from Bhatt (1971), and a wordlist from data collected through the author's fieldwork. Data for several other languages, namely Telugu, Koya, Kolami, Gondi, Parji, Ollari Gadba, Kuwi, Kurukh, Malto, Brahui, Yeruva, Kodava, Badga, Toda, Kota, and Betta Kurumba, were included from Kolipakam et al. (2018), as well as additional data for Kannada, Malayalam, Tamil, and Tulu from the same source. The latter’s Kannada, Malayalam, Tamil, and Tulu wordlists were added as separate doculects.

    Concepts not given in Concepticon are marked by an initial asterisk (e.g., "*BETEL LEAF"). Most of these are concepts distinguishing pronouns in the languages, such as marking distinctions between remote and proximate (e.g., "*3SG.I.R" and "*3SG.I.P"), which the author considered essential for comparing the languages also due to the conservativeness of some forms and phonemes (e.g., the presence of /a/ for remote and /i/ for proximate in most languages). The labels for pronoun concepts not given in Concepticon are built with the following constituents:

    "2SG" : second singular

    "3PL" : third plural

    "3SG" : third singular

    "A" : animate

    "F" : female

    "H" : honorific

    "I" : inanimate

    "M" : male

    "P" : proximate

    "R" : remote

    References

    Asher, R. E. & Kumari, T. C. (1997). Malayalam. Descriptive Grammars Series, Descriptive Grammars. London & New York: Routledge.

    Bhat, D. N. S. (1971). The Koraga language. Poona: Deccan College.

    Bhat, M. M. (1967). Tulu-English dictionary. Madras: University of Madras.

    Bhatt, S. L. (1971). A Grammar of Tulu (A Dravidian Language). Ann Arbor: UMI. (Doctoral dissertation, Madison: University of Wisconsin.)

    Borin, L.; Comrie, B. & Saxena, A. (2013). The Intercontinental Dictionary Series – a rich and principled database for language comparison. In Borin, L. & Saxena, A. (eds) Approaches to Measuring Linguistic Differences, 285–302. Berlin: De Gruyter Mouton.

    IRA ISO 639-3 Registration Authority. Change Request Number 2013-020: adopted create ptq. Dallas: SIL International.

    Jiang, H. (2010). Malayalam: a Grammatical Sketch and a Text. Houston: Department of Linguistics, Rice University.

    Kittel, F. (1894). A Kannaḍa-English dictionary. Mangalore: Basel Mission Book and Tract Depository.

    Kolipakam, V.; Jordan, F. M.; Dunn, M.; Greenhill, S. J.; Bouckaert, R.; Gray, R. D. & Verkerk, A. (2018). A Bayesian phylogenetic study of the Dravidian language family. Royal Society open science, 5(3), 171504. http://doi.org/10.1098/rsos.171504

    Moag, R. & Moag, R. (1967). A course in Colloquial Malayalam. Milwaukee, Wisconsin: Peace Corps.

    Shetty, R. (2008). Koraga Grammar. Kuppam: Department of Dravidian Computational Linguistics, Dravidian University.

    Spencer, H. (1950). A Kanarese Grammar. Mysore City: Wesley Press.

    Sridhar, S. N. 1990. Kannada. (Descriptive Grammars Series, Descriptive Grammars.) London & New York: Routledge.

    Sudha, B. B. (1984). Case grammar of standard Malayalam. (Doctoral dissertation, Trivandrum: University of Kerala.)

    Učida, N.; Rajapurohit, B. B. & Takashima, J. (2018). Kannada-English Etymological Dictionary. Tokyo: ILCAA.

    Upadhyaya, S. P. (2011). Beary language: descriptive grammar and comparative study. Mangalore: Karnataka Beary Sahithya Academy.

    Zydenbos, R. (2011). A grammar of Kannada. Ms.

  15. m

    Multidimensional Dataset Of Food Security And Nutrition In Cauca.

    • data.mendeley.com
    Updated Dec 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Santiago Restrepo (2021). Multidimensional Dataset Of Food Security And Nutrition In Cauca. [Dataset]. http://doi.org/10.17632/wsss65c885.1
    Explore at:
    Dataset updated
    Dec 6, 2021
    Authors
    David Santiago Restrepo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Cauca
    Description

    A multidimensional dataset created for the department of Cauca based on public data sources is published. The dataset integrates the 4 FAO food security dimensions: physical availability of food, economic and physical access to food, food utilization, and the sustainability of the dimensions mentioned above. It also allows analysis of different variables such as nutritional, socioeconomic, climatic, sociodemographic, among others with statistical techniques or temporal analysis. The dataset can also be used for analysis and extraction of characteristics with computer vision techniques from satellite images, or multimodal machine learning with data of a different nature (images and tabular data).

    The dataset Contains the folders: - Multidimensional dataset of Cauca/: Here are the tabular data of the municipalities of the department of Cauca. The folder contains the files: 1. dictionary(English).xlsx: The dictionary of the static variables for each municipality of Cauca in english. 2. dictionary(Español): The dictionary of the static variables for each municipality of Cauca in spanish. 3. dictionary(English).xlsx: The dictionary of the static variables for each municipality of Cauca in english. 4. MultidimensionalDataset_AllMunicipalities.csv: Nutritional, climatic, sociodemographic, socioeconomic and agricultural data of the 42 municipalities of the department of Cauca, although with some null values due to the lack of data in nutrition surveys of some municipalities. - Satellite Images Popayán/: Here are the monthly Landsat 8 satellite images of the municipality of Popayán in Cauca. The folder contains the folders: 1. RGB/: Contains the RGB images of the municipality of Popayán in the department of Cauca. It contains RGB images of Popayán from April 2013 to December 2020 in a resolution of 15 m / px. The title of each image is image year_month.png. 1. 6 Band Images/: Contains images of Landsat 8 using bands 1 to 8 to generate images of the municipality of Popayán in the department of Cauca. It contains 6 band images in a tif format of Popayán from April 2013 to December 2020 in a resolution of 15 m / px. The title of each image is image year_month.tif.

  16. E

    Data from: HindEnCorp 0.5

    • live.european-language-grid.eu
    binary format
    Updated Mar 20, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). HindEnCorp 0.5 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1076
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Mar 20, 2014
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    HindEnCorp parallel texts (sentence-aligned) come from the following sources:

    Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

    Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

    EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

    Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.

    For the current release, we are extending the parallel corpus using these sources:

    Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

    TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

    The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

    Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

    Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.

  17. Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS |...

    • datarade.ai
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS | Dictionary Display | Translations | EU & LATAM Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
    Explore at:
    .json, .xml, .csv, .xls, .txt, .mp3, .wavAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Chile, Costa Rica, Ecuador, Colombia, Nicaragua, Cuba, Honduras, Paraguay, Bolivia (Plurinational State of), Panama
    Description

    Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

    1. Spanish Monolingual Dictionary Data
    2. Spanish Bilingual Dictionary Data
    3. Spanish Sentences Data
    4. Synonyms and Antonyms Data
    5. Audio Data
    6. Spanish Word List Data

    Key Features (approximate numbers):

    1. Spanish Monolingual Dictionary Data

    Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

    • Headwords: 73,000
    • Senses: 123,000
    • Sentence examples: 104,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

    • Translations: 221,300
    • Senses: 103,500
    • Example sentences: 74,500
    • Example translations: 83,800
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Sentences Data

    Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

    • Sentences volume: 1,840,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    1. Spanish Synonyms and Antonyms Data

    This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

    • Synonyms: 127,700
    • Antonyms: 9,500
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. Spanish Audio Data (word-level)

    Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

    • Audio files: 20,900
    • Format: XLSX (for index), MP3 and WAV (audio files)
    1. Spanish Word List Data

    This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

    • Wordforms: 450,000
    • Format: CSV and TXT formats
    • Delivery: Email (link-based file sharing)

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

  18. t

    Eurostat web services and dissemination APIs

    • service.tib.eu
    • data.europa.eu
    Updated Jan 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Eurostat web services and dissemination APIs [Dataset]. https://service.tib.eu/ldmservice/dataset/eurostat_estat-web-services
    Explore at:
    Dataset updated
    Jan 8, 2025
    Description

    Eurostat data contains many indicators (short-term, structural, theme-specific and others) on the EU-28 and the Eurozone, the Member States and their partners. The database of Eurostat contains always the latest version of the datasets meaning that there is no versioning on the data. Datasets are updated twice a day, at 11:00 and at 23:00, in case new data is available or because of structural change. It is possible to access the datasets through SDMX Web Services, as well as through Json and Unicode Web Services. SDMX Web Services are a programmatic access to Eurostat data, with the possibility to: get a complete list of publicly available datasets; detail the complete structure definition of a given dataset; download a subset of a given dataset or a full dataset. SDMX Web Services: provide access to datasets listed under database by themes, and predefined tables listed under tables by themes; provide data in SDMX 2.0 and 2.1 formats; support both Representation State Transfer (REST) and Simple Object Access Protocol (SOAP) protocols; return responses in English language only; are free of charge. The JSON & UNICODE Web Services are a programmatic access to Eurostat data, with the possibility to download a subset of a given dataset. This operation allows customizing requests for data. You can filter on dimensions to retrieve specific data subsets. The JSON & UNICODE Web Services: provide data in JSON-stat and UNICODE formats; support only Representation State Transfer (REST) protocol; deliver responses in English, French and German language; are free of charge.

  19. German Language Datasets | 393K Translations | NLP | Dictionary Display |...

    • datarade.ai
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). German Language Datasets | 393K Translations | NLP | Dictionary Display | Machine Learning (ML) Data | Translations | EU Coverage [Dataset]. https://datarade.ai/data-products/german-language-datasets-393k-translations-nlp-dictiona-oxford-languages
    Explore at:
    .json, .xml, .csv, .txtAvailable download formats
    Dataset updated
    Jul 31, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Belgium, Austria, Germany, Switzerland, Liechtenstein, Luxembourg
    Description

    Our German language datasets are carefully compiled and annotated by language and linguistic experts. The below datasets in German are available for license:

    1. German Monolingual Dictionary Data
    2. German Bilingual Dictionary Data
    3. German Word List Data

    Key Features (approximate numbers):

    1. German Monolingual Dictionary Data

    Our German monolingual features clear definitions, headwords, examples, and comprehensive coverage of the German language spoken today.

    • Headwords: 85,500
    • Senses: 78,000
    • Sentence examples: 55,000
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    1. German Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to German and from German to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.

    • Translations: 393,000
    • Senses: 207,500
    • Example sentences: 129,500
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. German Word List Data

    This language data contains a carefully curated and comprehensive list of 338,000 German words.

    • Wordforms: 338,000
    • Format: CSV and TXT formats
    • Delivery: Email (link-based file sharing)

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.

  20. d

    Historical changes in RÚIAN data for basic data set distributed by...

    • data.gov.cz
    • gimi9.com
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Český úřad zeměměřický a katastrální, Historical changes in RÚIAN data for basic data set distributed by municipalities in the VFR format [Dataset]. https://data.gov.cz/dataset?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A9-sady%2F00025712%2Fe97dac0271106436b00cde4d04c11c8e
    Explore at:
    Dataset authored and provided by
    Český úřad zeměměřický a katastrální
    Description

    Dataset contains basic descriptive data on territorial elements and units of territorial registration, in which at least one attribute in selected day have changed. Dataset contains no spatial location (polygons, definition lines and centroids of RÚIAN elements). The file contains following elements (in case they have changed): state, cohesion region, higher territorial self-governing entity (VÚSC), municipality with extended competence (ORP), authorized municipal office (POU), regions (old ones – defined in 1960), county, municipality, municipality part, town district (MOMC), Prague city district (MOP), town district of Prague (SOP), cadastral units and basic urban units (ZSJ), streets, building objects and address points. Dataset is provided as Open Data (licence CC-BY 4.0). Data is based on RÚIAN (Register of Territorial Identification, Addresses and Real Estates). Data is created every day (in case any change occurred) in RÚIAN exchange format (VFR), which is based on XML language and fulfils the GML 3.2.1 standard (according to ISO 19136:2007). Dataset is compressed (ZIP) for downloading. More in the Act No. 111/2009 Coll., on the Basic Registers, in Decree No. 359/2011 Coll., on the Basic Register of Territorial Identification, Addresses and Real Estates.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2005). New Oxford Dictionary of English, 2nd Edition [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2276

New Oxford Dictionary of English, 2nd Edition

NODE

Explore at:
9 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 6, 2005
License

http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

Description

This is Oxford University Press's most comprehensive single-volume dictionary, with 170,000 entries covering all varieties of English worldwide. The NODE data set constitutes a fully integrated range of formal data types suitable for language engineering and NLP applications: It is available in XML or SGML. - Source dictionary data. The NODE data set includes all the information present in the New Oxford Dictionary of English itself, such as definition text, example sentences, grammatical indicators, and encyclopaedic material. - Morphological data. Each NODE lemma (both headwords and subentries) has a full listing of all possible syntactic forms (e.g. plurals for nouns, inflections for verbs, comparatives and superlatives for adjectives), tagged to show their syntactic relationships. Each form has an IPA pronunciation. Full morphological data is also given for spelling variants (e.g. typical American variants), and a system of links enables straightforward correlation of variant forms to standard forms. The data set thus provides robust support for all look-up routines, and is equally viable for applications dealing with American and British English. - Phrases and idioms. The NODE data set provides a rich and flexible codification of over 10,000 phrasal verbs and other multi-word phrases. It features comprehensive lexical resources enabling applications to identify a phrase not only in the form listed in the dictionary but also in a range of real-world variations, including alternative wording, variable syntactic patterns, inflected verbs, optional determiners, etc. - Subject classification. Using a categorization scheme of 200 key domains, over 80,000 words and senses have been associated with particular subject areas, from aeronautics to zoology. As well as facilitating the extraction of subject-specific sub-lexicons, this also provides an extensive resource for document categorization and information retrieval. - Semantic relationships. The relationships between every noun and noun sense in the dictionary are being codified using an extensive semantic taxonomy on the model of the Princeton WordNet project. (Mapping to WordNet 1.7 is supported.) This structure allows elements of the basic lexical database to function as a formal knowledge database, enabling functionality such as sense disambiguation and logical inference. - Derived from the detailed and authoritative corpus-based research of Oxford University Press's lexicographic team, the NODE data set is a powerful asset for any task dealing with real-world contemporary English usage. By integrating a number of different data types into a single structure, it creates a coherent resource which can be queried along numerous axes, allowing open-ended exploitation by many kinds of language-related applications.

Search
Clear search
Close search
Google apps
Main menu