79 datasets found

l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
E
New Oxford Dictionary of English, 2nd Edition
live.european-language-grid.eu
catalog.elra.info
Updated Dec 6, 2005
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2005). New Oxford Dictionary of English, 2nd Edition [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2276
Explore at:
Dataset updated
Dec 6, 2005
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
This is Oxford University Press's most comprehensive single-volume dictionary, with 170,000 entries covering all varieties of English worldwide. The NODE data set constitutes a fully integrated range of formal data types suitable for language engineering and NLP applications: It is available in XML or SGML. - Source dictionary data. The NODE data set includes all the information present in the New Oxford Dictionary of English itself, such as definition text, example sentences, grammatical indicators, and encyclopaedic material. - Morphological data. Each NODE lemma (both headwords and subentries) has a full listing of all possible syntactic forms (e.g. plurals for nouns, inflections for verbs, comparatives and superlatives for adjectives), tagged to show their syntactic relationships. Each form has an IPA pronunciation. Full morphological data is also given for spelling variants (e.g. typical American variants), and a system of links enables straightforward correlation of variant forms to standard forms. The data set thus provides robust support for all look-up routines, and is equally viable for applications dealing with American and British English. - Phrases and idioms. The NODE data set provides a rich and flexible codification of over 10,000 phrasal verbs and other multi-word phrases. It features comprehensive lexical resources enabling applications to identify a phrase not only in the form listed in the dictionary but also in a range of real-world variations, including alternative wording, variable syntactic patterns, inflected verbs, optional determiners, etc. - Subject classification. Using a categorization scheme of 200 key domains, over 80,000 words and senses have been associated with particular subject areas, from aeronautics to zoology. As well as facilitating the extraction of subject-specific sub-lexicons, this also provides an extensive resource for document categorization and information retrieval. - Semantic relationships. The relationships between every noun and noun sense in the dictionary are being codified using an extensive semantic taxonomy on the model of the Princeton WordNet project. (Mapping to WordNet 1.7 is supported.) This structure allows elements of the basic lexical database to function as a formal knowledge database, enabling functionality such as sense disambiguation and logical inference. - Derived from the detailed and authoritative corpus-based research of Oxford University Press's lexicographic team, the NODE data set is a powerful asset for any task dealing with real-world contemporary English usage. By integrating a number of different data types into a single structure, it creates a coherent resource which can be queried along numerous axes, allowing open-ended exploitation by many kinds of language-related applications.
American English Language Datasets | 150+ Years of Research | Textual Data |...
datarade.ai
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). American English Language Datasets | 150+ Years of Research | Textual Data | NLP | LLMs | TTS | Dictionary Display | Game | US English Coverage [Dataset]. https://datarade.ai/data-products/american-english-language-datasets-150-years-of-research-oxford-languages
Explore at:
.csv, .json, .mp3, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 29, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
United States
Description
One of our flagship datasets, the American English data, is expertly curated and linguistically annotated by professionals, with annual updates to ensure accuracy and relevance. The below datasets in American English are available for license:

American English Monolingual Dictionary Data

American English Synonyms and Antonyms Data

American English Pronunciations with Audio

Key Features (approximate numbers):

American English Monolingual Dictionary Data

Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure relevance and accuracy.

Headwords: 140,000

Senses: 222,000

Sentence examples: 140,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

American English Synonyms and Antonyms Data

The American English Synonyms and Antonyms Dataset is a leading resource offering comprehensive and up-to-date coverage of word relationships in contemporary American English. It includes rich linguistic detail such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.

Synonyms: 600,000

Antonyms: 22,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

American English Pronunciations with Audio (word-level)

This dataset provides IPA transcriptions and mapped audio files for words in contemporary American English, with a focus on US speaker usage. It includes syllabified transcriptions, variant spellings, part-of-speech tags, and pronunciation group identifiers. Audio files are supplied separately and linked where available – ideal for TTS, ASR, and pronunciation modeling.

Transcriptions (IPA): 250,000

Audio files: 180,000

Format: XLSX (for transcriptions), MP3 and WAV (audio files)

Updated frequency: annually

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
British English Language Datasets | 150+ Years of Research | Natural...
datarade.ai
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). British English Language Datasets | 150+ Years of Research | Natural Language Processing (NLP) Data | LLMs | TTS | Dictionary Display | EU Coverage [Dataset]. https://datarade.ai/data-products/british-english-language-datasets-150-years-of-research-oxford-languages
Explore at:
.json, .xml, .csv, .xls, .mp3, .wavAvailable download formats
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
United Kingdom
Description
Our British English language datasets are meticulously curated and annotated by experienced linguistics and language experts, ensuring exceptional accuracy, consistency, and linguistic depth. The below datasets in British English are available for license:

British English Monolingual Dictionary Data

British English Synonyms and Antonyms Data

British English Pronunciations with Audio

Key Features (approximate numbers):

British English Monolingual Dictionary Data

Our British English monolingual dataset delivers clear, reliable definitions and authentic usage examples, featuring a high volume of headwords and in-depth coverage of the British English variant of English. As one of the world’s most authoritative lexical resources, it’s trusted by leading academic, AI, and language technology organizations.

Headwords: 146,000

Senses: 230,000

Sentence examples: 149,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Updated frequency: twice a year

British English Synonyms and Antonyms Data

This British English language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for NLP tasks such as semantic search, word sense disambiguation, and language generation.

Synonyms: 600,000

Antonyms: 22,000

Usage Examples: 39,000

Format: XML and JSON format

Delivery: Email (link-based file sharing)

Updated frequency: annually

British English Pronunciations with audio (word-level)

This dataset provides IPA transcriptions and mapped audio files for words in contemporary British English, with a focus on UK speaker usage. It includes syllabified transcriptions, variant spellings, part-of-speech tags, and pronunciation group identifiers. Audio files are supplied separately and linked where available – ideal for TTS, ASR, and pronunciation modeling.

Transcriptions (IPA): 250,000

Audio files: 180,000

Format: XLSX (for transcriptions), MP3 and WAV (audio files)

Updated frequency: annually

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
o
Armenian-Russian-English Dictionary of Forest Terminology - Dataset - Data...
data.opendata.am
Updated Jul 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Armenian-Russian-English Dictionary of Forest Terminology - Dataset - Data Catalog Armenia [Dataset]. https://data.opendata.am/dataset/recc-95f59cfffe224b46a641db49186efe7e
Explore at:
Dataset updated
Jul 8, 2023
Area covered
Armenia
Description
The dictionary includes about 1650 terms and concepts (in Armenian, Russian and English) used in forest and landscaping sectors with a brief explanation in Armenian.Citation: J.H. Vardanyan, H.T. Sayadyan, Armenian-Russian-English Dictionary of Forest Terminology, Publishing House of the Institute of Botany of NAS RA, Yerevan, 2008.
Words for Wordle
kaggle.com
Updated Feb 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Advait Kale (2022). Words for Wordle [Dataset]. https://www.kaggle.com/datasets/uniquekale/wordle-words
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Advait Kale
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is a curated list of frequently used English Dictionary Words. This dataset is mainly designed to help a user solve the Wordle Game in a much faster way.
Dictionary Graph
kaggle.com
Updated Sep 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bfbarry (2021). Dictionary Graph [Dataset]. https://www.kaggle.com/bfbarry/dictionary-graph/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2021
Dataset provided by
Kaggle
Authors
bfbarry
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dictionary Network Graph

Contains a graph representation of the the English dictionary, where each word is a node and its edges are defined when a word appears in a definition. The JSON file is of the form: JSON {word: [Each, word, in, its, definition] ... }

Use this dataset to explore the structure of natural language!
n
English Contractor Monthly Orthodontic Activity - Datasets - Open Data...
opendata.nhsbsa.net
Updated Oct 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). English Contractor Monthly Orthodontic Activity - Datasets - Open Data Portal [Dataset]. https://opendata.nhsbsa.net/dataset/english-contractor-monthly-orthodontic-activity
Explore at:
Dataset updated
Oct 2, 2023
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
You can read more about this dataset and how to use it in the English Contractor Monthly General Dental and Orthodontic Activity guidance documentation (ODT: 234KB). You can view all definitions for the fields included in the dataset in the English Contractor Monthly Orthodontic Activity Data Dictionary (XLSX: 10KB) We also publish the English Contractor Monthly General Dental Activity dataset.
g
Arabic Pronunciation Dictionary Dataset
gts.ai
json
Updated Feb 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2023). Arabic Pronunciation Dictionary Dataset [Dataset]. https://gts.ai/case-study/arabic-pronunciation-dictionary-dataset/
Explore at:
jsonAvailable download formats
Dataset updated
Feb 13, 2023
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Explore the Arabic Pronunciation Dictionary Dataset for accurate pronunciation data, a valuable resource for language learners.
Online Marshallese-English Dictionary
pacific-data.sprep.org
rmi-data.sprep.org
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Secretariat of the Pacific Regional Environment Programme (2025). Online Marshallese-English Dictionary [Dataset]. https://pacific-data.sprep.org/dataset/online-marshallese-english-dictionary
Explore at:
Dataset updated
Jul 29, 2025
Dataset provided by
Pacific Regional Environment Programmehttps://www.sprep.org/
License
http://www.opendefinition.org/licenses/cc-by-sahttp://www.opendefinition.org/licenses/cc-by-sa
Area covered
177.495117 18.604601, 157.807617 2.504085)), 177.495117 2.504085, 157.807617 18.604601, POLYGON ((157.807617 2.504085, Republic of the Marshall Islands
Description
This dataset contains resources that link to Online Marshallese-English Dictionaries.
F
Audio Visual Speech Dataset: American English
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Audio Visual Speech Dataset: American English [Dataset]. https://www.futurebeeai.com/dataset/multi-modal-dataset/american-english-visual-speech-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
United States
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the US English Language Visual Speech Dataset! This dataset is a collection of diverse, single-person unscripted spoken videos supporting research in visual speech recognition, emotion detection, and multimodal communication.
Dataset Content
This visual speech dataset contains 1000 videos in US English language each paired with a corresponding high-fidelity audio track. Each participant is answering a specific question in a video in an unscripted and spontaneous nature.
•Participant Diversity:
•
Speakers: The dataset includes visual speech data from more than 200 participants from different states/provinces of United States of America.

•
Regions: Ensures a balanced representation of Skip 3 accents, dialects, and demographics.

•
Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.

Video Data
While recording each video extensive guidelines are kept in mind to maintain the quality and diversity.
•Recording Details:
•
File Duration: Average duration of 30 seconds to 3 minutes per video.

•
Formats: Videos are available in MP4 or MOV format.

•
Resolution: Videos are recorded in ultra-high-definition resolution with 30 fps or above.

•
Device: Both the latest Android and iOS devices are used in this collection.

•
Recording Conditions: Videos were recorded under various conditions to ensure diversity and reduce bias:

•
Indoor and Outdoor Settings: Includes both indoor and outdoor recordings.

•
Lighting Variations: Captures videos in daytime, nighttime, and varying lighting conditions.

•
Camera Positions: Includes handheld and fixed camera positions, as well as portrait and landscape orientations.

•
Face Orientation: Contains straight face and tilted face angles.

•
Participant Positions: Records participants in both standing and seated positions.

•
Motion Variations: Features both stationary and moving videos, where participants pass through different lighting conditions.

•
Occlusions: Includes videos where the participant's face is partially occluded by hand movements, microphones, hair, glasses, and facial hair.

•
Focus: In each video, the participant's face remains in focus throughout the video duration, ensuring the face stays within the video frame.

•
Video Content: In each video, the participant answers a specific question in an unscripted manner. These questions are designed to capture various emotions of participants. The dataset contain videos expressing following human emotions:

•Happy
•Sad
•Excited
•Angry
•Annoyed
•Normal
•
Question Diversity: For each human emotion participant answered a specific question expressing that particular emotion.

Metadata
The dataset provides comprehensive metadata for each video recording and participant:
•
w
Dataset of books called A dictionary of English costume
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called A dictionary of English costume [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=A+dictionary+of+English+costume
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 2 rows and is filtered where the book is A dictionary of English costume. It features 7 columns including author, publication date, language, and book publisher.
Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS |...
datarade.ai
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS | Dictionary Display | Translations | EU & LATAM Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
Explore at:
.json, .xml, .csv, .xls, .txt, .mp3, .wavAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
Colombia, Ecuador, Nicaragua, Chile, Paraguay, Cuba, Costa Rica, Honduras, Panama, Bolivia (Plurinational State of)
Description
Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

Spanish Monolingual Dictionary Data

Spanish Bilingual Dictionary Data

Spanish Sentences Data

Synonyms and Antonyms Data

Audio Data

Spanish Word List Data

Key Features (approximate numbers):

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Headwords: 73,000

Senses: 123,000

Sentence examples: 104,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300

Senses: 103,500

Example sentences: 74,500

Example translations: 83,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700

Antonyms: 9,500

Format: XML format

Delivery: Email (link-based file sharing)

Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900

Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.
LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data...
datarade.ai
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data | TTS | Dictionary Display | Translation Data | LATAM Coverage [Dataset]. https://datarade.ai/data-products/latam-data-suite-1-8m-sentences-nlp-tts-dictionary-d-oxford-languages
Explore at:
.csv, .json, .mp3, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 22, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
Dominican Republic, Uruguay, Bolivia (Plurinational State of), Colombia, Panama, Puerto Rico, Ecuador, Spain, Peru, Mexico
Description
Discover our expertly curated language datasets in the LATAM Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

Sentences Curated examples of real-world usage with contextual annotations.

Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.

Audio Data Native speaker recordings for TTS and pronunciation modeling.

Word Lists Frequency-ranked and thematically grouped lists.

Learn more about the datasets included in the data suite:

Portuguese Monolingual Dictionary Data

Portuguese Bilingual Dictionary Data

Spanish Monolingual Dictionary Data

Spanish Bilingual Dictionary Data

Spanish Sentences Data

Spanish Synonyms and Antonyms Data

Spanish Audio Data

Spanish Word List Data

American English Monolingual Dictionary Data

American English Synonyms and Antonyms Data

American English Pronunciations with Audio

Key Features (approximate numbers):

Portuguese Monolingual Dictionary Data

Our Portuguese monolingual covers both European and Latin American varieties, featuring clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Portuguese language.

Headwords: 143,600

Senses: 285,500

Sentence examples: 69,300

Format: XML format

Delivery: Email (link-based file sharing)

Portuguese Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Portuguese and from Portuguese to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality that span both European and Latin American Portuguese varieties.

Translations: 300,000

Senses: 158,000

Example sentences: 117,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Headwords: 73,000

Senses: 123,000

Sentence examples: 104,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300

Senses: 103,500

Example sentences: 74,500

Example translations: 83,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700

Antonyms: 9,500

Format: XML format

Delivery: Email (link-based file sharing)

Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900

Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

American English Monolingual Dictionary Data

Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure rel...
Korean - English Parallel Corpus
kaggle.com
Updated Aug 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramzel Renz Loto (2020). Korean - English Parallel Corpus [Dataset]. https://www.kaggle.com/rareloto/naver-dictionary-conversation-of-the-day/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2020
Dataset provided by
Kaggle
Authors
Ramzel Renz Loto
Description
Context

As part of my Korean language learning hobby, I write and type out daily conversations from Naver Conversation of the Day. After getting introduced to data science and machine learning, I wanted to use programming to facilitate my learning process by collecting data and trying out projects. So I scraped data from Naver Dictionary using a Python script to be used later when I train a bilingual AI study buddy chatbot or automate Anki flashcards.

Content

This is a corpus of Korean - English paired conversations parallel text extracted from Naver Dictionary. This dataset consists of 4563 parallel text pairs from December 4, 2017 to August 19, 2020 of Naver's Conversation of the Day. The files and their headers are listed below. * conversations.csv * date - 'Conversation of the Day' date * conversation_id - ordered numbering to indicate conversation flow * kor_sent - Korean sentence * eng_sent - English translation * qna_id - from sender or receiver, message or feedback * conversation_titles.csv * date - 'Conversation of the Day' date * kor_title - 'Conversation of the Day' title in Korean * eng_title - English translation of the title * grammar - grammar of the day * grammar_desc - grammar description

Acknowledgements

The data was collected from Naver Dictionary and the conversations were from the Korean Language Institute of Yonsei University.
German Language Datasets | 393K Translations | NLP | Dictionary Display |...
datarade.ai
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). German Language Datasets | 393K Translations | NLP | Dictionary Display | Machine Learning (ML) Data | Translations | EU Coverage [Dataset]. https://datarade.ai/data-products/german-language-datasets-393k-translations-nlp-dictiona-oxford-languages
Explore at:
.json, .xml, .csv, .txtAvailable download formats
Dataset updated
Jul 31, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
Belgium, Austria, Switzerland, Germany, Liechtenstein, Luxembourg
Description
Our German language datasets are carefully compiled and annotated by language and linguistic experts. The below datasets in German are available for license:

German Monolingual Dictionary Data

German Bilingual Dictionary Data

German Word List Data

Key Features (approximate numbers):

German Monolingual Dictionary Data

Our German monolingual features clear definitions, headwords, examples, and comprehensive coverage of the German language spoken today.

Headwords: 85,500

Senses: 78,000

Sentence examples: 55,000

Format: XML format

Delivery: Email (link-based file sharing)

German Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to German and from German to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.

Translations: 393,000

Senses: 207,500

Example sentences: 129,500

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

German Word List Data

This language data contains a carefully curated and comprehensive list of 338,000 German words.

Wordforms: 338,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
h
hind_encorp
huggingface.co
opendatalab.com
Updated Mar 22, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Rychlý (2014). hind_encorp [Dataset]. https://huggingface.co/datasets/pary/hind_encorp
Explore at:
Dataset updated
Mar 22, 2014
Authors
Pavel Rychlý
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus. For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.
m
Multidimensional Dataset Of Food Security And Nutrition In Cauca.
data.mendeley.com
Updated Dec 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Santiago Restrepo (2021). Multidimensional Dataset Of Food Security And Nutrition In Cauca. [Dataset]. http://doi.org/10.17632/wsss65c885.1
Explore at:
Unique identifier
https://doi.org/10.17632/wsss65c885.1
Dataset updated
Dec 6, 2021
Authors
David Santiago Restrepo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Cauca
Description
A multidimensional dataset created for the department of Cauca based on public data sources is published. The dataset integrates the 4 FAO food security dimensions: physical availability of food, economic and physical access to food, food utilization, and the sustainability of the dimensions mentioned above. It also allows analysis of different variables such as nutritional, socioeconomic, climatic, sociodemographic, among others with statistical techniques or temporal analysis. The dataset can also be used for analysis and extraction of characteristics with computer vision techniques from satellite images, or multimodal machine learning with data of a different nature (images and tabular data).

The dataset Contains the folders: - Multidimensional dataset of Cauca/: Here are the tabular data of the municipalities of the department of Cauca. The folder contains the files: 1. dictionary(English).xlsx: The dictionary of the static variables for each municipality of Cauca in english. 2. dictionary(Español): The dictionary of the static variables for each municipality of Cauca in spanish. 3. dictionary(English).xlsx: The dictionary of the static variables for each municipality of Cauca in english. 4. MultidimensionalDataset_AllMunicipalities.csv: Nutritional, climatic, sociodemographic, socioeconomic and agricultural data of the 42 municipalities of the department of Cauca, although with some null values due to the lack of data in nutrition surveys of some municipalities. - Satellite Images Popayán/: Here are the monthly Landsat 8 satellite images of the municipality of Popayán in Cauca. The folder contains the folders: 1. RGB/: Contains the RGB images of the municipality of Popayán in the department of Cauca. It contains RGB images of Popayán from April 2013 to December 2020 in a resolution of 15 m / px. The title of each image is image year_month.png. 1. 6 Band Images/: Contains images of Landsat 8 using bands 1 to 8 to generate images of the municipality of Popayán in the department of Cauca. It contains 6 band images in a tif format of Popayán from April 2013 to December 2020 in a resolution of 15 m / px. The title of each image is image year_month.tif.
English Contractor Monthly General Dental Activity - Datasets - Open Data...
opendata.nhsbsa.net
Updated Sep 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nhsbsa.net (2023). English Contractor Monthly General Dental Activity - Datasets - Open Data Portal [Dataset]. https://opendata.nhsbsa.net/dataset/english-contractor-monthly-general-dental-activity
Explore at:
Dataset updated
Sep 29, 2023
Dataset provided by
NHS Business Services Authority
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
The English Contractor Monthly General Dental Activity dataset provides information about the general dental activity carried out by dentists on behalf of the NHS in England. Published monthly, the data is released at individual dental contract level and only relate to contracts that provide General NHS Dental Services. However, they do contain information on all service lines held by that contract in relation to the UDA and FP17s delivered. This dataset allows you to look at the activity provided for each financial year, monthly at national, commissioner and contract level. You can compare the general activity delivered to that commissioned in the contract data. This will give you an idea of how areas of contracts are performing. The dataset will not give the whole picture of NHS dentistry in England as it is possible for other dental services to commissioned locally. You can read more about this dataset and how to use it in the English Contractor Monthly General Dental and Orthodontic Activity guidance documentation (ODT: 234KB). You can view all definitions for the fields included in the dataset in the English Contractor Monthly General Dental Activity Data Dictionary (XLSX: 12KB) We also publish the English Contractor Monthly Orthodontic Dental Activity dataset. Overview of Service
e
Database of Analysed Texts of English - DANTE (ELEXIS) - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Database of Analysed Texts of English - DANTE (ELEXIS) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/13b2b237-8737-52c7-91ec-3e2100e569f7
Explore at:
Dataset updated
Oct 31, 2023
Description
DANTE is a lexicographic project where the end product is not a dictionary but a lexical database resulting from in-depth analysis of corpus data. Its lexical entries offer a thorough, systematic, and fine-grained description of the core vocabulary of English, derived entirely from corpus data. The users of DANTE are not the dictionary-using public but the lexicographic teams who will develop dictionaries and computer lexicons from it. This project is the source-language analysis stage of the New English-Irish Dictionary (NEID). The database covers approximately 50,000 headwords and 45,000 compounds, idioms and phrasal verbs, using over 40 datatypes in their lexical description.

Facebook

Twitter

Click to copy link

Link copied

Cite

Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3

LScD (Leicester Scientific Dictionary)

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

docxAvailable download formats

Unique identifier

https://doi.org/10.25392/leicester.data.9746900.v3

Dataset updated

Apr 15, 2020

Dataset provided by

University of Leicester

Authors

Neslihan Suzen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Leicester

Description

LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

Clear search

Close search

Google apps

Main menu

LScD (Leicester Scientific Dictionary)

New Oxford Dictionary of English, 2nd Edition

American English Language Datasets | 150+ Years of Research | Textual Data |...

British English Language Datasets | 150+ Years of Research | Natural...

Armenian-Russian-English Dictionary of Forest Terminology - Dataset - Data...

Words for Wordle

Dictionary Graph

Dictionary Network Graph

English Contractor Monthly Orthodontic Activity - Datasets - Open Data...

Arabic Pronunciation Dictionary Dataset

Online Marshallese-English Dictionary

Audio Visual Speech Dataset: American English

Introduction

Dataset Content

Video Data

Metadata

Dataset of books called A dictionary of English costume

Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS |...

LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data...

Korean - English Parallel Corpus

Context

Content

Acknowledgements

German Language Datasets | 393K Translations | NLP | Dictionary Display |...

hind_encorp

Multidimensional Dataset Of Food Security And Nutrition In Cauca.

English Contractor Monthly General Dental Activity - Datasets - Open Data...

Database of Analysed Texts of English - DANTE (ELEXIS) - Dataset - B2FIND

LScD (Leicester Scientific Dictionary)See More Versions

LScD (Leicester Scientific Dictionary)