Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset consists of 42,052 English words and their corresponding definitions. It is a comprehensive collection of words ranging from common terms to more obscure vocabulary. The dataset is ideal for Natural Language Processing (NLP) tasks, educational tools, and various language-related applications.
This dataset is well-suited for a range of use cases, including:
This version focuses on providing essential information while emphasizing the total number of words and potential applications of the dataset. Let me know if you'd like any further adjustments!
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description: This dataset contains images of numbers written in words from one to fifty (one, ONE, One, two, TWO, Two, …….). Each image is stored in their respective folders named as (one,two,three….) .
Content: Images: The dataset includes images of numbers written in words from one to hundred in various formats and styles. Images are provided in JPG, JPEG, PNG format.
Usage: This dataset can be used to develop machine learning models for optical character recognition (OCR) tasks or Image Classification. The goal is to train a model that can predict what is written in words when given an image containing the word.
Acknowledgements: This dataset was created for the purpose of solving the problem statement: "Develop a machine-learning model to train with images of numbers written in words from one to fifty.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP.
These features ensure a robust contextual framework for accurate slang detection and semantic analysis.
The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset:
The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word:
The SlangTrack Dataset serves as a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.
The table below provides a breakdown of the total number of instances categorized as slang or non-slang for each target keyword in the SlangTrack (ST) Dataset.
| Keyword | Non-slang | Slang | Total |
| BMW | 1083 | 14 | 1097 |
| Brownie | 582 | 382 | 964 |
| Chronic | 1415 | 270 | 1685 |
| Climber | 520 | 122 | 642 |
| Cucumber | 972 | 79 | 1051 |
| Eat | 2462 | 561 | 3023 |
| Germ | 566 | 249 | 815 |
| Mammy | 894 | 154 | 1048 |
| Rodent | 718 | 349 | 1067 |
| Salty | 543 | 727 | 1270 |
| Total | 9755 | 2907 | 12662 |
The table below provides examples of sentences from the SlangTrack (ST) Dataset, showcasing both slang and non-slang usage of the target keywords. Each example highlights the context in which the target word is used and its corresponding category.
| Example Sentences | Target Keyword | Category |
| Today, I heard, for the first time, a short scientific talk given by a man dressed as a rodent...! An interesting experience. | Rodent | Slang |
| On the other. Mr. Taylor took food requests and, with a stern look in his eye, told the children to stay seated until he and his wife returned with the food. The children nodded attentively. After the adults left, the children seemed to relax, talking more freely and playing with one another. When the parents returned, the kids straightened up again, received their food, and began to eat, displaying quiet and gracious manners all the while. | Eat | Non-Slang |
| Greater than this one that washed between the shores of Florida and Mexico. He balanced between the breakers and the turning tide. Small particles of sand churned in the waters around him, and a small fish swam against his leg, a momentary dark streak that vanished in the surf. He began to swim. Buoyant in the salty water, he swam a hundred meters to a jetty that sent small whirlpools around its barnacle rough pilings. | Salty | Non-Slang |
| Mom was totally hating on my dance moves. She's so salty. | Salty | Slang |
**Licenses**
The SlangTrack (ST) dataset is built using a combination of licensed and publicly available corpora. To ensure compliance with licensing agreements, all data has been extensively preprocessed, modified, and anonymized while preserving linguistic integrity. The dataset has been randomized and structured to support research in slang detection without violating the terms of the original sources.
The **original authors and data providers retain their respective rights**, where applicable. We encourage users to **review the licensing agreements** included with the dataset to understand any potential usage limitations. While some source corpora, such as **COHA, require a paid license and restrict redistribution**, our processed dataset is **legally shareable and publicly available** for **research and development purposes**.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this article, we introduce a unique dataset containing all written communication published by the German Bundestag between 1949 and 2017. Increasing numbers of scholars make use of protocols of parliamentary speeches, parliamentary questions, or the texts of legislative drafts in various fields of comparative politics including representation, responsiveness, professionalization and political careers, or parliamentary agenda studies. Since preparing parliamentary documents is rather resource intense, these studies remain limited to single points in time, types of documents and/or policy areas. The long time horizon and various types of documents covered by our new comprehensive dataset will enable scholars interested in parliaments, parties and representatives to answer various innovative research questions related to legislative studies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To create our dataset we combined two resources: the LexMTurk (Horn et al., 2014) and LSeval (De Belder and Moens, 2012) datasets. The instances in both datasets, 929 in total, contain a sentence, a target complex word, and several candidate substitutions ranked according to their simplicity. The candidates in both datasets were suggested and ranked by English speakers from the U.S. To increase its reliability, we applied the following corrections over each instance of our dataset:Spelling Filtering: We discard any misspelled can- didates using Norvig’s algorithm. We trained our spelling model over the News Crawl corpus.Inflection Correction: We inflected all candidates to the tense of the target word using the Text Adorning module of LEXenstein (Paetzold and Specia, 2015; Burns, 2013).The resulting dataset – BenchLS – contains 929 instances, with an average of 7.37 candidate substitutions per complex word.
Facebook
TwitterDerived from over 150 years of lexical research, these comprehensive textual and audio data, focused on American English, provide linguistically annotated data. Ideal for NLP applications, LLM training and/or fine-tuning, as well as educational and game apps.
One of our flagship datasets, the American English data is expertly curated and linguistically annotated by professionals, with annual updates to ensure accuracy and relevance. The below datasets in American English are available for license:
Key Features (approximate numbers):
Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure relevance and accuracy.
The American English Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary American English. It includes rich linguistic details such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.
This dataset provides IPA transcriptions and clean audio data in contemporary American English. It includes syllabified transcriptions, variant spellings, POS tags, and pronunciation group identifiers. The audio files are supplied separately and linked where available for seamless integration - perfect for teams building TTS systems, ASR models, and pronunciation engines.
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation machine, AI training and fine-tuning, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals. Please note that some datasets may have rights restrictions. Contact us for more information.
About the sample:
To help you explore the structure and features of our dataset on this platform, we provide a sample in CSV and/or JSON formats for one of the presented datasets, for preview purposes only, as shown on this page. This sample offers a quick and accessible overview of the data's contents and organization.
Our full datasets are available in various formats, depending on the language and type of data you require. These may include XML, JSON, TXT, XLSX, CSV, WAV, MP3, and other file types. Please contact us (Growth.OL@oup.com) if you would like to receive the original sample with full details.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Corpora of misspelled words. Data is downloaded from here and merged into one dataset. For more documentation refer to webpage.
It contains two columns: input and label.
Input is what the user has entered as a misspelled word and label is what the word is supposed to be.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Word is a dataset for object detection tasks - it contains A annotations for 230 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
Twitterroyzhong/one-word dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterList of most commonly used English words. Simple text file. One word per line.
Used for this competition: https://www.kaggle.com/competitions/llms-you-cant-please-them-all
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Stanford Rare Word (RW) Similarity Dataset
Created by Minh-Thang Luong, Richard Socher, and Christopher D. Manning, Stanford University Computer Science Department. Available at: http://nlp.stanford.edu/~lmthang/morphoNLM Described in: Luong, M.-T., Socher, R., & Manning, C. D. (2013). Better Word Representations with Recursive Neural Networks for Morphology. CoNLL, Sofia, Bulgaria.
Columns
Word1 – First word in the pair Word2 – Second word in the pair… See the full description on the dataset page: https://huggingface.co/datasets/almogtavor/stanford-rare-word-similarity-dataset.
Facebook
TwitterHow frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.
This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.
Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.
The code used to generate this dataset is distributed under the MIT License.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Roman word dataset of scene word text images named Scene Word Dataset (SWord6k) is developed for character segmentation, Character recognition, text detection, and script identification. All images of SWord6k datasets are captured from an outdoor environment. These images are collected from banners, advertisements, shop names, and posters from different sources, like shopping malls, book fairs, and puja pandals. The SWord6k dataset contains 6,661 scene word images in "png" file format. Three types of ground truth annotations are composed for the SWord6k dataset, viz (i) component level i.e. to check whether a component is text or not, (ii) Script level i.e. to identify the script, (iii) recognition level i.e., character/word recognition of text. Each image's ground truth level annotations are stored in XML(extensible markup language) file format.
Facebook
TwitterThis dataset was collected in order to provide detailed information about the morphemic structure of English words. Morphemes are the smallest units of meaning in a language, and English words are made up of one or more morphemes. This dataset contains four different csv files, each one containing data about a different aspect of English words:
Facebook
Twitterhttps://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/
English Valid Words
This repository contains CSV files with valid English words along with their frequency, stem, and stem valid probability. Dataset Github link: https://github.com/Maximax67/English-Valid-Words
Files included
valid_words_sorted_alphabetically.csv:
N: Counter for each word entry. Word: The English word itself. Frequency count: The number of occurrences of the word in the 1-grams dataset. Stem: The stem of the word. Stem valid probability: Probability… See the full description on the dataset page: https://huggingface.co/datasets/Maximax67/English-Valid-Words.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Better words : a first thesaurus. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event.
The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier).
Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example.
Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data.
Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format.
<?xml version="1.0" encoding="UTF-8"?>
The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article.
The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Slovenian datasets for contextual synonym and antonym detection can be used for training machine learning classifiers as described in the MSc thesis of Jasmina Pegan "Semantic detection of synonyms and antonyms with contextual embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=141456). Datasets contain example pairs of synonyms and antonyms in contexts together with additional information on a sense pair. Candidates for synonyms and antonyms were retrieved from the dataset created in the BSc thesis of Jasmina Pegan "Antonym detection with word embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=110533). Example sentences were retrieved from The comprehensive Slovenian-Hungarian dictionary (VSMS) (https://www.clarin.si/repository/xmlui/handle/11356/1453). Each dataset is class balanced and contains an equal amount of examples and counterexamples. An example is a pair of example sentences where the two words are synonyms/antonyms. A counterexample is a pair of example sentences where two words are not synonyms/antonyms. Note that a word pair can be synonymous or antonymous in some sense of the two words (but not in the given context).
Datasets are divided into two categories, datasets for synonyms and datasets for antonyms. Each category is further divided into base and updated datasets. These contain three dataset files: train, validation and test dataset. Base datasets include only manually-reviewed sense pairs. These are generated from all pairs of VSMS sense examples for all confirmed pairs of antonym and synonym senses. Updated datasets include automatically generated sense pairs while constraining the maximal number of examples per word. In this way, the dataset is more balanced word-wise, but is not fully manually-reviewed and contains less accurate data.
A single dataset entry contains the information on the base word, followed by data on synonym/antonym candidate. The last column discerns whether the sense pair is a pair of synonyms/antonyms or not. More details on this can be found inside the included README file.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.