100+ datasets found

l
LScDC Word-Category RIG Matrix
figshare.le.ac.uk
pdf
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12133431.v2
Dataset updated
Apr 28, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Dictionary of English Words and Definitions
kaggle.com
zip
Updated Sep 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AnthonyTherrien (2024). Dictionary of English Words and Definitions [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/dictionary-of-english-words-and-definitions
Explore at:
zip(6401928 bytes)Available download formats
Dataset updated
Sep 22, 2024
Authors
AnthonyTherrien
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Overview

This dataset consists of 42,052 English words and their corresponding definitions. It is a comprehensive collection of words ranging from common terms to more obscure vocabulary. The dataset is ideal for Natural Language Processing (NLP) tasks, educational tools, and various language-related applications.

Key Features:

Words: A diverse set of English words, including both rare and frequently used terms.

Definitions: Each word is accompanied by a detailed definition that explains its meaning and contextual usage.

Total Number of Words: 42,052

Applications

This dataset is well-suited for a range of use cases, including:

Natural Language Processing (NLP): Enhance text understanding models by providing contextual meaning and word associations.

Vocabulary Building: Create educational tools or games that help users expand their vocabulary.

Lexical Studies: Perform academic research on word usage, trends, and lexical semantics.

Dictionary and Thesaurus Development: Serve as a resource for building dictionary or thesaurus applications, where users can search for words and definitions.

Data Structure

Word: The column containing the English word.

Definition: The column providing a comprehensive definition of the word.

Potential Use Cases

Language Learning: This dataset can be used to develop applications or tools aimed at enhancing vocabulary acquisition for language learners.

NLP Model Training: Useful for tasks such as word embeddings, definition generation, and contextual learning.

Research: Analyze word patterns, rare vocabulary, and trends in the English language.

This version focuses on providing essential information while emphasizing the total number of words and potential applications of the dataset. Let me know if you'd like any further adjustments!
Number Words Dataset
kaggle.com
zip
Updated Apr 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashutosh_kun (2024). Number Words Dataset [Dataset]. https://www.kaggle.com/datasets/ashutoshkun/number-words-dataset
Explore at:
zip(4704889 bytes)Available download formats
Dataset updated
Apr 25, 2024
Authors
Ashutosh_kun
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description: This dataset contains images of numbers written in words from one to fifty (one, ONE, One, two, TWO, Two, …….). Each image is stored in their respective folders named as (one,two,three….) .

Content: Images: The dataset includes images of numbers written in words from one to hundred in various formats and styles. Images are provided in JPG, JPEG, PNG format.

Usage: This dataset can be used to develop machine learning models for optical character recognition (OCR) tasks or Image Classification. The goal is to train a model that can predict what is written in words when given an image containing the word.

Acknowledgements: This dataset was created for the purpose of solving the problem statement: "Develop a machine-learning model to train with images of numbers written in words from one to fifty.

SlangTrack (ST) Dataset

zenodo.org

Updated Feb 5, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Afnan aloraini; Afnan aloraini (2025). SlangTrack (ST) Dataset [Dataset]. http://doi.org/10.5281/zenodo.14744510

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.14744510

Dataset updated

Feb 5, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Afnan aloraini; Afnan aloraini

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Oct 15, 2022

Description

The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP.

Key Features:

Unique Words: 48,508
Total Tokens: 310,170
Average Post Length: 34.6 words
Average Sentences per Post: 3.74

These features ensure a robust contextual framework for accurate slang detection and semantic analysis.

Significance of the Dataset:

Unified Annotation: The dataset offers consistent annotations across the corpus, achieving high Inter-Annotator Agreement (IAA) to ensure reliability and accuracy.
Addressing Limitations: It overcomes the constraints of previous corpora, which often lacked differentiation between slang and non-slang meanings or did not provide illustrative examples for each sense.
Comprehensive Coverage: Unlike earlier corpora that primarily supported dictionary-style entries or paraphrasing tasks, this dataset includes rich contextual examples from historical (COHA) and contemporary (Twitter) sources, along with multiple senses for each target word.
Focus on Dual Meanings: The dataset emphasizes words with at least one slang and one dominant non-slang sense, facilitating the exploration of nuanced linguistic patterns.
Applicability to Research: By covering both historical and modern contexts, the dataset provides a platform for exploring slang's semantic evolution and its impact on natural language processing.

Target Word Selection:

The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset:

It coexists in the slang SD wordlist and the Corpus of Historical American English (COHA).
Has between 2 and 8 distinct senses, including both slang and non-slang meanings.
Was cross-referenced using trusted resources such as:
- Green's Dictionary of Slang
- Urban Dictionary
- Online Slang Dictionary
- Oxford English Dictionary
Features at least one slang and one dominant non-slang sense.
Excludes proper nouns to maintain linguistic relevance and focus.

Data Sources and Collection:

1. Corpus of Historical American English (COHA):

Historical examples were extracted from the cleaned version of COHA (CCOHA).
Data spans the years 1980–2010, capturing the evolution of target words over time.

2. Twitter:

Twitter was selected for its dynamic, real-time communication, offering rich examples of contemporary slang and informal language.
For each target word, 1,000 examples were collected from tweets posted between 2010–2020, reflecting modern usage.

Dataset Scope:

The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word:

Demonstrates semantic diversity, balancing slang and non-slang senses.
Offers robust representation across both historical (COHA) and modern (Twitter) contexts.

The SlangTrack Dataset serves as a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.

Data Statistics:

The table below provides a breakdown of the total number of instances categorized as slang or non-slang for each target keyword in the SlangTrack (ST) Dataset.

Keyword	Non-slang	Slang	Total
BMW	1083	14	1097
Brownie	582	382	964
Chronic	1415	270	1685
Climber	520	122	642
Cucumber	972	79	1051
Eat	2462	561	3023
Germ	566	249	815
Mammy	894	154	1048
Rodent	718	349	1067
Salty	543	727	1270
Total	9755	2907	12662

Sample Texts from the Dataset:

The table below provides examples of sentences from the SlangTrack (ST) Dataset, showcasing both slang and non-slang usage of the target keywords. Each example highlights the context in which the target word is used and its corresponding category.

Example Sentences	Target Keyword	Category
Today, I heard, for the first time, a short scientific talk given by a man dressed as a rodent...! An interesting experience.	Rodent	Slang
On the other. Mr. Taylor took food requests and, with a stern look in his eye, told the children to stay seated until he and his wife returned with the food. The children nodded attentively. After the adults left, the children seemed to relax, talking more freely and playing with one another. When the parents returned, the kids straightened up again, received their food, and began to eat, displaying quiet and gracious manners all the while.	Eat	Non-Slang
Greater than this one that washed between the shores of Florida and Mexico. He balanced between the breakers and the turning tide. Small particles of sand churned in the waters around him, and a small fish swam against his leg, a momentary dark streak that vanished in the surf. He began to swim. Buoyant in the salty water, he swam a hundred meters to a jetty that sent small whirlpools around its barnacle rough pilings.	Salty	Non-Slang
Mom was totally hating on my dance moves. She's so salty.	Salty	Slang

**Licenses**

The SlangTrack (ST) dataset is built using a combination of licensed and publicly available corpora. To ensure compliance with licensing agreements, all data has been extensively preprocessed, modified, and anonymized while preserving linguistic integrity. The dataset has been randomized and structured to support research in slang detection without violating the terms of the original sources.

The **original authors and data providers retain their respective rights**, where applicable. We encourage users to **review the licensing agreements** included with the dataset to understand any potential usage limitations. While some source corpora, such as **COHA, require a paid license and restrict redistribution**, our processed dataset is **legally shareable and publicly available** for **research and development purposes**.

H
Data from: Every single word - A new dataset including all parliamentary...
dataverse.harvard.edu
search.dataone.org
Updated Mar 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Corinna Kroeber; Tobias Remschel (2024). Every single word - A new dataset including all parliamentary materials published in Germany [Dataset]. http://doi.org/10.7910/DVN/7EJ1KI
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/7EJ1KI
Dataset updated
Mar 27, 2024
Dataset provided by
Harvard Dataverse
Authors
Corinna Kroeber; Tobias Remschel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Germany
Description
In this article, we introduce a unique dataset containing all written communication published by the German Bundestag between 1949 and 2017. Increasing numbers of scholars make use of protocols of parliamentary speeches, parliamentary questions, or the texts of legislative drafts in various fields of comparative politics including representation, responsiveness, professionalization and political careers, or parliamentary agenda studies. Since preparing parliamentary documents is rather resource intense, these studies remain limited to single points in time, types of documents and/or policy areas. The long time horizon and various types of documents covered by our new comprehensive dataset will enable scholars interested in parliaments, parties and representatives to answer various innovative research questions related to legislative studies.
E
BenchLS: A Reliable Dataset for Lexical Simplification
live.european-language-grid.eu
data.europa.eu
txt
Updated Aug 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). BenchLS: A Reliable Dataset for Lexical Simplification [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7452
Explore at:
txtAvailable download formats
Dataset updated
Aug 7, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To create our dataset we combined two resources: the LexMTurk (Horn et al., 2014) and LSeval (De Belder and Moens, 2012) datasets. The instances in both datasets, 929 in total, contain a sentence, a target complex word, and several candidate substitutions ranked according to their simplicity. The candidates in both datasets were suggested and ranked by English speakers from the U.S. To increase its reliability, we applied the following corrections over each instance of our dataset:Spelling Filtering: We discard any misspelled can- didates using Norvig’s algorithm. We trained our spelling model over the News Crawl corpus.Inflection Correction: We inflected all candidates to the tense of the target word using the Text Adorning module of LEXenstein (Paetzold and Specia, 2015; Burns, 2013).The resulting dataset – BenchLS – contains 929 instances, with an average of 7.37 candidate substitutions per complex word.
American English Language Datasets | 150+ Years of Research | Textual Data |...
datarade.ai
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). American English Language Datasets | 150+ Years of Research | Textual Data | Audio Data | Natural Language Processing (NLP) Data | US English Coverage [Dataset]. https://datarade.ai/data-products/american-english-language-datasets-150-years-of-research-oxford-languages
Explore at:
.json, .xml, .csv, .xls, .mp3, .wavAvailable download formats
Dataset updated
Jul 29, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
United States
Description
Derived from over 150 years of lexical research, these comprehensive textual and audio data, focused on American English, provide linguistically annotated data. Ideal for NLP applications, LLM training and/or fine-tuning, as well as educational and game apps.

One of our flagship datasets, the American English data is expertly curated and linguistically annotated by professionals, with annual updates to ensure accuracy and relevance. The below datasets in American English are available for license:

American English Monolingual Dictionary Data

American English Synonyms and Antonyms Data

American English Pronunciations with Audio

Key Features (approximate numbers):

American English Monolingual Dictionary Data

Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure relevance and accuracy.

Headwords: 140,000

Senses: 222,000

Sentence examples: 140,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

American English Synonyms and Antonyms Data

The American English Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary American English. It includes rich linguistic details such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.

Synonyms: 600,000

Antonyms: 22,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

American English Pronunciations with Audio (word-level)

This dataset provides IPA transcriptions and clean audio data in contemporary American English. It includes syllabified transcriptions, variant spellings, POS tags, and pronunciation group identifiers. The audio files are supplied separately and linked where available for seamless integration - perfect for teams building TTS systems, ASR models, and pronunciation engines.

Transcriptions (IPA): 250,000

Audio files: 180,000

Format: XLSX (for transcriptions), MP3 and WAV (audio files)

Updated frequency: annually

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation machine, AI training and fine-tuning, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals. Please note that some datasets may have rights restrictions. Contact us for more information.

About the sample:

To help you explore the structure and features of our dataset on this platform, we provide a sample in CSV and/or JSON formats for one of the presented datasets, for preview purposes only, as shown on this page. This sample offers a quick and accessible overview of the data's contents and organization.

Our full datasets are available in various formats, depending on the language and type of data you require. These may include XML, JSON, TXT, XLSX, CSV, WAV, MP3, and other file types. Please contact us (Growth.OL@oup.com) if you would like to receive the original sample with full details.
Misspelled Words
kaggle.com
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fazil T (2024). Misspelled Words [Dataset]. https://www.kaggle.com/datasets/fazilbtopal/misspelled-words
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 19, 2024
Dataset provided by
Kaggle
Authors
Fazil T
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Corpora of misspelled words. Data is downloaded from here and merged into one dataset. For more documentation refer to webpage.

It contains two columns: input and label.

Input is what the user has entered as a misspelled word and label is what the word is supposed to be.
R
Word Dataset
universe.roboflow.com
zip
Updated Dec 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yolo (2024). Word Dataset [Dataset]. https://universe.roboflow.com/yolo-x5ynb/word-cpkda/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Dec 10, 2024
Dataset authored and provided by
yolo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
A Bounding Boxes
Description
Word

## Overview Word is a dataset for object detection tasks - it contains A annotations for 230 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
one-word
huggingface.co
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roy (2023). one-word [Dataset]. https://huggingface.co/datasets/royzhong/one-word
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2023
Authors
Roy
Description
royzhong/one-word dataset hosted on Hugging Face and contributed by the HF Datasets community
English words
kaggle.com
zip
Updated Jan 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiri Prudky (2025). English words [Dataset]. https://www.kaggle.com/datasets/jiprud/words-en
Explore at:
zip(9474 bytes)Available download formats
Dataset updated
Jan 11, 2025
Authors
Jiri Prudky
Description
List of most commonly used English words. Simple text file. One word per line.

Used for this competition: https://www.kaggle.com/competitions/llms-you-cant-please-them-all
h
stanford-rare-word-similarity-dataset
huggingface.co
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Almog Tavor (2025). stanford-rare-word-similarity-dataset [Dataset]. https://huggingface.co/datasets/almogtavor/stanford-rare-word-similarity-dataset
Explore at:
Dataset updated
Aug 11, 2025
Authors
Almog Tavor
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Stanford Rare Word (RW) Similarity Dataset

Created by Minh-Thang Luong, Richard Socher, and Christopher D. Manning, Stanford University Computer Science Department. Available at: http://nlp.stanford.edu/~lmthang/morphoNLM Described in: Luong, M.-T., Socher, R., & Manning, C. D. (2013). Better Word Representations with Recursive Neural Networks for Morphology. CoNLL, Sofia, Bulgaria.

Columns

Word1 – First word in the pair Word2 – Second word in the pair… See the full description on the dataset page: https://huggingface.co/datasets/almogtavor/stanford-rare-word-similarity-dataset.
English Word Frequency
kaggle.com
zip
Updated Sep 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael Tatman (2017). English Word Frequency [Dataset]. https://www.kaggle.com/datasets/rtatman/english-word-frequency/code
Explore at:
zip(2236581 bytes)Available download formats
Dataset updated
Sep 6, 2017
Authors
Rachael Tatman
Description
Context:

How frequently a word occurs in a language is an important piece of information for natural language processing and linguists. In natural language processing, very frequent words tend to be less informative than less frequent one and are often removed during preprocessing. Human language users are also sensitive to word frequency. How often a word is used affects language processing in humans. For example, very frequent words are read and understood more quickly and can be understood more easily in background noise.

Content:

This dataset contains the counts of the 333,333 most commonly-used single words on the English language web, as derived from the Google Web Trillion Word Corpus.

Acknowledgements:

Data files were derived from the Google Web Trillion Word Corpus (as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium) by Peter Norvig. You can find more information on these files and the code used to generate them here.

The code used to generate this dataset is distributed under the MIT License.

Inspiration:

Can you tag the part of speech of these words? Which parts of speech are most frequent? Is this similar to other languages, like Japanese?

What differences are there between the very frequent words in this dataset, and the the frequent words in other corpora, such as the Brown Corpus or the TIMIT corpus? What might these differences tell us about how language is used?
Word Dataset (Sword6k)
figshare.com
zip
Updated Jan 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Payel Sengupta; Ayatullah Faruk Mollah (2024). Word Dataset (Sword6k) [Dataset]. http://doi.org/10.6084/m9.figshare.21523479.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21523479.v1
Dataset updated
Jan 18, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Payel Sengupta; Ayatullah Faruk Mollah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Roman word dataset of scene word text images named Scene Word Dataset (SWord6k) is developed for character segmentation, Character recognition, text detection, and script identification. All images of SWord6k datasets are captured from an outdoor environment. These images are collected from banners, advertisements, shop names, and posters from different sources, like shopping malls, book fairs, and puja pandals. The SWord6k dataset contains 6,661 scene word images in "png" file format. Three types of ground truth annotations are composed for the SWord6k dataset, viz (i) component level i.e. to check whether a component is text or not, (ii) Script level i.e. to identify the script, (iii) recognition level i.e., character/word recognition of text. Each image's ground truth level annotations are stored in XML(extensible markup language) file format.
Morphemic Segmentation of English Words
kaggle.com
zip
Updated Oct 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Morphemic Segmentation of English Words [Dataset]. https://www.kaggle.com/datasets/thedevastator/morphemic-segmentation-of-english-words
Explore at:
zip(2874178 bytes)Available download formats
Dataset updated
Oct 29, 2022
Authors
The Devastator
Description
Morphemic Segmentation of English Words

A Dataset of english words and morphemic segmentations

About this dataset

This dataset was collected in order to provide detailed information about the morphemic structure of English words. Morphemes are the smallest units of meaning in a language, and English words are made up of one or more morphemes. This dataset contains four different csv files, each one containing data about a different aspect of English words:

The file lookup.csv contains a list of all the words in the dataset, along with their corresponding frequencies

The file prefixes.csv contains a list of common English prefixes

The file suffixes.csv contains a list of suffixes used in English words, along with their frequency

The file vocabulary.csv contains a list of all the words in the English language, as well as their frequency of use

How to use the dataset

lookup.csv: Contains a list of every word in the dataset, as well as their corresponding frequencies

prefixes.csv: This file contains a list of common English prefixes

suffixes.csv: This file contains a list of suffixes used in English words, along with their frequency

vocabulary.csv: The file contains a list of all the words in the English language, as well as their frequency of use

words.csv: The file contains a list of English words and their corresponding frequencies

Research Ideas

Find most common prefixes/suffixes of English words

Find most frequent words in the English language

Segment English words into their morphemic components
h
English-Valid-Words
huggingface.co
Updated Sep 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maxim Belikov (2024). English-Valid-Words [Dataset]. https://huggingface.co/datasets/Maximax67/English-Valid-Words
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2024
Authors
Maxim Belikov
License
https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/
Description
English Valid Words

This repository contains CSV files with valid English words along with their frequency, stem, and stem valid probability. Dataset Github link: https://github.com/Maximax67/English-Valid-Words

Files included

valid_words_sorted_alphabetically.csv:

N: Counter for each word entry. Word: The English word itself. Frequency count: The number of occurrences of the word in the 1-grams dataset. Stem: The stem of the word. Stem valid probability: Probability… See the full description on the dataset page: https://huggingface.co/datasets/Maximax67/English-Valid-Words.
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
w
Dataset of books called Better words : a first thesaurus
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Better words : a first thesaurus [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Better+words+%3A+a+first+thesaurus
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Better words : a first thesaurus. It features 7 columns including author, publication date, language, and book publisher.
m
Event Detection Dataset
data.mendeley.com
datosdeinvestigacion.conicet.gov.ar
+2more
Updated Jul 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariano Maisonnave (2020). Event Detection Dataset [Dataset]. http://doi.org/10.17632/7d54rvzxkr.1
Explore at:
Unique identifier
https://doi.org/10.17632/7d54rvzxkr.1
Dataset updated
Jul 11, 2020
Authors
Mariano Maisonnave
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event.

The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier).

Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example.

Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data.

Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format.

<?xml version="1.0" encoding="UTF-8"?>

The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article.

The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.
E
Data from: Slovenian datasets for contextual synonym and antonym detection
live.european-language-grid.eu
binary format
Updated Oct 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Slovenian datasets for contextual synonym and antonym detection [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/20526
Explore at:
binary formatAvailable download formats
Dataset updated
Oct 25, 2022
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Slovenian datasets for contextual synonym and antonym detection can be used for training machine learning classifiers as described in the MSc thesis of Jasmina Pegan "Semantic detection of synonyms and antonyms with contextual embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=141456). Datasets contain example pairs of synonyms and antonyms in contexts together with additional information on a sense pair. Candidates for synonyms and antonyms were retrieved from the dataset created in the BSc thesis of Jasmina Pegan "Antonym detection with word embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=110533). Example sentences were retrieved from The comprehensive Slovenian-Hungarian dictionary (VSMS) (https://www.clarin.si/repository/xmlui/handle/11356/1453). Each dataset is class balanced and contains an equal amount of examples and counterexamples. An example is a pair of example sentences where the two words are synonyms/antonyms. A counterexample is a pair of example sentences where two words are not synonyms/antonyms. Note that a word pair can be synonymous or antonymous in some sense of the two words (but not in the given context).

Datasets are divided into two categories, datasets for synonyms and datasets for antonyms. Each category is further divided into base and updated datasets. These contain three dataset files: train, validation and test dataset. Base datasets include only manually-reviewed sense pairs. These are generated from all pairs of VSMS sense examples for all confirmed pairs of antonym and synonym senses. Updated datasets include automatically generated sense pairs while constraining the maximal number of examples per word. In this way, the dataset is more balanced word-wise, but is not fully manually-reviewed and contains less accurate data.

A single dataset entry contains the information on the base word, followed by data on synonym/antonym candidate. The last column discerns whether the sense pair is a pair of synonyms/antonyms or not. More details on this can be found inside the included README file.

Facebook

Twitter

Click to copy link

Link copied

Cite

Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2

LScDC Word-Category RIG Matrix

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

pdfAvailable download formats

Unique identifier

https://doi.org/10.25392/leicester.data.12133431.v2

Dataset updated

Apr 28, 2020

Dataset provided by

University of Leicester

Authors

Neslihan Suzen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.

Clear search

Close search

Google apps

Main menu

LScDC Word-Category RIG Matrix

Dictionary of English Words and Definitions

Dataset Overview

Key Features:

Total Number of Words: 42,052

Applications

Data Structure

Potential Use Cases

Number Words Dataset

SlangTrack (ST) Dataset

Key Features:

Significance of the Dataset:

Target Word Selection:

Data Sources and Collection:

1. Corpus of Historical American English (COHA):

2. Twitter:

Dataset Scope:

Data Statistics:

Sample Texts from the Dataset:

Data from: Every single word - A new dataset including all parliamentary...

BenchLS: A Reliable Dataset for Lexical Simplification

American English Language Datasets | 150+ Years of Research | Textual Data |...

Misspelled Words

Word Dataset

Word

one-word

English words

stanford-rare-word-similarity-dataset

English Word Frequency

Context:

Content:

Acknowledgements:

Inspiration:

Word Dataset (Sword6k)

Morphemic Segmentation of English Words

Morphemic Segmentation of English Words

A Dataset of english words and morphemic segmentations

About this dataset

How to use the dataset

Research Ideas

English-Valid-Words

LScD (Leicester Scientific Dictionary)

Dataset of books called Better words : a first thesaurus

Event Detection Dataset

Data from: Slovenian datasets for contextual synonym and antonym detection

LScDC Word-Category RIG Matrix