100+ datasets found

Words for Wordle
kaggle.com
Updated Feb 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Advait Kale (2022). Words for Wordle [Dataset]. https://www.kaggle.com/datasets/uniquekale/wordle-words
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Advait Kale
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is a curated list of frequently used English Dictionary Words. This dataset is mainly designed to help a user solve the Wordle Game in a much faster way.
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
m
Data from: Dataset for classifying English words into difficulty levels by...
data.mendeley.com
Updated Oct 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nisar Kangoo (2023). Dataset for classifying English words into difficulty levels by undergraduate and postgraduate students [Dataset]. http://doi.org/10.17632/p2wrs7hm4z.4
Explore at:
Unique identifier
https://doi.org/10.17632/p2wrs7hm4z.4
Dataset updated
Oct 24, 2023
Authors
Nisar Kangoo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains English words in column B. Corresponding to each word the other columns contain its frequency(fre), length(len), parts of speech(PS), the number of undergraduate students which marked it difficult (difficult_ug) and the number of postgraduate students which marked it difficult (difficult_pg).The dataset has a total of 5368 unique words. The words marked as difficult by undergraduate students are 680; and those marked as difficult by postgraduate students are 151; all the remaining words, viz., 4537, are easy and hence are not marked as difficult either by undergraduate and postgraduate students. The word against which there is hyphen (-) in difficult_ug column means that this word is not present in the text circulated to undergraduate students. Likewise hyphen(-) in difficult_pg column means words not present in text circulated to postgraduate students. The data is collected from the students of Jammu and Kashmir (a Union Territory of India). Latitude and Longitude (32.2778° N, 75.3412° E) The description of files attached is as: The dataset_english CSV file is the original dataset containing English words, its length, frequency, Parts of speech, number of undergraduate and postgraduate students which marked the particular words as difficult.
The dataset_numerical CSV file contains the original dataset along with string fields transformed into numerical. The English language difficulty level measurement -Questionnaire (1-6) & PG1,PG2,PG3,PG4 .docx files contains the questionnaire supplied to students of College and University to underline difficult words in the English text. IGNOU English.zip file contains the Indra Gandhi National Open University (IGNOU) English text books for graduation and post graduation students. The text for above questionnaires were taken from these IGNOU English text books.
s
Wake Word English Dataset
hmn.shaip.com
Updated Aug 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Wake Word English Dataset [Dataset]. https://hmn.shaip.com/offerings/speech-data-catalog/wake-word-english-dataset/
Explore at:
Dataset updated
Aug 9, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Wake Word English DatasetHigh-Quality English Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Lus Askiv DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
Z
SignBD-Word: Video-Based Bangla Word-Level Sign Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Mar 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ataher Sams (2024). SignBD-Word: Video-Based Bangla Word-Level Sign Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6779840
Explore at:
Dataset updated
Mar 3, 2024
Dataset authored and provided by
Ataher Sams
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Bangla sign language (BdSL) is a complete and independent natural sign language with its own linguistic characteristics. While there exists video datasets for well-known sign languages, there is currently no available dataset for word-level BdSL. In this study, we present a video-based word-level dataset for Bangla sign language, called SignBD-Word, consisting of 6000 sign videos representing 200 unique words. The dataset includes full and upper-body views of the signers, along with 2D body pose information. This dataset can also be used as a benchmark for testing sign video classification algorithms.Official Train Test Spllit (for both RGB and bodypose) can be found from the following link: https://sites.google.com/view/signbd-word/datasetThis dataset is part of the following paper:A. Sams, A. H. Akash and S. M. M. Rahman, "SignBD-Word: Video-Based Bangla Word-Level Sign Language and Pose Translation," 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 2023, pp. 1-7, doi: 10.1109/ICCCNT56998.2023.10306914.Download the corresponding paper from this link:https://asnsams.github.io/Publications.html
S
Word cloud for data science
scidb.cn
Updated Apr 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Word cloud for data science [Dataset]. https://www.scidb.cn/en/detail?dataSetId=c4ea9cbb144c4465a56929851cbf226f
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.07847
Dataset updated
Apr 3, 2023
Dataset provided by
Science Data Bank
Authors
Lili Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes a .txt file and a .ipynb file. Raw data are captured through Web of Science as retrieval records on 24 February 2023. Refined by only published articles entitled "data science," 3490 pieces of records with abstracts are selected. Besides, the python code for word cloud analysis is also shared. This package provides supporting details for a paper, Looking Back to the Future: A Glimpse at Twenty Years of Data Science, submitted to the Data science Journal.
s
Wake Word Hebrew Dataset
shaip.com
ta.shaip.com
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). Wake Word Hebrew Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-hebrew-dataset/
Explore at:
Dataset updated
Nov 8, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Wake Word Hebrew DatasetHigh-Quality Hebrew Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Hebrew Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
Data from: all-words
kaggle.com
Updated Apr 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Lauga (2018). all-words [Dataset]. https://www.kaggle.com/nathanlauga/all-words/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2018
Dataset provided by
Kaggle
Authors
Nathan Lauga
Description
This datasets contains txt file with all words from different languages like english or french for example.
Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display...
datarade.ai
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display | Game | Translations | European & Latin Amer. Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
Explore at:
.csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://www.lexico.com/
Area covered
Ecuador, Paraguay, Panama, Bolivia (Plurinational State of), Chile, Cuba, Nicaragua, Colombia, Honduras, Costa Rica
Description
Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

Spanish Monolingual Dictionary Data

Spanish Bilingual Dictionary Data

Spanish Sentences Data

Synonyms and Antonyms Data

Audio Data

Word list Data

Key Features (approximate numbers):

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Headwords: 73,000

Senses: 123,000

Sentence examples: 104,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300

Senses: 103,500

Example sentences: 74,500

Example translations: 83,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700

Antonyms: 9,500

Format: XML format

Delivery: Email (link-based file sharing)

Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900

Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.
E
Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.1
live.european-language-grid.eu
binary format
Updated May 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Parallel sense-annotated corpus ELEXIS-WSD 1.1 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22947
Explore at:
binary formatAvailable download formats
Dataset updated
May 21, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.

The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.

The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation.

List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene

The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, five empty columns (reserved for e.g. dependency parsing, which is absent from this version), and the final MISC column containing the following: the token's whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression).

Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.

For more information, please refer to 00README.txt.

Differences to version 1.0: - Several minor errors were fixed (e.g. a typo in one of the Slovene sense IDs). - The corpus was converted to the true CoNLL-U format (as opposed to the CoNLL-U-like format used in v1.0). - An error was fixed that resulted in missing UPOS tags in version 1.0. - The sentences in all corpora now follow the same order (from 1 to 2024).
s
Wake Word Mandarin Dataset
shaip.com
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Wake Word Mandarin Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-mandarin-dataset/
Explore at:
Dataset updated
Mar 22, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Mandarin DatasetHigh-Quality Mandarin Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleMandarin Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word / Keyphrase collection of…
r
Annotated data of English antonyms
researchdata.se
data.europa.eu
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carita Paradis; Joost van de Wejer; Caroline Willners; Simone Löhndorf (2024). Annotated data of English antonyms [Dataset]. http://doi.org/10.5878/4042-mg83
Explore at:
(4528071), (1478603), (256626)Available download formats
Unique identifier
https://doi.org/10.5878/4042-mg83
Dataset updated
Sep 26, 2024
Dataset provided by
Lund University
Authors
Carita Paradis; Joost van de Wejer; Caroline Willners; Simone Löhndorf
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2008 - 2014
Description
The data set consists of a good 500 randomly selected occurrences of each of the above adjectives in their contexts in the BNC (British National Corpus) (some 21,000 occurrences in total). The UNIX command grep was used to retrieve the sentences containing the target words tagged as adjectives in the BNC, and the nominal heads of the adjectives were then identified using a head finder script. The sentence in the written part of the corpus data and the corresponding chunk for the spoken occurrences for each of the adjectives were imported into FileMaker Pro and the adjectives were then manually coded.

The methodological procedure used in the analysis of the data proceeds from the lexical items in each case to their actual discursive interpretations in context, i.e., from lexical items to their contextual readings. For instance, if the actual reading of say short report refers to the paper copy, it was analyzed as a concrete object since its basic domain of instantiation is space/concrete object, and if it refers to the content it was coded in its domain of instantiation which is neither space nor time, but abstract/mental space. Crucially, this method then also involves a close analysis of the combining nominals and the meanings they express in each instance. The method of identifying discursive meanings of the antonymic word pairs in their contexts serves to make it possible to make generalizations across the interpretations of the lexical items rather than focusing on the lexical items as such without taking their meanings into account.
m
Indian sign Language-Real-life Words
data.mendeley.com
Updated Aug 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akansha Tyagi (2022). Indian sign Language-Real-life Words [Dataset]. http://doi.org/10.17632/s6kgb6r3ss.2
Explore at:
Unique identifier
https://doi.org/10.17632/s6kgb6r3ss.2
Dataset updated
Aug 10, 2022
Authors
Akansha Tyagi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
The dataset contains the RGB images of hand gestures of twenty ISL words, namely, ‘afraid’,’agree’,’assistance’,’bad’,’become’,’college’,’doctor’,’from’,’pain’,’pray’, ’secondary’, ’skin’, ’small’, ‘specific’, ‘stand’, ’today’, ‘warn’, ‘which’, ‘work’, ‘you’’ which are commonly used to convey messages or seek support during medical situations. All the words included in this dataset are static. The images were captured from 8 individuals including 6 males and 2 females in the age group of 9 years to 30 years. The dataset contains a 18000 images in jpg format. The images are labelled using the format ISLword_X_YYYY_Z, where: • ISLword corresponds to the words ‘afraid’, ‘agree’, ‘assistance’, ‘bad’, ‘become’, ‘college’, ‘doctor’ ,‘from’, ’pray’, ‘pain’, ‘secondary’, ‘skin’, ‘small’, ‘specific’, ‘stand’, ‘today’, ‘warn’, ‘which’, ‘work’, ‘you’. • X is an image number in the range 1 to 900. • YYYY is an identifier of the participant and is in the range of 1 to 6. • Z corresponds to 01 or 02 that identifies the sample number for each subject. For example, the file named afraid_1_user1_1 is the image sequence of the first sample of the ISL gesture of the word ‘afraid’ presented by the 1st user.
Z
Data and source code for Automatic generation of a large dictionary with...
data.niaid.nih.gov
explore.openaire.eu
Updated Nov 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivanov Vladimir (2021). Data and source code for Automatic generation of a large dictionary with concreteness/abstractness ratings based on a small human dictionary [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5634190
Explore at:
Dataset updated
Nov 1, 2021
Dataset provided by
Ivanov Vladimir
Solovyev Valery
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a method for automatic ranking concreteness of words and propose an approach to significantly decrease amount of expert assessment. The method has been evaluated on a large test set for English. The quality of the constructed dictionaries is comparable to the expert ones. The correlation between predicted and expert ratings is higher comparing to the state-of-the-art methods.
10000 Words
kaggle.com
Updated Dec 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramiro Melo (2021). 10000 Words [Dataset]. https://www.kaggle.com/ramiromelo/10000-words/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ramiro Melo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
List of 10000 Words in English

Source: https://www.mit.edu/~ecprice/wordlist.10000
n
wordnet-words
networkrepository.com
csv
Updated Jul 4, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Network Data Repository (2016). wordnet-words [Dataset]. https://networkrepository.com/wordnet-words.php
Explore at:
csvAvailable download formats
Dataset updated
Jul 4, 2016
Dataset authored and provided by
Network Data Repository
License
https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php
Description
WordNet - This is the lexical network of words from the WordNet dataset. Nodes in the network are English words, and links are relationships between them, such as synonymy, antonymy, meronymy, etc. All relationships present in the WordNet dataset are included. The resulting network is undirected.
S
Data from: Readers may not Integrate Words Strictly in the Order in Which...
scidb.cn
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hui Zhao; Linjieqiong Huang; Xingshan Li (2024). Readers may not Integrate Words Strictly in the Order in Which They Appear in Chinese Reading [Dataset]. http://doi.org/10.57760/sciencedb.18101
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.18101
Dataset updated
Apr 17, 2024
Dataset provided by
Science Data Bank
Authors
Hui Zhao; Linjieqiong Huang; Xingshan Li
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
China
Description
Data from: Readers may not Integrate Words Strictly in the Order in Which They Appear in Chinese Reading. The dataset includes materials, raw data, and code for data analysis. Compared to the previous version, we have made minor adjustments to the title.
o
Linguistic Parts of Speech Data
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Linguistic Parts of Speech Data [Dataset]. https://www.opendatabay.com/data/ai-ml/ae1be978-99ec-4500-9840-bc4b63387067
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset provides a collection of words categorised by their parts of speech, specifically designed for educational pursuits. It includes a CSV file detailing statistical counts and separate files for individual words sorted by their respective parts of speech. The information within this dataset is freely available and primarily intended for academic and learning applications in areas such as natural language processing and text analysis. It differentiates between 'pure' and 'impure' parts of speech, noting specific counts for pure adjectives and adverbs.

Columns

The core description file features the following columns:

Parts of Speech: The name given to each part of speech (e.g., noun, verb, adjective).

Count: The total number of words available for each specific part of speech.

Pure (Top): This indicates the count of words that belong exclusively to one part of speech, meaning they cannot function as other parts of speech. For instance, 143 top adjectives and 47 top adverbs are considered pure in this dataset.

Distribution

The dataset typically includes data files in CSV format. It consists of a content description CSV file that provides counts and a folder containing all words, with each word organised into separate files named by its part of speech. The 'Parts of Speech' column itself contains 8 unique values and 8 total values. While specific total row or record counts for the entire word collection are not provided, statistical summaries like percentages (e.g., 63%, 13%, 25%) are included for various categories.

Usage

This dataset is ideally suited for:

Data Science and Analytics projects focusing on linguistic patterns.

Natural Language Processing (NLP) research and development.

Text Mining and Text Cleaning applications.

Studying Languages and grammatical structures.

Educational purposes in Computer Science and related fields.

Coverage

The dataset has a global regional coverage. It was listed on 16th June 2025, with version 1.0. There are no specific notes regarding demographic scope or data availability for particular groups or years beyond its general global reach.

License

CC0

Who Can Use It

This dataset is intended primarily for educational purposes. Ideal users include:

Students and researchers in data science, computer science, and linguistics.

Developers working on NLP models and text analysis tools.

Anyone interested in linguistic data analysis and understanding parts of speech.

Dataset Name Suggestions

Parts of Speech Word Collection

PoS Word Corpus

Linguistic Parts of Speech Data

Grammatical Word Classification

Attributes

Original Data Source: 📕 Words - Parts of Speech 📰 Collection 2022 📌
E
Data from: Frequency lists of words from the GOS 1.0 corpus
live.european-language-grid.eu
binary format
Updated Nov 17, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Frequency lists of words from the GOS 1.0 corpus [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/8319
Explore at:
binary formatAvailable download formats
Dataset updated
Nov 17, 2019
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Frequency lists of words were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all words occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.

The lists were extracted for each part-of-speech category. For each part-of-speech, two lists were extracted:
1) one containing lemmas and their text-type distribution,
2) one containing lower-case word forms as well as their normalized forms, lemmas, and morphosyntactic tags along with their text-type distribution.

In addition, four lists were extracted from all words (regardless of their part-of-speech category):
1) a list of all lemmas along with their part-of-speech category and text-type distribution;
2) a list of all lower-case word forms with their lemmas, part-of-speech categories, and text-type distribution;
3) a list of all lower-case word forms with their normalized word forms, lemmas, part-of-speech categories, and text-type distribution;
4) a list of all morphosyntactic tags and their text-type distribution (the tags are also split into several columns).
h
word-flag-data
huggingface.co
Updated Feb 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Avi Pal (2023). word-flag-data [Dataset]. https://huggingface.co/datasets/ovi054/word-flag-data
Explore at:
Dataset updated
Feb 20, 2023
Authors
Avi Pal
Description
ovi054/word-flag-data dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Advait Kale (2022). Words for Wordle [Dataset]. https://www.kaggle.com/datasets/uniquekale/wordle-words

Words for Wordle

Frequent Words For Solving Wordle

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 12, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Advait Kale

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is a curated list of frequently used English Dictionary Words. This dataset is mainly designed to help a user solve the Wordle Game in a much faster way.

Clear search

Close search

Google apps

Main menu

Words for Wordle

LScD (Leicester Scientific Dictionary)

Data from: Dataset for classifying English words into difficulty levels by...

Wake Word English Dataset

SignBD-Word: Video-Based Bangla Word-Level Sign Language Dataset

Word cloud for data science

Wake Word Hebrew Dataset

Data from: all-words

Spanish Language Datasets | 1.8M+ Sentences | NLP | TTS | Dictionary Display...

Data from: Parallel sense-annotated corpus ELEXIS-WSD 1.1

Wake Word Mandarin Dataset

Annotated data of English antonyms

Indian sign Language-Real-life Words

Data and source code for Automatic generation of a large dictionary with...

10000 Words

wordnet-words

Data from: Readers may not Integrate Words Strictly in the Order in Which...

Linguistic Parts of Speech Data

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Data from: Frequency lists of words from the GOS 1.0 corpus

word-flag-data

Words for Wordle

Frequent Words For Solving Wordle