Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home English United States Wake Word DatasetHigh-Quality English United States Wake Word Dataset for AI & Speech Models Contact Us OverviewTitle (Language)English United States Language DatasetDataset TypesWake WordCountryUnited StatesDescriptionWake Words…
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all the Turkish words I've managed to fetch from the web. The dataset has approximately 7 million lines of Turkish word tokens, each seperated by " " so it is easier to read.
Some words are different variations of the same word e.g. "araba", "arabada", "arabadan". Feel free to use lemmatization algorithms to reduce the data size.
I believe this dataset could be improved upon. It certainly is not finished. I will update this dataset if I can get my hands on new words in the future.
My Linkedin: https://www.linkedin.com/in/enistuna/ My Github: https://github.com/enistuna
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
List of 10000 Words in English
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Bangla sign language (BdSL) is a complete and independent natural sign language with its own linguistic characteristics. While there exists video datasets for well-known sign languages, there is currently no available dataset for word-level BdSL. In this study, we present a video-based word-level dataset for Bangla sign language, called SignBD-Word, consisting of 6000 sign videos representing 200 unique words. The dataset includes full and upper-body views of the signers, along with 2D body pose information. This dataset can also be used as a benchmark for testing sign video classification algorithms.
Official Train Test Spllit (for both RGB and bodypose) can be found from the following link:
https://sites.google.com/view/signbd-word/dataset
This dataset is part of the following paper:
A. Sams, A. H. Akash and S. M. M. Rahman, "SignBD-Word: Video-Based Bangla Word-Level Sign Language and Pose Translation," 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 2023, pp. 1-7, doi: 10.1109/ICCCNT56998.2023.10306914.
Download the corresponding paper from this link:
https://asnsams.github.io/Publications.html
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Indian English Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
This diversity ensures robust training for real-world voice assistant applications.
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
Facebook
TwitterAPAC Data Suite offers high-quality language datasets. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.
Discover our expertly curated language datasets in the APAC Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:
Monolingual and Bilingual Dictionary Data
Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.
Semi-bilingual Dictionary Data Each entry features a headword with definitions and/or usage examples in Language 1, followed by a translation of the headword and/or definition in Language 2, enabling efficient cross-lingual mapping.
Sentence Corpora
Curated examples of real-world usage with contextual annotations for training and evaluation.
Synonyms & Antonyms
Lexical relations to support semantic search, paraphrasing, and language understanding.
Audio Data
Native speaker recordings for speech recognition, TTS, and pronunciation modeling.
Word Lists
Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks. The word list data can cover one language or two, such as Tamil words with English translations.
Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.
If you require more information about a specific dataset, please contact us Growth.OL@oup.com.
Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.
Assamese Semi-bilingual Dictionary Data: 72,200 words | 83,700 senses | 83,800 translations.
Bengali Bilingual Dictionary Data: 161,400 translations | 71,600 senses.
Bengali Semi-bilingual Dictionary Data: 28,300 words | 37,700 senses | 62,300 translations.
British English Monolingual Dictionary Data: 146,000 words | 230,000 senses | 149,000 example sentences.
British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms.
British English Pronunciations with Audio: 250,000 transcriptions (IPA) | 180,000 audio files.
French Monolingual Dictionary Data: 42,000 words | 56,000 senses | 43,000 example sentences.
French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 example translations.
Gujarati Monolingual Dictionary Data: 91,800 words | 131,500 senses.
Gujarati Bilingual Dictionary Data: 171,800 translations | 158,200 senses.
Hindi Monolingual Dictionary Data: 46,200 words | 112,700 senses.
Hindi Bilingual Dictionary Data: 263,400 translations | 208,100 senses | 18,600 example translations.
Hindi Synonyms and Antonyms Dictionary Data: 478,100 synonyms | 18,800 antonyms.
Hindi Sentence Data: 216,000 sentences.
Hindi Audio data: 68,000 audio files.
Indonesian Bilingual Dictionary Data: 36,000 translations | 23,700 senses | 12,700 example translations.
Indonesian Monolingual Dictionary Data: 120,000 words | 140,000 senses | 30,000 example sentences.
Korean Bilingual Dictionary Data: 952,500 translations | 449,700 senses | 227,800 example translations.
Mandarin Chinese (simplified) Monolingual Dictionary Data: 81,300 words | 162,400 senses | 80,700 example sentences.
Mandarin Chinese (traditional) Monolingual Dictionary Data: 60,100 words | 144,700 senses | 29,900 example sentences.
Mandarin Chinese (simplified) Bilingual Dictionary Data: 367,600 translations | 204,500 senses | 150,900 example translations.
Mandarin Chinese (traditional) Bilingual Dictionary Data: 215,600 translations | 202,800 senses | 149,700 example translations.
Mandarin Chinese (simplified) Synonyms and Antonyms Data: 3,800 synonyms | 3,180 antonyms.
Malay Bilingual Dictionary Data: 106,100 translations | 53,500 senses.
Malay Monolingual Dictionary Data: 39,800 words | 40,600 senses | 21,100 example sentences.
Malayalam Monolingual Dictionary Data: 91,300 words | 159,200 senses.
Malayalam Bilingual Word List Data: 76,200 translation pairs.
Marathi Bilingual Dictionary Data: 45,400 translations | 32,800 senses | 3,600 example translations.
Nepali Bilingual Dictionary Data: 350,000 translations | 264,200 senses | 1,300 example translations.
New Zealand English Monolingual Dictionary Data: 100,000 words
Odia Semi-bilingual Dictionary Data: 30,700 words | 69,300 senses | 69,200 translations.
Punjabi ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the RGB images of hand gestures of twenty ISL words, namely, ‘afraid’,’agree’,’assistance’,’bad’,’become’,’college’,’doctor’,’from’,’pain’,’pray’, ’secondary’, ’skin’, ’small’, ‘specific’, ‘stand’, ’today’, ‘warn’, ‘which’, ‘work’, ‘you’’ which are commonly used to convey messages or seek support during medical situations. All the words included in this dataset are static. The images were captured from 8 individuals including 6 males and 2 females in the age group of 9 years to 30 years. The dataset contains a 18000 images in jpg format. The images are labelled using the format ISLword_X_YYYY_Z, where: • ISLword corresponds to the words ‘afraid’, ‘agree’, ‘assistance’, ‘bad’, ‘become’, ‘college’, ‘doctor’ ,‘from’, ’pray’, ‘pain’, ‘secondary’, ‘skin’, ‘small’, ‘specific’, ‘stand’, ‘today’, ‘warn’, ‘which’, ‘work’, ‘you’. • X is an image number in the range 1 to 900. • YYYY is an identifier of the participant and is in the range of 1 to 6. • Z corresponds to 01 or 02 that identifies the sample number for each subject. For example, the file named afraid_1_user1_1 is the image sequence of the first sample of the ISL gesture of the word ‘afraid’ presented by the 1st user.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.
Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.
[Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.
[ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006
Diachronica models
Training data
Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:
Classical subcorpus
Hellenistic subcorpus
Whole corpus
Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.
Word2Vec
Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.
Syntactic word embeddings
Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.
ALP models
Training data
Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.
Word2Vec
Software used: Gensim library (Řehůřek and Sojka, 2010)
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.
References
Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.
Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.
Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).
Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.
Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.
Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.
Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.
Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013
Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Northeast English Wake Word DatasetHigh-Quality Northeast English Wake Word Dataset for AI & Speech Models Contact Us OverviewTitle (Language)Northeast English Language DatasetDataset TypesWake WordCountryUnited StatesDescriptionWake Words / Voice Command…
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset containing the images and labels for the Language data used in the CVPR NAS workshop Unseen-data challenge under the codename "LaMelo"The Language dataset is a constructed dataset using words from aspell dictionaries. The intention of this dataset is to require machine learning models to not only perform image classification but also linguistic analysis to figure out which letter frequency is associated with each language. For each Language image we selected four six-letter words using the standard latin alphabet and removed any words with letters that used diacritics (such as ́e or ̈u) or included ‘y’ or ‘z’.We encode these words on a graph with one axis representing the index of the 24 character long string (the four words joined together) and the other representing the letter (going A-X).The data is in a channels-first format with a shape of (n, 1, 24, 24) where n is the number of samples in the corresponding set (50,000 for training, 10,000 for validation, and 10,000 for testing).There are ten classes in the dataset, with 7,000 examples of each, distributed evenly between the three subsets.The ten classes and corresponding numerical label are as follows:English: 0,Dutch: 1,German: 2,Spanish: 3,French: 4,Portuguese: 5,Swahili: 6,Zulu: 7,Finnish: 8,Swedish: 9
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP.
These features ensure a robust contextual framework for accurate slang detection and semantic analysis.
The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset:
The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word:
The SlangTrack Dataset serves as a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.
The table below provides a breakdown of the total number of instances categorized as slang or non-slang for each target keyword in the SlangTrack (ST) Dataset.
| Keyword | Non-slang | Slang | Total |
| BMW | 1083 | 14 | 1097 |
| Brownie | 582 | 382 | 964 |
| Chronic | 1415 | 270 | 1685 |
| Climber | 520 | 122 | 642 |
| Cucumber | 972 | 79 | 1051 |
| Eat | 2462 | 561 | 3023 |
| Germ | 566 | 249 | 815 |
| Mammy | 894 | 154 | 1048 |
| Rodent | 718 | 349 | 1067 |
| Salty | 543 | 727 | 1270 |
| Total | 9755 | 2907 | 12662 |
The table below provides examples of sentences from the SlangTrack (ST) Dataset, showcasing both slang and non-slang usage of the target keywords. Each example highlights the context in which the target word is used and its corresponding category.
| Example Sentences | Target Keyword | Category |
| Today, I heard, for the first time, a short scientific talk given by a man dressed as a rodent...! An interesting experience. | Rodent | Slang |
| On the other. Mr. Taylor took food requests and, with a stern look in his eye, told the children to stay seated until he and his wife returned with the food. The children nodded attentively. After the adults left, the children seemed to relax, talking more freely and playing with one another. When the parents returned, the kids straightened up again, received their food, and began to eat, displaying quiet and gracious manners all the while. | Eat | Non-Slang |
| Greater than this one that washed between the shores of Florida and Mexico. He balanced between the breakers and the turning tide. Small particles of sand churned in the waters around him, and a small fish swam against his leg, a momentary dark streak that vanished in the surf. He began to swim. Buoyant in the salty water, he swam a hundred meters to a jetty that sent small whirlpools around its barnacle rough pilings. | Salty | Non-Slang |
| Mom was totally hating on my dance moves. She's so salty. | Salty | Slang |
**Licenses**
The SlangTrack (ST) dataset is built using a combination of licensed and publicly available corpora. To ensure compliance with licensing agreements, all data has been extensively preprocessed, modified, and anonymized while preserving linguistic integrity. The dataset has been randomized and structured to support research in slang detection without violating the terms of the original sources.
The **original authors and data providers retain their respective rights**, where applicable. We encourage users to **review the licensing agreements** included with the dataset to understand any potential usage limitations. While some source corpora, such as **COHA, require a paid license and restrict redistribution**, our processed dataset is **legally shareable and publicly available** for **research and development purposes**.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home US Spanish DatasetHigh-Quality US Spanish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleUS Spanish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word /…
Facebook
TwitterEMEA Data Suite offers 43 high-quality language datasets covering 23 languages spoken in the region. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.
Discover our expertly curated language datasets in the EMEA Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:
Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.
Sentence Corpora Curated examples of real-world usage with contextual annotations for training and evaluation.
Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.
Audio Data Native speaker recordings for speech recognition, TTS, and pronunciation modeling.
Word Lists Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks.
Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.
If you require more information about a specific dataset, please contact us Growth.OL@oup.com.
Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.
Arabic Monolingual Dictionary Data: 66,500 words | 98,700 senses | 70,000 example sentences.
Arabic Bilingual Dictionary Data: 116,600 translations | 88,300 senses | 74,700 example translations.
Arabic Synonyms and Antonyms Data: 55,100 synonyms.
British English Monolingual Dictionary Data: 146,000 words | 230,000 senses | 149,000 example sentences.
British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms
British English Pronunciations with Audio: 250,000 transcriptions (IPA) |180,000 audio files.
Catalan Monolingual Dictionary Data: 29,800 words | 47,400 senses | 25,600 example sentences.
Catalan Bilingual Dictionary Data: 76,800 translations | 109,350 senses | 26,900 example translations.
Croatian Monolingual Dictionary Data: 129,600 words | 164,760 senses | 34,630 example sentences.
Croatian Bilingual Dictionary Data: 100,700 translations | 91,600 senses | 10,180 example translations.
Czech Bilingual Dictionary Data: 426,473 translations | 199,800 senses | 95,000 example translations.
Danish Bilingual Dictionary Data: 129,000 translations | 91,500 senses | 23,000 example translations.
French Monolingual Dictionary Data: 42,000 words | 56,000 senses | 43,000 example sentences.
French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 example translations.
German Monolingual Dictionary Data: 85,500 words | 78,000 senses | 55,000 example sentences.
German Bilingual Dictionary Data: 393,000 translations | 207,500 senses | 129,500 example translations.
German Word List Data: 338,000 wordforms.
Greek Monolingual Dictionary Data: 47,800 translations | 46,309 senses | 2,388 example sentences.
Hebrew Monolingual Dictionary Data: 85,600 words | 104,100 senses | 94,000 example sentences.
Hebrew Bilingual Dictionary Data: 67,000 translations | 49,000 senses | 19,500 example translations.
Hungarian Monolingual Dictionary Data: 90,500 words | 155,300 senses | 42,500 example sentences.
Italian Monolingual Dictionary Data: 102,500 words | 231,580 senses | 48,200 example sentences.
Italian Bilingual Dictionary Data: 492,000 translations | 251,600 senses | 157,100 example translations.
Italian Synonyms and Antonyms Data: 197,000 synonyms | 62,000 antonyms.
Latvian Monolingual Dictionary Data: 36,000 words | 43,600 senses | 73,600 example sentences.
Polish Bilingual Dictionary Data: 287,400 translations | 216,900 senses | 19,800 example translations.
Portuguese Monolingual Dictionary Data: 143,600 words | 285,500 senses | 69,300 example sentences.
Portuguese Bilingual Dictionary Data: 300,000 translations | 158,000 senses | 117,800 example translations.
Portuguese Synonyms and Antonyms Data: 196,000 synonyms | 90,000 antonyms.
Romanian Monolingual Dictionary Data: 66,900 words | 113,500 senses | 2,700 example sentences.
Romanian Bilingual Dictionary Data: 77,500 translations | 63,870 senses | 33,730 example translations.
Russian Monolingual Dictionary Data: 65,950 words | 57,500 senses | 51,900 example sentences.
Russian Bilingual Dictionary Data: 230,100 translations | 122,200 senses | 69,600 example translations.
Slovak Bilingual Dictionary Dat...
Facebook
TwitterThe provided list contains common stop words used in natural language processing (NLP) tasks. Stop words are words that are filtered out before or after processing of natural language data. They are typically the most common words in a language and don't carry significant meaning, thus often removed to focus on the more important words or tokens in a text. This dataset can be used in various NLP applications such as text classification, sentiment analysis, and information retrieval to improve the accuracy and efficiency of text processing algorithms. By eliminating these stop words, the computational resources can be utilized more effectively, and the analysis can focus on the meaningful content of the text.
Facebook
TwitterTwo experiments investigated Estes and Maddox’ theory (2002) that word frequency mirror effect in episodic recognition memory is due to word likeness rather than frequency of experience with a word. In Experiment 1, sixteen first year psychology students at the University of Newcastle studied lists of high and low frequency words crossed with high-neighbourhood-density and low-neighbourhood-density words and were given an episodic recognition test and asked to rate words as new or old and provide ratings of confidence according to a three point scale with six possible responses: sure old, probably old, possibly old, possibly new, probably new and sure new. Experiment 2 included twenty-three first year psychology students at the University of Newcastle who were tested using lexical decision task lists of words and nonwords. Testing was undertaken on a computer that presented the stimuli and recorded the participants’ responses using a program written in Turbo Pascal 6.0 with millisecond accurate timing. The dataset contains one Microsoft Excel file in .xls format containing data for Experiments 1 and 2.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Frequency lists of words were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all words occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.
The lists were extracted for each part-of-speech category. For each part-of-speech, two lists were extracted:
1) one containing lemmas and their text-type distribution,
2) one containing lower-case word forms as well as their normalized forms, lemmas, and morphosyntactic tags along with their text-type distribution.
In addition, four lists were extracted from all words (regardless of their part-of-speech category):
1) a list of all lemmas along with their part-of-speech category and text-type distribution;
2) a list of all lower-case word forms with their lemmas, part-of-speech categories, and text-type distribution;
3) a list of all lower-case word forms with their normalized word forms, lemmas, part-of-speech categories, and text-type distribution;
4) a list of all morphosyntactic tags and their text-type distribution (the tags are also split into several columns).
Facebook
TwitterComprehensive German language datasets with linguistic annotations, including headwords, definitions, word senses, usage examples, part-of-speech (POS) tags, semantic metadata, and contextual usage details.
Our German language datasets are carefully compiled and annotated by language and linguistic experts. The below datasets in German are available for license:
Key Features (approximate numbers):
Our German monolingual features clear definitions, headwords, examples, and comprehensive coverage of the German language spoken today.
The bilingual data provides translations in both directions, from English to German and from German to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.
This language data contains a carefully curated and comprehensive list of 338,000 German words.
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
About the sample:
The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.
If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information.
Facebook
TwitterComprehensive Chinese language datasets with linguistic annotations, including headwords, definitions, word senses, usage examples, part-of-speech (POS) tags, semantic metadata, and contextual usage details. Covering Simplified and Traditional writing systems.
Our Chinese language datasets are carefully compiled and annotated by language and linguistic experts. The below datasets are available for license:
Key Features (approximate numbers):
Our Mandarin Chinese (simplified) monolingual features clear definitions, headwords, examples, and comprehensive coverage of the Mandarin Chinese language spoken today.
Our Mandarin Chinese (traditional) monolingual features clear definitions, headwords, examples, and comprehensive coverage of the Mandarin Chinese language spoken today.
The bilingual data provides translations in both directions, from English to Mandarin Chinese (simplified) and from Mandarin Chinese (simplified) to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.
The bilingual data provides translations in both directions, from English to Mandarin Chinese (traditional) and from Mandarin Chinese (traditional) to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.
The Mandarin Chinese (simplified) Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary Mandarin Chinese. It includes rich linguistic detail such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
Please note that some datasets may have rights restrictions. Contact us for more information.
About the sample:
The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.
If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: This dataset holds the content of one day's micro-blogs sampled from Weibo(http://weibo.com) in the form of bags-of-words.-----------------------------------------------------Data Set Characteristics: TextNumber of Micro-blogs:189,223Total Number of Words:3,252,492Size of the Vocabulary:20,942Associated Tasks: short text topic modeling and etc.-----------------------------------------------------About PreprocessingFor tokenization, we use NLPIR. Stop words and those with term-frequence less than 20 were removed. Besides,words contain only one chinese-character were also removed.-----------------------------------------------------Data FormatThe format of released data is setted as follows:[document_1][document_2]...[document_M]in which each line is one document. [document_i] is the ith document of the dataset that consists of a list of Ni words/terms.[document_i] = [word_i1] [word_i2] ... [word_iNi]in which all word_ij are text strings and they are separated by the blank character.-----------------------------------------------------If you have any questions about the data set, please contact: jichang@buaa.edu.cn.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LSC (Leicester Scientific Corpus)August 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.Getting StartedThis text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper3. Abstract: The abstract of the paper4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.Data ProcessingThis section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:Step 1: Downloading of the Data OnlineThis is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.Step 2: Importing the Dataset to RThis is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryNot all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsTraditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.List 1 Headings of sections identified in structured abstractsBackground Method(s) DesignTheoretical Measurement(s) LocationAim(s) Methodology ProcessAbstract Population ApproachObjective(s) Purpose(s) Subject(s)Introduction Implication(s) Patient(s)Procedure(s) Hypothesis Measure(s)Setting(s) Limitation(s) DiscussionConclusion(s) Result(s) Finding(s)Material (s) Rationale(s)Implications for health and nursing policyAll words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.Step 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.Step 6: Saving the Dataset into CSV FormatCorrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.The Structure of Fields in CSV FilesIn CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home English United States Wake Word DatasetHigh-Quality English United States Wake Word Dataset for AI & Speech Models Contact Us OverviewTitle (Language)English United States Language DatasetDataset TypesWake WordCountryUnited StatesDescriptionWake Words…