Facebook
TwitterThis dataset was created by nohafathi
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home US Spanish DatasetHigh-Quality US Spanish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleUS Spanish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word /…
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Mandarin DatasetHigh-Quality Mandarin Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleMandarin Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word / Keyphrase collection of…
Facebook
Twitterhttp://researchdatafinder.qut.edu.au/display/n15252http://researchdatafinder.qut.edu.au/display/n15252
QUT Research Data Respository Dataset and Resources
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.
Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.
[Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.
[ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006
Diachronica models
Training data
Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:
Classical subcorpus
Hellenistic subcorpus
Whole corpus
Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.
Word2Vec
Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.
Syntactic word embeddings
Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.
ALP models
Training data
Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.
Models
Count-based
Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)
a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.
b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.
Word2Vec
Software used: Gensim library (Řehůřek and Sojka, 2010)
a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.
b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.
References
Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.
Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.
Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).
Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.
Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.
Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.
Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.
Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.
Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013
Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).
Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ArboSloleks is a dataset containing Slovene word formation trees that have been automatically constructed from word relations (http://hdl.handle.net/11356/1986) extracted from Sloleks 2.0 (http://hdl.handle.net/11356/1230). Each word formation tree begins with a root lexeme from Sloleks (e.g. abolicionizem); morphologically related lexemes are then listed in pairs (original lexeme, related lexeme) along with the levels of word formation (e.g. abolicionizem – abolicionist (Level 1); abolicionist – abolicionistka (Level 2)).
Version 1.0 includes 14.918 word formation trees constructed from 66.360 lexeme pairs. It is available in an ad-hoc .txt format – for information on the structure and how to parse the data, please consult 00README.txt.
Facebook
TwitterThis dataset was created by VishaljiODEDRA
Facebook
TwitterThis dataset provides information about the number of properties, residents, and average property values for Word Road cross streets in Lewisburg, TN.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The French Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
This diversity ensures robust training for real-world voice assistant applications.
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Bangla sign language (BdSL) is a complete and independent natural sign language with its own linguistic characteristics. While there exists video datasets for well-known sign languages, there is currently no available dataset for word-level BdSL. In this study, we present a video-based word-level dataset for Bangla sign language, called SignBD-Word, consisting of 6000 sign videos representing 200 unique words. The dataset includes full and upper-body views of the signers, along with 2D body pose information. This dataset can also be used as a benchmark for testing sign video classification algorithms.Official Train Test Spllit (for both RGB and bodypose) can be found from the following link: https://sites.google.com/view/signbd-word/datasetThis dataset is part of the following paper:A. Sams, A. H. Akash and S. M. M. Rahman, "SignBD-Word: Video-Based Bangla Word-Level Sign Language and Pose Translation," 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 2023, pp. 1-7, doi: 10.1109/ICCCNT56998.2023.10306914.Download the corresponding paper from this link:https://asnsams.github.io/Publications.html
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.
- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.
- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.
Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.
- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.
I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.
This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
United States Imports from Mexico of Typewriters and word processing machines was US$27.07 Thousand during 2012, according to the United Nations COMTRADE database on international trade. United States Imports from Mexico of Typewriters and word processing machines - data, historical chart and statistics - was last updated on October of 2025.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Thai Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
This diversity ensures robust training for real-world voice assistant applications.
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
Facebook
TwitterThis paper gives an overview over all currently available data sets for the European Long-term Ecosystem Research (eLTER) monitoring site Gesäuse-Johnsbachtal. The site is part of the LTSER platform Eisenwurzen in the Alps of the province of Styria, Austria. It contains both protected (National Park Gesäuse) and non-protected areas (Johnsbachtal). Although the main research focus of the eLTER monitoring site Gesäuse-Johnsbachtal is on inland surface running waters, forests and other wooded land, the eLTER whole system (WAILS) approach was followed in regard to the data selection, systematically screening all available data in regard to its suitability as eLTERs Standard Observations (SOs). Thus, data from all system strata was included, incorporating Geosphere, Atmosphere, Hydrosphere, Biosphere and Sociosphere. In the WAILS approach these SOs are key data for a whole system approach towards long term ecosystem research. Altogether, 54 data sets have been collected for the eLTER monitoring site Gesäuse-Johnsbachtal and included in the Dynamical Ecological Information Management System Site and Data Registry (DEIMS-SDR), which is the eLTER data platform. The presented work provides all these data sets through dedicated data repositories for FAIR use. This paper gives an overview on all compiled data sets and their main properties. Additionally, the available data are evaluated in a concluding gap analysis with regard to the needed observation data according to WAILS, followed by an outlook on how to fill these gaps.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Burundi Imports from Italy of Typewriters and word processing machines was US$17 during 2018, according to the United Nations COMTRADE database on international trade. Burundi Imports from Italy of Typewriters and word processing machines - data, historical chart and statistics - was last updated on December of 2025.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
On an annual basis (calendar year), individual LTC facilities report facility-level data on services capacity, utilization, patients, and capital/equipment expenditures.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the RGB images of hand gestures of twenty ISL words, namely, ‘afraid’,’agree’,’assistance’,’bad’,’become’,’college’,’doctor’,’from’,’pain’,’pray’, ’secondary’, ’skin’, ’small’, ‘specific’, ‘stand’, ’today’, ‘warn’, ‘which’, ‘work’, ‘you’’ which are commonly used to convey messages or seek support during medical situations. All the words included in this dataset are static. The images were captured from 8 individuals including 6 males and 2 females in the age group of 9 years to 30 years. The dataset contains a 18000 images in jpg format. The images are labelled using the format ISLword_X_YYYY_Z, where: • ISLword corresponds to the words ‘afraid’, ‘agree’, ‘assistance’, ‘bad’, ‘become’, ‘college’, ‘doctor’ ,‘from’, ’pray’, ‘pain’, ‘secondary’, ‘skin’, ‘small’, ‘specific’, ‘stand’, ‘today’, ‘warn’, ‘which’, ‘work’, ‘you’. • X is an image number in the range 1 to 900. • YYYY is an identifier of the participant and is in the range of 1 to 6. • Z corresponds to 01 or 02 that identifies the sample number for each subject. For example, the file named afraid_1_user1_1 is the image sequence of the first sample of the ISL gesture of the word ‘afraid’ presented by the 1st user.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Switzerland Imports from United States of Typewriters and word processing machines was US$3.82 Thousand during 2016, according to the United Nations COMTRADE database on international trade. Switzerland Imports from United States of Typewriters and word processing machines - data, historical chart and statistics - was last updated on November of 2025.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Italy Exports of typewriters and word processing machines to Ukraine was US$849.27 Thousand during 2012, according to the United Nations COMTRADE database on international trade. Italy Exports of typewriters and word processing machines to Ukraine - data, historical chart and statistics - was last updated on November of 2025.
Facebook
TwitterThis dataset was created by nohafathi