100+ datasets found
  1. example text data word & CSV format

    • kaggle.com
    zip
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nohafathi (2025). example text data word & CSV format [Dataset]. https://www.kaggle.com/datasets/nohaaf/example-text-data/discussion?sort=undefined
    Explore at:
    zip(10243 bytes)Available download formats
    Dataset updated
    Apr 14, 2025
    Authors
    nohafathi
    Description

    Dataset

    This dataset was created by nohafathi

    Contents

  2. s

    Wake Word US Spanish Dataset

    • shaip.com
    Updated Oct 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Wake Word US Spanish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-us-spanish-dataset/
    Explore at:
    Dataset updated
    Oct 13, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home US Spanish DatasetHigh-Quality US Spanish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleUS Spanish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word /…

  3. s

    Wake Word Mandarin Dataset

    • shaip.com
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Wake Word Mandarin Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-mandarin-dataset/
    Explore at:
    Dataset updated
    Mar 22, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Mandarin DatasetHigh-Quality Mandarin Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleMandarin Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word / Keyphrase collection of…

  4. q

    Survey Word Vector Data and Movie Review Vector Data

    • data.researchdatafinder.qut.edu.au
    Updated Jan 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Survey Word Vector Data and Movie Review Vector Data [Dataset]. https://data.researchdatafinder.qut.edu.au/dataset/survey-word-vector
    Explore at:
    Dataset updated
    Jan 27, 2018
    License

    http://researchdatafinder.qut.edu.au/display/n15252http://researchdatafinder.qut.edu.au/display/n15252

    Description

    QUT Research Data Respository Dataset and Resources

  5. Z

    Data from: Ancient Greek language models

    • data.niaid.nih.gov
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stopponi; Pedrazzini; Peels-Matthey; McGillivray; Nissim (2024). Ancient Greek language models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8369515
    Explore at:
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    Silvia
    Barbara
    Nilo
    Malvina
    Saskia
    Authors
    Stopponi; Pedrazzini; Peels-Matthey; McGillivray; Nissim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.

    Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.

    [Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.

    [ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006

    Diachronica models

    Training data

    Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:

    Classical subcorpus

    Hellenistic subcorpus

    Whole corpus

    Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.

    Word2Vec

    Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.

    Syntactic word embeddings

    Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.

    ALP models

    Training data

    Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.

    Word2Vec

    Software used: Gensim library (Řehůřek and Sojka, 2010)

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.

    References

    Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.

    Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.

    Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

    Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.

    Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

    Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.

    Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

    Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.

    Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013

    Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.

  6. c

    Data from: Slovenian Word in Context dataset SloWiC 1.0

    • clarin.si
    • live.european-language-grid.eu
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timotej Knez; Slavko Žitnik (2023). Slovenian Word in Context dataset SloWiC 1.0 [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1781?locale-attribute=en
    Explore at:
    Dataset updated
    Mar 23, 2023
    Authors
    Timotej Knez; Slavko Žitnik
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The SloWIC dataset is a Slovenian dataset for the Word in Context task. Each example in the dataset contains a target word with multiple meanings and two sentences that both contain the target word. Each example is also annotated with a label that shows if both sentences use the same meaning of the target word. The dataset contains 1808 manually annotated sentence pairs and additional 13150 automatically annotated pairs to help with training larger models. The dataset is stored in the JSON format following the format used in the SuperGLUE version of the Word in Context task (https://super.gluebenchmark.com/).

    Each example contains the following data fields: - word: The target word with multiple meanings - sentence1: The first sentence containing the target word - sentence2: The second sentence containing the target word - idx: The index of the example in the dataset - label: Label showing if the sentences contain the same meaning of the target word - start1: Start of the target word in the first sentence - start2: Start of the target word in the second sentence - end1: End of the target word in the first sentence - end2: End of the target word in the second sentence - version: The version of the annotation - manual_annotation: Boolean showing if the label was manually annotated - group: The group of annotators that labelled the example

  7. E

    Data from: Dataset of Slovene word formation trees ArboSloleks 1.0

    • live.european-language-grid.eu
    binary format
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dataset of Slovene word formation trees ArboSloleks 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/23752
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 29, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ArboSloleks is a dataset containing Slovene word formation trees that have been automatically constructed from word relations (http://hdl.handle.net/11356/1986) extracted from Sloleks 2.0 (http://hdl.handle.net/11356/1230). Each word formation tree begins with a root lexeme from Sloleks (e.g. abolicionizem); morphologically related lexemes are then listed in pairs (original lexeme, related lexeme) along with the levels of word formation (e.g. abolicionizem – abolicionist (Level 1); abolicionist – abolicionistka (Level 2)).

    Version 1.0 includes 14.918 word formation trees constructed from 66.360 lexeme pairs. It is available in an ad-hoc .txt format – for information on the structure and how to parse the data, please consult 00README.txt.

  8. Potato Export Data of Word in Last 20 Years

    • kaggle.com
    zip
    Updated May 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VishaljiODEDRA (2023). Potato Export Data of Word in Last 20 Years [Dataset]. https://www.kaggle.com/datasets/vishaljiodedra/potato-export-data-of-word-in-last-20-years/code
    Explore at:
    zip(57529 bytes)Available download formats
    Dataset updated
    May 19, 2023
    Authors
    VishaljiODEDRA
    Description

    Dataset

    This dataset was created by VishaljiODEDRA

    Contents

  9. o

    Word Road Cross Street Data in Lewisburg, TN

    • ownerly.com
    Updated Jan 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ownerly (2022). Word Road Cross Street Data in Lewisburg, TN [Dataset]. https://www.ownerly.com/tn/lewisburg/word-rd-home-details
    Explore at:
    Dataset updated
    Jan 14, 2022
    Dataset authored and provided by
    Ownerly
    Area covered
    Lewisburg, Tennessee, Word Road
    Description

    This dataset provides information about the number of properties, residents, and average property values for Word Road cross streets in Lewisburg, TN.

  10. F

    French Wake Words & Voice Commands Speech Data

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). French Wake Words & Voice Commands Speech Data [Dataset]. https://www.futurebeeai.com/dataset/wake-words-and-commands-dataset/wake-words-and-commands-french-france
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    French
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The French Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.

    Speech Data

    This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:

    Wake words alone
    Wake words followed by command phrases

    Participant Diversity

    Speakers: 50 native French speakers from the FutureBeeAI community
    Regions: Participants from various France provinces, ensuring broad coverage of accents and dialects
    Demographics: Ages 18–70; 60% male and 40% female participants

    Recording Details

    Type: Scripted wake words and command phrases
    Duration: 1 to 15 seconds per clip
    Format: WAV, stereo, 16-bit, with sample rates ranging from 16 kHz to 48 kHz

    Dataset Diversity

    Wake Word Types
    Automobile Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Ok Ford, etc.
    Voice Assistant Wake Words: Hey Siri, Ok Google, Alexa, Hey Cortana, Hi Bixby, Hey Celia, etc.
    Home Appliance Wake Words: Hi LG, Ok LG, Hello Lloyd, and more
    Command Types by Use Case
    Automobile: Play music, check directions, voice search, provide feedback, and more
    Voice Assistant: Ask general questions, make calls, control devices, shopping, manage calendars, and more
    Home Appliances: Control appliances, check status, set reminders/alarms, manage shopping lists, etc.
    Recording Environments
    No background noise
    Background traffic noise
    People talking in the background
    Speaking Pace
    Normal speed
    Fast speed

    This diversity ensures robust training for real-world voice assistant applications.

    Metadata

    Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.

    Participant Metadata: Unique ID, age, gender, region, accent, dialect
    Recording Metadata: Transcript, environment, pace, device used, sample rate, bit depth, file format

    Use Cases & Applications

    Voice Assistant Activation: Train models to accurately detect and trigger based on wake words
    Smart Home Devices: Enable responsive voice control in smart appliances
    <b style="font-weight:

  11. Z

    SignBD-Word: Video-Based Bangla Word-Level Sign Language Dataset

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Mar 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ataher Sams (2024). SignBD-Word: Video-Based Bangla Word-Level Sign Language Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6779840
    Explore at:
    Dataset updated
    Mar 3, 2024
    Dataset provided by
    Bangladesh University of Engineering and Technology
    Authors
    Ataher Sams
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Bangla sign language (BdSL) is a complete and independent natural sign language with its own linguistic characteristics. While there exists video datasets for well-known sign languages, there is currently no available dataset for word-level BdSL. In this study, we present a video-based word-level dataset for Bangla sign language, called SignBD-Word, consisting of 6000 sign videos representing 200 unique words. The dataset includes full and upper-body views of the signers, along with 2D body pose information. This dataset can also be used as a benchmark for testing sign video classification algorithms.Official Train Test Spllit (for both RGB and bodypose) can be found from the following link: https://sites.google.com/view/signbd-word/datasetThis dataset is part of the following paper:A. Sams, A. H. Akash and S. M. M. Rahman, "SignBD-Word: Video-Based Bangla Word-Level Sign Language and Pose Translation," 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 2023, pp. 1-7, doi: 10.1109/ICCCNT56998.2023.10306914.Download the corresponding paper from this link:https://asnsams.github.io/Publications.html

  12. Tamil (Tamizh) Wikipedia Text Dataset for NLP

    • kaggle.com
    zip
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. https://www.kaggle.com/datasets/younusmohamed/tamil-tamizh-wikipedia-articles
    Explore at:
    zip(339341289 bytes)Available download formats
    Dataset updated
    Nov 12, 2024
    Authors
    Younus_Mohamed
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

    What’s Included

    - Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

    - Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

    Why This Dataset?

    Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

    ** How You Can Use This Dataset**

    - Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

    Let’s Collaborate!

    I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

    License

    This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.

  13. T

    United States Imports from Mexico of Typewriters and word processing...

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Feb 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2020). United States Imports from Mexico of Typewriters and word processing machines [Dataset]. https://tradingeconomics.com/united-states/imports/mexico/typewriters-word-processing-machines
    Explore at:
    excel, csv, json, xmlAvailable download formats
    Dataset updated
    Feb 10, 2020
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1990 - Dec 31, 2025
    Area covered
    United States
    Description

    United States Imports from Mexico of Typewriters and word processing machines was US$27.07 Thousand during 2012, according to the United Nations COMTRADE database on international trade. United States Imports from Mexico of Typewriters and word processing machines - data, historical chart and statistics - was last updated on October of 2025.

  14. F

    Thai Wake Words & Voice Commands Speech Data

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Thai Wake Words & Voice Commands Speech Data [Dataset]. https://www.futurebeeai.com/dataset/wake-words-and-commands-dataset/wake-words-and-commands-thai-thailand
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Thai Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.

    Speech Data

    This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:

    Wake words alone
    Wake words followed by command phrases

    Participant Diversity

    Speakers: 50 native Thai speakers from the FutureBeeAI community
    Regions: Participants from various Thailand provinces, ensuring broad coverage of accents and dialects
    Demographics: Ages 18–70; 60% male and 40% female participants

    Recording Details

    Type: Scripted wake words and command phrases
    Duration: 1 to 15 seconds per clip
    Format: WAV, stereo, 16-bit, with sample rates ranging from 16 kHz to 48 kHz

    Dataset Diversity

    Wake Word Types
    Automobile Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Ok Ford, etc.
    Voice Assistant Wake Words: Hey Siri, Ok Google, Alexa, Hey Cortana, Hi Bixby, Hey Celia, etc.
    Home Appliance Wake Words: Hi LG, Ok LG, Hello Lloyd, and more
    Command Types by Use Case
    Automobile: Play music, check directions, voice search, provide feedback, and more
    Voice Assistant: Ask general questions, make calls, control devices, shopping, manage calendars, and more
    Home Appliances: Control appliances, check status, set reminders/alarms, manage shopping lists, etc.
    Recording Environments
    No background noise
    Background traffic noise
    People talking in the background
    Speaking Pace
    Normal speed
    Fast speed

    This diversity ensures robust training for real-world voice assistant applications.

    Metadata

    Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.

    Participant Metadata: Unique ID, age, gender, region, accent, dialect
    Recording Metadata: Transcript, environment, pace, device used, sample rate, bit depth, file format

    Use Cases & Applications

    Voice Assistant Activation: Train models to accurately detect and trigger based on wake words
    Smart Home Devices: Enable responsive voice control in smart appliances
    <b style="font-weight:

  15. e

    Data from: A Benchmark Data Set for Long-Term Monitoring in the eLTER Site...

    • data.europa.eu
    • data.gv.at
    pdf
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nationalparks Austria (2024). A Benchmark Data Set for Long-Term Monitoring in the eLTER Site Gesäuse-Johnsbachtal [Dataset]. https://data.europa.eu/data/datasets/040137be-5f18-504f-95f7-d15fabe213fe~~1?locale=bg
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Mar 18, 2024
    Dataset authored and provided by
    Nationalparks Austria
    Description

    This paper gives an overview over all currently available data sets for the European Long-term Ecosystem Research (eLTER) monitoring site Gesäuse-Johnsbachtal. The site is part of the LTSER platform Eisenwurzen in the Alps of the province of Styria, Austria. It contains both protected (National Park Gesäuse) and non-protected areas (Johnsbachtal). Although the main research focus of the eLTER monitoring site Gesäuse-Johnsbachtal is on inland surface running waters, forests and other wooded land, the eLTER whole system (WAILS) approach was followed in regard to the data selection, systematically screening all available data in regard to its suitability as eLTERs Standard Observations (SOs). Thus, data from all system strata was included, incorporating Geosphere, Atmosphere, Hydrosphere, Biosphere and Sociosphere. In the WAILS approach these SOs are key data for a whole system approach towards long term ecosystem research. Altogether, 54 data sets have been collected for the eLTER monitoring site Gesäuse-Johnsbachtal and included in the Dynamical Ecological Information Management System Site and Data Registry (DEIMS-SDR), which is the eLTER data platform. The presented work provides all these data sets through dedicated data repositories for FAIR use. This paper gives an overview on all compiled data sets and their main properties. Additionally, the available data are evaluated in a concluding gap analysis with regard to the needed observation data according to WAILS, followed by an outlook on how to fill these gaps.

  16. T

    Burundi Imports from Italy of Typewriters and word processing machines

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Sep 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2023). Burundi Imports from Italy of Typewriters and word processing machines [Dataset]. https://tradingeconomics.com/burundi/imports/italy/typewriters-word-processing-machines
    Explore at:
    excel, xml, csv, jsonAvailable download formats
    Dataset updated
    Sep 30, 2023
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1990 - Dec 31, 2025
    Area covered
    Burundi
    Description

    Burundi Imports from Italy of Typewriters and word processing machines was US$17 during 2018, according to the United Nations COMTRADE database on international trade. Burundi Imports from Italy of Typewriters and word processing machines - data, historical chart and statistics - was last updated on December of 2025.

  17. Long-term Care Facilities Annual Utilization Data

    • data.ca.gov
    • data.chhs.ca.gov
    • +2more
    aspx, docx, html, pdf +4
    Updated Oct 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Access and Information (2025). Long-term Care Facilities Annual Utilization Data [Dataset]. https://data.ca.gov/dataset/long-term-care-facilities-annual-utilization-data
    Explore at:
    xlsx, pdf, html, xlsm, zip, docx, aspx, xlsAvailable download formats
    Dataset updated
    Oct 28, 2025
    Dataset authored and provided by
    Department of Health Care Access and Information
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    On an annual basis (calendar year), individual LTC facilities report facility-level data on services capacity, utilization, patients, and capital/equipment expenditures.

  18. m

    Indian sign Language-Real-life Words

    • data.mendeley.com
    Updated Aug 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akansha Tyagi (2022). Indian sign Language-Real-life Words [Dataset]. http://doi.org/10.17632/s6kgb6r3ss.2
    Explore at:
    Dataset updated
    Aug 10, 2022
    Authors
    Akansha Tyagi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    The dataset contains the RGB images of hand gestures of twenty ISL words, namely, ‘afraid’,’agree’,’assistance’,’bad’,’become’,’college’,’doctor’,’from’,’pain’,’pray’, ’secondary’, ’skin’, ’small’, ‘specific’, ‘stand’, ’today’, ‘warn’, ‘which’, ‘work’, ‘you’’ which are commonly used to convey messages or seek support during medical situations. All the words included in this dataset are static. The images were captured from 8 individuals including 6 males and 2 females in the age group of 9 years to 30 years. The dataset contains a 18000 images in jpg format. The images are labelled using the format ISLword_X_YYYY_Z, where: • ISLword corresponds to the words ‘afraid’, ‘agree’, ‘assistance’, ‘bad’, ‘become’, ‘college’, ‘doctor’ ,‘from’, ’pray’, ‘pain’, ‘secondary’, ‘skin’, ‘small’, ‘specific’, ‘stand’, ‘today’, ‘warn’, ‘which’, ‘work’, ‘you’. • X is an image number in the range 1 to 900. • YYYY is an identifier of the participant and is in the range of 1 to 6. • Z corresponds to 01 or 02 that identifies the sample number for each subject. For example, the file named afraid_1_user1_1 is the image sequence of the first sample of the ISL gesture of the word ‘afraid’ presented by the 1st user.

  19. T

    Switzerland Imports from United States of Typewriters and word processing...

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Dec 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2022). Switzerland Imports from United States of Typewriters and word processing machines [Dataset]. https://tradingeconomics.com/switzerland/imports/united-states/typewriters-word-processing-machines
    Explore at:
    json, xml, csv, excelAvailable download formats
    Dataset updated
    Dec 1, 2022
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1990 - Dec 31, 2025
    Area covered
    Switzerland
    Description

    Switzerland Imports from United States of Typewriters and word processing machines was US$3.82 Thousand during 2016, according to the United Nations COMTRADE database on international trade. Switzerland Imports from United States of Typewriters and word processing machines - data, historical chart and statistics - was last updated on November of 2025.

  20. T

    Italy Exports of typewriters and word processing machines to Ukraine

    • tradingeconomics.com
    csv, excel, json, xml
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TRADING ECONOMICS (2023). Italy Exports of typewriters and word processing machines to Ukraine [Dataset]. https://tradingeconomics.com/italy/exports/ukraine/typewriters-word-processing-machines
    Explore at:
    xml, csv, excel, jsonAvailable download formats
    Dataset updated
    Nov 9, 2023
    Dataset authored and provided by
    TRADING ECONOMICS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1990 - Dec 31, 2025
    Area covered
    Italy
    Description

    Italy Exports of typewriters and word processing machines to Ukraine was US$849.27 Thousand during 2012, according to the United Nations COMTRADE database on international trade. Italy Exports of typewriters and word processing machines to Ukraine - data, historical chart and statistics - was last updated on November of 2025.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
nohafathi (2025). example text data word & CSV format [Dataset]. https://www.kaggle.com/datasets/nohaaf/example-text-data/discussion?sort=undefined
Organization logo

example text data word & CSV format

Explore at:
zip(10243 bytes)Available download formats
Dataset updated
Apr 14, 2025
Authors
nohafathi
Description

Dataset

This dataset was created by nohafathi

Contents

Search
Clear search
Close search
Google apps
Main menu