100+ datasets found
  1. s

    English United States Wake Word Dataset

    • shaip.com
    • ny.shaip.com
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2025). English United States Wake Word Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/english-united-states-wake-word-dataset/
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Home English United States Wake Word DatasetHigh-Quality English United States Wake Word Dataset for AI & Speech Models Contact Us OverviewTitle (Language)English United States Language DatasetDataset TypesWake WordCountryUnited StatesDescriptionWake Words…

  2. All Turkish Words Dataset 📃🖊️

    • kaggle.com
    zip
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enis Tuna (2024). All Turkish Words Dataset 📃🖊️ [Dataset]. https://www.kaggle.com/datasets/enistuna/all-turkish-words-dataset
    Explore at:
    zip(42391799 bytes)Available download formats
    Dataset updated
    Mar 14, 2024
    Authors
    Enis Tuna
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ALL TURKISH WORDS DATASET

    This dataset contains all the Turkish words I've managed to fetch from the web. The dataset has approximately 7 million lines of Turkish word tokens, each seperated by " " so it is easier to read.

    Some words are different variations of the same word e.g. "araba", "arabada", "arabadan". Feel free to use lemmatization algorithms to reduce the data size.

    I believe this dataset could be improved upon. It certainly is not finished. I will update this dataset if I can get my hands on new words in the future.

    My Linkedin: https://www.linkedin.com/in/enistuna/ My Github: https://github.com/enistuna

  3. 10000 Words

    • kaggle.com
    zip
    Updated Dec 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramiro Aires Melo (2021). 10000 Words [Dataset]. https://www.kaggle.com/datasets/ramiromelo/10000-words/code
    Explore at:
    zip(28628 bytes)Available download formats
    Dataset updated
    Dec 11, 2021
    Authors
    Ramiro Aires Melo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    List of 10000 Words in English

    Source: https://www.mit.edu/~ecprice/wordlist.10000

  4. SignBD-Word: Video-Based Bangla Word-Level Sign Language Dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Mar 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ataher Sams; Ataher Sams (2024). SignBD-Word: Video-Based Bangla Word-Level Sign Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.6779843
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ataher Sams; Ataher Sams
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Bangla sign language (BdSL) is a complete and independent natural sign language with its own linguistic characteristics. While there exists video datasets for well-known sign languages, there is currently no available dataset for word-level BdSL. In this study, we present a video-based word-level dataset for Bangla sign language, called SignBD-Word, consisting of 6000 sign videos representing 200 unique words. The dataset includes full and upper-body views of the signers, along with 2D body pose information. This dataset can also be used as a benchmark for testing sign video classification algorithms.

    Official Train Test Spllit (for both RGB and bodypose) can be found from the following link:
    https://sites.google.com/view/signbd-word/dataset

    This dataset is part of the following paper:
    A. Sams, A. H. Akash and S. M. M. Rahman, "SignBD-Word: Video-Based Bangla Word-Level Sign Language and Pose Translation," 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 2023, pp. 1-7, doi: 10.1109/ICCCNT56998.2023.10306914.

    Download the corresponding paper from this link:
    https://asnsams.github.io/Publications.html

  5. F

    Indian English Wake Words & Voice Commands Speech Data

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Indian English Wake Words & Voice Commands Speech Data [Dataset]. https://www.futurebeeai.com/dataset/wake-words-and-commands-dataset/wake-words-and-commands-english-india
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Indian English Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.

    Speech Data

    This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:

    Wake words alone
    Wake words followed by command phrases

    Participant Diversity

    Speakers: 50 native Indian English speakers from the FutureBeeAI community
    Regions: Participants from various India provinces, ensuring broad coverage of accents and dialects
    Demographics: Ages 18–70; 60% male and 40% female participants

    Recording Details

    Type: Scripted wake words and command phrases
    Duration: 1 to 15 seconds per clip
    Format: WAV, stereo, 16-bit, with sample rates ranging from 16 kHz to 48 kHz

    Dataset Diversity

    Wake Word Types
    Automobile Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Ok Ford, etc.
    Voice Assistant Wake Words: Hey Siri, Ok Google, Alexa, Hey Cortana, Hi Bixby, Hey Celia, etc.
    Home Appliance Wake Words: Hi LG, Ok LG, Hello Lloyd, and more
    Command Types by Use Case
    Automobile: Play music, check directions, voice search, provide feedback, and more
    Voice Assistant: Ask general questions, make calls, control devices, shopping, manage calendars, and more
    Home Appliances: Control appliances, check status, set reminders/alarms, manage shopping lists, etc.
    Recording Environments
    No background noise
    Background traffic noise
    People talking in the background
    Speaking Pace
    Normal speed
    Fast speed

    This diversity ensures robust training for real-world voice assistant applications.

    Metadata

    Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.

    Participant Metadata: Unique ID, age, gender, region, accent, dialect
    Recording Metadata: Transcript, environment, pace, device used, sample rate, bit depth, file format

    Use Cases & Applications

    Voice Assistant Activation: Train models to accurately detect and trigger based on wake words
    Smart Home Devices: Enable responsive voice control in smart appliances
    <b

  6. APAC Data Suite | 4M+ Translations | 1.6M+ Words | Natural Language...

    • datarade.ai
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). APAC Data Suite | 4M+ Translations | 1.6M+ Words | Natural Language Processing Data | Dictionary Display | Translations | APAC Coverage [Dataset]. https://datarade.ai/data-products/apac-data-suite-4m-translations-1-6m-words-natural-la-oxford-languages
    Explore at:
    .json, .xml, .csv, .txt, .mp3, .wavAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Marshall Islands, Papua New Guinea, China, Kiribati, Thailand, Taiwan, Philippines, Vietnam, Fiji, Australia
    Description

    APAC Data Suite offers high-quality language datasets. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.

    Discover our expertly curated language datasets in the APAC Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

    • Monolingual and Bilingual Dictionary Data
      Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

    • Semi-bilingual Dictionary Data Each entry features a headword with definitions and/or usage examples in Language 1, followed by a translation of the headword and/or definition in Language 2, enabling efficient cross-lingual mapping.

    • Sentence Corpora
      Curated examples of real-world usage with contextual annotations for training and evaluation.

    • Synonyms & Antonyms
      Lexical relations to support semantic search, paraphrasing, and language understanding.

    • Audio Data
      Native speaker recordings for speech recognition, TTS, and pronunciation modeling.

    • Word Lists
      Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks. The word list data can cover one language or two, such as Tamil words with English translations.

    Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.

    If you require more information about a specific dataset, please contact us Growth.OL@oup.com.

    Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.

    1. Assamese Semi-bilingual Dictionary Data: 72,200 words | 83,700 senses | 83,800 translations.

    2. Bengali Bilingual Dictionary Data: 161,400 translations | 71,600 senses.

    3. Bengali Semi-bilingual Dictionary Data: 28,300 words | 37,700 senses | 62,300 translations.

    4. British English Monolingual Dictionary Data: 146,000 words | 230,000 senses | 149,000 example sentences.

    5. British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms.

    6. British English Pronunciations with Audio: 250,000 transcriptions (IPA) | 180,000 audio files.

    7. French Monolingual Dictionary Data: 42,000 words | 56,000 senses | 43,000 example sentences.

    8. French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 example translations.

    9. Gujarati Monolingual Dictionary Data: 91,800 words | 131,500 senses.

    10. Gujarati Bilingual Dictionary Data: 171,800 translations | 158,200 senses.

    11. Hindi Monolingual Dictionary Data: 46,200 words | 112,700 senses.

    12. Hindi Bilingual Dictionary Data: 263,400 translations | 208,100 senses | 18,600 example translations.

    13. Hindi Synonyms and Antonyms Dictionary Data: 478,100 synonyms | 18,800 antonyms.

    14. Hindi Sentence Data: 216,000 sentences.

    15. Hindi Audio data: 68,000 audio files.

    16. Indonesian Bilingual Dictionary Data: 36,000 translations | 23,700 senses | 12,700 example translations.

    17. Indonesian Monolingual Dictionary Data: 120,000 words | 140,000 senses | 30,000 example sentences.

      1. Korean Monolingual Dictionary Data: 596,100 words | 386,600 senses | 91,700 example sentences.
    18. Korean Bilingual Dictionary Data: 952,500 translations | 449,700 senses | 227,800 example translations.

    19. Mandarin Chinese (simplified) Monolingual Dictionary Data: 81,300 words | 162,400 senses | 80,700 example sentences.

    20. Mandarin Chinese (traditional) Monolingual Dictionary Data: 60,100 words | 144,700 senses | 29,900 example sentences.

    21. Mandarin Chinese (simplified) Bilingual Dictionary Data: 367,600 translations | 204,500 senses | 150,900 example translations.

    22. Mandarin Chinese (traditional) Bilingual Dictionary Data: 215,600 translations | 202,800 senses | 149,700 example translations.

    23. Mandarin Chinese (simplified) Synonyms and Antonyms Data: 3,800 synonyms | 3,180 antonyms.

    24. Malay Bilingual Dictionary Data: 106,100 translations | 53,500 senses.

    25. Malay Monolingual Dictionary Data: 39,800 words | 40,600 senses | 21,100 example sentences.

    26. Malayalam Monolingual Dictionary Data: 91,300 words | 159,200 senses.

    27. Malayalam Bilingual Word List Data: 76,200 translation pairs.

    28. Marathi Bilingual Dictionary Data: 45,400 translations | 32,800 senses | 3,600 example translations.

    29. Nepali Bilingual Dictionary Data: 350,000 translations | 264,200 senses | 1,300 example translations.

    30. New Zealand English Monolingual Dictionary Data: 100,000 words

    31. Odia Semi-bilingual Dictionary Data: 30,700 words | 69,300 senses | 69,200 translations.

    32. Punjabi ...

  7. m

    Indian sign Language-Real-life Words

    • data.mendeley.com
    Updated Aug 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akansha Tyagi (2022). Indian sign Language-Real-life Words [Dataset]. http://doi.org/10.17632/s6kgb6r3ss.2
    Explore at:
    Dataset updated
    Aug 10, 2022
    Authors
    Akansha Tyagi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    The dataset contains the RGB images of hand gestures of twenty ISL words, namely, ‘afraid’,’agree’,’assistance’,’bad’,’become’,’college’,’doctor’,’from’,’pain’,’pray’, ’secondary’, ’skin’, ’small’, ‘specific’, ‘stand’, ’today’, ‘warn’, ‘which’, ‘work’, ‘you’’ which are commonly used to convey messages or seek support during medical situations. All the words included in this dataset are static. The images were captured from 8 individuals including 6 males and 2 females in the age group of 9 years to 30 years. The dataset contains a 18000 images in jpg format. The images are labelled using the format ISLword_X_YYYY_Z, where: • ISLword corresponds to the words ‘afraid’, ‘agree’, ‘assistance’, ‘bad’, ‘become’, ‘college’, ‘doctor’ ,‘from’, ’pray’, ‘pain’, ‘secondary’, ‘skin’, ‘small’, ‘specific’, ‘stand’, ‘today’, ‘warn’, ‘which’, ‘work’, ‘you’. • X is an image number in the range 1 to 900. • YYYY is an identifier of the participant and is in the range of 1 to 6. • Z corresponds to 01 or 02 that identifies the sample number for each subject. For example, the file named afraid_1_user1_1 is the image sequence of the first sample of the ISL gesture of the word ‘afraid’ presented by the 1st user.

  8. Z

    Data from: Ancient Greek language models

    • data.niaid.nih.gov
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stopponi; Pedrazzini; Peels-Matthey; McGillivray; Nissim (2024). Ancient Greek language models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8369515
    Explore at:
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    Silvia
    Barbara
    Nilo
    Malvina
    Saskia
    Authors
    Stopponi; Pedrazzini; Peels-Matthey; McGillivray; Nissim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this repository, we release a series of vector space models of Ancient Greek, trained following different architectures and with different hyperparameter values.

    Below is a breakdown of all the models released, with an indication of the training method and hyperparameters. The models are split into ‘Diachronica’ and ‘ALP’ models, according to the published paper they are associated with.

    [Diachronica:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. Forthcoming. Natural Language Processing for Ancient Greek: Design, Advantages, and Challenges of Language Models, Diachronica.

    [ALP:] Stopponi, Silvia, Nilo Pedrazzini, Saskia Peels-Matthey, Barbara McGillivray & Malvina Nissim. 2023. Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work. Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). 49-58. Association for Computational Linguistics (ACL). https://doi.org/10.26615/978-954-452-087-8.2023_006

    Diachronica models

    Training data

    Diorisis corpus (Vatri & McGillivray 2018). Separate models were trained for:

    Classical subcorpus

    Hellenistic subcorpus

    Whole corpus

    Models are named according to the (sub)corpus they are trained on (i.e. hel_ or hellenestic is appended to the name of the models trained on the Hellenestic subcorpus, clas_ or classical for the Classical subcorpus, full_ for the whole corpus).

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder PPMI spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, k=1, alpha=0.75.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder PPMI+SVD spaces). For each model, a version trained on each subcorpus after removing stopwords is also included (_stopfilt is appended to the model names). Hyperparameter values: window=5, dimensions=300, gamma=0.0.

    Word2Vec

    Software used: CADE (Bianchi et al. 2020; https://github.com/vinid/cade).

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=0, ns=20.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, siter=5, diter=5, workers=4, sg=1, ns=20.

    Syntactic word embeddings

    Syntactic word embeddings were also trained on the Ancient Greek subcorpus of the PROIEL treebank (Haug & Jøhndal 2008), the Gorman treebank (Gorman 2020), the PapyGreek treebank (Vierros & Henriksson 2021), the Pedalion treebank (Keersmaekers et al. 2019), and the Ancient Greek Dependency Treebank (Bamman & Crane 2011) largely following the SuperGraph method described in Al-Ghezi & Kurimo (2020) and the Node2Vec architecture (Grover & Leskovec 2016) (see https://github.com/npedrazzini/ancientgreek-syntactic-embeddings for more details). Hyperparameter values: window=1, min_count=1.

    ALP models

    Training data

    Archaic, Classical, and Hellenistic portions of the Diorisis corpus (Vatri & McGillivray 2018) merged, stopwords removed according to the list made by Alessandro Vatri, available at https://figshare.com/articles/dataset/Ancient_Greek_stop_words/9724613.

    Models

    Count-based

    Software used: LSCDetection (Kaiser et al. 2021; https://github.com/Garrafao/LSCDetection)

    a. With Positive Pointwise Mutual Information applied (folder ppmi_alp). Hyperparameter values: window=5, k=1, alpha=0.75. Stopwords were removed from the training set.

    b. With both Positive Pointwise Mutual Information and dimensionality reduction with Singular Value Decomposition applied (folder ppmi_svd_alp). Hyperparameter values: window=5, dimensions=300, gamma=0.0. Stopwords were removed from the training set.

    Word2Vec

    Software used: Gensim library (Řehůřek and Sojka, 2010)

    a. Continuous-bag-of-words (CBOW). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=0. Stopwords were removed from the training set.

    b. Skipgram with Negative Sampling (SGNS). Hyperparameter values: size=30, window=5, min_count=5, negative=20, sg=1. Stopwords were removed from the training set.

    References

    Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Ustalov, Dmitry, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72-78.

    Bamman, D. & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Sporleder, Caroline, Antal van den Bosch & Kalliopi Zervanou (eds.), Language Technology for Cultural Heritage. Selected Papers from the LaTeCH [Language Technology for Cultural Heritage] Workshop Series. Theory and Applications of Natural Language Processing, 79-98. Berlin, Heidelberg: Springer.

    Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

    Grover, Aditya & Jure Leskovec. 2016. Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), 855-864.

    Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

    Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Candito, Marie, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109-117.

    Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of Pre- and Post-Processing on type-based Embeddings in Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics.

    Schlechtweg, Dominik, Anna Hätty, Marco del Tredici & Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 732-746, Florence, Italy. ACL.

    Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus: Linguistics and Literature. Research Data Journal for the Humanities and Social Sciences 3, 1, 55-65, Available From: Brill https://doi.org/10.1163/24523666-01000013

    Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: a dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 7.

  9. s

    Wake Word Northeast English Dataset

    • shaip.com
    • ny.shaip.com
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2025). Wake Word Northeast English Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-northeast-english-dataset/
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Northeast English Wake Word DatasetHigh-Quality Northeast English Wake Word Dataset for AI & Speech Models Contact Us OverviewTitle (Language)Northeast English Language DatasetDataset TypesWake WordCountryUnited StatesDescriptionWake Words / Voice Command…

  10. n

    Language Dataset

    • data.ncl.ac.uk
    json
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Towers; Rob Geada; Amir Atapour-Abarghouei; Andrew Stephen McGough (2023). Language Dataset [Dataset]. http://doi.org/10.25405/data.ncl.24574729.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Newcastle University
    Authors
    David Towers; Rob Geada; Amir Atapour-Abarghouei; Andrew Stephen McGough
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset containing the images and labels for the Language data used in the CVPR NAS workshop Unseen-data challenge under the codename "LaMelo"The Language dataset is a constructed dataset using words from aspell dictionaries. The intention of this dataset is to require machine learning models to not only perform image classification but also linguistic analysis to figure out which letter frequency is associated with each language. For each Language image we selected four six-letter words using the standard latin alphabet and removed any words with letters that used diacritics (such as ́e or ̈u) or included ‘y’ or ‘z’.We encode these words on a graph with one axis representing the index of the 24 character long string (the four words joined together) and the other representing the letter (going A-X).The data is in a channels-first format with a shape of (n, 1, 24, 24) where n is the number of samples in the corresponding set (50,000 for training, 10,000 for validation, and 10,000 for testing).There are ten classes in the dataset, with 7,000 examples of each, distributed evenly between the three subsets.The ten classes and corresponding numerical label are as follows:English: 0,Dutch: 1,German: 2,Spanish: 3,French: 4,Portuguese: 5,Swahili: 6,Zulu: 7,Finnish: 8,Swedish: 9

  11. SlangTrack (ST) Dataset

    • zenodo.org
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afnan aloraini; Afnan aloraini (2025). SlangTrack (ST) Dataset [Dataset]. http://doi.org/10.5281/zenodo.14744510
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Afnan aloraini; Afnan aloraini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 15, 2022
    Description

    The SlangTrack (ST) Dataset is a novel, meticulously curated resource aimed at addressing the complexities of slang detection in natural language processing. This dataset uniquely emphasizes words that exhibit both slang and non-slang contexts, enabling a binary classification system to distinguish between these dual senses. By providing comprehensive examples for each usage, the dataset supports fine-grained linguistic and computational analysis, catering to both researchers and practitioners in NLP.

    Key Features:

    • Unique Words: 48,508
    • Total Tokens: 310,170
    • Average Post Length: 34.6 words
    • Average Sentences per Post: 3.74

    These features ensure a robust contextual framework for accurate slang detection and semantic analysis.

    Significance of the Dataset:

    1. Unified Annotation: The dataset offers consistent annotations across the corpus, achieving high Inter-Annotator Agreement (IAA) to ensure reliability and accuracy.
    2. Addressing Limitations: It overcomes the constraints of previous corpora, which often lacked differentiation between slang and non-slang meanings or did not provide illustrative examples for each sense.
    3. Comprehensive Coverage: Unlike earlier corpora that primarily supported dictionary-style entries or paraphrasing tasks, this dataset includes rich contextual examples from historical (COHA) and contemporary (Twitter) sources, along with multiple senses for each target word.
    4. Focus on Dual Meanings: The dataset emphasizes words with at least one slang and one dominant non-slang sense, facilitating the exploration of nuanced linguistic patterns.
    5. Applicability to Research: By covering both historical and modern contexts, the dataset provides a platform for exploring slang's semantic evolution and its impact on natural language processing.

    Target Word Selection:

    The target words were carefully chosen to align with the goals of fine-grained analysis. Each word in the dataset:

    • It coexists in the slang SD wordlist and the Corpus of Historical American English (COHA).
    • Has between 2 and 8 distinct senses, including both slang and non-slang meanings.
    • Was cross-referenced using trusted resources such as:
      • Green's Dictionary of Slang
      • Urban Dictionary
      • Online Slang Dictionary
      • Oxford English Dictionary
    • Features at least one slang and one dominant non-slang sense.
    • Excludes proper nouns to maintain linguistic relevance and focus.

    Data Sources and Collection:

    1. Corpus of Historical American English (COHA):

    • Historical examples were extracted from the cleaned version of COHA (CCOHA).
    • Data spans the years 1980–2010, capturing the evolution of target words over time.

    2. Twitter:

    • Twitter was selected for its dynamic, real-time communication, offering rich examples of contemporary slang and informal language.
    • For each target word, 1,000 examples were collected from tweets posted between 2010–2020, reflecting modern usage.

    Dataset Scope:

    The final dataset comprises ten target words, meeting strict selection criteria to ensure linguistic and computational relevance. Each word:

    • Demonstrates semantic diversity, balancing slang and non-slang senses.
    • Offers robust representation across both historical (COHA) and modern (Twitter) contexts.

    The SlangTrack Dataset serves as a public resource, fostering research in slang detection, semantic evolution, and informal language processing. Combining historical and contemporary sources provides a comprehensive platform for exploring the nuances of slang in natural language.

    Data Statistics:

    The table below provides a breakdown of the total number of instances categorized as slang or non-slang for each target keyword in the SlangTrack (ST) Dataset.

    KeywordNon-slangSlangTotal
    BMW1083141097
    Brownie582382964
    Chronic14152701685
    Climber520122642
    Cucumber972791051
    Eat24625613023
    Germ566249815
    Mammy8941541048
    Rodent7183491067
    Salty5437271270
    Total9755290712662

    Sample Texts from the Dataset:

    The table below provides examples of sentences from the SlangTrack (ST) Dataset, showcasing both slang and non-slang usage of the target keywords. Each example highlights the context in which the target word is used and its corresponding category.

    Example Sentences Target Keyword Category
    Today, I heard, for the first time, a short scientific talk given by a man dressed as a rodent...! An interesting experience.RodentSlang
    On the other. Mr. Taylor took food requests and, with a stern look in his eye, told the children to stay seated until he and his wife returned with the food. The children nodded attentively. After the adults left, the children seemed to relax, talking more freely and playing with one another. When the parents returned, the kids straightened up again, received their food, and began to eat, displaying quiet and gracious manners all the while.EatNon-Slang
    Greater than this one that washed between the shores of Florida and Mexico. He balanced between the breakers and the turning tide. Small particles of sand churned in the waters around him, and a small fish swam against his leg, a momentary dark streak that vanished in the surf. He began to swim. Buoyant in the salty water, he swam a hundred meters to a jetty that sent small whirlpools around its barnacle rough pilings.SaltyNon-Slang
    Mom was totally hating on my dance moves. She's so salty.SaltySlang

    **Licenses**

    The SlangTrack (ST) dataset is built using a combination of licensed and publicly available corpora. To ensure compliance with licensing agreements, all data has been extensively preprocessed, modified, and anonymized while preserving linguistic integrity. The dataset has been randomized and structured to support research in slang detection without violating the terms of the original sources.

    The **original authors and data providers retain their respective rights**, where applicable. We encourage users to **review the licensing agreements** included with the dataset to understand any potential usage limitations. While some source corpora, such as **COHA, require a paid license and restrict redistribution**, our processed dataset is **legally shareable and publicly available** for **research and development purposes**.

  12. s

    Wake Word US Spanish Dataset

    • shaip.com
    Updated Oct 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). Wake Word US Spanish Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-us-spanish-dataset/
    Explore at:
    Dataset updated
    Oct 13, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home US Spanish DatasetHigh-Quality US Spanish Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleUS Spanish Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word /…

  13. EMEA Data Suite | 3.3M Translations | 1.9M Words | 22 Languages | Natural...

    • datarade.ai
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). EMEA Data Suite | 3.3M Translations | 1.9M Words | 22 Languages | Natural Language Processing (NLP) Data | Translation Data | TTS | EMEA Coverage [Dataset]. https://datarade.ai/data-products/emea-data-suite-3-3m-translations-1-9m-words-23-languag-oxford-languages
    Explore at:
    .json, .xml, .csv, .xls, .txt, .mp3, .wavAvailable download formats
    Dataset updated
    Aug 8, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Uganda, Seychelles, Israel, Burundi, Syrian Arab Republic, Central African Republic, Spain, Bosnia and Herzegovina, Romania, Morocco
    Description

    EMEA Data Suite offers 43 high-quality language datasets covering 23 languages spoken in the region. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.

    Discover our expertly curated language datasets in the EMEA Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

    • Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

    • Sentence Corpora Curated examples of real-world usage with contextual annotations for training and evaluation.

    • Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.

    • Audio Data Native speaker recordings for speech recognition, TTS, and pronunciation modeling.

    • Word Lists Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks.

    Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.

    If you require more information about a specific dataset, please contact us Growth.OL@oup.com.

    Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.

    1. Arabic Monolingual Dictionary Data: 66,500 words | 98,700 senses | 70,000 example sentences.

    2. Arabic Bilingual Dictionary Data: 116,600 translations | 88,300 senses | 74,700 example translations.

    3. Arabic Synonyms and Antonyms Data: 55,100 synonyms.

    4. British English Monolingual Dictionary Data: 146,000 words | 230,000 senses | 149,000 example sentences.

    5. British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms

    6. British English Pronunciations with Audio: 250,000 transcriptions (IPA) |180,000 audio files.

    7. Catalan Monolingual Dictionary Data: 29,800 words | 47,400 senses | 25,600 example sentences.

    8. Catalan Bilingual Dictionary Data: 76,800 translations | 109,350 senses | 26,900 example translations.

    9. Croatian Monolingual Dictionary Data: 129,600 words | 164,760 senses | 34,630 example sentences.

    10. Croatian Bilingual Dictionary Data: 100,700 translations | 91,600 senses | 10,180 example translations.

    11. Czech Bilingual Dictionary Data: 426,473 translations | 199,800 senses | 95,000 example translations.

    12. Danish Bilingual Dictionary Data: 129,000 translations | 91,500 senses | 23,000 example translations.

    13. French Monolingual Dictionary Data: 42,000 words | 56,000 senses | 43,000 example sentences.

    14. French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 example translations.

    15. German Monolingual Dictionary Data: 85,500 words | 78,000 senses | 55,000 example sentences.

    16. German Bilingual Dictionary Data: 393,000 translations | 207,500 senses | 129,500 example translations.

    17. German Word List Data: 338,000 wordforms.

    18. Greek Monolingual Dictionary Data: 47,800 translations | 46,309 senses | 2,388 example sentences.

    19. Hebrew Monolingual Dictionary Data: 85,600 words | 104,100 senses | 94,000 example sentences.

    20. Hebrew Bilingual Dictionary Data: 67,000 translations | 49,000 senses | 19,500 example translations.

    21. Hungarian Monolingual Dictionary Data: 90,500 words | 155,300 senses | 42,500 example sentences.

    22. Italian Monolingual Dictionary Data: 102,500 words | 231,580 senses | 48,200 example sentences.

    23. Italian Bilingual Dictionary Data: 492,000 translations | 251,600 senses | 157,100 example translations.

    24. Italian Synonyms and Antonyms Data: 197,000 synonyms | 62,000 antonyms.

    25. Latvian Monolingual Dictionary Data: 36,000 words | 43,600 senses | 73,600 example sentences.

    26. Polish Bilingual Dictionary Data: 287,400 translations | 216,900 senses | 19,800 example translations.

    27. Portuguese Monolingual Dictionary Data: 143,600 words | 285,500 senses | 69,300 example sentences.

    28. Portuguese Bilingual Dictionary Data: 300,000 translations | 158,000 senses | 117,800 example translations.

    29. Portuguese Synonyms and Antonyms Data: 196,000 synonyms | 90,000 antonyms.

    30. Romanian Monolingual Dictionary Data: 66,900 words | 113,500 senses | 2,700 example sentences.

    31. Romanian Bilingual Dictionary Data: 77,500 translations | 63,870 senses | 33,730 example translations.

    32. Russian Monolingual Dictionary Data: 65,950 words | 57,500 senses | 51,900 example sentences.

    33. Russian Bilingual Dictionary Data: 230,100 translations | 122,200 senses | 69,600 example translations.

    34. Slovak Bilingual Dictionary Dat...

  14. Stop Words Hinglish for NLP

    • kaggle.com
    zip
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranam Shetty (2024). Stop Words Hinglish for NLP [Dataset]. https://www.kaggle.com/datasets/prxshetty/stop-words-hinglish/suggestions
    Explore at:
    zip(2796 bytes)Available download formats
    Dataset updated
    Mar 13, 2024
    Authors
    Pranam Shetty
    Description

    The provided list contains common stop words used in natural language processing (NLP) tasks. Stop words are words that are filtered out before or after processing of natural language data. They are typically the most common words in a language and don't carry significant meaning, thus often removed to focus on the more important words or tokens in a text. This dataset can be used in various NLP applications such as text classification, sentiment analysis, and information retrieval to improve the accuracy and efficiency of text processing algorithms. By eliminating these stop words, the computational resources can be utilized more effectively, and the analysis can focus on the meaningful content of the text.

  15. r

    Data from: Examining the origins of the word frequency effect in episodic...

    • researchdata.edu.au
    • openresearch.newcastle.edu.au
    Updated Apr 9, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Heathcote (2013). Examining the origins of the word frequency effect in episodic recognition memory and its relationship to the word frequency effect in lexical memory [Dataset]. https://researchdata.edu.au/examining-the-origins-of-the-word-frequency-effect-in-episodic-recognition-memory-and-its-relationship-to-the-word-frequency-effect-in-lexical-memory
    Explore at:
    Dataset updated
    Apr 9, 2013
    Dataset provided by
    The University of Newcastle
    Authors
    Andrew Heathcote
    Description

    Two experiments investigated Estes and Maddox’ theory (2002) that word frequency mirror effect in episodic recognition memory is due to word likeness rather than frequency of experience with a word. In Experiment 1, sixteen first year psychology students at the University of Newcastle studied lists of high and low frequency words crossed with high-neighbourhood-density and low-neighbourhood-density words and were given an episodic recognition test and asked to rate words as new or old and provide ratings of confidence according to a three point scale with six possible responses: sure old, probably old, possibly old, possibly new, probably new and sure new. Experiment 2 included twenty-three first year psychology students at the University of Newcastle who were tested using lexical decision task lists of words and nonwords. Testing was undertaken on a computer that presented the stimuli and recorded the participants’ responses using a program written in Turbo Pascal 6.0 with millisecond accurate timing. The dataset contains one Microsoft Excel file in .xls format containing data for Experiments 1 and 2.

  16. E

    Data from: Frequency lists of words from the GOS 1.0 corpus

    • live.european-language-grid.eu
    binary format
    Updated Nov 17, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Frequency lists of words from the GOS 1.0 corpus [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/8319
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 17, 2019
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Frequency lists of words were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all words occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy.

    The lists were extracted for each part-of-speech category. For each part-of-speech, two lists were extracted:

    1) one containing lemmas and their text-type distribution,

    2) one containing lower-case word forms as well as their normalized forms, lemmas, and morphosyntactic tags along with their text-type distribution.

    In addition, four lists were extracted from all words (regardless of their part-of-speech category):

    1) a list of all lemmas along with their part-of-speech category and text-type distribution;

    2) a list of all lower-case word forms with their lemmas, part-of-speech categories, and text-type distribution;

    3) a list of all lower-case word forms with their normalized word forms, lemmas, part-of-speech categories, and text-type distribution;

    4) a list of all morphosyntactic tags and their text-type distribution (the tags are also split into several columns).

  17. German Language Datasets | 393K Translations | NLP | Dictionary Display |...

    • datarade.ai
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). German Language Datasets | 393K Translations | NLP | Dictionary Display | Machine Learning (ML) Data | Translations | EU Coverage [Dataset]. https://datarade.ai/data-products/german-language-datasets-393k-translations-nlp-dictiona-oxford-languages
    Explore at:
    .json, .xml, .csv, .txtAvailable download formats
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Liechtenstein, Belgium, Austria, Luxembourg, Switzerland, Germany
    Description

    Comprehensive German language datasets with linguistic annotations, including headwords, definitions, word senses, usage examples, part-of-speech (POS) tags, semantic metadata, and contextual usage details.

    Our German language datasets are carefully compiled and annotated by language and linguistic experts. The below datasets in German are available for license:

    1. German Monolingual Dictionary Data
    2. German Bilingual Dictionary Data
    3. German Word List Data

    Key Features (approximate numbers):

    1. German Monolingual Dictionary Data

    Our German monolingual features clear definitions, headwords, examples, and comprehensive coverage of the German language spoken today.

    • Words: 85,500
    • Senses: 78,000
    • Example sentences: 55,000
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    1. German Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to German and from German to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.

    • Translations: 393,000
    • Senses: 207,500
    • Example translations: 129,500
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. German Word List Data

    This language data contains a carefully curated and comprehensive list of 338,000 German words.

    • Wordforms: 338,000
    • Format: CSV and TXT formats
    • Delivery: Email (link-based file sharing)

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.

    About the sample:

    The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.

    If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information.

  18. Chinese Language Datasets | 583KTranslations | 141K Words | NLP | Dictionary...

    • datarade.ai
    Updated Aug 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). Chinese Language Datasets | 583KTranslations | 141K Words | NLP | Dictionary Display | Translations Data | APAC coverage | Mandarin | Cantonese [Dataset]. https://datarade.ai/data-products/chinese-language-datasets-583ktranslations-178k-words-n-oxford-languages
    Explore at:
    .json, .xml, .csv, .txtAvailable download formats
    Dataset updated
    Aug 30, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Malaysia, Indonesia, Macao, Taiwan, China, Hong Kong, Singapore
    Description

    Comprehensive Chinese language datasets with linguistic annotations, including headwords, definitions, word senses, usage examples, part-of-speech (POS) tags, semantic metadata, and contextual usage details. Covering Simplified and Traditional writing systems.

    Our Chinese language datasets are carefully compiled and annotated by language and linguistic experts. The below datasets are available for license:

    1. Mandarin Chinese (simplified) Monolingual Dictionary Data
    2. Mandarin Chinese (traditional) Monolingual Dictionary Data
    3. Mandarin Chinese (simplified) Bilingual Dictionary Data
    4. Mandarin Chinese (traditional) Bilingual Dictionary Data
    5. Mandarin Chinese (simplified) Synonyms and Antonyms Data

    Key Features (approximate numbers):

    1. Mandarin Chinese (simplified) Monolingual Dictionary Data

    Our Mandarin Chinese (simplified) monolingual features clear definitions, headwords, examples, and comprehensive coverage of the Mandarin Chinese language spoken today.

    • Words: 81,300
    • Senses: 62,400
    • Example sentences: 80,700
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    1. Mandarin Chinese (traditional) Monolingual Dictionary Data

    Our Mandarin Chinese (traditional) monolingual features clear definitions, headwords, examples, and comprehensive coverage of the Mandarin Chinese language spoken today.

    • Words: 60,100
    • Senses: 144,700
    • Example sentences: 29,900
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    1. Mandarin Chinese (simplified) Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Mandarin Chinese (simplified) and from Mandarin Chinese (simplified) to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.

    • Translations: 367,600
    • Senses: 204,500
    • Example translations: 150,900
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Mandarin Chinese (traditional) Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Mandarin Chinese (traditional) and from Mandarin Chinese (traditional) to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality.

    • Translations: 215,600
    • Senses: 202,800
    • Example sentences: 149,700
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    1. Mandarin Chinese (simplified) Synonyms and Antonyms Data

    The Mandarin Chinese (simplified) Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary Mandarin Chinese. It includes rich linguistic detail such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.

    • Synonyms: 3,800
    • Antonyms: 3,180
    • Format: XML format
    • Delivery: Email (link-based file sharing)

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.

    Please note that some datasets may have rights restrictions. Contact us for more information.

    About the sample:

    The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.

    If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information.

  19. Datasets of word network topic model

    • figshare.com
    application/x-rar
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jichang Zhao (2023). Datasets of word network topic model [Dataset]. http://doi.org/10.6084/m9.figshare.5572588.v1
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Jichang Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: This dataset holds the content of one day's micro-blogs sampled from Weibo(http://weibo.com) in the form of bags-of-words.-----------------------------------------------------Data Set Characteristics: TextNumber of Micro-blogs:189,223Total Number of Words:3,252,492Size of the Vocabulary:20,942Associated Tasks: short text topic modeling and etc.-----------------------------------------------------About PreprocessingFor tokenization, we use NLPIR. Stop words and those with term-frequence less than 20 were removed. Besides,words contain only one chinese-character were also removed.-----------------------------------------------------Data FormatThe format of released data is setted as follows:[document_1][document_2]...[document_M]in which each line is one document. [document_i] is the ith document of the dataset that consists of a list of Ni words/terms.[document_i] = [word_i1] [word_i2] ... [word_iNi]in which all word_ij are text strings and they are separated by the blank character.-----------------------------------------------------If you have any questions about the data set, please contact: jichang@buaa.edu.cn.

  20. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v1
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)August 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.Getting StartedThis text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper3. Abstract: The abstract of the paper4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.Data ProcessingThis section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:Step 1: Downloading of the Data OnlineThis is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.Step 2: Importing the Dataset to RThis is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryNot all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsTraditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.List 1 Headings of sections identified in structured abstractsBackground Method(s) DesignTheoretical Measurement(s) LocationAim(s) Methodology ProcessAbstract Population ApproachObjective(s) Purpose(s) Subject(s)Introduction Implication(s) Patient(s)Procedure(s) Hypothesis Measure(s)Setting(s) Limitation(s) DiscussionConclusion(s) Result(s) Finding(s)Material (s) Rationale(s)Implications for health and nursing policyAll words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.Step 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.Step 6: Saving the Dataset into CSV FormatCorrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.The Structure of Fields in CSV FilesIn CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shaip (2025). English United States Wake Word Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/english-united-states-wake-word-dataset/

English United States Wake Word Dataset

Explore at:
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Shaip
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Area covered
United States
Description

Home English United States Wake Word DatasetHigh-Quality English United States Wake Word Dataset for AI & Speech Models Contact Us OverviewTitle (Language)English United States Language DatasetDataset TypesWake WordCountryUnited StatesDescriptionWake Words…

Search
Clear search
Close search
Google apps
Main menu