9 datasets found

OK Aura Wake-up Word Dataset
zenodo.org
data.niaid.nih.gov
Updated Nov 29, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillermo Cámbara; Guillermo Cámbara; Jordi Luque; Jordi Luque; David Bonet; Fernando López; Mireia Farrús; Mireia Farrús; Pablo Gómez; Carlos Segura; David Bonet; Fernando López; Pablo Gómez; Carlos Segura (2021). OK Aura Wake-up Word Dataset [Dataset]. http://doi.org/10.5281/zenodo.5734340
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5734340
Dataset updated
Nov 29, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Guillermo Cámbara; Guillermo Cámbara; Jordi Luque; Jordi Luque; David Bonet; Fernando López; Mireia Farrús; Mireia Farrús; Pablo Gómez; Carlos Segura; David Bonet; Fernando López; Pablo Gómez; Carlos Segura
Description
Speech dataset for wake-up word (WuW) detection in Telefónica's home assistant, Aura. It contains 1247 utterances (1.4 hours) from ~80 speakers. Speakers pronounce the wake-up word itself "OK Aura", plus other sentences that might be similar, or not, to "OK Aura".

This dataset contains rich metadata annotations, so it is possible to study diverse factors and biases that might affect wake-up word detection performance: accent, gender, prosody/emotion, room size, distance to the microphone, etc. Besides, it also contains recordings of sentences that are phonetically similar to "OK Aura", like "Porque Laura..." or "... como Aura...", with the purpose to experiment with difficult sentences.
GENIA Bio-medical event dataset
kaggle.com
zip
Updated Dec 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nishanth (2020). GENIA Bio-medical event dataset [Dataset]. https://www.kaggle.com/nishanthsalian/genia-biomedical-event-dataset
Explore at:
zip(813625 bytes)Available download formats
Dataset updated
Dec 5, 2020
Authors
Nishanth
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
Context

Bio-medical texts have a lot of information which can be used for developments in the medical field. Traditionally, domain experts used to manually extract such information. Automating this information extraction task can help speed up progress in the field. To name a few use cases of bio-medical events, they show the effects of drugs on a person. They can also be used to identify certain medical conditions in a person. Hence automating extraction of events from bio-medical texts is very beneficial

Content

The dataset is just a simplified version of the event annotated GENIA dataset derived from the version available in TEES

It consists of the original bio-medical text, labelled trigger words, location of trigger word in the text and the event type associated with the trigger word There are 3 sets of data (train (8k+ sentences), devel (about 3k sentences) and test (about 3k sentences)). Each set has 4 columns namely "Sentence", "TriggerWord", "TriggerWordLoc" and "EventType", capturing the original bio-medical text, trigger words in the sentence, location of the trigger words in the sentence and the event type associated with the trigger words respectively.

Acknowledgements

The dataset is just a simplified version of the event annotated GENIA dataset derived from the version available in TEES The original source dataset is from BioNLP Shared Task 2011 A complete unprocessed version seems to be present in genia-event-2011 dataset too

For TEES licensing information please refer this link For GENIA dataset licensing information, please refer the file "GE11-LICENSE" present beside the data files (.csv) in this kaggle dataset

Photo Credits: Louis Reed on Unsplash
s
Wake Word Hebrew Dataset
shaip.com
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). Wake Word Hebrew Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/wake-word-hebrew-dataset/
Explore at:
Dataset updated
Nov 8, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Wake Word Hebrew DatasetHigh-Quality Hebrew Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Hebrew Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
D
Wake Word Detection AI Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Wake Word Detection AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/wake-word-detection-ai-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Wake Word Detection AI Market Outlook

According to our latest research, the global wake word detection AI market size reached USD 1.42 billion in 2024, reflecting robust adoption across multiple industries. The market is projected to expand at a CAGR of 18.6% from 2025 to 2033, reaching an estimated USD 7.18 billion by 2033. This remarkable growth is being driven by the surging integration of voice-activated technologies in consumer electronics, automotive infotainment systems, and smart home devices, as well as increasing demand for hands-free and intuitive human-machine interactions.

One of the most significant growth factors for the wake word detection AI market is the exponential rise in smart device adoption worldwide. Voice assistants such as Amazon Alexa, Google Assistant, and Apple Siri have become ubiquitous, embedded in everything from smartphones and smart speakers to televisions and household appliances. The ability of wake word detection AI to provide seamless, always-on listening capabilities without draining device power or compromising user privacy is a major driver for manufacturers. As the global population becomes increasingly tech-savvy and reliant on digital assistants for daily tasks, the demand for accurate, low-latency wake word detection technology is set to soar. This growth is further bolstered by advancements in edge computing and AI chipsets, which enable real-time processing and reduce the need for constant cloud connectivity, enhancing both privacy and responsiveness.

Another critical factor driving the wake word detection AI market is the rapid evolution of the automotive and healthcare sectors. In automotive, the integration of voice-activated controls is transforming the in-car experience, making infotainment systems safer and more user-friendly by minimizing driver distraction. Leading automakers are partnering with AI solution providers to embed wake word detection in vehicles, enabling drivers to control navigation, music, and climate settings hands-free. Similarly, in healthcare, wake word detection is being increasingly used in medical devices and remote patient monitoring systems, allowing for hands-free operation, improved accessibility for patients with mobility challenges, and faster response times in emergencies. These applications are expanding the market beyond traditional consumer electronics, opening up new avenues for growth and innovation.

The proliferation of IoT and smart home ecosystems is also playing a pivotal role in market expansion. With the growing popularity of home automation products such as smart lights, thermostats, security systems, and connected appliances, the need for reliable wake word detection AI is more critical than ever. Consumers now expect their devices to respond instantly and accurately to voice commands, regardless of ambient noise or multiple users. This has led to a surge in R&D investments aimed at improving the robustness, multilingual capabilities, and contextual understanding of wake word detection algorithms. The convergence of AI, machine learning, and natural language processing is enabling more personalized and secure voice experiences, further accelerating adoption across residential and commercial settings.

From a regional perspective, North America continues to dominate the wake word detection AI market due to the presence of major technology giants, a high concentration of early adopters, and significant investments in AI research. However, Asia Pacific is emerging as the fastest-growing region, fueled by rapid urbanization, rising disposable incomes, and increasing penetration of smart devices in countries like China, Japan, and South Korea. Europe is also witnessing substantial growth, driven by strong regulatory frameworks around data privacy and a burgeoning ecosystem of AI startups. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with governments and enterprises recognizing the potential of voice AI to drive digital transformation and enhance user engagement.

Component Analysis

The wake word detection AI market can be segmented by component into software, hardware, and services, each playing a distinct role in the overall ecosystem. The software segment is the backbone of wake word detection, comprising advanced algorithms and machine learning models that enable devices to recognize specific trigger words with high accuracy and minimal latency. Over the past few years, there has been a si
h
EXALT-v1
huggingface.co
Updated May 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pranaydeep Singh (2024). EXALT-v1 [Dataset]. https://huggingface.co/datasets/pranaydeeps/EXALT-v1
Explore at:
Dataset updated
May 20, 2024
Authors
Pranaydeep Singh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for EXALT-v1

A cross-lingual Emotion Detection dataset with an explainability component ie. Trigger Word Detection. We provide training data in English along with test data in 5 languages (Dutch, Russian, Spanish, English, and French). For additional details, please refer our website at: lt3.ugent.be/exalt

Dataset Details Dataset Description

A cross-lingual emotion detection & explainability dataset, that consists of two sub-tasks:… See the full description on the dataset page: https://huggingface.co/datasets/pranaydeeps/EXALT-v1.
D
Automotive Wake Word Detection Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Automotive Wake Word Detection Market Research Report 2033 [Dataset]. https://dataintelo.com/report/automotive-wake-word-detection-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Automotive Wake Word Detection Market Outlook

As per our latest research, the global automotive wake word detection market size reached USD 437.2 million in 2024. The market is expected to grow at a robust CAGR of 19.7% during the forecast period, reaching a projected value of USD 1,613.6 million by 2033. This significant growth is primarily driven by the increasing integration of voice-activated technologies in vehicles, the rising demand for hands-free infotainment, and the ongoing advancements in artificial intelligence and natural language processing within the automotive sector.

One of the primary factors fueling the expansion of the automotive wake word detection market is the surging demand for enhanced in-car user experiences. Consumers today expect seamless, intuitive, and safe interactions with their vehicles, especially as digital connectivity becomes ubiquitous. Wake word detection systems, which allow drivers and passengers to activate voice assistants using specific trigger phrases, are at the heart of this transformation. These systems enable hands-free operation of navigation, infotainment, and vehicle controls, significantly reducing driver distraction and improving overall safety. Additionally, the growing proliferation of connected cars and the integration of smart features have made wake word detection a standard expectation in modern vehicles, further boosting market growth.

Another key growth driver is the rapid advancement in artificial intelligence (AI) and machine learning algorithms. The evolution of these technologies has enabled wake word detection systems to achieve higher accuracy, faster response times, and better contextual understanding, even in noisy automotive environments. The incorporation of advanced natural language processing (NLP) allows these systems to recognize a wide range of accents, dialects, and languages, making them more accessible to global consumers. Moreover, the increasing collaboration between automotive OEMs and technology providers has accelerated the development and deployment of robust and reliable wake word detection solutions, ensuring that vehicles remain at the forefront of digital innovation.

The market is also witnessing growth due to the rising adoption of electric vehicles (EVs) and autonomous driving technologies. As EVs and self-driving cars become more prevalent, the need for advanced human-machine interfaces (HMIs) that facilitate effortless communication between the driver, passengers, and vehicle systems becomes critical. Wake word detection plays a pivotal role in these HMIs, enabling voice-activated commands for climate control, entertainment, and navigation without manual intervention. Furthermore, regulatory bodies worldwide are emphasizing driver safety and minimal distraction, encouraging automakers to integrate sophisticated voice-activated systems, thus propelling the market forward.

Regionally, North America and Europe are currently leading the automotive wake word detection market, driven by high consumer awareness, advanced automotive manufacturing capabilities, and the early adoption of innovative in-car technologies. However, the Asia Pacific region is poised for the fastest growth, thanks to the expanding automotive industry, increasing disposable incomes, and the rapid digitization of vehicles in emerging economies such as China and India. The presence of major automotive OEMs and technology vendors in these regions further accelerates the adoption of wake word detection systems, ensuring a dynamic and competitive market landscape.

Component Analysis

The automotive wake word detection market by component is segmented into software, hardware, and services. The software segment dominates the market, accounting for the largest share in 2024, as sophisticated algorithms and AI-driven models are essential for accurately recognizing wake words in varying acoustic environments. The continuous evolution of software platforms, leveraging deep learning and neural network models, has significantly improved system performance, enabling seamless integration with diverse vehicle architectures. Automotive OEMs are increasingly investing in proprietary and customizable software solutions to differentiate their in-car experiences, further driving the growth of this segment.

The hardware segment is also witnessing notable growth, primarily due to the nee
Grade rules of trigger word frequency.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yajun Zhang; Zongtian Liu; Wen Zhou (2023). Grade rules of trigger word frequency. [Dataset]. http://doi.org/10.1371/journal.pone.0160147.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0160147.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yajun Zhang; Zongtian Liu; Wen Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Grade rules of trigger word frequency.
Additional file 1 of Filtering large-scale event collections using a...
springernature.figshare.com
tar
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farrokh Mehryary; Suwisa Kaewphan; Kai Hakala; Filip Ginter (2023). Additional file 1 of Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification [Dataset]. http://doi.org/10.6084/m9.figshare.c.3627959_D1.v1
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3627959_D1.v1
Dataset updated
Jun 3, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Farrokh Mehryary; Suwisa Kaewphan; Kai Hakala; Filip Ginter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This tar file contains images of the binary cluster tree, before and after the pruning. The HowToInterpretTreeDiagrams.txt file describes how the diagrams should be interpreted. (TAR 1290 kb)
m
Event Detection Dataset
data.mendeley.com
datosdeinvestigacion.conicet.gov.ar
+2more
Updated Jul 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariano Maisonnave (2020). Event Detection Dataset [Dataset]. http://doi.org/10.17632/7d54rvzxkr.1
Explore at:
Unique identifier
https://doi.org/10.17632/7d54rvzxkr.1
Dataset updated
Jul 11, 2020
Authors
Mariano Maisonnave
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event.

The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier).

Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example.

Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data.

Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format.

<?xml version="1.0" encoding="UTF-8"?>

The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article.

The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Guillermo Cámbara; Guillermo Cámbara; Jordi Luque; Jordi Luque; David Bonet; Fernando López; Mireia Farrús; Mireia Farrús; Pablo Gómez; Carlos Segura; David Bonet; Fernando López; Pablo Gómez; Carlos Segura (2021). OK Aura Wake-up Word Dataset [Dataset]. http://doi.org/10.5281/zenodo.5734340

OK Aura Wake-up Word Dataset

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.5734340

Dataset updated

Nov 29, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Description

Speech dataset for wake-up word (WuW) detection in Telefónica's home assistant, Aura. It contains 1247 utterances (1.4 hours) from ~80 speakers. Speakers pronounce the wake-up word itself "OK Aura", plus other sentences that might be similar, or not, to "OK Aura".

This dataset contains rich metadata annotations, so it is possible to study diverse factors and biases that might affect wake-up word detection performance: accent, gender, prosody/emotion, room size, distance to the microphone, etc. Besides, it also contains recordings of sentences that are phonetically similar to "OK Aura", like "Porque Laura..." or "... como Aura...", with the purpose to experiment with difficult sentences.

Clear search

Close search

Google apps

Main menu

OK Aura Wake-up Word Dataset

GENIA Bio-medical event dataset

Context

Content

Acknowledgements

Wake Word Hebrew Dataset

Wake Word Detection AI Market Research Report 2033

Wake Word Detection AI Market Outlook

Component Analysis

EXALT-v1

Automotive Wake Word Detection Market Research Report 2033

Automotive Wake Word Detection Market Outlook

Component Analysis

Grade rules of trigger word frequency.

Additional file 1 of Filtering large-scale event collections using a...

Event Detection Dataset

OK Aura Wake-up Word DatasetSee More Versions

OK Aura Wake-up Word Dataset