Facebook
TwitterSpeech dataset for wake-up word (WuW) detection in Telefónica's home assistant, Aura. It contains 1247 utterances (1.4 hours) from ~80 speakers. Speakers pronounce the wake-up word itself "OK Aura", plus other sentences that might be similar, or not, to "OK Aura".
This dataset contains rich metadata annotations, so it is possible to study diverse factors and biases that might affect wake-up word detection performance: accent, gender, prosody/emotion, room size, distance to the microphone, etc. Besides, it also contains recordings of sentences that are phonetically similar to "OK Aura", like "Porque Laura..." or "... como Aura...", with the purpose to experiment with difficult sentences.
Facebook
Twitterhttp://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Bio-medical texts have a lot of information which can be used for developments in the medical field. Traditionally, domain experts used to manually extract such information. Automating this information extraction task can help speed up progress in the field. To name a few use cases of bio-medical events, they show the effects of drugs on a person. They can also be used to identify certain medical conditions in a person. Hence automating extraction of events from bio-medical texts is very beneficial
The dataset is just a simplified version of the event annotated GENIA dataset derived from the version available in TEES
It consists of the original bio-medical text, labelled trigger words, location of trigger word in the text and the event type associated with the trigger word There are 3 sets of data (train (8k+ sentences), devel (about 3k sentences) and test (about 3k sentences)). Each set has 4 columns namely "Sentence", "TriggerWord", "TriggerWordLoc" and "EventType", capturing the original bio-medical text, trigger words in the sentence, location of the trigger words in the sentence and the event type associated with the trigger words respectively.
The dataset is just a simplified version of the event annotated GENIA dataset derived from the version available in TEES The original source dataset is from BioNLP Shared Task 2011 A complete unprocessed version seems to be present in genia-event-2011 dataset too
For TEES licensing information please refer this link For GENIA dataset licensing information, please refer the file "GE11-LICENSE" present beside the data files (.csv) in this kaggle dataset
Photo Credits: Louis Reed on Unsplash
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Wake Word Hebrew DatasetHigh-Quality Hebrew Wake Word Dataset for AI & Speech Models Contact Us OverviewTitleWake Word Hebrew Language DatasetDataset TypeWake WordDescriptionWake Words / Voice Command / Trigger Word…
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global wake word detection AI market size reached USD 1.42 billion in 2024, reflecting robust adoption across multiple industries. The market is projected to expand at a CAGR of 18.6% from 2025 to 2033, reaching an estimated USD 7.18 billion by 2033. This remarkable growth is being driven by the surging integration of voice-activated technologies in consumer electronics, automotive infotainment systems, and smart home devices, as well as increasing demand for hands-free and intuitive human-machine interactions.
One of the most significant growth factors for the wake word detection AI market is the exponential rise in smart device adoption worldwide. Voice assistants such as Amazon Alexa, Google Assistant, and Apple Siri have become ubiquitous, embedded in everything from smartphones and smart speakers to televisions and household appliances. The ability of wake word detection AI to provide seamless, always-on listening capabilities without draining device power or compromising user privacy is a major driver for manufacturers. As the global population becomes increasingly tech-savvy and reliant on digital assistants for daily tasks, the demand for accurate, low-latency wake word detection technology is set to soar. This growth is further bolstered by advancements in edge computing and AI chipsets, which enable real-time processing and reduce the need for constant cloud connectivity, enhancing both privacy and responsiveness.
Another critical factor driving the wake word detection AI market is the rapid evolution of the automotive and healthcare sectors. In automotive, the integration of voice-activated controls is transforming the in-car experience, making infotainment systems safer and more user-friendly by minimizing driver distraction. Leading automakers are partnering with AI solution providers to embed wake word detection in vehicles, enabling drivers to control navigation, music, and climate settings hands-free. Similarly, in healthcare, wake word detection is being increasingly used in medical devices and remote patient monitoring systems, allowing for hands-free operation, improved accessibility for patients with mobility challenges, and faster response times in emergencies. These applications are expanding the market beyond traditional consumer electronics, opening up new avenues for growth and innovation.
The proliferation of IoT and smart home ecosystems is also playing a pivotal role in market expansion. With the growing popularity of home automation products such as smart lights, thermostats, security systems, and connected appliances, the need for reliable wake word detection AI is more critical than ever. Consumers now expect their devices to respond instantly and accurately to voice commands, regardless of ambient noise or multiple users. This has led to a surge in R&D investments aimed at improving the robustness, multilingual capabilities, and contextual understanding of wake word detection algorithms. The convergence of AI, machine learning, and natural language processing is enabling more personalized and secure voice experiences, further accelerating adoption across residential and commercial settings.
From a regional perspective, North America continues to dominate the wake word detection AI market due to the presence of major technology giants, a high concentration of early adopters, and significant investments in AI research. However, Asia Pacific is emerging as the fastest-growing region, fueled by rapid urbanization, rising disposable incomes, and increasing penetration of smart devices in countries like China, Japan, and South Korea. Europe is also witnessing substantial growth, driven by strong regulatory frameworks around data privacy and a burgeoning ecosystem of AI startups. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with governments and enterprises recognizing the potential of voice AI to drive digital transformation and enhance user engagement.
The wake word detection AI market can be segmented by component into software, hardware, and services, each playing a distinct role in the overall ecosystem. The software segment is the backbone of wake word detection, comprising advanced algorithms and machine learning models that enable devices to recognize specific trigger words with high accuracy and minimal latency. Over the past few years, there has been a si
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for EXALT-v1
A cross-lingual Emotion Detection dataset with an explainability component ie. Trigger Word Detection. We provide training data in English along with test data in 5 languages (Dutch, Russian, Spanish, English, and French). For additional details, please refer our website at: lt3.ugent.be/exalt
Dataset Details
Dataset Description
A cross-lingual emotion detection & explainability dataset, that consists of two sub-tasks:… See the full description on the dataset page: https://huggingface.co/datasets/pranaydeeps/EXALT-v1.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
As per our latest research, the global automotive wake word detection market size reached USD 437.2 million in 2024. The market is expected to grow at a robust CAGR of 19.7% during the forecast period, reaching a projected value of USD 1,613.6 million by 2033. This significant growth is primarily driven by the increasing integration of voice-activated technologies in vehicles, the rising demand for hands-free infotainment, and the ongoing advancements in artificial intelligence and natural language processing within the automotive sector.
One of the primary factors fueling the expansion of the automotive wake word detection market is the surging demand for enhanced in-car user experiences. Consumers today expect seamless, intuitive, and safe interactions with their vehicles, especially as digital connectivity becomes ubiquitous. Wake word detection systems, which allow drivers and passengers to activate voice assistants using specific trigger phrases, are at the heart of this transformation. These systems enable hands-free operation of navigation, infotainment, and vehicle controls, significantly reducing driver distraction and improving overall safety. Additionally, the growing proliferation of connected cars and the integration of smart features have made wake word detection a standard expectation in modern vehicles, further boosting market growth.
Another key growth driver is the rapid advancement in artificial intelligence (AI) and machine learning algorithms. The evolution of these technologies has enabled wake word detection systems to achieve higher accuracy, faster response times, and better contextual understanding, even in noisy automotive environments. The incorporation of advanced natural language processing (NLP) allows these systems to recognize a wide range of accents, dialects, and languages, making them more accessible to global consumers. Moreover, the increasing collaboration between automotive OEMs and technology providers has accelerated the development and deployment of robust and reliable wake word detection solutions, ensuring that vehicles remain at the forefront of digital innovation.
The market is also witnessing growth due to the rising adoption of electric vehicles (EVs) and autonomous driving technologies. As EVs and self-driving cars become more prevalent, the need for advanced human-machine interfaces (HMIs) that facilitate effortless communication between the driver, passengers, and vehicle systems becomes critical. Wake word detection plays a pivotal role in these HMIs, enabling voice-activated commands for climate control, entertainment, and navigation without manual intervention. Furthermore, regulatory bodies worldwide are emphasizing driver safety and minimal distraction, encouraging automakers to integrate sophisticated voice-activated systems, thus propelling the market forward.
Regionally, North America and Europe are currently leading the automotive wake word detection market, driven by high consumer awareness, advanced automotive manufacturing capabilities, and the early adoption of innovative in-car technologies. However, the Asia Pacific region is poised for the fastest growth, thanks to the expanding automotive industry, increasing disposable incomes, and the rapid digitization of vehicles in emerging economies such as China and India. The presence of major automotive OEMs and technology vendors in these regions further accelerates the adoption of wake word detection systems, ensuring a dynamic and competitive market landscape.
The automotive wake word detection market by component is segmented into software, hardware, and services. The software segment dominates the market, accounting for the largest share in 2024, as sophisticated algorithms and AI-driven models are essential for accurately recognizing wake words in varying acoustic environments. The continuous evolution of software platforms, leveraging deep learning and neural network models, has significantly improved system performance, enabling seamless integration with diverse vehicle architectures. Automotive OEMs are increasingly investing in proprietary and customizable software solutions to differentiate their in-car experiences, further driving the growth of this segment.
The hardware segment is also witnessing notable growth, primarily due to the nee
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Grade rules of trigger word frequency.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This tar file contains images of the binary cluster tree, before and after the pruning. The HowToInterpretTreeDiagrams.txt file describes how the diagrams should be interpreted. (TAR 1290 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event.
The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier).
Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example.
Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data.
Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format.
<?xml version="1.0" encoding="UTF-8"?>
The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article.
The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterSpeech dataset for wake-up word (WuW) detection in Telefónica's home assistant, Aura. It contains 1247 utterances (1.4 hours) from ~80 speakers. Speakers pronounce the wake-up word itself "OK Aura", plus other sentences that might be similar, or not, to "OK Aura".
This dataset contains rich metadata annotations, so it is possible to study diverse factors and biases that might affect wake-up word detection performance: accent, gender, prosody/emotion, room size, distance to the microphone, etc. Besides, it also contains recordings of sentences that are phonetically similar to "OK Aura", like "Porque Laura..." or "... como Aura...", with the purpose to experiment with difficult sentences.