Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a set of one-second .wav audio files, each containing a single spoken English word or background noise. These words are from a small set of commands, and are spoken by a variety of different speakers. This data set is designed to help train simple machine learning models. This dataset is covered in more detail at https://arxiv.org/abs/1804.03209.
Version 0.01 of the data set (configuration "v0.01") was released on August 3rd 2017 and contains
64,727 audio files.
In version 0.01 thirty different words were recoded: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Bed", "Bird", "Cat", "Dog", "Happy", "House", "Marvin", "Sheila", "Tree", "Wow".
In version 0.02 more words were added: "Backward", "Forward", "Follow", "Learn", "Visual".
In both versions, ten of them are used as commands by convention: "Yes", "No", "Up", "Down", "Left",
"Right", "On", "Off", "Stop", "Go". Other words are considered to be auxiliary (in current implementation
it is marked by True value of "is_unknown" feature). Their function is to teach a model to distinguish core words
from unrecognized ones.
The _silence_ class contains a set of longer audio clips that are either recordings or
a mathematical simulation of noise.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Google Speech Commands Dataset v0.02 is a curated collection of short (approximately one-second) audio recordings of spoken words, specifically designed for training and benchmarking keyword spotting systems. Each recording captures a single spoken command uttered by a diverse set of speakers, making the dataset highly valuable for developing robust, real-world voice-controlled applications. The commands include common terms such as "yes", "no", "up", "down", "left", "right", "on", "off", "stop", and "go", among others.
In addition to the primary command recordings, the dataset also provides a set of background noise audio files. These files, stored in a dedicated folder, are intended to support data augmentation techniques and help improve model performance in noisy environments. The dataset has been widely adopted in both academic research and industry applications, serving as a benchmark for lightweight and efficient speech recognition systems.
Facebook
TwitterThis dataset was created by Kaladin Stormblessed
Facebook
TwitterHunzla/google-speech-commands-wav2vec2-960h dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset was created by Neeha Kurelli
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Olggol
Released under Apache 2.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Arabic Speech Commands Dataset
This dataset is designed to help train simple machine learning models that serve educational and research purposes in the speech recognition domain, mainly for keyword spotting tasks.
Dataset Description
Our dataset is a list of pairs (x, y), where x is the input speech signal, and y is the corresponding keyword. The final dataset consists of 12000 such pairs, comprising 40 keywords. Each audio file is one-second in length sampled at 16 kHz. We have 30 participants, each of them recorded 10 utterances for each keyword. Therefore, we have 300 audio files for each keyword in total (30 * 10 * 40 = 12000), and the total size of all the recorded keywords is ~384 MB. The dataset also contains several background noise recordings we obtained from various natural sources of noise. We saved these audio files in a separate folder with the name background_noise and a total size of ~49 MB.
Dataset Structure
There are 40 folders, each of which represents one keyword and contains 300 files. The first eight digits of each file name identify the contributor, while the last two digits identify the round number. For example, the file path rotate/00000021_NO_06.wav indicates that the contributor with the ID 00000021 pronounced the keyword rotate for the 6th time.
Data Split
We recommend using the provided CSV files in your experiments. We kept 60% of the dataset for training, 20% for validation, and the remaining 20% for testing. In our split method, we guarantee that all recordings of a certain contributor are within the same subset.
License
This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. For more details, see the LICENSE file in this folder.
Citations
If you want to use the Arabic Speech Commands dataset in your work, please cite it as:
@article{arabicspeechcommandsv1, author = {Ghandoura, Abdulkader and Hjabo, Farouk and Al Dakkak, Oumayma}, title = {Building and Benchmarking an Arabic Speech Commands Dataset for Small-Footprint Keyword Spotting}, journal = {Engineering Applications of Artificial Intelligence}, year = {2021}, publisher={Elsevier} }
Facebook
TwitterGoogle Researcher published the Speech command Dataset! I'm publishing only 14 subcategories of voice data in a 1sec period.
I'm doing preprocessing and generating JSON files and uploading the dataset! feel free to use it
Enjoy! And develop your key spotting Application Efficiently
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database description:
The written and spoken digits database is not a new database but a constructed database from existing ones, in order to provide a ready-to-use database for multimodal fusion.
The written digits database is the original MNIST handwritten digits database [1] with no additional processing. It consists of 70000 images (60000 for training and 10000 for test) of 28 x 28 = 784 dimensions.
The spoken digits database was extracted from Google Speech Commands [2], an audio dataset of spoken words that was proposed to train and evaluate keyword spotting systems. It consists of 105829 utterances of 35 words, amongst which 38908 utterances of the ten digits (34801 for training and 4107 for test). A pre-processing was done via the extraction of the Mel Frequency Cepstral Coefficients (MFCC) with a framing window size of 50 ms and frame shift size of 25 ms. Since the speech samples are approximately 1 s long, we end up with 39 time slots. For each one, we extract 12 MFCC coefficients with an additional energy coefficient. Thus, we have a final vector of 39 x 13 = 507 dimensions. Standardization and normalization were applied on the MFCC features.
To construct the multimodal digits dataset, we associated written and spoken digits of the same class respecting the initial partitioning in [1] and [2] for the training and test subsets. Since we have less samples for the spoken digits, we duplicated some random samples to match the number of written digits and have a multimodal digits database of 70000 samples (60000 for training and 10000 for test).
The dataset is provided in six files as described below. Therefore, if a shuffle is performed on the training or test subsets, it must be performed in unison with the same order for the written digits, spoken digits and labels.
Files:
References:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multilingual Speech Commands Dataset (15 Languages, Augmented)
This dataset contains augmented speech command samples in 15 languages, derived from multiple public datasets. Only commands that overlap with the Google Speech Commands (GSC) vocabulary are included, making the dataset suitable for multilingual keyword spotting tasks aligned with GSC-style classification. Audio samples have been augmented using standard audio techniques to improve model robustness (e.g., time-shifting⦠See the full description on the dataset page: https://huggingface.co/datasets/artur-muratov/multilingual-speech-commands-15lang.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created by AntonFilatov
Released under Attribution 4.0 International (CC BY 4.0)
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Thomas Shoesmith
Released under Database: Open Database, Contents: Database Contents
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Discover Google wake words and voice commands in US English for seamless interaction with your Google enabled devices and services.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Google-synth dataset comprises a synthetic Punjabi dataset that has been generated using Google's Cloud Text-to-Speech service. This dataset encompasses approximately 50,000 synthesized utterances, featuring four synthetic speakers (two male and two female), which amounts to approximately 38 hours of audio data. To facilitate training, validation, and testing, the dataset has been pre-divided into three portions: 80% for training, 10% for validation, and 10% for testing. The dataset is meticulously organized, with all speech files stored in the "clips" directory. The corresponding transcript files (train, dev, and test) are situated in the parent directory and follow the TSV (Tab-Separated Values) format. Each line within the transcript files represents a label assigned to a particular speech sample from the clips directory. The first column of each line contains the path and name of the corresponding WAV file, while the second column, separated by a tab, contains the transcript in textual form.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The expansion of Internet connectivity has revolutionized our daily lives, with people increasingly relying on smartphones and laptops for various tasks. This technological evolution has prompted the development of innovative solutions to enhance the quality of life for diverse populations, including the elderly and individuals with disabilities. Among the most impactful advancements are voice-command-enabled technologies such as SIRI and Google voice commands, which are built upon the foundation of Speech Recognition modules, a critical component in facilitating human-machine communication.Automatic Speech Recognition (ASR) has witnessed significant progress in achieving human-like performance through data-driven methods. In the context of our research, we have meticulously crafted an Arabic voice command dataset to facilitate advancements in ASR and other speech processing tasks. This dataset comprises 10 distinct commands spoken by 10 unique speakers, each repeated 10 times. Despite its modest size, the dataset has demonstrated remarkable performance across a range of speech processing tasks.The dataset was rigorously evaluated, yielding exceptional results. In ASR, it achieved an accuracy of 95.9%, showcasing its potential for effectively transcribing spoken Arabic commands. Furthermore, the dataset excelled in speaker identification, gender recognition, accent recognition, and spoken language understanding, with macro F1 scores of 99.67%, 100%, 100%, and 97.98%, respectively.This Arabic Voice Command Dataset represents a valuable resource for researchers and developers in the field of speech processing and human-machine interaction. Its quality and diversity make it a robust foundation for developing and testing ASR and other related systems, ultimately contributing to the advancement of voice-command technologies and their widespread accessibility.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of keywords extracted from conversational-style speech and command-style speech and it is made up of 3 subsets: command keywords (ck), extended command keywords (eck) and conversational speech (cs). This dataset is intended solely for research on keyword recognition and speech style analysis.
For each of the three datasets, we asked users to record themselves in a quiet environment, reciting given text excerpts with specific keywords inside them and saving the recordings as 16kHz 32-bit PCM WAVE files.
We added an additional folder of background noises, which is a copy from Google's Speech Commands set of background noises.
-------------------------------------------------------------------------------------------------------------------------------------------------------------
For command keywords (ck), the users were given 10 text examples, each containing 10 repetitions of the following keywords: on, no, up, off, down, stop, go, right, yes and left. They were asked to pronounce the examples as if answering a device, using an intonation similar to command-style speech; for instance, as if the device asked, "Lights up or lights down?" and they answered "Up!" ten times, with as much variation in intonation between the repetitions as they could produce.
ck sentences:
1) On! (10 times)
2) No! (10 times)
3) Up! (10 times)
4) Off! (10 times)
5) Down! (10 times)
6) Stop! (10 times)
7) Go! (10 times)
8) Right! (10 times)
9) Yes! (10 times)
10) Left! (10 times)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
For conversational speech (cs), users were given 20 more elaborate text examples such as this one: "She put the book on the table. No other books were there. She then looked up to her mum.". They were asked to recite the examples in their normal speaking tone as if they were telling a story to a friend. The recordings for CS contained a total of 10 instances of the same keywords mentioned for CM.
cs sentences:
1) She put the book on the table. No other books were there. She then looked up to her mum.
2) On Friday they were meeting. He had no idea what she was up to.
3) She climbed up the ladder with no hesitation and stood on the roof.
4) The alarm clock rang, but no lights were on to wake him up.
5) He insisted on going up the mountain, even though no one followed.
6) They were driving up the street on a sunny day. There was no traffic.
7) The power went on, but there was no heat, so she bundled up in a blanket.
8) She started the car up, but no radio was on to play her favourite music.
9) The computer powered up, but no alerts came on the screen.
10) No wonder she woke up tired. She had been on call 3 times this week already.
11) She put off the task, unable to find the right resources. Down the hall, the phone rang.
12) He switched off the lights, quietly stepped down the stairs and turned right.
13) The cat jumped right off the neighbourās fence and darted down our garden alley.
14) She climbed down slowly, then jumped off right at the bottom.
15) He pulled off his hat and looked right down at his shoes.
16) They set off down the road, aiming to reach their destination right at dawn.
17) The papers flew off the table and floated down to the floor, right next to his shoes.
18) She carefully wiped the mud off her shoes and sat down, turning slightly, to her right.
19) They turned right off the busy highway and drove down the country road.
20) She got off the bus and walked down the street, heading right past the blue house.
-------------------------------------------------------------------------------------------------------------------------------------------------------------
For the extended command keywords (eck), we extended the keywords in the ck dataset by additional 100ms to the left and right. A few keywords needed to be manually corrected. Two speakers were removed from the dataset because they pronounced the keywords very quickly, which resulted in too many bad keywords.
Facebook
Twitteradialia/voice-commands-google-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset was created by Tingting WANG
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Russian Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. Itās designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
This diversity ensures robust training for real-world voice assistant applications.
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming voice recognition software market! This comprehensive analysis reveals market size, CAGR, key trends (AI, cloud solutions), challenges, and top companies. Explore regional breakdowns and future growth projections (2025-2033) for informed decision-making.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a set of one-second .wav audio files, each containing a single spoken English word or background noise. These words are from a small set of commands, and are spoken by a variety of different speakers. This data set is designed to help train simple machine learning models. This dataset is covered in more detail at https://arxiv.org/abs/1804.03209.
Version 0.01 of the data set (configuration "v0.01") was released on August 3rd 2017 and contains
64,727 audio files.
In version 0.01 thirty different words were recoded: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Bed", "Bird", "Cat", "Dog", "Happy", "House", "Marvin", "Sheila", "Tree", "Wow".
In version 0.02 more words were added: "Backward", "Forward", "Follow", "Learn", "Visual".
In both versions, ten of them are used as commands by convention: "Yes", "No", "Up", "Down", "Left",
"Right", "On", "Off", "Stop", "Go". Other words are considered to be auxiliary (in current implementation
it is marked by True value of "is_unknown" feature). Their function is to teach a model to distinguish core words
from unrecognized ones.
The _silence_ class contains a set of longer audio clips that are either recordings or
a mathematical simulation of noise.