https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 12.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 11.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/taqwa92/cm.trial.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Common Voice is a corpus of speech data read by users on the Common Voice website, and based upon text from a number of public domain sources like user submitted blog posts, old books, movies, and other public speech corpora. Its primary purpose is to enable the training and testing of automatic speech recognition (ASR) systems.
In google colab, i downloaded the .tar.gz from common-voice (mozilla). And placed the compressed file in a folder marked the folder as dataset and straight-up uploaded it
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 4
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 4257 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 3401 validated hours in 40 languages, but more voices and languages are always added. Take a look at the Languages page to request a… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_4_0.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Release Date: 17.01.22
Welcome to Common Phone 1.0
Legal Information
Common Phone is a subset of the Common Voice corpus collected by Mozilla Corporation. By using Common Phone, you agree to the Common Voice Legal Terms. Common Phone is maintained and distributed by speech researchers at the Pattern Recognition Lab of Friedrich-Alexander-University Erlangen-Nuremberg (FAU) under the CC0 license.
Like for Common Voice, you must not make any attempt to identify speakers that contributed to Common Phone.
About Common Phone
This corpus aims to provide a basis for Machine Learning (ML) researchers and enthusiasts to train and test their models against a wide variety of speakers, hardware/software ecosystems and acoustic conditions to improve generalization and availability of ML in real-world speech applications.
The current version of Common Phone comprises 116,5 hours of speech samples, collected from 11.246 speakers in 6 languages:
Language |
Speakers |
Hours |
---|---|---|
|
| |
English |
4716 / 771 / 774 |
14.1 / 2.3 / 2.3 |
French |
796 / 138 / 135 |
13.6 / 2.3 / 2.2 |
German |
1176 / 202 / 206 |
14.5 / 2.5 / 2.6 |
Italian |
1031 / 176 / 178 |
14.6 / 2.5 / 2.5 |
Spanish |
508 / 88 / 91 |
16.5 / 3.0 / 3.1 |
Russian |
190 / 34 / 36 |
12.7 / 2.6 / 2.8 |
Total |
8417 / 1409 / 1420 |
85.8 / 15.2 / 15.5 |
Presented train
, dev
and test
splits are not identical to those shipped with Common Voice. Speaker separation among splits was realized by only using those speakers that had provided age and gender information. This information can only be provided as a registered user on the website. When logged in, the session ID of contributed recordings is always linked to your user, thus we could easily link recordings to individual speakers. Keep in mind this would not be possible for unregistered users, as their session ID changes if they decide to contribute more than once.
During speaker selection, we considered that some speakers had contributed to more than one of the six Common Voice datasets (one for each language). In Common Phone, a speaker will only appear in one language.
The dataset is structured as follows:
Where does the phonetic annotation come from?
Phonetic annotation was computed via BAS Web Services. We used the regular Pipeline (G2P-MAUS) without ASR to create an alignment of text transcripts with audio signals. We chose International Phonetic Alphabet (IPA) output symbols as they work well even in a multi-lingual setup. Common Phone annotation comprises 101 phonetic symbols, including silence.
Why Common Phone?
Is there any publication available?
Yes, a paper describing Common Phone in detail is currently under revision for LREC 2022. You can access a pre-print version on arXiv entitled “Common Phone: A Multilingual Dataset for Robust Acoustic Modelling”.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 6.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 9261 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 7327 validated hours in 60 languages, but more voices and languages are always added. Take a look at the Languages page to request… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_6_0.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 17.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 31175 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 20408 validated hours in 124 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
The following dataset is intended to be used for gender recognition using audio files in uncontrolled environments from the Mozilla Common Voice Dataset 10.0. It consists of a table of descriptive statistical characteristics of the fundamental frequency of six tonal languages Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Thai, Vietnamese, and Punjabi. In addition, the estimation of the vocal tract of each of the speakers.
This dataset contains 18 columns: 'client_id': id speaker from Mozilla Common Voice 'path': Name of the mp3 file 'sentence': The sentence spoken by the speaker 'age': Age in decades (teens, twenties, etc.) 'gender': Binary gender (male or female) 'duration': Duration of mp3 in seconds 'vocal_tract_length': Vocal tract length in cm. 'mean_F4': Mean of the fourth formant in Hz. 'min_pitch': Minimal pitch of the whole pitch contour in Hz. 'mean_pitch': Mean pitch of the whole pitch contour in Hz. 'q1_pitch': : First quartile of the whole pitch contour in Hz. 'median_pitch': : Median pitch of the whole pitch contour Hz. 'q3_pitch': : Third quartile of the whole pitch contour in Hz. 'max_pitch': : Max pitch of the whole pitch contour in Hz. 'stddev_pitch' : Standard deviation of the whole pitch contour in Hz. 'estimated_age': Nominal value (adult or teen) 'estimated_age_gender: Nominal value (adult-male, adult-female, teen-male and teen-female). 'language': Nominal value (Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Thai, Vietnamese, and Punjabi).
The methodology for the extraction of these characteristics was the following:
Only the audios from the valid.tsv file of the respective language were analyzed (this file is contained in the Mozilla Common Voice Dataset https://commonvoice.mozilla.org/en/datasets ) the voiced-speech was extracted using Praat's algorithm Vocal ToolKit (https://www.praatvocaltoolkit.com/extract-voiced-and-unvoiced.html)
2) The vocal tract length was calculated with the Vocal Tool Kit algorithm ( https://www.praatvocaltoolkit.com/calculate-vocal-tract-length.html ) as follows: If the audio came from a teen, then the maximum formant was established at 8000, otherwise it was adjusted to 5000 Hz for men and 5500 for women. Finally, the mean of the fourth formant was calculated for the windows with voiced speech only.
3) The fundamental frequency was calculated using the PRAAT Software in the To Pitch (ac) option and a) Time step (s) 0.0 (=auto) b) Pitch floor (Hz) 75.0 c) Max. number of candidates 15 d) Vey accurate=True e) Silence Threshold= 0.03 f) Voicing threshold= 0.45 g) Octave Cost= 0.01 h) Octave jump cost = 0.35 i) Voiced/ Unvoiced cost= 0.14 j) Pitch ceiling (Hz) = 350
4) The statistical characteristics of the fundamental frequency were calculated only in the windows that were detected as voiced speech.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Common Voice is an audio dataset that consists of a unique MP3 and corresponding text file. There are 9,283 recorded hours in the dataset. The dataset also includes demographic metadata like age, sex, and accent. The dataset consists of 7,335 validated hours in 60 languages.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 10.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 20817 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 15234 validated hours in 96 languages, but more voices and languages are always added. Take a look at the Languages page… See the full description on the dataset page: https://huggingface.co/datasets/gogogogo-1/test.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
this audio dataset contains most common used contents in file fragment classification such as Music, Speech and Phone Call
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 3
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 2454 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 1979 validated hours in 29 languages, but more voices and languages are always added. Take a look at the Languages page to request a… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_3_0.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the US English Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of English language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
This dataset includes over 6,000 high-quality scripted audio prompts recorded in US English, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
The prompts span a broad range of healthcare-specific interactions, such as:
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Every audio recording is accompanied by a verbatim, manually verified transcription.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Version 2.0, February 2022.
BirdVox-ANAFCC is a dataset of short audio waveforms, each of them containing a flight call from one of 14 birds of North America: four American sparrows, one cardinal, two thrushes, and seven New World warblers. * American Tree Sparrow (ATSP) * Chipping Sparrow (CHSP) * Savannah Sparrow (SAVS) * White-throated Sparrow (WTSP) * Red-breasted Grosbeak (RBGR) * Gray-cheeked Thrush (GCTH) * Swainson's Thrush (SWTH) * American Redstart (AMRE) * Bay-breasted Warbler (BBWA) * Black-throated Blue Warbler (BTBW) * Canada Warbler (CAWA) * Common Yellowthroat (COYE) * Mourning Warbler (MOWA) * Ovenbird (OVEN)
It also contains other sounds which are often confused for one of the species above. These "confounding factors" encompass flight calls from other species of birds, vocalizations from non-avian animals, as well as some machine beeps.
BirdVox-ANAFCC results from an aggregation of various smaller datasets, integrated under a common taxonomy. For more details on this taxonomy, we refer the reader to [1]:
[1] Cramer, Lostanlen, Salamon, Farnsworth, Bello. Chirping up the right tree: Incorporating biological taxonomies into deep bioacoustic classifiers. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
The second version of the BirdVox-ANAFCC dataset (v2.0) contains flight calls from the BirdVox-full-night dataset. These flight calls were present in the ICASSP 2020 benchmark but did not appear in the initial release of BirdVox-ANAFCC.
BirdVox-ANAFCC contains the recordings as HDF5 files, sampled at 22,050 Hz, with a single channel (mono). Each HDF5 file contains flight call vocalizations of a particular species. The name of each HDF5 file follows the format: _original.h5
. The name of the HDF5 dataset in each file is "waveforms", with the corresponding key for each audio recording varying in format depending on the data source.
taxonomy.yaml
details the three-level taxonomy structure used in this dataset, reflected in three-number-codes which largely follow "..". Additionally, at any level of the taxonomy, the numeric code "0" is reserved for "other" and the code "X" refers to unknown. For example, 1.1.0 corresponds to an American Sparrow with a species outside of our scope of interest, and 1.1.X corresponds to an American Sparrow of unknown species. At the top level (family), the "other" codes (0.*.*) deviate from the family-order-species in order to capture a variety of other out-of-scope sounds, including anthropophony, non-avian biophony, and biophony of avians outside of the scope of interest.
When BirdVox-ANAFCC is used for academic research, we would highly appreciate it if scientific publications of works partly based on this dataset cite the following publication:
Cramer, Lostanlen, Salamon, Farnsworth, Bello. Chirping up the right tree: Incorporating biological taxonomies into deep bioacoustic classifiers. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
The creation of this dataset was supported by NSF grants 1125098 (BIRDCAST) and 1633259 (BIRDVOX), a Google Faculty Award, the Leon Levy Foundation, and two anonymous donors.
Dataset created by Aurora Cramer, Vincent Lostanlen, Bill Evans, Andrew Farnsworth, Justin Salamon, and Juan Pablo Bello.
The BirdVox-ANAFCC dataset is offered free of charge under the terms of the Creative Commons Attribution International License: https://creativecommons.org/licenses/by/4.0/
The dataset and its contents are made available on an "as is" basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, the authors are not liable for, and expressly exclude all liability for, loss or damage however and whenever caused to anyone by any use of the BirdVox-ANAFCC dataset or any part of it.
Please help us improve BirdVox-full-night by sending your feedback to: vincent.lostanlen@gmail.com and auroracramer@nyu.edu
In case of a problem, please include as many details as possible.
1.0, May 2020: initial version, paired with ICASSP 2020 publication. 2.0, February 2022: added a missing dataset file (BirdVox-70k), updated name of first author (Aurora Cramer).
Jessie Barry, Ian Davies, Tom Fredericks, Jeff Gerbracht, Sara Keen, Holger Klinck, Anne Klingensmith, Ray Mack, Peter Marchetto, Ed Moore, Matt Robbins, Ken Rosenberg, and Chris Tessaglia-Hymes.
We thank contributors and maintainers of the Macaulay Library and the Xeno-Canto website.
We acknowledge that the land on which the data was collected is the unceded territory of the Cayuga nation, which is part of the Haudenosaunee (Iroquois) confederacy.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 9.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 20217 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 14973 validated hours in 93 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
• This dataset contains over 7500 fart recordings that were collected over a period of 37 months. I created this dataset for educational purposes, to allow others to study and experiment with audio, and signal processing more broadly.
• The files are in .wav format. I recorded all of the farts using a voice recording app in whatever environment I was in at the time. Thus, there may be background noise, people talking, and variations in volume and clarity (note: I generally step far away to fart if others are present). I validate every single recording and delete those with low volume, too much background noise, and sounds too similar to farts (e.g. phone vibrations). Over time, I have decreased my threshold for acceptable background noise and strive to have minimal or no extraneous noise. There are numerous types of farts, as well. I did not record every fart I produced during the time period, rather I recorded when available to do so. Thus, this data is not inclusive of all emitted farts.
• I did not perform any preprocessing on this data to maintain its versatility. For most audio tasks, you may consider trimming files to be of a consistent duration, as well as other common audio preprocessing techniques.
• The files are named as integers, starting from 1. The order of files bears no significance.
• If you are using these files specifically for fart classification or fart recognition, please bear in mind that this data is biased towards my farts. Consequently, you may find that a model recognizes someone else's farts, such as your own, with different results.
Suggested Uses
• Unsupervised signal classification - You can experiment with categorizing farts without any preexisting knowledge of defining characteristics and potentially apply these learnings to other signal types - speech, radar, tv, radio, light, EEG.
• Supervised signal recognition - This dataset could be used to experiment with developing deep learning models capable of recognizing whether a sound is a fart. An interesting property of farts is variable frequencies and inconsistent durations.
• Sound effects creation - This dataset could be used by sound designers or audio engineers as a basis to create new sound effects for movies, video games, or other media. You could also simply use it as a publicly available and free source of farts.
• Education and outreach - Educators and scientists can use this dataset as an approach to better engage their audiences in signal processing and deep learning.
License
• This data is publicly and freely available to use and modify however you would like. There is no license and no limitations for use. I would appreciate being notified of this data being used publicly, purely for my own entertainment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains a WOLOF Text To Speech(TTS) dataset, it contains recordings from two natif Wolof actos (a male and female voice).
Each actor recored more than 20 000 sentences.
The notebook accompanying the dataset contains a brief analysis of the dataset and the code creating the appropriate train/validation and test set.
The file [male, female]train, [male, female]validation and [male, female]test are also present to extract the corespondig audios inside the data-commonvoice.zip
The text dataset come from news website, Wikipedia and self curated text. We made sure with the help of our Wolof expert that the text dataset cover the different phonemes in the Wolof language.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Italian Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of Italian language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
This dataset includes over 6,000 high-quality scripted audio prompts recorded in Italian, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
The prompts span a broad range of healthcare-specific interactions, such as:
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Every audio recording is accompanied by a verbatim, manually verified transcription.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset for this project comprises audio recordings of the operational states of belt conveyor rollers in a mining environment, covering three conditions: normal roller operation, roller shell cracking, and roller breakage. Combined with machine learning models, this dataset can be used for real-time diagnosis of roller operational states. The database contains two main folders: dataset and code.The dataset folder includes three subfolders:wav: Contains 19 WAV files recorded from 19 microphones, capturing the audio data of belt conveyor rollers in a mining site. Of these, 17 files represent normal roller operation, 1 file captures the audio of a roller with shell cracking, and 1 file captures the audio of a roller with complete breakage.csv_dataset: Contains 10 subfolders, each representing audio feature datasets extracted from the WAV files with frame lengths ranging from 100ms to 1000ms. Each subfolder contains 19 CSV files, corresponding to the 19 audio recordings. The feature datasets within different frame-length subfolders should not be used interchangeably.test_dataset: Contains 17 audio feature datasets with a 200ms frame length. These datasets include features from 17 normal operation recordings combined with features from the roller shell cracking and roller breakage recordings. The combined datasets are shuffled 100 times to ensure even distribution of features from each operational state. This dataset was used for validating the accuracy and usability of the audio feature datasets for real-time monitoring of roller states in the paper.The code folder contains two sets of code:Matlab Code: This code extracts 25 audio features from the WAV files and generates the 17 audio feature datasets using a 200ms frame length.Python Code: This code validates the accuracy and usability of the audio feature datasets in real-time monitoring of belt conveyor roller operational states.This dataset and code combination supports the real-time diagnosis of belt conveyor roller conditions and provides a foundation for validating the effectiveness of audio features in fault detection.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is the "development dataset" for the DCASE 2020 Challenge Task 2 "Unsupervised Detection of Anomalous Sounds for Machine Condition Monitoring" [task description].
The data comprises parts of ToyADMOS and the MIMII Dataset consisting of the normal/anomalous operating sounds of six types of toy/real machines. Each recording is a single-channel (proximately) 10-sec length audio that includes both a target machine's operating sound and environmental noise. The following six types of toy/real machines are used in this task:
Toy-car (ToyADMOS)
Toy-conveyor (ToyADMOS)
Valve (MIMII Dataset)
Pump (MIMII Dataset)
Fan (MIMII Dataset)
Slide rail (MIMII Dataset)
Recording procedure
The ToyADMOS consists of normal/anomalous operating sounds of miniature machines (toys) collected with four microphones, and the MIMII dataset consists of those of real-machines collected with eight microphones. Anomalous sounds in these datasets were collected by deliberately damaging target machines. For simplifying the task, we used only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. The sampling rate of all signals has been downsampled to 16 kHz. From ToyADMOS, we used only IND-type data that contain the operating sounds of the entire operation (i.e., from start to stop) in a recording. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. For the details of the recording procedure, please refer to the papers of ToyADMOS and MIMII Dataset.
Data
We first define two important terms in this task: Machine Type and Machine ID. Machine Type means the kind of machine, which in this task can be one of six: toy-car, toy-conveyor, valve, pump, fan, and slide rail. Machine ID is the identifier of each individual of the same type of machine, which in the training dataset can be of three or four. Each machine ID's dataset consists of (i) around 1,000 samples of normal sounds for training and (ii) 100-200 samples each of normal and anomalous sounds for the test. The given labels for each training/test sample are Machine Type, Machine ID, and condition (normal/anomaly). Machine Type information is given by directory name, and Machine ID and condition information are given by their respective file names.
Directory structure
When you unzip the downloaded files from Zenodo, you can see the following directory structure. As described in the previous section, Machine Type information is given by directory name, and Machine ID and condition information are given by file name, as:
/dev_data
/ToyCar
/train (Only normal data for all Machine IDs are included.)
/normal_id_01_00000000.wav
...
/normal_id_01_00000999.wav
/normal_id_02_00000000.wav
...
/normal_id_04_00000999.wav
/test (Normal and anomaly data for all Machine IDs are included.)
/normal_id_01_00000000.wav
...
/normal_id_01_00000349.wav
/anomaly_id_01_00000000.wav
...
/anomaly_id_01_00000263.wav
/normal_id_02_00000000.wav
...
/anomaly_id_04_00000264.wav
/ToyConveyor (The other Machine Types have the same directory structure as ToyCar.)
/fan
/pump
/slider
/valve
The paths of audio files are:
"/dev_data//train/normal_id_[0-9]+.wav"
"/dev_data//test/normal_id_[0-9]+.wav"
"/dev_data//test/anomaly_id_[0-9]+.wav"
For example, the Machine Type and Machine ID of "/ToyCar/train/normal_id_01_00000000.wav" are "ToyCar" and "01", respectively, and its condition is normal. The Machine Type and Machine ID of "/fan/test/anomaly_id_00_00000000.wav" are "fan" and "00", respectively, and its condition is anomalous.
Baseline system
A simple baseline system is available on the Github repository [URL]. The baseline system provides a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. It is a good starting point, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Conditions of use
This dataset was created jointly by NTT Corporation and Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Publication
If you use this dataset, please cite all the following three papers:
Yuma Koizumi, Shoichiro Saito, Noboru Harada, Hisashi Uematsu, and Keisuke Imoto, "ToyADMOS: A Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection," in Proc of Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019. [pdf]
Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019. [pdf]
Yuma Koizumi, Yohei Kawaguchi, Keisuke Imoto, Toshiki Nakamura, Yuki Nikaido, Ryo Tanabe, Harsh Purohit, Kaori Suefusa, Takashi Endo, Masahiro Yasuda, and Noboru Harada, "Description and Discussion on DCASE2020 Challenge Task2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring," in Proc. 5th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2020. [pdf]
Feedback
If there is any problem, please contact us:
Yuma Koizumi, koizumi.yuma@ieee.org
Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
Keisuke Imoto, keisuke.imoto@ieee.org
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 12.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.