https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 12.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 11.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is the Arabic subset of the Mozilla Common Voice project, version 11.0.
🔗 Original source: https://commonvoice.mozilla.org/en/datasets
📝 License: CC0 1.0 (Public Domain Dedication)
Common Voice is an open-source voice dataset project by the Mozilla Foundation, aiming to make voice technology more inclusive.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Common Voices Delta 21.0 (Romanian)
Common Voices is an open-source dataset of speech recordings created by Mozilla to improve speech recognition technologies. It consists of crowdsourced voice samples in multiple languages, contributed by volunteers worldwide.
Challenges: The raw dataset included numerous recordings with incorrect transcriptions or those requiring adjustments, such as sampling rate modifications, conversion to .wav format, and other refinements essential… See the full description on the dataset page: https://huggingface.co/datasets/ionut-visan/CommonVoicesDelta21_ro.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CL-MASR Dataset
This is the dataset used in the continual learning for multilingual ASR (CL-MASR) benchmark. It is composed of speech recordings from 20 languages selected from the Common Voice 13 dataset. For each language, it includes up to 10/1/1 hours for train/dev/test, respectively.
The CL-MASR benchmark platform is available in the SpeechBrain toolkit (see recipes/CommonVoice):
https://github.com/speechbrain/speechbrain
The original Common Voice 13 data are available at:
https://commonvoice.mozilla.org/en/datasets
List of Languages
- English (en)
- Chinese (zh-CN)
- German (de)
- Spanish (es)
- Russian (ru)
- French (fr)
- Portuguese (pt)
- Japanese (ja)
- Turkish (tr)
- Polish (pl)
- Kinyarwanda (rw)
- Esperanto (eo)
- Kabyle (kab)
- Luganda (lg)
- Meadow Mari (mhr)
- Central Kurdish (ckb)
- Abkhaz (ab)
- Kurmanji Kurdish (kmr)
- Frisian (fy-NL)
- Interlingua (ia)
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
CoVoST 2, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla’s open source Common Voice database of crowdsourced voice recordings.
Note that in order to limit the required storage for preparing this dataset, the audio
is stored in the .mp3 format and is not converted to a float32 array. To convert, the audio
file to a float32 array, please make use of the .map()
function as follows:
import torchaudio
def map_to_array(batch):
speech_array, _ = torchaudio.load(batch["file"])
batch["speech"] = speech_array.numpy()
return batch
dataset = dataset.map(map_to_array, remove_columns=["file"])
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Common Voice Corpus 22.0 - Urdu
This dataset contains the Urdu subset of the Mozilla Common Voice 22.0 corpus, released in June 2025.It consists of crowdsourced speech recordings and their corresponding text transcriptions, collected to support open-source speech technology.
Dataset Summary
The Common Voice Corpus 22.0 Urdu dataset provides high-quality speech data for automatic speech recognition (ASR), speaker identification, and linguistic research in Urdu.It includes… See the full description on the dataset page: https://huggingface.co/datasets/azeem-ahmed/Common_Voice_Corpus_22_0_Urdu.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 12.0
Dataset Summary
The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.