7 datasets found

common_voice_12_0
huggingface.co
Updated Mar 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2023). common_voice_12_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0
Explore at:
Dataset updated
Mar 24, 2023
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 12.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.
common_voice_11_0
huggingface.co
Updated Nov 3, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mozilla Foundation (2022). common_voice_11_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0
Explore at:
Dataset updated
Nov 3, 2022
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 11.0

Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0.
Arabic Speech Dataset - Mozilla Common Voice
kaggle.com
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayar Jao (2025). Arabic Speech Dataset - Mozilla Common Voice [Dataset]. https://www.kaggle.com/datasets/mayarjao/arabic-tts/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mayar Jao
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
📌 Attribution

This dataset is the Arabic subset of the Mozilla Common Voice project, version 11.0.

🔗 Original source: https://commonvoice.mozilla.org/en/datasets

📝 License: CC0 1.0 (Public Domain Dedication)

Common Voice is an open-source voice dataset project by the Mozilla Foundation, aiming to make voice technology more inclusive.
h
CommonVoicesDelta21_ro
huggingface.co
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ionut Visan (2025). CommonVoicesDelta21_ro [Dataset]. https://huggingface.co/datasets/ionut-visan/CommonVoicesDelta21_ro
Explore at:
Dataset updated
Jun 21, 2025
Authors
Ionut Visan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Common Voices Delta 21.0 (Romanian)

Common Voices is an open-source dataset of speech recordings created by Mozilla to improve speech recognition technologies. It consists of crowdsourced voice samples in multiple languages, contributed by volunteers worldwide.

Challenges: The raw dataset included numerous recordings with incorrect transcriptions or those requiring adjustments, such as sampling rate modifications, conversion to .wav format, and other refinements essential… See the full description on the dataset page: https://huggingface.co/datasets/ionut-visan/CommonVoicesDelta21_ro.
CL-MASR
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jun 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luca Della Libera; Luca Della Libera; Pooneh Mousavi; Salah Zaiem; Cem Subakan; Mirco Ravanelli; Pooneh Mousavi; Salah Zaiem; Cem Subakan; Mirco Ravanelli (2023). CL-MASR [Dataset]. http://doi.org/10.5281/zenodo.8065754
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8065754
Dataset updated
Jun 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luca Della Libera; Luca Della Libera; Pooneh Mousavi; Salah Zaiem; Cem Subakan; Mirco Ravanelli; Pooneh Mousavi; Salah Zaiem; Cem Subakan; Mirco Ravanelli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CL-MASR Dataset

This is the dataset used in the continual learning for multilingual ASR (CL-MASR) benchmark. It is composed of speech recordings from 20 languages selected from the Common Voice 13 dataset. For each language, it includes up to 10/1/1 hours for train/dev/test, respectively.

The CL-MASR benchmark platform is available in the SpeechBrain toolkit (see recipes/CommonVoice):
https://github.com/speechbrain/speechbrain

The original Common Voice 13 data are available at:
https://commonvoice.mozilla.org/en/datasets

List of Languages

- English (en)
- Chinese (zh-CN)
- German (de)
- Spanish (es)
- Russian (ru)
- French (fr)
- Portuguese (pt)
- Japanese (ja)
- Turkish (tr)
- Polish (pl)
- Kinyarwanda (rw)
- Esperanto (eo)
- Kabyle (kab)
- Luganda (lg)
- Meadow Mari (mhr)
- Central Kurdish (ckb)
- Abkhaz (ab)
- Kurmanji Kurdish (kmr)
- Frisian (fy-NL)
- Interlingua (ia)
h
covost2
huggingface.co
Updated Jun 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2024). covost2 [Dataset]. https://huggingface.co/datasets/facebook/covost2
Explore at:
Dataset updated
Jun 3, 2024
Dataset authored and provided by
AI at Meta
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
CoVoST 2, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla’s open source Common Voice database of crowdsourced voice recordings.

Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .mp3 format and is not converted to a float32 array. To convert, the audio file to a float32 array, please make use of the .map() function as follows:

import torchaudio def map_to_array(batch): speech_array, _ = torchaudio.load(batch["file"]) batch["speech"] = speech_array.numpy() return batch dataset = dataset.map(map_to_array, remove_columns=["file"])
h
Common_Voice_Corpus_22_0_Urdu
huggingface.co
Updated Sep 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Azeem Ahmed (2025). Common_Voice_Corpus_22_0_Urdu [Dataset]. https://huggingface.co/datasets/azeem-ahmed/Common_Voice_Corpus_22_0_Urdu
Explore at:
Dataset updated
Sep 4, 2025
Authors
Azeem Ahmed
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Common Voice Corpus 22.0 - Urdu

This dataset contains the Urdu subset of the Mozilla Common Voice 22.0 corpus, released in June 2025.It consists of crowdsourced speech recordings and their corresponding text transcriptions, collected to support open-source speech technology.

Dataset Summary

The Common Voice Corpus 22.0 Urdu dataset provides high-quality speech data for automatic speech recognition (ASR), speaker identification, and linguistic research in Urdu.It includes… See the full description on the dataset page: https://huggingface.co/datasets/azeem-ahmed/Common_Voice_Corpus_22_0_Urdu.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mozilla Foundation (2023). common_voice_12_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0

common_voice_12_0

Common Voice Corpus 12.0

mozilla-foundation/common_voice_12_0

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 24, 2023

Dataset authored and provided by

Mozilla Foundationhttp://mozilla.org/

License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Dataset Card for Common Voice Corpus 12.0

  Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.

Clear search

Close search

Google apps

Main menu

common_voice_12_0

common_voice_11_0

Arabic Speech Dataset - Mozilla Common Voice

📌 Attribution

CommonVoicesDelta21_ro

CL-MASR

covost2

Common_Voice_Corpus_22_0_Urdu

common_voice_12_0See More Versions

Common Voice Corpus 12.0

mozilla-foundation/common_voice_12_0

common_voice_12_0