7 datasets found
  1. common_voice_12_0

    • huggingface.co
    Updated Mar 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation (2023). common_voice_12_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0
    Explore at:
    Dataset updated
    Mar 24, 2023
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 12.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.

  2. common_voice_11_0

    • huggingface.co
    Updated Nov 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation (2022). common_voice_11_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0
    Explore at:
    Dataset updated
    Nov 3, 2022
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 11.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0.

  3. Arabic Speech Dataset - Mozilla Common Voice

    • kaggle.com
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayar Jao (2025). Arabic Speech Dataset - Mozilla Common Voice [Dataset]. https://www.kaggle.com/datasets/mayarjao/arabic-tts/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mayar Jao
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📌 Attribution

    This dataset is the Arabic subset of the Mozilla Common Voice project, version 11.0.

    🔗 Original source: https://commonvoice.mozilla.org/en/datasets

    📝 License: CC0 1.0 (Public Domain Dedication)

    Common Voice is an open-source voice dataset project by the Mozilla Foundation, aiming to make voice technology more inclusive.

  4. h

    CommonVoicesDelta21_ro

    • huggingface.co
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ionut Visan (2025). CommonVoicesDelta21_ro [Dataset]. https://huggingface.co/datasets/ionut-visan/CommonVoicesDelta21_ro
    Explore at:
    Dataset updated
    Jun 21, 2025
    Authors
    Ionut Visan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Common Voices Delta 21.0 (Romanian)

    Common Voices is an open-source dataset of speech recordings created by Mozilla to improve speech recognition technologies. It consists of crowdsourced voice samples in multiple languages, contributed by volunteers worldwide.

    Challenges: The raw dataset included numerous recordings with incorrect transcriptions or those requiring adjustments, such as sampling rate modifications, conversion to .wav format, and other refinements essential… See the full description on the dataset page: https://huggingface.co/datasets/ionut-visan/CommonVoicesDelta21_ro.

  5. CL-MASR

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jun 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luca Della Libera; Luca Della Libera; Pooneh Mousavi; Salah Zaiem; Cem Subakan; Mirco Ravanelli; Pooneh Mousavi; Salah Zaiem; Cem Subakan; Mirco Ravanelli (2023). CL-MASR [Dataset]. http://doi.org/10.5281/zenodo.8065754
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 28, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luca Della Libera; Luca Della Libera; Pooneh Mousavi; Salah Zaiem; Cem Subakan; Mirco Ravanelli; Pooneh Mousavi; Salah Zaiem; Cem Subakan; Mirco Ravanelli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CL-MASR Dataset

    This is the dataset used in the continual learning for multilingual ASR (CL-MASR) benchmark. It is composed of speech recordings from 20 languages selected from the Common Voice 13 dataset. For each language, it includes up to 10/1/1 hours for train/dev/test, respectively.

    The CL-MASR benchmark platform is available in the SpeechBrain toolkit (see recipes/CommonVoice):
    https://github.com/speechbrain/speechbrain

    The original Common Voice 13 data are available at:
    https://commonvoice.mozilla.org/en/datasets

    List of Languages

    - English (en)
    - Chinese (zh-CN)
    - German (de)
    - Spanish (es)
    - Russian (ru)
    - French (fr)
    - Portuguese (pt)
    - Japanese (ja)
    - Turkish (tr)
    - Polish (pl)
    - Kinyarwanda (rw)
    - Esperanto (eo)
    - Kabyle (kab)
    - Luganda (lg)
    - Meadow Mari (mhr)
    - Central Kurdish (ckb)
    - Abkhaz (ab)
    - Kurmanji Kurdish (kmr)
    - Frisian (fy-NL)
    - Interlingua (ia)

  6. h

    covost2

    • huggingface.co
    Updated Jun 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2024). covost2 [Dataset]. https://huggingface.co/datasets/facebook/covost2
    Explore at:
    Dataset updated
    Jun 3, 2024
    Dataset authored and provided by
    AI at Meta
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    CoVoST 2, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. The dataset is created using Mozilla’s open source Common Voice database of crowdsourced voice recordings.

    Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .mp3 format and is not converted to a float32 array. To convert, the audio file to a float32 array, please make use of the .map() function as follows:

    import torchaudio
    
    def map_to_array(batch):
      speech_array, _ = torchaudio.load(batch["file"])
      batch["speech"] = speech_array.numpy()
      return batch
    
    dataset = dataset.map(map_to_array, remove_columns=["file"])
    
  7. h

    Common_Voice_Corpus_22_0_Urdu

    • huggingface.co
    Updated Sep 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Azeem Ahmed (2025). Common_Voice_Corpus_22_0_Urdu [Dataset]. https://huggingface.co/datasets/azeem-ahmed/Common_Voice_Corpus_22_0_Urdu
    Explore at:
    Dataset updated
    Sep 4, 2025
    Authors
    Azeem Ahmed
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Common Voice Corpus 22.0 - Urdu

    This dataset contains the Urdu subset of the Mozilla Common Voice 22.0 corpus, released in June 2025.It consists of crowdsourced speech recordings and their corresponding text transcriptions, collected to support open-source speech technology.

      Dataset Summary
    

    The Common Voice Corpus 22.0 Urdu dataset provides high-quality speech data for automatic speech recognition (ASR), speaker identification, and linguistic research in Urdu.It includes… See the full description on the dataset page: https://huggingface.co/datasets/azeem-ahmed/Common_Voice_Corpus_22_0_Urdu.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mozilla Foundation (2023). common_voice_12_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0
Organization logo

common_voice_12_0

Common Voice Corpus 12.0

mozilla-foundation/common_voice_12_0

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Mar 24, 2023
Dataset authored and provided by
Mozilla Foundationhttp://mozilla.org/
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

Dataset Card for Common Voice Corpus 12.0

  Dataset Summary

The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.

Search
Clear search
Close search
Google apps
Main menu