93 datasets found
  1. m

    Corpus de llenguatge ofensiu en català

    • mozilladatacollective.com
    Updated Mar 24, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MDC Curators (2026). Corpus de llenguatge ofensiu en català [Dataset]. https://mozilladatacollective.com/datasets/cmn4s1j5d0091nu07e1hgzwgn
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 24, 2026
    Dataset authored and provided by
    MDC Curators
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan.

    The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.

  2. m

    Mozilla Common Voice Text Language Identification dataset

    • mozilladatacollective.com
    Updated Dec 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2025). Mozilla Common Voice Text Language Identification dataset [Dataset]. https://mozilladatacollective.com/datasets/cmj8ddapc02c8mb07l6wyr882
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 16, 2025
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A dataset for text-based language identification of 19 Million sentences from over 300 languages taken from Mozilla Common Voice scripted (v23) and spontaneous (v1) speech projects.

  3. m

    Mozilla Common Voice Spontaneous Speech ASR Shared Task Test Data

    • mozilladatacollective.com
    Updated Dec 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2025). Mozilla Common Voice Spontaneous Speech ASR Shared Task Test Data [Dataset]. https://mozilladatacollective.com/datasets/cminc35no007no707hql26lzk
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2025
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A bundle of the held-out test data for the Mozilla Common Voice Spontaneous Speech ASR shared task.

  4. m

    Common Voice Scripted Speech 25.0 - Aragonese

    • mozilladatacollective.com
    Updated Mar 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Aragonese [Dataset]. https://mozilladatacollective.com/datasets/cmn2cpd2m01himm07yj9w1lxn
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Aragonese (Aragonés).

  5. m

    Common Voice Scripted Speech 25.0 - Wakhi

    • mozilladatacollective.com
    Updated Mar 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Wakhi [Dataset]. https://mozilladatacollective.com/datasets/cmn2cq4j601iemm0765824vbi
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Wakhi (Wakhi (Wuk̃hikwor)).

  6. m

    Common Voice Scripted Speech 25.0 - Estonian

    • mozilladatacollective.com
    Updated Mar 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Estonian [Dataset]. https://mozilladatacollective.com/datasets/cmn2e880l01kumm07i9upoz99
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Estonian (eesti).

  7. m

    Common Voice Scripted Speech 25.0 - Dagbani

    • mozilladatacollective.com
    Updated Mar 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Dagbani [Dataset]. https://mozilladatacollective.com/datasets/cmn2cy2su01iymm07xfr6ul2b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Dagbani (Dagbanli).

  8. m

    Common Voice Scripted Speech 25.0 - Romansh Sursilvan

    • mozilladatacollective.com
    Updated Mar 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Romansh Sursilvan [Dataset]. https://mozilladatacollective.com/datasets/cmn2cq76201iimm07avtwokjf
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Romansh Sursilvan (romontsch sursilvan).

  9. m

    Common Voice Scripted Speech 25.0 - Kabyle

    • mozilladatacollective.com
    Updated Mar 23, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Kabyle [Dataset]. https://mozilladatacollective.com/datasets/cmn38spwm005vmi07bejigyo6
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Kabyle (Taqbaylit).

  10. m

    Common Voice Scripted Speech 25.0 - Urdu

    • mozilladatacollective.com
    Updated Mar 23, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Urdu [Dataset]. https://mozilladatacollective.com/datasets/cmn2h58bw01mwmm07t3ypteqz
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Urdu (اردو).

  11. m

    Common Voice Scripted Speech 25.0 - Hindi

    • mozilladatacollective.com
    Updated Mar 22, 2026
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Hindi [Dataset]. https://mozilladatacollective.com/datasets/cmn2cxzy701iumm077t5ayw0e
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Hindi (हिंदी).

  12. m

    Zacatlán Tepetzintla Nahuatl ASR Dataset

    • mozilladatacollective.com
    Updated Feb 18, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaltepetlahtol (2026). Zacatlán Tepetzintla Nahuatl ASR Dataset [Dataset]. https://mozilladatacollective.com/datasets/cmls27zfd0043ma07mxvsz8zg
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 18, 2026
    Dataset authored and provided by
    Kaltepetlahtol
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Area covered
    Zacatlán
    Description

    An ASR dataset of Zacatlán-Ahuacatlán-Tepetzintla (Western Sierra Puebla) Nahuatl, ISO 639-3 nhi. This is a derivative work of the Zacatlán Tepetzintla Nahuatl Audio and Transcriptions datasets. It consists of the subset of larger audio dataset with transcriptions (approximately 14 hours) converted to the Mozilla Common Voice Scripted Speech format. The original stereo audio has been split and aligned with the parsed transcriptions.

  13. m

    INEL Dolgan Speech Corpus

    • mozilladatacollective.com
    Updated Mar 24, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute of Finno-Ugric/Uralic Studies, University of Hamburg (2026). INEL Dolgan Speech Corpus [Dataset]. https://mozilladatacollective.com/datasets/cmn4kqzzt0013nu07caxllg3t
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 24, 2026
    Dataset authored and provided by
    Institute of Finno-Ugric/Uralic Studies, University of Hamburg
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset is a machine-learning-ready subset of the INEL Dolgan Corpus (Version 2.0), processed specifically for Automatic Speech Recognition (ASR) / Speech-to-Text (STT) training. It translates the highly detailed EXMARaLDA XML annotations into the standard tabular layout utilized by Mozilla Common Voice. The dataset comprises 13 hours and 5 minutes of perfectly aligned supervised speech data (10,609 individual clips) across recordings spanning from the 1970s to 2017. It features demographic metadata where available, and prioritizes Cyrillic transcriptions while falling back to Latin or Phonological tiers to ensure complete text coverage for acoustic modeling.

  14. m

    Common Voice Spontaneous Speech 3.0 - Russian

    • mozilladatacollective.com
    Updated Mar 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Spontaneous Speech 3.0 - Russian [Dataset]. https://mozilladatacollective.com/datasets/cmn1pnb4n00vfmm07eqydzilq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of spontaneous responses to questions in Russian (Русский).

  15. m

    Common Voice Scripted Speech 25.0 - Russian

    • mozilladatacollective.com
    Updated Mar 23, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Russian [Dataset]. https://mozilladatacollective.com/datasets/cmn2h1dg201gro107lpynbbd6
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Russian (Русский).

  16. m

    Common Voice Scripted Speech 25.0 - Mongolian

    • mozilladatacollective.com
    Updated Mar 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Mongolian [Dataset]. https://mozilladatacollective.com/datasets/cmn2e7nxs01k6mm07you99zve
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Mongolia
    Description

    A collection of read speech recordings in Mongolian (Монгол хэл).

  17. m

    Common Voice Scripted Speech 25.0 - Adyghe

    • mozilladatacollective.com
    Updated Mar 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Adyghe [Dataset]. https://mozilladatacollective.com/datasets/cmn2e80ea01kmmm07lzsoe5z9
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Adyghe (Адыгабзэ).

  18. m

    Common Voice Scripted Speech 25.0 - Indonesian

    • mozilladatacollective.com
    Updated Mar 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Indonesian [Dataset]. https://mozilladatacollective.com/datasets/cmn2e8ats01eno107glwgoasv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Indonesian (Bahasa Indonesia).

  19. m

    Common Voice Scripted Speech 25.0 - Kalenjin

    • mozilladatacollective.com
    Updated Mar 22, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Scripted Speech 25.0 - Kalenjin [Dataset]. https://mozilladatacollective.com/datasets/cmn2e84ge01kqmm07esyp21xq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of read speech recordings in Kalenjin (kln).

  20. m

    Common Voice Spontaneous Speech 3.0 - Kuku

    • mozilladatacollective.com
    Updated Mar 20, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Voice (2026). Common Voice Spontaneous Speech 3.0 - Kuku [Dataset]. https://mozilladatacollective.com/datasets/cmmytfqkp00emnz072015yasf
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 20, 2026
    Dataset authored and provided by
    Common Voice
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    A collection of spontaneous responses to questions in Kuku (ukv).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
MDC Curators (2026). Corpus de llenguatge ofensiu en català [Dataset]. https://mozilladatacollective.com/datasets/cmn4s1j5d0091nu07e1hgzwgn

Corpus de llenguatge ofensiu en català

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 24, 2026
Dataset authored and provided by
MDC Curators
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan.

The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.

Search
Clear search
Close search
Google apps
Main menu