3 datasets found
  1. E

    MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

    • live.european-language-grid.eu
    • zenodo.org
    npy
    Updated Aug 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). MaSS - Multilingual corpus of Sentence-aligned Spoken utterances [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7722
    Explore at:
    npyAvailable download formats
    Dataset updated
    Aug 28, 2022
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AbstractThe CMU Wilderness Multilingual Speech Dataset is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible), is the same for all the languages is not exploited to date. Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 para-lel spoken utterances across 8 languages (56 language pairs).We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for syntactically divergent language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs).Paper | GitHub Repository containing the scripts needed to build the data set from scratch (if needed)Project structureThis repository contains 8 Numpy files, one for each featured language, pickled with Python 3.6. Each line corresponds to the spectrogram of the file mentioned in the file verses.csv. There is a direct mapping between the ID of the verse and its index in the list (thus verse with ID 5634 is located at index 5634 in the Numpy file). Verses not available for a given language (as stated by the value "Not Available" in the CSV file) are represented by empty lists in the Numpy files, thus ensuring a perfect verse-to-verse alignement between each file.Spectrogram were extracted using Librosa with the following parameters:Pre-emphasis = 0.97Sample rate = 16000Window size = 0.025Window stride = 0.01Window type = 'hamming'Mel coefficients = 40Min frequency = 20

  2. O

    MaSS

    • opendatalab.com
    • paperswithcode.com
    zip
    Updated Mar 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grenoble Alpes University (2023). MaSS [Dataset]. https://opendatalab.com/OpenDataLab/MaSS
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 7, 2023
    Dataset provided by
    Grenoble Alpes University
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MaSS (Multilingual corpus of Sentence-aligned Spoken utterances) is an extension of the CMU Wilderness Multilingual Speech Dataset, a speech dataset based on recorded readings of the New Testament. MaSS extends it by providing a large and clean dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). The covered languages are: Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish.

  3. h

    lomwe-speech-text

    • huggingface.co
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mich-Seth Owusu (2025). lomwe-speech-text [Dataset]. https://huggingface.co/datasets/michsethowusu/lomwe-speech-text
    Explore at:
    Dataset updated
    Jun 26, 2025
    Authors
    Mich-Seth Owusu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Lomwe Speech-Text Parallel Dataset

    This dataset is a collection of aligned audio-text pairs in Lomwe, extracted from the CMU Wilderness dataset. It is useful for tasks such as:

    Speech recognition (ASR) Text-to-speech (TTS) Language modeling for low-resource languages

      Dataset Structure
    

    Each entry in the dataset contains:

    audio: A .wav file sampled at 16kHz text: A transcription of the spoken audio in Lomwe (digits removed)

      Example
    

    audio text… See the full description on the dataset page: https://huggingface.co/datasets/michsethowusu/lomwe-speech-text.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). MaSS - Multilingual corpus of Sentence-aligned Spoken utterances [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7722

MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

Explore at:
8 scholarly articles cite this dataset (View in Google Scholar)
npyAvailable download formats
Dataset updated
Aug 28, 2022
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

AbstractThe CMU Wilderness Multilingual Speech Dataset is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible), is the same for all the languages is not exploited to date. Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 para-lel spoken utterances across 8 languages (56 language pairs).We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for syntactically divergent language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs).Paper | GitHub Repository containing the scripts needed to build the data set from scratch (if needed)Project structureThis repository contains 8 Numpy files, one for each featured language, pickled with Python 3.6. Each line corresponds to the spectrogram of the file mentioned in the file verses.csv. There is a direct mapping between the ID of the verse and its index in the list (thus verse with ID 5634 is located at index 5634 in the Numpy file). Verses not available for a given language (as stated by the value "Not Available" in the CSV file) are represented by empty lists in the Numpy files, thus ensuring a perfect verse-to-verse alignement between each file.Spectrogram were extracted using Librosa with the following parameters:Pre-emphasis = 0.97Sample rate = 16000Window size = 0.025Window stride = 0.01Window type = 'hamming'Mel coefficients = 40Min frequency = 20

Search
Clear search
Close search
Google apps
Main menu