3 datasets found

E
MaSS - Multilingual corpus of Sentence-aligned Spoken utterances
live.european-language-grid.eu
zenodo.org
npy
Updated Aug 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). MaSS - Multilingual corpus of Sentence-aligned Spoken utterances [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7722
Explore at:
npyAvailable download formats
Dataset updated
Aug 28, 2022
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AbstractThe CMU Wilderness Multilingual Speech Dataset is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible), is the same for all the languages is not exploited to date. Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 para-lel spoken utterances across 8 languages (56 language pairs).We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for syntactically divergent language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs).Paper | GitHub Repository containing the scripts needed to build the data set from scratch (if needed)Project structureThis repository contains 8 Numpy files, one for each featured language, pickled with Python 3.6. Each line corresponds to the spectrogram of the file mentioned in the file verses.csv. There is a direct mapping between the ID of the verse and its index in the list (thus verse with ID 5634 is located at index 5634 in the Numpy file). Verses not available for a given language (as stated by the value "Not Available" in the CSV file) are represented by empty lists in the Numpy files, thus ensuring a perfect verse-to-verse alignement between each file.Spectrogram were extracted using Librosa with the following parameters:Pre-emphasis = 0.97Sample rate = 16000Window size = 0.025Window stride = 0.01Window type = 'hamming'Mel coefficients = 40Min frequency = 20
O
MaSS
opendatalab.com
paperswithcode.com
zip
Updated Mar 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grenoble Alpes University (2023). MaSS [Dataset]. https://opendatalab.com/OpenDataLab/MaSS
Explore at:
zipAvailable download formats
Dataset updated
Mar 7, 2023
Dataset provided by
Grenoble Alpes University
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MaSS (Multilingual corpus of Sentence-aligned Spoken utterances) is an extension of the CMU Wilderness Multilingual Speech Dataset, a speech dataset based on recorded readings of the New Testament. MaSS extends it by providing a large and clean dataset of 8,130 parallel spoken utterances across 8 languages (56 language pairs). The covered languages are: Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish.
h
lomwe-speech-text
huggingface.co
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mich-Seth Owusu (2025). lomwe-speech-text [Dataset]. https://huggingface.co/datasets/michsethowusu/lomwe-speech-text
Explore at:
Dataset updated
Jun 26, 2025
Authors
Mich-Seth Owusu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Lomwe Speech-Text Parallel Dataset

This dataset is a collection of aligned audio-text pairs in Lomwe, extracted from the CMU Wilderness dataset. It is useful for tasks such as:

Speech recognition (ASR) Text-to-speech (TTS) Language modeling for low-resource languages

Dataset Structure

Each entry in the dataset contains:

audio: A .wav file sampled at 16kHz text: A transcription of the spoken audio in Lomwe (digits removed)

Example

audio text… See the full description on the dataset page: https://huggingface.co/datasets/michsethowusu/lomwe-speech-text.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). MaSS - Multilingual corpus of Sentence-aligned Spoken utterances [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7722

MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

Explore at:

8 scholarly articles cite this dataset (View in Google Scholar)

npyAvailable download formats

Dataset updated

Aug 28, 2022

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

AbstractThe CMU Wilderness Multilingual Speech Dataset is a newly published multilingual speech dataset based on recorded readings of the New Testament. It provides data to build Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models for potentially 700 languages. However, the fact that the source content (the Bible), is the same for all the languages is not exploited to date. Therefore, this article proposes to add multilingual links between speech segments in different languages, and shares a large and clean dataset of 8,130 para-lel spoken utterances across 8 languages (56 language pairs).We name this corpus MaSS (Multilingual corpus of Sentence-aligned Spoken utterances). The covered languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian and Spanish) allow researches on speech-to-speech alignment as well as on translation for syntactically divergent language pairs. The quality of the final corpus is attested by human evaluation performed on a corpus subset (100 utterances, 8 language pairs).Paper | GitHub Repository containing the scripts needed to build the data set from scratch (if needed)Project structureThis repository contains 8 Numpy files, one for each featured language, pickled with Python 3.6. Each line corresponds to the spectrogram of the file mentioned in the file verses.csv. There is a direct mapping between the ID of the verse and its index in the list (thus verse with ID 5634 is located at index 5634 in the Numpy file). Verses not available for a given language (as stated by the value "Not Available" in the CSV file) are represented by empty lists in the Numpy files, thus ensuring a perfect verse-to-verse alignement between each file.Spectrogram were extracted using Librosa with the following parameters:Pre-emphasis = 0.97Sample rate = 16000Window size = 0.025Window stride = 0.01Window type = 'hamming'Mel coefficients = 40Min frequency = 20

Clear search

Close search

Google apps

Main menu

MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

MaSS

lomwe-speech-text

MaSS - Multilingual corpus of Sentence-aligned Spoken utterances