95 datasets found
  1. h

    librispeech_asr

    • huggingface.co
    Updated Jun 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenSLR (2024). librispeech_asr [Dataset]. https://huggingface.co/datasets/openslr/librispeech_asr
    Explore at:
    Dataset updated
    Jun 3, 2024
    Dataset authored and provided by
    OpenSLR
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.87

  2. h

    LibriSpeech

    • huggingface.co
    • tensorflow.org
    • +2more
    Updated Feb 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    k2-fsa (2024). LibriSpeech [Dataset]. https://huggingface.co/datasets/k2-fsa/LibriSpeech
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2024
    Dataset authored and provided by
    k2-fsa
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. Acoustic models, trained on this data set, are available at icefall and language models, suitable for evaluation can be found at openslr. For more information, see the paper "LibriSpeech: an ASR corpus based on public domain audio… See the full description on the dataset page: https://huggingface.co/datasets/k2-fsa/LibriSpeech.

  3. p

    Multilingual LibriSpeech Dataset

    • paperswithcode.com
    Updated Apr 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vineel Pratap; Qiantong Xu; Anuroop Sriram; Gabriel Synnaeve; Ronan Collobert (2023). Multilingual LibriSpeech Dataset [Dataset]. https://paperswithcode.com/dataset/multilingual-librispeech
    Explore at:
    Dataset updated
    Apr 12, 2023
    Authors
    Vineel Pratap; Qiantong Xu; Anuroop Sriram; Gabriel Synnaeve; Ronan Collobert
    Description

    Multilingual LibriSpeech is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It includes about 44.5K hours of English and a total of about 6K hours for other languages.

  4. h

    multilingual_librispeech

    • huggingface.co
    Updated Aug 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2024). multilingual_librispeech [Dataset]. https://huggingface.co/datasets/facebook/multilingual_librispeech
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 13, 2024
    Dataset authored and provided by
    AI at Meta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for MultiLingual LibriSpeech

      Dataset Summary
    

    This is a streamable version of the Multilingual LibriSpeech (MLS) dataset. The data archives were restructured from the original ones from OpenSLR to make it easier to stream. MLS dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish. It… See the full description on the dataset page: https://huggingface.co/datasets/facebook/multilingual_librispeech.

  5. torchaudio-LibriSpeech

    • kaggle.com
    Updated Feb 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beomseong Kim (2025). torchaudio-LibriSpeech [Dataset]. https://www.kaggle.com/datasets/beomseongkim/torchaudio
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Beomseong Kim
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset is sourced from the LibriSpeech corpus, a widely used dataset for Automatic Speech Recognition (ASR) tasks. The dataset consists of high-quality English speech recordings paired with corresponding text transcripts.

  6. h

    librispeech-alignments

    • huggingface.co
    Updated Mar 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim Gilkey (2024). librispeech-alignments [Dataset]. https://huggingface.co/datasets/gilkeyio/librispeech-alignments
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2024
    Authors
    Kim Gilkey
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Librispeech Alignments

    Librispeech with alignments generated by the Montreal Forced Aligner. The original alignments in TextGrid format can be found here

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Librispeech is a corpus of read English speech, designed for training and evaluating automatic speech recognition (ASR) systems. The dataset contains 1000 hours of 16kHz read English speech derived from audiobooks. The Montreal Forced Aligner (MFA) was used… See the full description on the dataset page: https://huggingface.co/datasets/gilkeyio/librispeech-alignments.

  7. h

    librispeech

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shi Qundong (2024). librispeech [Dataset]. https://huggingface.co/datasets/TwinkStart/librispeech
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Shi Qundong
    Description

    This dataset only contains test data, which is integrated into UltraEval-Audio(https://github.com/OpenBMB/UltraEval-Audio) framework.

    python audio_evals/main.py --dataset librispeech-test-clean --model gpt4o_audio

    python audio_evals/main.py --dataset librispeech-dev-clean --model gpt4o_audio

    python audio_evals/main.py --dataset librispeech-test-other --model gpt4o_audio

    python audio_evals/main.py --dataset librispeech-dev-other --model gpt4o_audio

      🚀超凡体验,尽在UltraEval-Audio🚀… See the full description on the dataset page: https://huggingface.co/datasets/TwinkStart/librispeech.
    
  8. d

    LibriStutter

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kourkounakis, Tedd (2023). LibriStutter [Dataset]. http://doi.org/10.5683/SP3/NKVOGQ
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Kourkounakis, Tedd
    Description

    The Libristutter dataset is a synthesized dataset comprised from the public LibriSpeech ASR corpus. It contains artificially stuttered speech, as well as time-aligned transcriptions and stutter classification labels for 5 stutter types. It was created for the use of classification of stuttered speech. It was generated using 20 hours of audio selected from the 'dev-clean-100' section of the original corpus, consisting of "clean" speech. All corresponding metadata, including speaker, book, and chapter information are provided in the original public LibriSpeech ASR corpus.

  9. libriSpeech

    • kaggle.com
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thế Hiểu Phạm (2024). libriSpeech [Dataset]. https://www.kaggle.com/datasets/hieugiaosu/librispeech
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Thế Hiểu Phạm
    Description

    Dataset

    This dataset was created by Thế Hiểu Phạm

    Contents

  10. Translation Augmented LibriSpeech Corpus

    • zenodo.org
    zip
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Can Kocabiyikoglu; Alexandre Bérard; Laurent Besacier; Olivier Kraif; Ali Can Kocabiyikoglu; Alexandre Bérard; Laurent Besacier; Olivier Kraif (2022). Translation Augmented LibriSpeech Corpus [Dataset]. http://doi.org/10.5281/zenodo.6482585
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ali Can Kocabiyikoglu; Alexandre Bérard; Laurent Besacier; Olivier Kraif; Ali Can Kocabiyikoglu; Alexandre Bérard; Laurent Besacier; Olivier Kraif
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large scale (>200h) and publicly available read audio book corpus. This corpus is an augmentation of LibriSpeech ASR Corpus (1000h) and contains English utterances (from audiobooks) automatically aligned with French text. Our dataset offers ~236h of speech aligned to translated text. Speech recordings and source texts are originally from Gutenberg Project, which is a digital library of public domain books read by volunteers. Our augmentation of LibriSpeech is straightforward: we automatically aligned e-books in a foreign language (French) with English utterances of LibriSpeech. We gathered open domain e-books in French and extracted individual chapters available in LibriSpeech Corpus. Furthermore, we aligned chapters in French with English utterances in order to provide a corpus of speech recordings aligned with their translations.

    ====================================================

    Large scale (>200h) and publicly available read audio book corpus. This corpus is an augmentation of LibriSpeech ASR Corpus (1000h)[1] and contains English utterances (from audiobooks) automatically aligned with French text. Our dataset offers ~236h of speech aligned to translated text.


    Overview of the corpus:
    +----------+-------+--------------+----------------+
    | Chapters | Books | Duration (h) | Total Segments |
    +----------+-------+--------------+----------------+
    | 1408 | 247 | ~236h | 131395 |
    +----------+-------+--------------+----------------+


    Speech recordings and source texts are originally from Gutenberg Project[2], which is a digital library of public domain books read by volunteers. Our augmentation of LibriSpeech is straightforward: we automatically aligned e-books in a foreign language (French) with English utterances of LibriSpeech.

    We gathered open domain e-books in French and extracted individual chapters available in LibriSpeech Corpus. Furthermore, we aligned chapters in French with English utterances in order to provide a corpus of speech recordings aligned with their translations. Our corpus is licensed under a Creative Commons Attribution 4.0 License.

    Further information on how the corpus was obtained can be found in [3].

    Details on the 100h subset:
    ===========================

    This 100h subset was specifically designed for direct speech translation training and evaluation.
    It was used for the first time in [4] (end-to-end automatic speech recognition of audiobooks).
    In this subset, we extracted the best 100h according to cross language alignment scores. Dev and Test sets are composed of clean speech segments only.
    Since English (source) transcriptions are initially available for LibriSpeech, we also translated them using Google Translate. To summarize, for each utterance of our corpus, the following quadruplet is available: English speech signal, English transcription (should not be used for direct speech translation experiments), French text translation 1 (from alignment of e-books) and translation 2 (from MT of English transcripts).

    +---------+----------+--------+-----------------------------+-----------------+
    | Corpus | Total | | Source(per seg) | Target(per seg) |
    +---------+----------+--------+-----------------------------+-----------------+
    | | segments | hours | frames | chars | (sub)words | chars |
    +---------+----------+--------+--------+-------+------------+-----------------+
    | train 1 | 47271 | 100:00 | 762 | 111 | 20.7 | 143 |
    | train 2 | | | | | | 126 |
    +---------+----------+--------+--------+-------+------------+-----------------+
    | dev | 1071 | 2:00 | 673 | 93 | 17.9 | 110 |
    +---------+----------+--------+--------+-------+------------+-----------------+
    | test | 2048 | 3:44 | 657 | 95 | 18.3 | 112 |
    +---------+----------+--------+--------+-------+------------+-----------------+

    The following archives correspond to the 100h subset used in [4]:

    For audio files:

    - train_100h.zip (~8.7GB)
    - dev.zip(~180MB)
    - test.zip(~330MB)
    - train_130h_additional.zip (~10.6GB)

    For aligned text files:

    - train_100h_txt.zip
    - dev_txt.zip
    - test_txt.zip
    - train130h_additional_txt.zip

    Other archives provided:
    ========================

    Following archives are available to download for other potential use of the corpus:

    - database.zip(~50MB): Database describing the corpus (sqlite3)
    - alignments.zip(~1.86GB): All of the intermediate processing files created in the cross-lingual alignment process along with English and French raw e-books
    - audio_files.zip(~23GB): All of the speech segments organized as books and chapters
    - interface.zip(~72MB): Contains static html files for alignment visualisation. With the interface, speech utterances can be listened while visualizing each sentence alignment

    Note: In order to listen to speech segments with the html interface, 'audio_files' folder should be placed inside the 'Interface' folder
    ./Interface
    ./audio_files (audio_files.zip)
    ./css (interface.zip)
    ./js (interface.zip)
    (..)


    Github Page
    ===========
    We provide a python script to interact with the database and to extract the corpus with different queries. This script along with all of the code used for the alignment process can be found at:
    https://github.com/alicank/Translation-Augmented-LibriSpeech-Corpus

    Detailed Corpus Structure
    =========================

    Folders name convention corresponds to book id's from LibriSpeech and Gutenberg projects. For instance folder name "11" corresponds to the id number of "Alice's Adventures in Wonderland by Lewis Carroll" in both Gutenberg Project and LibriSpeech Project.


    This corpus is composed of three sections:
    - Audio Files: resegmented audio files for each book id in the project
    - HTML alignment visualisation interface : HTML visualisation for textual alignments with audio files avaliable to listen
    - Alignments folder: all of the processing steps: pre-processing, alignment, forced transcriptions, forced alignments, etc.


    -Interface
    - audio_files/ : folder contains ~130.000 audio segments aligned with their translations
    - book id/
    - Chapter id/
    - book_id-chapter_id-sentence_number.wav
    - reader_id-chapter_id-sentence_number.wav **if the corpus comes from the dev/test pool of LibriSpeech**

    - Alignments/ : Folder contains processing steps used in different alignment stages (reading [3] is mandatory to understand where these files come from)

    - en/ : Folder contains preprocessing steps for English chapters used before alignment

    - fr/ Folder contains preprocessing steps for French chapters used before alignment

    - ls_book_id.txt (Gutenberg original text)
    - lc_book_id.format (pdf,epub,txt,...)

    - db/ Folder contains the database containing alignments, metadata and other information
    -TA-LibriSpeechCorpus.sqlite3

    index.html (Main html page of the Interface)


    Database Structure
    ==================

    Corpus is provided with different tables containing useful information provided with the corpus. Database structure is organized as follows:

    Alignment Tables
    - alignments: Table containing transcriptions, textual alignments and name of the audio file associated with a given alignment. Each row corresponds to an aligned sentence.
    - audio: Table that contains duration of each speech segment (seconds)
    - alignments_evaluations: 200 sentences manually annotated (for alignement evaluation see [3])
    - alignments_excluded: Table used to mark sentences to be excluded from the corpus (bad alignments)
    - alignments_gTranslate: automatic translation output from Google translate for each segment (transcriptions)
    - alignments_scores: different cross lingual alignment score calculations provided with the corpus which could be used to sort the corpus from highest scores to the lowest

    Metadata Tables
    - Table librispeech: This table contains all the books from LibriSpeech project for which a downloadable link could be found (might be a dead/wrong link if it disappeared after our work)
    - Table csv,clean100,other: Metadata completion for books provided with LibriSpeech project.
    - Table nosLivres: some French e-book links gathered from http://www.nosLivres.net

    References
    ==========

    [1] Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015, April). Librispeech: an ASR corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on (pp. 5206-5210). IEEE.
    [2] https://www.gutenberg.org/
    [3] Ali Can Kocabiyikoglu, Laurent Besacier and Olivier Kraif, "Augmenting LibriSpeech with French Translations : A Multimodal Corpus for Direct Speech Translation Evaluation" in submitted to LREC, 2018.
    [4] Aléxandre Bérard, Laurent Besacier, Ali Can Kocabiyikoglu and Olivier Pietquin, "End-to-End Automatic Speech Translation of Audiobooks" in submitted to ICASSP, 2018.

  11. Z

    Librispeech Slakh Unmix (LSX)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petermann, Darius (2023). Librispeech Slakh Unmix (LSX) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7765139
    Explore at:
    Dataset updated
    Apr 4, 2023
    Dataset provided by
    Wichern, Gordon
    Petermann, Darius
    Le Roux, Jonathan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    Librispeech Slakh Unmix (LSX) is a proof of concept source separation dataset for training and testing algorithms that separate a monaural audio signal using hyperbolic embeddings for hierarchical separation. The dataset is composed of artificial mixtures using audio from the librispeech (clean subset) and Slakh2100 datasets. The dataset was introduced in our paper Hyperbolic Audio Source Separation.

    At a Glance

    The size of the unzipped dataset is ~28GB

    Each mixture is 60-s in length and denotes the first 60 s of the bass, drums, and guitar stems of the associated Slakh2100 track.

    Audio is encoded as 16 bit wav files at a sampling rate of 16 kHz

    The data is split into training tr (1390 mixtues), validation cv (348 mixtures) and testing tt (209 mixtures) subsets

    The directory for each mixture contains eight wav files:

    mix.wav the overall mixture from the five child sources

    music_mix.wav the music submix containing guitar, bass, and drums

    speech_mix.wav the speech submix containing both male and female speech signals

    bass.wav original bass submix from slakh track

    drums.wav original drums submix from slakh track

    guitar.wav original guitar submix from slakh track

    speech_male.wav concatenated male speech utterances filling the length of the song

    speech_female.wav concatenated female speech utterances filling the length of the song

    Other Resources

    Pytorch code for training models along with our hyperbolic separation interface are available here

    Citation

    If you use LSX in your research, please cite our paper:

    @InProceedings{Petermann2023ICASSP_hyper, author = {Petermann, Darius and Wichern, Gordon and Subramanian, Aswin and {Le Roux}, Jonathan}, title = {Hyperbolic Audio Source Separation}, booktitle = {Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = 2023, month = jun }

    Copyright and License

    The LSX dataset is released under CC-BY-4.0 license.

    All data:

    Created by Mitsubishi Electric Research Laboratories (MERL), 2022-2023

    SPDX-License-Identifier: CC-BY-4.0

  12. librispeech-mixes-test

    • kaggle.com
    Updated Nov 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Korolev Kirill (2023). librispeech-mixes-test [Dataset]. https://www.kaggle.com/datasets/kafafyf/librispeech-mixes-test/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Korolev Kirill
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Korolev Kirill

    Released under CC0: Public Domain

    Contents

  13. librispeech-spkid-corpus

    • kaggle.com
    Updated Nov 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tommy NgX (2020). librispeech-spkid-corpus [Dataset]. https://www.kaggle.com/tommyngx/librispeech-spkid-corpus/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tommy NgX
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Tommy NgX

    Released under CC0: Public Domain

    Contents

    LibriSpeech ASR corpus

  14. T

    user_libri_text

    • tensorflow.org
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). user_libri_text [Dataset]. https://www.tensorflow.org/datasets/catalog/user_libri_text
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    UserLibri is a dataset containing paired audio-transcripts and additional text only data for each of 107 users. It is a reformatting of the LibriSpeech dataset found at http://www.openslr.org/12, reorganizing the data into users with an average of 52 LibriSpeech utterances and about 6,700 text example sentences per user. The UserLibriAudio class provides access to the audio-transcript pairs. See UserLibriText for the additional text data.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('user_libri_text', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  15. CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from...

    • zenodo.org
    csv, zip
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Modan Tailleur; Modan Tailleur; Mathieu Lagrange; Mathieu Lagrange; Pierre Aumond; Pierre Aumond; Vincent Tourre; Vincent Tourre (2025). CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from LibriSpeech and SONYC-UST [Dataset]. http://doi.org/10.5281/zenodo.15405950
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Modan Tailleur; Modan Tailleur; Mathieu Lagrange; Mathieu Lagrange; Pierre Aumond; Pierre Aumond; Vincent Tourre; Vincent Tourre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CitySpeechMix is a simulated audio dataset that mixes speech excerpts from LibriSpeech with environmental recordings from SONYC-UST to create controlled mixtures of voice and background noise. Each audio file is accompanied by the corresponding LibriSpeech transcription and the SONYC-UST sound class labels. A mapping is also provided between the selected SONYC-UST sound classes and their corresponding AudioSet categories.

    📊 Dataset Overview

    The dataset consists of 742 audio clips , each 10 seconds long:
    - 371 mixtures of speech over urban background noise
    - 371 voice-free urban environmental recordings

    🛠️ Dataset Construction

    The dataset, included in the `cityspeechmix.zip` archive, is constructed as follows:

    - Environmental sounds are selected from the SONYC-UST v2 evaluation set. Only clips annotated with exactly one of the following seven sound classes are retained: `engine`, `jackhammer`, `chainsaw`, `car horn`, `siren`, `music`, and `dog`.
    - The resulting SONYC subset is balanced to 742 clips (106 per class, selected randomly when more clips are available). Of these, 371 clips are retained for mixing (sonyc_librispeech_mixtures folder), and 371 clips are peak-normalized and left untouched (sonyc_unmixed_subset) .
    - 371 speech clips (approximately 10 seconds each) are randomly selected from the LibriSpeech evaluation set and matched randomly to the 371 SONYC audio files selected for mixing.
    - Each pair of SONYC and LibriSpeech clips is resampled to 44.1 kHz and put at the same RMSE . To simulate realistic background noise conditions, the SONYC signal is attenuated by 6 dB prior to mixing.
    - The resulting mixtures are peak-normalized.

    📁 Folder Structure

    Inside the `cityspeechmix/` folder:

    - `sonyc_librispeech_mixtures/` — 371 speech + background noise mixtures
    - `sonyc_unmixed_subset/` — 371 voice-free environmental recordings

    The source stems (individual speech and background files for each mixture) are available separately in `stems.zip`.

    📄 Metadata File Description

    Each row in `metadata.csv` corresponds to a 10-second audio clip from the CitySpeechMix dataset. The columns are defined as follows:

    - `fname` — Filename of the resulting audio file (either a mixture or a reference clip).
    - `sonyc_file` — Filename of the SONYC-UST environmental recording used.
    - `librispeech_file` — Filename of the LibriSpeech audio sample used in the mixture. This field is `NaN` for voice-free clips.
    - `script` — Transcription of the spoken content from the LibriSpeech file. This field is `NaN` for voice-free clips.
    - `label1_sonyc` — First SONYC sound class label (e.g., `siren`, `dog`, `engine`) associated with the environmental recording.
    - `label1_audioset` — Corresponding AudioSet-compatible label for `label1_sonyc`.
    - `label2_sonyc` — Second SONYC label, corresponding to the voice label of SONYC-UST. This field is `NaN` for voice-free clips.
    - `label2_audioset` — Corresponding AudioSet-compatible label for `label2_sonyc`. This field is `NaN` for voice-free clips.

    🔎 Suggested Applications

    - Speech anonymization systems
    - Robust automatic speech recognition (ASR)
    - Urban sound tagging in presence of voice

    📚 Source Datasets

    - LibriSpeech
    Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015).
    Librispeech: An ASR corpus based on public domain audio books.
    [Paper] • [Dataset]

    - SONYC-UST V2
    Cartwright, M., Cramer, J., Bello, J. P., McFee, B., Cartwright, M., & Salamon, J. (2020).
    SONYC-UST V2: An Urban Sound Tagging Dataset with Spatiotemporal Context.
    [Paper] • [Dataset]

    📎 Citation

    If you use CitySpeechMix in your research, please cite it as:

    > CitySpeechMix: A Simulated Dataset of Speech and Urban Sound Mixtures from LibriSpeech and SONYC-UST
    > Modan Tailleur, Mathieu Lagrange, Pierre Aumond, Vincent Tourre. 2025.
    > Zenodo. https://doi.org/10.5281/zenodo.15405950

    @misc{tailleur2025cityspeechmix,
    title = {CitySpeechMix: A Dataset of Speech and Urban Sound Mixtures},
    author = {Tailleur, Modan and Lagrange, Mathieu and Aumond, Pierre and Tourre, Vincent},
    year = 2025,
    publisher = {Zenodo},
    doi = {10.5281/zenodo.15405950},
    url = {https://doi.org/10.5281/zenodo.15405950}
    }

  16. Z

    Crowdsourced LibriTTS Speech Prominence Annotations

    • data.niaid.nih.gov
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morrison, Max (2023). Crowdsourced LibriTTS Speech Prominence Annotations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10402792
    Explore at:
    Dataset updated
    Dec 18, 2023
    Dataset authored and provided by
    Morrison, Max
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset corresponding to the ICASSP 2024 paper "Crowdsourced and Automatic Speech Prominence Estimation" [link]

    This dataset is useful for training machine learning models to perform automatic emphasis annotaiton, as well as downstream tasks such as emphasis-controlled TTS, emotion recognition, and text summarization. The dataset is described in Section 3 (Emphasis Annotation Dataset). The contents of this section are copied below for convenience.

    We used our crowdsourced annotation system to perform human annotation on one eighth of the train-clean-100 partition of the LibriTTS [1] dataset. Specifically, participants annotated 3,626 utterances with a total length of 6.42 hours and 69,809 words from 18 speakers (9 male and 9 female). We collected at least one annotation of all 3,626 utterances, at least two annotations of 2,259 of those utterances, at least four annotations of 974 utterances, and at least eight annotations of 453 utterances. We did this in order to explore (in Section 6) whether it is more cost-effective to train a system on multiple annotations of fewer utterances or fewer annotations of more utterances. We paid 298 annotators to annotate batches of 20 utterances, where each batch takes approximately 15 minutes. We paid $3.34 for each completed batch (estimated $13.35 per hour). Annotators each annotated between one and six batches. We recruited on MTurk US residents with an approval rating of at least 99 and at least 1000 approved tasks. Today, microlabor platforms like MTurk are plagued by automated task-completion software agents (bots) that randomly fill out surveys. We filtered out bots by excluding annotations from an additional 107 annotators that marked more than 2/3 of words as emphasized in eight or more utterances of the 20 utterances in a batch. Annotators who fail the bot filter are blocked from performing further annotation. We also recorded participants' native country and language, but note these may be unreliable as many MTurk workers use VPNs to subvert IP region filters on MTurk [2].

    The average Cohen Kappa score for annotators with at least one overlapping utterance is 0.226 (i.e., ``Fair'' agreement)---but not all annotators annotate the same utterances, and this overemphasizes pairs of annotators with low overlap. Therefore, we use a one-parameter logistic model (i.e., a Rasch model) computed via py-irt [3], which predicts heldout annotations from scores of overlapping annotators with 77.7% accuracy (50% is random).

    The structure of this dataset is a single JSON file of word-aligned emphasis annotations. The JSON references file stems of the LibriTTS dataset, which can be found here. All code used in the creation of the dataset can be found here. The format of the JSON file is as follows.

    {

    { "annotations": [ { "score": [ , , ... ], "stem": , "words": [ [ , ,

    ], [ , ,

    ], ... ] }, ... ], "country": , "language": }, ... }

    [1] Zen et al., “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Interspeech, 2019.[2] Moss et al., “Bots or inattentive humans? Identifying sources of low-quality data in online platforms,” PsyArXiv preprint PsyArXiv:wr8ds, 2021.[3] John Patrick Lalor and Pedro Rodriguez, “py-irt: A scalable item response theory library for Python,” INFORMS Journal on Computing, 2023.

  17. h

    libris_clean_100

    • huggingface.co
    Updated Apr 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Binh Nguyen (2023). libris_clean_100 [Dataset]. https://huggingface.co/datasets/nguyenvulebinh/libris_clean_100
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2023
    Authors
    Binh Nguyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for librispeech_asr

      Dataset Summary
    

    LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.

      Supported Tasks and Leaderboards
    

    automatic-speech-recognition, audio-speaker-identification: The dataset can be used to train a model for Automatic… See the full description on the dataset page: https://huggingface.co/datasets/nguyenvulebinh/libris_clean_100.

  18. O

    Multilingual LibriSpeech

    • opendatalab.com
    zip
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Facebook AI Research (2023). Multilingual LibriSpeech [Dataset]. https://opendatalab.com/OpenDataLab/Multilingual_LibriSpeech
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    Facebook AI Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from reading audiobooks by LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish.

  19. f

    PESQ and STOI result for LibriSPeech dataset.

    • figshare.com
    Updated Jan 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Jan 3, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Zhenqing Li; Abdul Basit; Amil Daraz; Atif Jan
    Description

    Long short-term memory (LSTM) has been effectively used to represent sequential data in recent years. However, LSTM still struggles with capturing the long-term temporal dependencies. In this paper, we propose an hourglass-shaped LSTM that is able to capture long-term temporal correlations by reducing the feature resolutions without data loss. We have used skip connections in non-adjacent layers to avoid gradient decay. In addition, an attention process is incorporated into skip connections to emphasize the essential spectral features and spectral regions. The proposed LSTM model is applied to speech enhancement and recognition applications. The proposed LSTM model uses no future information, resulting in a causal system suitable for real-time processing. The combined spectral feature sets are used to train the LSTM model for improved performance. Using the proposed model, the ideal ratio mask (IRM) is estimated as a training objective. The experimental evaluations using short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) have demonstrated that the proposed model with robust feature representation obtained higher speech intelligibility and perceptual quality. With the TIMIT, LibriSpeech, and VoiceBank datasets, the proposed model improved STOI by 16.21%, 16.41%, and 18.33% over noisy speech, whereas PESQ is improved by 31.1%, 32.9%, and 32%. In seen and unseen noisy situations, the proposed model outperformed existing deep neural networks (DNNs), including baseline LSTM, feedforward neural network (FDNN), convolutional neural network (CNN), and generative adversarial network (GAN). With the Kaldi toolkit for automated speech recognition (ASR), the proposed model significantly reduced the word error rates (WERs) and reached an average WER of 15.13% in noisy backgrounds.

  20. tgritsaev-librispeech-mixture

    • kaggle.com
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timofei Gritsaev (2023). tgritsaev-librispeech-mixture [Dataset]. https://www.kaggle.com/datasets/timgritsaev/tgritsaev-librispeech-mixture/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Timofei Gritsaev
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Timofei Gritsaev

    Released under MIT

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
OpenSLR (2024). librispeech_asr [Dataset]. https://huggingface.co/datasets/openslr/librispeech_asr

librispeech_asr

LibriSpeech

openslr/librispeech_asr

Explore at:
15 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 3, 2024
Dataset authored and provided by
OpenSLR
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LibriSpeech is a corpus of approximately 1000 hours of read English speech with sampling rate of 16 kHz, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.87

Search
Clear search
Close search
Google apps
Main menu