20 datasets found
  1. h

    gigaspeech

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SpeechColab, gigaspeech [Dataset]. https://huggingface.co/datasets/speechcolab/gigaspeech
    Explore at:
    Dataset authored and provided by
    SpeechColab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.

  2. h

    gigaspeech

    • huggingface.co
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ultravox.ai (2025). gigaspeech [Dataset]. https://huggingface.co/datasets/fixie-ai/gigaspeech
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    Ultravox.ai
    Description

    fixie-ai/gigaspeech dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    gigaspeech-part-2

    • huggingface.co
    Updated Jul 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahd Safarani (2025). gigaspeech-part-2 [Dataset]. https://huggingface.co/datasets/shahdsaf/gigaspeech-part-2
    Explore at:
    Dataset updated
    Jul 6, 2025
    Authors
    Shahd Safarani
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Gigaspeech Part 2

    This is Part 2 of 8 of a large-scale speech dataset, split to accommodate HuggingFace's repository size limits.

      Multi-Part Dataset
    

    This dataset is split across multiple repositories:

    Part 1: shahdsaf/gigaspeech-part-1 Part 2 (current): shahdsaf/gigaspeech-part-2 Part 3: shahdsaf/gigaspeech-part-3 Part 4: shahdsaf/gigaspeech-part-4 Part 5: shahdsaf/gigaspeech-part-5 Part 6: shahdsaf/gigaspeech-part-6 Part 7: shahdsaf/gigaspeech-part-7 Part 8:… See the full description on the dataset page: https://huggingface.co/datasets/shahdsaf/gigaspeech-part-2.

  4. h

    gigaspeech-tiny-stage4

    • huggingface.co
    Updated Jul 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helin Wang (2024). gigaspeech-tiny-stage4 [Dataset]. https://huggingface.co/datasets/westbrook/gigaspeech-tiny-stage4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2024
    Authors
    Helin Wang
    Description

    westbrook/gigaspeech-tiny-stage4 dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    gigaspeech-tiny-2

    • huggingface.co
    Updated Aug 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helin Wang (2024). gigaspeech-tiny-2 [Dataset]. https://huggingface.co/datasets/westbrook/gigaspeech-tiny-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2024
    Authors
    Helin Wang
    Description

    westbrook/gigaspeech-tiny-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    gigaspeech-tiny-3

    • huggingface.co
    Updated Jul 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helin Wang (2024). gigaspeech-tiny-3 [Dataset]. https://huggingface.co/datasets/westbrook/gigaspeech-tiny-3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 18, 2024
    Authors
    Helin Wang
    Description

    westbrook/gigaspeech-tiny-3 dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    gigaspeech-processed

    • huggingface.co
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helin Wang (2024). gigaspeech-processed [Dataset]. https://huggingface.co/datasets/westbrook/gigaspeech-processed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 12, 2024
    Authors
    Helin Wang
    Description

    westbrook/gigaspeech-processed dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    indo-split-gigaspeech

    • huggingface.co
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bagas S (2025). indo-split-gigaspeech [Dataset]. https://huggingface.co/datasets/bagasshw/indo-split-gigaspeech
    Explore at:
    Dataset updated
    Apr 6, 2025
    Authors
    Bagas S
    Description

    This dataset contains transcribed audio data for Indonesian. The dataset consists of audio files and a CSV file. The CSV file contains the audio ID and transcription of the audio in the file.

  9. h

    gigaspeech-seed-context-continuation-noise

    • huggingface.co
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    patrick Li (2025). gigaspeech-seed-context-continuation-noise [Dataset]. https://huggingface.co/datasets/patricklifixie/gigaspeech-seed-context-continuation-noise
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2025
    Authors
    patrick Li
    Description

    patricklifixie/gigaspeech-seed-context-continuation-noise dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    gigaspeech-vi

    • huggingface.co
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A Pham (2025). gigaspeech-vi [Dataset]. https://huggingface.co/datasets/hoanganhpham/gigaspeech-vi
    Explore at:
    Dataset updated
    Jan 23, 2025
    Authors
    A Pham
    Description

    hoanganhpham/gigaspeech-vi dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    gigaspeech-tiny-0-train

    • huggingface.co
    Updated Jul 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helin Wang (2024). gigaspeech-tiny-0-train [Dataset]. https://huggingface.co/datasets/westbrook/gigaspeech-tiny-0-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 31, 2024
    Authors
    Helin Wang
    Description

    westbrook/gigaspeech-tiny-0-train dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    gigaspeech-l_multi_prompts

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dimitris Damianos, gigaspeech-l_multi_prompts [Dataset]. https://huggingface.co/datasets/ddamianos/gigaspeech-l_multi_prompts
    Explore at:
    Authors
    Dimitris Damianos
    Description

    ddamianos/gigaspeech-l_multi_prompts dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    gigaspeech-hubert_large_ll60k-layer_22

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anil Keshwani, gigaspeech-hubert_large_ll60k-layer_22 [Dataset]. https://huggingface.co/datasets/anilkeshwani/gigaspeech-hubert_large_ll60k-layer_22
    Explore at:
    Authors
    Anil Keshwani
    Description

    anilkeshwani/gigaspeech-hubert_large_ll60k-layer_22 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    gigaspeech-icefall-data

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yifan Yang (2025). gigaspeech-icefall-data [Dataset]. https://huggingface.co/datasets/yfyeung/gigaspeech-icefall-data
    Explore at:
    Authors
    Yifan Yang
    Description

    yfyeung/gigaspeech-icefall-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    asr-alignment

    • huggingface.co
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Binh Nguyen (2024). asr-alignment [Dataset]. https://huggingface.co/datasets/nguyenvulebinh/asr-alignment
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 18, 2024
    Authors
    Binh Nguyen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Speech Recognition Alignment Dataset

    This dataset is a variation of several widely-used ASR datasets, encompassing Librispeech, MuST-C, TED-LIUM, VoxPopuli, Common Voice, and GigaSpeech. The difference is this dataset includes:

    Precise alignment between audio and text. Text that has been punctuated and made case-sensitive. Identification of named entities in the text.

      Usage
    

    First, install the latest version of the 🤗 Datasets package: pip install --upgrade pip pip… See the full description on the dataset page: https://huggingface.co/datasets/nguyenvulebinh/asr-alignment.

  16. h

    dataset

    • huggingface.co
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NGUYEN VAN MANH (2024). dataset [Dataset]. https://huggingface.co/datasets/vanmanhnew/dataset
    Explore at:
    Dataset updated
    Jun 19, 2024
    Authors
    NGUYEN VAN MANH
    Description

    GigaSpeech 2

    This is the official repository of the GigaSpeech 2 dataset. For details of how we created the dataset, please refer to our arXiv preprint paper. GigaSpeech 2 version: 2.0 (2024/06/19)

      Download
    

    The dataset is available at HuggingFace and ModelScope. The pre-trained models are available at Thai and Vietnamese.

      Leaderboard
    

    Contributor Toolkit Train Recipe Train Data Inference Test CER/WER

    Baseline Icefall… See the full description on the dataset page: https://huggingface.co/datasets/vanmanhnew/dataset.

  17. h

    CapSpeech_GigaSpeech

    • huggingface.co
    Updated Mar 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenSound (2025). CapSpeech_GigaSpeech [Dataset]. https://huggingface.co/datasets/OpenSound/CapSpeech_GigaSpeech
    Explore at:
    Dataset updated
    Mar 23, 2025
    Authors
    OpenSound
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    CapSpeech-GigaSpeech Audio

    DataSet used for the paper: CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech Please refer to 🤗CapSpeech for the whole dataset and 🚀CapSpeech repo for more details.

      Overview
    

    🔥 CapSpeech is a new benchmark designed for style-captioned TTS (CapTTS) tasks, including style-captioned text-to-speech synthesis with sound effects (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS) and… See the full description on the dataset page: https://huggingface.co/datasets/OpenSound/CapSpeech_GigaSpeech.

  18. o

    ESPnet2 pretrained model, Shinji...

    • explore.openaire.eu
    Updated Mar 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shinji Watanabe (2021). ESPnet2 pretrained model, Shinji Watanabe/gigaspeech_asr_train_asr_raw_en_bpe5000_valid.acc.ave, fs=16k, lang=en [Dataset]. http://doi.org/10.5281/zenodo.4630405
    Explore at:
    Dataset updated
    Mar 23, 2021
    Authors
    Shinji Watanabe
    Description

    This model was trained by Shinji Watanabe using gigaspeech recipe in espnet. Python API See https://github.com/espnet/espnet_model_zoo Evaluate in the recipe git clone https://github.com/espnet/espnet cd espnet git checkout dcb5bdb2ffa34a9f44255c0b073759c5b9b3f86e pip install -e . cd egs2/gigaspeech/asr1 ./run.sh --skip_data_prep false --skip_train true --download_model Shinji Watanabe/gigaspeech_asr_train_asr_raw_en_bpe5000_valid.acc.ave Results # RESULTS ## Environments - date: Tue Mar 23 10:03:49 EDT 2021 - python version: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] - espnet version: espnet 0.9.8 - pytorch version: pytorch 1.7.1 - Git hash: dcb5bdb2ffa34a9f44255c0b073759c5b9b3f86e - Commit date: Sat Mar 13 10:16:16 2021 -0500 ## asr_train_asr_raw_en_bpe5000 ### WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |decode_asr_asr_model_valid.acc.ave/dev|2043|51075|92.9|4.5|2.6|2.1|9.2|65.6| |decode_asr_asr_model_valid.acc.ave/test|9627|175116|90.5|7.0|2.5|6.1|15.6|69.3| ### CER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |decode_asr_asr_model_valid.acc.ave/dev|2043|271188|97.5|0.9|1.6|1.7|4.2|65.6| |decode_asr_asr_model_valid.acc.ave/test|9627|909930|96.5|1.6|1.9|5.6|9.0|69.3| ### TER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |decode_asr_asr_model_valid.acc.ave/dev|2043|63598|93.3|3.9|2.8|2.1|8.8|65.6| |decode_asr_asr_model_valid.acc.ave/test|9627|218851|90.8|6.1|3.1|7.0|16.2|69.3| ASR config config: conf/train_asr.yaml print_config: false log_level: INFO dry_run: false iterator_type: sequence output_dir: exp/asr_train_asr_raw_en_bpe5000 ngpu: 1 seed: 0 num_workers: 1 num_att_plot: 3 dist_backend: nccl dist_init_method: env:// dist_world_size: 4 dist_rank: 0 local_rank: 0 dist_master_addr: localhost dist_master_port: 37831 dist_launcher: null multiprocessing_distributed: true unused_parameters: false sharded_ddp: false cudnn_enabled: true cudnn_benchmark: false cudnn_deterministic: true collect_stats: false write_collected_feats: false max_epoch: 20 patience: null val_scheduler_criterion: - valid - loss early_stopping_criterion: - valid - loss - min best_model_criterion: - - valid - acc - max keep_nbest_models: 10 grad_clip: 5.0 grad_clip_type: 2.0 grad_noise: false accum_grad: 4 no_forward_run: false resume: true train_dtype: float32 use_amp: false log_interval: null use_tensorboard: true use_wandb: false wandb_project: null wandb_id: null detect_anomaly: false pretrain_path: null init_param: [] freeze_param: [] num_iters_per_epoch: null batch_size: 20 valid_batch_size: null batch_bins: 35000000 valid_batch_bins: null train_shape_file: - exp/asr_stats_raw_en_bpe5000/train/speech_shape - exp/asr_stats_raw_en_bpe5000/train/text_shape.bpe valid_shape_file: - exp/asr_stats_raw_en_bpe5000/valid/speech_shape - exp/asr_stats_raw_en_bpe5000/valid/text_shape.bpe batch_type: numel valid_batch_type: null fold_length: - 80000 - 150 sort_in_batch: descending sort_batch: descending multiple_iterator: false chunk_length: 500 chunk_shift_ratio: 0.5 num_cache_chunks: 1024 train_data_path_and_name_and_type: - - dump/raw/train/wav.scp - speech - kaldi_ark - - dump/raw/train/text - text - text valid_data_path_and_name_and_type: - - dump/raw/dev/wav.scp - speech - kaldi_ark - - dump/raw/dev/text - text - text allow_variable_data_keys: false max_cache_size: 0.0 max_cache_fd: 32 valid_max_cache_size: null optim: adam optim_conf: lr: 0.0015 scheduler: warmuplr scheduler_conf: warmup_steps: 25000 token_list: - - - S - ▁THE - ▁TO - ▁OF - ▁A - ▁AND - '''' - ▁THAT - ▁IN - ▁YOU - ▁I - ▁IT - T - ▁IS - ▁WAS - ED - ▁WE - ▁FOR - ING - ▁THIS - D - ▁ON - ▁BE - ▁WITH - ▁HAVE - ▁SO - ▁HE - RE - ▁THEY - ▁ARE - ▁NOT - ▁AS - ▁LIKE - ▁AT - ▁KNOW - ▁WHAT - LY - ▁CAN - ▁DO - ▁ABOUT - ▁ALL - ▁HIS - M - ▁HAD - '-' - ▁ONE - ▁OR - ▁FROM - ▁THERE - ▁ME - ▁MY - ▁BUT - ▁JUST - ▁YOUR - ▁AN - ▁BY - Y - ▁IF - ▁OUT - ▁PEOPLE - ▁UP - ▁HER - ER - ▁WERE - ▁THINK - E - N - ▁WOULD - ▁SHE - ▁THEIR - ▁WHO - ▁MORE - ▁OUR - ▁THEM - ▁WHEN - ▁WHICH - ▁VERY - ▁WILL - ▁SOME - ▁TIME - ▁BEEN - R - ▁GET - ▁HAS - ▁GOING - ▁HIM - VE - ▁REALLY - ▁HOW - ▁DON - ▁NO - ▁THEN - LL - ▁GO - ▁BECAUSE - ▁NOW - AL - ▁INTO - ▁THESE - ▁OTHER - ▁RIGHT - ▁SEE - ▁SAID - ▁HERE - ▁WAY - ▁TWO - ▁US - ▁WANT - ▁COULD - ▁S - ▁SAY - ▁OVER - ▁AH - ES - ▁WHERE - ▁BACK - ▁ALSO - ▁THOSE - ▁THINGS - ▁MAKE - ▁KIND - ▁MUCH - IN - ▁WELL - ▁GOOD - ▁DID - L - ▁FIRST - ▁THAN - ▁LITTLE - ▁RE - C - ▁NEW - ▁WORK - ▁ANY - A - P - ▁LOT - ▁DOWN - ▁SOMETHING - ▁THING - OR - LE - ▁MAN - ▁GOT - B - ▁COME - ▁ONLY - G - ▁BEING - ▁ACTUALLY - ▁LOOK - O - ▁TAKE - ▁EVEN - ▁NEED - ▁THROUGH - W - ▁GREAT - ▁WORLD - ▁MANY - ▁SHOULD - ▁YEARS - ATION - ▁UM - ▁MOST - ▁DAY - ▁YEAH - ▁LIFE - ▁BEFORE - ▁THREE - ▁UN - ION - ▁DIFFERENT - ▁DE - ▁MIGHT - ▁LET - ▁MADE - ▁MEAN - ▁PART - IC - ▁AGAIN - TH - ▁AFTER - ▁OWN - ▁USE - ITY - ABLE - ▁LONG - ▁STILL - ▁MAY - F - ▁OFF - ▁NEVER - ▁PUT - ▁C - ▁SAME - ...

  19. h

    gigaspeech_test

    • huggingface.co
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AudioLLMs (2024). gigaspeech_test [Dataset]. https://huggingface.co/datasets/AudioLLMs/gigaspeech_test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset authored and provided by
    AudioLLMs
    Description

    @article{chen2021gigaspeech, title={Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio}, author={Chen, Guoguo and Chai, Shuzhou and Wang, Guanbo and Du, Jiayu and Zhang, Wei-Qiang and Weng, Chao and Su, Dan and Povey, Daniel and Trmal, Jan and Zhang, Junbo and others}, journal={arXiv preprint arXiv:2106.06909}, year={2021} }

    @article{wang2024audiobench, title={AudioBench: A Universal Benchmark for Audio Large Language Models}, author={Wang, Bin… See the full description on the dataset page: https://huggingface.co/datasets/AudioLLMs/gigaspeech_test.

  20. h

    gigaspeech2-test

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AudioLLMs (2025). gigaspeech2-test [Dataset]. https://huggingface.co/datasets/AudioLLMs/gigaspeech2-test
    Explore at:
    Dataset updated
    May 31, 2025
    Dataset authored and provided by
    AudioLLMs
    Description

    @article{yang2024gigaspeech, title={GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement}, author={Yang, Yifan and Song, Zheshu and Zhuo, Jianheng and Cui, Mingyu and Li, Jinpeng and Yang, Bo and Du, Yexing and Ma, Ziyang and Liu, Xunying and Wang, Ziyuan and others}, journal={arXiv preprint arXiv:2406.11546}, year={2024} }

    @article{wang2024audiobench, title={AudioBench: A Universal… See the full description on the dataset page: https://huggingface.co/datasets/AudioLLMs/gigaspeech2-test.

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SpeechColab, gigaspeech [Dataset]. https://huggingface.co/datasets/speechcolab/gigaspeech

gigaspeech

Gigaspeech

speechcolab/gigaspeech

Explore at:
445 scholarly articles cite this dataset (View in Google Scholar)
Dataset authored and provided by
SpeechColab
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.

Search
Clear search
Close search
Google apps
Main menu