Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.
fixie-ai/gigaspeech dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Gigaspeech Part 2
This is Part 2 of 8 of a large-scale speech dataset, split to accommodate HuggingFace's repository size limits.
Multi-Part Dataset
This dataset is split across multiple repositories:
Part 1: shahdsaf/gigaspeech-part-1 Part 2 (current): shahdsaf/gigaspeech-part-2 Part 3: shahdsaf/gigaspeech-part-3 Part 4: shahdsaf/gigaspeech-part-4 Part 5: shahdsaf/gigaspeech-part-5 Part 6: shahdsaf/gigaspeech-part-6 Part 7: shahdsaf/gigaspeech-part-7 Part 8:… See the full description on the dataset page: https://huggingface.co/datasets/shahdsaf/gigaspeech-part-2.
westbrook/gigaspeech-tiny-stage4 dataset hosted on Hugging Face and contributed by the HF Datasets community
westbrook/gigaspeech-tiny-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
westbrook/gigaspeech-tiny-3 dataset hosted on Hugging Face and contributed by the HF Datasets community
westbrook/gigaspeech-processed dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset contains transcribed audio data for Indonesian. The dataset consists of audio files and a CSV file. The CSV file contains the audio ID and transcription of the audio in the file.
patricklifixie/gigaspeech-seed-context-continuation-noise dataset hosted on Hugging Face and contributed by the HF Datasets community
hoanganhpham/gigaspeech-vi dataset hosted on Hugging Face and contributed by the HF Datasets community
westbrook/gigaspeech-tiny-0-train dataset hosted on Hugging Face and contributed by the HF Datasets community
ddamianos/gigaspeech-l_multi_prompts dataset hosted on Hugging Face and contributed by the HF Datasets community
anilkeshwani/gigaspeech-hubert_large_ll60k-layer_22 dataset hosted on Hugging Face and contributed by the HF Datasets community
yfyeung/gigaspeech-icefall-data dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Speech Recognition Alignment Dataset
This dataset is a variation of several widely-used ASR datasets, encompassing Librispeech, MuST-C, TED-LIUM, VoxPopuli, Common Voice, and GigaSpeech. The difference is this dataset includes:
Precise alignment between audio and text. Text that has been punctuated and made case-sensitive. Identification of named entities in the text.
Usage
First, install the latest version of the 🤗 Datasets package: pip install --upgrade pip pip… See the full description on the dataset page: https://huggingface.co/datasets/nguyenvulebinh/asr-alignment.
GigaSpeech 2
This is the official repository of the GigaSpeech 2 dataset. For details of how we created the dataset, please refer to our arXiv preprint paper. GigaSpeech 2 version: 2.0 (2024/06/19)
Download
The dataset is available at HuggingFace and ModelScope. The pre-trained models are available at Thai and Vietnamese.
Leaderboard
Contributor Toolkit Train Recipe Train Data Inference Test CER/WER
Baseline Icefall… See the full description on the dataset page: https://huggingface.co/datasets/vanmanhnew/dataset.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
CapSpeech-GigaSpeech Audio
DataSet used for the paper: CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech Please refer to 🤗CapSpeech for the whole dataset and 🚀CapSpeech repo for more details.
Overview
🔥 CapSpeech is a new benchmark designed for style-captioned TTS (CapTTS) tasks, including style-captioned text-to-speech synthesis with sound effects (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS) and… See the full description on the dataset page: https://huggingface.co/datasets/OpenSound/CapSpeech_GigaSpeech.
This model was trained by Shinji Watanabe using gigaspeech recipe in espnet. Python API See https://github.com/espnet/espnet_model_zoo Evaluate in the recipe git clone https://github.com/espnet/espnet cd espnet git checkout dcb5bdb2ffa34a9f44255c0b073759c5b9b3f86e pip install -e . cd egs2/gigaspeech/asr1 ./run.sh --skip_data_prep false --skip_train true --download_model Shinji Watanabe/gigaspeech_asr_train_asr_raw_en_bpe5000_valid.acc.ave Results # RESULTS ## Environments - date: Tue Mar 23 10:03:49 EDT 2021
- python version: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
- espnet version: espnet 0.9.8
- pytorch version: pytorch 1.7.1
- Git hash: dcb5bdb2ffa34a9f44255c0b073759c5b9b3f86e
- Commit date: Sat Mar 13 10:16:16 2021 -0500
## asr_train_asr_raw_en_bpe5000 ### WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |decode_asr_asr_model_valid.acc.ave/dev|2043|51075|92.9|4.5|2.6|2.1|9.2|65.6| |decode_asr_asr_model_valid.acc.ave/test|9627|175116|90.5|7.0|2.5|6.1|15.6|69.3| ### CER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |decode_asr_asr_model_valid.acc.ave/dev|2043|271188|97.5|0.9|1.6|1.7|4.2|65.6| |decode_asr_asr_model_valid.acc.ave/test|9627|909930|96.5|1.6|1.9|5.6|9.0|69.3| ### TER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |decode_asr_asr_model_valid.acc.ave/dev|2043|63598|93.3|3.9|2.8|2.1|8.8|65.6| |decode_asr_asr_model_valid.acc.ave/test|9627|218851|90.8|6.1|3.1|7.0|16.2|69.3| ASR config config: conf/train_asr.yaml print_config: false log_level: INFO dry_run: false iterator_type: sequence output_dir: exp/asr_train_asr_raw_en_bpe5000 ngpu: 1 seed: 0 num_workers: 1 num_att_plot: 3 dist_backend: nccl dist_init_method: env:// dist_world_size: 4 dist_rank: 0 local_rank: 0 dist_master_addr: localhost dist_master_port: 37831 dist_launcher: null multiprocessing_distributed: true unused_parameters: false sharded_ddp: false cudnn_enabled: true cudnn_benchmark: false cudnn_deterministic: true collect_stats: false write_collected_feats: false max_epoch: 20 patience: null val_scheduler_criterion: - valid - loss early_stopping_criterion: - valid - loss - min best_model_criterion: - - valid - acc - max keep_nbest_models: 10 grad_clip: 5.0 grad_clip_type: 2.0 grad_noise: false accum_grad: 4 no_forward_run: false resume: true train_dtype: float32 use_amp: false log_interval: null use_tensorboard: true use_wandb: false wandb_project: null wandb_id: null detect_anomaly: false pretrain_path: null init_param: [] freeze_param: [] num_iters_per_epoch: null batch_size: 20 valid_batch_size: null batch_bins: 35000000 valid_batch_bins: null train_shape_file: - exp/asr_stats_raw_en_bpe5000/train/speech_shape - exp/asr_stats_raw_en_bpe5000/train/text_shape.bpe valid_shape_file: - exp/asr_stats_raw_en_bpe5000/valid/speech_shape - exp/asr_stats_raw_en_bpe5000/valid/text_shape.bpe batch_type: numel valid_batch_type: null fold_length: - 80000 - 150 sort_in_batch: descending sort_batch: descending multiple_iterator: false chunk_length: 500 chunk_shift_ratio: 0.5 num_cache_chunks: 1024 train_data_path_and_name_and_type: - - dump/raw/train/wav.scp - speech - kaldi_ark - - dump/raw/train/text - text - text valid_data_path_and_name_and_type: - - dump/raw/dev/wav.scp - speech - kaldi_ark - - dump/raw/dev/text - text - text allow_variable_data_keys: false max_cache_size: 0.0 max_cache_fd: 32 valid_max_cache_size: null optim: adam optim_conf: lr: 0.0015 scheduler: warmuplr scheduler_conf: warmup_steps: 25000 token_list: - - - S - ▁THE - ▁TO - ▁OF - ▁A - ▁AND - '''' - ▁THAT - ▁IN - ▁YOU - ▁I - ▁IT - T - ▁IS - ▁WAS - ED - ▁WE - ▁FOR - ING - ▁THIS - D - ▁ON - ▁BE - ▁WITH - ▁HAVE - ▁SO - ▁HE - RE - ▁THEY - ▁ARE - ▁NOT - ▁AS - ▁LIKE - ▁AT - ▁KNOW - ▁WHAT - LY - ▁CAN - ▁DO - ▁ABOUT - ▁ALL - ▁HIS - M - ▁HAD - '-' - ▁ONE - ▁OR - ▁FROM - ▁THERE - ▁ME - ▁MY - ▁BUT - ▁JUST - ▁YOUR - ▁AN - ▁BY - Y - ▁IF - ▁OUT - ▁PEOPLE - ▁UP - ▁HER - ER - ▁WERE - ▁THINK - E - N - ▁WOULD - ▁SHE - ▁THEIR - ▁WHO - ▁MORE - ▁OUR - ▁THEM - ▁WHEN - ▁WHICH - ▁VERY - ▁WILL - ▁SOME - ▁TIME - ▁BEEN - R - ▁GET - ▁HAS - ▁GOING - ▁HIM - VE - ▁REALLY - ▁HOW - ▁DON - ▁NO - ▁THEN - LL - ▁GO - ▁BECAUSE - ▁NOW - AL - ▁INTO - ▁THESE - ▁OTHER - ▁RIGHT - ▁SEE - ▁SAID - ▁HERE - ▁WAY - ▁TWO - ▁US - ▁WANT - ▁COULD - ▁S - ▁SAY - ▁OVER - ▁AH - ES - ▁WHERE - ▁BACK - ▁ALSO - ▁THOSE - ▁THINGS - ▁MAKE - ▁KIND - ▁MUCH - IN - ▁WELL - ▁GOOD - ▁DID - L - ▁FIRST - ▁THAN - ▁LITTLE - ▁RE - C - ▁NEW - ▁WORK - ▁ANY - A - P - ▁LOT - ▁DOWN - ▁SOMETHING - ▁THING - OR - LE - ▁MAN - ▁GOT - B - ▁COME - ▁ONLY - G - ▁BEING - ▁ACTUALLY - ▁LOOK - O - ▁TAKE - ▁EVEN - ▁NEED - ▁THROUGH - W - ▁GREAT - ▁WORLD - ▁MANY - ▁SHOULD - ▁YEARS - ATION - ▁UM - ▁MOST - ▁DAY - ▁YEAH - ▁LIFE - ▁BEFORE - ▁THREE - ▁UN - ION - ▁DIFFERENT - ▁DE - ▁MIGHT - ▁LET - ▁MADE - ▁MEAN - ▁PART - IC - ▁AGAIN - TH - ▁AFTER - ▁OWN - ▁USE - ITY - ABLE - ▁LONG - ▁STILL - ▁MAY - F - ▁OFF - ▁NEVER - ▁PUT - ▁C - ▁SAME - ...
@article{chen2021gigaspeech, title={Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio}, author={Chen, Guoguo and Chai, Shuzhou and Wang, Guanbo and Du, Jiayu and Zhang, Wei-Qiang and Weng, Chao and Su, Dan and Povey, Daniel and Trmal, Jan and Zhang, Junbo and others}, journal={arXiv preprint arXiv:2106.06909}, year={2021} }
@article{wang2024audiobench, title={AudioBench: A Universal Benchmark for Audio Large Language Models}, author={Wang, Bin… See the full description on the dataset page: https://huggingface.co/datasets/AudioLLMs/gigaspeech_test.
@article{yang2024gigaspeech, title={GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement}, author={Yang, Yifan and Song, Zheshu and Zhuo, Jianheng and Cui, Mingyu and Li, Jinpeng and Yang, Bo and Du, Yexing and Ma, Ziyang and Liu, Xunying and Wang, Ziyuan and others}, journal={arXiv preprint arXiv:2406.11546}, year={2024} }
@article{wang2024audiobench, title={AudioBench: A Universal… See the full description on the dataset page: https://huggingface.co/datasets/AudioLLMs/gigaspeech2-test.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.