Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for LibriTTS
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members. The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.
Overview
This is the LibriTTS dataset, adapted… See the full description on the dataset page: https://huggingface.co/datasets/mythicinfinity/libritts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used for loading TTS spectrograms and waveform audio with alignments and a number of configurable "measures", which are extracted from the raw audio.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for LibriTTS-R
LibriTTS-R [1] is a sound quality improved version of the LibriTTS corpus (http://www.openslr.org/60/) which is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, published in 2019.
Overview
This is the LibriTTS-R dataset, adapted for the datasets library.
Usage
Splits
There are 7 splits (dots replace dashes from the original dataset, to comply with hf naming… See the full description on the dataset page: https://huggingface.co/datasets/mythicinfinity/libritts_r.
This dataset is a subset of a minimal version of google's LibriTTS dataset, for more information on the LibriTTS dataset see this article. It's a minimal version because it contains only the text and audio files, that is, the basics you need to train a text-to-speech model. It's also only a subset, because kaggle has a size limit for the datasets to access the "full minimal dataset", see the list bellow: 1. Libri TTS train clean 100 (from the file train-clean-100 of the dataset) 2. Libri TTS train clean 360 part 1 (from the first half of the file train-clean-360) 3. Libri TTS train clean 360 part 2 (from the second part of the same file) 4. Libri TTS train other 500 part 1 (from the first part of the file train-other-500) 5. Libri TTS train other 500 part 2 (from the same file) 6. Libri TTS test (from the files test-clean and test-other) 7. Libri TTS dev (this dataset, from the files dev-clean and dev-other)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Prateek Narain
Released under Apache 2.0
azain/LibriTTS-raw dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Annotated LibriTTS-R
This dataset is an annotated version of LibriTTS-R [1]. LibriTTS-R [1] is a sound quality improved version of the LibriTTS corpus which is a multi-speaker English corpus of approximately 960 hours of read English speech at 24kHz sampling rate, published in 2019. In the text_description column, it provides natural language annotations on the characteristics of speakers and utterances, that have been generated using the Data-Speech repository.… See the full description on the dataset page: https://huggingface.co/datasets/parler-tts/libritts_r_tags_tagged_10k_generated.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
LibriTTS Speaker Voices & Embeddings
Dataset Description
This dataset provides a collection of speaker voice samples from the LibriTTS corpus. For each speaker, a single 30-second audio clip is provided, created by concatenating their speech segments. The dataset is designed for tasks such as speaker identification, speaker verification, and as a voice bank for Text-to-Speech (TTS) models, particularly for voice cloning. In addition to the audio files and their metadata… See the full description on the dataset page: https://huggingface.co/datasets/sdialog/voices-libritts.
LibriTTS Enhanced Dataset
Enhanced version of LibriTTS dataset for speech enhancement research.
This model was trained by kan-bayashi using libritts/tts1 recipe in espnet. Python APISee https://github.com/espnet/espnet_model_zoo Evaluate in the recipegit clone https://github.com/espnet/espnet cd espnet git checkout 628b46282537ce532d613d6bafb75e826e8455de pip install -e . cd egs2/libritts/tts1 # Download the model file here ./run.sh --skip_data_prep false --skip_train true --download_model kan-bayashi/libritts_tts_train_xvector_vits_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave Configconfig: ./conf/tuning/train_xvector_vits.yaml print_config: false log_level: INFO dry_run: false iterator_type: sequence output_dir: exp/tts_train_xvector_vits_raw_phn_tacotron_g2p_en_no_space ngpu: 1 seed: 777 num_workers: 4 num_att_plot: 3 dist_backend: nccl dist_init_method: env:// dist_world_size: 4 dist_rank: 0 local_rank: 0 dist_master_addr: localhost dist_master_port: 60056 dist_launcher: null multiprocessing_distributed: true unused_parameters: true sharded_ddp: false cudnn_enabled: true cudnn_benchmark: false cudnn_deterministic: false collect_stats: false write_collected_feats: false max_epoch: 100 patience: null val_scheduler_criterion: - valid - loss early_stopping_criterion: - valid - loss - min best_model_criterion: - - train - total_count - max keep_nbest_models: 10 grad_clip: -1 grad_clip_type: 2.0 grad_noise: false accum_grad: 1 no_forward_run: false resume: true train_dtype: float32 use_amp: false log_interval: 50 use_tensorboard: true use_wandb: false wandb_project: null wandb_id: null wandb_entity: null wandb_name: null wandb_model_log_interval: -1 detect_anomaly: false pretrain_path: null init_param: [] ignore_init_mismatch: false freeze_param: [] num_iters_per_epoch: 10000 batch_size: 20 valid_batch_size: null batch_bins: 5000000 valid_batch_bins: null train_shape_file: - exp/tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/train/text_shape.phn - exp/tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/train/speech_shape valid_shape_file: - exp/tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/valid/text_shape.phn - exp/tts_stats_raw_linear_spectrogram_phn_tacotron_g2p_en_no_space/valid/speech_shape batch_type: numel valid_batch_type: null fold_length: - 150 - 204800 sort_in_batch: descending sort_batch: descending multiple_iterator: false chunk_length: 500 chunk_shift_ratio: 0.5 num_cache_chunks: 1024 train_data_path_and_name_and_type: - - dump/22k/raw/train-clean-460/text - text - text - - dump/22k/raw/train-clean-460/wav.scp - speech - sound - - dump/22k/xvector/train-clean-460/xvector.scp - spembs - kaldi_ark valid_data_path_and_name_and_type: - - dump/22k/raw/dev-clean/text - text - text - - dump/22k/raw/dev-clean/wav.scp - speech - sound - - dump/22k/xvector/dev-clean/xvector.scp - spembs - kaldi_ark allow_variable_data_keys: false max_cache_size: 0.0 max_cache_fd: 32 valid_max_cache_size: null optim: adamw optim_conf: lr: 0.0002 betas: - 0.8 - 0.99 eps: 1.0e-09 weight_decay: 0.0 scheduler: exponentiallr scheduler_conf: gamma: 0.999875 optim2: adamw optim2_conf: lr: 0.0002 betas: - 0.8 - 0.99 eps: 1.0e-09 weight_decay: 0.0 scheduler2: exponentiallr scheduler2_conf: gamma: 0.999875 generator_first: false token_list: - - - AH0 - T - N - D - S - R - L - IH1 - DH - M - K - Z - EH1 - AE1 - IH0 - AH1 - W - ',' - HH - ER0 - P - IY1 - V - F - B - UW1 - AA1 - AY1 - AO1 - . - EY1 - IY0 - OW1 - NG - G - SH - Y - AW1 - CH - ER1 - UH1 - TH - JH - '''' - '?' - OW0 - EH2 - '!' - IH2 - OY1 - EY2 - AY2 - EH0 - UW0 - AA2 - AE2 - OW2 - AO2 - AE0 - AH2 - ZH - AA0 - UW2 - IY2 - AY0 - AO0 - AW2 - EY0 - UH2 - ER2 - AW0 - '...' - UH0 - OY2 - . . . - OY0 - . . . . - .. - . ... - . . - . . . . . - .. .. - '... .' - odim: null model_conf: {} use_preprocessor: true token_type: phn bpemodel: null non_linguistic_symbols: null cleaner: tacotron g2p: g2p_en_no_space feats_extract: linear_spectrogram feats_extract_conf: n_fft: 1024 hop_length: 256 win_length: null normalize: null normalize_conf: {} tts: vits tts_conf: generator_type: vits_generator generator_params: hidden_channels: 192 spks: -1 spk_embed_dim: 512 global_channels: 256 segment_size: 32 text_encoder_attention_heads: 2 text_encoder_ffn_expand: 4 text_encoder_blocks: 6 text_encoder_positionwise_layer_type: conv1d text_encoder_positionwise_conv_kernel_size: 3 text_encoder_positional_encoding_layer_type: rel_pos text_encoder_self_attention_layer_type: rel_selfattn text_encoder_activation_type: swish text_encoder_normalize_before: true text_encoder_dropout_rate: 0.1 text_encoder_positional_dropout_rate: 0.0 text_encoder_attention_dropout_rate: 0.1 use_macaron_style_in_text_encoder: true use_conformer_conv_in_text_encoder: false text_encoder_conformer_kernel_size: -1 decoder_kernel_size: 7 decoder_channels: 512 decoder_upsample_scales: - 8 - 8 - 2 - 2 decoder_upsample_kernel_sizes: - 16 - 16 - 4 - 4 decoder_resblock_kernel_sizes: - 3 - 7 - 11 decoder_resblock_dilations: - - 1 - 3 - 5 - - 1 - 3 - 5 - - 1 - 3 - 5 use_weight...
azain/LibriTTS-358-samples dataset hosted on Hugging Face and contributed by the HF Datasets community
ylacombe/libritts-r-text-tags-v4 dataset hosted on Hugging Face and contributed by the HF Datasets community
cmeraki/libritts dataset hosted on Hugging Face and contributed by the HF Datasets community
Nikhil20Sharma/3-LibriTTS-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LibriTTS-R Mimi encoding
This dataset converts all audio in the dev.clean, test.clean, train.100 and train.360 splits of the LibriTTS-R dataset from waveforms to tokens in Kyutai's Mimi neural codec. These tokens are intended as targets for DualAR audio models, but also allow you to simply download all audio in ~50-100x less space, if you're comfortable decoding later on with rustymimi or Transformers. This does NOT contain the original audio, please use the regular LibriTTS-R for… See the full description on the dataset page: https://huggingface.co/datasets/jkeisling/libritts-r-mimi.
200 dialogues generated using SDialog:
ExpO0O5 > DoPaCo > 001 001: both roles use gemma3:27b-it-qat as LLM only doctor gets truncated '?'
Split without persona overlapp: train set: doc 0-59 pat 0-119 dev set: doc 60 -79 pat 120 - 139 test set: doc 80 - 99 pat 140 - 199
Audio Setup:
Databased of voices build from LibriTTS dataset IndexTTS model for utterances generation dScaper for channels and metadata creation PyRoomAcoustics for spacialization of the audio
TeodoraR/libritts-r-filtered-speaker-descriptions dataset hosted on Hugging Face and contributed by the HF Datasets community
azain/LibriTTS-dev-clean-16khz-mono-loudnorm-100-random-samples-2024-04-18-17-34-39-similarities dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ryota-komatsu/libritts-r-mhubert-2000units dataset hosted on Hugging Face and contributed by the HF Datasets community
morateng/libritts-r-test-clean dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for LibriTTS
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members. The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.
Overview
This is the LibriTTS dataset, adapted… See the full description on the dataset page: https://huggingface.co/datasets/mythicinfinity/libritts.