This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
Note that the 'p315' text was lost due to a hard disk error.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('vctk', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
badayvedat/VCTK dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive. The newspaper texts were taken from Herald Glasgow, with permission from Herald & Times Group. Each speaker has a different set of the newspaper texts selected based a greedy algorithm that increases the contextual and phonetic coverage. The details of the text selection algorithms are described in the following paper: C. Veaux, J. Yamagishi and S. King, "The voice bank corpus: Design, collection and data analysis of a large regional accent speech database," https://doi.org/10.1109/ICSDA.2013.6709856. The rainbow passage and elicitation paragraph are the same for all speakers. The rainbow passage can be found at International Dialects of English Archive: (http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation paragraph is identical to the one used for the speech accent archive (http://accent.gmu.edu). The details of the the speech accent archive can be found at http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf. All speech data was recorded using an identical recording setup: an omni-directional microphone (DPA 4035) and a small diaphragm condenser microphone with very wide bandwidth (Sennheiser MKH 800), 96kHz sampling frequency at 24 bits and in a hemi-anechoic chamber of the University of Edinburgh. (However, two speakers, p280 and p315 had technical issues of the audio recordings using MKH 800). All recordings were converted into 16 bits, were downsampled to 48 kHz, and were manually end-pointed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a training checkpoint for speech-to-text-wavenet trained against the VCTK dataset. Training terminated after 20 epochs with a loss of 8.72. Training batch size was reduced from 16 to 4, and revision 91758811f of speech-to-text-wavenet was used, with tensorflow 0.12.1 and sugartensor 0.0.2.3. Training was run in Python 2.7.13 on OS X 10.11.6 with CUDA 7.5 on an NVIDIA GeForce GTX 780M.
Dataset Card for "vctk"
More Information needed
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created by Desty Nova
Released under Attribution 4.0 International (CC BY 4.0)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
VCTK
This is a processed clone of the VCTK dataset with leading and trailing silence removed using Silero VAD. A fixed 25 ms of padding has been added to both ends of each audio clip to (hopefully) imrprove training and finetuning. The original dataset is available at: https://datashare.ed.ac.uk/handle/10283/3443.
Reproducing
This repository notably lacks a requirements.txt file. There's likely a missing dependency or two, but roughly: pydub tqdm torch torchaudio… See the full description on the dataset page: https://huggingface.co/datasets/jspaulsen/vctk.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce PodcastMix, a dataset formalizing the task of separating background music and foreground speech in podcasts. It contains audio files at 44.1kHz and the corresponding metadata. For further details check the following paper and the associated GitHub repository:
N. Schmidt, J. Pons, M. Miron, "Score-informed source separation for multi-channel orchestral recordings", submitted to ICASSP (2022)
https://github.com/MTG/Podcastmix
This dataset contains four parts (we highlight the content of the zenodo archives within brackets]:
[metadata] PodcastMix-synth train: large and diverse training set that is programatically generated (with a validation partition). The mixtures are created programatically with music from Jamendo and speech from the VCTK dataset.
[metadata] PodcastMix-synth test a programatically generated test set with reference stems to compute evaluation metrics. The mixtures are created programatically with music from Jamendo and speech from the VCTK dataset.
[audio and metadata] PodcastMix-real with-reference : a test set with real podcasts with reference stems to compute evaluation metrics. The podcasts are recorded by one of the authors and the source of the music is the FMA dataset.
[audio and metadata] PodcastMix-real no-reference: a test set with real podcasts with only the podcasts mixes for subjective evaluation. The podcasts are compiled from the internet.
This dataset is created by Nicolas Schmidt, Marius Miron, Music Technology Group - Universitat Pompeu Fabra (Barcelona) and Jordi Pons - Dolby Labs. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 Unported License (CC BY-SA 4.0).
Please acknowledge PodcastMix in Academic Research. When the present dataset is used for academic research, we would highly appreciate if authors quote the following publication:
N. Schmidt, J. Pons, M. Miron, "Score-informed source separation for multi-channel orchestral recordings", submitted to ICASSP (2022)
The dataset and its contents are made available on an “as is” basis and without warranties of any kind, including without limitation satisfactory quality and conformity, merchantability, fitness for a particular purpose, accuracy or completeness, or absence of errors. Subject to any liability that may not be excluded or limited by law, the UPF is not liable for, and expressly excludes, all liability for loss or damage however and whenever caused to anyone by any use of the dataset or any part of it.
PURPOSES. The data is processed for the general purpose of carrying out research development and innovation studies, works or projects. In particular, but without limitation, the data is processed for the purpose of communicating with Licensee regarding any administrative and legal / judicial purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a new variant of the voice cloning toolkit (VCTK) dataset: device-recorded VCTK (DR-VCTK), where the high-quality speech signals recorded in a semi-anechoic chamber using professional audio devices are played back and re-recorded in office environments using relatively inexpensive consumer devices.
This dataset was created by Kate Dmitrieva
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
yfyeung/vctk dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by Smeu Stefan
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was generated as part of the paper:DCUnet-Based Multi-Model Approach for Universal Sound Separation,K. Arendt, A. Szumaczuk, B. Jasik, K. Piaskowski, P. Masztalski, M. Matuszewski, K. Nowicki, P. Zborowski.It contains various sounds from the Audio Set [1] and spoken utterances from VCTK [2] and DNS [3] datasets.Contents:sr_8k/ mix_clean/ s1/ s2/ s3/ s4/sr_16k/ mix_clean/ s1/ s2/ s3/ s4/sr_48k/ mix_clean/ s1/ s2/ s3/ s4/Each directory contains 512 audio samples in different sampling rate (sr_8k - 8 kHz, sr_16k - 16 kHz, sr_48k - 48 kHz).The audio samples for each sampling rate are different as they were generated randomly and separately.Each directory contains 5 subdirectories:- mix_clean - mixed sources,- s1 - source #1 (general sounds),- s2 - source #2 (speech),- s3 - source #3 (traffic sounds),- s4 - source #4 (wind noise).The sound mixtures were generated by adding s2, s3, s4 to s1 with SNR ranging from -10 to 10 dB w.r.t. s1.REFERENCES:[1] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.[2] Christophe Veaux, Junichi Yamagishi, and Kirsten Mac- Donald, “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit, [sound],” https://doi.org/10.7488/ds/1994, University of Edinburgh. The Centre for Speech Technology Research (CSTR). 2017.[3] Chandan K. A. Reddy, Ebrahim Beyrami, Harishchandra Dubey, Vishak Gopal, Roger Cheng, Ross Cutler, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, and Johannes Gehrke, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective speech quality and testing framework,” 2020.
This dataset was created by Alexandr Ivanov
ruhullah1/vctk dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A dataset of parallel fullband speech simultaneously recorded by various microphones in the Wilska Multichannel Anechoic Chamber at the Acoustics Lab of Aalto University, Espoo, Finland.
The source data comes from the VCTK dataset. It was played back using a high quality professional fullrange loudspeaker and recorded by two headset microphones and six studio microphones. The total amount of data per microphone is 23 hours, 11 minutes and 13 seconds. All data was recorded at 48kHz and 24-bit, and later on compressed in .flac format (loseless compression).
The data is structured as follows:
Each folder contains the same number of files. Files with the same name across folders are parallel recordings, meaning that they have identical content and the same number of samples, but captured by each corresponding microphone.
Each file uses the following naming convention:
pXXX_YYY_ZZ.flac
where:
pXXX
is the speaker ID, matching exactly the IDs used in the VCTK dataset. YYY
is the utterance ID, also consistent with the VCTK dataset.ZZ
is the segment ID. Utterances were trimmed to remove silent sections, and some were split into multiple segments as a result (e.g., p225_009_00.flac
and p225_009_01.flac
).Each folder contains a meta.csv
file with metadata for its corresponding subset, including a SHA256 hash to verify data integrity. These were generated using sndls with the following command:
sndls audio --csv meta.csv -e .flac -r --sha256
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by KC La
Released under CC0: Public Domain
confit/vctk-full dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets used in the paper: "Creating speech zones with self-distributing acoustic swarms"
This deposit contains 2 distinct datasets:
A dataset of speech mixtures containing 2-5 speakers simulated using PyRoomAcoustics. The dataset consists of 8000 training mixtures, 500 validation mixtures and 1000 testing mixtures.
A dataset of speech mixtures containing 3-5 speakers created from synchronized recordings in reverberant rooms with objects cluttering the table. The dataset consists of 500 testing mixtures.
The source sounds are various utterances from the VCTK dataset. For real world data, the utterances are played over a Rokono Bass+ Mini Speaker. The recordings are captured from an array of 7 microphones, as they are recorded by our robotic swarm as it is distributed across the table. The recorded audio in the real world has been subjected to audio compression and decompression using the Opus Codec to enable multiple simultaneous streams.
Please see the Readme for more infromation. Please see related identifiers for other datasets.
This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
Note that the 'p315' text was lost due to a hard disk error.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('vctk', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.