WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about 10,000 hours unlabeled speech, with 22,400+ hours in total. The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.
foreveronly12/WenetSpeech dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset only contains test data, which is integrated into UltraEval-Audio(https://github.com/OpenBMB/UltraEval-Audio) framework.
python audio_evals/main.py --dataset WenetSpeech-test-meeting --model gpt4o_audio
python audio_evals/main.py --dataset WenetSpeech-test-net --model gpt4o_audio
🚀超凡体验,尽在UltraEval-Audio🚀
UltraEval-Audio——全球首个同时支持语音理解和语音生成评估的开源框架,专为语音大模型评估打造,集合了34项权威Benchmark,覆盖语音、声音、医疗及音乐四大领域,支持十种语言,涵盖十二类任务。选择UltraEval-Audio,您将体验到前所未有的便捷与高效:
一键式基准管理… See the full description on the dataset page: https://huggingface.co/datasets/TwinkStart/WenetSpeech.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
pengyizhou/wenetspeech-subset-S dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The study uses a state-of-the-art speech embedding method for WD detection in unstructured connected speech (UCS), combining bi-directional semantic dependencies and attentional mechanisms.The feature data file contains 110 native Mandarin-speaking participants, including 55 WD patients and 55 sex-matched healthy individuals. Four columns of data are labels (0 for healthy individuals and 1 for WD patients), ComParE feature set, Wav2vec 2.0, and HuBERT embedded feature set.To obtain frame-level speech representations that can be compared and fused with embedding approaches, we use only the LLDs of ComParE (the current latest 2016 version), which contains 65-dimensional features per time step, and configure the window length and the step length to 30 ms and 20 ms, respectively. The final ComParE feature shape of each participant's 60s audio is 2999 × 65.For adapting to native speech data, we extract embeddings based on pre-trained models w2v2 and HuBERT fine-tuned on 10,000 hours of Chinese speech data from WenetSpeech, respectively. Furthermore, considering the computational resources and time cost, we choose to use the base version of the pre-trained models, i.e., the final 768-dimensional hidden layer, as the embedding representation of the audio. The last hidden state in the model serves as the embedding representation with a shape of 2999 × 768 for an audio sample.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
WenetSpeech is a multi-domain Mandarin corpus consisting of 10,000+ hours high-quality labeled speech, 2,400+ hours weakly labelled speech, and about 10,000 hours unlabeled speech, with 22,400+ hours in total. The authors collected the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions.