This dataset was created by Rahul Bhalley
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Common Voice Corpus 22.0
This dataset is an unofficial version of the Mozilla Common Voice Corpus 22. It was downloaded and converted from the project's website https://commonvoice.mozilla.org/.
Languages
Abkhaz, Albanian, Amharic, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech… See the full description on the dataset page: https://huggingface.co/datasets/fsicoli/common_voice_22_0.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Czech Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of Czech language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
This dataset includes over 6,000 high-quality scripted audio prompts recorded in Czech, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
The prompts span a broad range of healthcare-specific interactions, such as:
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Every audio recording is accompanied by a verbatim, manually verified transcription.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Presenting the Czech Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of Czech speech recognition and voice AI models specifically tailored for the telecommunications industry.
This dataset includes over 6,000 high-quality scripted prompt recordings in Czech, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.
The dataset reflects a wide variety of common telecom customer interactions, including:
To maximize contextual richness, prompts include:
Each audio file is paired with an accurate, verbatim transcription for precise model training:
Detailed metadata is included to enhance dataset usability
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset was created by Rahul Bhalley