4 datasets found

Common Voice Czech
kaggle.com
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Bhalley (2023). Common Voice Czech [Dataset]. https://www.kaggle.com/datasets/rahulbhalley/common-voice-czech
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rahul Bhalley
Description
Dataset

This dataset was created by Rahul Bhalley

Contents
h
common_voice_22_0
huggingface.co
Updated Jun 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabio Sicoli (2025). common_voice_22_0 [Dataset]. https://huggingface.co/datasets/fsicoli/common_voice_22_0
Explore at:
Dataset updated
Jun 28, 2025
Authors
Fabio Sicoli
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for Common Voice Corpus 22.0

This dataset is an unofficial version of the Mozilla Common Voice Corpus 22. It was downloaded and converted from the project's website https://commonvoice.mozilla.org/.

Languages

Abkhaz, Albanian, Amharic, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech… See the full description on the dataset page: https://huggingface.co/datasets/fsicoli/common_voice_22_0.
F
Czech Scripted Monologue Speech Data for Healthcare
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Czech Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-czech-czech-republic
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Introducing the Czech Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of Czech language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
Speech Data
This dataset includes over 6,000 high-quality scripted audio prompts recorded in Czech, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
•Participant Diversity
•
Speakers: 60 native Czech speakers.

•
Regional Balance: Participants are sourced from multiple regions across Czech Republic, reflecting diverse dialects and linguistic traits.

•
Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.

•Recording Specifications
•
Nature of Recordings: Scripted monologues based on healthcare-related use cases.

•
Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.

•
Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.

•
Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

Topic Coverage
The prompts span a broad range of healthcare-specific interactions, such as:
•Patient check-in and follow-up communication
•Appointment booking and cancellation dialogues
•Insurance and regulatory support queries
•Medication, test results, and consultation discussions
•General health tips and wellness advice
•Emergency and urgent care communication
•Technical support for patient portals and apps
•Domain-specific scripted statements and FAQs
Contextual Depth
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
•
Names: Gender- and region-appropriate Czech Republic names

•
Addresses: Varied local address formats spoken naturally

•
Dates & Times: References to appointment dates, times, follow-ups, and schedules

•
Medical Terminology: Common medical procedures, symptoms, and treatment references

•
Numbers & Measurements: Health data like dosages, vitals, and test result values

•
Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Transcription
Every audio recording is accompanied by a verbatim, manually verified transcription.
•
Content: The transcription mirrors the exact scripted prompt recorded by the speaker.

•
Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.

•
<b
F
Czech Scripted Monologue Speech Data for Telecom
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Czech Scripted Monologue Speech Data for Telecom [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/telecom-scripted-speech-monologues-czech-czech-republic
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Presenting the Czech Scripted Monologue Speech Dataset for the Telecom Domain, a purpose-built dataset created to accelerate the development of Czech speech recognition and voice AI models specifically tailored for the telecommunications industry.
Speech Data
This dataset includes over 6,000 high-quality scripted prompt recordings in Czech, representing real-world telecom customer service scenarios. It’s designed to support the training of speech-based AI systems used in call centers, virtual agents, and voice-powered support tools.
•Participant Diversity
•
Speakers: 60 native Czech speakers

•
Geographic Distribution: Carefully selected from multiple regions across Czech Republic to capture a wide spectrum of dialects and speaking styles

•
Demographics: Balanced representation of males and females (60:40 ratio), aged between 18 to 70 years

•Recording Specifications
•
Type: Scripted monologue prompts focused on telecom industry use cases

•
Duration: Each audio clip ranges from 5 to 30 seconds

•
Format: WAV files in mono, 16-bit depth, with sample rates of 8 kHz and 16 kHz

•
Environment: Clean, echo-free, and noise-controlled settings to ensure optimal audio clarity

Topic Coverage
The dataset reflects a wide variety of common telecom customer interactions, including:
•Customer onboarding and service inquiries
•Billing and payment questions
•Data plans and product information
•Technical support requests
•Network coverage discussions
•Regulatory compliance and policy information
•Upgrades, renewals, and service plan changes
•Domain-specific scripted interactions tailored to real-world telecom use cases
Contextual Depth
To maximize contextual richness, prompts include:
•
Localized Names: Common Czech Republic names in various formats

•
Addresses: Region-specific address structures for realism

•
Dates & Times: Spoken date and time references in typical telecom scenarios (e.g., billing cycles, service activation times)

•
Telecom Terminology: Keywords related to mobile data, network, SIM, devices, plans, etc.

•
Numbers & Rates: Usage statistics, pricing info, recharge values, and billing figures

•
Service Providers: References to telecom companies and third-party service entities

Transcription
Each audio file is paired with an accurate, verbatim transcription for precise model training:
•
Content: Transcriptions are direct representations of each recorded prompt

•
Format: Plain text (.TXT), with filenames matching their corresponding audio files

•
Verification: Every transcription is manually verified by native Czech linguists to ensure consistency and accuracy

Metadata
Detailed metadata is included to enhance dataset usability
Not seeing a result you expected?
Learn how you can add new datasets to our index.