100+ datasets found
  1. E

    The "SIVA" Speech Database for Speaker Verification and Identification

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 14, 2005
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2005). The "SIVA" Speech Database for Speaker Verification and Identification [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0028/
    Explore at:
    Dataset updated
    Jun 14, 2005
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The Italian speech database SIVA (?Speaker Identification and Verification Archives: SIVA?), is a database comprising more than two thousands calls, collected over the public switched telephone network, and available very soon via ELRA. The SIVA database consists of four speaker categories: male users, female users, male impostors, female impostors. Speakers were contacted via mail before the test, and they were asked to read the information and the instructions provided carefully before making the call. About 500 speakers were recruited using a company specialized in selection of population samples. The others were volunteers contacted by the institute concerned. Speakers access the recording system by calling a toll free number. An automatic answering system guides them through the three sessions that make up a recording. In the first session, a list of 28 words (including digits and some commands) is recorded using a standard enumerated prompt. The second session is a simple unidirectional dialogue (the caller answers prompted questions) where personal information is asked (name, age, etc.). In the third session, the speaker is asked to read a continuous passage of phonetically balanced text that resembles a short curriculum vitae. The signal is a standard 8kHz sampled signal, coded using 8 bits mu-law format. The data collected so far consists of:· MU: male users 18 speakers, 20 repetitions· FU: female users 16 speakers, 26 repetitions· MI: male impostors: 189 speakers, 2 repetitions, and 128 speakers, 1 repetition· FI: female impostors: 213 speakers, 2 repetitions, and 107 speakers, 1 repetition.

  2. MOBIO

    • data.europa.eu
    • data.niaid.nih.gov
    • +1more
    unknown
    Updated Nov 30, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2010). MOBIO [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4269551?locale=da
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Nov 30, 2010
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    Description

    MOBIO is a dataset for mobile face and speaker recognition. The dataset consists of bi-modal (audio and video) data taken from 150 people. The dataset has a female-male ratio of nearly 1:2 (99 males and 51 females) and was collected from August 2008 until July 2010 in six different sites from five different countries. This led to a diverse bi-modal dataset with both native and non-native English speakers. In total 12 sessions were captured for each client: 6 sessions for Phase I and 6 sessions for Phase II. The Phase I data consists of 21 questions with the question types ranging from: Short Response Questions, Short Response Free Speech, Set Speech, and Free Speech. The Phase II data consists of 11 questions with the question types ranging from: Short Response Questions, Set Speech, and Free Speech. A more detailed description of the questions asked of the clients is provided below. The database was recorded using two mobile devices: a mobile phone and a laptop computer. The mobile phone used to capture the database was a NOKIA N93i mobile while the laptop computer was a standard 2008 MacBook. The laptop was only used to capture part of the first session, this first session consists of data captured on both the laptop and the mobile phone. Detailed Description of Questions Please note that the answers to the Short Response Free Speech and Free Speech questions DO NOT necessarily relate to the question as the sole purpose is to have the subject speaking free speech, therefore, the answers to ALL of these questions are assumed to be false. 1. Short Response Questions The short response questions consisted of five pre-defined questions, which were: What is your name? – the user supplies their fake name What is your address? – the user supplies their fake address What is your birthdate? – the user supplies their fake birthdate What is your license number? – the user supplied their fake ID card number (the same for each person) What is your credit card number? – the user supplies their fake Card number 2. Short Response Free Speech There were five random questions taken form a list of 30-40 questions. The user had to answer these questions by speaking for approximately 5 seconds of recording (sometimes more and sometimes less). 3. Set Speech The users were asked to read pre-defined text out aloud. This text was designed to take longer than 10 seconds to utter and the participants were allowed to correct themselves while reading these paragraphs. The text that was read was: I have signed the MOBIO consent form and I understand that my biometric data is being captured for a database that might be made publicly available for research purposes. I understand that I am solely responsible for the content of my statements and my behaviour. I will ensure that when answering a question I do not provide any personal information in response to any question. 4. Free Speech The free speech session consisted of 10 random questions from a list of approximately 30 questions. The answers to each of these questions took approximately 10 seconds (sometimes less and sometimes more). Acknowledgements Elie Khoury, Laurent El-Shafey, Christopher McCool, Manuel Günther, Sébastien Marcel, “Bi-modal biometric authentication on mobile phones in challenging conditions”, Image and Vision Computing Volume 32, Issue 12, 2014. 10.1016/j.imavis.2013.10.001 https://publications.idiap.ch/index.php/publications/show/2689

  3. Call Center Speech Recognition Dataset

    • kaggle.com
    zip
    Updated Oct 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Axon Labs (2025). Call Center Speech Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/axondata/call-center-speech-dataset
    Explore at:
    zip(12766164 bytes)Available download formats
    Dataset updated
    Oct 14, 2025
    Authors
    Axon Labs
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Multilingual Call Center Speech Recognition Dataset: 10,000 Hours

    Dataset Summary

    10,000 hours of real-world call center speech recordings in 7 languages with transcripts. Train speech recognition, sentiment analysis, and conversation AI models on authentic customer support audio. Covers support, sales, billing, finance, and pharma domains

    Dataset Features

    📊 Scale & Quality

    • 10,000 hours of inbound & outbound calls
    • Real-world field recordings - no synthetic audio
    • With transcripts and concise summaries

    🎙️ Audio Specifications

    • Format: Single-channel (mono) telephone speech
    • Sample rate: 8,000 Hz
    • Non-synthetic source audio

    🌍 Languages (7)

    English, Russian, Polish, French, German, Spanish, Portuguese - Non-English calls include English translation - Additional languages available on request: Swedish, Dutch, Arabic, Japanese, etc.

    🏢 Domains

    Support, Billing/Account, Sales, Finance/Account Management, Pharma - Each call labeled by domain - Speaker roles annotated (Agent/Customer)

    Full version of dataset is availible for commercial usage - leave a request on our website Axonlabs to purchase the dataset 💰

    Purpose and Usage Scenarios

    • Automatic Speech Recognition, punctuation restoration, and speaker diarization on telephone speech
    • Intent detection, topic classification, and sentiment analysis from customer-service dialogs
    • Post-call concise summaries for QA/quality monitoring and CRM automation
    • Cross-lingual pipelines (original → English) and multilingual support models
  4. d

    Speech Recognition Dataset [Customer Calls] – Transcribed support...

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com, Speech Recognition Dataset [Customer Calls] – Transcribed support conversations for training voice AI systems [Dataset]. https://datarade.ai/data-products/speech-recognition-dataset-customer-calls-transcribed-sup-wiserbrand-com
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset provided by
    WiserBrand
    Area covered
    Moldova (Republic of), Portugal, Denmark, Greece, Poland, United Kingdom, Slovenia, Croatia, Norway, Czech Republic
    Description

    This dataset is designed for building and improving speech recognition systems. It features transcribed customer service calls from real interactions across 160+ industries, including retail, banking, telecom, logistics, healthcare, and entertainment. Calls are natural, unscripted, and emotion-rich — making the data especially valuable for training models that must interpret speech under real-world conditions.

    Each dataset entry includes:

    • Full call transcription (agent + customer dialogue)
    • Human-written call summary
    • Overall sentiment label: positive, neutral, or negative
    • Metadata: call duration, caller location (city, state, country), timestamp
    • Optional: company name and industry tag

    Use this dataset to:

    • Train speech-to-text models on real customer language patterns. -Benchmark or evaluate speech recognition tools in support settings
    • Improve voice interfaces, chatbots, and IVR systems.
    • Model tone, frustration cues, and escalation behaviors
    • Support LLM fine-tuning for tasks involving spoken input.s

    This dataset provides your speech recognition models with exposure to genuine customer conversations, helping you build tools that can listen, understand, and act in line with how people actually speak.

    The larger the volume you purchase, the lower the price will be.

  5. Speech_Command|Application of Speech Recognition

    • kaggle.com
    zip
    Updated Mar 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VK (2022). Speech_Command|Application of Speech Recognition [Dataset]. https://www.kaggle.com/datasets/venkatkumar001/speechcommands
    Explore at:
    zip(820205557 bytes)Available download formats
    Dataset updated
    Mar 28, 2022
    Authors
    VK
    Description

    Google Researcher published the Speech command Dataset! I'm publishing only 14 subcategories of voice data in a 1sec period.

    I'm doing preprocessing and generating JSON files and uploading the dataset! feel free to use it

    Enjoy! And develop your key spotting Application Efficiently

  6. E

    POLYCOST

    • live.european-language-grid.eu
    • catalogue.elra.info
    audio format
    Updated Jun 25, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). POLYCOST [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1509
    Explore at:
    audio formatAvailable download formats
    Dataset updated
    Jun 25, 2017
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The POLYCOST speech database was recorded during January-March 1996 as a common initiative entitled ''Speaker Recognition in Telephony'' within the COST 250 action. The main purpose of the database is to compare and validate speaker recognition algorithms. The data was collected via international telephone lines, with more than five sessions per speaker, and with English spoken by foreigners. The database contains 1 285 calls (around 10 sessions per speaker) recorded by 133 subjects (74 male and 59 female speakers) from 13 different countries. Approximately 10 speakers per country were provided by each partner. Each session comprises 15 prompts, including one prompt for DTMF detection, 10 prompts with connected digits uttered in English, 2 prompts with sentences uttered in English and 2 prompts in the speaker's mother tongue. One of the prompts in the speaker's mother tongue consists of free speech. * English: - 4 prompts distributed throughout the session in which the speaker pronounces his or her 7-digit client code; - 5 prompts distributed throughout the session in which the speaker pronounces a sequence of 10 digits (the same from session to session and from speaker to speaker); - 2 prompts in which the speaker pronounces the sentences: ''Joe took father's green shoe bench out'' and ''He eats several light tacos'', as fixed password phrases which are common to all speakers; - 1 prompt in which the speaker is supposed to give his or her international phone number. * Mother tongue - 1 prompt in which the speaker gives his or her first name, family name, gender (female/male), town and country; - 1 prompt with free speech. The database was collected through the European telephone network and was recorded through an ISDN card on XTL SUN platform with an 8 kHz sampling rate. Most of the calls were automatically classified by DTMF detection. Manual classification has been used in the case of no DTMF or wrong DTMF PIN code (circa 10% of the database). The English prompts are segmented and labelled at the word level (orthographic transcription and word stretches). The prompts in mother tongue are simply labelled (an orthographic transcription will be given). The conventions used for the annotation are those defined within the SpeechDat project. Character set: ISO-8859-1 Medium: CD-ROMs. The first CD contains speech data from speakers M001-M069, and the second CD contains data from speakers F001-F060 plus M070-M074. Total size CD1: 636 MB Total size CD2: 610 MB File format: A-law, 8 kHz sampling rate, 8 bits/sample, with no file header.

  7. d

    Latin American English Accent Speech Dataset — Authentic Local Speaker...

    • datarade.ai
    .wav
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2025). Latin American English Accent Speech Dataset — Authentic Local Speaker Conversations [Dataset]. https://datarade.ai/data-products/english-accent-speech-dataset-central-america-authentic-l-filemarket
    Explore at:
    .wavAvailable download formats
    Dataset updated
    Jul 20, 2025
    Dataset authored and provided by
    FileMarket
    Area covered
    El Salvador, Mexico, Guatemala, Dominican Republic, Colombia, Costa Rica
    Description

    The Central America English Accent Speech Dataset features real conversations from native and bilingual English speakers across Mexico, Colombia, Dominican Republic, Costa Rica, Guatemala, and El Salvador.

    This curated collection provides authentic English speech with distinct regional accents, recorded in natural conversational settings. The dataset is ideal for:

    AI speech training and accent detection

    Automatic speech recognition (ASR) model development

    Natural language processing (NLP) applications

    Conversational AI, chatbots, and voice assistants

    Key Features:

    ✅ Native and bilingual speakers with verified metadata (age, gender, country)

    ✅ Clean audio, human-validated for accent clarity

    ✅ Over 1,000 hours of recordings from 2,000 speakers

    ✅ Comprehensive CSV metadata with accent labels

    ✅ Licensed for commercial AI training and research use

  8. E

    AURORA-5

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Aug 16, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). AURORA-5 [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-AURORA-CD0005/
    Explore at:
    Dataset updated
    Aug 16, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Aurora project was originally set up to establish a worldwide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.The earlier three Aurora experiments had a focus on additive noise and the influence of some telephone frequency characteristics. Aurora-5 tries to cover all effects as they occur in realistic application scenarios. The focus was put on two scenarios. The first one is the hands-free speech input in the noisy car environment with the intention of controlling either devices in the car itself or retrieving information from a remote speech server over the telephone. The second one covers the hands-free speech input in a type of office or in a type of living room to control e.g. a telephone device or some audio/video equipment.The AURORA-5 database contains the following data:•Artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz. The distortions consist of: - additive background noise, - the simulation of a hands-free speech input in rooms, - the simulation of transmitting speech over cellular telephone networks.•A subset of recordings from the meeting recorder project at the International Computer Science Institute. The recordings contain sequences of digits uttered by different speakers in hands-free mode in a meeting room.•A set of scripts for running recognition experiments on the above mentioned speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.Further information is also available at the following address: http://aurora.hsnr.de

  9. d

    Accented English Speech Dataset | Humam-to-Chatbot conversation | 1000+...

    • datarade.ai
    .mp3, .wav
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2025). Accented English Speech Dataset | Humam-to-Chatbot conversation | 1000+ hours of recordings [Dataset]. https://datarade.ai/data-products/accented-english-speech-dataset-1-5k-recordings-scripted-filemarket
    Explore at:
    .mp3, .wavAvailable download formats
    Dataset updated
    Aug 5, 2025
    Dataset authored and provided by
    FileMarket
    Area covered
    Rwanda, Philippines, Russian Federation, Bangladesh, Curaçao, Palestine, Grenada, Poland, Falkland Islands (Malvinas), Netherlands
    Description

    The Accented English Speech Dataset provides over 1,000 hours of authentic conversational recordings designed to strengthen ASR systems, conversational AI, and voice applications. Unlike synthetic or scripted datasets, this collection captures real human-to-human and chatbot-guided dialogues, reflecting natural speech flow, spontaneous phrasing, and diverse accents.

    Off-the-shelf recordings are available from:

    Mexico Colombia Guatemala Costa Rica El Salvador Dominican Republic South Africa

    This ensures exposure to Latin American, Caribbean, and African English accents, which are often missing from mainstream corpora.

    Beyond these, we support custom collection in any language and any accent worldwide, tailored to specific project requirements.

    Audio Specifications

    Format: WAV Sample rate: 48kHz Sample size: 16-bit PCM Channel: Mono/Stereo Double-track recording: Available upon request (clean separation of speakers) Data Structure and Metadata Dual-track or single-channel audio depending on project need Metadata includes speaker ID, demographic attributes, accent/region, and context Dialogues include both structured (chatbot/task-based) and free-flow natural conversations

    Use Cases

    • ASR Training & Benchmarking – Improve transcription across accented English
    • Accent Adaptation – Build robust, inclusive systems that work in real-world scenarios
    • Multilingual Voice Interfaces – Expand IVR and assistants to support more voices
    • Conversational AI – Train chatbots on authentic, unstructured dialogue
    • Voice Biometrics – Support research in identity verification and speaker profiling
    • Model Fine-Tuning – Enrich foundation models with high-quality speech data

    Why It Matters

    Mainstream datasets disproportionately focus on U.S. and U.K. English. This dataset fills the gap with diverse accented English coverage, and the ability to collect any language or accent on demand, enabling the creation of fairer, more accurate, and globally deployable AI solutions.

    Key Highlights

    • 1,000+ hours of accented English speech
    • Ready-to-use coverage from Latin America, Caribbean, and Africa
    • Authentic dialogues: human-to-human and chatbot-guided
    • WAV, 48kHz, 16-bit PCM, mono/stereo, double-track option
    • Metadata-rich recordings for advanced AI research
    • Custom collection in any language and accent
  10. a

    LibriSpeech

    • datasets.activeloop.ai
    • tensorflow.org
    • +2more
    deeplake
    Updated Dec 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2022). LibriSpeech [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/librispeech-dataset/
    Explore at:
    deeplakeAvailable download formats
    Dataset updated
    Dec 12, 2022
    Dataset authored and provided by
    Google
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    Carnegie Mellon University
    Description

    The LibriSpeech dataset is a corpus of read English speech. It was created by Google and Carnegie Mellon University, and is used for training and evaluating speech recognition models. The dataset consists of 960 hours of speech, divided into a training set of 920 hours and a test set of 40 hours. The speech is read by a variety of speakers, and covers a wide range of topics. The dataset is available for free download.

  11. Bengali Speech Recognition Dataset (BSRD)

    • kaggle.com
    zip
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kumar Basak-4004 (2025). Bengali Speech Recognition Dataset (BSRD) [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasak4004/bengali-speech-recognition-dataset-bsrd
    Explore at:
    zip(300882482 bytes)Available download formats
    Dataset updated
    Jan 14, 2025
    Authors
    Shuvo Kumar Basak-4004
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.

    …………………………………..Note for Researchers Using the dataset………………………………………………………………………

    This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.

  12. m

    Bengali Speech Recognition - Bangla Real Number Audio Dataset

    • data.mendeley.com
    Updated Feb 4, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Mahadi Hasan Nahid (2018). Bengali Speech Recognition - Bangla Real Number Audio Dataset [Dataset]. http://doi.org/10.17632/t33byr6cpt.3
    Explore at:
    Dataset updated
    Feb 4, 2018
    Authors
    Md Mahadi Hasan Nahid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ========================================================

    This dataset is developed by Md Ashraful Islam, SUST CSE'2010 Md Mahadi Hasan Nahid, SUST CSE'2010 (nahid-cse@sust.edu)

    Department of Computer Science and Engineering (CSE) 
    Shahjalal University of Science and Technology (SUST), www.sust.edu 
    

    Special Thanks To Mohammad Al-Amin, SUST CSE'2011 Md Mazharul Islam Midhat, SUST CSE'2010 Md Mahedi Hasan Nayem, SUST CSE'2010
    Avro Key Board, Omicron lab, https://www.omicronlab.com/index.html

    =========================================================

    It is a Audio Text Parallel Corpus. This dataset contains Some Recording Audio of Bangla Real Number and Its Coresponding Text. Specially designed for Bangla Speech recognition.

    There are five speakers(alamin, ashraful, midhat, nahid, nayem) in this dataset.

    Vocabulary Contains only bangla real numbers (shunno-ekshoto, hazar, loksho, koti, doshomic etc.)

    Total Number of Audio file : 175 (35 from each speaker) Age range of the speakers : 20-23

    Total Size: 32.4MB

    TextData.txt file contains the text of the audio set. Each line starts with tag and ends with tag. The file name is added after each line using parenthesis, in this audio file you will get its recorder Audio Data. This text data actually generated using Avro (Free Opensourse Writting Software).

    ==========================================================

    For Full Data: please contact nahid-cse@sust.edu

  13. F

    British English Scripted Monologue Speech Data for Healthcare

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). British English Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-english-uk
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United Kingdom
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Introducing the UK English Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of English language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.

    Speech Data

    This dataset includes over 6,000 high-quality scripted audio prompts recorded in UK English, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.

    Participant Diversity
    Speakers: 60 native UK English speakers.
    Regional Balance: Participants are sourced from multiple regions across United Kingdom, reflecting diverse dialects and linguistic traits.
    Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.
    Recording Specifications
    Nature of Recordings: Scripted monologues based on healthcare-related use cases.
    Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.
    Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.
    Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

    Topic Coverage

    The prompts span a broad range of healthcare-specific interactions, such as:

    Patient check-in and follow-up communication
    Appointment booking and cancellation dialogues
    Insurance and regulatory support queries
    Medication, test results, and consultation discussions
    General health tips and wellness advice
    Emergency and urgent care communication
    Technical support for patient portals and apps
    Domain-specific scripted statements and FAQs

    Contextual Depth

    To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:

    Names: Gender- and region-appropriate United Kingdom names
    Addresses: Varied local address formats spoken naturally
    Dates & Times: References to appointment dates, times, follow-ups, and schedules
    Medical Terminology: Common medical procedures, symptoms, and treatment references
    Numbers & Measurements: Health data like dosages, vitals, and test result values
    Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

    These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.

    Transcription

    Every audio recording is accompanied by a verbatim, manually verified transcription.

    Content: The transcription mirrors the exact scripted prompt recorded by the speaker.
    Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.
    <b

  14. VoicePA

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated Mar 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Serife Kucur Ergünay; Elie Khoury; Alexandros Lazaridis; Alexandros Lazaridis; Sébastien Marcel; Sébastien Marcel; Pavel Korshunov; Pavel Korshunov; André R. Goncalves; André R. Goncalves; Ricardo P. V. Violato; Serife Kucur Ergünay; Elie Khoury; Ricardo P. V. Violato (2023). VoicePA [Dataset]. http://doi.org/10.34777/gmc4-v249
    Explore at:
    Dataset updated
    Mar 6, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Serife Kucur Ergünay; Elie Khoury; Alexandros Lazaridis; Alexandros Lazaridis; Sébastien Marcel; Sébastien Marcel; Pavel Korshunov; Pavel Korshunov; André R. Goncalves; André R. Goncalves; Ricardo P. V. Violato; Serife Kucur Ergünay; Elie Khoury; Ricardo P. V. Violato
    Description

    VoicePA is a dataset for speaker recognition and voice presentation attack detection (anti-spoofing).The dataset contains a set of Bona Fide (genuine) voice samples from 44 speakers and 24 different types of speech presentation attacks (spoofing attacks). The attacks were created using the Bona Fide data recorded for the AVSpoof dataset.

    Genuine data

    The genuine (non-attack) data is taken directly from 'AVspoof' database and can be used by both automatic speaker verification (ASV) and presentation attack detection (PAD) systems (folder 'genuine' contains this data). The genuine data acquisition process lasted approximately two months with 44 subjects, each participating in four different sessions configured in different environmental setups. During each recording session, subjects were asked to speak out prepared (read) speech, pass-phrases and free speech recorded with three devices: one laptop with high-quality microphone and two mobile phones (iPhone 3GS and Samsung S3).

    Attack data

    Based on the genuine data, 24 types of presentation attacks were generated. Attacks were recorded in 3 different environments (two typical offices and a large conference room), using 5 different playback devices, including built-in laptop speakers, high quality speakers, and three phones: iPhone 3GS, iPhone 6S, and Samsung S3, and assuming an ASV system running on either laptop, iPhone 3GS, or Samsung S3. In addition to a replay type of attacks (speech is recorded and replayed to the microphone of an ASV system), two types of synthetic speech were also replayed: speech synthesis and voice conversion (for the details on these algorithms, please refer to the paper below published in BTAS 2015 and describing 'AVspoof' database).

    Protocols

    The data in 'voicePA' database is split into three non-overlapping subsets: training (genuine and attack samples from 4 female and 10 male subjects), development or 'Dev' (genuine and attack samples from 4 female and 10 male subjects), and evaluation or 'Eval' (genuine and attack samples from 5 female and 11 male subjects).

    Reference

    Pavel Korshunov, André R. Goncalves, Ricardo P. V. Violato, Flávio O. Simões and Sébastien Marcel. "On the Use of Convolutional Neural Networks for Speech Presentation Attack Detection", International Conference on Identity, Security and Behavior Analysis, 2018.
    "http://publications.idiap.ch/index.php/publications/show/3779">10.1109/ISBA.2018.8311474
    http://publications.idiap.ch/index.php/publications/show/3779

  15. d

    Customer Service Audio Dataset [Raw Call Recordings, Multi-Industry, U.S.]

    • datarade.ai
    .wav
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com (2023). Customer Service Audio Dataset [Raw Call Recordings, Multi-Industry, U.S.] [Dataset]. https://datarade.ai/data-products/customer-service-audio-dataset-raw-call-recordings-multi-in-wiserbrand-com
    Explore at:
    .wavAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    WiserBrand
    Area covered
    United States of America
    Description

    This dataset contains thousands of authentic audio recordings of customer calls to service teams across key U.S. industries. Captured from inbound support channels, these files reflect natural speech in real service contexts, with varied speaker accents, background noise, and emotion levels. Each recording involves only a customer and a customer service agent, preserving a realistic two-party call structure.

    Dataset includes: - Thousands of customer service call recordings (WAV/MP3) - English language, native and accented speech - Real-world acoustic conditions (noise, silence, overlapping speech) - Dataset language: English (other languages on request)

    Use this dataset to: - Train speech-to-text engines on real-world, noisy support audio
    - Build speaker diarization and audio segmentation models
    - Simulate customer-agent voice interactions for LLM fine-tuning
    - Test multilingual or accent-robust audio pipelines
    - Develop acoustic models for call quality enhancement

    This audio-first dataset is ideal for ASR developers, call center AI builders, and speech researchers looking for real-life, labeled customer service calls.

    The more you purchase, the lower the price will be.

  16. h

    m-ailabs_speech_dataset_fr

    • huggingface.co
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Théo Gigant (2022). m-ailabs_speech_dataset_fr [Dataset]. https://huggingface.co/datasets/gigant/m-ailabs_speech_dataset_fr
    Explore at:
    Dataset updated
    Jun 1, 2022
    Authors
    Théo Gigant
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description


    The M-AILABS Speech Dataset is the first large dataset that we are providing free-of-charge, freely usable as training data for speech recognition and speech synthesis.

    Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly thousand hours of audio and the text-files in prepared format.

    A transcription is provided for each clip. Clips vary in length from 1 to 20 seconds and have a total length of approximately shown in the list (and in the respective info.txt-files) below.

    The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded by the LibriVox project and is also in the public domain – except for Ukrainian.

    Ukrainian audio was kindly provided either by Nash Format or Gwara Media for machine learning purposes only (please check the data info.txt files for details).

  17. Indian Languages Audio Dataset

    • kaggle.com
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HARSHMAN SOLANKI (2023). Indian Languages Audio Dataset [Dataset]. https://www.kaggle.com/datasets/hmsolanki/indian-languages-audio-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HARSHMAN SOLANKI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    India
    Description

    Description: The "Indian Languages Audio Dataset" is a collection of audio samples featuring a diverse set of 10 Indian languages. Each audio sample in this dataset is precisely 5 seconds in duration and is provided in MP3 format. It is important to note that this dataset is a subset of a larger collection known as the "Audio Dataset with 10 Indian Languages." The source of these audio samples is regional videos freely available on YouTube, and none of the audio samples or source videos are owned by the dataset creator.

    Languages Included: 1. Bengali 2. Gujarati 3. Hindi 4. Kannada 5. Malayalam 6. Marathi 7. Punjabi 8. Tamil 9. Telugu 10. Urdu

    This dataset offers a valuable resource for researchers, linguists, and machine learning enthusiasts who are interested in studying and analyzing the phonetics, accents, and linguistic characteristics of the Indian subcontinent. It is a representative sample of the linguistic diversity present in India, encompassing a wide array of languages and dialects. Researchers and developers are encouraged to explore this dataset to build applications or conduct research related to speech recognition, language identification, and other audio processing tasks.

    Additionally, the dataset is not limited to these 10 languages and has the potential for expansion. Given the dynamic nature of language use in India, this dataset can serve as a foundation for future data collection efforts involving additional Indian languages and dialects.

    Access to the "Indian Multilingual Audio Dataset - 10 Languages" is provided with the understanding that users will comply with applicable copyright and licensing restrictions. If users plan to extend this dataset or use it for commercial purposes, it is essential to seek proper permissions and adhere to relevant copyright and licensing regulations.

    By utilizing this dataset responsibly and ethically, users can contribute to the advancement of language technology and research, ultimately benefiting language preservation, speech recognition, and cross-cultural communication.

  18. d

    Call Center Audio Recordings (100,000+ Hours, High-Quality) in Multiple...

    • datarade.ai
    .mp3, .wav
    Updated Jul 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2025). Call Center Audio Recordings (100,000+ Hours, High-Quality) in Multiple Languages | Available now (off-the-shelf) [Dataset]. https://datarade.ai/data-products/call-center-audio-recordings-100-000-hours-high-quality-i-filemarket
    Explore at:
    .mp3, .wavAvailable download formats
    Dataset updated
    Jul 23, 2025
    Dataset authored and provided by
    FileMarket
    Area covered
    Hong Kong, Chad, Iceland, Yemen, American Samoa, Lesotho, Nauru, Martinique, Afghanistan, French Southern Territories
    Description

    Call Center Audio Recordings Dataset: 100,000+ Hours of Customer-Agent Conversations in Multiple Languages

    FileMarket AI Data Labs presents an extensive call center audio recordings dataset with over 100,000 hours of customer-agent conversations across a diverse range of topics. This dataset is designed to support AI, machine learning, and speech analytics projects requiring high-quality, real-world customer interaction data. Whether you're working on speech recognition, natural language processing (NLP), sentiment analysis, or any other conversational AI task, our dataset offers the breadth and quality you need to build, train, and refine cutting-edge models.

    Our dataset includes a multilingual collection of customer service interactions, recorded across various industries and service sectors. These recordings cover different call center topics such as customer support, sales and telemarketing, technical helpdesk, complaint handling, and information services, ensuring that the dataset provides rich context and variety. With support for a broad spectrum of languages including English, Spanish, French, German, Chinese, Arabic, and more, this dataset allows for training models that cater to global customer bases.

    In addition to the audio recordings, our dataset includes detailed metadata such as call duration, region, language, and call type, ensuring that data is easily usable for targeted applications. All recordings are carefully annotated for speaker separation and high fidelity to meet the highest standards for audio data.

    Our dataset is fully compliant with data protection and privacy regulations, offering consented and ethically sourced data. You can be assured that every data point meets the highest standards for legal compliance, making it safe for use in your commercial, academic, or research projects.

    At FileMarket AI Data Labs, we offer flexibility in terms of data scaling. Whether you need a small sample or a full-scale dataset, we can cater to your requirements. We also provide sample data for evaluation to help you assess quality before committing to the full dataset. Our pricing structure is competitive, with custom pricing options available for large-scale acquisitions.

    We invite you to explore this versatile dataset, which can help accelerate your AI and machine learning initiatives, whether for training conversational models, improving customer service tools, or enhancing multi-language support systems.

  19. Podcast Database - Complete Podcast Metadata, All Countries & Languages

    • datarade.ai
    .json, .csv, .sql
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes (2025). Podcast Database - Complete Podcast Metadata, All Countries & Languages [Dataset]. https://datarade.ai/data-products/podcast-database-complete-podcast-metadata-all-countries-listen-notes
    Explore at:
    .json, .csv, .sqlAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    Listen Notes
    Area covered
    Zambia, Colombia, Indonesia, Anguilla, Slovenia, Turkey, Iran (Islamic Republic of), Gibraltar, Bosnia and Herzegovina, Guinea-Bissau
    Description

    == Quick facts ==

    The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,600,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in SQLite format Learn how we build a high quality podcast database: https://www.listennotes.help/article/105-high-quality-podcast-database-from-listen-notes

    == Use Cases ==

    AI training, including speech recognition, generative AI, voice cloning / synthesis, and news analysis Alternative data for investment research, such as sentiment analysis of executive interviews, market research and tracking investment themes PR and marketing, including social monitoring, content research, outreach, and guest booking ...

    == Data Attributes ==

    See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

    How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

    == Custom Offers ==

    We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

    We also provide a RESTful API at PodcastAPI.com

    Contact us: hello@listennotes.com

    == Need Help? ==

    If you have any questions about our products, feel free to reach out hello@listennotes.com

    == About Listen Notes, Inc. ==

    Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.

  20. i

    Data from: The TORGO database of acoustic and articulatory speech from...

    • incluset.com
    Updated 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Rudzicz; Aravind Kumar Namasivayam; Talya Wolff (2011). The TORGO database of acoustic and articulatory speech from speakers with dysarthria [Dataset]. https://incluset.com/datasets
    Explore at:
    Dataset updated
    2011
    Authors
    Frank Rudzicz; Aravind Kumar Namasivayam; Talya Wolff
    Measurement technique
    It consists of aligned acoustics and measured 3D articulatory features from speakers with either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS).
    Description

    This database is orignally created for a resource for developing advanced models in automatic speech recognition that are more suited to the needs of people with dysarthria.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2005). The "SIVA" Speech Database for Speaker Verification and Identification [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0028/

The "SIVA" Speech Database for Speaker Verification and Identification

Explore at:
Dataset updated
Jun 14, 2005
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License

https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

Description

The Italian speech database SIVA (?Speaker Identification and Verification Archives: SIVA?), is a database comprising more than two thousands calls, collected over the public switched telephone network, and available very soon via ELRA. The SIVA database consists of four speaker categories: male users, female users, male impostors, female impostors. Speakers were contacted via mail before the test, and they were asked to read the information and the instructions provided carefully before making the call. About 500 speakers were recruited using a company specialized in selection of population samples. The others were volunteers contacted by the institute concerned. Speakers access the recording system by calling a toll free number. An automatic answering system guides them through the three sessions that make up a recording. In the first session, a list of 28 words (including digits and some commands) is recorded using a standard enumerated prompt. The second session is a simple unidirectional dialogue (the caller answers prompted questions) where personal information is asked (name, age, etc.). In the third session, the speaker is asked to read a continuous passage of phonetically balanced text that resembles a short curriculum vitae. The signal is a standard 8kHz sampled signal, coded using 8 bits mu-law format. The data collected so far consists of:· MU: male users 18 speakers, 20 repetitions· FU: female users 16 speakers, 26 repetitions· MI: male impostors: 189 speakers, 2 repetitions, and 128 speakers, 1 repetition· FI: female impostors: 213 speakers, 2 repetitions, and 107 speakers, 1 repetition.

Search
Clear search
Close search
Google apps
Main menu