100+ datasets found
  1. d

    Norwegian audio dataset for speech recognition 20 hours (5/5)

    • datarade.ai
    .json
    Updated Jul 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    StageZero (2024). Norwegian audio dataset for speech recognition 20 hours (5/5) [Dataset]. https://datarade.ai/data-products/norwegian-audio-dataset-for-speech-recognition-20-hours-5-5-stagezero
    Explore at:
    .jsonAvailable download formats
    Dataset updated
    Jul 24, 2024
    Dataset authored and provided by
    StageZero
    Area covered
    Norway
    Description

    Specifications: - Each user has a unique ID across the entire dataset. - Maximum four hours of speech per person in the dataset. - Speech is recorded and transcribed on separate tracks. - High-quality transcriptions come with the data in JSON format. - No noise and high-quality recordings with both male and female speakers. - Metadata includes: gender, age, and location. - License terms: you pay once and you can use the data commercially in your products, but you cannot resell the data.

  2. s

    Speech & Audio Datasets

    • sapien.io
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sapien (2025). Speech & Audio Datasets [Dataset]. https://www.sapien.io/dataset-marketplace/speech-audio-datasets-for-ai-training
    Explore at:
    Dataset updated
    Apr 22, 2025
    Dataset authored and provided by
    Sapien
    License

    https://www.sapien.io/termshttps://www.sapien.io/terms

    Description

    High-quality speech audio datasets designed for AI model training, supporting various applications like speech recognition, voice identification, and multilingual speech data.

  3. TORGO Dataset for Dysarthric Speech - Audio Files

    • kaggle.com
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranay Koppula (2023). TORGO Dataset for Dysarthric Speech - Audio Files [Dataset]. https://www.kaggle.com/datasets/pranaykoppula/torgo-audio
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Pranay Koppula
    Description

    Citation: DOI 10.1007/s10579-011-9145-0

    Collection of Audio Recordings by the Department of Computer Science at the University of Toronto from speakers with and without Dysarthtria. Useful for tasks like Audio Classification, Disease Detection, Speech Processing, etc.

    Directory Structure:

    F_Con : Audio Samples of female speakers from the control group, i.e., female speakers without dysarthria. 'FC01' in the folder names and the filenames refers to the first speaker, 'FC02' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.

    F_Dys : Audio Samples of female speakers with dysarthria. 'F01' in the folder names and the filenames refers to the first speaker, 'F03' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.

    M_Con : Audio Samples of male speakers from the control group, i.e., male speakers without dysarthria. 'MC01' in the folder names and the filenames refers to the first speaker, 'MC02' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.

    M_Dys : Audio Samples of male speakers with dysarthria. 'M01' in the folder names and the filenames refers to the first speaker, 'M03' refers to the second speaker and so on. 'S01' refers to the first recording session with that speaker, 'S02' refers to the second session and so on. 'arrayMic' suggests that the audio was recorded with an array microphone, whereas 'headMic' suggests that the audio was recorded by a headpiece microphone.

  4. d

    Lithuanian audio dataset for speech recognition 20 hours (4/5)

    • datarade.ai
    .json
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    StageZero (2024). Lithuanian audio dataset for speech recognition 20 hours (4/5) [Dataset]. https://datarade.ai/data-products/lithuanian-audio-dataset-for-speech-recognition-20-hours-4-5-stagezero
    Explore at:
    .jsonAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset authored and provided by
    StageZero
    Area covered
    Lithuania
    Description

    Specifications: - Each user has a unique ID across the entire dataset. - Maximum four hours of speech per person in the dataset. - Speech is recorded and transcribed on separate tracks. - High-quality transcriptions come with the data in JSON format. - No noise and high-quality recordings with both male and female speakers. - Metadata includes: gender, age, and location. - License terms: you pay once and you can use the data commercially in your products, but you cannot resell the data.

  5. F

    English (UK) General Conversation Speech Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English (UK) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-uk
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United Kingdom
    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the English Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of English language speech recognition models, with a particular focus on British accents and dialects.

    With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the English language spoken in United Kingdom.

    Speech Data:

    This training dataset comprises 30 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 40 native English speakers from different states/provinces of United Kingdom. This collaborative effort guarantees a balanced representation of British accents, dialects, and demographics, reducing biases and promoting inclusivity.

    Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.

    Metadata:

    In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.

    The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of English language speech recognition models.

    Transcription:

    This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.

    Our goal is to expedite the deployment of English language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.

    Updates and Customization:

    We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.

    If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.

    License:

    This audio dataset, created by FutureBeeAI, is now available for commercial use.

    Conclusion:

    Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.

  6. F

    Japanese General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Japanese General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-japanese-japan
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Japanese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Japanese speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Japanese communication.

    Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Japanese speech models that understand and respond to authentic Japanese accents and dialects.

    Speech Data

    The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Japanese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 80 verified native Japanese speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Japan to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Japanese speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Japanese.
    Voice Assistants: Build smart assistants capable of understanding natural Japanese conversations.

  7. Bengali Speech Recognition Dataset (BSRD)

    • kaggle.com
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kumar Basak-4004 (2025). Bengali Speech Recognition Dataset (BSRD) [Dataset]. http://doi.org/10.34740/kaggle/dsv/10465455
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shuvo Kumar Basak-4004
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The BengaliSpeechRecognitionDataset (BSRD) is a comprehensive dataset designed for the development and evaluation of Bengali speech recognition and text-to-speech systems. This dataset includes a collection of Bengali characters and their corresponding audio files, which are generated using speech synthesis models. It serves as an essential resource for researchers and developers working on automatic speech recognition (ASR) and text-to-speech (TTS) applications for the Bengali language. Key Features: • Bengali Characters: The dataset contains a wide range of Bengali characters, including consonants, vowels, and unique symbols used in the Bengali script. This includes standard characters such as 'ক', 'খ', 'গ', and many more. • Corresponding Speech Data: For each Bengali character, an MP3 audio file is provided, which contains the correct pronunciation of that character. This audio is generated by a Bengali text-to-speech model, ensuring clear and accurate pronunciation. • 1000 Audio Samples per Folder: Each character is associated with at least 1000 MP3 files. These multiple samples provide variations of the character's pronunciation, which is essential for training robust speech recognition systems. • Language and Phonetic Diversity: The dataset offers a phonetic diversity of Bengali sounds, covering different tones and pronunciations commonly found in spoken Bengali. This ensures that the dataset can be used for training models capable of recognizing diverse speech patterns. • Use Cases: o Automatic Speech Recognition (ASR): BSRD is ideal for training ASR systems, as it provides accurate audio samples linked to specific Bengali characters. o Text-to-Speech (TTS): Researchers can use this dataset to fine-tune TTS systems for generating natural Bengali speech from text. o Phonetic Analysis: The dataset can be used for phonetic analysis and developing models that study the linguistic features of Bengali pronunciation. • Applications: o Voice Assistants: The dataset can be used to build and train voice recognition systems and personal assistants that understand Bengali. o Speech-to-Text Systems: BSRD can aid in developing accurate transcription systems for Bengali audio content. o Language Learning Tools: The dataset can help in creating educational tools aimed at teaching Bengali pronunciation.

    …………………………………..Note for Researchers Using the dataset………………………………………………………………………

    This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.

  8. m

    General conversation speech datasets in Shona for School

    • data.macgence.com
    mp3
    Updated May 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). General conversation speech datasets in Shona for School [Dataset]. https://data.macgence.com/dataset/general-conversation-speech-datasets-in-shona-for-school
    Explore at:
    mp3Available download formats
    Dataset updated
    May 25, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes General Conversation, featuring Shona speakers from Africa with detailed metadata.

  9. h

    medical_asr_recording_dataset

    • huggingface.co
    Updated Oct 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hani. M (2023). medical_asr_recording_dataset [Dataset]. https://huggingface.co/datasets/Hani89/medical_asr_recording_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2023
    Authors
    Hani. M
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Data Source Kaggle Medical Speech, Transcription, and Intent Context

    8.5 hours of audio utterances paired with text for common medical symptoms.

    Content

    This data contains thousands of audio utterances for common medical symptoms like “knee pain” or “headache,” totaling more than 8 hours in aggregate. Each utterance was created by individual human contributors based on a given symptom. These audio snippets can be used to train conversational agents in the medical field. This Figure Eight… See the full description on the dataset page: https://huggingface.co/datasets/Hani89/medical_asr_recording_dataset.

  10. u

    Japanese Speech Recognition Dataset

    • unidata.pro
    u-law/a-law, wav
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata L.L.C-FZ (2025). Japanese Speech Recognition Dataset [Dataset]. https://unidata.pro/datasets/japanese-speech-recognition-dataset/
    Explore at:
    wav, u-law/a-lawAvailable download formats
    Dataset updated
    Feb 26, 2025
    Dataset authored and provided by
    Unidata L.L.C-FZ
    Description

    Train AI to understand Japanese with Unidata’s dataset, featuring diverse speech samples for better transcription accuracy

  11. h

    alphanumeric-audio-dataset

    • huggingface.co
    Updated Nov 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sakshee Patil (2024). alphanumeric-audio-dataset [Dataset]. https://huggingface.co/datasets/sakshee05/alphanumeric-audio-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2024
    Authors
    Sakshee Patil
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Speech Recognition Bias Reduction Project

      Executive Summary
    

    Welcome to the Speech Recognition Bias Reduction Project. It aims to create a more inclusive and representative dataset for improving automated speech recognition systems. This project addresses the challenges faced by speakers with non-native English accents, particularly when interacting with automated voice systems that struggle to interpret alphanumeric information such as names, phone numbers, and addresses.… See the full description on the dataset page: https://huggingface.co/datasets/sakshee05/alphanumeric-audio-dataset.

  12. m

    Bangla Real Number Audio Dataset

    • data.mendeley.com
    Updated Feb 4, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Mahadi Hasan Nahid (2018). Bangla Real Number Audio Dataset [Dataset]. http://doi.org/10.17632/t33byr6cpt.1
    Explore at:
    Dataset updated
    Feb 4, 2018
    Authors
    Md Mahadi Hasan Nahid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ========================================================

    This dataset is developed by Md Ashraful Islam, SUST CSE'2010 Md Mahadi Hasan Nahid, SUST CSE'2010 (nahid-cse@sust.edu)

    Department of Computer Science and Engineering (CSE) 
    Shahjalal University of Science and Technology (SUST), www.sust.edu 
    

    Special Thanks To Mohammad Al-Amin, SUST CSE'2011 Md Mazharul Islam Midhat, SUST CSE'2010 Md Mahedi Hasan Nayem, SUST CSE'2010
    Avro Key Board, Omicron lab, https://www.omicronlab.com/index.html

    =========================================================

    It is a Audio Text Parallel Corpus. This dataset contains Some Recording Audio of Bangla Real Number and Its Coresponding Text. Specially designed for Bangla Speech recognition.

    There are five speakers(alamin, ashraful, midhat, nahid, nayem) in this dataset.

    Vocabulary Contains only bangla real numbers (shunno-ekshoto, hazar, loksho, koti, doshomic etc.)

    Total Number of Audio file : 175 (35 from each speaker) Age range of the speakers : 20-23

    Total Size: 32.4MB

    TextData.txt file contains the text of the audio set. Each line starts with tag and ends with tag. The file name is added after each line using parenthesis, in this audio file you will get its recorder Audio Data. This text data actually generated using Avro (Free Opensourse Writting Software).

    ==========================================================

    For Full Data: please contact nahid-cse@sust.edu

  13. Arabic Speech Commands Dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Apr 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulkader Ghandoura; Abdulkader Ghandoura (2021). Arabic Speech Commands Dataset [Dataset]. http://doi.org/10.5281/zenodo.4662481
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 5, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abdulkader Ghandoura; Abdulkader Ghandoura
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Arabic Speech Commands Dataset

    This dataset is designed to help train simple machine learning models that serve educational and research purposes in the speech recognition domain, mainly for keyword spotting tasks.

    Dataset Description

    Our dataset is a list of pairs (x, y), where x is the input speech signal, and y is the corresponding keyword. The final dataset consists of 12000 such pairs, comprising 40 keywords. Each audio file is one-second in length sampled at 16 kHz. We have 30 participants, each of them recorded 10 utterances for each keyword. Therefore, we have 300 audio files for each keyword in total (30 * 10 * 40 = 12000), and the total size of all the recorded keywords is ~384 MB. The dataset also contains several background noise recordings we obtained from various natural sources of noise. We saved these audio files in a separate folder with the name background_noise and a total size of ~49 MB.

    Dataset Structure

    There are 40 folders, each of which represents one keyword and contains 300 files. The first eight digits of each file name identify the contributor, while the last two digits identify the round number. For example, the file path rotate/00000021_NO_06.wav indicates that the contributor with the ID 00000021 pronounced the keyword rotate for the 6th time.

    Data Split

    We recommend using the provided CSV files in your experiments. We kept 60% of the dataset for training, 20% for validation, and the remaining 20% for testing. In our split method, we guarantee that all recordings of a certain contributor are within the same subset.

    License

    This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. For more details, see the LICENSE file in this folder.

    Citations

    If you want to use the Arabic Speech Commands dataset in your work, please cite it as:

    @article{arabicspeechcommandsv1,
       author = {Ghandoura, Abdulkader and Hjabo, Farouk and Al Dakkak, Oumayma},
       title = {Building and Benchmarking an Arabic Speech Commands Dataset for Small-Footprint Keyword Spotting},
       journal = {Engineering Applications of Artificial Intelligence},
       year = {2021},
       publisher={Elsevier}
    }

  14. P

    Zambezi Voice Dataset

    • paperswithcode.com
    Updated Sep 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Claytone Sikasote; Kalinda Siaminwe; Stanly Mwape; Bangiwe Zulu; Mofya Phiri; Martin Phiri; David Zulu; Mayumbo Nyirenda; Antonios Anastasopoulos (2024). Zambezi Voice Dataset [Dataset]. https://paperswithcode.com/dataset/zambezi-voice
    Explore at:
    Dataset updated
    Sep 29, 2024
    Authors
    Claytone Sikasote; Kalinda Siaminwe; Stanly Mwape; Bangiwe Zulu; Mofya Phiri; Martin Phiri; David Zulu; Mayumbo Nyirenda; Antonios Anastasopoulos
    Description

    This work introduces Zambezi Voice, an open-source multilingual speech resource for Zambian languages. It contains two collections of datasets: unlabelled audio recordings of radio news and talk shows programs (160 hours) and labelled data (over 80 hours) consisting of read speech recorded from text sourced from publicly available literature books. The dataset is created for speech recognition but can be extended to multilingual speech processing research for both supervised and unsupervised learning approaches. To our knowledge, this is the first multilingual speech dataset created for Zambian languages. We exploit pretraining and cross-lingual transfer learning by finetuning the Wav2Vec2.0 large-scale multilingual pre-trained model to build end-to-end (E2E) speech recognition models for our baseline models. The dataset is released publicly under a Creative Commons BY-NC-ND 4.0 license and can be accessed through the project repository.

  15. common_voice_12_0

    • huggingface.co
    Updated Mar 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mozilla Foundation (2023). common_voice_12_0 [Dataset]. https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0
    Explore at:
    Dataset updated
    Mar 24, 2023
    Dataset authored and provided by
    Mozilla Foundationhttp://mozilla.org/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for Common Voice Corpus 12.0

      Dataset Summary
    

    The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 17127 validated hours in 104 languages, but more voices and languages are always added. Take a look at the Languages page to… See the full description on the dataset page: https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0.

  16. m

    African speaker Speech Dataset in Igbo

    • data.macgence.com
    mp3
    Updated Apr 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). African speaker Speech Dataset in Igbo [Dataset]. https://data.macgence.com/dataset/african-speaker-speech-dataset-in-igbo
    Explore at:
    mp3Available download formats
    Dataset updated
    Apr 28, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes general conversations, featuring Igbo speakers from Africa with detailed metadata.

  17. m

    UAE speaker Speech Dataset in Arabic

    • data.macgence.com
    mp3
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). UAE speaker Speech Dataset in Arabic [Dataset]. https://data.macgence.com/dataset/uae-speaker-speech-dataset-in-arabic
    Explore at:
    mp3Available download formats
    Dataset updated
    May 17, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    United Arab Emirates, Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes general conversations, featuring Arabic speakers from UAE with detailed metadata.

  18. m

    USA speaker Speech Dataset in English

    • data.macgence.com
    mp3
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). USA speaker Speech Dataset in English [Dataset]. https://data.macgence.com/dataset/usa-speaker-speech-dataset-in-english
    Explore at:
    mp3Available download formats
    Dataset updated
    Mar 30, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    United States, Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    The audio dataset includes general conversations, featuring English speakers from USA with detailed metadata.

  19. h

    wolof-audio-data

    • huggingface.co
    Updated Dec 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdoulaye Diallo (2024). wolof-audio-data [Dataset]. https://huggingface.co/datasets/vonewman/wolof-audio-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2024
    Authors
    Abdoulaye Diallo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wolof Audio Dataset

    The Wolof Audio Dataset is a collection of audio recordings and their corresponding transcriptions in Wolof. This dataset is designed to support the development of Automatic Speech Recognition (ASR) models for the Wolof language. It was created by combining three existing datasets:

    ALFFA: Available at serge-wilson/wolof_speech_transcription FLEURS: Available at vonewman/fleurs-wolof-dataset Urban Bus Wolof Speech Dataset: Available at vonewman/urban-bus-wolof… See the full description on the dataset page: https://huggingface.co/datasets/vonewman/wolof-audio-data.

  20. d

    Bulgarian audio dataset for speech recognition 20 hours (1/4)

    • datarade.ai
    .json
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    StageZero (2024). Bulgarian audio dataset for speech recognition 20 hours (1/4) [Dataset]. https://datarade.ai/data-products/bulgarian-audio-dataset-for-speech-recognition-20-hours-1-4-stagezero
    Explore at:
    .jsonAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset authored and provided by
    StageZero
    Area covered
    Bulgaria
    Description

    Specifications: - Each user has a unique ID across the entire dataset. - Maximum four hours of speech per person in the dataset. - Speech is recorded and transcribed on separate tracks. - High-quality transcriptions come with the data in JSON format. - No noise and high-quality recordings with both male and female speakers. - Metadata includes: gender, age, and location. - License terms: you pay once and you can use the data commercially in your products, but you cannot resell the data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
StageZero (2024). Norwegian audio dataset for speech recognition 20 hours (5/5) [Dataset]. https://datarade.ai/data-products/norwegian-audio-dataset-for-speech-recognition-20-hours-5-5-stagezero

Norwegian audio dataset for speech recognition 20 hours (5/5)

Explore at:
.jsonAvailable download formats
Dataset updated
Jul 24, 2024
Dataset authored and provided by
StageZero
Area covered
Norway
Description

Specifications: - Each user has a unique ID across the entire dataset. - Maximum four hours of speech per person in the dataset. - Speech is recorded and transcribed on separate tracks. - High-quality transcriptions come with the data in JSON format. - No noise and high-quality recordings with both male and female speakers. - Metadata includes: gender, age, and location. - License terms: you pay once and you can use the data commercially in your products, but you cannot resell the data.

Search
Clear search
Close search
Google apps
Main menu