78 datasets found
  1. g

    Indonesian Media Audio Database

    • gts.ai
    json
    Updated Jan 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2024). Indonesian Media Audio Database [Dataset]. https://gts.ai/case-study/indonesian-media-audio-database-custom-ai-data-collection/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jan 31, 2024
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Our project, “Indonesian Media Audio Database,” is designed to establish a rich and diverse dataset tailored for training advanced machine learning models in language processing, speech recognition, and cultural analysis.

  2. E

    AURORA-5

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Aug 16, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). AURORA-5 [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-AURORA-CD0005/
    Explore at:
    Dataset updated
    Aug 16, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The Aurora project was originally set up to establish a worldwide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system. The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.The earlier three Aurora experiments had a focus on additive noise and the influence of some telephone frequency characteristics. Aurora-5 tries to cover all effects as they occur in realistic application scenarios. The focus was put on two scenarios. The first one is the hands-free speech input in the noisy car environment with the intention of controlling either devices in the car itself or retrieving information from a remote speech server over the telephone. The second one covers the hands-free speech input in a type of office or in a type of living room to control e.g. a telephone device or some audio/video equipment.The AURORA-5 database contains the following data:•Artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz. The distortions consist of: - additive background noise, - the simulation of a hands-free speech input in rooms, - the simulation of transmitting speech over cellular telephone networks.•A subset of recordings from the meeting recorder project at the International Computer Science Institute. The recordings contain sequences of digits uttered by different speakers in hands-free mode in a meeting room.•A set of scripts for running recognition experiments on the above mentioned speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.Further information is also available at the following address: http://aurora.hsnr.de

  3. Audio Noise Dataset

    • kaggle.com
    zip
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min Si Thu (2023). Audio Noise Dataset [Dataset]. https://www.kaggle.com/datasets/minsithu/audio-noise-dataset
    Explore at:
    zip(1616602 bytes)Available download formats
    Dataset updated
    Oct 20, 2023
    Authors
    Min Si Thu
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Audio Noise Dataset

    Noise is an unwanted behavior in audio datasets. Noise plays an important part in the machine learning field of audio data type.

    The dataset can be used for noise filtering, noise generation & noise recognition in audio classification, audio recognition, audio generation, and audio-related machine learning. I, Min Si Thu, used this dataset on open-source projects.

    I collected ten types of noise in this dataset.

    Location - Myanmar, Mandalay, Amarapura Township

    Ten types of noise

    • the noise of a crowded place (Myanmar, Mandalay, Amarapura Township)
    • the noise of urban areas with people talking (Myanmar, Mandalay, Amarapura Township)
    • the noise of the restaurant (Myanmar, Mandalay, Amarapura Township, at a random restaurant)
    • the noise of a working place, people's discussion (Myanmar, Mandalay, Amarapura Township, a private company)
    • the noise of mosquitos (Myanmar, Mandalay, Amarapura Township, Dataset creator's home)
    • the noise of car traffic (Myanmar, Mandalay, Amarapura Township, Asia Bank Road, nighttime)
    • the noise of painful sounds (Myanmar, Mandalay, Amarapura Township)
    • the noise of the rainy day (Myanmar, Mandalay, Amarapura Township, Dataset creator's home)
    • the noise of motorbike and people talking (Myanmar, Mandalay, Amarapura Township, NanTawYar Quarter, Cherry Street)
    • the noise of a festival (Myanmar, Mandalay, Chinese Festival)
  4. E

    ASR database ARTUR 0.1 (audio)

    • live.european-language-grid.eu
    binary format
    Updated Nov 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ASR database ARTUR 0.1 (audio) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20861
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 30, 2022
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840 hours are transcribed, while the remaining 195 hours are without transcription. The data is divided into 4 parts: (1) approx. 520 hours of read speech, which includes the reading of pre-defined sentences, selected from the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320); each sentence is contained in one file; speakers are demographically balanced; spelling is included in special files; all with manual transcriptions; (2) approx. 204 hours of public speech, which includes media recordings, online recordings of conferences, workshops, education videos, etc.; 56 hours are manually transcribed; (3) approx. 110 hours of private speech, which includes monologues and dialogues between two persons, recorded for the purposes of the speech database; the speakers are demographically balanced; two subsets for domain-specific ASR (i.e., smart-home and face-description) are included; 63 hours are manually transcribed; (4) approx. 201 hours of parliamentary speech, which includes recordings from the Slovene National Assembly, all with manual transcriptions. Audio files are WAV 44,1 kHz, pcm, 16-bit, mono.

    This entry includes the recordings only; transcriptions are available at http://hdl.handle.net/11356/1718.

  5. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo (2024). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Dataset]. http://doi.org/10.5281/zenodo.1188976
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Description

    The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

    The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.

    Citing the RAVDESS

    The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.

    Academic paper citation

    Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

    Personal use citation

    Include a link to this Zenodo page - https://zenodo.org/record/1188976

    Commercial Licenses

    Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.

    Contact Information

    If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.

    Example Videos

    Watch a sample of the RAVDESS speech and song videos.

    Emotion Classification Users

    If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].

    Construction and Validation

    Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.

    The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.

    Contents

    Audio-only files

    Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

    • Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440.
    • Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.

    Audio-Visual and Video-only files

    Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:

    • Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880.
    • Song files (Video_Song_Actor_01.zip to Video_Song_Actor_24.zip) collectively contains 2024 files: 44 trials per actor x 2 modalities (AV, VO) x 23 actors = 2024.

    File Summary

    In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).

    File naming convention

    Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:

    Filename identifiers

    • Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
    • Vocal channel (01 = speech, 02 = song).
    • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
    • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
    • Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
    • Repetition (01 = 1st repetition, 02 = 2nd repetition).
    • Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).


    Filename example: 02-01-06-01-02-01-12.mp4

    1. Video-only (02)
    2. Speech (01)
    3. Fearful (06)
    4. Normal intensity (01)
    5. Statement "dogs" (02)
    6. 1st Repetition (01)
    7. 12th Actor (12)
    8. Female, as the actor ID number is even.

    License information

    The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0

    Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.

    Related Data sets

  6. m

    SoraniTTS dataset : Central Kurdish (CK) Speech Corpus for Text-to-Speech

    • data.mendeley.com
    Updated Sep 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farhad Rahimi (2025). SoraniTTS dataset : Central Kurdish (CK) Speech Corpus for Text-to-Speech [Dataset]. http://doi.org/10.17632/jmtn248cc9.5
    Explore at:
    Dataset updated
    Sep 16, 2025
    Authors
    Farhad Rahimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset represents a comprehensive resource for advancing Kurdish TTS systems. Converting text to speech is one of the important topics in the design and construction of multimedia systems, human-machine communication, and information and communication technology, and its purpose, along with speech recognition, is to establish communication between humans and machines in its most basic and natural form, that is, spoken language.
    For our text corpus, we collected 6,565 sentences from a set of texts in various categories, including news, sport, health, question and exclamation sentences, science, general information, politics, education and literature, story, miscellaneous, and tourism, to create the train sentences. We thoroughly reviewed the texts and normalized them, then they were recorded by a male speaker. We recorded audios in a voice recording studio at 44,100Hz, and all audio files are down sampled to 22,050 Hz in our modeling process. The audio ranges from 3 to 36 seconds in length. We generate the speech corpus in this method, and the last speech has about 6,565 texts and audio pairings, which takes around 19 hours. Altogether, audio files are saved in wave format, and the texts are saved in text files in the corresponding sub-folders. Furthermore, for model training, all of the audio files are gathered in a single folder. Each line in the transcript files is formatted as WAVS | audio file’s name.wav| transcript. The audio file’s name includes the extensions, and the transcript was the speech's text. The audio recording and editing process lasted for 90 days. It involved capturing over 6,565 WAV files and over 19 h of recorded speech. The data set helps researchers improve Kurdish TTS early, thereby reducing the time consumed for this process.

    Acknowledgments: We would like to express our sincere gratitude to Ayoub Mohammadzadeh for his invaluable support in recording the corpus.

  7. E

    Czech Audio-Visual Speech Corpus for Recognition with Impaired Conditions

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Nov 5, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2008). Czech Audio-Visual Speech Corpus for Recognition with Impaired Conditions [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0284/
    Explore at:
    Dataset updated
    Nov 5, 2008
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems collected with impaired illumination conditions. The corpus consists of about 20 hours of audio-visual records of 50 speakers in laboratory conditions. Recorded subjects were instructed to remain static. The illumination varied and chunks of each speaker were recorded with several different conditions, such as full illumination, or illumination from one side (left or right) only. These conditions make the database usable for training lip-/head-tracking systems under various illumination conditions independently of the language. Speakers were asked to read 200 sentences each (50 common for all speakers and 150 specific to each speaker). The average total length of recording per speaker was 23 minutes.Acoustic data are stored in wave files using PCM format, sampling frequency 44kHz, resolution 16 bits. Each speaker’s acoustic data set represents about 180 MB of disk space (about 8.8 GB).Visual data are stored in video files (.avi format) using the digital video (DV) codec. Visual data per speaker take about 3.7 GB of disk (about 185 GB as a whole) and are stored on an IDE hard disk (NTFS format).

  8. Data from: Written and spoken digits database for multimodal learning

    • zenodo.org
    bin
    Updated Jan 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lyes Khacef; Lyes Khacef; Laurent Rodriguez; Benoit Miramond; Laurent Rodriguez; Benoit Miramond (2021). Written and spoken digits database for multimodal learning [Dataset]. http://doi.org/10.5281/zenodo.3515935
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lyes Khacef; Lyes Khacef; Laurent Rodriguez; Benoit Miramond; Laurent Rodriguez; Benoit Miramond
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database description:

    The written and spoken digits database is not a new database but a constructed database from existing ones, in order to provide a ready-to-use database for multimodal fusion.

    The written digits database is the original MNIST handwritten digits database [1] with no additional processing. It consists of 70000 images (60000 for training and 10000 for test) of 28 x 28 = 784 dimensions.

    The spoken digits database was extracted from Google Speech Commands [2], an audio dataset of spoken words that was proposed to train and evaluate keyword spotting systems. It consists of 105829 utterances of 35 words, amongst which 38908 utterances of the ten digits (34801 for training and 4107 for test). A pre-processing was done via the extraction of the Mel Frequency Cepstral Coefficients (MFCC) with a framing window size of 50 ms and frame shift size of 25 ms. Since the speech samples are approximately 1 s long, we end up with 39 time slots. For each one, we extract 12 MFCC coefficients with an additional energy coefficient. Thus, we have a final vector of 39 x 13 = 507 dimensions. Standardization and normalization were applied on the MFCC features.

    To construct the multimodal digits dataset, we associated written and spoken digits of the same class respecting the initial partitioning in [1] and [2] for the training and test subsets. Since we have less samples for the spoken digits, we duplicated some random samples to match the number of written digits and have a multimodal digits database of 70000 samples (60000 for training and 10000 for test).

    The dataset is provided in six files as described below. Therefore, if a shuffle is performed on the training or test subsets, it must be performed in unison with the same order for the written digits, spoken digits and labels.

    Files:

    • data_wr_train.npy: 60000 samples of 784-dimentional written digits for training;
    • data_sp_train.npy: 60000 samples of 507-dimentional spoken digits for training;
    • labels_train.npy: 60000 labels for the training subset;
    • data_wr_test.npy: 10000 samples of 784-dimentional written digits for test;
    • data_sp_test.npy: 10000 samples of 507-dimentional spoken digits for test;
    • labels_test.npy: 10000 labels for the test subset.

    References:

    1. LeCun, Y. & Cortes, C. (1998), “MNIST handwritten digit database”.
    2. Warden, P. (2018), “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition”.
  9. Speaker Recognition - CMU ARCTIC

    • kaggle.com
    zip
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Lins (2022). Speaker Recognition - CMU ARCTIC [Dataset]. https://www.kaggle.com/datasets/mrgabrielblins/speaker-recognition-cmu-arctic
    Explore at:
    zip(1354293783 bytes)Available download formats
    Dataset updated
    Nov 21, 2022
    Authors
    Gabriel Lins
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Can you predict which speaker is talking?
    • Can you predict what they are saying? This dataset makes all of these possible. Perfect for a school project, research project, or resume builder.

    File information

    • train.csv - file containing all the data you need for training, with 4 columns, id (file id), file_path(path to .wav files), speech(transcription of audio file), and speaker (target column)
    • test.csv - file containing all the data you need to test your model (20% of total audio files), it has the same columns as train.csv
    • train/ - Folder with training data, subdivided with Speaker's folders
      • aew/ - Folder containing audio files in .wav format for speaker aew
      • ...
    • test/ - Folder containing audio files for test data.

    Column description

    ColumnDescription
    idfile id (string)
    file_pathfile path to .wav file (string)
    speechtranscription of the audio file (string)
    speakerspeaker name, use this as the target variable if you are doing audio classification (string)

    More Details

    The CMU_ARCTIC databases were constructed at the Language Technologies Institute at Carnegie Mellon University as phonetically balanced, US-English single-speaker databases designed for unit selection speech synthesis research. A detailed report on the structure and content of the database and the recording environment etc is available as a Carnegie Mellon University, Language Technologies Institute Tech Report CMU-LTI-03-177 and is also available here.

    The databases consist of around 1150 utterances carefully selected from out-of-copyright texts from Project Gutenberg. The databases include US English male (bdl) and female (slt) speakers (both experienced voice talent) as well as other accented speakers.

    The 1132 sentence prompt list is available from cmuarctic.data

    The distributions include 16KHz waveform and simultaneous EGG signals. Full phonetically labeling was performed by the CMU Sphinx using the FestVox based labeling scripts. Complete runnable Festival Voices are included with the database distributions, as examples though better voices can be made by improving labeling, etc.

    Acknowledgements

    This work was partially supported by the U.S. National Science Foundation under Grant No. 0219687, "ITR/CIS Evaluation and Personalization of Synthetic Voices". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

  10. Data from: Speech databases of typical children and children with SLI

    • figshare.com
    • live.european-language-grid.eu
    • +1more
    zip
    Updated Dec 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Grill (2016). Speech databases of typical children and children with SLI [Dataset]. http://doi.org/10.6084/m9.figshare.2360626.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 15, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Pavel Grill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our Laboratory of Artificial Neural Network Applications (LANNA) in the Czech Technical University in Prague (head of the laboratory is professor Jana Tučková) collaborates on a project with the Department of Paediatric Neurology, 2nd Faculty of Medicine of Charles University in Prague and with the Motol University Hospital (head of clinic is professor Vladimír Komárek), which focuses on the study of children with SLI.The speech database contains two subgroups of recordings of children's speech from different types of speakers. The first subgroup (healthy) consists of recordings of children without speech disorders; the second subgroup (patients) consists of recordings of children with SLI. These children have different degrees of severity (1 – mild, 2 – moderate, and 3 – severe). The speech therapists and specialists from Motol Hospital decided upon this classification. The children’s speech was recorded in the period 2003-2013. These databases were commonly created in a schoolroom or a speech therapist’s consulting room, in the presence of surrounding background noise. This situation simulates the natural environment in which the children live, and is important for capturing the normal behavior of children. The database of healthy children’s speech was created as a referential database for the computer processing of children’s speech. It was recorded on the SONY digital Dictaphone (sampling frequency, fs = 16 kHz, 16-bit resolution in stereo mode in the standardized wav format) and on the MD SONY MZ-N710 (sampling frequency, fs = 44.1 kHz, 16-bit resolution in stereo mode in the standardized wav format). The corpus was recorded in the natural environment of a schoolroom and in a clinic. This subgroup contains a total of 44 native Czech participants (15 boys, 29 girls) aged 4 to 12 years, and was recorded during the period 2003–2005. The database of children with SLI was recorded in a private speech therapist’s office. The children’s speech is captured by means of a SHURE lapel microphone using the solution by the company AVID (MBox – USB AD/DA converter and ProTools LE software) on an Apple laptop (iBook G4). The sound recordings are saved in the standardized wav format. The sampling frequency is set to 44.1 kHz with 16-bit resolution in mono mode. This subgroup contains a total of 54 native Czech participants (35 boys, 19 girls) aged 6 to 12 years, and was recorded during the period 2009–2013. This package contains wav data sets for development and testing methods for detection children with SLI.Software pack:FORANA - was developed the original software FORANA for formants analysis. It is based on the MATLAB programming environment. The development of this software was mainly driven by the need to have the ability to complete formant analysis correctly and full automation of the process of extracting formants from the recorded speech signals. Development of this application is still running. Software was developed in the LANNA at CTU FEE in Prague.LABELING - the program LABELING is used for segmentation of the speech signal. It is a part of SOMLab program system. Software was developed in the LANNA at CTU FEE in Prague.PRAAT - is an acoustic analysis software. The Praat program was created by Paul Boersma and David Weenink of the Institute of Phonetics Sciences of the University of Amsterdam. Home page:http://www.praat.org or http://www.fon.hum.uva.nl/praat/.openSMILE - The openSMILE feature extration tool enables you to extract large audio feature spaces in realtime. It combines features from Music Information Retrieval and Speech Processing. SMILE is an acronym forSpeech & Music Interpretation by Large-space Extraction. It is written in C++ and is available as both a standalone commandline executable as well as a dynamic library. The main features of openSMILE are its capability of on-line incremental processing and its modularity. Feature extractor components can be freely interconnected to create new and custom features, all via a simple configuration file. New components can be added to openSMILE via an easy binary plugin interface and a comprehensive API. Citing: Florian Eyben, Martin Wöllmer, Björn Schuller: "openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor", In Proc. ACM Multimedia (MM), ACM, Florence, Italy, ACM, ISBN 978-1-60558-933-6, pp. 1459-1462, October 2010. doi:10.1145/1873951.1874246

  11. Data from: Basic Audio Dataset

    • kaggle.com
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhuvi Ranga (2023). Basic Audio Dataset [Dataset]. https://www.kaggle.com/datasets/bhuviranga/basic-audio-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bhuvi Ranga
    Description

    This is a small sample from the RAVDESS Emotional speech audio used only for basic Audio Data Analysis.

    "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.

  12. E

    Laboratory Conditions Czech Audio-Visual Speech Corpus

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated Nov 5, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2008). Laboratory Conditions Czech Audio-Visual Speech Corpus [Dataset]. http://catalog.elra.info/en-us/repository/browse/ELRA-S0283/
    Explore at:
    Dataset updated
    Nov 5, 2008
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    http://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    http://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems. The corpus consists of about 25 hours of audio-visual records of 65 speakers in laboratory conditions. Data collection was done with static illumination, and recorded subjects were instructed to remain static.The average speaker age was 22 years old. Speakers were asked to read 200 sentences each (50 common for all speakers and 150 specific to each speaker). The average total length of recording per speaker is 23 minutes.All audio-visual data are transcribed (.trs files) and divided into sentences (one sentence per file). For each video file we get the description file containing information about the position and size of the region of interest.Acoustic data are stored in wave files using PCM format, sampling frequency 44kHz, resolution 16 bits. Each speaker’s acoustic data set represents about 140 MB of disk space (about 9 GB as a whole).Visual data are stored in video files (.avi format) using the digital video (DV) codec. Visual data per speaker take about 3 GB of disk (about 195 GB as a whole) and are stored on an IDE hard disk (NTFS format).

  13. Podcast Database - Complete Podcast Metadata, All Countries & Languages

    • datarade.ai
    .json, .csv, .sql
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Listen Notes (2025). Podcast Database - Complete Podcast Metadata, All Countries & Languages [Dataset]. https://datarade.ai/data-products/podcast-database-complete-podcast-metadata-all-countries-listen-notes
    Explore at:
    .json, .csv, .sqlAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    Listen Notes
    Area covered
    Zambia, Gibraltar, Colombia, Guinea-Bissau, Indonesia, Slovenia, Turkey, Iran (Islamic Republic of), Bosnia and Herzegovina, Anguilla
    Description

    == Quick facts ==

    The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,600,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in SQLite format Learn how we build a high quality podcast database: https://www.listennotes.help/article/105-high-quality-podcast-database-from-listen-notes

    == Use Cases ==

    AI training, including speech recognition, generative AI, voice cloning / synthesis, and news analysis Alternative data for investment research, such as sentiment analysis of executive interviews, market research and tracking investment themes PR and marketing, including social monitoring, content research, outreach, and guest booking ...

    == Data Attributes ==

    See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

    How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

    == Custom Offers ==

    We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

    We also provide a RESTful API at PodcastAPI.com

    Contact us: hello@listennotes.com

    == Need Help? ==

    If you have any questions about our products, feel free to reach out hello@listennotes.com

    == About Listen Notes, Inc. ==

    Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.

  14. Data from: DiPCo -- Dinner Party Corpus

    • zenodo.org
    application/gzip, pdf
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maarten Van Segbroeck; Ahmed Zaid; Ksenia Kutsenko; Cirenia Huerta; Tinh Nguyen; Xuewen Luo; Björn Hoffmeister; Jan Trmal; Maurizio Omologo; Roland Maas; Maarten Van Segbroeck; Ahmed Zaid; Ksenia Kutsenko; Cirenia Huerta; Tinh Nguyen; Xuewen Luo; Björn Hoffmeister; Jan Trmal; Maurizio Omologo; Roland Maas (2024). DiPCo -- Dinner Party Corpus [Dataset]. http://doi.org/10.21437/interspeech.2020-2800
    Explore at:
    application/gzip, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maarten Van Segbroeck; Ahmed Zaid; Ksenia Kutsenko; Cirenia Huerta; Tinh Nguyen; Xuewen Luo; Björn Hoffmeister; Jan Trmal; Maurizio Omologo; Roland Maas; Maarten Van Segbroeck; Ahmed Zaid; Ksenia Kutsenko; Cirenia Huerta; Tinh Nguyen; Xuewen Luo; Björn Hoffmeister; Jan Trmal; Maurizio Omologo; Roland Maas
    License

    https://cdla.io/permissive-1-0https://cdla.io/permissive-1-0

    Description

    We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.

  15. E

    Pashto Conversational Speech Recognition Corpus (Telephone)

    • catalog.elra.info
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2020). Pashto Conversational Speech Recognition Corpus (Telephone) [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-S0228_59/
    Explore at:
    Dataset updated
    Apr 7, 2020
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The corpus contains 26 pairs of Afghanistan Southern Pashto spontaneous conversational speech, which were from 52 speakers (27 males and 25 females). For this collection, 2 speakers of each group performed the recording in separate quiet rooms. 21 topics were contained in this database. The audio duration is 160.3 hours and the speech hour is about 50.8 hours, including the reasonable leading and trailing silence. The total size of this database is 8.6 GB.

  16. E

    Russian Speech Database

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 3, 2005
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2005). Russian Speech Database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0050/
    Explore at:
    Dataset updated
    Jun 3, 2005
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    The STC Russian speech database was recorded in 1996-1998. The main purpose of the database is to investigate individual speaker variability and to validate speaker recognition algorithms. The database was recorded through a 16-bit Vibra-16 Creative Labs sound card with an 11,025 Hz sampling rate.The database contains Russian read speech of 89 different speakers (54 male, 35 female), including 70 speakers with 15 sessions or more, 10 speakers with 10 sessions or more and 9 speakers with less than 10 sessions. The speakers were recorded in Saint-Petersburg and are within the age of 18-62. All are native speakers. The corpus consists of 5 sentences. Each speaker reads carefully but fluently each sentence 15 times on different dates over the period of 1-3 months. The corpus contains a total of 6,889 utterances and of 2 volumes, total size 700 MB uncompressed data. The signal of each utterance is stored as a separate file (approx. 126 KB). Total size of data for one speaker approximates 9,500 KB. Average utterance duration is about 5 sec.A file gives information about the speakers (speaker?s age and gender). The orthography and phonetic transcription of the corpus is given in separate files which contain the prompted sentences and their transcription in IPA. The signal files are raw files without any header, 16 bit per sample, linear, 11,025 Hz sample frequency. The recording conditions were as follows:Microphone: dynamic omnidirectional high-quality microphone, distance to mouth 5-10 cmEnvironment: office roomSampling rate: 11,025 HzResolution: 16 BitSound board: Creative Labs Vibra-16Means of delivery: CD-ROM

  17. i

    Data from: The TORGO database of acoustic and articulatory speech from...

    • incluset.com
    Updated 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Rudzicz; Aravind Kumar Namasivayam; Talya Wolff (2011). The TORGO database of acoustic and articulatory speech from speakers with dysarthria [Dataset]. https://incluset.com/datasets
    Explore at:
    Dataset updated
    2011
    Authors
    Frank Rudzicz; Aravind Kumar Namasivayam; Talya Wolff
    Measurement technique
    It consists of aligned acoustics and measured 3D articulatory features from speakers with either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS).
    Description

    This database is orignally created for a resource for developing advanced models in automatic speech recognition that are more suited to the needs of people with dysarthria.

  18. i

    Data from: Dysarthric speech database for universal access research

    • incluset.com
    Updated 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heejin Kim; Mark Allan Hasegawa-Johnson; Adrienne Perlman; Jon Gunderson; Thomas S Huang; Kenneth Watkin; Simone Frame (2007). Dysarthric speech database for universal access research [Dataset]. https://incluset.com/datasets
    Explore at:
    Dataset updated
    2007
    Authors
    Heejin Kim; Mark Allan Hasegawa-Johnson; Adrienne Perlman; Jon Gunderson; Thomas S Huang; Kenneth Watkin; Simone Frame
    Measurement technique
    Participants read a variety of words, like digits, letters, computer commands, common words, and uncommon words from porject Gutenberg novels.
    Description

    This dataset is collected to enhance research into speech recognition systems for dysarthic speech.

  19. Emotional Voice Messages (EMOVOME) database

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucía Gómez-Zaragozá; Lucía Gómez-Zaragozá; Rocío del Amor; Rocío del Amor; Elena Parra Vargas; Elena Parra Vargas; Valery Naranjo; Valery Naranjo; Mariano Alcañiz Raya; Mariano Alcañiz Raya; Javier Marín-Morales; Javier Marín-Morales (2024). Emotional Voice Messages (EMOVOME) database [Dataset]. http://doi.org/10.5281/zenodo.10694370
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lucía Gómez-Zaragozá; Lucía Gómez-Zaragozá; Rocío del Amor; Rocío del Amor; Elena Parra Vargas; Elena Parra Vargas; Valery Naranjo; Valery Naranjo; Mariano Alcañiz Raya; Mariano Alcañiz Raya; Javier Marín-Morales; Javier Marín-Morales
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Emotional Voice Messages (EMOVOME) database is a speech dataset collected for emotion recognition in real-world conditions. It contains 999 spontaneous voice messages from 100 Spanish speakers, collected from real conversations on a messaging app. EMOVOME includes both expert and non-expert emotional annotations, covering valence and arousal dimensions, along with emotion categories for the expert annotations. Detailed participant information is provided, including sociodemographic data and personality trait assessments using the NEO-FFI questionnaire. Moreover, EMOVOME provides audio recordings of participants reading a given text, as well as transcriptions of all 999 voice messages. Additionally, baseline models for valence and arousal recognition are provided, utilizing both speech and audio transcriptions.

    Description

    For details on the EMOVOME database, please refer to the article:

    "EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios". Lucía Gómez-Zaragozá, Rocío del Amor, María José Castro-Bleda, Valery Naranjo, Mariano Alcañiz Raya, Javier Marín-Morales. (pre-print available in https://doi.org/10.48550/arXiv.2403.02167)

    Content

    The Zenodo repository contains four files:

    • EMOVOME_agreement.pdf: agreement file required to access the original audio files, detailed in section Usage Notes.
    • labels.csv: ratings of the three non-experts and the expert annotator, independently and combined.
    • participants_ids.csv: table mapping each numerical file ID to its corresponding alphanumeric participant ID.
    • transcriptions.csv: transcriptions of each audio.

    The repository also includes three folders:

    • Audios: it contains the file features_eGeMAPSv02.csv corresponding to the standard acoustic feature set used in the baseline model, and two folders:
      • Lecture: contains the audio files corresponding to the text readings, with each file named according to the participant's ID.
      • Emotions: contains the voice recordings from the messaging app provided by the user, named with a file ID.
    • Questionnaires: it contains two files: 1) sociodemographic_spanish.csv and sociodemographic_english.csv are the sociodemographic data of participants in Spanish and English, respectively, including the demographic information; and 2) NEO-FFI_spanish.csv includes the participants’ answers to the Spanish version of the NEO-FFI questionnaire. The three files include a column indicating the participant's ID to link the information.
    • Baseline_emotion_recognition: it includes three files and two folders. The file partitions.csv specifies the proposed data partition. Particularly, the dataset is divided into 80% for development and 20% for testing using a speaker-independent approach, i.e., samples from the same speaker are not included in both development and test. The development set includes 80 participants (40 female, 40 male) containing the following distribution of labels: 241 negative, 305 neutral and 261 positive valence; and 148 low, 328 neutral and 331 high arousal. The test set includes 20 participants (10 female, 10 male) with the distribution of labels that follows: 57 negative, 62 neutral and 73 positive valence; and 13 low, 70 neutral and 109 high arousal. Files baseline_speech.ipynb and baseline_text.ipynb contain the code used to create the baseline emotion recognition models based on speech and text, respectively. The actual trained models for valence and arousal prediction are provided in folders models_speech and models_text.

    Audio files in “Lecture” and “Emotions” are only provided to the users that complete the agreement file in section Usage Notes. Audio files are in Ogg Vorbis format at 16-bit and 44.1 kHz or 48 kHz. The total size of the “Audios” folder is about 213 MB.

    Usage Notes

    All the data included in the EMOVOME database is publicly available under the Creative Commons Attribution 4.0 International license. The only exception is the original raw audio files, for which an additional step is required as a security measure to safeguard the speakers' privacy. To request access, interested authors should first complete and sign the agreement file EMOVOME_agreement.pdf and send it to the corresponding author (jamarmo@htech.upv.es). The data included in the EMOVOME database is expected to be used for research purposes only. Therefore, the agreement file states that the authors are not allowed to share the data with profit-making companies or organisations. They are also not expected to distribute the data to other research institutions; instead, they are suggested to kindly refer interested colleagues to the corresponding author of this article. By agreeing to the terms of the agreement, the authors also commit to refraining from publishing the audio content on the media (such as television and radio), in scientific journals (or any other publications), as well as on other platforms on the internet. The agreement must bear the signature of the legally authorised representative of the research institution (e.g., head of laboratory/department). Once the signed agreement is received and validated, the corresponding author will deliver the "Audios" folder containing the audio files through a download procedure. A direct connection between the EMOVOME authors and the applicants guarantees that updates regarding additional materials included in the database can be received by all EMOVOME users.

  20. Data from: Speech activity detection datasets

    • kaggle.com
    zip
    Updated Nov 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LazyRac00n (2020). Speech activity detection datasets [Dataset]. https://www.kaggle.com/datasets/lazyrac00n/speech-activity-detection-datasets/data
    Explore at:
    zip(144366075 bytes)Available download formats
    Dataset updated
    Nov 3, 2020
    Authors
    LazyRac00n
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    The speech activity detection task discriminates the segments of a signal where human speech and other type of sounds (such as silence and noise) occur. it's very critical and important since it's the starting point for many speech/audio applications, including speech coding, speech recognition and speech enhancement.

    Since there are no specific public dataset for this kind of task I collect 719 audio from three different databases: - TIMIT: is a corpus of read speech, designed to provide speech data for acoustic and phonetic studies and evaluation of automatic speech recognition system. - PTDB-TUG: is a speech database for pitch tracking that provide microphone signals of 20 English speakers. - Noizeus: contains speech data of 30 sentences. Noise signal (from AURORA database) is artificially added to the speech signal, in particular the database contains audio corrupted with babble (crowd of preople), street, train, train station, car and restaurant noise at SNRs of 5dB, and the original ones.

    Content

    Praat is a useful speech analysis tool which provides a wide range of functionalities, one of those creates an annotation file which write, in a txt format, where the silent and sounding intervals are in the signal. For the data from TIMIT and PTDB-TUG the silences correspond to NON-SPEECH and the sounding intervals correspond to SPEECH, because these databases don’t have any kind of background noise. Instead for those files from Noizeus I worked slightly differently, since the database has also the original audio (without noise) I associated to each noisy file the annotations of the corresponding noiseless one.

    All audio files are .wav, and all the annotation files are .TextGrid. the structure of such files is described in the description of the folder. The annotations can be read with the suitable python library: https://pypi.org/project/praat-textgrids/

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
GTS (2024). Indonesian Media Audio Database [Dataset]. https://gts.ai/case-study/indonesian-media-audio-database-custom-ai-data-collection/

Indonesian Media Audio Database

Explore at:
jsonAvailable download formats
Dataset updated
Jan 31, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Our project, “Indonesian Media Audio Database,” is designed to establish a rich and diverse dataset tailored for training advanced machine learning models in language processing, speech recognition, and cultural analysis.

Search
Clear search
Close search
Google apps
Main menu