Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Noise is an unwanted behavior in audio datasets. Noise plays an important part in the machine learning field of audio data type.
The dataset can be used for noise filtering, noise generation & noise recognition in audio classification, audio recognition, audio generation, and audio-related machine learning. I, Min Si Thu, used this dataset on open-source projects.
I collected ten types of noise in this dataset.
Location - Myanmar, Mandalay, Amarapura Township
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
'In-the-Wild' Dataset We present a dataset of audio deepfakes (and corresponding benign audio) for a set of politicians and other public figures, collected from publicly available sources such as social networks and video streaming platforms. For n = 58 celebrities and politicians, we collect both bona-fide and spoofed audio. In total, we collect 20.8 hours of bona-fide and 17.2 hours of spoofed audio. On average, there are 23 minutes of bona-fide and 18 minutes of spoofed audio per speaker.
The dataset is intended to be used for evaluating deepfake detection and voice anti-spoofing machine-learning models. It is especially useful to judge a model's capability to generalize to realistic, in-the-wild audio samples. Find more information in our paper, and download the dataset here.
The most interesting deepfake detection models we used in our experiments are open-source on GitHub:
RawNet 2 RawGAT-ST PC-Darts This dataset and the associated documentation are licensed under the Apache License, Version 2.0.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection (MIMII dataset). It contains the sounds generated from four types of industrial machines, i.e. valves, pumps, fans, and slide rails. Each type of machine includes seven individual product models*1, and the data for each model contains normal sounds (from 5000 seconds to 10000 seconds) and anomalous sounds (about 1000 seconds). To resemble a real-life scenario, various anomalous sounds were recorded (e.g., contamination, leakage, rotating unbalance, and rail damage). Also, the background noise recorded in multiple real factories was mixed with the machine sounds. The sounds were recorded by eight-channel microphone array with 16 kHz sampling rate and 16 bit per sample. The MIMII dataset assists benchmark for sound-based machine fault diagnosis. Users can test the performance for specific functions e.g., unsupervised anomaly detection, transfer learning, noise robustness, etc. The detail of the dataset is described in [1][2].
This dataset is made available by Hitachi, Ltd. under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
A baseline sample code for anomaly detection is available on GitHub: https://github.com/MIMII-hitachi/mimii_baseline/
*1: This version "public 1.0" contains four models (model ID 00, 02, 04, and 06). The rest three models will be released in a future edition.
[1] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” arXiv preprint arXiv:1909.09347, 2019.
[2] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains Kiswahili text and audio files. The dataset contains 7,108 text files and audio files. The Kiswahili dataset was created from an open-source non-copyrighted material: Kiswahili audio Bible. The authors permit use for non-profit, educational, and public benefit purposes. The downloaded audio files length was more than 12.5s. Therefore, the audio files were programmatically split into short audio clips based on silence. They were then combined based on a random length such that each eventual audio file lies between 1 to 12.5s. This was done using python 3. The audio files were saved as a single channel,16 PCM WAVE file with a sampling rate of 22.05 kHz The dataset contains approximately 106,000 Kiswahili words. The words were then transcribed into mean words of 14.96 per text file and saved in CSV format. Each text file was divided into three parts: unique ID, transcribed words, and normalized words. A unique ID is a number assigned to each text file. The transcribed words are the text spoken by a reader. Normalized texts are the expansion of abbreviations and numbers into full words. An audio file split was assigned a unique ID, the same as the text file.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Improve your machine learning models with high-quality physician/doctor dictation speech datasets. Deep domain expertise. Fast & Cost-effective. Trusted by industry leaders.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EDM-HSE Dataset
EDM-HSE is an open audio dataset containing a collection of code-generated drum recordings in the style of modern electronic house music. It includes 8,000 audio loops recorded in uncompressed stereo WAV format, created using custom audio samples and a MIDI drum dataset. The dataset also comes with paired JSON files containing MIDI note numbers (pitch) and tempo data, intended for supervised training of generative AI audio models.
Overview
The EDM-HSE Dataset was developed using an algorithmic framework to generate probable drum notations commonly played by EDM music producers. For supervised training with labeled data, a variational mixing technique was applied to the rendered audio files. This method systematically includes or excludes drum notes, assisting the model in recognizing patterns and relationships between drum instruments, thereby enhancing its generalization capabilities.
The primary purpose of this dataset is to provide accessible content for machine learning applications in music and audio. Potential use cases include generative music, feature extraction, tempo detection, audio classification, rhythm analysis, drum synthesis, music information retrieval (MIR), sound design and signal processing.
Specifications
8,000 audio loops (approximately 17 hours)
16-bit WAV format
Tempo range: 120–130 BPM
Paired label data (WAV + JSON)
Variational drum patterns
Subgenre styles (Big room, electro, minimal, classic)
A JSON file is provided for referencing and converting MIDI note numbers to text labels. You can update the text labels to suit your preferences.
License
This dataset was compiled by WaivOps, a crowdsourced music project managed by the sound label company Patchbanks. All recordings have been compiled by verified sources for copyright clearance.
The EDM-HSE dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
Additional Info
Please note that this dataset has not been fully reviewed and may contain minor notational errors or audio defects.
For audio examples or more information about this dataset, please refer to the GitHub repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WaivOps HH-LFBB Dataset
HH-LFBB is an open audio dataset composed of a series of drum recordings in the style of lofi hip-hop music. The dataset contains 3332 audio loops recorded in uncompressed stereo WAV format, produced with custom drum samples and MIDI-programmed rhythms at various tempo rates.
Dataset
The primary objective of this dataset is to provide accessible content for machine learning applications in music and audio research. Some potential use cases for this dataset include tempo detection and classification, drum rhythm analysis, audio-to-MIDI conversion, source separation, automated mixing, music information retrieval, AI music generation, sound design, and signal processing.
Specifications
License
This dataset was compiled by WaivOps, a crowdsourced music project managed by the sound label company Patchbanks. All recordings have been compiled by verified sources for copyright clearance.
The HH-LFBB dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).
Additional Info
For audio examples or more information about this dataset, please refer to the GitHub repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset for this project comprises audio recordings of the operational states of belt conveyor rollers in a mining environment, covering three conditions: normal roller operation, roller shell cracking, and roller breakage. Combined with machine learning models, this dataset can be used for real-time diagnosis of roller operational states. The database contains two main folders: dataset and code.The dataset folder includes three subfolders:wav: Contains 19 WAV files recorded from 19 microphones, capturing the audio data of belt conveyor rollers in a mining site. Of these, 17 files represent normal roller operation, 1 file captures the audio of a roller with shell cracking, and 1 file captures the audio of a roller with complete breakage.csv_dataset: Contains 10 subfolders, each representing audio feature datasets extracted from the WAV files with frame lengths ranging from 100ms to 1000ms. Each subfolder contains 19 CSV files, corresponding to the 19 audio recordings. The feature datasets within different frame-length subfolders should not be used interchangeably.test_dataset: Contains 17 audio feature datasets with a 200ms frame length. These datasets include features from 17 normal operation recordings combined with features from the roller shell cracking and roller breakage recordings. The combined datasets are shuffled 100 times to ensure even distribution of features from each operational state. This dataset was used for validating the accuracy and usability of the audio feature datasets for real-time monitoring of roller states in the paper.The code folder contains two sets of code:Matlab Code: This code extracts 25 audio features from the WAV files and generates the 17 audio feature datasets using a 200ms frame length.Python Code: This code validates the accuracy and usability of the audio feature datasets in real-time monitoring of belt conveyor roller operational states.This dataset and code combination supports the real-time diagnosis of belt conveyor roller conditions and provides a foundation for validating the effectiveness of audio features in fault detection.
Kieli labels audio speech, Image, Video & Text Data including semantic segmentation, named entity recognition (NER) and POS tagging. Kieli transforms unstructured data into high quality training data for the refinement of Artificial Intelligence and Machine Learning platforms. For over a decade, hundreds of organizations have relied on Kieli to deliver secure, high-quality training data and model validation for machine learning. At Kieli, we believe that accurate data is the most important factor in production learning models. We are committed to delivering the best quality data for the most enterprising organizations and helping you make strides in Artificial Intelligence. At Kieli, we're passionately dedicated to serving the Arabic, English and French markets. We work in all areas of industry: healthcare, technology and retail.
DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.
Can machine learning be used to detect when speech is AI-generated?
Introduction There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.
To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.
For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:
(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)
Dataset There are two forms to the dataset that are made available.
First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.
Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.
Note: All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.
A potential use of a successful system could be used for the following:
(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)
Kaggle The dataset is available on the Kaggle data science platform.
The Kaggle page can be found by clicking here: Dataset on Kaggle
Attribution This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"
The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion
License This dataset is provided under the MIT License:
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
** Accepted in IEEE Data Descriptions Journal ** This dataset contains 535 recordings of heart and lung sounds captured using a digital stethoscope from a clinical manikin, including both individual and mixed recordings of heart and lung sounds; It includes 50 heart sounds, 50 lung sounds, and 145 mixed sounds. For each mixed sound, the corresponding source heart sound (145 recordings) and source lung sound (145 recordings) were also recorded. It includes recordings from different anatomical chest locations, with normal and abnormal sounds. Each recording has been filtered to highlight specific sound types, making it valuable for artificial intelligence (AI) research and applications in automated cardiopulmonary disease detection, sound classification, and deep learning algorithms related to audio signal processing. If you use this dataset in your research, please cite the following paper:
Y. Torabi, S. Shirani and J. P. Reilly, "Descriptor: Heart and Lung Sounds Dataset Recorded from a Clinical Manikin using Digital Stethoscope (HLS-CMDS)," in IEEE Data Descriptions, https://doi.org/10.1109/IEEEDATA.2025.3566012 .
Data Type: Audio files (.wav), Comma Separated Values (.CSV)
Each category is accompanied by a corresponding CSV file that provides metadata for the respective audio files. The CSV files (HS.csv, LS.csv, and Mix.csv) contain metadata about the corresponding audio files, including the file name, gender, heart and lung sound type, and the anatomical location where we recorded the sound.
Sound Types: Normal Heart, Late Diastolic Murmur, Mid Systolic Murmur, Late Systolic Murmur, Atrial Fibrillation, Fourth Heart Sound, Early Systolic Murmur, Third Heart Sound, Tachycardia, Atrioventricular Block, Normal Lung, Wheezing, Fine Crackles, Rhonchi, Pleural Rub, and Coarse Crackles.
Auscultation Landmarks: Right Upper Sternal Border, Left Upper Sternal Border, Lower Left Sternal Border, Right Costal Margin, Left Costal Margin, Apex, Right Upper Anterior, Left Upper Anterior, Right Mid Anterior, Left Mid Anterior, Right Lower Anterior, and Left Lower Anterior.
Applications: AI-based cardiopulmonary disease detection, unsupervised sound separation techniques, and deep learning for audio signal processing.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We've provided a readme.pdf to explain how to use the dataset. Here, we reiterate some of that information to assist others in utilizing the dataset. Please be aware that the files and the dataset are large (approx. 200GB.). It is advised to make sure there is ample storage space for downloading and unzipping. Please download one file at a time. Dataset: Our data collection method involved cameras, subjects, environments, and guidelines for data simulation to elucidate the specifics of our process. Notably, our dataset, comprising 10,948 clips, stands out as the largest when compared to others that focus on falls recorded through egocentric cameras. Equipment: Data was amassed using two wearable camera types: the OnReal G1 and CAMMHD Bodycams. The OnReal G1 is a compact mini action camera, with dimensions of 420 x 420 x 200 mm, and can capture videos in resolutions as high as 1080P at 30 fps. Conversely, the CAMMHD Bodycam, a larger camera measuring 800 x 500 x 300 mm, is outfitted with infrared sensors suitable for night vision. These cameras were strategically affixed to the human body at places like the waist and neck, allowing the collection of extensive visual, motion, and audio information across varied environments. The standard setting for data capture was the 1080p video mode at 30 frames per second. It's worth noting that the OnReal G1 frames consist of distinct R, G, B channels, whereas CAMMHD Bodycam frames feature three identical grayscale channels. This dataset, therefore, is a pivotal resource for this thesis, facilitating a thorough analysis of different events and activities. Subject: For this study, we had 14 volunteer participants: 12 males and 2 females. This included 12 young, healthy individuals and 2 elderly subjects. All participants gave informed consent, understanding their data might be utilized for research and potentially be publicized. Most subjects (11 out of 14) finished the data collection encompassing four types of falls and nine types of non-falls, both indoor and outdoor. However, three participants couldn't complete the entire data collection due to personal reasons. This study yields significant insights into falls and non-fall behaviors, underscoring the dedication of the majority of our participants. Environment: Our aim was a comprehensive study of both indoor and outdoor environments. We captured data across 14 different outdoor settings and 15 unique indoor spaces. To introduce variability, participants were prompted to change their positions or directions post each activity. Such an approach ensures a diversified dataset, letting us derive more reliable conclusions and insights. Data Collection: Our data collection approach encompasses two main perspectives: visual and auditory. For visual data, we adhered to guidelines from existing literature; typical falls and related activities have a duration of 1-3 seconds. We proposed an exhaustive set of trials that cover 20 types of falls, each varying in direction and object interaction. Contrarily, specific guidelines for audio data are scarce, as past research largely centered on visual cues. Our audio dataset comprises three categories: subject audio, subject-object audio, and environment audio. To provide participants a realistic feel of falls, we showed them online videos of real-world fall incidents. These videos accurately render the auditory and visual elements of these events. Upon manual inspection of all clips, we discerned prevalent audio patterns. For falls, subject audio includes elements like yelling and moaning; subject-object audio encapsulates sounds of impacts, and environmental audio captures background noises like traffic or television. Importantly, not all clips contained every sound type. Non-fall activities were bifurcated into three groups based on their audio intensity. Our findings shed light on the audio patterns across activities, potentially enhancing subsequent research in this domain.
L3DAS22: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING This dataset supports the L3DAS22 IEEE ICASSP Gand Challenge. The challenge is supported by a Python API that facilitates the dataset download and preprocessing, the training and evaluation of the baseline models and the results submission.
Scope of the Challenge The L3DAS22 Challenge aims at encouraging and fostering research on machine learning for 3D audio signal processing. 3D audio is gaining increasing interest in the machine learning community in recent years. The range of applications is incredibly wide, extending from virtual and real conferencing to autonomous driving, surveillance and many more. In these contexts, a fundamental procedure is to properly identify the nature of events present in a soundscape, their spatial position and eventually remove unwanted noises that can interfere with the useful signal. To this end, L3DAS22 Challenge presents two tasks: 3D Speech Enhancement and 3D Sound Event Localization and Detection, both relying on first-order Ambisonics recordings in reverberant office environments. Each task involves 2 separate tracks: 1-mic and 2-mic recordings, respectively containing sounds acquired by one 1st order Ambisonics microphone and by an array of two ones. The use of two Ambisonics microphones represents one of the main novelties of the L3DAS22 Challenge. We expect higher accuracy/reconstruction quality when taking advantage of the dual spatial perspective of the two microphones. Moreover, we are very interested in identifying other possible advantages of this configuration over standard Ambisonics formats. Interactive demos of our baseline models are available on Replicate. Top 5 ranked teams can submit a regular paper according to the ICASSP guidelines. Prizes will be awarded to the challenge winners thanks to the support of Kuaishou Technology.
Tasks Tasks The tasks we propose are: * 3D Speech Enhancement The objective of this task is the enhancement of speech signals immersed in the spatial sound field of a reverberant office environment. Here the models are expected to extract the monophonic voice signal from the 3D mixture containing various background noises. The evaluation metric for this task is a combination of short-time objective intelligibility (STOI) and word error rate (WER). * 3D Sound Event Localization and Detection The aim of this task is to detect the temporal activities of a known set of sound event classes and, in particular, to further locate them in the space. Here the models must predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. Performance on this task is evaluated according to the location-sensitive detection error, which joins the localization and detection error metrics.
Dataset Info The L3DAS22 datasets contain multiple-source and multiple-perspective B-format Ambisonics audio recordings. We sampled the acoustic field of a large office room, placing two first-order Ambisonics microphones in the center of the room and moving a speaker reproducing the analytic signal in 252 fixed spatial positions. Relying on the collected Ambisonics impulse responses (IRs), we augmented existing clean monophonic datasets to obtain synthetic tridimensional sound sources by convolving the original sounds with our IRs. We extracted speech signals from the Librispeech dataset and office-like background noises from the FSD50K dataset. We aimed at creating plausible and variegate 3D scenarios to reflect possible real-life situations in which sound and disparate types of background noises coexist in the same 3D reverberant environment. We provide normalized raw waveforms as predictors data and the target data varies according to the task.
The dataset is divided in two main sections, respectively dedicated to the challenge tasks.
The first section is optimized for 3D Speech Enhancement and contains more than 60000 virtual 3D audio environments with a duration up to 12 seconds. In each sample, a spoken voice is always present alongside with other office-like background noises. As target data for this section we provide the clean monophonic voice signals. For each subset we also provide a csv file, where we annotated the coordinates and spatial distance of the IR convolved with the target voice signals for each datapoint. This may be useful to estimate the delay caused by the virtual time-of-flight of the target voice signal and to perform a sample-level alignment of the input and ground truth signals.
The other sections, instead, is dedicated to the 3D Sound Event Localization and Detection task and contains 900 30-seconds-long audio files. Each data point contains a simulated 3D office audio environment in which up to 3 simultaneous acoustic events may be active at the same time. In this section, the samples are not forced to contain a spoken voice. As target data for this section we provide a list of the onset and offset time stamps, the typology class, and the spatial coordinates of each individual sound event present in the data-points.
We split both dataset sections into a training set and a development set, paying attention to create similar distributions. The train set of the SE section is divided in two partitions: train360 and train100, and contain speech samples extracted from the correspondent partitions of Librispeech (only the sample) up to 12 seconds. The train360 is split in 2 zip files for a more convenient download. All sets of the SELD section are divided in: OV1, OV2, OV3. These partitions refer to the maximum amount of possible overlapping sounds, which are 1, 2 or 3, respectively.
L3DAS22 Challenge Supporting API The gitHub supporting API is aimed at downloading the dataset, pre-processing the sound files and the metadata, training and evaluating the baseline models and validating the final results. We provide easy-to-use instruction to produce the results included in our paper. Moreover, we extensively commented our code for easy customization. For further information please refer to the challenge website and to the challenge documentation.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Our project, “Indonesian Media Audio Database,” is designed to establish a rich and diverse dataset tailored for training advanced machine learning models in language processing, speech recognition, and cultural analysis.
Environment : quiet indoor environment, without echo;
Recording content : No preset linguistic data,dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;
Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.
Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;
Device : Telephony recording system;
Language : 100+ Languages;
Application scenarios : speech recognition; voiceprint recognition;
Accuracy rate : the word accuracy rate is not less than 98%
Snippets database has sound / audio / sonic recordings across all kinds of venues (restaurants, bars, arenas, churches, movie theaters, retail stores, factories, parks, libraries, gyms, hotels, offices, factories and many more) and variance in noise levels (Quiet, Moderate, Loud, Very Loud), noise types and acoustic environments with valuable metadata.
This is valuable for any audio-based software product/company to run/test its algorithm against various acoustic environments including:
Hearing aid companies wanting to test their software's ability to identify or separate certain sounds and background noise and mitigate them
Audio or Video Conferencing platforms that want to be able to identify a user's location (i.e. user joins call from a coffee shop and platform has ability to identify and mitigate such sounds for better audio
Other audio-based use cases
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description: The "Indian Languages Audio Dataset" is a collection of audio samples featuring a diverse set of 10 Indian languages. Each audio sample in this dataset is precisely 5 seconds in duration and is provided in MP3 format. It is important to note that this dataset is a subset of a larger collection known as the "Audio Dataset with 10 Indian Languages." The source of these audio samples is regional videos freely available on YouTube, and none of the audio samples or source videos are owned by the dataset creator.
Languages Included: 1. Bengali 2. Gujarati 3. Hindi 4. Kannada 5. Malayalam 6. Marathi 7. Punjabi 8. Tamil 9. Telugu 10. Urdu
This dataset offers a valuable resource for researchers, linguists, and machine learning enthusiasts who are interested in studying and analyzing the phonetics, accents, and linguistic characteristics of the Indian subcontinent. It is a representative sample of the linguistic diversity present in India, encompassing a wide array of languages and dialects. Researchers and developers are encouraged to explore this dataset to build applications or conduct research related to speech recognition, language identification, and other audio processing tasks.
Additionally, the dataset is not limited to these 10 languages and has the potential for expansion. Given the dynamic nature of language use in India, this dataset can serve as a foundation for future data collection efforts involving additional Indian languages and dialects.
Access to the "Indian Multilingual Audio Dataset - 10 Languages" is provided with the understanding that users will comply with applicable copyright and licensing restrictions. If users plan to extend this dataset or use it for commercial purposes, it is essential to seek proper permissions and adhere to relevant copyright and licensing regulations.
By utilizing this dataset responsibly and ethically, users can contribute to the advancement of language technology and research, ultimately benefiting language preservation, speech recognition, and cross-cultural communication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BanglaSER is a Bangla language-based speech emotion recognition dataset. It consists of speech-audio data of 34 participating speakers from diverse age groups between 19 and 47 years, with a balanced 17 male and 17 female nonprofessional participating actors. This dataset contains 1467 Bangla speech-audio recordings of five rudimentary human emotional states, namely angry, happy, neutral, sad, and surprise. Three trials are conducted for each emotional state. Hence, the total number of recordings involves 3 statements × 3 repetitions × 4 emotional states (angry, happy, sad, and surprise) × 34 participating speakers = 1224 recordings + 3 statements × 3 repetitions × 1 emotional state (neutral) × 27 participating speakers = 243 recordings, making the total number of recordings of 1467. BanglaSER dataset is collected by recording through smartphones, and laptops, having a balanced number of recordings in each category with evenly distributed participating male and female actors, preserves the real-life environment, and would serve as an essential training dataset for the speech emotion recognition model in terms of generalization. BanglaSER is compatible with various deep learning architectures such as CNN, LSTM, BiLSTM etc.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection with domain shifts due to changes in operational and environmental conditions (MIMII DUE). The dataset consists of normal and abnormal operating sounds of five different types of industrial machines, i.e., fans, gearboxes, pumps, slide rails, and valves. The data for each machine type includes six subsets called ``sections'', and each section roughly corresponds to a single product. Each section consists of data from two domains, called the source domain and the target domain, with different conditions such as operating speed and environmental noise. This dataset is a subset of the dataset for DCASE 2021 Challenge Task 2, so the dataset is entirely the same as data included in the development dataset and additional training dataset. For more information, please see this paper and the pages of the development dataset and the task description for DCASE 2021 Challenge Task 2.
Baseline system
Two simple baseline systems are available on the Github repositories [URL] and [URL]. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Conditions of use
This dataset was made by Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Publication
If you use this dataset, please cite the following paper:
Ryo Tanabe, Harsh Purohit, Kota Dohi, Takashi Endo, Yuki Nikaido, Toshiki Nakamura, and Yohei Kawaguchi, "MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection with Domain Shifts due to Changes in Operational and Environmental Conditions," arXiv preprint arXiv: 2105.02702, 2021. [URL]
Feedback
If there is any problem, please contact us:
Ryo Tanabe, ryo.tanabe.rw.xk@hitachi.com
Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
GTS Provides Clinical Audio Transcription Dataset that Powers AI and Machine Learning for Better Understanding
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Noise is an unwanted behavior in audio datasets. Noise plays an important part in the machine learning field of audio data type.
The dataset can be used for noise filtering, noise generation & noise recognition in audio classification, audio recognition, audio generation, and audio-related machine learning. I, Min Si Thu, used this dataset on open-source projects.
I collected ten types of noise in this dataset.
Location - Myanmar, Mandalay, Amarapura Township