100+ datasets found

Audio Noise Dataset
kaggle.com
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min Si Thu (2023). Audio Noise Dataset [Dataset]. https://www.kaggle.com/datasets/minsithu/audio-noise-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 20, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Min Si Thu
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Audio Noise Dataset

Noise is an unwanted behavior in audio datasets. Noise plays an important part in the machine learning field of audio data type.

The dataset can be used for noise filtering, noise generation & noise recognition in audio classification, audio recognition, audio generation, and audio-related machine learning. I, Min Si Thu, used this dataset on open-source projects.

I collected ten types of noise in this dataset.

Location - Myanmar, Mandalay, Amarapura Township

Ten types of noise

the noise of a crowded place (Myanmar, Mandalay, Amarapura Township)

the noise of urban areas with people talking (Myanmar, Mandalay, Amarapura Township)

the noise of the restaurant (Myanmar, Mandalay, Amarapura Township, at a random restaurant)

the noise of a working place, people's discussion (Myanmar, Mandalay, Amarapura Township, a private company)

the noise of mosquitos (Myanmar, Mandalay, Amarapura Township, Dataset creator's home)

the noise of car traffic (Myanmar, Mandalay, Amarapura Township, Asia Bank Road, nighttime)

the noise of painful sounds (Myanmar, Mandalay, Amarapura Township)

the noise of the rainy day (Myanmar, Mandalay, Amarapura Township, Dataset creator's home)

the noise of motorbike and people talking (Myanmar, Mandalay, Amarapura Township, NanTawYar Quarter, Cherry Street)

the noise of a festival (Myanmar, Mandalay, Chinese Festival)
In The Wild (audio Deepfake)
kaggle.com
zip
Updated Apr 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdalla Mohamed (2024). In The Wild (audio Deepfake) [Dataset]. https://www.kaggle.com/datasets/abdallamohamed312/in-the-wild-audio-deepfake
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 20, 2024
Authors
Abdalla Mohamed
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
'In-the-Wild' Dataset We present a dataset of audio deepfakes (and corresponding benign audio) for a set of politicians and other public figures, collected from publicly available sources such as social networks and video streaming platforms. For n = 58 celebrities and politicians, we collect both bona-fide and spoofed audio. In total, we collect 20.8 hours of bona-fide and 17.2 hours of spoofed audio. On average, there are 23 minutes of bona-fide and 18 minutes of spoofed audio per speaker.

The dataset is intended to be used for evaluating deepfake detection and voice anti-spoofing machine-learning models. It is especially useful to judge a model's capability to generalize to realistic, in-the-wild audio samples. Find more information in our paper, and download the dataset here.

The most interesting deepfake detection models we used in our experiments are open-source on GitHub:

RawNet 2 RawGAT-ST PC-Darts This dataset and the associated documentation are licensed under the Apache License, Version 2.0.
Data from: MIMII Dataset: Sound Dataset for Malfunctioning Industrial...
zenodo.org
zip
Updated Feb 29, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido; Kaori Suefusa; Kaori Suefusa; Yohei Kawaguchi; Yohei Kawaguchi; Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido (2020). MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection [Dataset]. http://doi.org/10.5281/zenodo.3384388
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3384388
Dataset updated
Feb 29, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido; Kaori Suefusa; Kaori Suefusa; Yohei Kawaguchi; Yohei Kawaguchi; Harsh Purohit; Ryo Tanabe; Kenji Ichige; Takashi Endo; Yuki Nikaido
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection (MIMII dataset). It contains the sounds generated from four types of industrial machines, i.e. valves, pumps, fans, and slide rails. Each type of machine includes seven individual product models*1, and the data for each model contains normal sounds (from 5000 seconds to 10000 seconds) and anomalous sounds (about 1000 seconds). To resemble a real-life scenario, various anomalous sounds were recorded (e.g., contamination, leakage, rotating unbalance, and rail damage). Also, the background noise recorded in multiple real factories was mixed with the machine sounds. The sounds were recorded by eight-channel microphone array with 16 kHz sampling rate and 16 bit per sample. The MIMII dataset assists benchmark for sound-based machine fault diagnosis. Users can test the performance for specific functions e.g., unsupervised anomaly detection, transfer learning, noise robustness, etc. The detail of the dataset is described in [1][2].

This dataset is made available by Hitachi, Ltd. under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

A baseline sample code for anomaly detection is available on GitHub: https://github.com/MIMII-hitachi/mimii_baseline/

*1: This version "public 1.0" contains four models (model ID 00, 02, 04, and 06). The rest three models will be released in a future edition.

[1] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” arXiv preprint arXiv:1909.09347, 2019.

[2] Harsh Purohit, Ryo Tanabe, Kenji Ichige, Takashi Endo, Yuki Nikaido, Kaori Suefusa, and Yohei Kawaguchi, “MIMII Dataset: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection,” in Proc. 4th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2019.
m
A kiswahili Dataset for Development of Text-To-Speech System
data.mendeley.com
Updated Nov 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiptoo Rono (2021). A kiswahili Dataset for Development of Text-To-Speech System [Dataset]. http://doi.org/10.17632/vbvj6j6pm9.1
Explore at:
Unique identifier
https://doi.org/10.17632/vbvj6j6pm9.1
Dataset updated
Nov 30, 2021
Authors
Kiptoo Rono
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains Kiswahili text and audio files. The dataset contains 7,108 text files and audio files. The Kiswahili dataset was created from an open-source non-copyrighted material: Kiswahili audio Bible. The authors permit use for non-profit, educational, and public benefit purposes. The downloaded audio files length was more than 12.5s. Therefore, the audio files were programmatically split into short audio clips based on silence. They were then combined based on a random length such that each eventual audio file lies between 1 to 12.5s. This was done using python 3. The audio files were saved as a single channel,16 PCM WAVE file with a sampling rate of 22.05 kHz The dataset contains approximately 106,000 Kiswahili words. The words were then transcribed into mean words of 14.96 per text file and saved in CSV format. Each text file was divided into three parts: unique ID, transcribed words, and normalized words. A unique ID is a number assigned to each text file. The transcribed words are the text spoken by a reader. Normalized texts are the expansion of abbreviations and numbers into full words. An audio file split was assigned a unique ID, the same as the text file.
s
Physician Dictation Audio Data datasets for Machine Learning
shaip.com
yi.shaip.com
json
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2022). Physician Dictation Audio Data datasets for Machine Learning [Dataset]. https://www.shaip.com/offerings/physician-dictation-audio-data-medical-data-catalog/
Explore at:
jsonAvailable download formats
Dataset updated
Apr 8, 2022
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Improve your machine learning models with high-quality physician/doctor dictation speech datasets. Deep domain expertise. Fast & Cost-effective. Trusted by industry leaders.
WaivOps EDM-HSE: Open Audio Resources for Machine Learning in Music
data.niaid.nih.gov
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patchbanks (2024). WaivOps EDM-HSE: Open Audio Resources for Machine Learning in Music [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13769543
Explore at:
Dataset updated
Oct 11, 2024
Dataset provided by
Patchbanks
WaivOps
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EDM-HSE Dataset

EDM-HSE is an open audio dataset containing a collection of code-generated drum recordings in the style of modern electronic house music. It includes 8,000 audio loops recorded in uncompressed stereo WAV format, created using custom audio samples and a MIDI drum dataset. The dataset also comes with paired JSON files containing MIDI note numbers (pitch) and tempo data, intended for supervised training of generative AI audio models.

Overview

The EDM-HSE Dataset was developed using an algorithmic framework to generate probable drum notations commonly played by EDM music producers. For supervised training with labeled data, a variational mixing technique was applied to the rendered audio files. This method systematically includes or excludes drum notes, assisting the model in recognizing patterns and relationships between drum instruments, thereby enhancing its generalization capabilities.

The primary purpose of this dataset is to provide accessible content for machine learning applications in music and audio. Potential use cases include generative music, feature extraction, tempo detection, audio classification, rhythm analysis, drum synthesis, music information retrieval (MIR), sound design and signal processing.

Specifications

8,000 audio loops (approximately 17 hours)

16-bit WAV format

Tempo range: 120–130 BPM

Paired label data (WAV + JSON)

Variational drum patterns

Subgenre styles (Big room, electro, minimal, classic)

A JSON file is provided for referencing and converting MIDI note numbers to text labels. You can update the text labels to suit your preferences.

License

This dataset was compiled by WaivOps, a crowdsourced music project managed by the sound label company Patchbanks. All recordings have been compiled by verified sources for copyright clearance.

The EDM-HSE dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

Additional Info

Please note that this dataset has not been fully reviewed and may contain minor notational errors or audio defects.

For audio examples or more information about this dataset, please refer to the GitHub repository.
WaivOps HH-LFBB: Open Audio Resources for Machine Learning in Music
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Aug 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patchbanks (2024). WaivOps HH-LFBB: Open Audio Resources for Machine Learning in Music [Dataset]. http://doi.org/10.5281/zenodo.7523435
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7523435
Dataset updated
Aug 17, 2024
Dataset provided by
Patchbanks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WaivOps HH-LFBB Dataset

HH-LFBB is an open audio dataset composed of a series of drum recordings in the style of lofi hip-hop music. The dataset contains 3332 audio loops recorded in uncompressed stereo WAV format, produced with custom drum samples and MIDI-programmed rhythms at various tempo rates.

Dataset

The primary objective of this dataset is to provide accessible content for machine learning applications in music and audio research. Some potential use cases for this dataset include tempo detection and classification, drum rhythm analysis, audio-to-MIDI conversion, source separation, automated mixing, music information retrieval, AI music generation, sound design, and signal processing.

Specifications

3332 audio loops (19.3 hours)

24-bit WAV format

BPM labeled

Tempo range: 60-96bpm

Expressive drum swings

Lofi and boom bap style rhythms

License

This dataset was compiled by WaivOps, a crowdsourced music project managed by the sound label company Patchbanks. All recordings have been compiled by verified sources for copyright clearance.

The HH-LFBB dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

Additional Info

For audio examples or more information about this dataset, please refer to the GitHub repository.
f
Audio Datasets of belt conveyor rollers in mines
figshare.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Liu; Shiming Fu; Fen Liu; Xuefeng Cheng (2024). Audio Datasets of belt conveyor rollers in mines [Dataset]. http://doi.org/10.6084/m9.figshare.27051424.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27051424.v2
Dataset updated
Sep 19, 2024
Dataset provided by
figshare
Authors
Juan Liu; Shiming Fu; Fen Liu; Xuefeng Cheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset for this project comprises audio recordings of the operational states of belt conveyor rollers in a mining environment, covering three conditions: normal roller operation, roller shell cracking, and roller breakage. Combined with machine learning models, this dataset can be used for real-time diagnosis of roller operational states. The database contains two main folders: dataset and code.The dataset folder includes three subfolders:wav: Contains 19 WAV files recorded from 19 microphones, capturing the audio data of belt conveyor rollers in a mining site. Of these, 17 files represent normal roller operation, 1 file captures the audio of a roller with shell cracking, and 1 file captures the audio of a roller with complete breakage.csv_dataset: Contains 10 subfolders, each representing audio feature datasets extracted from the WAV files with frame lengths ranging from 100ms to 1000ms. Each subfolder contains 19 CSV files, corresponding to the 19 audio recordings. The feature datasets within different frame-length subfolders should not be used interchangeably.test_dataset: Contains 17 audio feature datasets with a 200ms frame length. These datasets include features from 17 normal operation recordings combined with features from the roller shell cracking and roller breakage recordings. The combined datasets are shuffled 100 times to ensure even distribution of features from each operational state. This dataset was used for validating the accuracy and usability of the audio feature datasets for real-time monitoring of roller states in the paper.The code folder contains two sets of code:Matlab Code: This code extracts 25 audio features from the WAV files and generates the 17 audio feature datasets using a 200ms frame length.Python Code: This code validates the accuracy and usability of the audio feature datasets in real-time monitoring of belt conveyor roller operational states.This dataset and code combination supports the real-time diagnosis of belt conveyor roller conditions and provides a foundation for validating the effectiveness of audio features in fault detection.
d
Kieli NLP Data - Fully-labelled Audio & Text Dataset for Machine Learning &...
datarade.ai
Updated Mar 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kieli (2021). Kieli NLP Data - Fully-labelled Audio & Text Dataset for Machine Learning & AI platforms [Dataset]. https://datarade.ai/data-products/a-fully-labelled-dataset-for-machine-learning-and-ai-platforms-kieli
Explore at:
.json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Mar 20, 2021
Dataset authored and provided by
Kieli
Area covered
Tajikistan, Antigua and Barbuda, Fiji, Venezuela (Bolivarian Republic of), Uruguay, Denmark, Mauritius, Djibouti, Anguilla, Ethiopia
Description
Kieli labels audio speech, Image, Video & Text Data including semantic segmentation, named entity recognition (NER) and POS tagging. Kieli transforms unstructured data into high quality training data for the refinement of Artificial Intelligence and Machine Learning platforms. For over a decade, hundreds of organizations have relied on Kieli to deliver secure, high-quality training data and model validation for machine learning. At Kieli, we believe that accurate data is the most important factor in production learning models. We are committed to delivering the best quality data for the most enterprising organizations and helping you make strides in Artificial Intelligence. At Kieli, we're passionately dedicated to serving the Arabic, English and French markets. We work in all areas of industry: healthcare, technology and retail.
P
DEEP-VOICE: DeepFake Voice Recognition Dataset
paperswithcode.com
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). DEEP-VOICE: DeepFake Voice Recognition Dataset [Dataset]. https://paperswithcode.com/dataset/deep-voice-deepfake-voice-recognition
Explore at:
Dataset updated
Aug 23, 2023
Description
DEEP-VOICE: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion This dataset contains examples of real human speech, and DeepFake versions of those speeches by using Retrieval-based Voice Conversion.

Can machine learning be used to detect when speech is AI-generated?

Introduction There are growing implications surrounding generative AI in the speech domain that enable voice cloning and real-time voice conversion from one individual to another. This technology poses a significant ethical threat and could lead to breaches of privacy and misrepresentation, thus there is an urgent need for real-time detection of AI-generated speech for DeepFake Voice Conversion.

To address the above emerging issues, we are introducing the DEEP-VOICE dataset. DEEP-VOICE is comprised of real human speech from eight well-known figures and their speech converted to one another using Retrieval-based Voice Conversion.

For each speech, the accompaniment ("background noise") was removed before conversion using RVC. The original accompaniment is then added back to the DeepFake speech:

(Above: Overview of the Retrieval-based Voice Conversion process to generate DeepFake speech with Ryan Gosling's speech converted to Margot Robbie. Conversion is run on the extracted vocals before being layered on the original background ambience.)

Dataset There are two forms to the dataset that are made available.

First, the raw audio can be found in the "AUDIO" directory. They are arranged within "REAL" and "FAKE" class directories. The audio filenames note which speakers provided the real speech, and which voices they were converted to. For example "Obama-to-Biden" denotes that Barack Obama's speech has been converted to Joe Biden's voice.

Second, the extracted features can be found in the "DATASET-balanced.csv" file. This is the data that was used in the below study. The dataset has each feature extracted from one-second windows of audio and are balanced through random sampling.

Note: All experimental data is found within the "KAGGLE" directory. The "DEMONSTRATION" directory is used for playing cropped and compressed demos in notebooks due to Kaggle's limitations on file size.

A potential use of a successful system could be used for the following:

(Above: Usage of the real-time system. The end user is notified when the machine learning model has processed the speech audio (e.g. a phone or conference call) and predicted that audio chunks contain AI-generated speech.)

Kaggle The dataset is available on the Kaggle data science platform.

The Kaggle page can be found by clicking here: Dataset on Kaggle

Attribution This dataset was produced from the study "Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion"

The preprint can be found on ArXiv by clicking here: Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion

License This dataset is provided under the MIT License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
m
HLS-CMDS: Heart and Lung Sounds Dataset Recorded from a Clinical Manikin...
data.mendeley.com
dataverse.harvard.edu
+1more
Updated May 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yasaman Torabi (2025). HLS-CMDS: Heart and Lung Sounds Dataset Recorded from a Clinical Manikin using Digital Stethoscope [Dataset]. http://doi.org/10.17632/8972jxbpmp.3
Explore at:
Unique identifier
https://doi.org/10.17632/8972jxbpmp.3
Dataset updated
May 7, 2025
Authors
Yasaman Torabi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
** Accepted in IEEE Data Descriptions Journal ** This dataset contains 535 recordings of heart and lung sounds captured using a digital stethoscope from a clinical manikin, including both individual and mixed recordings of heart and lung sounds; It includes 50 heart sounds, 50 lung sounds, and 145 mixed sounds. For each mixed sound, the corresponding source heart sound (145 recordings) and source lung sound (145 recordings) were also recorded. It includes recordings from different anatomical chest locations, with normal and abnormal sounds. Each recording has been filtered to highlight specific sound types, making it valuable for artificial intelligence (AI) research and applications in automated cardiopulmonary disease detection, sound classification, and deep learning algorithms related to audio signal processing. If you use this dataset in your research, please cite the following paper:

Y. Torabi, S. Shirani and J. P. Reilly, "Descriptor: Heart and Lung Sounds Dataset Recorded from a Clinical Manikin using Digital Stethoscope (HLS-CMDS)," in IEEE Data Descriptions, https://doi.org/10.1109/IEEEDATA.2025.3566012 .

Data Type: Audio files (.wav), Comma Separated Values (.CSV)

Each category is accompanied by a corresponding CSV file that provides metadata for the respective audio files. The CSV files (HS.csv, LS.csv, and Mix.csv) contain metadata about the corresponding audio files, including the file name, gender, heart and lung sound type, and the anatomical location where we recorded the sound.

Sound Types: Normal Heart, Late Diastolic Murmur, Mid Systolic Murmur, Late Systolic Murmur, Atrial Fibrillation, Fourth Heart Sound, Early Systolic Murmur, Third Heart Sound, Tachycardia, Atrioventricular Block, Normal Lung, Wheezing, Fine Crackles, Rhonchi, Pleural Rub, and Coarse Crackles.

Auscultation Landmarks: Right Upper Sternal Border, Left Upper Sternal Border, Lower Left Sternal Border, Right Costal Margin, Left Costal Margin, Apex, Right Upper Anterior, Left Upper Anterior, Right Mid Anterior, Left Mid Anterior, Right Lower Anterior, and Left Lower Anterior.

Applications: AI-based cardiopulmonary disease detection, unsupervised sound separation techniques, and deep learning for audio signal processing.
D
Data from: EGOFALLS: A visual-audio dataset and benchmark for fall detection...
dataverse.nl
bin, pdf, zip
Updated Sep 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xueyi Wang; Xueyi Wang (2023). EGOFALLS: A visual-audio dataset and benchmark for fall detection using egocentric cameras [Dataset]. http://doi.org/10.34894/HO5GE3
Explore at:
bin(7516192768), zip(1170655143), zip(3628430034), zip(5391633841), zip(4499387146), zip(6291630195), zip(5089523315), zip(5738257816), zip(6824283419), zip(6077348840), zip(2571176861), zip(5919911342), zip(4978240984), zip(3921336259), pdf(510707), zip(1580970093), zip(5009407608), zip(4623739479)Available download formats
Unique identifier
https://doi.org/10.34894/HO5GE3
Dataset updated
Sep 18, 2023
Dataset provided by
DataverseNL
Authors
Xueyi Wang; Xueyi Wang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Sep 1, 2018 - Sep 1, 2022
Description
We've provided a readme.pdf to explain how to use the dataset. Here, we reiterate some of that information to assist others in utilizing the dataset. Please be aware that the files and the dataset are large (approx. 200GB.). It is advised to make sure there is ample storage space for downloading and unzipping. Please download one file at a time. Dataset: Our data collection method involved cameras, subjects, environments, and guidelines for data simulation to elucidate the specifics of our process. Notably, our dataset, comprising 10,948 clips, stands out as the largest when compared to others that focus on falls recorded through egocentric cameras. Equipment: Data was amassed using two wearable camera types: the OnReal G1 and CAMMHD Bodycams. The OnReal G1 is a compact mini action camera, with dimensions of 420 x 420 x 200 mm, and can capture videos in resolutions as high as 1080P at 30 fps. Conversely, the CAMMHD Bodycam, a larger camera measuring 800 x 500 x 300 mm, is outfitted with infrared sensors suitable for night vision. These cameras were strategically affixed to the human body at places like the waist and neck, allowing the collection of extensive visual, motion, and audio information across varied environments. The standard setting for data capture was the 1080p video mode at 30 frames per second. It's worth noting that the OnReal G1 frames consist of distinct R, G, B channels, whereas CAMMHD Bodycam frames feature three identical grayscale channels. This dataset, therefore, is a pivotal resource for this thesis, facilitating a thorough analysis of different events and activities. Subject: For this study, we had 14 volunteer participants: 12 males and 2 females. This included 12 young, healthy individuals and 2 elderly subjects. All participants gave informed consent, understanding their data might be utilized for research and potentially be publicized. Most subjects (11 out of 14) finished the data collection encompassing four types of falls and nine types of non-falls, both indoor and outdoor. However, three participants couldn't complete the entire data collection due to personal reasons. This study yields significant insights into falls and non-fall behaviors, underscoring the dedication of the majority of our participants. Environment: Our aim was a comprehensive study of both indoor and outdoor environments. We captured data across 14 different outdoor settings and 15 unique indoor spaces. To introduce variability, participants were prompted to change their positions or directions post each activity. Such an approach ensures a diversified dataset, letting us derive more reliable conclusions and insights. Data Collection: Our data collection approach encompasses two main perspectives: visual and auditory. For visual data, we adhered to guidelines from existing literature; typical falls and related activities have a duration of 1-3 seconds. We proposed an exhaustive set of trials that cover 20 types of falls, each varying in direction and object interaction. Contrarily, specific guidelines for audio data are scarce, as past research largely centered on visual cues. Our audio dataset comprises three categories: subject audio, subject-object audio, and environment audio. To provide participants a realistic feel of falls, we showed them online videos of real-world fall incidents. These videos accurately render the auditory and visual elements of these events. Upon manual inspection of all clips, we discerned prevalent audio patterns. For falls, subject audio includes elements like yelling and moaning; subject-object audio encapsulates sounds of impacts, and environmental audio captures background noises like traffic or television. Importantly, not all clips contained every sound type. Non-fall activities were bifurcated into three groups based on their audio intensity. Our findings shed light on the audio patterns across activities, potentially enhancing subsequent research in this domain.
P
L3DAS22 Dataset
paperswithcode.com
Updated Aug 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Guizzo; Christian Marinoni; Marco Pennese; Xinlei Ren; Xiguang Zheng; Chen Zhang; Bruno Masiero; Aurelio Uncini; Danilo Comminiello (2023). L3DAS22 Dataset [Dataset]. https://paperswithcode.com/dataset/l3das22
Explore at:
Dataset updated
Aug 29, 2023
Authors
Eric Guizzo; Christian Marinoni; Marco Pennese; Xinlei Ren; Xiguang Zheng; Chen Zhang; Bruno Masiero; Aurelio Uncini; Danilo Comminiello
Description
L3DAS22: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING This dataset supports the L3DAS22 IEEE ICASSP Gand Challenge. The challenge is supported by a Python API that facilitates the dataset download and preprocessing, the training and evaluation of the baseline models and the results submission.

Scope of the Challenge The L3DAS22 Challenge aims at encouraging and fostering research on machine learning for 3D audio signal processing. 3D audio is gaining increasing interest in the machine learning community in recent years. The range of applications is incredibly wide, extending from virtual and real conferencing to autonomous driving, surveillance and many more. In these contexts, a fundamental procedure is to properly identify the nature of events present in a soundscape, their spatial position and eventually remove unwanted noises that can interfere with the useful signal. To this end, L3DAS22 Challenge presents two tasks: 3D Speech Enhancement and 3D Sound Event Localization and Detection, both relying on first-order Ambisonics recordings in reverberant office environments. Each task involves 2 separate tracks: 1-mic and 2-mic recordings, respectively containing sounds acquired by one 1st order Ambisonics microphone and by an array of two ones. The use of two Ambisonics microphones represents one of the main novelties of the L3DAS22 Challenge. We expect higher accuracy/reconstruction quality when taking advantage of the dual spatial perspective of the two microphones. Moreover, we are very interested in identifying other possible advantages of this configuration over standard Ambisonics formats. Interactive demos of our baseline models are available on Replicate. Top 5 ranked teams can submit a regular paper according to the ICASSP guidelines. Prizes will be awarded to the challenge winners thanks to the support of Kuaishou Technology.

Tasks Tasks The tasks we propose are: * 3D Speech Enhancement The objective of this task is the enhancement of speech signals immersed in the spatial sound field of a reverberant office environment. Here the models are expected to extract the monophonic voice signal from the 3D mixture containing various background noises. The evaluation metric for this task is a combination of short-time objective intelligibility (STOI) and word error rate (WER). * 3D Sound Event Localization and Detection The aim of this task is to detect the temporal activities of a known set of sound event classes and, in particular, to further locate them in the space. Here the models must predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. Performance on this task is evaluated according to the location-sensitive detection error, which joins the localization and detection error metrics.

Dataset Info The L3DAS22 datasets contain multiple-source and multiple-perspective B-format Ambisonics audio recordings. We sampled the acoustic field of a large office room, placing two first-order Ambisonics microphones in the center of the room and moving a speaker reproducing the analytic signal in 252 fixed spatial positions. Relying on the collected Ambisonics impulse responses (IRs), we augmented existing clean monophonic datasets to obtain synthetic tridimensional sound sources by convolving the original sounds with our IRs. We extracted speech signals from the Librispeech dataset and office-like background noises from the FSD50K dataset. We aimed at creating plausible and variegate 3D scenarios to reflect possible real-life situations in which sound and disparate types of background noises coexist in the same 3D reverberant environment. We provide normalized raw waveforms as predictors data and the target data varies according to the task.

The dataset is divided in two main sections, respectively dedicated to the challenge tasks.

The first section is optimized for 3D Speech Enhancement and contains more than 60000 virtual 3D audio environments with a duration up to 12 seconds. In each sample, a spoken voice is always present alongside with other office-like background noises. As target data for this section we provide the clean monophonic voice signals. For each subset we also provide a csv file, where we annotated the coordinates and spatial distance of the IR convolved with the target voice signals for each datapoint. This may be useful to estimate the delay caused by the virtual time-of-flight of the target voice signal and to perform a sample-level alignment of the input and ground truth signals.

The other sections, instead, is dedicated to the 3D Sound Event Localization and Detection task and contains 900 30-seconds-long audio files. Each data point contains a simulated 3D office audio environment in which up to 3 simultaneous acoustic events may be active at the same time. In this section, the samples are not forced to contain a spoken voice. As target data for this section we provide a list of the onset and offset time stamps, the typology class, and the spatial coordinates of each individual sound event present in the data-points.

We split both dataset sections into a training set and a development set, paying attention to create similar distributions. The train set of the SE section is divided in two partitions: train360 and train100, and contain speech samples extracted from the correspondent partitions of Librispeech (only the sample) up to 12 seconds. The train360 is split in 2 zip files for a more convenient download. All sets of the SELD section are divided in: OV1, OV2, OV3. These partitions refer to the maximum amount of possible overlapping sounds, which are 1, 2 or 3, respectively.

L3DAS22 Challenge Supporting API The gitHub supporting API is aimed at downloading the dataset, pre-processing the sound files and the metadata, training and evaluating the baseline models and validating the final results. We provide easy-to-use instruction to produce the results included in our paper. Moreover, we extensively commented our code for easy customization. For further information please refer to the challenge website and to the challenge documentation.
g
Indonesian Media Audio Database
gts.ai
json
Updated Jan 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2024). Indonesian Media Audio Database [Dataset]. https://gts.ai/case-study/indonesian-media-audio-database-custom-ai-data-collection/
Explore at:
jsonAvailable download formats
Dataset updated
Jan 31, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Our project, “Indonesian Media Audio Database,” is designed to establish a rich and diverse dataset tailored for training advanced machine learning models in language processing, speech recognition, and cultural analysis.
8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech...
datarade.ai
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech Recognition Data| Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-8khz-tele-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 10, 2023
Dataset authored and provided by
Nexdata
Area covered
Czech Republic, Vietnam, Romania, United Arab Emirates, Singapore, Netherlands, Argentina, Philippines, Poland, United States of America
Description
Specifications Format : 8kHz, 8bit, u-law/a-law pcm, mono channel;

Environment : quiet indoor environment, without echo;

Recording content : No preset linguistic data，dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

Device : Telephony recording system;

Language : 100+ Languages;

Application scenarios : speech recognition; voiceprint recognition;

Accuracy rate : the word accuracy rate is not less than 98%

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
d
AI-Machine Learning Sound / Audio / Snippet Recordings Database
datarade.ai
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SoundPrint (2023). AI-Machine Learning Sound / Audio / Snippet Recordings Database [Dataset]. https://datarade.ai/data-products/ai-machine-learning-sound-audio-snippet-recordings-database-soundprint
Explore at:
Dataset updated
Aug 23, 2023
Dataset authored and provided by
SoundPrint
Area covered
Solomon Islands, Congo, Turkey, Greenland, Peru, Iran (Islamic Republic of), Taiwan, Nauru, Palau, Mongolia
Description
Snippets database has sound / audio / sonic recordings across all kinds of venues (restaurants, bars, arenas, churches, movie theaters, retail stores, factories, parks, libraries, gyms, hotels, offices, factories and many more) and variance in noise levels (Quiet, Moderate, Loud, Very Loud), noise types and acoustic environments with valuable metadata.

This is valuable for any audio-based software product/company to run/test its algorithm against various acoustic environments including:

Hearing aid companies wanting to test their software's ability to identify or separate certain sounds and background noise and mitigate them

Audio or Video Conferencing platforms that want to be able to identify a user's location (i.e. user joins call from a coffee shop and platform has ability to identify and mitigate such sounds for better audio

Other audio-based use cases
Indian Languages Audio Dataset
kaggle.com
Updated Nov 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HARSHMAN SOLANKI (2023). Indian Languages Audio Dataset [Dataset]. https://www.kaggle.com/datasets/hmsolanki/indian-languages-audio-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
HARSHMAN SOLANKI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
India
Description
Description: The "Indian Languages Audio Dataset" is a collection of audio samples featuring a diverse set of 10 Indian languages. Each audio sample in this dataset is precisely 5 seconds in duration and is provided in MP3 format. It is important to note that this dataset is a subset of a larger collection known as the "Audio Dataset with 10 Indian Languages." The source of these audio samples is regional videos freely available on YouTube, and none of the audio samples or source videos are owned by the dataset creator.

Languages Included: 1. Bengali 2. Gujarati 3. Hindi 4. Kannada 5. Malayalam 6. Marathi 7. Punjabi 8. Tamil 9. Telugu 10. Urdu

This dataset offers a valuable resource for researchers, linguists, and machine learning enthusiasts who are interested in studying and analyzing the phonetics, accents, and linguistic characteristics of the Indian subcontinent. It is a representative sample of the linguistic diversity present in India, encompassing a wide array of languages and dialects. Researchers and developers are encouraged to explore this dataset to build applications or conduct research related to speech recognition, language identification, and other audio processing tasks.

Additionally, the dataset is not limited to these 10 languages and has the potential for expansion. Given the dynamic nature of language use in India, this dataset can serve as a foundation for future data collection efforts involving additional Indian languages and dialects.

Access to the "Indian Multilingual Audio Dataset - 10 Languages" is provided with the understanding that users will comply with applicable copyright and licensing restrictions. If users plan to extend this dataset or use it for commercial purposes, it is essential to seek proper permissions and adhere to relevant copyright and licensing regulations.

By utilizing this dataset responsibly and ethically, users can contribute to the advancement of language technology and research, ultimately benefiting language preservation, speech recognition, and cross-cultural communication.
m
BanglaSER: A Bangla speech emotion recognition dataset
data.mendeley.com
Updated Mar 14, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rakesh Kumar Das (2022). BanglaSER: A Bangla speech emotion recognition dataset [Dataset]. http://doi.org/10.17632/t9h6p943xy.5
Explore at:
Unique identifier
https://doi.org/10.17632/t9h6p943xy.5
Dataset updated
Mar 14, 2022
Authors
Rakesh Kumar Das
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BanglaSER is a Bangla language-based speech emotion recognition dataset. It consists of speech-audio data of 34 participating speakers from diverse age groups between 19 and 47 years, with a balanced 17 male and 17 female nonprofessional participating actors. This dataset contains 1467 Bangla speech-audio recordings of five rudimentary human emotional states, namely angry, happy, neutral, sad, and surprise. Three trials are conducted for each emotional state. Hence, the total number of recordings involves 3 statements × 3 repetitions × 4 emotional states (angry, happy, sad, and surprise) × 34 participating speakers = 1224 recordings + 3 statements × 3 repetitions × 1 emotional state (neutral) × 27 participating speakers = 243 recordings, making the total number of recordings of 1467. BanglaSER dataset is collected by recording through smartphones, and laptops, having a balanced number of recordings in each category with evenly distributed participating male and female actors, preserves the real-life environment, and would serve as an essential training dataset for the speech emotion recognition model in terms of generalization. BanglaSER is compatible with various deep learning architectures such as CNN, LSTM, BiLSTM etc.
Z
Data from: MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine...
data.niaid.nih.gov
zenodo.org
Updated May 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harsh Purohit (2021). MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection with Domain Shifts due to Changes in Operational and Environmental Conditions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4740100
Explore at:
Dataset updated
May 7, 2021
Dataset provided by
Yuki Nikaido
Yohei Kawaguchi
Ryo Tanabe
Harsh Purohit
Kota Dohi
Takashi Endo
Toshiki Nakamura
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

This dataset is a sound dataset for malfunctioning industrial machine investigation and inspection with domain shifts due to changes in operational and environmental conditions (MIMII DUE). The dataset consists of normal and abnormal operating sounds of five different types of industrial machines, i.e., fans, gearboxes, pumps, slide rails, and valves. The data for each machine type includes six subsets called ``sections'', and each section roughly corresponds to a single product. Each section consists of data from two domains, called the source domain and the target domain, with different conditions such as operating speed and environmental noise. This dataset is a subset of the dataset for DCASE 2021 Challenge Task 2, so the dataset is entirely the same as data included in the development dataset and additional training dataset. For more information, please see this paper and the pages of the development dataset and the task description for DCASE 2021 Challenge Task 2.

Baseline system

Two simple baseline systems are available on the Github repositories [URL] and [URL]. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

Conditions of use

This dataset was made by Hitachi, Ltd. and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Publication

If you use this dataset, please cite the following paper:

Ryo Tanabe, Harsh Purohit, Kota Dohi, Takashi Endo, Yuki Nikaido, Toshiki Nakamura, and Yohei Kawaguchi, "MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection with Domain Shifts due to Changes in Operational and Environmental Conditions," arXiv preprint arXiv: 2105.02702, 2021. [URL]

Feedback

If there is any problem, please contact us:

Ryo Tanabe, ryo.tanabe.rw.xk@hitachi.com

Yohei Kawaguchi, yohei.kawaguchi.xk@hitachi.com
g
Clinical Audio Transcription Dataset
gts.ai
json
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2022). Clinical Audio Transcription Dataset [Dataset]. https://gts.ai/case-study/clinical-audio-trascription-dataset-services-for-machine-learning/
Explore at:
jsonAvailable download formats
Dataset updated
Oct 2, 2022
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
GTS Provides Clinical Audio Transcription Dataset that Powers AI and Machine Learning for Better Understanding

Facebook

Twitter

Click to copy link

Link copied

Cite

Min Si Thu (2023). Audio Noise Dataset [Dataset]. https://www.kaggle.com/datasets/minsithu/audio-noise-dataset

Audio Noise Dataset

Typical background noise for audio recognition, classification & generation

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 20, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Min Si Thu

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Audio Noise Dataset

Noise is an unwanted behavior in audio datasets. Noise plays an important part in the machine learning field of audio data type.

The dataset can be used for noise filtering, noise generation & noise recognition in audio classification, audio recognition, audio generation, and audio-related machine learning. I, Min Si Thu, used this dataset on open-source projects.

I collected ten types of noise in this dataset.

Location - Myanmar, Mandalay, Amarapura Township

Ten types of noise

the noise of a crowded place (Myanmar, Mandalay, Amarapura Township)
the noise of urban areas with people talking (Myanmar, Mandalay, Amarapura Township)
the noise of the restaurant (Myanmar, Mandalay, Amarapura Township, at a random restaurant)
the noise of a working place, people's discussion (Myanmar, Mandalay, Amarapura Township, a private company)
the noise of mosquitos (Myanmar, Mandalay, Amarapura Township, Dataset creator's home)
the noise of car traffic (Myanmar, Mandalay, Amarapura Township, Asia Bank Road, nighttime)
the noise of painful sounds (Myanmar, Mandalay, Amarapura Township)
the noise of the rainy day (Myanmar, Mandalay, Amarapura Township, Dataset creator's home)
the noise of motorbike and people talking (Myanmar, Mandalay, Amarapura Township, NanTawYar Quarter, Cherry Street)
the noise of a festival (Myanmar, Mandalay, Chinese Festival)

Clear search

Close search

Google apps

Main menu

Audio Noise Dataset

Audio Noise Dataset

Ten types of noise

In The Wild (audio Deepfake)

Data from: MIMII Dataset: Sound Dataset for Malfunctioning Industrial...

A kiswahili Dataset for Development of Text-To-Speech System

Physician Dictation Audio Data datasets for Machine Learning

WaivOps EDM-HSE: Open Audio Resources for Machine Learning in Music

WaivOps HH-LFBB: Open Audio Resources for Machine Learning in Music

Audio Datasets of belt conveyor rollers in mines

Kieli NLP Data - Fully-labelled Audio & Text Dataset for Machine Learning &...

DEEP-VOICE: DeepFake Voice Recognition Dataset

HLS-CMDS: Heart and Lung Sounds Dataset Recorded from a Clinical Manikin...

Data from: EGOFALLS: A visual-audio dataset and benchmark for fall detection...

L3DAS22 Dataset

Indonesian Media Audio Database

8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech...

AI-Machine Learning Sound / Audio / Snippet Recordings Database

Indian Languages Audio Dataset

BanglaSER: A Bangla speech emotion recognition dataset

Data from: MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine...

Clinical Audio Transcription Dataset

Audio Noise Dataset

Typical background noise for audio recognition, classification & generation

Audio Noise Dataset

Ten types of noise