Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
The ESC dataset is a collection of short environmental recordings available in a unified format (5-second-long clips, 44.1 kHz, single channel, Ogg Vorbis compressed @ 192 kbit/s). All clips have been extracted from public field recordings available through the Freesound.org project. Please see the README files for a detailed attribution list. The dataset is available under the terms of the Creative Commons license - Attribution-NonCommercial. The dataset consists of three parts: ESC-50: a labeled set of 2 000 environmental recordings (50 classes, 40 clips per class), ESC-10: a labeled set of 400 environmental recordings (10 classes, 40 clips per class) (this is a subset of ESC-50 - created initialy as a proof-of-concept/standardized selection of easy recordings), ESC-US: an unlabeled dataset of 250 000 environmental recordings (5-second-long clips), suitable for unsupervised pre-training. The ESC-US dataset, although not hand-annotated, includes the labels (tags) submitted by the original uploading users, which could be potentially used for weakly-supervised learning (noisy and/or missing labels). The ESC-10 and ESC-50 datasets have been prearranged into 5 uniformly sized folds so that clips extracted from the same original source recording are always contained in a single fold. The labeled datasets are also available as GitHub projects: ESC-50 | ESC-10. For a more thorough description and analysis, please see the original paper and the supplementary IPython notebook. The goal of this project is to facilitate open research initiatives in the field of environmental sound classification as publicly available datasets in this domain are still quite scarce. Acknowledgments I would like to thank Frederic Font Corbera for his help in using the Freesound API.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
The ESC-50 dataset is a dataset of environmental sound recordings. The dataset consists of 50 classes of environmental sounds, each of which has 200 recordings. The recordings are of high quality, and they have been carefully labeled. The ESC-50 dataset has been used to train and evaluate a variety of environmental sound classification algorithms.
ESC50
Dataset Summary
The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings suitable for benchmarking methods of environmental sound classification. It comprises 2000 5s-clips of 50 different classes across natural, human and domestic sounds, again, drawn from Freesound.org.
Data Instances
An example of 'train' looks as follows. { "audio": { "path": "ESC-50-master/audio/4-143118-B-7.wav", "array"… See the full description on the dataset page: https://huggingface.co/datasets/yangwang825/esc50.
This is a dataset with vocal imitation, which involve the process of replicating or mimicking the rhythm and pitch of sounds by voice for an environmental sound in ESC-50 [1] that can be used in various tasks that use environmental sounds. The dataset consists of 9,920 vocal imitations (8 imitators per environmental sound). Each imitator is a Japanese speaker. All audio data are 48kHz/16bit wav files.
Each audio file is named as follows:
vocal_imitation/SpeakerID/FileName_SpeakerID.wav
FileName means the original audio file name in ESC-50. SpaekerID means the ID of the imitator. We recorded vocal imitations for a part of sound events in ESC-50. A list of the sound events used can be obtained from EventList.csv.
Note that this dataset does not contain environmental sound files, which can be obtained from ESC-50. Environmental sounds in ESC-50 are available here.
The materials may be used free of charge for research purposes, but please refrain from redistribution or use that is offensive to public order and morals. If you want to use for commercial purposes, please contact us (Yuki Okamoto or Keisuke Imoto).
If you use this dataset, please cite as follow:
Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryotaro Nagase, Takahiro Fukumori, and Yoichi Yamashita, "Environmental Sound Synthesis from Vocal Imitations and Sound Event Labels," Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 411-415, 2024.
If there is any problem, please contact us
[1] K. J. Piczak, "Esc: Dataset for environmental sound classification,” in Proc. the 23rd ACM International Conference on Multimedia, 2015, p. 1015–1018.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Forest environmental sound classification is one use case of ESC which has been widely experimenting to identify illegal activities inside a forest. With the unavailability of public datasets specific to forest sounds, there is a requirement for a benchmark forest environment sound dataset. With this motivation, the FSC22 was created as a public benchmark dataset, using the audio samples collected from FreeSound org.
This dataset includes 2025 labeled sound clips of 5s long. All the audio samples are distributed between six major parent-level classes; Mechanical sounds, Animal sounds, Environmental Sounds, Vehicle Sounds, Forest Threat Sounds, and Human Sounds. Further, each class is divided into subclasses that capture specific sounds which fall under the main category. Overall the dataset taxonomy consists of 34 classes as shown below. For the first phase of the dataset creation, 75 audio samples for every 27 classes were collected.
We expect that this dataset will help research communities with their research work governing Forest Acoustic monitoring and classification domain.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains 2000 spectrogram images converted from the ESC50 audio dataset, which includes 50 categories of environmental sounds such as dog barking, thunder, and clock ticking.
These images are created using the Adaf-Spectrogram method, an adaptive frequency-axis spectrogram representation proposed to improve deep learning performance in classification tasks.
For more information, visit the Adaf-Spectrogram project.
đź”§ Spectrogram Generation
All audio clips were processed using the Short-Time Fourier Transform (STFT) to generate time–frequency representations. The transformation was implemented using the scipy library, specifically the scipy.signal module.
np.abs(spectrogram)**2
(not in decibel or amplitude scale)After conversion, each spectrogram image was resized to 128Ă—128 pixels using high-quality resampling (LANCZOS filter) via the Pillow (PIL) library.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of the article titled "A tinyML for risk identification for people with hearing loss" and was developed to train an embedded system called SSIES (Support System for Identifying Environmental Sounds). The goal of the system is to assist individuals with hearing loss by identifying environmental sounds that signal emergencies or require immediate attention. The dataset is primarily based on selected audio samples from the ESC-50 database, including siren, horn, and baby cry sounds. An additional class, “scream,” was sourced from Freesound.org and self-recordings, as it was not included in the original ESC-50 set. To minimize false positives, a fifth class called “X” was added, consisting of common non-emergency environmental sounds. The audio files were processed to match the constraints of an embedded environment by downsampling to 16 KHz, trimming to 1–1.5 seconds, and applying data augmentation techniques (pitch and speed variations) using the librosa library. This optimized dataset enables the training of lightweight tinyML models capable of real-time emergency sound recognition.
Dataset Details: Audio Source Separation
The dataset for audio source separation is created by combining four different datasets, ensuring diverse and representative audio classes.
Dataset Composition
• Individual audio sources were extracted for each class.
• These sources were mixed in all possible combinations to generate 1,000 mixed WAV files.
• Each mixed file is accompanied by its corresponding true source signals.
Source Datasets
1. Speech – LibriVox
2. Music – MUSDB18
3. Environmental Sounds – ESC-50
4. Traffic Sounds – UrbanSound8K
This dataset is designed to support research in audio source separation, machine learning, and signal processing.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
The ESC dataset is a collection of short environmental recordings available in a unified format (5-second-long clips, 44.1 kHz, single channel, Ogg Vorbis compressed @ 192 kbit/s). All clips have been extracted from public field recordings available through the Freesound.org project. Please see the README files for a detailed attribution list. The dataset is available under the terms of the Creative Commons license - Attribution-NonCommercial. The dataset consists of three parts: ESC-50: a labeled set of 2 000 environmental recordings (50 classes, 40 clips per class), ESC-10: a labeled set of 400 environmental recordings (10 classes, 40 clips per class) (this is a subset of ESC-50 - created initialy as a proof-of-concept/standardized selection of easy recordings), ESC-US: an unlabeled dataset of 250 000 environmental recordings (5-second-long clips), suitable for unsupervised pre-training. The ESC-US dataset, although not hand-annotated, includes the labels (tags) submitted by the original uploading users, which could be potentially used for weakly-supervised learning (noisy and/or missing labels). The ESC-10 and ESC-50 datasets have been prearranged into 5 uniformly sized folds so that clips extracted from the same original source recording are always contained in a single fold. The labeled datasets are also available as GitHub projects: ESC-50 | ESC-10. For a more thorough description and analysis, please see the original paper and the supplementary IPython notebook. The goal of this project is to facilitate open research initiatives in the field of environmental sound classification as publicly available datasets in this domain are still quite scarce. Acknowledgments I would like to thank Frederic Font Corbera for his help in using the Freesound API.