Facebook
TwitterFSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.
Citation
If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):
@article{fonseca2022FSD50K,
title={{FSD50K}: an open dataset of human-labeled sound events},
author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
volume={30},
pages={829--852},
year={2022},
publisher={IEEE}
}
Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).
Data curators
Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez
Contact
You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.
ABOUT FSD50K
Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.
What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.
Basic characteristics:
vocabulary.csv (see Files section below).Dev set:
Eval set:
Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.
LICENSE
All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:
The development set consists of 40,966 clips with the following licenses:
The evaluation set consists of 10,231 clips with the following licenses:
For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.
In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).
Usage of FSD50K for commercial purposes:
If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.
Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.
FILES
FSD50K can be downloaded as a series of zip files with the following directory structure:
root │ └───FSD50K.dev_audio/ Audio clips in the dev set │ └───FSD50K.eval_audio/ Audio clips in the eval set │ └───FSD50K.ground_truth/ Files for FSD50K's ground truth │ │ │ └─── dev.csv Ground truth for the dev set │ │ │ └─── eval.csv Ground truth for the eval set │ │ │ └─── vocabulary.csv List of 200 sound classes in FSD50K │ └───FSD50K.metadata/ Files for additional metadata │ │ │ └─── class_info_FSD50K.json Metadata about the sound classes │ │ │ └─── dev_clips_info_FSD50K.json Metadata about the dev clips │ │ │ └─── eval_clips_info_FSD50K.json Metadata about the eval clips │ │ │ └─── pp_pnp_ratings_FSD50K.json PP/PNP ratings │ │ │ └─── collection/ Files for the *sound collection* format │ └───FSD50K.doc/ │ └───README.md The dataset description file that you are reading │ └───LICENSE-DATASET License of the FSD50K dataset as an entity
Each row (i.e. audio clip) of dev.csv contains the following information:
fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.labels: the class labels (i.e., the ground truth). Note these
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FSD50K: An open dataset of human-labeled sound events
This is a mirror of the FSD50K sound event dataset. The original files were converted from WAV to Opus to reduce the size and accelerate streaming.
Sampling rate: 48 kHz Channels: 1 Format: Opus Splits: Dev: 80 hours, 40966 clips. Eval: 28 hours, 10231 clips.
License: FSD50K is released under CC-BY. However, each clip has its own licence. Clip licenses include CC0, CC-BY, CC-BY-NC and CC Sampling+. Clip licenses are specified… See the full description on the dataset page: https://huggingface.co/datasets/philgzl/fsd50k.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is the FSD50K dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.
$ tar tvf fsdk50_eval_0000000.tar |head
-r--r--r-- bigdata/bigdata 40 2025-01-12 13:02 45604.json
-r--r--r-- bigdata/bigdata 43066 2025-01-12 13:02 45604.wav
-r--r--r-- bigdata/bigdata 46 2025-01-12 13:02 213293.json
-r--r--r-- bigdata/bigdata 1372242 2025-01-12 13:02 213293.wav
-r--r--r-- bigdata/bigdata 82 2025-01-12 13:02 348174.json
-r--r--r-- bigdata/bigdata 804280 2025-01-12 13:02 348174.wav
-r--r--r-- bigdata/bigdata 71 2025-01-12 13:02 417736.json
-r--r--r-- bigdata/bigdata 2238542 2025-01-12 13:02 417736.wav
-r--r--r-- bigdata/bigdata 43 2025-01-12 13:02 235555.json
-r--r--r-- bigdata/bigdata 542508 2025-01-12 13:02 235555.wav
$ tar -xOf fsdk50_eval_0000000.tar 45604.json
{"soundevent": "Yell;Shout;Human_voice"}
Facebook
TwitterDataset Card for "FSD50K"
More Information needed
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.
In addition to ARCA23K, this release includes a companion dataset called ARCA23K-FSD, which is a single-label subset of the FSD50K dataset. ARCA23K-FSD contains the same sound classes as ARCA23K and the same number of audio clips per class. As it is a subset of FSD50K, each clip and its label have been manually verified. Note that only the ground truth data of ARCA23K-FSD is distributed in this release. To download the audio clips, please visit the Zenodo page for FSD50K.
A paper has been published detailing how the dataset was constructed. See the Citing section below.
The source code used to create the datasets is available: https://github.com/tqbl/arca23k-dataset
Characteristics
Sound Classes
The list of sound classes is given below. They are grouped based on the top-level superclasses of the AudioSet ontology.
Music
Sounds of things
Natural sounds
Human sounds
Animal
Source-ambiguous sounds
License and Attribution
This release is licensed under the Creative Commons Attribution 4.0 International License.
The audio clips distributed as part of ARCA23K were sourced from Freesound and have their own Creative Commons license. The license information and attribution for each audio clip can be found in ARCA23K.metadata/train.json, which also includes the original Freesound URLs.
The files under ARCA23K-FSD.ground_truth/ are an adaptation of the ground truth data provided as part of FSD50K, which is licensed under the Creative Commons Attribution 4.0 International License. The curators of FSD50K are Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano, and Sara Fernandez.
Citing
If you wish to cite this work, please cite the following paper:
T. Iqbal, Y. Cao, A. Bailey, M. D. Plumbley, and W. Wang, “ARCA23K: An audio dataset for investigating open-set label noise”, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 2021, Barcelona, Spain, pp. 201–205.
BibTeX:
@inproceedings{Iqbal2021,
author = {Iqbal, T. and Cao, Y. and Bailey, A. and Plumbley, M. D. and Wang, W.},
title = {{ARCA23K}: An audio dataset for investigating open-set label noise},
booktitle = {Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)},
pages = {201--205},
year = {2021},
address = {Barcelona, Spain},
}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DESCRIPTION:
This audio dataset serves serves as supplementary material for the DCASE2024 Challenge Task 3: Audio and Audiovisual Sound Event Localization and Detection with Distance Estimation. The dataset consists of synthetic spatial audio mixtures of sound events spatialized for two different spatial formats using real measured room impulse responses (RIRs) measured in various spaces of Tampere University (TAU). The mixtures are generated using the same process as the one used to generate the recordings of the TAU-NIGENS Spatial Sound Scenes 2021 dataset for the DCASE2021 Challenge Task 3.
The SELD task setup in DCASE2024 is based on spatial recordings of real scenes, captured in the STARS23 dataset. Since the task setup allows use of external data, these synthetic mixtures serve as additional training material for the baseline model. For more details on the task setup, please refer to the task description.
Note that the generator code and the collection of room responses used to spatialize sound samples will be also be made available soon. For more details on the recording of RIRs, spatialization, and generation, see:
available here.
SPECIFICATIONS:
DOWNLOAD INSTRUCTIONS:
Download the zip files and use your preferred compression tool to unzip these split zip files. To extract a split zip archive (named as zip, z01, z02, ...), you could use, for example, the following syntax in Linux or OSX terminal:
zip -s 0 split.zip --out single.zip
unzip single.zip
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Created by
Yu Wang, Nicholas J. Bryan, Justin Salamon, Mark Cartwright, and Juan Pablo Bello
Publication
If using this data in academic work, please cite the following paper, which presented this dataset:
Y. Wang, N. J. Bryan, J. Salamon, M. Cartwright, and J. P. Bello. "Who calls the shots? Rethinking Few-shot Learning for Audio", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021
Description
FSD-MIX-CLIPS is an open dataset of programmatically mixed audio clips with a controlled level of polyphony and signal-to-noise ratio. We use single-labeled clips from FSD50K as the source material for the foreground sound events and Brownian noise as the background to generate 281,039 10-second strongly-labeled soundscapes with Scaper. We refer this (intermediate) dataset of 10s soundscapes as FSD-MIX-SED. Each soundscape contains n events from n different sound classes where n is ranging from 1 to 5. We then extract 614,533 1s clips centered on each sound event in the soundscapes in FSD-MIX-SED to produce FSD-MIX-CLIPS.
Source material and annotations
Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce FSD-MIX-SED using Scaper with the script in the project repository.
All clips in FSD-MIX-CLIPS are extracted from FSD-MIX-SED. Therefore, for FSD-MIX-CLIPS, instead of releasing duplicated audio content, we provide annotations that specify the filename in FSD-MIX-SED and the corresponding starting time (in second) of each 1-second clip.
Foreground material from FSD50K
We choose clips shorter than 4s that have a single validated label with the Present and Predominant annotation type. We further trim the silence at the edges of each clip. The resulting subset contains clips each with a single and strong label. The 200 sound classes in FSD50K are hierarchically organized. We focus on the leaf nodes and rule out classes with less than 20 single-labeled clips. This gives us 89 sound classes. vocab.json contains the list of 89 classes, each class is then labeled by its index in the list.
Data splits
FSD-MIX-CLIPS is originally generated for the task of multi-label audio classification under a few-sho continual learning setup. Therefore, the classes are split into disjoint sets of base and novel classes where novel class data are only used at inference time. We partition the 89 classes into three splits: base, novel-val, and novel-test with 59, 15, and 15 classes, respectively. Base class data are used for both training and evaluation while novel-val/novel-test class data are used for validation/test only.
Files
FSD_MIX_SED.source.tar.gz contains the background Brownian noise and 10,296 of single-labeled sound events from FSD50K in .wav format. The original file size is 1.9GB.
FSD_MIX_SED.annotations.tar.gz contains 281,039 JAMS files. The original file size is 35GB.
FSD_MIX_CLIPS.annotations.tar.gz contains ground truth labels for 1-second clips in each data split in FSD_MIX_SED, specified by filename and starting time (sec).
vocab.json contains the 89 classes.
Foreground sound materials and soundscape annotations in FSD_MIX_SED are organized in a similar folder structure following the data splits:
root folder
│
└───base/ Base classes (label 0-58)
│ │
│ └─── train/
│ │ │
│ │ └─── audio or annotation files
│ │
│ └─── val/
│ │ │
│ │ └─── audio or annotation files
│ │
│ └─── test/
│ │
│ └─── audio or annotation files
│
│
└───val/ Novel-val classes (label 59-73)
│ │
│ └─── audio or annotation files
│
│
└───test/ Novel-test classes (label 74-88)
│
└─── audio or annotation files
References
[1] Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43805 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London.
Citation
If you use the FSD-FS dataset, please cite our paper and FSD50K.
@article{liang2022learning,
title={Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition},
author={Liang, Jinhua and Phan, Huy and Benetos, Emmanouil},
journal={arXiv preprint arXiv:2212.08952},
year={2022}
}
@ARTICLE{9645159, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={FSD50K: An Open Dataset of Human-Labeled Sound Events}, year={2022}, volume={30}, number={}, pages={829-852}, doi={10.1109/TASLP.2021.3133208}}
About FSD-FS
FSD-FS is an open database for multi-label few-shot audio classification containing 143 classes drawn from the FSD50K. It also inherits the AudioSet Ontology. FSD-FS follows the ratio 7:2:1 to split classes into base, validation, and evaluation sets, so there are 98 classes in the base set, 30 classes in the validation set, and 15 classes in the evaluation set (More details can be found in our paper).
LICENSE
FSD-FS are released in Creative Commons (CC) licenses. Same as FSD50K, each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. For more details, ones can refer to the link.
FILES
FSD-FS are organised in the structure:
root
|
└─── dev_base
|
└─── dev_val
|
└─── eval
REFERENCES AND LINKS
[1] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. [paper] [link]
[2] Fonseca, Eduardo, et al. "Fsd50k: an open dataset of human-labeled sound events." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021): 829-852. [paper] [code]
Facebook
TwitterThe Free Universal Sound Separation (FUSS) Dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation.
This is the official sound separation data for the DCASE2020 Challenge Task 4: Sound Event Detection and Separation in Domestic Environments.
Overview: FUSS audio data is sourced from a pre-release of Freesound dataset known as (FSD50k), a sound event dataset composed of Freesound content annotated with labels from the AudioSet Ontology. Using the FSD50K labels, these source files have been screened such that they likely only contain a single type of sound. Labels are not provided for these source files, and are not considered part of the challenge. For the purpose of the DCASE Task4 Sound Separation and Event Detection challenge, systems should not use FSD50K labels, even though they may become available upon FSD50K release.
To create mixtures, 10 second clips of sources are convolved with simulated room impulse responses and added together. Each 10 second mixture contains between 1 and 4 sources. Source files longer than 10 seconds are considered "background" sources. Every mixture contains one background source, which is active for the entire duration. We provide: a software recipe to create the dataset, the room impulse responses, and the original source audio.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('fuss', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThis dataset was created by Bao Tran Tong
Facebook
TwitterThis dataset was created by Anirudh Vignesh
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GISE-51 is an open dataset of 51 isolated sound events based on the FSD50K dataset. The release also includes the GISE-51-Mixtures subset, a dataset of 5-second soundscapes with up to three sound events synthesized from GISE-51. The GISE-51 release attempts to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research and the freedom to adapt the included isolated sound events for domain-specific applications, which was not possible using existing large-scale weakly labelled datasets. GISE-51 release also included accompanying code for baseline experiments, which can be found at https://github.com/SarthakYadav/GISE-51-pytorch.
Citation
If you use the GISE-51 dataset and/or the released code, please cite our paper:
Sarthak Yadav and Mary Ellen Foster, "GISE-51: A scalable isolated sound events dataset", arXiv:2103.12306, 2021
Since GISE-51 is based on FSD50K, if you use GISE-51 kindly also cite the FSD50K paper:
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.
About GISE-51 and GISE-51-Mixtures
The following sections summarize key characteristics of the GISE-51 and the GISE-51-Mixtures datasets, including details left out from the paper.
GISE-51
meta/lbl_map.csv for the complete vocabulary.silence_thresholds.txt lists class bins and their corresponding volume threshold. Files that were determined by sox to contain no audio at all were manually clipped. Code for performing silence filtering can be found in scripts/strip_silence_sox.py in the code repository.GISE-51-Mixtures
LICENSE
All audio clips (i.e., found in isolated_events.tar.gz) used in the preparation of the Glasgow Isolated Events Dataset (GISE-51) are designated Creative Commons and were obtained from FSD50K. The source data in isolated_events.tar.gz is based on the FSD50K dataset, which is licensed as Creative Commons Attribution 4.0 International (CC BY 4.0) License.
GISE-51 dataset (including GISE-51-Mixtures) is a curated, processed and generated preparation, and is released under Creative Commons Attribution 4.0 International (CC BY 4.0) License. The license is specified in the LICENSE-DATASET file in license.tar.gz.
Baselines
Several sound event recognition experiments were conducted, establishing baseline performance on several prominent convolutional neural network architectures. The experiments are described in Section 4 of our paper, and the implementation for reproducing these experiments is available at https://github.com/SarthakYadav/GISE-51-pytorch.
Files
GISE-51 is available as a collection of several tar archives. All audio files are PCM 16 bit, 22050 Hz. Following lists the contents of these files in detail:
isolated_events.tar.gz: The core GISE-51 isolated events dataset containing train, val and eval subfolders.meta.tar.gz: contains lbl_map.jsonnoises.tar.gz: contains background noises used for GISE-51-Mixtures soundscape generationmixtures_jams.tar.gz: This file contains annotation files in .jams format that, alongside isolated_events.tar.gz and noises.tar.gz can be reused to generate exact GISE-51-Mixtures soundscapes. (Optional, we provide the complete set of GISE-51-Mixtures soundscapes as independent tar archives.)train.tar.gz: GISE-51-Mixtures train set, containing 60k synthetic soundscapes.val.tar.gz: GISE-51-Mixtures val set, containing 10k synthetic soundscapes.eval.tar.gz: GISE-51-Mixtures eval set, containing 10k synthetic soundscapes.train_*.tar.gz: These are tar archives containing training mixtures of a various number of soundscapes, used primarily in Section 4.1 of the paper, which compares val mAP performance v/s number of training soundscapes. A helper script is provided in the code release, prepare_mixtures_lmdb.sh, to prepare data for experiments in Section 4.1.pretrained-models.tar.gz: Contains model checkpoints for all experiments conducted in the paper. More information on these checkpoints can be found in the code release README.
state_dicts for use with transfer learning experiments.license.tar.gz: contains dataset license info.silence_thresholds.txt: contains volume thresholds for various sound event bins used for silence filtering.Contact
In case of queries and clarifications, feel free to contact Sarthak at s.yadav.2@research.gla.ac.uk. (Adding [GISE-51] to the subject of the email would be appreciated!)
Facebook
TwitterThis dataset was created by Bao Tran Tong
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Created by
Yu Wang, Mark Cartwright, and Juan Pablo Bello
Publication
If using this data in academic work, please cite the following paper, which presented this dataset:
Y. Wang, M. Cartwright, and J. P. Bello. "Active Few-Shot Learning for Sound Event Detection", INTERSPEECH, 2022
Description
SONYC-FSD-SED is an open dataset of programmatically mixed audio clips that simulates audio data in an environmental sound monitoring system, where sound class occurrences and co-occurrences exhibit seasonal periodic patterns. We use recordings collected from the Sound of New York City (SONYC) acoustic sensor network as backgrounds, and single-labeled clips in the FSD50K dataset as foreground events to generate 576,591 10-second strongly-labeled soundscapes with Scaper (including 111,294 additional test data for the experiment of sampling window). Instead of sampling foreground sound events uniformly, we simulate the occurrence probability of each class at different times in a year, creating more realistic temporal characteristics.
Source material and annotations
Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce SONYC-FSD-SED using Scaper with the script in the project repository.
Background material from SONYC recordings
We pick a sensor from the SONYC sensor network and subsample from recordings it collected within a year (2017). We categorize these ∼550k 10-second clips into 96 bins based on timestamps, where each bin represents a unique combination of the month of a year, day of a week (weekday or weekend), and time of a day (divided into four 6-hour blocks). Next, we run a pre-trained urban sound event classifier over all recordings and filter out clips with active sound classes. We do not filter out footstep and bird since they appear too frequently, instead, we remove these two classes from the foreground sound material. Then from each bin, we choose the clip with the lowest sound pressure level, yielding 96 background clips.
Foreground material from FSD50K
We follow the same filtering process as in FSD-MIX-SED to get the subset of FSD50K with short single-labeled clips. In addition, we remove two classes, "Chirp_and_tweet" and "Walk_and_footsteps", that exist in our SONYC background recordings. This results in 87 sound classes. vocab.json contains the list of 87 classes, each class is then labeled by its index in the list. 0-42: train, 43-56: val, 57-86: test.
Occurrence probability modelling
For each class, we model its occurrence probability within a year. We use von Mises probability density functions to simulate the probability distribution over different weeks in a year and hours in a day considering their cyclic characteristics: (f(x|μ, κ) = e^{κcos(x−μ)}/2πI_0(κ)), where (I_0(κ)) is the modified Bessel function of order (0), (\mu) and (1/\kappa) are analogous to the mean and variance in the normal distribution. We randomly sample ((\mu_{year}, \mu_{day})) from ([-\pi, \pi]) and ((\kappa_{year}, \kappa_{day})) from ([0, 10]). We also randomly assign (p_{weekday} \in [0, 1] ), (p_{weekend} = 1 − p_{weekday}) to simulate the probability distribution over different days in a week. Finally, we get the probability distribution over the entire year with a 1-hour resolution. At a given timestamp, we integrate (f_{year}) and (f_{day}) over the 1-hour window and multiply them together with (p_{weekday}) or (p_{weekend}) depends on the day. To speed up the following sampling process, we scale the final probability distribution using a temperature parameter randomly sampled from ([2,3]).
Files
SONYC_FSD_SED.source.tar.gz: 96 SONYC backgrounds and 10,158 foreground sounds in .wav format. The original file size is 2GB.
SONYC_FSD_SED.annotations.tar.gz: 465,467 JAMS files. The original file size is 57GB.
SONYC_FSD_SED_add_test.annotations.tar.gz: 111,294 JAMS files for additional test data. The original file size is 14GB.
vocab.json: 87 classes.
occ_prob_per_cl.pkl: Occurrence probability for each foreground sound class.
References
[1] J. P. Bello, C. T. Silva, O. Nov, R. L. DuBois, A. Arora, J. Salamon, C. Mydlarz, and H. Doraiswamy, “SONYC: A system for monitoring, analyzing, and mitigating urban noise pollution,” Commun. ACM, 2019
[2] E. Fonseca, X. Favory, J. Pons, F. Font, X. Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
L3DAS21: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING
IEEE MLSP Data Challenge 2021
SCOPE OF THE CHALLENGE
The L3DAS21 Challenge for the IEEE MLSP 2021 aims at encouraging and fostering research on machine learning for 3D audio signal processing. In multi-speaker scenarios it is very important to properly understand the nature of a sound event and its position within the environment, what is the content of the sound signal and how to leverage it at best for a specific application (e.g., teleconferencing rather than assistive listening or entertainment, among others). To this end, L3DAS21 Challenge presents two tasks: 3D Speech Enhancement and 3D Sound Event Localization and Detection, both relying on first-order Ambisonics recordings in reverberant office environment.
Each task involves 2 separate tracks: 1-mic and 2-mic recordings, respectively containing sounds acquired by one Ambisonics microphone and by an array of two Ambisonics microphones. The use of two first-order Ambisonics microphones definitely represents one of the main novelties of the L3DAS21 Challenge.
Task 1: 3D Speech Enhancement The objective of this task is the enhancement of speech signals immersed in the spatial sound field of a reverberant office environment. Here the models are expected to extract the monophonic voice signal from the 3D mixture containing various background noises.The evaluation metric for this task is a combination of the short-time objective intelligibility (STOI) and word error rate (WER).
Task 2: 3D Sound Event Localization and Detection The aim of this task is to detect the temporal activities of a known set of sound event classes and, in particular, to further locate them in the space.Here the models must predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. Performance on this task are evaluated according to the location-sensitive detection error, which joins the localization and detection errors.
DATASETS
The LEDAS21 datasets contain multiple-source and multiple-perspective B-format Ambisonics audio recordings. We sampled the acoustic field of a large office room, placing two first-order Ambisonics microphones in the center of the room and moving a speaker reproducing the analytic signal in 252 fixed spatial positions. Relying on the collected Ambisonics impulse responses (IRs), we augmented existing clean monophonic datasets to obtain synthetic tridimensional sound sources by convolving the original sounds with our IRs. We extracted speech signals from the Librispeech dataset and office-like background noises from the FSD50K dataset. We aimed at creating plausible and variegate 3D scenarios to reflect possible real-life situations in which sound and disparate types of background noises coexist in the same 3D reverberant environment. We provide normalized raw waveforms as predictors data and the target data varies according to the task.
The dataset is divided in two main sections, respectively dedicated to the challenge tasks.
The first section is optimized for 3D Speech Enhancement and contains more than 30000 virtual 3D audio environments with a duration up to 10 seconds. In each sample, a spoken voice is always present alongside with other office-like background noises. As target data for this section we provide the clean monophonic voice signals.
The other sections, instead, is dedicated to the 3D Sound Event Localization and Detection task and contains 900 60-seconds-long audio files Each data point contains a simulated 3D office audio environment in which up to 3 simultaneous acoustic events may be active at the same time. In this section, the samples are not forced to contain a spoken voice. As target data for this section we provide a list of the onset and offset time stamps, the typology class, and the spatial coordinates of each individual sound event present in the data-points.
We split both dataset sections into: a training set (44 hours for SE and 600 hours for SELD) and a test set (6 hours for SE and 5 hours for SELD), paying attention to create similar distributions. The train set of the SE section is divided in two partitions: train360 and train100, and contain speech samples extracted from the correspondent partitions of Librispeech (only the sample) up to 10 seconds). All sets of the SELD section are divided in: OV1, OV2, OV3. These partitions refer to the maximum amount of possible overlapping sounds, which are 1, 2 ore 3, respectively.
The evaluation test datasets can be downloaded here:
L3DAS21_Task1_test.zip
L3DAS21_Task2_test.zip
CHALLENGE WEBSITE AND CONTACTS
L3DAS21 Challenge Website: www.l3das.com/mlsp2021
GitHub repository: github.com/l3das/L3DAS21
Paper: arxiv.org/abs/2104.05499
IEEE MLSP 2021: 2021.ieeemlsp.org/
Email contact: l3das@uniroma1.it
Twitter: https://twitter.com/das_l3
Facebook
TwitterZeroshot-Audio-Classification-Instructions
Convert audio classification dataset into zero-shot format speech instructions, support both single label and multi-label,
VGGSound FSD50k Nonspeech7k urbansound8K VocalSound Emotion Gender ESD Emotion Age Language TAU Urban Acoustic Scenes 2022 CochlScene BirdCLEF_2021 EmoBox AudioSet
We also converted huge WAV files into MP3 16k sample rate to reduce storage size.To prevent leakage, please do not include test set in training session.… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/Zeroshot-Audio-Classification-Instructions.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🎧 Auditory Scene Analysis 2 (ASA2) Dataset
We constructed a new dataset for multichannel USS and polyphonic audio classification tasks. The proposed dataset is designed to reflect various conditions, including moving sources with temporal onsets and offsets. For foreground sound sources, signals from 13 audio classes were selected from open-source databases (Pixabay¹, FSD50K, Librispeech, MUSDB18, Vocalsound). These signals were resampled to 16 kHz and pre-processed by either padding zeros… See the full description on the dataset page: https://huggingface.co/datasets/donghoney22/ASA2_dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction:
Divide and Remaster (DnR) is a source separation dataset for training and testing algorithms that separate a monaural audio signal into speech, music, and sound effects/background stems. The dataset is composed of artificial mixtures using audio from the librispeech, free music archive (FMA), and Freesound Dataset 50k (FSD50k). We introduce it as part of the Cocktail Fork Problem paper.
At a Glance:
The size of the unzipped dataset is ~174GB
Each mixture is 60 seconds long and sources are not fully overlapped
Audio is encoded as 16-bit .wav files at a sampling rate of 44.1 kHz
The data is split into training tr (3295 mixtues), validation cv (440 mixtures) and testing tt (652 mixtures) subsets
The directory for each mixture contains four .wav files, mix.wav, music.wav, speech.wav, sfx.wav, and annots.csv which contains the metadata for the original audio used to compose the mixture (transcriptions for speech, sound classes for sfx, and genre labels for music)
Other Resources:
Demo examples and additional information are available at: https://cocktail-fork.github.io/
For more details about the data generation process, the code used to generate our dataset can be found at the following: https://github.com/darius522/dnr-utils
Contact and Support:
Have an issue, concern, or question about DnR ? If so, please open an issue here.
For any other inquiries, feel free to shoot an email at: firstname.lastname@gmail.com, my name is Darius Petermann ;)
Citation:
If you use DnR please cite our paper in which we introduce the dataset as part of the Cocktail Fork Problem:
@article{Petermann2021cocktail, title={The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks}, author={Darius Petermann and Gordon Wichern and Zhong-Qiu Wang and Jonathan {Le Roux}}, year={2021}, journal={arXiv preprint arXiv:2110.09958}, archivePrefix={arXiv}, primaryClass={eess.AS} }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open-set Tagging (OST) is a synthetic dataset of 1s clips used to evaluate source-centric representation learning models in the paper Compositional Audio Representation Learning.
Due to the size of the dataset, we only share the source files, and provide the scripts to generate the dataset are available here.
The dataset generation process is as follows:
1. From single-source FSD50K audio files, we generate a dataset of 10s soundscapes called Open-set Soundscapes (OSS) using Scaper.
2. We then center a 1s window around the center of each sound event in the 10s soundscapes to generate Open-set Tagging (OST), which contains ~500k clips.
If you are not going to use OSS, you can choose to synthesize it without audio-- this will synthesize only the JAMS annotation files needed for the 1s clips. Using the OSS JAMS files, OST clips can be generated deterministically.
There are five dataset variants (~17GB each), each with a different random assignment of classes to the known and unknown class categories. For further details, refer to our previous paper Multi-label open-set audio classification. In this work, OST dataset variant 1 is referred to as OST for simplicity.
We also introduce a tiny version of the dataset called OST-Tiny, which contains ~20k clips and only 10 known classes. This is convenient for faster prototyping and to evaluate models in a more challenging open-set classification scenario.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Free Universal Sound Separation (FUSS) dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation. FUSS is based on FSD50K corpus.
Facebook
TwitterFSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.
Citation
If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):
@article{fonseca2022FSD50K,
title={{FSD50K}: an open dataset of human-labeled sound events},
author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
volume={30},
pages={829--852},
year={2022},
publisher={IEEE}
}
Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).
Data curators
Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez
Contact
You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.
ABOUT FSD50K
Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.
What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.
Basic characteristics:
vocabulary.csv (see Files section below).Dev set:
Eval set:
Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.
LICENSE
All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:
The development set consists of 40,966 clips with the following licenses:
The evaluation set consists of 10,231 clips with the following licenses:
For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.
In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).
Usage of FSD50K for commercial purposes:
If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.
Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.
FILES
FSD50K can be downloaded as a series of zip files with the following directory structure:
root │ └───FSD50K.dev_audio/ Audio clips in the dev set │ └───FSD50K.eval_audio/ Audio clips in the eval set │ └───FSD50K.ground_truth/ Files for FSD50K's ground truth │ │ │ └─── dev.csv Ground truth for the dev set │ │ │ └─── eval.csv Ground truth for the eval set │ │ │ └─── vocabulary.csv List of 200 sound classes in FSD50K │ └───FSD50K.metadata/ Files for additional metadata │ │ │ └─── class_info_FSD50K.json Metadata about the sound classes │ │ │ └─── dev_clips_info_FSD50K.json Metadata about the dev clips │ │ │ └─── eval_clips_info_FSD50K.json Metadata about the eval clips │ │ │ └─── pp_pnp_ratings_FSD50K.json PP/PNP ratings │ │ │ └─── collection/ Files for the *sound collection* format │ └───FSD50K.doc/ │ └───README.md The dataset description file that you are reading │ └───LICENSE-DATASET License of the FSD50K dataset as an entity
Each row (i.e. audio clip) of dev.csv contains the following information:
fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.labels: the class labels (i.e., the ground truth). Note these