23 datasets found

FSD50K
zenodo.org
opendatalab.com
+2more
bin, zip
Updated Apr 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons (2022). FSD50K [Dataset]. http://doi.org/10.5281/zenodo.4060432
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4060432
Dataset updated
Apr 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons
Description
FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

Citation

If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):

@article{fonseca2022FSD50K, title={{FSD50K}: an open dataset of human-labeled sound events}, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={30}, pages={829--852}, year={2022}, publisher={IEEE} }

Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).

Data curators

Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez

Contact

You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.

ABOUT FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.

Basic characteristics:

FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio

The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.

The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).

The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].

Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.

All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.

Ground truth labels are provided at the clip-level (i.e., weak labels).

The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).

In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).

The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.

Dev set:

40,966 audio clips totalling 80.4 hours of audio

Avg duration/clip: 7.1s

114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)

Labels are correct but could be occasionally incomplete

A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)

Eval set:

10,231 audio clips totalling 27.9 hours of audio

Avg duration/clip: 9.8s

38,596 smeared labels

Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.

LICENSE

All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:

The development set consists of 40,966 clips with the following licenses:

CC0: 14,959

CC-BY: 20,017

CC-BY-NC: 4616

CC Sampling+: 1374

The evaluation set consists of 10,231 clips with the following licenses:

CC0: 4914

CC-BY: 3489

CC-BY-NC: 1425

CC Sampling+: 403

For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.

In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).

Usage of FSD50K for commercial purposes:

If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

FILES

FSD50K can be downloaded as a series of zip files with the following directory structure:

root │ └───FSD50K.dev_audio/ Audio clips in the dev set │ └───FSD50K.eval_audio/ Audio clips in the eval set │ └───FSD50K.ground_truth/ Files for FSD50K's ground truth │ │ │ └─── dev.csv Ground truth for the dev set │ │ │ └─── eval.csv Ground truth for the eval set │ │ │ └─── vocabulary.csv List of 200 sound classes in FSD50K │ └───FSD50K.metadata/ Files for additional metadata │ │ │ └─── class_info_FSD50K.json Metadata about the sound classes │ │ │ └─── dev_clips_info_FSD50K.json Metadata about the dev clips │ │ │ └─── eval_clips_info_FSD50K.json Metadata about the eval clips │ │ │ └─── pp_pnp_ratings_FSD50K.json PP/PNP ratings │ │ │ └─── collection/ Files for the *sound collection* format │ └───FSD50K.doc/ │ └───README.md The dataset description file that you are reading │ └───LICENSE-DATASET License of the FSD50K dataset as an entity

Each row (i.e. audio clip) of dev.csv contains the following information:

fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.

labels: the class labels (i.e., the ground truth). Note these
h
fsd50k
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philippe Gonzalez (2025). fsd50k [Dataset]. https://huggingface.co/datasets/philgzl/fsd50k
Explore at:
Dataset updated
Jul 8, 2025
Authors
Philippe Gonzalez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FSD50K: An open dataset of human-labeled sound events

This is a mirror of the FSD50K sound event dataset. The original files were converted from WAV to Opus to reduce the size and accelerate streaming.

Sampling rate: 48 kHz Channels: 1 Format: Opus Splits: Dev: 80 hours, 40966 clips. Eval: 28 hours, 10231 clips.

License: FSD50K is released under CC-BY. However, each clip has its own licence. Clip licenses include CC0, CC-BY, CC-BY-NC and CC Sampling+. Clip licenses are specified… See the full description on the dataset page: https://huggingface.co/datasets/philgzl/fsd50k.
z
FSD50k in WebDataset Format
zenodo.org
tar
Updated Feb 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niu Yadong; Niu Yadong (2025). FSD50k in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14868441
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14868441
Dataset updated
Feb 14, 2025
Dataset provided by
Xiaomi
Authors
Niu Yadong; Niu Yadong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is the FSD50K dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

$ tar tvf fsdk50_eval_0000000.tar |head -r--r--r-- bigdata/bigdata 40 2025-01-12 13:02 45604.json -r--r--r-- bigdata/bigdata 43066 2025-01-12 13:02 45604.wav -r--r--r-- bigdata/bigdata 46 2025-01-12 13:02 213293.json -r--r--r-- bigdata/bigdata 1372242 2025-01-12 13:02 213293.wav -r--r--r-- bigdata/bigdata 82 2025-01-12 13:02 348174.json -r--r--r-- bigdata/bigdata 804280 2025-01-12 13:02 348174.wav -r--r--r-- bigdata/bigdata 71 2025-01-12 13:02 417736.json -r--r--r-- bigdata/bigdata 2238542 2025-01-12 13:02 417736.wav -r--r--r-- bigdata/bigdata 43 2025-01-12 13:02 235555.json -r--r--r-- bigdata/bigdata 542508 2025-01-12 13:02 235555.wav

$ tar -xOf fsdk50_eval_0000000.tar 45604.json {"soundevent": "Yell;Shout;Human_voice"}
h
FSD50K
huggingface.co
Updated Jul 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xirong Cao (2023). FSD50K [Dataset]. https://huggingface.co/datasets/mikiyax/FSD50K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 17, 2023
Authors
Xirong Cao
Description
Dataset Card for "FSD50K"

More Information needed
ARCA23K
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Feb 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turab Iqbal; Turab Iqbal; Yin Cao; Yin Cao; Andrew Bailey; Andrew Bailey; Mark D. Plumbley; Mark D. Plumbley; Wenwu Wang; Wenwu Wang (2022). ARCA23K [Dataset]. http://doi.org/10.5281/zenodo.5117901
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5117901
Dataset updated
Feb 25, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Turab Iqbal; Turab Iqbal; Yin Cao; Yin Cao; Andrew Bailey; Andrew Bailey; Mark D. Plumbley; Mark D. Plumbley; Wenwu Wang; Wenwu Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.

In addition to ARCA23K, this release includes a companion dataset called ARCA23K-FSD, which is a single-label subset of the FSD50K dataset. ARCA23K-FSD contains the same sound classes as ARCA23K and the same number of audio clips per class. As it is a subset of FSD50K, each clip and its label have been manually verified. Note that only the ground truth data of ARCA23K-FSD is distributed in this release. To download the audio clips, please visit the Zenodo page for FSD50K.

A paper has been published detailing how the dataset was constructed. See the Citing section below.

The source code used to create the datasets is available: https://github.com/tqbl/arca23k-dataset

Characteristics

ARCA23K(-FSD) is divided into:

A training set containing 17,979 clips (39.6 hours for ARCA23K).

A validation set containing 2,264 clips (5.0 hours).

A test test containing 3,484 clips (7.3 hours).

There are 70 sound classes in total. Each class belongs to the AudioSet ontology.

Each audio clip was sourced from the Freesound database. Other than format conversions (e.g. resampling), the audio clips have not been modified.

The duration of the audio clips varies from 0.3 seconds to 30 seconds.

All audio clips are mono 16-bit WAV files sampled at 44.1 kHz.

Based on listening tests (details in paper), 46.4% of the training examples are estimated to be labelled incorrectly. Among the incorrectly-labelled examples, 75.9% are estimated to be out-of-vocabulary.

Sound Classes

The list of sound classes is given below. They are grouped based on the top-level superclasses of the AudioSet ontology.

Music

Acoustic guitar

Bass guitar

Bowed string instrument

Crash cymbal

Electric guitar

Gong

Harp

Organ

Piano

Rattle (instrument)

Scratching (performance technique)

Snare drum

Trumpet

Wind chime

Wind instrument, woodwind instrument

Sounds of things

Boom

Camera

Coin (dropping)

Computer keyboard

Crack

Dishes, pots, and pans

Drawer open or close

Drill

Gunshot, gunfire

Hammer

Keys jangling

Knock

Microwave oven

Printer

Sawing

Scissors

Skateboard

Slam

Splash, splatter

Squeak

Tap

Thump, thud

Toilet flush

Train

Water tap, faucet

Whoosh, swoosh, swish

Writing

Zipper (clothing)

Natural sounds

Crackle

Stream

Waves, surf

Wind

Human sounds

Burping, eructation

Chewing, mastication

Child speech, kid speaking

Clapping

Cough

Crying, sobbing

Fart

Female singing

Female speech, woman speaking

Finger snapping

Giggle

Male speech, man speaking

Run

Screaming

Walk, footsteps

Animal

Bark

Cricket

Livestock, farm animals, working animals

Meow

Rattle

Source-ambiguous sounds

Crumpling, crinkling

Crushing

Tearing

License and Attribution

This release is licensed under the Creative Commons Attribution 4.0 International License.

The audio clips distributed as part of ARCA23K were sourced from Freesound and have their own Creative Commons license. The license information and attribution for each audio clip can be found in ARCA23K.metadata/train.json, which also includes the original Freesound URLs.

The files under ARCA23K-FSD.ground_truth/ are an adaptation of the ground truth data provided as part of FSD50K, which is licensed under the Creative Commons Attribution 4.0 International License. The curators of FSD50K are Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano, and Sara Fernandez.

Citing

If you wish to cite this work, please cite the following paper:

T. Iqbal, Y. Cao, A. Bailey, M. D. Plumbley, and W. Wang, “ARCA23K: An audio dataset for investigating open-set label noise”, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 2021, Barcelona, Spain, pp. 201–205.

BibTeX:

@inproceedings{Iqbal2021, author = {Iqbal, T. and Cao, Y. and Bailey, A. and Plumbley, M. D. and Wang, W.}, title = {{ARCA23K}: An audio dataset for investigating open-set label noise}, booktitle = {Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)}, pages = {201--205}, year = {2021}, address = {Barcelona, Spain}, }
[DCASE2024 Task 3] Synthetic SELD mixtures for baseline training
zenodo.org
data.niaid.nih.gov
Updated Apr 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Aleksander Krause; Daniel Aleksander Krause; Archontis Politis; Archontis Politis (2024). [DCASE2024 Task 3] Synthetic SELD mixtures for baseline training [Dataset]. http://doi.org/10.5281/zenodo.10932241
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10932241
Dataset updated
Apr 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Aleksander Krause; Daniel Aleksander Krause; Archontis Politis; Archontis Politis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DESCRIPTION:

This audio dataset serves serves as supplementary material for the DCASE2024 Challenge Task 3: Audio and Audiovisual Sound Event Localization and Detection with Distance Estimation. The dataset consists of synthetic spatial audio mixtures of sound events spatialized for two different spatial formats using real measured room impulse responses (RIRs) measured in various spaces of Tampere University (TAU). The mixtures are generated using the same process as the one used to generate the recordings of the TAU-NIGENS Spatial Sound Scenes 2021 dataset for the DCASE2021 Challenge Task 3.

The SELD task setup in DCASE2024 is based on spatial recordings of real scenes, captured in the STARS23 dataset. Since the task setup allows use of external data, these synthetic mixtures serve as additional training material for the baseline model. For more details on the task setup, please refer to the task description.

Note that the generator code and the collection of room responses used to spatialize sound samples will be also be made available soon. For more details on the recording of RIRs, spatialization, and generation, see:

Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen (2021). A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2021), Barcelona, Spain.

available here.

SPECIFICATIONS:

13 target sound classes (see task description for details)

The sound event samples are sources from the FSD50K dataset, based on affinity of the labels in that dataset to the target classes. The selection on distinguishing which labels in FSD50K corresponded to the target ones, then selecting samples that were tagged with only those labels, and additionally that they had annotator rating of Present and Predominant (see FSD50K for more details). The list of the selected files is included here.

1200 1-minute long spatial recordings

Sampling rate of 24kHz

Two 4-channel recording formats, first-order Ambisonics (FOA) and tetrahedral microphone array (MIC)

Spatial events spatialized in 9 unique rooms, using measured RIRs for the two formats

Maximum polyphony of 3 (with possible same-class events overlapping)

Even though the whole set is used for training of the baseline without distinction between the mixtures, we have included a separation into a training and testing split, in case on one needs to test the performance purely on those synthetic conditions (for example for comparisons with training on mixed synthetic-real data, fine-tuning on real data, or training on real data only).

The training split is indicated as fold1 in the dataset, contains 900 recordings spatialized on 6 rooms (150 recordings/room) and it is based on samples from the development set of FSD50K.

The testing split is indicated as fold2 in the dataset, contains 300 recordings spatialized on 3 rooms (100 recordings/room) and it is based on samples from the evaluation set of FSD50K.

Common metadata files for both formats are provided. For the file naming and the metadata format, refer to the task setup.

DOWNLOAD INSTRUCTIONS:

Download the zip files and use your preferred compression tool to unzip these split zip files. To extract a split zip archive (named as zip, z01, z02, ...), you could use, for example, the following syntax in Linux or OSX terminal:

Combine the split archive to a single archive:
zip -s 0 split.zip --out single.zip

Extract the single archive using unzip:
unzip single.zip
Z
FSD-MIX-CLIPS
data.niaid.nih.gov
zenodo.org
Updated Oct 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Wang (2021). FSD-MIX-CLIPS [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5574134
Explore at:
Dataset updated
Oct 17, 2021
Dataset provided by
Justin Salamon
Nicholas J. Bryan
Mark Cartwright
Juan Pablo Bello
Yu Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Created by

Yu Wang, Nicholas J. Bryan, Justin Salamon, Mark Cartwright, and Juan Pablo Bello

Publication

If using this data in academic work, please cite the following paper, which presented this dataset:

Y. Wang, N. J. Bryan, J. Salamon, M. Cartwright, and J. P. Bello. "Who calls the shots? Rethinking Few-shot Learning for Audio", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021

Description

FSD-MIX-CLIPS is an open dataset of programmatically mixed audio clips with a controlled level of polyphony and signal-to-noise ratio. We use single-labeled clips from FSD50K as the source material for the foreground sound events and Brownian noise as the background to generate 281,039 10-second strongly-labeled soundscapes with Scaper. We refer this (intermediate) dataset of 10s soundscapes as FSD-MIX-SED. Each soundscape contains n events from n different sound classes where n is ranging from 1 to 5. We then extract 614,533 1s clips centered on each sound event in the soundscapes in FSD-MIX-SED to produce FSD-MIX-CLIPS.

Source material and annotations

Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce FSD-MIX-SED using Scaper with the script in the project repository.

All clips in FSD-MIX-CLIPS are extracted from FSD-MIX-SED. Therefore, for FSD-MIX-CLIPS, instead of releasing duplicated audio content, we provide annotations that specify the filename in FSD-MIX-SED and the corresponding starting time (in second) of each 1-second clip.

Foreground material from FSD50K

We choose clips shorter than 4s that have a single validated label with the Present and Predominant annotation type. We further trim the silence at the edges of each clip. The resulting subset contains clips each with a single and strong label. The 200 sound classes in FSD50K are hierarchically organized. We focus on the leaf nodes and rule out classes with less than 20 single-labeled clips. This gives us 89 sound classes. vocab.json contains the list of 89 classes, each class is then labeled by its index in the list.

Data splits

FSD-MIX-CLIPS is originally generated for the task of multi-label audio classification under a few-sho continual learning setup. Therefore, the classes are split into disjoint sets of base and novel classes where novel class data are only used at inference time. We partition the 89 classes into three splits: base, novel-val, and novel-test with 59, 15, and 15 classes, respectively. Base class data are used for both training and evaluation while novel-val/novel-test class data are used for validation/test only.

Files

FSD_MIX_SED.source.tar.gz contains the background Brownian noise and 10,296 of single-labeled sound events from FSD50K in .wav format. The original file size is 1.9GB.

FSD_MIX_SED.annotations.tar.gz contains 281,039 JAMS files. The original file size is 35GB.

FSD_MIX_CLIPS.annotations.tar.gz contains ground truth labels for 1-second clips in each data split in FSD_MIX_SED, specified by filename and starting time (sec).

vocab.json contains the 89 classes.

Foreground sound materials and soundscape annotations in FSD_MIX_SED are organized in a similar folder structure following the data splits:

root folder │
└───base/ Base classes (label 0-58) │ │
│ └─── train/
│ │ │
│ │ └─── audio or annotation files
│ │
│ └─── val/
│ │ │
│ │ └─── audio or annotation files
│ │
│ └─── test/
│ │
│ └─── audio or annotation files │ │ └───val/ Novel-val classes (label 59-73) │ │
│ └─── audio or annotation files
│
│
└───test/ Novel-test classes (label 74-88) │
└─── audio or annotation files

References

[1] Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.
FSD-FS
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jan 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinhua Liang; Jinhua Liang; Huy Phan; Huy Phan; Emmanouil Benetos; Emmanouil Benetos (2023). FSD-FS [Dataset]. http://doi.org/10.5281/zenodo.7557107
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7557107
Dataset updated
Jan 22, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jinhua Liang; Jinhua Liang; Huy Phan; Huy Phan; Emmanouil Benetos; Emmanouil Benetos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43805 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London.

Citation

If you use the FSD-FS dataset, please cite our paper and FSD50K.

@article{liang2022learning, title={Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition}, author={Liang, Jinhua and Phan, Huy and Benetos, Emmanouil}, journal={arXiv preprint arXiv:2212.08952}, year={2022} } @ARTICLE{9645159, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={FSD50K: An Open Dataset of Human-Labeled Sound Events}, year={2022}, volume={30}, number={}, pages={829-852}, doi={10.1109/TASLP.2021.3133208}}

About FSD-FS

FSD-FS is an open database for multi-label few-shot audio classification containing 143 classes drawn from the FSD50K. It also inherits the AudioSet Ontology. FSD-FS follows the ratio 7:2:1 to split classes into base, validation, and evaluation sets, so there are 98 classes in the base set, 30 classes in the validation set, and 15 classes in the evaluation set (More details can be found in our paper).

LICENSE

FSD-FS are released in Creative Commons (CC) licenses. Same as FSD50K, each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. For more details, ones can refer to the link.

FILES

FSD-FS are organised in the structure:

root | └─── dev_base | └─── dev_val | └─── eval

REFERENCES AND LINKS

[1] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. [paper] [link]

[2] Fonseca, Eduardo, et al. "Fsd50k: an open dataset of human-labeled sound events." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021): 829-852. [paper] [code]
T
fuss
tensorflow.org
Updated Nov 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). fuss [Dataset]. https://www.tensorflow.org/datasets/catalog/fuss
Explore at:
Dataset updated
Nov 12, 2020
Description
The Free Universal Sound Separation (FUSS) Dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation.

This is the official sound separation data for the DCASE2020 Challenge Task 4: Sound Event Detection and Separation in Domestic Environments.

Overview: FUSS audio data is sourced from a pre-release of Freesound dataset known as (FSD50k), a sound event dataset composed of Freesound content annotated with labels from the AudioSet Ontology. Using the FSD50K labels, these source files have been screened such that they likely only contain a single type of sound. Labels are not provided for these source files, and are not considered part of the challenge. For the purpose of the DCASE Task4 Sound Separation and Event Detection challenge, systems should not use FSD50K labels, even though they may become available upon FSD50K release.

To create mixtures, 10 second clips of sources are convolved with simulated room impulse responses and added together. Each 10 second mixture contains between 1 and 4 sources. Source files longer than 10 seconds are considered "background" sources. Every mixture contains one background source, which is active for the entire duration. We provide: a software recipe to create the dataset, the room impulse responses, and the original source audio.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('fuss', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
FSD50KMagma-Validationset_NoAug
kaggle.com
Updated Nov 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bao Tran Tong (2021). FSD50KMagma-Validationset_NoAug [Dataset]. https://www.kaggle.com/trantong/fsd5kval-spectrogram/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bao Tran Tong
Description
Dataset

This dataset was created by Bao Tran Tong

Contents
FSD50K Custom Preprocessed Dev
kaggle.com
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anirudh Vignesh (2025). FSD50K Custom Preprocessed Dev [Dataset]. https://www.kaggle.com/datasets/anirudhvignesh/fsd50k-custom-preprocessed-dev/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anirudh Vignesh
Description
Dataset

This dataset was created by Anirudh Vignesh

Contents
GISE-51
zenodo.org
application/gzip, txt
Updated Apr 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarthak Yadav; Sarthak Yadav; Mary Ellen Foster; Mary Ellen Foster (2021). GISE-51 [Dataset]. http://doi.org/10.5281/zenodo.4593514
Explore at:
application/gzip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4593514
Dataset updated
Apr 13, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sarthak Yadav; Sarthak Yadav; Mary Ellen Foster; Mary Ellen Foster
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
GISE-51 is an open dataset of 51 isolated sound events based on the FSD50K dataset. The release also includes the GISE-51-Mixtures subset, a dataset of 5-second soundscapes with up to three sound events synthesized from GISE-51. The GISE-51 release attempts to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research and the freedom to adapt the included isolated sound events for domain-specific applications, which was not possible using existing large-scale weakly labelled datasets. GISE-51 release also included accompanying code for baseline experiments, which can be found at https://github.com/SarthakYadav/GISE-51-pytorch.

Citation

If you use the GISE-51 dataset and/or the released code, please cite our paper:

Sarthak Yadav and Mary Ellen Foster, "GISE-51: A scalable isolated sound events dataset", arXiv:2103.12306, 2021

Since GISE-51 is based on FSD50K, if you use GISE-51 kindly also cite the FSD50K paper:

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.

About GISE-51 and GISE-51-Mixtures

The following sections summarize key characteristics of the GISE-51 and the GISE-51-Mixtures datasets, including details left out from the paper.

GISE-51

Three subsets: train, val and eval with 12465, 1716, and2176 utterances. Subsets are in coherence with the FSD50K release.

Encompasses 51 sound classes from the FSD50K release

View meta/lbl_map.csv for the complete vocabulary.

The dataset was obtained from FSD50K using the following steps:

Unsmearing annotations to obtain single instances with a single label using the provided metadata and ground truth in FSD50K.

Manual inspection to qualitatively evaluate shortlisted utterances.

Volume-threshold based automated silence filtering using sox. Different volume thresholds are selected for various sound event class bins using trial-and-error. silence_thresholds.txt lists class bins and their corresponding volume threshold. Files that were determined by sox to contain no audio at all were manually clipped. Code for performing silence filtering can be found in scripts/strip_silence_sox.py in the code repository.

Re-evaluate sound event classes, removing ones with too few samples and merging those with high inter-class ambiguity.

GISE-51-Mixtures

Synthetic 5-second soundscapes with up to 3 events created using Scaper.

Weighted sampling with replacement for sound event selection, effectively oversampling events with very few samples. Synthetic soundscapes generated thus have a near equal number of annotations per sound event.

The number of soundscapes in val and eval set is 10000 each.

The number of soundscapes in the final train set is 60000. We do provide training sets with 5k-100k soundscapes.

GISE-51-Mixtures is our proposed subset that can be used to benchmark the performance of future works.

LICENSE

All audio clips (i.e., found in isolated_events.tar.gz) used in the preparation of the Glasgow Isolated Events Dataset (GISE-51) are designated Creative Commons and were obtained from FSD50K. The source data in isolated_events.tar.gz is based on the FSD50K dataset, which is licensed as Creative Commons Attribution 4.0 International (CC BY 4.0) License.

GISE-51 dataset (including GISE-51-Mixtures) is a curated, processed and generated preparation, and is released under Creative Commons Attribution 4.0 International (CC BY 4.0) License. The license is specified in the LICENSE-DATASET file in license.tar.gz.

Baselines

Several sound event recognition experiments were conducted, establishing baseline performance on several prominent convolutional neural network architectures. The experiments are described in Section 4 of our paper, and the implementation for reproducing these experiments is available at https://github.com/SarthakYadav/GISE-51-pytorch.

Files

GISE-51 is available as a collection of several tar archives. All audio files are PCM 16 bit, 22050 Hz. Following lists the contents of these files in detail:

isolated_events.tar.gz: The core GISE-51 isolated events dataset containing train, val and eval subfolders.

meta.tar.gz: contains lbl_map.json

noises.tar.gz: contains background noises used for GISE-51-Mixtures soundscape generation

mixtures_jams.tar.gz: This file contains annotation files in .jams format that, alongside isolated_events.tar.gz and noises.tar.gz can be reused to generate exact GISE-51-Mixtures soundscapes. (Optional, we provide the complete set of GISE-51-Mixtures soundscapes as independent tar archives.)

train.tar.gz: GISE-51-Mixtures train set, containing 60k synthetic soundscapes.

val.tar.gz: GISE-51-Mixtures val set, containing 10k synthetic soundscapes.

eval.tar.gz: GISE-51-Mixtures eval set, containing 10k synthetic soundscapes.

train_*.tar.gz: These are tar archives containing training mixtures of a various number of soundscapes, used primarily in Section 4.1 of the paper, which compares val mAP performance v/s number of training soundscapes. A helper script is provided in the code release, prepare_mixtures_lmdb.sh, to prepare data for experiments in Section 4.1.

pretrained-models.tar.gz: Contains model checkpoints for all experiments conducted in the paper. More information on these checkpoints can be found in the code release README.

experiments_60k_mixtures: model checkpoints from section 4.2 of the paper.

exported_weights_60k: ResNet-18 and EfficientNet-B1 exported as plain state_dicts for use with transfer learning experiments.

experiments_audioset: checkpoints from AudioSet Balanced (Sec 4.3.1) experiments

experiments_vggsound: checkpoints from Section 4.3.2 of the paper

experiments_esc50: ESC-50 dataset checkpoints, from Section 4.3.3

license.tar.gz: contains dataset license info.

silence_thresholds.txt: contains volume thresholds for various sound event bins used for silence filtering.

Contact

In case of queries and clarifications, feel free to contact Sarthak at s.yadav.2@research.gla.ac.uk. (Adding [GISE-51] to the subject of the email would be appreciated!)
FSD50KMagma_NoAug_Model
kaggle.com
Updated Nov 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bao Tran Tong (2021). FSD50KMagma_NoAug_Model [Dataset]. https://www.kaggle.com/datasets/trantong/fsd50k-64-noaug
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bao Tran Tong
Description
Dataset

This dataset was created by Bao Tran Tong

Contents
Z
SONYC-FSD-SED
data.niaid.nih.gov
zenodo.org
Updated Sep 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Wang (2022). SONYC-FSD-SED [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6392323
Explore at:
Dataset updated
Sep 20, 2022
Dataset provided by
Mark Cartwright
Juan Pablo Bello
Yu Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Created by

Yu Wang, Mark Cartwright, and Juan Pablo Bello

Publication

If using this data in academic work, please cite the following paper, which presented this dataset:

Y. Wang, M. Cartwright, and J. P. Bello. "Active Few-Shot Learning for Sound Event Detection", INTERSPEECH, 2022

Description

SONYC-FSD-SED is an open dataset of programmatically mixed audio clips that simulates audio data in an environmental sound monitoring system, where sound class occurrences and co-occurrences exhibit seasonal periodic patterns. We use recordings collected from the Sound of New York City (SONYC) acoustic sensor network as backgrounds, and single-labeled clips in the FSD50K dataset as foreground events to generate 576,591 10-second strongly-labeled soundscapes with Scaper (including 111,294 additional test data for the experiment of sampling window). Instead of sampling foreground sound events uniformly, we simulate the occurrence probability of each class at different times in a year, creating more realistic temporal characteristics.

Source material and annotations

Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce SONYC-FSD-SED using Scaper with the script in the project repository.

Background material from SONYC recordings

We pick a sensor from the SONYC sensor network and subsample from recordings it collected within a year (2017). We categorize these ∼550k 10-second clips into 96 bins based on timestamps, where each bin represents a unique combination of the month of a year, day of a week (weekday or weekend), and time of a day (divided into four 6-hour blocks). Next, we run a pre-trained urban sound event classifier over all recordings and filter out clips with active sound classes. We do not filter out footstep and bird since they appear too frequently, instead, we remove these two classes from the foreground sound material. Then from each bin, we choose the clip with the lowest sound pressure level, yielding 96 background clips.

Foreground material from FSD50K

We follow the same filtering process as in FSD-MIX-SED to get the subset of FSD50K with short single-labeled clips. In addition, we remove two classes, "Chirp_and_tweet" and "Walk_and_footsteps", that exist in our SONYC background recordings. This results in 87 sound classes. vocab.json contains the list of 87 classes, each class is then labeled by its index in the list. 0-42: train, 43-56: val, 57-86: test.

Occurrence probability modelling

For each class, we model its occurrence probability within a year. We use von Mises probability density functions to simulate the probability distribution over different weeks in a year and hours in a day considering their cyclic characteristics: (f(x|μ, κ) = e^{κcos(x−μ)}/2πI_0(κ)), where (I_0(κ)) is the modified Bessel function of order (0), (\mu) and (1/\kappa) are analogous to the mean and variance in the normal distribution. We randomly sample ((\mu_{year}, \mu_{day})) from ([-\pi, \pi]) and ((\kappa_{year}, \kappa_{day})) from ([0, 10]). We also randomly assign (p_{weekday} \in [0, 1] ), (p_{weekend} = 1 − p_{weekday}) to simulate the probability distribution over different days in a week. Finally, we get the probability distribution over the entire year with a 1-hour resolution. At a given timestamp, we integrate (f_{year}) and (f_{day}) over the 1-hour window and multiply them together with (p_{weekday}) or (p_{weekend}) depends on the day. To speed up the following sampling process, we scale the final probability distribution using a temperature parameter randomly sampled from ([2,3]).

Files

SONYC_FSD_SED.source.tar.gz: 96 SONYC backgrounds and 10,158 foreground sounds in .wav format. The original file size is 2GB.

SONYC_FSD_SED.annotations.tar.gz: 465,467 JAMS files. The original file size is 57GB.

SONYC_FSD_SED_add_test.annotations.tar.gz: 111,294 JAMS files for additional test data. The original file size is 14GB.

vocab.json: 87 classes.

occ_prob_per_cl.pkl: Occurrence probability for each foreground sound class.

References

[1] J. P. Bello, C. T. Silva, O. Nov, R. L. DuBois, A. Arora, J. Salamon, C. Mydlarz, and H. Doraiswamy, “SONYC: A system for monitoring, analyzing, and mitigating urban noise pollution,” Commun. ACM, 2019

[2] E. Fonseca, X. Favory, J. Pons, F. Font, X. Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.
Z
L3DAS21 Challenge
data.niaid.nih.gov
zenodo.org
Updated May 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danilo Comminiello (2021). L3DAS21 Challenge [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4642004
Explore at:
Dataset updated
May 10, 2021
Dataset provided by
Eric Guizzo
Danilo Comminiello
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
L3DAS21: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING

IEEE MLSP Data Challenge 2021

SCOPE OF THE CHALLENGE

The L3DAS21 Challenge for the IEEE MLSP 2021 aims at encouraging and fostering research on machine learning for 3D audio signal processing. In multi-speaker scenarios it is very important to properly understand the nature of a sound event and its position within the environment, what is the content of the sound signal and how to leverage it at best for a specific application (e.g., teleconferencing rather than assistive listening or entertainment, among others). To this end, L3DAS21 Challenge presents two tasks: 3D Speech Enhancement and 3D Sound Event Localization and Detection, both relying on first-order Ambisonics recordings in reverberant office environment.

Each task involves 2 separate tracks: 1-mic and 2-mic recordings, respectively containing sounds acquired by one Ambisonics microphone and by an array of two Ambisonics microphones. The use of two first-order Ambisonics microphones definitely represents one of the main novelties of the L3DAS21 Challenge.

Task 1: 3D Speech Enhancement The objective of this task is the enhancement of speech signals immersed in the spatial sound field of a reverberant office environment. Here the models are expected to extract the monophonic voice signal from the 3D mixture containing various background noises.The evaluation metric for this task is a combination of the short-time objective intelligibility (STOI) and word error rate (WER).

Task 2: 3D Sound Event Localization and Detection The aim of this task is to detect the temporal activities of a known set of sound event classes and, in particular, to further locate them in the space.Here the models must predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. Performance on this task are evaluated according to the location-sensitive detection error, which joins the localization and detection errors.

DATASETS

The LEDAS21 datasets contain multiple-source and multiple-perspective B-format Ambisonics audio recordings. We sampled the acoustic field of a large office room, placing two first-order Ambisonics microphones in the center of the room and moving a speaker reproducing the analytic signal in 252 fixed spatial positions. Relying on the collected Ambisonics impulse responses (IRs), we augmented existing clean monophonic datasets to obtain synthetic tridimensional sound sources by convolving the original sounds with our IRs. We extracted speech signals from the Librispeech dataset and office-like background noises from the FSD50K dataset. We aimed at creating plausible and variegate 3D scenarios to reflect possible real-life situations in which sound and disparate types of background noises coexist in the same 3D reverberant environment. We provide normalized raw waveforms as predictors data and the target data varies according to the task.

The dataset is divided in two main sections, respectively dedicated to the challenge tasks.

The first section is optimized for 3D Speech Enhancement and contains more than 30000 virtual 3D audio environments with a duration up to 10 seconds. In each sample, a spoken voice is always present alongside with other office-like background noises. As target data for this section we provide the clean monophonic voice signals.

The other sections, instead, is dedicated to the 3D Sound Event Localization and Detection task and contains 900 60-seconds-long audio files Each data point contains a simulated 3D office audio environment in which up to 3 simultaneous acoustic events may be active at the same time. In this section, the samples are not forced to contain a spoken voice. As target data for this section we provide a list of the onset and offset time stamps, the typology class, and the spatial coordinates of each individual sound event present in the data-points.

We split both dataset sections into: a training set (44 hours for SE and 600 hours for SELD) and a test set (6 hours for SE and 5 hours for SELD), paying attention to create similar distributions. The train set of the SE section is divided in two partitions: train360 and train100, and contain speech samples extracted from the correspondent partitions of Librispeech (only the sample) up to 10 seconds). All sets of the SELD section are divided in: OV1, OV2, OV3. These partitions refer to the maximum amount of possible overlapping sounds, which are 1, 2 ore 3, respectively.

The evaluation test datasets can be downloaded here:

L3DAS21_Task1_test.zip

L3DAS21_Task2_test.zip

CHALLENGE WEBSITE AND CONTACTS

L3DAS21 Challenge Website: www.l3das.com/mlsp2021

GitHub repository: github.com/l3das/L3DAS21

Paper: arxiv.org/abs/2104.05499

IEEE MLSP 2021: 2021.ieeemlsp.org/

Email contact: l3das@uniroma1.it

Twitter: https://twitter.com/das_l3
h
Zeroshot-Audio-Classification-Instructions
huggingface.co
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mesolitica (2025). Zeroshot-Audio-Classification-Instructions [Dataset]. https://huggingface.co/datasets/mesolitica/Zeroshot-Audio-Classification-Instructions
Explore at:
Dataset updated
Jun 28, 2025
Dataset authored and provided by
Mesolitica
Description
Zeroshot-Audio-Classification-Instructions

Convert audio classification dataset into zero-shot format speech instructions, support both single label and multi-label,

VGGSound FSD50k Nonspeech7k urbansound8K VocalSound Emotion Gender ESD Emotion Age Language TAU Urban Acoustic Scenes 2022 CochlScene BirdCLEF_2021 EmoBox AudioSet

We also converted huge WAV files into MP3 16k sample rate to reduce storage size.To prevent leakage, please do not include test set in training session.… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/Zeroshot-Audio-Classification-Instructions.
h
ASA2_dataset
huggingface.co
Updated Sep 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongheon Lee (2024). ASA2_dataset [Dataset]. https://huggingface.co/datasets/donghoney22/ASA2_dataset
Explore at:
Dataset updated
Sep 13, 2024
Authors
Dongheon Lee
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🎧 Auditory Scene Analysis 2 (ASA2) Dataset

We constructed a new dataset for multichannel USS and polyphonic audio classification tasks. The proposed dataset is designed to reflect various conditions, including moving sources with temporal onsets and offsets. For foreground sound sources, signals from 13 audio classes were selected from open-source databases (Pixabay¹, FSD50K, Librispeech, MUSDB18, Vocalsound). These signals were resampled to 16 kHz and pre-processed by either padding zeros… See the full description on the dataset page: https://huggingface.co/datasets/donghoney22/ASA2_dataset.
Z
Divide and Remaster (DnR)
data.niaid.nih.gov
zenodo.org
Updated Mar 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Petermann, Darius; Wichern, Gordon; Wang, Zhong-Qiu; Le Roux, Jonathan (2023). Divide and Remaster (DnR) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5574712
Explore at:
Dataset updated
Mar 22, 2023
Dataset provided by
Mitsubishi Electric Research Laboratories
Indiana University, Department of Intelligent Systems Engineering
Authors
Petermann, Darius; Wichern, Gordon; Wang, Zhong-Qiu; Le Roux, Jonathan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction:

Divide and Remaster (DnR) is a source separation dataset for training and testing algorithms that separate a monaural audio signal into speech, music, and sound effects/background stems. The dataset is composed of artificial mixtures using audio from the librispeech, free music archive (FMA), and Freesound Dataset 50k (FSD50k). We introduce it as part of the Cocktail Fork Problem paper.

At a Glance:

The size of the unzipped dataset is ~174GB

Each mixture is 60 seconds long and sources are not fully overlapped

Audio is encoded as 16-bit .wav files at a sampling rate of 44.1 kHz

The data is split into training tr (3295 mixtues), validation cv (440 mixtures) and testing tt (652 mixtures) subsets

The directory for each mixture contains four .wav files, mix.wav, music.wav, speech.wav, sfx.wav, and annots.csv which contains the metadata for the original audio used to compose the mixture (transcriptions for speech, sound classes for sfx, and genre labels for music)

Other Resources:

Demo examples and additional information are available at: https://cocktail-fork.github.io/

For more details about the data generation process, the code used to generate our dataset can be found at the following: https://github.com/darius522/dnr-utils

Contact and Support:

Have an issue, concern, or question about DnR ? If so, please open an issue here.

For any other inquiries, feel free to shoot an email at: firstname.lastname@gmail.com, my name is Darius Petermann ;)

Citation:

If you use DnR please cite our paper in which we introduce the dataset as part of the Cocktail Fork Problem:

@article{Petermann2021cocktail, title={The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks}, author={Darius Petermann and Gordon Wichern and Zhong-Qiu Wang and Jonathan {Le Roux}}, year={2021}, journal={arXiv preprint arXiv:2110.09958}, archivePrefix={arXiv}, primaryClass={eess.AS} }
Open-Set Tagging Dataset (OST)
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sripathi Sridhar; Sripathi Sridhar; Mark Cartwright; Mark Cartwright (2025). Open-Set Tagging Dataset (OST) [Dataset]. http://doi.org/10.5281/zenodo.13755902
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13755902
Dataset updated
Sep 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sripathi Sridhar; Sripathi Sridhar; Mark Cartwright; Mark Cartwright
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Open-set Tagging (OST) is a synthetic dataset of 1s clips used to evaluate source-centric representation learning models in the paper Compositional Audio Representation Learning.

Due to the size of the dataset, we only share the source files, and provide the scripts to generate the dataset are available here.

The dataset generation process is as follows:
1. From single-source FSD50K audio files, we generate a dataset of 10s soundscapes called Open-set Soundscapes (OSS) using Scaper.

2. We then center a 1s window around the center of each sound event in the 10s soundscapes to generate Open-set Tagging (OST), which contains ~500k clips.

If you are not going to use OSS, you can choose to synthesize it without audio-- this will synthesize only the JAMS annotation files needed for the 1s clips. Using the OSS JAMS files, OST clips can be generated deterministically.

There are five dataset variants (~17GB each), each with a different random assignment of classes to the known and unknown class categories. For further details, refer to our previous paper Multi-label open-set audio classification. In this work, OST dataset variant 1 is referred to as OST for simplicity.

We also introduce a tiny version of the dataset called OST-Tiny, which contains ~20k clips and only 10 known classes. This is convenient for faster prototyping and to evaluate models in a more challenging open-set classification scenario.
FUSS(Free Universal Sound Separation)
opendatalab.com
zip
Updated May 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adobe (2023). FUSS(Free Universal Sound Separation) [Dataset]. https://opendatalab.com/OpenDataLab/FUSS
Explore at:
zip(50263329 bytes)Available download formats
Dataset updated
May 12, 2023
Dataset provided by
Adobehttp://adobe.com/
Google Researchhttps://research.google.com/
Universitat Pompeu Fabra
University of Lorraine
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Free Universal Sound Separation (FUSS) dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation. FUSS is based on FSD50K corpus.

Facebook

Twitter

Click to copy link

Link copied

Cite

Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons (2022). FSD50K [Dataset]. http://doi.org/10.5281/zenodo.4060432

FSD50K

Explore at:

8 scholarly articles cite this dataset (View in Google Scholar)

zip, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4060432

Dataset updated

Apr 24, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons

Description

FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

Citation

If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):

@article{fonseca2022FSD50K,
 title={{FSD50K}: an open dataset of human-labeled sound events},
 author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier},
 journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
 volume={30},
 pages={829--852},
 year={2022},
 publisher={IEEE}
}

Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).

Data curators

Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez

Contact

You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.

ABOUT FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.

Basic characteristics:

FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio
The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.
The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).
The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].
Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.
All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.
Ground truth labels are provided at the clip-level (i.e., weak labels).
The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).
In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).
The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.

Dev set:

40,966 audio clips totalling 80.4 hours of audio
Avg duration/clip: 7.1s
114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)
Labels are correct but could be occasionally incomplete
A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)

Eval set:

10,231 audio clips totalling 27.9 hours of audio
Avg duration/clip: 9.8s
38,596 smeared labels
Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.

LICENSE

All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:

The development set consists of 40,966 clips with the following licenses:

CC0: 14,959
CC-BY: 20,017
CC-BY-NC: 4616
CC Sampling+: 1374

The evaluation set consists of 10,231 clips with the following licenses:

CC0: 4914
CC-BY: 3489
CC-BY-NC: 1425
CC Sampling+: 403

For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.

In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).

Usage of FSD50K for commercial purposes:

If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

FILES

FSD50K can be downloaded as a series of zip files with the following directory structure:

root
│ 
└───FSD50K.dev_audio/          Audio clips in the dev set
│ 
└───FSD50K.eval_audio/         Audio clips in the eval set
│  
└───FSD50K.ground_truth/        Files for FSD50K's ground truth
│  │  
│  └─── dev.csv               Ground truth for the dev set
│  │    
│  └─── eval.csv               Ground truth for the eval set      
│  │      
│  └─── vocabulary.csv            List of 200 sound classes in FSD50K 
│  
└───FSD50K.metadata/          Files for additional metadata
│  │      
│  └─── class_info_FSD50K.json        Metadata about the sound classes
│  │      
│  └─── dev_clips_info_FSD50K.json      Metadata about the dev clips
│  │      
│  └─── eval_clips_info_FSD50K.json     Metadata about the eval clips
│  │      
│  └─── pp_pnp_ratings_FSD50K.json      PP/PNP ratings  
│  │      
│  └─── collection/             Files for the *sound collection* format  
│  
└───FSD50K.doc/
  │      
  └───README.md               The dataset description file that you are reading
  │      
  └───LICENSE-DATASET            License of the FSD50K dataset as an entity

Each row (i.e. audio clip) of dev.csv contains the following information:

fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.
labels: the class labels (i.e., the ground truth). Note these

Clear search

Close search

Google apps

Main menu

FSD50K

fsd50k

FSD50k in WebDataset Format

FSD50K

ARCA23K

[DCASE2024 Task 3] Synthetic SELD mixtures for baseline training

FSD-MIX-CLIPS

FSD-FS

fuss

FSD50KMagma-Validationset_NoAug

Dataset

Contents

FSD50K Custom Preprocessed Dev

Dataset

Contents

GISE-51

FSD50KMagma_NoAug_Model

Dataset

Contents

SONYC-FSD-SED

L3DAS21 Challenge

Zeroshot-Audio-Classification-Instructions

ASA2_dataset

Divide and Remaster (DnR)

Open-Set Tagging Dataset (OST)

FUSS(Free Universal Sound Separation)

FSD50KSee More Versions

FSD50K