23 datasets found
  1. FSD50K

    • zenodo.org
    • opendatalab.com
    • +2more
    bin, zip
    Updated Apr 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons (2022). FSD50K [Dataset]. http://doi.org/10.5281/zenodo.4060432
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Apr 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons
    Description

    FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

    Citation

    If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):

    @article{fonseca2022FSD50K,
     title={{FSD50K}: an open dataset of human-labeled sound events},
     author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier},
     journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
     volume={30},
     pages={829--852},
     year={2022},
     publisher={IEEE}
    }
    

    Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).

    Data curators

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.

    ABOUT FSD50K

    Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

    What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.

    Basic characteristics:

    • FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio
    • The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.
    • The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).
    • The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].
    • Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.
    • All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.
    • Ground truth labels are provided at the clip-level (i.e., weak labels).
    • The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).
    • In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).
    • The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.

    Dev set:

    • 40,966 audio clips totalling 80.4 hours of audio
    • Avg duration/clip: 7.1s
    • 114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)
    • Labels are correct but could be occasionally incomplete
    • A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)

    Eval set:

    • 10,231 audio clips totalling 27.9 hours of audio
    • Avg duration/clip: 9.8s
    • 38,596 smeared labels
    • Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

    Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.

    LICENSE

    All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:

    The development set consists of 40,966 clips with the following licenses:

    • CC0: 14,959
    • CC-BY: 20,017
    • CC-BY-NC: 4616
    • CC Sampling+: 1374

    The evaluation set consists of 10,231 clips with the following licenses:

    • CC0: 4914
    • CC-BY: 3489
    • CC-BY-NC: 1425
    • CC Sampling+: 403

    For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.

    In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).

    Usage of FSD50K for commercial purposes:

    If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

    Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

    FILES

    FSD50K can be downloaded as a series of zip files with the following directory structure:

    root
    │ 
    └───FSD50K.dev_audio/          Audio clips in the dev set
    │ 
    └───FSD50K.eval_audio/         Audio clips in the eval set
    │  
    └───FSD50K.ground_truth/        Files for FSD50K's ground truth
    │  │  
    │  └─── dev.csv               Ground truth for the dev set
    │  │    
    │  └─── eval.csv               Ground truth for the eval set      
    │  │      
    │  └─── vocabulary.csv            List of 200 sound classes in FSD50K 
    │  
    └───FSD50K.metadata/          Files for additional metadata
    │  │      
    │  └─── class_info_FSD50K.json        Metadata about the sound classes
    │  │      
    │  └─── dev_clips_info_FSD50K.json      Metadata about the dev clips
    │  │      
    │  └─── eval_clips_info_FSD50K.json     Metadata about the eval clips
    │  │      
    │  └─── pp_pnp_ratings_FSD50K.json      PP/PNP ratings  
    │  │      
    │  └─── collection/             Files for the *sound collection* format  
    │  
    └───FSD50K.doc/
      │      
      └───README.md               The dataset description file that you are reading
      │      
      └───LICENSE-DATASET            License of the FSD50K dataset as an entity  
    

    Each row (i.e. audio clip) of dev.csv contains the following information:

    • fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.
    • labels: the class labels (i.e., the ground truth). Note these

  2. h

    fsd50k

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philippe Gonzalez (2025). fsd50k [Dataset]. https://huggingface.co/datasets/philgzl/fsd50k
    Explore at:
    Dataset updated
    Jul 8, 2025
    Authors
    Philippe Gonzalez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FSD50K: An open dataset of human-labeled sound events

    This is a mirror of the FSD50K sound event dataset. The original files were converted from WAV to Opus to reduce the size and accelerate streaming.

    Sampling rate: 48 kHz Channels: 1 Format: Opus Splits: Dev: 80 hours, 40966 clips. Eval: 28 hours, 10231 clips.

    License: FSD50K is released under CC-BY. However, each clip has its own licence. Clip licenses include CC0, CC-BY, CC-BY-NC and CC Sampling+. Clip licenses are specified… See the full description on the dataset page: https://huggingface.co/datasets/philgzl/fsd50k.

  3. z

    FSD50k in WebDataset Format

    • zenodo.org
    tar
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). FSD50k in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14868441
    Explore at:
    tarAvailable download formats
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the FSD50K dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf fsdk50_eval_0000000.tar |head
    -r--r--r-- bigdata/bigdata 40 2025-01-12 13:02 45604.json
    -r--r--r-- bigdata/bigdata 43066 2025-01-12 13:02 45604.wav
    -r--r--r-- bigdata/bigdata  46 2025-01-12 13:02 213293.json
    -r--r--r-- bigdata/bigdata 1372242 2025-01-12 13:02 213293.wav
    -r--r--r-- bigdata/bigdata   82 2025-01-12 13:02 348174.json
    -r--r--r-- bigdata/bigdata 804280 2025-01-12 13:02 348174.wav
    -r--r--r-- bigdata/bigdata   71 2025-01-12 13:02 417736.json
    -r--r--r-- bigdata/bigdata 2238542 2025-01-12 13:02 417736.wav
    -r--r--r-- bigdata/bigdata   43 2025-01-12 13:02 235555.json
    -r--r--r-- bigdata/bigdata 542508 2025-01-12 13:02 235555.wav
     $ tar -xOf fsdk50_eval_0000000.tar 45604.json
    {"soundevent": "Yell;Shout;Human_voice"}
    
    
  4. h

    FSD50K

    • huggingface.co
    Updated Jul 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xirong Cao (2023). FSD50K [Dataset]. https://huggingface.co/datasets/mikiyax/FSD50K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 17, 2023
    Authors
    Xirong Cao
    Description

    Dataset Card for "FSD50K"

    More Information needed

  5. ARCA23K

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Feb 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turab Iqbal; Turab Iqbal; Yin Cao; Yin Cao; Andrew Bailey; Andrew Bailey; Mark D. Plumbley; Mark D. Plumbley; Wenwu Wang; Wenwu Wang (2022). ARCA23K [Dataset]. http://doi.org/10.5281/zenodo.5117901
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Feb 25, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Turab Iqbal; Turab Iqbal; Yin Cao; Yin Cao; Andrew Bailey; Andrew Bailey; Mark D. Plumbley; Mark D. Plumbley; Wenwu Wang; Wenwu Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.

    In addition to ARCA23K, this release includes a companion dataset called ARCA23K-FSD, which is a single-label subset of the FSD50K dataset. ARCA23K-FSD contains the same sound classes as ARCA23K and the same number of audio clips per class. As it is a subset of FSD50K, each clip and its label have been manually verified. Note that only the ground truth data of ARCA23K-FSD is distributed in this release. To download the audio clips, please visit the Zenodo page for FSD50K.

    A paper has been published detailing how the dataset was constructed. See the Citing section below.

    The source code used to create the datasets is available: https://github.com/tqbl/arca23k-dataset

    Characteristics

    • ARCA23K(-FSD) is divided into:
      • A training set containing 17,979 clips (39.6 hours for ARCA23K).
      • A validation set containing 2,264 clips (5.0 hours).
      • A test test containing 3,484 clips (7.3 hours).
    • There are 70 sound classes in total. Each class belongs to the AudioSet ontology.
    • Each audio clip was sourced from the Freesound database. Other than format conversions (e.g. resampling), the audio clips have not been modified.
    • The duration of the audio clips varies from 0.3 seconds to 30 seconds.
    • All audio clips are mono 16-bit WAV files sampled at 44.1 kHz.
    • Based on listening tests (details in paper), 46.4% of the training examples are estimated to be labelled incorrectly. Among the incorrectly-labelled examples, 75.9% are estimated to be out-of-vocabulary.

    Sound Classes

    The list of sound classes is given below. They are grouped based on the top-level superclasses of the AudioSet ontology.

    Music

    • Acoustic guitar
    • Bass guitar
    • Bowed string instrument
    • Crash cymbal
    • Electric guitar
    • Gong
    • Harp
    • Organ
    • Piano
    • Rattle (instrument)
    • Scratching (performance technique)
    • Snare drum
    • Trumpet
    • Wind chime
    • Wind instrument, woodwind instrument

    Sounds of things

    • Boom
    • Camera
    • Coin (dropping)
    • Computer keyboard
    • Crack
    • Dishes, pots, and pans
    • Drawer open or close
    • Drill
    • Gunshot, gunfire
    • Hammer
    • Keys jangling
    • Knock
    • Microwave oven
    • Printer
    • Sawing
    • Scissors
    • Skateboard
    • Slam
    • Splash, splatter
    • Squeak
    • Tap
    • Thump, thud
    • Toilet flush
    • Train
    • Water tap, faucet
    • Whoosh, swoosh, swish
    • Writing
    • Zipper (clothing)

    Natural sounds

    • Crackle
    • Stream
    • Waves, surf
    • Wind

    Human sounds

    • Burping, eructation
    • Chewing, mastication
    • Child speech, kid speaking
    • Clapping
    • Cough
    • Crying, sobbing
    • Fart
    • Female singing
    • Female speech, woman speaking
    • Finger snapping
    • Giggle
    • Male speech, man speaking
    • Run
    • Screaming
    • Walk, footsteps

    Animal

    • Bark
    • Cricket
    • Livestock, farm animals, working animals
    • Meow
    • Rattle

    Source-ambiguous sounds

    • Crumpling, crinkling
    • Crushing
    • Tearing

    License and Attribution

    This release is licensed under the Creative Commons Attribution 4.0 International License.

    The audio clips distributed as part of ARCA23K were sourced from Freesound and have their own Creative Commons license. The license information and attribution for each audio clip can be found in ARCA23K.metadata/train.json, which also includes the original Freesound URLs.

    The files under ARCA23K-FSD.ground_truth/ are an adaptation of the ground truth data provided as part of FSD50K, which is licensed under the Creative Commons Attribution 4.0 International License. The curators of FSD50K are Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano, and Sara Fernandez.

    Citing

    If you wish to cite this work, please cite the following paper:

    T. Iqbal, Y. Cao, A. Bailey, M. D. Plumbley, and W. Wang, “ARCA23K: An audio dataset for investigating open-set label noise”, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 2021, Barcelona, Spain, pp. 201–205.

    BibTeX:

    @inproceedings{Iqbal2021,
      author = {Iqbal, T. and Cao, Y. and Bailey, A. and Plumbley, M. D. and Wang, W.},
      title = {{ARCA23K}: An audio dataset for investigating open-set label noise},
      booktitle = {Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)},
      pages = {201--205},
      year = {2021},
      address = {Barcelona, Spain},
    }
  6. [DCASE2024 Task 3] Synthetic SELD mixtures for baseline training

    • zenodo.org
    • data.niaid.nih.gov
    Updated Apr 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Aleksander Krause; Daniel Aleksander Krause; Archontis Politis; Archontis Politis (2024). [DCASE2024 Task 3] Synthetic SELD mixtures for baseline training [Dataset]. http://doi.org/10.5281/zenodo.10932241
    Explore at:
    Dataset updated
    Apr 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel Aleksander Krause; Daniel Aleksander Krause; Archontis Politis; Archontis Politis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DESCRIPTION:

    This audio dataset serves serves as supplementary material for the DCASE2024 Challenge Task 3: Audio and Audiovisual Sound Event Localization and Detection with Distance Estimation. The dataset consists of synthetic spatial audio mixtures of sound events spatialized for two different spatial formats using real measured room impulse responses (RIRs) measured in various spaces of Tampere University (TAU). The mixtures are generated using the same process as the one used to generate the recordings of the TAU-NIGENS Spatial Sound Scenes 2021 dataset for the DCASE2021 Challenge Task 3.

    The SELD task setup in DCASE2024 is based on spatial recordings of real scenes, captured in the STARS23 dataset. Since the task setup allows use of external data, these synthetic mixtures serve as additional training material for the baseline model. For more details on the task setup, please refer to the task description.

    Note that the generator code and the collection of room responses used to spatialize sound samples will be also be made available soon. For more details on the recording of RIRs, spatialization, and generation, see:

    • Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen (2021). A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2021), Barcelona, Spain.

    available here.

    SPECIFICATIONS:

    • 13 target sound classes (see task description for details)
    • The sound event samples are sources from the FSD50K dataset, based on affinity of the labels in that dataset to the target classes. The selection on distinguishing which labels in FSD50K corresponded to the target ones, then selecting samples that were tagged with only those labels, and additionally that they had annotator rating of Present and Predominant (see FSD50K for more details). The list of the selected files is included here.
    • 1200 1-minute long spatial recordings
    • Sampling rate of 24kHz
    • Two 4-channel recording formats, first-order Ambisonics (FOA) and tetrahedral microphone array (MIC)
    • Spatial events spatialized in 9 unique rooms, using measured RIRs for the two formats
    • Maximum polyphony of 3 (with possible same-class events overlapping)
    • Even though the whole set is used for training of the baseline without distinction between the mixtures, we have included a separation into a training and testing split, in case on one needs to test the performance purely on those synthetic conditions (for example for comparisons with training on mixed synthetic-real data, fine-tuning on real data, or training on real data only).
    • The training split is indicated as fold1 in the dataset, contains 900 recordings spatialized on 6 rooms (150 recordings/room) and it is based on samples from the development set of FSD50K.
    • The testing split is indicated as fold2 in the dataset, contains 300 recordings spatialized on 3 rooms (100 recordings/room) and it is based on samples from the evaluation set of FSD50K.
    • Common metadata files for both formats are provided. For the file naming and the metadata format, refer to the task setup.

    DOWNLOAD INSTRUCTIONS:

    Download the zip files and use your preferred compression tool to unzip these split zip files. To extract a split zip archive (named as zip, z01, z02, ...), you could use, for example, the following syntax in Linux or OSX terminal:

    1. Combine the split archive to a single archive:
      zip -s 0 split.zip --out single.zip
    2. Extract the single archive using unzip:
      unzip single.zip
  7. Z

    FSD-MIX-CLIPS

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Wang (2021). FSD-MIX-CLIPS [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5574134
    Explore at:
    Dataset updated
    Oct 17, 2021
    Dataset provided by
    Justin Salamon
    Nicholas J. Bryan
    Mark Cartwright
    Juan Pablo Bello
    Yu Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Created by

    Yu Wang, Nicholas J. Bryan, Justin Salamon, Mark Cartwright, and Juan Pablo Bello

    Publication

    If using this data in academic work, please cite the following paper, which presented this dataset:

    Y. Wang, N. J. Bryan, J. Salamon, M. Cartwright, and J. P. Bello. "Who calls the shots? Rethinking Few-shot Learning for Audio", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021

    Description

    FSD-MIX-CLIPS is an open dataset of programmatically mixed audio clips with a controlled level of polyphony and signal-to-noise ratio. We use single-labeled clips from FSD50K as the source material for the foreground sound events and Brownian noise as the background to generate 281,039 10-second strongly-labeled soundscapes with Scaper. We refer this (intermediate) dataset of 10s soundscapes as FSD-MIX-SED. Each soundscape contains n events from n different sound classes where n is ranging from 1 to 5. We then extract 614,533 1s clips centered on each sound event in the soundscapes in FSD-MIX-SED to produce FSD-MIX-CLIPS.

    Source material and annotations

    Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce FSD-MIX-SED using Scaper with the script in the project repository.

    All clips in FSD-MIX-CLIPS are extracted from FSD-MIX-SED. Therefore, for FSD-MIX-CLIPS, instead of releasing duplicated audio content, we provide annotations that specify the filename in FSD-MIX-SED and the corresponding starting time (in second) of each 1-second clip.

    Foreground material from FSD50K

    We choose clips shorter than 4s that have a single validated label with the Present and Predominant annotation type. We further trim the silence at the edges of each clip. The resulting subset contains clips each with a single and strong label. The 200 sound classes in FSD50K are hierarchically organized. We focus on the leaf nodes and rule out classes with less than 20 single-labeled clips. This gives us 89 sound classes. vocab.json contains the list of 89 classes, each class is then labeled by its index in the list.

    Data splits

    FSD-MIX-CLIPS is originally generated for the task of multi-label audio classification under a few-sho continual learning setup. Therefore, the classes are split into disjoint sets of base and novel classes where novel class data are only used at inference time. We partition the 89 classes into three splits: base, novel-val, and novel-test with 59, 15, and 15 classes, respectively. Base class data are used for both training and evaluation while novel-val/novel-test class data are used for validation/test only.

    Files

    FSD_MIX_SED.source.tar.gz contains the background Brownian noise and 10,296 of single-labeled sound events from FSD50K in .wav format. The original file size is 1.9GB.

    FSD_MIX_SED.annotations.tar.gz contains 281,039 JAMS files. The original file size is 35GB.

    FSD_MIX_CLIPS.annotations.tar.gz contains ground truth labels for 1-second clips in each data split in FSD_MIX_SED, specified by filename and starting time (sec).

    vocab.json contains the 89 classes.

    Foreground sound materials and soundscape annotations in FSD_MIX_SED are organized in a similar folder structure following the data splits:

    root folder │
    └───base/ Base classes (label 0-58) │ │
    │ └─── train/
    │ │ │
    │ │ └─── audio or annotation files
    │ │
    │ └─── val/
    │ │ │
    │ │ └─── audio or annotation files
    │ │
    │ └─── test/
    │ │
    │ └─── audio or annotation files │ │ └───val/ Novel-val classes (label 59-73) │ │
    │ └─── audio or annotation files


    └───test/ Novel-test classes (label 74-88) │
    └─── audio or annotation files

    References

    [1] Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.

  8. FSD-FS

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jan 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinhua Liang; Jinhua Liang; Huy Phan; Huy Phan; Emmanouil Benetos; Emmanouil Benetos (2023). FSD-FS [Dataset]. http://doi.org/10.5281/zenodo.7557107
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jan 22, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jinhua Liang; Jinhua Liang; Huy Phan; Huy Phan; Emmanouil Benetos; Emmanouil Benetos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43805 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London.

    Citation

    If you use the FSD-FS dataset, please cite our paper and FSD50K.

    @article{liang2022learning,
     title={Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition},
     author={Liang, Jinhua and Phan, Huy and Benetos, Emmanouil},
     journal={arXiv preprint arXiv:2212.08952},
     year={2022}
    }
    
    @ARTICLE{9645159, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},  title={FSD50K: An Open Dataset of Human-Labeled Sound Events},  year={2022}, volume={30}, number={}, pages={829-852}, doi={10.1109/TASLP.2021.3133208}}

    About FSD-FS

    FSD-FS is an open database for multi-label few-shot audio classification containing 143 classes drawn from the FSD50K. It also inherits the AudioSet Ontology. FSD-FS follows the ratio 7:2:1 to split classes into base, validation, and evaluation sets, so there are 98 classes in the base set, 30 classes in the validation set, and 15 classes in the evaluation set (More details can be found in our paper).

    LICENSE

    FSD-FS are released in Creative Commons (CC) licenses. Same as FSD50K, each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. For more details, ones can refer to the link.

    FILES

    FSD-FS are organised in the structure:

    root
    |
    └─── dev_base
    |
    └─── dev_val
    |
    └─── eval

    REFERENCES AND LINKS

    [1] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. [paper] [link]

    [2] Fonseca, Eduardo, et al. "Fsd50k: an open dataset of human-labeled sound events." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021): 829-852. [paper] [code]

  9. T

    fuss

    • tensorflow.org
    Updated Nov 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). fuss [Dataset]. https://www.tensorflow.org/datasets/catalog/fuss
    Explore at:
    Dataset updated
    Nov 12, 2020
    Description

    The Free Universal Sound Separation (FUSS) Dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation.

    This is the official sound separation data for the DCASE2020 Challenge Task 4: Sound Event Detection and Separation in Domestic Environments.

    Overview: FUSS audio data is sourced from a pre-release of Freesound dataset known as (FSD50k), a sound event dataset composed of Freesound content annotated with labels from the AudioSet Ontology. Using the FSD50K labels, these source files have been screened such that they likely only contain a single type of sound. Labels are not provided for these source files, and are not considered part of the challenge. For the purpose of the DCASE Task4 Sound Separation and Event Detection challenge, systems should not use FSD50K labels, even though they may become available upon FSD50K release.

    To create mixtures, 10 second clips of sources are convolved with simulated room impulse responses and added together. Each 10 second mixture contains between 1 and 4 sources. Source files longer than 10 seconds are considered "background" sources. Every mixture contains one background source, which is active for the entire duration. We provide: a software recipe to create the dataset, the room impulse responses, and the original source audio.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('fuss', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  10. FSD50KMagma-Validationset_NoAug

    • kaggle.com
    Updated Nov 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bao Tran Tong (2021). FSD50KMagma-Validationset_NoAug [Dataset]. https://www.kaggle.com/trantong/fsd5kval-spectrogram/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bao Tran Tong
    Description

    Dataset

    This dataset was created by Bao Tran Tong

    Contents

  11. FSD50K Custom Preprocessed Dev

    • kaggle.com
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anirudh Vignesh (2025). FSD50K Custom Preprocessed Dev [Dataset]. https://www.kaggle.com/datasets/anirudhvignesh/fsd50k-custom-preprocessed-dev/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anirudh Vignesh
    Description

    Dataset

    This dataset was created by Anirudh Vignesh

    Contents

  12. GISE-51

    • zenodo.org
    application/gzip, txt
    Updated Apr 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarthak Yadav; Sarthak Yadav; Mary Ellen Foster; Mary Ellen Foster (2021). GISE-51 [Dataset]. http://doi.org/10.5281/zenodo.4593514
    Explore at:
    application/gzip, txtAvailable download formats
    Dataset updated
    Apr 13, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sarthak Yadav; Sarthak Yadav; Mary Ellen Foster; Mary Ellen Foster
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GISE-51 is an open dataset of 51 isolated sound events based on the FSD50K dataset. The release also includes the GISE-51-Mixtures subset, a dataset of 5-second soundscapes with up to three sound events synthesized from GISE-51. The GISE-51 release attempts to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research and the freedom to adapt the included isolated sound events for domain-specific applications, which was not possible using existing large-scale weakly labelled datasets. GISE-51 release also included accompanying code for baseline experiments, which can be found at https://github.com/SarthakYadav/GISE-51-pytorch.

    Citation

    If you use the GISE-51 dataset and/or the released code, please cite our paper:

    Sarthak Yadav and Mary Ellen Foster, "GISE-51: A scalable isolated sound events dataset", arXiv:2103.12306, 2021

    Since GISE-51 is based on FSD50K, if you use GISE-51 kindly also cite the FSD50K paper:

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.

    About GISE-51 and GISE-51-Mixtures

    The following sections summarize key characteristics of the GISE-51 and the GISE-51-Mixtures datasets, including details left out from the paper.

    GISE-51

    • Three subsets: train, val and eval with 12465, 1716, and2176 utterances. Subsets are in coherence with the FSD50K release.
    • Encompasses 51 sound classes from the FSD50K release
    • View meta/lbl_map.csv for the complete vocabulary.
    • The dataset was obtained from FSD50K using the following steps:
      • Unsmearing annotations to obtain single instances with a single label using the provided metadata and ground truth in FSD50K.
      • Manual inspection to qualitatively evaluate shortlisted utterances.
      • Volume-threshold based automated silence filtering using sox. Different volume thresholds are selected for various sound event class bins using trial-and-error. silence_thresholds.txt lists class bins and their corresponding volume threshold. Files that were determined by sox to contain no audio at all were manually clipped. Code for performing silence filtering can be found in scripts/strip_silence_sox.py in the code repository.
      • Re-evaluate sound event classes, removing ones with too few samples and merging those with high inter-class ambiguity.

    GISE-51-Mixtures

    • Synthetic 5-second soundscapes with up to 3 events created using Scaper.
    • Weighted sampling with replacement for sound event selection, effectively oversampling events with very few samples. Synthetic soundscapes generated thus have a near equal number of annotations per sound event.
    • The number of soundscapes in val and eval set is 10000 each.
    • The number of soundscapes in the final train set is 60000. We do provide training sets with 5k-100k soundscapes.
    • GISE-51-Mixtures is our proposed subset that can be used to benchmark the performance of future works.

    LICENSE

    All audio clips (i.e., found in isolated_events.tar.gz) used in the preparation of the Glasgow Isolated Events Dataset (GISE-51) are designated Creative Commons and were obtained from FSD50K. The source data in isolated_events.tar.gz is based on the FSD50K dataset, which is licensed as Creative Commons Attribution 4.0 International (CC BY 4.0) License.

    GISE-51 dataset (including GISE-51-Mixtures) is a curated, processed and generated preparation, and is released under Creative Commons Attribution 4.0 International (CC BY 4.0) License. The license is specified in the LICENSE-DATASET file in license.tar.gz.

    Baselines

    Several sound event recognition experiments were conducted, establishing baseline performance on several prominent convolutional neural network architectures. The experiments are described in Section 4 of our paper, and the implementation for reproducing these experiments is available at https://github.com/SarthakYadav/GISE-51-pytorch.

    Files

    GISE-51 is available as a collection of several tar archives. All audio files are PCM 16 bit, 22050 Hz. Following lists the contents of these files in detail:

    • isolated_events.tar.gz: The core GISE-51 isolated events dataset containing train, val and eval subfolders.
    • meta.tar.gz: contains lbl_map.json
    • noises.tar.gz: contains background noises used for GISE-51-Mixtures soundscape generation
    • mixtures_jams.tar.gz: This file contains annotation files in .jams format that, alongside isolated_events.tar.gz and noises.tar.gz can be reused to generate exact GISE-51-Mixtures soundscapes. (Optional, we provide the complete set of GISE-51-Mixtures soundscapes as independent tar archives.)
    • train.tar.gz: GISE-51-Mixtures train set, containing 60k synthetic soundscapes.
    • val.tar.gz: GISE-51-Mixtures val set, containing 10k synthetic soundscapes.
    • eval.tar.gz: GISE-51-Mixtures eval set, containing 10k synthetic soundscapes.
    • train_*.tar.gz: These are tar archives containing training mixtures of a various number of soundscapes, used primarily in Section 4.1 of the paper, which compares val mAP performance v/s number of training soundscapes. A helper script is provided in the code release, prepare_mixtures_lmdb.sh, to prepare data for experiments in Section 4.1.
    • pretrained-models.tar.gz: Contains model checkpoints for all experiments conducted in the paper. More information on these checkpoints can be found in the code release README.
      • experiments_60k_mixtures: model checkpoints from section 4.2 of the paper.
      • exported_weights_60k: ResNet-18 and EfficientNet-B1 exported as plain state_dicts for use with transfer learning experiments.
      • experiments_audioset: checkpoints from AudioSet Balanced (Sec 4.3.1) experiments
      • experiments_vggsound: checkpoints from Section 4.3.2 of the paper
      • experiments_esc50: ESC-50 dataset checkpoints, from Section 4.3.3
    • license.tar.gz: contains dataset license info.
    • silence_thresholds.txt: contains volume thresholds for various sound event bins used for silence filtering.

    Contact

    In case of queries and clarifications, feel free to contact Sarthak at s.yadav.2@research.gla.ac.uk. (Adding [GISE-51] to the subject of the email would be appreciated!)

  13. FSD50KMagma_NoAug_Model

    • kaggle.com
    Updated Nov 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bao Tran Tong (2021). FSD50KMagma_NoAug_Model [Dataset]. https://www.kaggle.com/datasets/trantong/fsd50k-64-noaug
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bao Tran Tong
    Description

    Dataset

    This dataset was created by Bao Tran Tong

    Contents

  14. Z

    SONYC-FSD-SED

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Wang (2022). SONYC-FSD-SED [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6392323
    Explore at:
    Dataset updated
    Sep 20, 2022
    Dataset provided by
    Mark Cartwright
    Juan Pablo Bello
    Yu Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Created by

    Yu Wang, Mark Cartwright, and Juan Pablo Bello

    Publication

    If using this data in academic work, please cite the following paper, which presented this dataset:

    Y. Wang, M. Cartwright, and J. P. Bello. "Active Few-Shot Learning for Sound Event Detection", INTERSPEECH, 2022

    Description

    SONYC-FSD-SED is an open dataset of programmatically mixed audio clips that simulates audio data in an environmental sound monitoring system, where sound class occurrences and co-occurrences exhibit seasonal periodic patterns. We use recordings collected from the Sound of New York City (SONYC) acoustic sensor network as backgrounds, and single-labeled clips in the FSD50K dataset as foreground events to generate 576,591 10-second strongly-labeled soundscapes with Scaper (including 111,294 additional test data for the experiment of sampling window). Instead of sampling foreground sound events uniformly, we simulate the occurrence probability of each class at different times in a year, creating more realistic temporal characteristics.

    Source material and annotations

    Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce SONYC-FSD-SED using Scaper with the script in the project repository.

    Background material from SONYC recordings

    We pick a sensor from the SONYC sensor network and subsample from recordings it collected within a year (2017). We categorize these ∼550k 10-second clips into 96 bins based on timestamps, where each bin represents a unique combination of the month of a year, day of a week (weekday or weekend), and time of a day (divided into four 6-hour blocks). Next, we run a pre-trained urban sound event classifier over all recordings and filter out clips with active sound classes. We do not filter out footstep and bird since they appear too frequently, instead, we remove these two classes from the foreground sound material. Then from each bin, we choose the clip with the lowest sound pressure level, yielding 96 background clips.

    Foreground material from FSD50K

    We follow the same filtering process as in FSD-MIX-SED to get the subset of FSD50K with short single-labeled clips. In addition, we remove two classes, "Chirp_and_tweet" and "Walk_and_footsteps", that exist in our SONYC background recordings. This results in 87 sound classes. vocab.json contains the list of 87 classes, each class is then labeled by its index in the list. 0-42: train, 43-56: val, 57-86: test.

    Occurrence probability modelling

    For each class, we model its occurrence probability within a year. We use von Mises probability density functions to simulate the probability distribution over different weeks in a year and hours in a day considering their cyclic characteristics: (f(x|μ, κ) = e^{κcos(x−μ)}/2πI_0(κ)), where (I_0(κ)) is the modified Bessel function of order (0), (\mu) and (1/\kappa) are analogous to the mean and variance in the normal distribution. We randomly sample ((\mu_{year}, \mu_{day})) from ([-\pi, \pi]) and ((\kappa_{year}, \kappa_{day})) from ([0, 10]). We also randomly assign (p_{weekday} \in [0, 1] ), (p_{weekend} = 1 − p_{weekday}) to simulate the probability distribution over different days in a week. Finally, we get the probability distribution over the entire year with a 1-hour resolution. At a given timestamp, we integrate (f_{year}) and (f_{day}) over the 1-hour window and multiply them together with (p_{weekday}) or (p_{weekend}) depends on the day. To speed up the following sampling process, we scale the final probability distribution using a temperature parameter randomly sampled from ([2,3]).

    Files

    SONYC_FSD_SED.source.tar.gz: 96 SONYC backgrounds and 10,158 foreground sounds in .wav format. The original file size is 2GB.

    SONYC_FSD_SED.annotations.tar.gz: 465,467 JAMS files. The original file size is 57GB.

    SONYC_FSD_SED_add_test.annotations.tar.gz: 111,294 JAMS files for additional test data. The original file size is 14GB.

    vocab.json: 87 classes.

    occ_prob_per_cl.pkl: Occurrence probability for each foreground sound class.

    References

    [1] J. P. Bello, C. T. Silva, O. Nov, R. L. DuBois, A. Arora, J. Salamon, C. Mydlarz, and H. Doraiswamy, “SONYC: A system for monitoring, analyzing, and mitigating urban noise pollution,” Commun. ACM, 2019

    [2] E. Fonseca, X. Favory, J. Pons, F. Font, X. Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.

  15. Z

    L3DAS21 Challenge

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danilo Comminiello (2021). L3DAS21 Challenge [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4642004
    Explore at:
    Dataset updated
    May 10, 2021
    Dataset provided by
    Eric Guizzo
    Danilo Comminiello
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    L3DAS21: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING

    IEEE MLSP Data Challenge 2021

    SCOPE OF THE CHALLENGE

    The L3DAS21 Challenge for the IEEE MLSP 2021 aims at encouraging and fostering research on machine learning for 3D audio signal processing. In multi-speaker scenarios it is very important to properly understand the nature of a sound event and its position within the environment, what is the content of the sound signal and how to leverage it at best for a specific application (e.g., teleconferencing rather than assistive listening or entertainment, among others). To this end, L3DAS21 Challenge presents two tasks: 3D Speech Enhancement and 3D Sound Event Localization and Detection, both relying on first-order Ambisonics recordings in reverberant office environment.

    Each task involves 2 separate tracks: 1-mic and 2-mic recordings, respectively containing sounds acquired by one Ambisonics microphone and by an array of two Ambisonics microphones. The use of two first-order Ambisonics microphones definitely represents one of the main novelties of the L3DAS21 Challenge.

    Task 1: 3D Speech Enhancement The objective of this task is the enhancement of speech signals immersed in the spatial sound field of a reverberant office environment. Here the models are expected to extract the monophonic voice signal from the 3D mixture containing various background noises.The evaluation metric for this task is a combination of the short-time objective intelligibility (STOI) and word error rate (WER).

    Task 2: 3D Sound Event Localization and Detection The aim of this task is to detect the temporal activities of a known set of sound event classes and, in particular, to further locate them in the space.Here the models must predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. Performance on this task are evaluated according to the location-sensitive detection error, which joins the localization and detection errors.

    DATASETS

    The LEDAS21 datasets contain multiple-source and multiple-perspective B-format Ambisonics audio recordings. We sampled the acoustic field of a large office room, placing two first-order Ambisonics microphones in the center of the room and moving a speaker reproducing the analytic signal in 252 fixed spatial positions. Relying on the collected Ambisonics impulse responses (IRs), we augmented existing clean monophonic datasets to obtain synthetic tridimensional sound sources by convolving the original sounds with our IRs. We extracted speech signals from the Librispeech dataset and office-like background noises from the FSD50K dataset. We aimed at creating plausible and variegate 3D scenarios to reflect possible real-life situations in which sound and disparate types of background noises coexist in the same 3D reverberant environment. We provide normalized raw waveforms as predictors data and the target data varies according to the task.

    The dataset is divided in two main sections, respectively dedicated to the challenge tasks.

    The first section is optimized for 3D Speech Enhancement and contains more than 30000 virtual 3D audio environments with a duration up to 10 seconds. In each sample, a spoken voice is always present alongside with other office-like background noises. As target data for this section we provide the clean monophonic voice signals.

    The other sections, instead, is dedicated to the 3D Sound Event Localization and Detection task and contains 900 60-seconds-long audio files Each data point contains a simulated 3D office audio environment in which up to 3 simultaneous acoustic events may be active at the same time. In this section, the samples are not forced to contain a spoken voice. As target data for this section we provide a list of the onset and offset time stamps, the typology class, and the spatial coordinates of each individual sound event present in the data-points.

    We split both dataset sections into: a training set (44 hours for SE and 600 hours for SELD) and a test set (6 hours for SE and 5 hours for SELD), paying attention to create similar distributions. The train set of the SE section is divided in two partitions: train360 and train100, and contain speech samples extracted from the correspondent partitions of Librispeech (only the sample) up to 10 seconds). All sets of the SELD section are divided in: OV1, OV2, OV3. These partitions refer to the maximum amount of possible overlapping sounds, which are 1, 2 ore 3, respectively.

    The evaluation test datasets can be downloaded here:

    L3DAS21_Task1_test.zip

    L3DAS21_Task2_test.zip

    CHALLENGE WEBSITE AND CONTACTS

    L3DAS21 Challenge Website: www.l3das.com/mlsp2021

    GitHub repository: github.com/l3das/L3DAS21

    Paper: arxiv.org/abs/2104.05499

    IEEE MLSP 2021: 2021.ieeemlsp.org/

    Email contact: l3das@uniroma1.it

    Twitter: https://twitter.com/das_l3

  16. h

    Zeroshot-Audio-Classification-Instructions

    • huggingface.co
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mesolitica (2025). Zeroshot-Audio-Classification-Instructions [Dataset]. https://huggingface.co/datasets/mesolitica/Zeroshot-Audio-Classification-Instructions
    Explore at:
    Dataset updated
    Jun 28, 2025
    Dataset authored and provided by
    Mesolitica
    Description

    Zeroshot-Audio-Classification-Instructions

    Convert audio classification dataset into zero-shot format speech instructions, support both single label and multi-label,

    VGGSound FSD50k Nonspeech7k urbansound8K VocalSound Emotion Gender ESD Emotion Age Language TAU Urban Acoustic Scenes 2022 CochlScene BirdCLEF_2021 EmoBox AudioSet

    We also converted huge WAV files into MP3 16k sample rate to reduce storage size.To prevent leakage, please do not include test set in training session.… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/Zeroshot-Audio-Classification-Instructions.

  17. h

    ASA2_dataset

    • huggingface.co
    Updated Sep 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dongheon Lee (2024). ASA2_dataset [Dataset]. https://huggingface.co/datasets/donghoney22/ASA2_dataset
    Explore at:
    Dataset updated
    Sep 13, 2024
    Authors
    Dongheon Lee
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🎧 Auditory Scene Analysis 2 (ASA2) Dataset

    We constructed a new dataset for multichannel USS and polyphonic audio classification tasks. The proposed dataset is designed to reflect various conditions, including moving sources with temporal onsets and offsets. For foreground sound sources, signals from 13 audio classes were selected from open-source databases (Pixabay¹, FSD50K, Librispeech, MUSDB18, Vocalsound). These signals were resampled to 16 kHz and pre-processed by either padding zeros… See the full description on the dataset page: https://huggingface.co/datasets/donghoney22/ASA2_dataset.

  18. Z

    Divide and Remaster (DnR)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petermann, Darius; Wichern, Gordon; Wang, Zhong-Qiu; Le Roux, Jonathan (2023). Divide and Remaster (DnR) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5574712
    Explore at:
    Dataset updated
    Mar 22, 2023
    Dataset provided by
    Mitsubishi Electric Research Laboratories
    Indiana University, Department of Intelligent Systems Engineering
    Authors
    Petermann, Darius; Wichern, Gordon; Wang, Zhong-Qiu; Le Roux, Jonathan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction:

    Divide and Remaster (DnR) is a source separation dataset for training and testing algorithms that separate a monaural audio signal into speech, music, and sound effects/background stems. The dataset is composed of artificial mixtures using audio from the librispeech, free music archive (FMA), and Freesound Dataset 50k (FSD50k). We introduce it as part of the Cocktail Fork Problem paper.

    At a Glance:

    The size of the unzipped dataset is ~174GB

    Each mixture is 60 seconds long and sources are not fully overlapped

    Audio is encoded as 16-bit .wav files at a sampling rate of 44.1 kHz

    The data is split into training tr (3295 mixtues), validation cv (440 mixtures) and testing tt (652 mixtures) subsets

    The directory for each mixture contains four .wav files, mix.wav, music.wav, speech.wav, sfx.wav, and annots.csv which contains the metadata for the original audio used to compose the mixture (transcriptions for speech, sound classes for sfx, and genre labels for music)

    Other Resources:

    Demo examples and additional information are available at: https://cocktail-fork.github.io/

    For more details about the data generation process, the code used to generate our dataset can be found at the following: https://github.com/darius522/dnr-utils

    Contact and Support:

    Have an issue, concern, or question about DnR ? If so, please open an issue here.

    For any other inquiries, feel free to shoot an email at: firstname.lastname@gmail.com, my name is Darius Petermann ;)

    Citation:

    If you use DnR please cite our paper in which we introduce the dataset as part of the Cocktail Fork Problem:

    @article{Petermann2021cocktail, title={The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks}, author={Darius Petermann and Gordon Wichern and Zhong-Qiu Wang and Jonathan {Le Roux}}, year={2021}, journal={arXiv preprint arXiv:2110.09958}, archivePrefix={arXiv}, primaryClass={eess.AS} }

  19. Open-Set Tagging Dataset (OST)

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sripathi Sridhar; Sripathi Sridhar; Mark Cartwright; Mark Cartwright (2025). Open-Set Tagging Dataset (OST) [Dataset]. http://doi.org/10.5281/zenodo.13755902
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sripathi Sridhar; Sripathi Sridhar; Mark Cartwright; Mark Cartwright
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Open-set Tagging (OST) is a synthetic dataset of 1s clips used to evaluate source-centric representation learning models in the paper Compositional Audio Representation Learning.

    Due to the size of the dataset, we only share the source files, and provide the scripts to generate the dataset are available here.

    The dataset generation process is as follows:
    1. From single-source FSD50K audio files, we generate a dataset of 10s soundscapes called Open-set Soundscapes (OSS) using Scaper.

    2. We then center a 1s window around the center of each sound event in the 10s soundscapes to generate Open-set Tagging (OST), which contains ~500k clips.

    If you are not going to use OSS, you can choose to synthesize it without audio-- this will synthesize only the JAMS annotation files needed for the 1s clips. Using the OSS JAMS files, OST clips can be generated deterministically.

    There are five dataset variants (~17GB each), each with a different random assignment of classes to the known and unknown class categories. For further details, refer to our previous paper Multi-label open-set audio classification. In this work, OST dataset variant 1 is referred to as OST for simplicity.

    We also introduce a tiny version of the dataset called OST-Tiny, which contains ~20k clips and only 10 known classes. This is convenient for faster prototyping and to evaluate models in a more challenging open-set classification scenario.

  20. FUSS(Free Universal Sound Separation)

    • opendatalab.com
    zip
    Updated May 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adobe (2023). FUSS(Free Universal Sound Separation) [Dataset]. https://opendatalab.com/OpenDataLab/FUSS
    Explore at:
    zip(50263329 bytes)Available download formats
    Dataset updated
    May 12, 2023
    Dataset provided by
    Adobehttp://adobe.com/
    Google Researchhttps://research.google.com/
    Universitat Pompeu Fabra
    University of Lorraine
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Free Universal Sound Separation (FUSS) dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation. FUSS is based on FSD50K corpus.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons (2022). FSD50K [Dataset]. http://doi.org/10.5281/zenodo.4060432
Organization logo

FSD50K

Explore at:
8 scholarly articles cite this dataset (View in Google Scholar)
zip, binAvailable download formats
Dataset updated
Apr 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons
Description

FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

Citation

If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):

@article{fonseca2022FSD50K,
 title={{FSD50K}: an open dataset of human-labeled sound events},
 author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier},
 journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
 volume={30},
 pages={829--852},
 year={2022},
 publisher={IEEE}
}

Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).

Data curators

Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez

Contact

You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.

ABOUT FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.

Basic characteristics:

  • FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio
  • The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.
  • The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).
  • The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].
  • Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.
  • All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.
  • Ground truth labels are provided at the clip-level (i.e., weak labels).
  • The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).
  • In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).
  • The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.

Dev set:

  • 40,966 audio clips totalling 80.4 hours of audio
  • Avg duration/clip: 7.1s
  • 114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)
  • Labels are correct but could be occasionally incomplete
  • A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)

Eval set:

  • 10,231 audio clips totalling 27.9 hours of audio
  • Avg duration/clip: 9.8s
  • 38,596 smeared labels
  • Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.

LICENSE

All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:

The development set consists of 40,966 clips with the following licenses:

  • CC0: 14,959
  • CC-BY: 20,017
  • CC-BY-NC: 4616
  • CC Sampling+: 1374

The evaluation set consists of 10,231 clips with the following licenses:

  • CC0: 4914
  • CC-BY: 3489
  • CC-BY-NC: 1425
  • CC Sampling+: 403

For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.

In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).

Usage of FSD50K for commercial purposes:

If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

FILES

FSD50K can be downloaded as a series of zip files with the following directory structure:

root
│ 
└───FSD50K.dev_audio/          Audio clips in the dev set
│ 
└───FSD50K.eval_audio/         Audio clips in the eval set
│  
└───FSD50K.ground_truth/        Files for FSD50K's ground truth
│  │  
│  └─── dev.csv               Ground truth for the dev set
│  │    
│  └─── eval.csv               Ground truth for the eval set      
│  │      
│  └─── vocabulary.csv            List of 200 sound classes in FSD50K 
│  
└───FSD50K.metadata/          Files for additional metadata
│  │      
│  └─── class_info_FSD50K.json        Metadata about the sound classes
│  │      
│  └─── dev_clips_info_FSD50K.json      Metadata about the dev clips
│  │      
│  └─── eval_clips_info_FSD50K.json     Metadata about the eval clips
│  │      
│  └─── pp_pnp_ratings_FSD50K.json      PP/PNP ratings  
│  │      
│  └─── collection/             Files for the *sound collection* format  
│  
└───FSD50K.doc/
  │      
  └───README.md               The dataset description file that you are reading
  │      
  └───LICENSE-DATASET            License of the FSD50K dataset as an entity  

Each row (i.e. audio clip) of dev.csv contains the following information:

  • fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.
  • labels: the class labels (i.e., the ground truth). Note these

Search
Clear search
Close search
Google apps
Main menu