33 datasets found
  1. Z

    FSD50K

    • data.niaid.nih.gov
    • opendatalab.com
    • +2more
    Updated Apr 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca (2022). FSD50K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4060431
    Explore at:
    Dataset updated
    Apr 24, 2022
    Dataset provided by
    Jordi Pons
    Xavier Favory
    Eduardo Fonseca
    Xavier Serra
    Frederic Font
    Description

    FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

    Citation

    If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):

    @article{fonseca2022FSD50K, title={{FSD50K}: an open dataset of human-labeled sound events}, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={30}, pages={829--852}, year={2022}, publisher={IEEE} }

    Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).

    Data curators

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.

    ABOUT FSD50K

    Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

    What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.

    Basic characteristics:

    FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio

    The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.

    The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).

    The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].

    Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.

    All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.

    Ground truth labels are provided at the clip-level (i.e., weak labels).

    The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).

    In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).

    The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.

    Dev set:

    40,966 audio clips totalling 80.4 hours of audio

    Avg duration/clip: 7.1s

    114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)

    Labels are correct but could be occasionally incomplete

    A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)

    Eval set:

    10,231 audio clips totalling 27.9 hours of audio

    Avg duration/clip: 9.8s

    38,596 smeared labels

    Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

    Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.

    LICENSE

    All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:

    The development set consists of 40,966 clips with the following licenses:

    CC0: 14,959

    CC-BY: 20,017

    CC-BY-NC: 4616

    CC Sampling+: 1374

    The evaluation set consists of 10,231 clips with the following licenses:

    CC0: 4914

    CC-BY: 3489

    CC-BY-NC: 1425

    CC Sampling+: 403

    For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.

    In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).

    Usage of FSD50K for commercial purposes:

    If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

    Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

    FILES

    FSD50K can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSD50K.dev_audio/ Audio clips in the dev set │
    └───FSD50K.eval_audio/ Audio clips in the eval set │
    └───FSD50K.ground_truth/ Files for FSD50K's ground truth │ │
    │ └─── dev.csv Ground truth for the dev set │ │
    │ └─── eval.csv Ground truth for the eval set
    │ │
    │ └─── vocabulary.csv List of 200 sound classes in FSD50K │
    └───FSD50K.metadata/ Files for additional metadata │ │
    │ └─── class_info_FSD50K.json Metadata about the sound classes │ │
    │ └─── dev_clips_info_FSD50K.json Metadata about the dev clips │ │
    │ └─── eval_clips_info_FSD50K.json Metadata about the eval clips │ │
    │ └─── pp_pnp_ratings_FSD50K.json PP/PNP ratings
    │ │
    │ └─── collection/ Files for the sound collection format

    └───FSD50K.doc/ │
    └───README.md The dataset description file that you are reading │
    └───LICENSE-DATASET License of the FSD50K dataset as an entity

    Each row (i.e. audio clip) of dev.csv contains the following information:

    fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.

    labels: the class labels (i.e., the ground truth). Note these class labels are smeared, i.e., the labels have been propagated in the upwards direction to the root of the ontology. More details about the label smearing process can be found in Appendix D of our paper.

    mids: the Freebase identifiers corresponding to the class labels, as defined in the AudioSet Ontology specification

    split: whether the clip belongs to train or val (see paper for details on the proposed split)

    Rows in eval.csv follow the same format, except that there is no split column.

    Note: We use a slightly different format than AudioSet for the naming of class labels in order to avoid potential problems with spaces, commas, etc. Example: we use Accelerating_and_revving_and_vroom instead of the original Accelerating, revving, vroom. You can go back to the original AudioSet naming using the information provided in vocabulary.csv (class label and mid for the 200 classes of FSD50K) and the AudioSet Ontology specification.

    Files with additional metadata (FSD50K.metadata/)

    To allow a variety of analysis and approaches with FSD50K, we provide the following metadata:

    class_info_FSD50K.json: python dictionary where each entry corresponds to one sound class and contains: FAQs utilized during the annotation of the class, examples (representative audio clips), and verification_examples (audio clips presented to raters during annotation as a quality control mechanism). Audio clips are described by the Freesound id. Note: It may be that some of these examples are not included in the FSD50K release.

    dev_clips_info_FSD50K.json: python dictionary where each entry corresponds to one dev clip and contains: title,

  2. Z

    Google's Audioset: Reformatted

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bakhtin (2022). Google's Audioset: Reformatted [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7096701
    Explore at:
    Dataset updated
    Sep 21, 2022
    Dataset authored and provided by
    Bakhtin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Google's AudioSet consistently reformatted

    During my work with Google's AudioSet(https://research.google.com/audioset/index.html) I encountered some problems due to the fact that Weak (https://research.google.com/audioset/download.html) and Strong (https://research.google.com/audioset/download_strong.html) versions of the dataset used different csv formatting for the data, and that also labels used in the two datasets are different (https://github.com/audioset/ontology/issues/9) and also presented in files with different formatting.

    This dataset reformatting aims to unify the formats of the datasets so that it is possible to analyse them in the same pipelines, and also make the dataset files compatible with psds_eval, dcase_util and sed_eval Python packages used in Audio Processing.

    For better formatted documentation and source code of reformatting refer to https://github.com/bakhtos/GoogleAudioSetReformatted

    -Changes in dataset

    All files are converted to tab-separated *.tsv files (i.e. csv files with \t as a separator). All files have a header as the first line.

    -New fields and filenames

    Fields are renamed according to the following table, to be compatible with psds_eval:

    Old field -> New field YTID -> filename segment_id -> filename start_seconds -> onset start_time_seconds -> onset end_seconds -> offset end_time_seconds -> offset positive_labels -> event_label label -> event_label present -> present

    For class label files, id is now the name for the for mid label (e.g. /m/09xor) and label for the human-readable label (e.g. Speech). Index of label indicated for Weak dataset labels (index field in class_labels_indices.csv) is not used.

    Files are renamed according to the following table to ensure consisted naming of the form audioset_[weak|strong]_[train|eval]_[balanced|unbalanced|posneg]*.tsv:

    Old name -> New name balanced_train_segments.csv -> audioset_weak_train_balanced.tsv unbalanced_train_segments.csv -> audioset_weak_train_unbalanced.tsv eval_segments.csv -> audioset_weak_eval.tsv audioset_train_strong.tsv -> audioset_strong_train.tsv audioset_eval_strong.tsv -> audioset_strong_eval.tsv audioset_eval_strong_framed_posneg.tsv -> audioset_strong_eval_posneg.tsv class_labels_indices.csv -> class_labels.tsv (merged with mid_to_display_name.tsv) mid_to_display_name.tsv -> class_labels.tsv (merged with class_labels_indices.csv)

    -Strong dataset changes

    Only changes to the Strong dataset are renaming of fields and reordering of columns, so that both Weak and Strong version have filename and event_label as first two columns.

    -Weak dataset changes

    -- Labels are given one per line, instead of comma-separated and quoted list

    -- To make sure that filename format is the same as in Strong version, the following format change is made: The value of the start_seconds field is converted to milliseconds and appended to the filename with an underscore. Since all files in the dataset are assumed to be 10 seconds long, this unifies the format of filename with the Strong version and makes end_seconds also redundant.

    -Class labels changes

    Class labels from both datasets are merged into one file and given in alphabetical order of ids. Since same ids are present in both datasets, but sometimes with different human-readable labels, labels from Strong dataset overwrite those from Weak. It is possible to regenerate class_labels.tsv while giving priority to the Weak version of labels by calling convert_labels(False) from convert.py in the GitHub repository.

    -License

    Google's AudioSet was published in two stages - first the Weakly labelled data (Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017.), then the strongly labelled data (Hershey, Shawn, et al. "The benefit of temporally-strong labels in audio event classification." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.)

    Both the original dataset and this reworked version are licensed under CC BY 4.0

    Class labels come from the AudioSet Ontology, which is licensed under CC BY-SA 4.0.

  3. h

    FSD50k

    • huggingface.co
    Updated Oct 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nelson Yalta (2020). FSD50k [Dataset]. https://huggingface.co/datasets/Fhrozen/FSD50k
    Explore at:
    Dataset updated
    Oct 2, 2020
    Authors
    Nelson Yalta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Freesound Dataset 50k (FSD50K)

      Important
    

    This data set is a copy from the original one located at Zenodo.

      Citation
    

    If you use the FSD50K dataset, or part of it, please cite our paper:

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv 2020.

      Data curators
    

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary… See the full description on the dataset page: https://huggingface.co/datasets/Fhrozen/FSD50k.

  4. FSD-FS

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jan 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinhua Liang; Jinhua Liang; Huy Phan; Huy Phan; Emmanouil Benetos; Emmanouil Benetos (2023). FSD-FS [Dataset]. http://doi.org/10.5281/zenodo.7557107
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jan 22, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jinhua Liang; Jinhua Liang; Huy Phan; Huy Phan; Emmanouil Benetos; Emmanouil Benetos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43805 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London.

    Citation

    If you use the FSD-FS dataset, please cite our paper and FSD50K.

    @article{liang2022learning,
     title={Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition},
     author={Liang, Jinhua and Phan, Huy and Benetos, Emmanouil},
     journal={arXiv preprint arXiv:2212.08952},
     year={2022}
    }
    
    @ARTICLE{9645159, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},  title={FSD50K: An Open Dataset of Human-Labeled Sound Events},  year={2022}, volume={30}, number={}, pages={829-852}, doi={10.1109/TASLP.2021.3133208}}

    About FSD-FS

    FSD-FS is an open database for multi-label few-shot audio classification containing 143 classes drawn from the FSD50K. It also inherits the AudioSet Ontology. FSD-FS follows the ratio 7:2:1 to split classes into base, validation, and evaluation sets, so there are 98 classes in the base set, 30 classes in the validation set, and 15 classes in the evaluation set (More details can be found in our paper).

    LICENSE

    FSD-FS are released in Creative Commons (CC) licenses. Same as FSD50K, each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. For more details, ones can refer to the link.

    FILES

    FSD-FS are organised in the structure:

    root
    |
    └─── dev_base
    |
    └─── dev_val
    |
    └─── eval

    REFERENCES AND LINKS

    [1] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. [paper] [link]

    [2] Fonseca, Eduardo, et al. "Fsd50k: an open dataset of human-labeled sound events." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021): 829-852. [paper] [code]

  5. Z

    FSDKaggle2019

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel P. W. Ellis (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Eduardo Fonseca
    Daniel P. W. Ellis
    Manoj Plakal
    Xavier Serra
    Frederic Font
    Description

    FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

    Citation

    If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Data curators

    Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    ABOUT FSDKaggle2019

    Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

    FSDKaggle2019 employs audio clips from the following sources:

    Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

    The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

    The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

    What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

    Ground Truth Labels

    The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

    The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

    The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

    Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

    curated train set: correct (but potentially incomplete) labels

    noisy train set: noisy labels

    test set: correct and complete labels

    Further details can be found below in the sections for each set.

    Format

    All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

    DATA SPLIT

    FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

    Curated train set

    The curated train set consists of manually-labeled data from FSD.

    Number of clips/class: 75 except in a few cases (where there are less)

    Total number of clips: 4970

    Avg number of labels/clip: 1.2

    Total duration: 10.5 hours

    The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

    Noisy train set

    The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

    Number of clips/class: 300

    Total number of clips: 19,815

    Avg number of labels/clip: 1.2

    Total duration: ~80 hours

    The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

    Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

    Test set

    The test set is used for system evaluation and consists of manually-labeled data from FSD.

    Number of clips/class: between 50 and 150

    Total number of clips: 4481

    Avg number of labels/clip: 1.4

    Total duration: 12.9 hours

    The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

    During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

    Acoustic mismatch

    As mentioned before, FSDKaggle2019 uses audio clips from two sources:

    FSD: curated train set and test set, and

    YFCC: noisy train set.

    While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

    This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

    LICENSE

    All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

    Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

    Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

    In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

    FILES & DOWNLOAD

    FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy

  6. AudioSet [Train]

    • kaggle.com
    Updated Dec 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZFTurbo (2020). AudioSet [Train] [Dataset]. https://www.kaggle.com/zfturbo/audioset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ZFTurbo
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    The AudioSet dataset is a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. AudioSet is brought to you by the Sound and Video Understanding teams pursing Machine Perception research at Google. The official AudioSet site is located here. The main problem is that AudioSet wasn't releases as audio-files rather as just Youttube links which were hard to use. In this dataset you can find extracted raw WAV-files for balanced train, evaluation and manually created test data.

    Content

    Dataset consists of following folders and files:

    • train_wav - folder with audio files in WAV format
    • class_label_indices.csv - file with class_id mapping
    • train.csv - meta-data including target classes for train audio files
    • train_missed.csv - files which are not available (comparing with original dataset)

    Additional data

    Current Problems

    Around 10% of data already anavialable due to removal of some videos from YouTube.

  7. h

    fsd50k

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philippe Gonzalez (2025). fsd50k [Dataset]. https://huggingface.co/datasets/philgzl/fsd50k
    Explore at:
    Dataset updated
    Jul 8, 2025
    Authors
    Philippe Gonzalez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FSD50K: An open dataset of human-labeled sound events

    This is a mirror of the FSD50K sound event dataset. The original files were converted from WAV to Opus to reduce the size and accelerate streaming.

    Sampling rate: 48 kHz Channels: 1 Format: Opus Splits: Dev: 80 hours, 40966 clips. Eval: 28 hours, 10231 clips.

    License: FSD50K is released under CC-BY. However, each clip has its own licence. Clip licenses include CC0, CC-BY, CC-BY-NC and CC Sampling+. Clip licenses are specified… See the full description on the dataset page: https://huggingface.co/datasets/philgzl/fsd50k.

  8. FSC22 Dataset

    • kaggle.com
    Updated Sep 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IRMIOT22 (2022). FSC22 Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/4213460
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    IRMIOT22
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Forest environmental sound classification is one use case of ESC which has been widely experimenting to identify illegal activities inside a forest. With the unavailability of public datasets specific to forest sounds, there is a requirement for a benchmark forest environment sound dataset. With this motivation, the FSC22 was created as a public benchmark dataset, using the audio samples collected from FreeSound org.

    This dataset includes 2025 labeled sound clips of 5s long. All the audio samples are distributed between six major parent-level classes; Mechanical sounds, Animal sounds, Environmental Sounds, Vehicle Sounds, Forest Threat Sounds, and Human Sounds. Further, each class is divided into subclasses that capture specific sounds which fall under the main category. Overall the dataset taxonomy consists of 34 classes as shown below. For the first phase of the dataset creation, 75 audio samples for every 27 classes were collected.

    We expect that this dataset will help research communities with their research work governing Forest Acoustic monitoring and classification domain.

  9. E

    Test dataset for separation of speech, traffic sounds, wind noise, and...

    • live.european-language-grid.eu
    audio wav
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Test dataset for separation of speech, traffic sounds, wind noise, and general sounds [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7681
    Explore at:
    audio wavAvailable download formats
    Dataset updated
    Apr 24, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset was generated as part of the paper:DCUnet-Based Multi-Model Approach for Universal Sound Separation,K. Arendt, A. Szumaczuk, B. Jasik, K. Piaskowski, P. Masztalski, M. Matuszewski, K. Nowicki, P. Zborowski.It contains various sounds from the Audio Set [1] and spoken utterances from VCTK [2] and DNS [3] datasets.Contents:sr_8k/ mix_clean/ s1/ s2/ s3/ s4/sr_16k/ mix_clean/ s1/ s2/ s3/ s4/sr_48k/ mix_clean/ s1/ s2/ s3/ s4/Each directory contains 512 audio samples in different sampling rate (sr_8k - 8 kHz, sr_16k - 16 kHz, sr_48k - 48 kHz).The audio samples for each sampling rate are different as they were generated randomly and separately.Each directory contains 5 subdirectories:- mix_clean - mixed sources,- s1 - source #1 (general sounds),- s2 - source #2 (speech),- s3 - source #3 (traffic sounds),- s4 - source #4 (wind noise).The sound mixtures were generated by adding s2, s3, s4 to s1 with SNR ranging from -10 to 10 dB w.r.t. s1.REFERENCES:[1] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.[2] Christophe Veaux, Junichi Yamagishi, and Kirsten Mac- Donald, “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit, [sound],” https://doi.org/10.7488/ds/1994, University of Edinburgh. The Centre for Speech Technology Research (CSTR). 2017.[3] Chandan K. A. Reddy, Ebrahim Beyrami, Harishchandra Dubey, Vishak Gopal, Roger Cheng, Ross Cutler, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, and Johannes Gehrke, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective speech quality and testing framework,” 2020.

  10. Z

    DCASE2019_task4_synthetic_data

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shah Ankit Parag (2020). DCASE2019_task4_synthetic_data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2583795
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Turpault Nicolas
    Serizel Romain
    Salamon Justin
    Shah Ankit Parag
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Synthetic data for DCASE 2019 task 4

    Freesound dataset [1,2]: A subset of FSD is used as foreground sound events for the synthetic subset of the dataset for DCASE 2019 task 4. FSD is a large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology [3].

    SINS dataset [4]: The derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week. It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.

    The synthetic set is composed of 10 sec audio clips generated with Scaper [5]. The foreground events are obtained from FSD. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip. Each selected clip was then segmented when needed to remove silences before and after the event and between events when the file contained multiple occurrences of the event class.

    License:

    All sounds comming from FSD are released under Creative Commons licences. Synthetic sounds can only be used for competition purposes until the full CC license list is made available at the end of the competition.

    Further information on dcase website.

    References:

    [1] F. Font, G. Roma & X. Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013.

    [2] E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter & X. Serra. Freesound Datasets: A Platform for the Creation of Open Audio Datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017.

    [3] Jort F. Gemmeke and Daniel P. W. Ellis and Dylan Freedman and Aren Jansen and Wade Lawrence and R. Channing Moore and Manoj Plakal and Marvin Ritter. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings IEEE ICASSP 2017, New Orleans, LA, 2017.

    [4] Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.

    [5] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello. Scaper: A library for soundscape synthesis and augmentation In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2017.

  11. h

    AudioSet2K22

    • huggingface.co
    Updated Sep 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nelson Yalta (2023). AudioSet2K22 [Dataset]. https://huggingface.co/datasets/Fhrozen/AudioSet2K22
    Explore at:
    Dataset updated
    Sep 30, 2023
    Authors
    Nelson Yalta
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for audioset2022

      Dataset Summary
    

    The AudioSet ontology is a collection of sound events organized in a hierarchy. The ontology covers a wide range of everyday sounds, from human and animal sounds, to natural and environmental sounds, to musical and miscellaneous sounds. This repository only includes audio files for DCASE 2022 - Task 3 The included labels are limited to:

    Female speech, woman speaking Male speech, man speaking Clapping Telephone Telephone bell… See the full description on the dataset page: https://huggingface.co/datasets/Fhrozen/AudioSet2K22.

  12. ARCA23K

    • zenodo.org
    bin, zip
    Updated Feb 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turab Iqbal; Turab Iqbal; Yin Cao; Yin Cao; Andrew Bailey; Andrew Bailey; Mark D. Plumbley; Mark D. Plumbley; Wenwu Wang; Wenwu Wang (2022). ARCA23K [Dataset]. http://doi.org/10.5281/zenodo.5117901
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Feb 25, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Turab Iqbal; Turab Iqbal; Yin Cao; Yin Cao; Andrew Bailey; Andrew Bailey; Mark D. Plumbley; Mark D. Plumbley; Wenwu Wang; Wenwu Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.

    In addition to ARCA23K, this release includes a companion dataset called ARCA23K-FSD, which is a single-label subset of the FSD50K dataset. ARCA23K-FSD contains the same sound classes as ARCA23K and the same number of audio clips per class. As it is a subset of FSD50K, each clip and its label have been manually verified. Note that only the ground truth data of ARCA23K-FSD is distributed in this release. To download the audio clips, please visit the Zenodo page for FSD50K.

    A paper has been published detailing how the dataset was constructed. See the Citing section below.

    The source code used to create the datasets is available: https://github.com/tqbl/arca23k-dataset

    Characteristics

    • ARCA23K(-FSD) is divided into:
      • A training set containing 17,979 clips (39.6 hours for ARCA23K).
      • A validation set containing 2,264 clips (5.0 hours).
      • A test test containing 3,484 clips (7.3 hours).
    • There are 70 sound classes in total. Each class belongs to the AudioSet ontology.
    • Each audio clip was sourced from the Freesound database. Other than format conversions (e.g. resampling), the audio clips have not been modified.
    • The duration of the audio clips varies from 0.3 seconds to 30 seconds.
    • All audio clips are mono 16-bit WAV files sampled at 44.1 kHz.
    • Based on listening tests (details in paper), 46.4% of the training examples are estimated to be labelled incorrectly. Among the incorrectly-labelled examples, 75.9% are estimated to be out-of-vocabulary.

    Sound Classes

    The list of sound classes is given below. They are grouped based on the top-level superclasses of the AudioSet ontology.

    Music

    • Acoustic guitar
    • Bass guitar
    • Bowed string instrument
    • Crash cymbal
    • Electric guitar
    • Gong
    • Harp
    • Organ
    • Piano
    • Rattle (instrument)
    • Scratching (performance technique)
    • Snare drum
    • Trumpet
    • Wind chime
    • Wind instrument, woodwind instrument

    Sounds of things

    • Boom
    • Camera
    • Coin (dropping)
    • Computer keyboard
    • Crack
    • Dishes, pots, and pans
    • Drawer open or close
    • Drill
    • Gunshot, gunfire
    • Hammer
    • Keys jangling
    • Knock
    • Microwave oven
    • Printer
    • Sawing
    • Scissors
    • Skateboard
    • Slam
    • Splash, splatter
    • Squeak
    • Tap
    • Thump, thud
    • Toilet flush
    • Train
    • Water tap, faucet
    • Whoosh, swoosh, swish
    • Writing
    • Zipper (clothing)

    Natural sounds

    • Crackle
    • Stream
    • Waves, surf
    • Wind

    Human sounds

    • Burping, eructation
    • Chewing, mastication
    • Child speech, kid speaking
    • Clapping
    • Cough
    • Crying, sobbing
    • Fart
    • Female singing
    • Female speech, woman speaking
    • Finger snapping
    • Giggle
    • Male speech, man speaking
    • Run
    • Screaming
    • Walk, footsteps

    Animal

    • Bark
    • Cricket
    • Livestock, farm animals, working animals
    • Meow
    • Rattle

    Source-ambiguous sounds

    • Crumpling, crinkling
    • Crushing
    • Tearing

    License and Attribution

    This release is licensed under the Creative Commons Attribution 4.0 International License.

    The audio clips distributed as part of ARCA23K were sourced from Freesound and have their own Creative Commons license. The license information and attribution for each audio clip can be found in ARCA23K.metadata/train.json, which also includes the original Freesound URLs.

    The files under ARCA23K-FSD.ground_truth/ are an adaptation of the ground truth data provided as part of FSD50K, which is licensed under the Creative Commons Attribution 4.0 International License. The curators of FSD50K are Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano, and Sara Fernandez.

    Citing

    If you wish to cite this work, please cite the following paper:

    T. Iqbal, Y. Cao, A. Bailey, M. D. Plumbley, and W. Wang, “ARCA23K: An audio dataset for investigating open-set label noise”, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 2021, Barcelona, Spain, pp. 201–205.

    BibTeX:

    @inproceedings{Iqbal2021,
      author = {Iqbal, T. and Cao, Y. and Bailey, A. and Plumbley, M. D. and Wang, W.},
      title = {{ARCA23K}: An audio dataset for investigating open-set label noise},
      booktitle = {Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)},
      pages = {201--205},
      year = {2021},
      address = {Barcelona, Spain},
    }
  13. Z

    FSD-MIX-CLIPS

    • data.niaid.nih.gov
    Updated Oct 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Wang (2021). FSD-MIX-CLIPS [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5574134
    Explore at:
    Dataset updated
    Oct 17, 2021
    Dataset provided by
    Justin Salamon
    Nicholas J. Bryan
    Mark Cartwright
    Yu Wang
    Juan Pablo Bello
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Created by

    Yu Wang, Nicholas J. Bryan, Justin Salamon, Mark Cartwright, and Juan Pablo Bello

    Publication

    If using this data in academic work, please cite the following paper, which presented this dataset:

    Y. Wang, N. J. Bryan, J. Salamon, M. Cartwright, and J. P. Bello. "Who calls the shots? Rethinking Few-shot Learning for Audio", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021

    Description

    FSD-MIX-CLIPS is an open dataset of programmatically mixed audio clips with a controlled level of polyphony and signal-to-noise ratio. We use single-labeled clips from FSD50K as the source material for the foreground sound events and Brownian noise as the background to generate 281,039 10-second strongly-labeled soundscapes with Scaper. We refer this (intermediate) dataset of 10s soundscapes as FSD-MIX-SED. Each soundscape contains n events from n different sound classes where n is ranging from 1 to 5. We then extract 614,533 1s clips centered on each sound event in the soundscapes in FSD-MIX-SED to produce FSD-MIX-CLIPS.

    Source material and annotations

    Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce FSD-MIX-SED using Scaper with the script in the project repository.

    All clips in FSD-MIX-CLIPS are extracted from FSD-MIX-SED. Therefore, for FSD-MIX-CLIPS, instead of releasing duplicated audio content, we provide annotations that specify the filename in FSD-MIX-SED and the corresponding starting time (in second) of each 1-second clip.

    Foreground material from FSD50K

    We choose clips shorter than 4s that have a single validated label with the Present and Predominant annotation type. We further trim the silence at the edges of each clip. The resulting subset contains clips each with a single and strong label. The 200 sound classes in FSD50K are hierarchically organized. We focus on the leaf nodes and rule out classes with less than 20 single-labeled clips. This gives us 89 sound classes. vocab.json contains the list of 89 classes, each class is then labeled by its index in the list.

    Data splits

    FSD-MIX-CLIPS is originally generated for the task of multi-label audio classification under a few-sho continual learning setup. Therefore, the classes are split into disjoint sets of base and novel classes where novel class data are only used at inference time. We partition the 89 classes into three splits: base, novel-val, and novel-test with 59, 15, and 15 classes, respectively. Base class data are used for both training and evaluation while novel-val/novel-test class data are used for validation/test only.

    Files

    FSD_MIX_SED.source.tar.gz contains the background Brownian noise and 10,296 of single-labeled sound events from FSD50K in .wav format. The original file size is 1.9GB.

    FSD_MIX_SED.annotations.tar.gz contains 281,039 JAMS files. The original file size is 35GB.

    FSD_MIX_CLIPS.annotations.tar.gz contains ground truth labels for 1-second clips in each data split in FSD_MIX_SED, specified by filename and starting time (sec).

    vocab.json contains the 89 classes.

    Foreground sound materials and soundscape annotations in FSD_MIX_SED are organized in a similar folder structure following the data splits:

    root folder │
    └───base/ Base classes (label 0-58) │ │
    │ └─── train/
    │ │ │
    │ │ └─── audio or annotation files
    │ │
    │ └─── val/
    │ │ │
    │ │ └─── audio or annotation files
    │ │
    │ └─── test/
    │ │
    │ └─── audio or annotation files │ │ └───val/ Novel-val classes (label 59-73) │ │
    │ └─── audio or annotation files


    └───test/ Novel-test classes (label 74-88) │
    └─── audio or annotation files

    References

    [1] Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.

  14. O

    DCASE 2018 Task 4

    • opendatalab.com
    zip
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carnegie Mellon University, DCASE 2018 Task 4 [Dataset]. https://opendatalab.com/OpenDataLab/DCASE_2018_Task_4
    Explore at:
    zip(2496883 bytes)Available download formats
    Dataset provided by
    Carnegie Mellon University
    Johannes Kepler University Linz
    University of Lorraine
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    DCASE2018 Task 4 is a dataset for large-scale weakly labeled semi-supervised sound event detection in domestic environments. The data are YouTube video excerpts focusing on domestic context which could be used for example in ambient assisted living applications. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events...) and potential industrial applications. Specifically, the task employs a subset of “Audioset: An Ontology And Human-Labeled Dataset For Audio Events” by Google. Audioset consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. Task 4 focuses on a subset of Audioset that consists of 10 classes of sound events: speech, dog, cat, alarm bell ringing, dishes, frying, blender, running water, vacuum cleaner, electric shaver toothbrush.

  15. f

    Best performances up to statistic significance achieved using...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenjing Han; Eduardo Coutinho; Huabin Ruan; Haifeng Li; Björn Schuller; Xiaojie Yu; Xuan Zhu (2023). Best performances up to statistic significance achieved using semi-supervised active learning (SSAL), active learning (AL), and passive learning (PL) in pool-based and stream-based scenarios, as well as the number of human-labeled instances (#HLI) needed to achieve that performance. [Dataset]. http://doi.org/10.1371/journal.pone.0162075.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Wenjing Han; Eduardo Coutinho; Huabin Ruan; Haifeng Li; Björn Schuller; Xiaojie Yu; Xuan Zhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Best performances up to statistic significance achieved using semi-supervised active learning (SSAL), active learning (AL), and passive learning (PL) in pool-based and stream-based scenarios, as well as the number of human-labeled instances (#HLI) needed to achieve that performance.

  16. Z

    SONYC-FSD-SED

    • data.niaid.nih.gov
    Updated Sep 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Wang (2022). SONYC-FSD-SED [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6392323
    Explore at:
    Dataset updated
    Sep 20, 2022
    Dataset provided by
    Mark Cartwright
    Yu Wang
    Juan Pablo Bello
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Created by

    Yu Wang, Mark Cartwright, and Juan Pablo Bello

    Publication

    If using this data in academic work, please cite the following paper, which presented this dataset:

    Y. Wang, M. Cartwright, and J. P. Bello. "Active Few-Shot Learning for Sound Event Detection", INTERSPEECH, 2022

    Description

    SONYC-FSD-SED is an open dataset of programmatically mixed audio clips that simulates audio data in an environmental sound monitoring system, where sound class occurrences and co-occurrences exhibit seasonal periodic patterns. We use recordings collected from the Sound of New York City (SONYC) acoustic sensor network as backgrounds, and single-labeled clips in the FSD50K dataset as foreground events to generate 576,591 10-second strongly-labeled soundscapes with Scaper (including 111,294 additional test data for the experiment of sampling window). Instead of sampling foreground sound events uniformly, we simulate the occurrence probability of each class at different times in a year, creating more realistic temporal characteristics.

    Source material and annotations

    Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce SONYC-FSD-SED using Scaper with the script in the project repository.

    Background material from SONYC recordings

    We pick a sensor from the SONYC sensor network and subsample from recordings it collected within a year (2017). We categorize these ∼550k 10-second clips into 96 bins based on timestamps, where each bin represents a unique combination of the month of a year, day of a week (weekday or weekend), and time of a day (divided into four 6-hour blocks). Next, we run a pre-trained urban sound event classifier over all recordings and filter out clips with active sound classes. We do not filter out footstep and bird since they appear too frequently, instead, we remove these two classes from the foreground sound material. Then from each bin, we choose the clip with the lowest sound pressure level, yielding 96 background clips.

    Foreground material from FSD50K

    We follow the same filtering process as in FSD-MIX-SED to get the subset of FSD50K with short single-labeled clips. In addition, we remove two classes, "Chirp_and_tweet" and "Walk_and_footsteps", that exist in our SONYC background recordings. This results in 87 sound classes. vocab.json contains the list of 87 classes, each class is then labeled by its index in the list. 0-42: train, 43-56: val, 57-86: test.

    Occurrence probability modelling

    For each class, we model its occurrence probability within a year. We use von Mises probability density functions to simulate the probability distribution over different weeks in a year and hours in a day considering their cyclic characteristics: (f(x|μ, κ) = e^{κcos(x−μ)}/2πI_0(κ)), where (I_0(κ)) is the modified Bessel function of order (0), (\mu) and (1/\kappa) are analogous to the mean and variance in the normal distribution. We randomly sample ((\mu_{year}, \mu_{day})) from ([-\pi, \pi]) and ((\kappa_{year}, \kappa_{day})) from ([0, 10]). We also randomly assign (p_{weekday} \in [0, 1] ), (p_{weekend} = 1 − p_{weekday}) to simulate the probability distribution over different days in a week. Finally, we get the probability distribution over the entire year with a 1-hour resolution. At a given timestamp, we integrate (f_{year}) and (f_{day}) over the 1-hour window and multiply them together with (p_{weekday}) or (p_{weekend}) depends on the day. To speed up the following sampling process, we scale the final probability distribution using a temperature parameter randomly sampled from ([2,3]).

    Files

    SONYC_FSD_SED.source.tar.gz: 96 SONYC backgrounds and 10,158 foreground sounds in .wav format. The original file size is 2GB.

    SONYC_FSD_SED.annotations.tar.gz: 465,467 JAMS files. The original file size is 57GB.

    SONYC_FSD_SED_add_test.annotations.tar.gz: 111,294 JAMS files for additional test data. The original file size is 14GB.

    vocab.json: 87 classes.

    occ_prob_per_cl.pkl: Occurrence probability for each foreground sound class.

    References

    [1] J. P. Bello, C. T. Silva, O. Nov, R. L. DuBois, A. Arora, J. Salamon, C. Mydlarz, and H. Doraiswamy, “SONYC: A system for monitoring, analyzing, and mitigating urban noise pollution,” Commun. ACM, 2019

    [2] E. Fonseca, X. Favory, J. Pons, F. Font, X. Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.

  17. Nonspeech7k dataset

    • zenodo.org
    zip
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mamunur Rashid; Muhammad Mamunur Rashid; Guiqing Li *; Chengrui Du; Guiqing Li *; Chengrui Du (2023). Nonspeech7k dataset [Dataset]. http://doi.org/10.5281/zenodo.6967442
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Muhammad Mamunur Rashid; Muhammad Mamunur Rashid; Guiqing Li *; Chengrui Du; Guiqing Li *; Chengrui Du
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of 7,014 files delivered as 32kHz, mono audio files in .wav format and divided into train and test sets. The train set consists of 6,289, and the test set consists of 725 files. The files were strongly manually annotated with a single ground-truth label. The length of each file is from 500 milliseconds to 4 seconds.

    The dataset is only allowed for non-commercial and academic research purposes under the creative commons (CC BY-NC-SA 4.0) license. If you use the dataset, please cite our paper and acknowledge the source(freesound.org, Youtube, and Aigei). More details about the Nonspeech7k dataset are available in our article.

    Article title: "Nonspeech7k dataset: Classification and analysis of human nonspeech sound"

  18. d

    EUROPEAN CITIES Environmental Noise Data | Noise Complaints | GDPR Compliant...

    • datarade.ai
    Updated May 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silencio Network (2025). EUROPEAN CITIES Environmental Noise Data | Noise Complaints | GDPR Compliant | 100% Traceable Consent [Dataset]. https://datarade.ai/data-products/european-cities-environmental-noise-data-noise-complaints-silencio-network
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    May 5, 2025
    Dataset provided by
    Quickkonnect UG
    Authors
    Silencio Network
    Area covered
    Belarus, United States of America, Finland, Italy, Latvia, Austria, Faroe Islands, Switzerland, Norway, Bulgaria, Europe
    Description

    Noise Complaint Dataset —Acoustic Source Detection

    Silencio offers geolocated and categorized noise complaints collected directly through our mobile app. This unique dataset includes not only the location and time of each complaint but also the source of the noise (e.g., traffic, construction, nightlife, neighbors), making it a rare resource for research and monitoring focused on acoustic event classification, noise source identification, and urban sound analysis.

    Unlike standard sound datasets, which often lack real-world context or human-labeled sources, Silencio’s dataset is built entirely from user-submitted reports, providing authentic, ground-truth labels for research. It is ideal for training models in sound recognition, urban noise prediction, acoustic scene analysis, and noise impact assessment.

    Combined with Silencio’s Street Noise-Level Dataset, this complaint dataset allows researchers to correlate objective measurements with subjective community-reported noise events, opening up possibilities for multi-modal AI models that link noise intensity with human perception.

    Data delivery options include: • CSV exports • S3 bucket delivery • (Upcoming) API access

    All data is fully anonymized, GDPR-compliant, and available as both historical and updated datasets. We are open to early-access partnerships and custom formatting to meet AI research needs.

  19. m

    Arabic Natural Audio Dataset

    • data.mendeley.com
    Updated May 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samira klaylat (2018). Arabic Natural Audio Dataset [Dataset]. http://doi.org/10.17632/xm232yxf7t.1
    Explore at:
    Dataset updated
    May 30, 2018
    Authors
    Samira klaylat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the first Arabic Natural Audio Dataset (ANAD) developed to recognize 3 discrete emotions: Happy,angry, and surprised.

    Eight videos of live calls between an anchor and a human outside the studio were downloaded from online Arabic talk shows. Each video was then divided into turns: callers and receivers. To label each video, 18 listeners were asked to listen to each video and select whether they perceive a happy, angry or surprised emotion. Silence, laughs and noisy chunks were removed. Every chunk was then automatically divided into 1 sec speech units forming our final corpus composed of 1384 records.

    Twenty five acoustic features, also known as low-level descriptors, were extracted. These features are: intensity, zero crossing rates, MFCC 1-12 (Mel-frequency cepstral coefficients), F0 (Fundamental frequency) and F0 envelope, probability of voicing and, LSP frequency 0-7. On every feature nineteen statistical functions were applied. The functions are: maximum, minimum, range, absolute position of maximum, absolute position of minimum, arithmetic of mean, Linear Regression1, Linear Regression2, Linear RegressionA, Linear RegressionQ, standard Deviation, kurtosis, skewness, quartiles 1, 2, 3 and, inter-quartile ranges 1-2, 2-3, 1-3. The delta coefficient for every LLD is also computed as an estimate of the first derivative hence leading to a total of 950 features.

  20. O

    AudioSet

    • opendatalab.com
    • huggingface.co
    zip
    Updated Jul 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2023). AudioSet [Dataset]. https://opendatalab.com/OpenDataLab/AudioSet
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 1, 2023
    Dataset provided by
    Google
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from YouTube, therefore many of which are in poor-quality and contain multiple sound-sources. A hierarchical ontology of 632 event classes is employed to annotate these data, which means that the same sound could be annotated as different labels. For example, the sound of barking is annotated as Animal, Pets, and Dog. All the videos are split into Evaluation/Balanced-Train/Unbalanced-Train set.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Eduardo Fonseca (2022). FSD50K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4060431

FSD50K

Explore at:
Dataset updated
Apr 24, 2022
Dataset provided by
Jordi Pons
Xavier Favory
Eduardo Fonseca
Xavier Serra
Frederic Font
Description

FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

Citation

If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):

@article{fonseca2022FSD50K, title={{FSD50K}: an open dataset of human-labeled sound events}, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={30}, pages={829--852}, year={2022}, publisher={IEEE} }

Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).

Data curators

Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez

Contact

You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.

ABOUT FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.

Basic characteristics:

FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio

The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.

The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).

The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].

Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.

All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.

Ground truth labels are provided at the clip-level (i.e., weak labels).

The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).

In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).

The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.

Dev set:

40,966 audio clips totalling 80.4 hours of audio

Avg duration/clip: 7.1s

114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)

Labels are correct but could be occasionally incomplete

A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)

Eval set:

10,231 audio clips totalling 27.9 hours of audio

Avg duration/clip: 9.8s

38,596 smeared labels

Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.

LICENSE

All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:

The development set consists of 40,966 clips with the following licenses:

CC0: 14,959

CC-BY: 20,017

CC-BY-NC: 4616

CC Sampling+: 1374

The evaluation set consists of 10,231 clips with the following licenses:

CC0: 4914

CC-BY: 3489

CC-BY-NC: 1425

CC Sampling+: 403

For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.

In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).

Usage of FSD50K for commercial purposes:

If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

FILES

FSD50K can be downloaded as a series of zip files with the following directory structure:

root │
└───FSD50K.dev_audio/ Audio clips in the dev set │
└───FSD50K.eval_audio/ Audio clips in the eval set │
└───FSD50K.ground_truth/ Files for FSD50K's ground truth │ │
│ └─── dev.csv Ground truth for the dev set │ │
│ └─── eval.csv Ground truth for the eval set
│ │
│ └─── vocabulary.csv List of 200 sound classes in FSD50K │
└───FSD50K.metadata/ Files for additional metadata │ │
│ └─── class_info_FSD50K.json Metadata about the sound classes │ │
│ └─── dev_clips_info_FSD50K.json Metadata about the dev clips │ │
│ └─── eval_clips_info_FSD50K.json Metadata about the eval clips │ │
│ └─── pp_pnp_ratings_FSD50K.json PP/PNP ratings
│ │
│ └─── collection/ Files for the sound collection format

└───FSD50K.doc/ │
└───README.md The dataset description file that you are reading │
└───LICENSE-DATASET License of the FSD50K dataset as an entity

Each row (i.e. audio clip) of dev.csv contains the following information:

fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.

labels: the class labels (i.e., the ground truth). Note these class labels are smeared, i.e., the labels have been propagated in the upwards direction to the root of the ontology. More details about the label smearing process can be found in Appendix D of our paper.

mids: the Freebase identifiers corresponding to the class labels, as defined in the AudioSet Ontology specification

split: whether the clip belongs to train or val (see paper for details on the proposed split)

Rows in eval.csv follow the same format, except that there is no split column.

Note: We use a slightly different format than AudioSet for the naming of class labels in order to avoid potential problems with spaces, commas, etc. Example: we use Accelerating_and_revving_and_vroom instead of the original Accelerating, revving, vroom. You can go back to the original AudioSet naming using the information provided in vocabulary.csv (class label and mid for the 200 classes of FSD50K) and the AudioSet Ontology specification.

Files with additional metadata (FSD50K.metadata/)

To allow a variety of analysis and approaches with FSD50K, we provide the following metadata:

class_info_FSD50K.json: python dictionary where each entry corresponds to one sound class and contains: FAQs utilized during the annotation of the class, examples (representative audio clips), and verification_examples (audio clips presented to raters during annotation as a quality control mechanism). Audio clips are described by the Freesound id. Note: It may be that some of these examples are not included in the FSD50K release.

dev_clips_info_FSD50K.json: python dictionary where each entry corresponds to one dev clip and contains: title,

Search
Clear search
Close search
Google apps
Main menu