13 datasets found
  1. Z

    Google's Audioset: Reformatted

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bakhtin (2022). Google's Audioset: Reformatted [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7096701
    Explore at:
    Dataset updated
    Sep 21, 2022
    Dataset authored and provided by
    Bakhtin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Google's AudioSet consistently reformatted

    During my work with Google's AudioSet(https://research.google.com/audioset/index.html) I encountered some problems due to the fact that Weak (https://research.google.com/audioset/download.html) and Strong (https://research.google.com/audioset/download_strong.html) versions of the dataset used different csv formatting for the data, and that also labels used in the two datasets are different (https://github.com/audioset/ontology/issues/9) and also presented in files with different formatting.

    This dataset reformatting aims to unify the formats of the datasets so that it is possible to analyse them in the same pipelines, and also make the dataset files compatible with psds_eval, dcase_util and sed_eval Python packages used in Audio Processing.

    For better formatted documentation and source code of reformatting refer to https://github.com/bakhtos/GoogleAudioSetReformatted

    -Changes in dataset

    All files are converted to tab-separated *.tsv files (i.e. csv files with \t as a separator). All files have a header as the first line.

    -New fields and filenames

    Fields are renamed according to the following table, to be compatible with psds_eval:

    Old field -> New field YTID -> filename segment_id -> filename start_seconds -> onset start_time_seconds -> onset end_seconds -> offset end_time_seconds -> offset positive_labels -> event_label label -> event_label present -> present

    For class label files, id is now the name for the for mid label (e.g. /m/09xor) and label for the human-readable label (e.g. Speech). Index of label indicated for Weak dataset labels (index field in class_labels_indices.csv) is not used.

    Files are renamed according to the following table to ensure consisted naming of the form audioset_[weak|strong]_[train|eval]_[balanced|unbalanced|posneg]*.tsv:

    Old name -> New name balanced_train_segments.csv -> audioset_weak_train_balanced.tsv unbalanced_train_segments.csv -> audioset_weak_train_unbalanced.tsv eval_segments.csv -> audioset_weak_eval.tsv audioset_train_strong.tsv -> audioset_strong_train.tsv audioset_eval_strong.tsv -> audioset_strong_eval.tsv audioset_eval_strong_framed_posneg.tsv -> audioset_strong_eval_posneg.tsv class_labels_indices.csv -> class_labels.tsv (merged with mid_to_display_name.tsv) mid_to_display_name.tsv -> class_labels.tsv (merged with class_labels_indices.csv)

    -Strong dataset changes

    Only changes to the Strong dataset are renaming of fields and reordering of columns, so that both Weak and Strong version have filename and event_label as first two columns.

    -Weak dataset changes

    -- Labels are given one per line, instead of comma-separated and quoted list

    -- To make sure that filename format is the same as in Strong version, the following format change is made: The value of the start_seconds field is converted to milliseconds and appended to the filename with an underscore. Since all files in the dataset are assumed to be 10 seconds long, this unifies the format of filename with the Strong version and makes end_seconds also redundant.

    -Class labels changes

    Class labels from both datasets are merged into one file and given in alphabetical order of ids. Since same ids are present in both datasets, but sometimes with different human-readable labels, labels from Strong dataset overwrite those from Weak. It is possible to regenerate class_labels.tsv while giving priority to the Weak version of labels by calling convert_labels(False) from convert.py in the GitHub repository.

    -License

    Google's AudioSet was published in two stages - first the Weakly labelled data (Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017.), then the strongly labelled data (Hershey, Shawn, et al. "The benefit of temporally-strong labels in audio event classification." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.)

    Both the original dataset and this reworked version are licensed under CC BY 4.0

    Class labels come from the AudioSet Ontology, which is licensed under CC BY-SA 4.0.

  2. FSDKaggle2018

    • zenodo.org
    • opendatalab.com
    • +1more
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
    Description

    FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

    Citation

    If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    About this dataset

    Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

    The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

    All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

    The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

    "Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

    Some other relevant characteristics of FSDKaggle2018:

    • The dataset is split into a train set and a test set.

    • The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

    • Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

    • Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

    • The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

    • All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

    Data labeling process

    The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

    Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

    Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

    The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

    More details about the data labeling process can be found in [3].

    License

    FSDKaggle2018 has licenses at two different levels, as explained next.

    All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

    In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

    Files

    FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

    root
    │
    └───FSDKaggle2018.audio_train/ Audio clips in the train set │
    └───FSDKaggle2018.audio_test/ Audio clips in the test set │
    └───FSDKaggle2018.meta/ Files for evaluation setup │ │
    │ └───train_post_competition.csv Data split and ground truth for the train set │ │
    │ └───test_post_competition_scoring_clips.csv Ground truth for the test set

    └───FSDKaggle2018.doc/ │
    └───README.md The dataset description file you are reading │
    └───LICENSE-DATASET

  3. Vocal Imitation Set v1.1.3 : Thousands of vocal imitations of hundreds of...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bongjun Kim; Bryan Pardo; Bongjun Kim; Bryan Pardo (2020). Vocal Imitation Set v1.1.3 : Thousands of vocal imitations of hundreds of sounds from the AudioSet ontology [Dataset]. http://doi.org/10.5281/zenodo.1340763
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bongjun Kim; Bryan Pardo; Bongjun Kim; Bryan Pardo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The VocalImitationSet is a collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound (https://freesound.org/), which were curated based on Google's AudioSet ontology (https://research.google.com/audioset/). We expect that this dataset will help research communities obtain a better understanding of human's vocal imitation and build a machine understand the imitations as humans do.

    See https://github.com/interactiveaudiolab/VocalImitationSet for more information about this dataset and its latest updates.

    For citations, please use this reference:

    Bongjun Kim, Madhav Ghei, Bryan Pardo, and Zhiyao Duan, "Vocal Imitation Set: a dataset of vocally imitated sound events using the AudioSet ontology," Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Nov. 2018.

    Contact Info:

    - Interactive Audio Lab: http://music.eecs.northwestern.edu

    - Bongjun Kim bongjun@u.northwestern.edu | http://www.bongjunkim.com

    - Bryan Pardo pardo@northwestern.edu | http://www.bryanpardo.com

  4. Z

    FSD50K

    • data.niaid.nih.gov
    • opendatalab.com
    • +2more
    Updated Apr 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xavier Serra (2022). FSD50K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4060431
    Explore at:
    Dataset updated
    Apr 24, 2022
    Dataset provided by
    Xavier Favory
    Jordi Pons
    Eduardo Fonseca
    Frederic Font
    Xavier Serra
    Description

    FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

    Citation

    If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):

    @article{fonseca2022FSD50K, title={{FSD50K}: an open dataset of human-labeled sound events}, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={30}, pages={829--852}, year={2022}, publisher={IEEE} }

    Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).

    Data curators

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.

    ABOUT FSD50K

    Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

    What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.

    Basic characteristics:

    FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio

    The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.

    The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).

    The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].

    Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.

    All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.

    Ground truth labels are provided at the clip-level (i.e., weak labels).

    The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).

    In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).

    The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.

    Dev set:

    40,966 audio clips totalling 80.4 hours of audio

    Avg duration/clip: 7.1s

    114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)

    Labels are correct but could be occasionally incomplete

    A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)

    Eval set:

    10,231 audio clips totalling 27.9 hours of audio

    Avg duration/clip: 9.8s

    38,596 smeared labels

    Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

    Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.

    LICENSE

    All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:

    The development set consists of 40,966 clips with the following licenses:

    CC0: 14,959

    CC-BY: 20,017

    CC-BY-NC: 4616

    CC Sampling+: 1374

    The evaluation set consists of 10,231 clips with the following licenses:

    CC0: 4914

    CC-BY: 3489

    CC-BY-NC: 1425

    CC Sampling+: 403

    For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.

    In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).

    Usage of FSD50K for commercial purposes:

    If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

    Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

    FILES

    FSD50K can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSD50K.dev_audio/ Audio clips in the dev set │
    └───FSD50K.eval_audio/ Audio clips in the eval set │
    └───FSD50K.ground_truth/ Files for FSD50K's ground truth │ │
    │ └─── dev.csv Ground truth for the dev set │ │
    │ └─── eval.csv Ground truth for the eval set
    │ │
    │ └─── vocabulary.csv List of 200 sound classes in FSD50K │
    └───FSD50K.metadata/ Files for additional metadata │ │
    │ └─── class_info_FSD50K.json Metadata about the sound classes │ │
    │ └─── dev_clips_info_FSD50K.json Metadata about the dev clips │ │
    │ └─── eval_clips_info_FSD50K.json Metadata about the eval clips │ │
    │ └─── pp_pnp_ratings_FSD50K.json PP/PNP ratings
    │ │
    │ └─── collection/ Files for the sound collection format

    └───FSD50K.doc/ │
    └───README.md The dataset description file that you are reading │
    └───LICENSE-DATASET License of the FSD50K dataset as an entity

    Each row (i.e. audio clip) of dev.csv contains the following information:

    fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.

    labels: the class labels (i.e., the ground truth). Note these class labels are smeared, i.e., the labels have been propagated in the upwards direction to the root of the ontology. More details about the label smearing process can be found in Appendix D of our paper.

    mids: the Freebase identifiers corresponding to the class labels, as defined in the AudioSet Ontology specification

    split: whether the clip belongs to train or val (see paper for details on the proposed split)

    Rows in eval.csv follow the same format, except that there is no split column.

    Note: We use a slightly different format than AudioSet for the naming of class labels in order to avoid potential problems with spaces, commas, etc. Example: we use Accelerating_and_revving_and_vroom instead of the original Accelerating, revving, vroom. You can go back to the original AudioSet naming using the information provided in vocabulary.csv (class label and mid for the 200 classes of FSD50K) and the AudioSet Ontology specification.

    Files with additional metadata (FSD50K.metadata/)

    To allow a variety of analysis and approaches with FSD50K, we provide the following metadata:

    class_info_FSD50K.json: python dictionary where each entry corresponds to one sound class and contains: FAQs utilized during the annotation of the class, examples (representative audio clips), and verification_examples (audio clips presented to raters during annotation as a quality control mechanism). Audio clips are described by the Freesound id. Note: It may be that some of these examples are not included in the FSD50K release.

    dev_clips_info_FSD50K.json: python dictionary where each entry corresponds to one dev clip and contains: title,

  5. FSD-FS

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jan 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinhua Liang; Jinhua Liang; Huy Phan; Huy Phan; Emmanouil Benetos; Emmanouil Benetos (2023). FSD-FS [Dataset]. http://doi.org/10.5281/zenodo.7557107
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jan 22, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jinhua Liang; Jinhua Liang; Huy Phan; Huy Phan; Emmanouil Benetos; Emmanouil Benetos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43805 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London.

    Citation

    If you use the FSD-FS dataset, please cite our paper and FSD50K.

    @article{liang2022learning,
     title={Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition},
     author={Liang, Jinhua and Phan, Huy and Benetos, Emmanouil},
     journal={arXiv preprint arXiv:2212.08952},
     year={2022}
    }
    
    @ARTICLE{9645159, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},  title={FSD50K: An Open Dataset of Human-Labeled Sound Events},  year={2022}, volume={30}, number={}, pages={829-852}, doi={10.1109/TASLP.2021.3133208}}

    About FSD-FS

    FSD-FS is an open database for multi-label few-shot audio classification containing 143 classes drawn from the FSD50K. It also inherits the AudioSet Ontology. FSD-FS follows the ratio 7:2:1 to split classes into base, validation, and evaluation sets, so there are 98 classes in the base set, 30 classes in the validation set, and 15 classes in the evaluation set (More details can be found in our paper).

    LICENSE

    FSD-FS are released in Creative Commons (CC) licenses. Same as FSD50K, each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. For more details, ones can refer to the link.

    FILES

    FSD-FS are organised in the structure:

    root
    |
    └─── dev_base
    |
    └─── dev_val
    |
    └─── eval

    REFERENCES AND LINKS

    [1] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. [paper] [link]

    [2] Fonseca, Eduardo, et al. "Fsd50k: an open dataset of human-labeled sound events." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021): 829-852. [paper] [code]

  6. Z

    FSDKaggle2019

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Eduardo Fonseca
    Daniel P. W. Ellis
    Frederic Font
    Xavier Serra
    Manoj Plakal
    Description

    FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

    Citation

    If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Data curators

    Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    ABOUT FSDKaggle2019

    Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

    FSDKaggle2019 employs audio clips from the following sources:

    Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

    The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

    The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

    What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

    Ground Truth Labels

    The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

    The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

    The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

    Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

    curated train set: correct (but potentially incomplete) labels

    noisy train set: noisy labels

    test set: correct and complete labels

    Further details can be found below in the sections for each set.

    Format

    All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

    DATA SPLIT

    FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

    Curated train set

    The curated train set consists of manually-labeled data from FSD.

    Number of clips/class: 75 except in a few cases (where there are less)

    Total number of clips: 4970

    Avg number of labels/clip: 1.2

    Total duration: 10.5 hours

    The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

    Noisy train set

    The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

    Number of clips/class: 300

    Total number of clips: 19,815

    Avg number of labels/clip: 1.2

    Total duration: ~80 hours

    The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

    Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

    Test set

    The test set is used for system evaluation and consists of manually-labeled data from FSD.

    Number of clips/class: between 50 and 150

    Total number of clips: 4481

    Avg number of labels/clip: 1.4

    Total duration: 12.9 hours

    The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

    During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

    Acoustic mismatch

    As mentioned before, FSDKaggle2019 uses audio clips from two sources:

    FSD: curated train set and test set, and

    YFCC: noisy train set.

    While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

    This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

    LICENSE

    All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

    Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

    Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

    In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

    FILES & DOWNLOAD

    FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy

  7. Z

    FSDnoisy18k

    • data.niaid.nih.gov
    • paperswithcode.com
    • +2more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FSDnoisy18k [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2529933
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Mercedes Collado
    Eduardo Fonseca
    Xavier Favory
    Daniel P. W. Ellis
    Frederic Font
    Xavier Serra
    Manoj Plakal
    Description

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

    Data curators

    Eduardo Fonseca and Mercedes Collado

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    Citation

    If you use this dataset or part of it, please cite the following ICASSP 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

    You can also consider citing our ISMIR 2017 paper that describes the Freesound Annotator, which was used to gather the manual annotations included in FSDnoisy18k:

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound Datasets: A Platform for the Creation of Open Audio Datasets”, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    FSDnoisy18k description

    What follows is a summary of the most basic aspects of FSDnoisy18k. For a complete description of FSDnoisy18k, make sure to check:

    the FSDnoisy18k companion site: http://www.eduardofonseca.net/FSDnoisy18k/

    the description provided in Section 2 of our ICASSP 2019 paper

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

    The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. The 20 classes are: "Acoustic guitar", "Bass guitar", "Clapping", "Coin (dropping)", "Crash cymbal", "Dishes, pots, and pans", "Engine", "Fart", "Fire", "Fireworks", "Glass", "Hi-hat", "Piano", "Rain", "Slam", "Squeak", "Tearing", "Walk, footsteps", "Wind", and "Writing". FSDnoisy18k was created with the Freesound Annotator, which is a platform for the collaborative creation of open audio datasets.

    We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).

    The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.

    The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.

    Code

    We've released the code for our ICASSP 2019 paper at https://github.com/edufonseca/icassp19. The framework comprises all the basic stages: feature extraction, training, inference and evaluation. After loading the FSDnoisy18k dataset, log-mel energies are computed and a CNN baseline is trained and evaluated. The code also allows to test four noise-robust loss functions. Please check our paper for more details.

    Label noise characteristics

    FSDnoisy18k features real label noise that is representative of audio data retrieved from the web, particularly from Freesound. The analysis of a per-class, random, 15% of the noisy portion of FSDnoisy18k revealed that roughly 40% of the analyzed labels are correct and complete, whereas 60% of the labels show some type of label noise. Please check the FSDnoisy18k companion site for a detailed characterization of the label noise in the dataset, including a taxonomy of label noise for singly-labeled data as well as a per-class description of the label noise.

    FSDnoisy18k basic characteristics

    The dataset most relevant characteristics are as follows:

    FSDnoisy18k contains 18,532 audio clips (42.5h) unequally distributed in the 20 aforementioned classes drawn from the AudioSet Ontology.

    The audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

    The audio clips are of variable length ranging from 300ms to 30s, and each clip has a single ground truth label (singly-labeled data).

    The dataset is split into a test set and a train set. The test set is drawn entirely from the clean portion, while the remainder of data forms the train set.

    The train set is composed of 17,585 clips (41.1h) unequally distributed among the 20 classes. It features a clean subset and a noisy subset. In terms of number of clips their proportion is 10%/90%, whereas in terms of duration the proportion is slightly more extreme (6%/94%). The per-class percentage of clean data within the train set is also imbalanced, ranging from 6.1% to 22.4%. The number of audio clips per class ranges from 51 to 170, and from 250 to 1000 in the clean and noisy subsets, respectively. Further, a noisy small subset is defined, which includes an amount of (noisy) data comparable (in terms of duration) to that of the clean subset.

    The test set is composed of 947 clips (1.4h) that belong to the clean portion of the data. Its class distribution is similar to that of the clean subset of the train set. The number of per-class audio clips in the test set ranges from 30 to 72. The test set enables a multi-class classification problem.

    FSDnoisy18k is an expandable dataset that features a per-class varying degree of types and amount of label noise. The dataset allows investigation of label noise as well as other approaches, from semi-supervised learning, e.g., self-training to learning with minimal supervision.

    License

    FSDnoisy18k has licenses at two different levels, as explained next. All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. In particular, all Freesound clips included in FSDnoisy18k are released under either CC-BY or CC0. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of audio clips and their corresponding license in the LICENSE-INDIVIDUAL-CLIPS file downloaded with the dataset.

    In addition, FSDnoisy18k as a whole is the result of a curation process and it has an additional license. FSDnoisy18k is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the dataset.

    Files

    FSDnoisy18k can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSDnoisy18k.audio_train/ Audio clips in the train set │
    └───FSDnoisy18k.audio_test/ Audio clips in the test set │
    └───FSDnoisy18k.meta/ Files for evaluation setup │ │
    │ └───train.csv Data split and ground truth for the train set │ │
    │ └───test.csv Ground truth for the test set

    └───FSDnoisy18k.doc/ │
    └───README.md The dataset description file that you are reading │
    └───LICENSE-DATASET License of the FSDnoisy18k dataset as an entity

    └───LICENSE-INDIVIDUAL-CLIPS.csv Licenses of the individual audio clips from Freesound

    Each row (i.e. audio clip) of the train.csv file contains the following information:

    fname: the file name

    label: the audio classification label (ground truth)

    aso_id: the id of the corresponding category as per the AudioSet Ontology

    manually_verified: Boolean (1 or 0) flag to indicate whether the clip belongs to the clean portion (1), or to the noisy portion (0) of the train set

    noisy_small: Boolean (1 or 0) flag to indicate whether the clip belongs to the noisy_small portion (1) of the train set

    Each row (i.e. audio clip) of the test.csv file contains the following information:

    fname: the file name

    label: the audio classification label (ground truth)

    aso_id: the id of the corresponding category as per the AudioSet Ontology

    Links

    Source code for our preprint: https://github.com/edufonseca/icassp19 Freesound Annotator: https://annotator.freesound.org/ Freesound: https://freesound.org Eduardo Fonseca’s personal website: http://www.eduardofonseca.net/

    Acknowledgments

    This work is partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688382 AudioCommons. Eduardo Fonseca is also sponsored by a Google Faculty Research Award 2017. We thank everyone who contributed to FSDnoisy18k with annotations.

  8. Z

    ARCA23K

    • data.niaid.nih.gov
    • paperswithcode.com
    • +1more
    Updated Feb 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plumbley, Mark D. (2022). ARCA23K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5117900
    Explore at:
    Dataset updated
    Feb 25, 2022
    Dataset provided by
    Wang, Wenwu
    Bailey, Andrew
    Iqbal, Turab
    Plumbley, Mark D.
    Cao, Yin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.

    In addition to ARCA23K, this release includes a companion dataset called ARCA23K-FSD, which is a single-label subset of the FSD50K dataset. ARCA23K-FSD contains the same sound classes as ARCA23K and the same number of audio clips per class. As it is a subset of FSD50K, each clip and its label have been manually verified. Note that only the ground truth data of ARCA23K-FSD is distributed in this release. To download the audio clips, please visit the Zenodo page for FSD50K.

    A paper has been published detailing how the dataset was constructed. See the Citing section below.

    The source code used to create the datasets is available: https://github.com/tqbl/arca23k-dataset

    Characteristics

    ARCA23K(-FSD) is divided into:

    A training set containing 17,979 clips (39.6 hours for ARCA23K).

    A validation set containing 2,264 clips (5.0 hours).

    A test test containing 3,484 clips (7.3 hours).

    There are 70 sound classes in total. Each class belongs to the AudioSet ontology.

    Each audio clip was sourced from the Freesound database. Other than format conversions (e.g. resampling), the audio clips have not been modified.

    The duration of the audio clips varies from 0.3 seconds to 30 seconds.

    All audio clips are mono 16-bit WAV files sampled at 44.1 kHz.

    Based on listening tests (details in paper), 46.4% of the training examples are estimated to be labelled incorrectly. Among the incorrectly-labelled examples, 75.9% are estimated to be out-of-vocabulary.

    Sound Classes

    The list of sound classes is given below. They are grouped based on the top-level superclasses of the AudioSet ontology.

    Music

    Acoustic guitar

    Bass guitar

    Bowed string instrument

    Crash cymbal

    Electric guitar

    Gong

    Harp

    Organ

    Piano

    Rattle (instrument)

    Scratching (performance technique)

    Snare drum

    Trumpet

    Wind chime

    Wind instrument, woodwind instrument

    Sounds of things

    Boom

    Camera

    Coin (dropping)

    Computer keyboard

    Crack

    Dishes, pots, and pans

    Drawer open or close

    Drill

    Gunshot, gunfire

    Hammer

    Keys jangling

    Knock

    Microwave oven

    Printer

    Sawing

    Scissors

    Skateboard

    Slam

    Splash, splatter

    Squeak

    Tap

    Thump, thud

    Toilet flush

    Train

    Water tap, faucet

    Whoosh, swoosh, swish

    Writing

    Zipper (clothing)

    Natural sounds

    Crackle

    Stream

    Waves, surf

    Wind

    Human sounds

    Burping, eructation

    Chewing, mastication

    Child speech, kid speaking

    Clapping

    Cough

    Crying, sobbing

    Fart

    Female singing

    Female speech, woman speaking

    Finger snapping

    Giggle

    Male speech, man speaking

    Run

    Screaming

    Walk, footsteps

    Animal

    Bark

    Cricket

    Livestock, farm animals, working animals

    Meow

    Rattle

    Source-ambiguous sounds

    Crumpling, crinkling

    Crushing

    Tearing

    License and Attribution

    This release is licensed under the Creative Commons Attribution 4.0 International License.

    The audio clips distributed as part of ARCA23K were sourced from Freesound and have their own Creative Commons license. The license information and attribution for each audio clip can be found in ARCA23K.metadata/train.json, which also includes the original Freesound URLs.

    The files under ARCA23K-FSD.ground_truth/ are an adaptation of the ground truth data provided as part of FSD50K, which is licensed under the Creative Commons Attribution 4.0 International License. The curators of FSD50K are Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano, and Sara Fernandez.

    Citing

    If you wish to cite this work, please cite the following paper:

    T. Iqbal, Y. Cao, A. Bailey, M. D. Plumbley, and W. Wang, “ARCA23K: An audio dataset for investigating open-set label noise”, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 2021, Barcelona, Spain, pp. 201–205.

    BibTeX:

    @inproceedings{Iqbal2021, author = {Iqbal, T. and Cao, Y. and Bailey, A. and Plumbley, M. D. and Wang, W.}, title = {{ARCA23K}: An audio dataset for investigating open-set label noise}, booktitle = {Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)}, pages = {201--205}, year = {2021}, address = {Barcelona, Spain}, }

  9. Z

    VimSketch Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VimSketch Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2596910
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Mark Cartwright
    Bryan Pardo
    Bongjun Kim
    Fatemeh Pishdadian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    VimSketch Dataset combines two publicly available datasets, created by the Interactive Audio Lab:

    Vocal Imitation Set: a collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound (https://freesound.org/), which were curated based on Google's AudioSet ontology (https://research.google.com/audioset/).

    VocalSketch Dataset: a dataset containing thousands of vocal imitations of a large set of diverse sounds.

    Publications by the Interactive Audio Lab using VimSketch:

    [pdf] Fatemeh Pishdadian, Bongjun Kim, Prem Seetharaman, Bryan Pardo. "Classifying Non-speech Vocals: Deep vs Signal Processing Representations," Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 2019

    Contact information:

  10. d

    LAMA World Music Genre Dataset

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee, Bruce W (2023). LAMA World Music Genre Dataset [Dataset]. http://doi.org/10.7910/DVN/13BPFB
    Explore at:
    Dataset updated
    Nov 20, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Lee, Bruce W
    Description

    LAMA World Music Genre Dataset LAMA - LatinAmerica, Asia, MiddleEastern, Africa Genre Dataset This Dataset consists of the .wav files of audio classified into four categories: LatinAmerica, Asia, MiddleEastern, and Africa. We went through Google AudioSet ontology and pickled the ones we double-chekched to be from the region. We added 1-min audio (.wav), plots (.png), and numerical datapoints for training (.json). I hope that this work can help in several Deep Learning, Machine Learning projects in Music Genre Classification. Getting Started The data contained in LAMA can be classified into three categories: ->>>Section Format LatinAmerica Asia MiddleEast Africa audio .wav 535 539 548 645 graph plots .png 2140 2156 2192 2580 numerical .json 101650 102410 104120 122550 Overall statistics of LAMA. The numbers in “audio” and “plots” rows are counts of the included files in each section. The numbers provided in the “numerical” row are counted based on the total number of raw datapoints. Datapoints from two related files (trainMFCC.json, trainSC.json) were counted. Datapoints from trainZCR.json, and trainRMSE.json were not counted in this figure. LatinA refers to Latin America, and MiddleE refers to Middle East. What's in? audio .wav -> 1 min clip audio files from Latin America, Africa, Asia, and Middle East graph plots .png -> MFCC, STFT, FFT, waveform numerical .json -> x13 MFCC datapoints, x6 spectral contrast datapoints

  11. P

    DCASE 2018 Task 4 Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jul 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). DCASE 2018 Task 4 Dataset [Dataset]. https://paperswithcode.com/dataset/dcase-2018-task-4
    Explore at:
    Dataset updated
    Jul 26, 2018
    Description

    DCASE2018 Task 4 is a dataset for large-scale weakly labeled semi-supervised sound event detection in domestic environments. The data are YouTube video excerpts focusing on domestic context which could be used for example in ambient assisted living applications. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events...) and potential industrial applications. Specifically, the task employs a subset of “Audioset: An Ontology And Human-Labeled Dataset For Audio Events” by Google. Audioset consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. Task 4 focuses on a subset of Audioset that consists of 10 classes of sound events: speech, dog, cat, alarm bell ringing, dishes, frying, blender, running water, vacuum cleaner, electric shaver toothbrush.

  12. Z

    STARSS23: Sony-TAu Realistic Spatial Soundscapes 2023

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krause, Daniel Alexander (2023). STARSS23: Sony-TAu Realistic Spatial Soundscapes 2023 [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7709051
    Explore at:
    Dataset updated
    Jun 26, 2023
    Dataset provided by
    Adavanne, Sharath
    Mitsufuji, Yuki
    Virtanen, Tuomas
    Politis, Archontis
    Krause, Daniel Alexander
    Hakala, Aapo
    Shimada, Kazuki
    Takahashi, Shusuke
    Sudarsanam, Parthasaarathy
    Uchida, Kengo
    Koyama, Yuichiro
    Takahashi, Naoya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    DESCRIPTION:

    The Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University (TAU), and in Tokyo, Japan by SONY, using a similar setup and annotation procedure. The dataset is delivered in two 4-channel spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). These recordings serve as the development dataset for the DCASE 2023 Sound Event Localization and Detection Task of the DCASE 2023 Challenge.

    The STARSS23 dataset is a continuation of the STARSS22 dataset. It extends the previous version with the following:

    An additional 2.5hrs of recordings in the development set, from 5 new rooms distributed in 47 new recording clips.

    Distance labels (in cm) for the spatially annotated sound events, apart from only the previous azimuth and elevation labels.

    360° videos spatially and temporally aligned to the audio recordings of the dataset (apart from 12 audio-only clips).

    Additional new audio and video recordings will be added in the evaluation set of the dataset in a subsequent version.

    Contrary to the three previous datasets of synthetic spatial sound scenes of TAU Spatial Sound Events 2019 (development/evaluation), TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021 associated with previous iterations of the DCASE Challenge, the STARS22-23 dataset contains recordings of real sound scenes and hence it avoids some of the pitfalls of synthetic generation of scenes. Some such key properties are:

    annotations are based on a combination of human annotators for sound event activity and optical tracking for spatial positions,

    the annotated target event classes are determined by the composition of the real scenes,

    the density, polyphony, occurences and co-occurences of events and sound classes is not random, and it follows actions and interactions of participants in the real scenes.

    The first round of recordings was collected between September 2021 and January 2022. A second round of recordings was collected between November 2022 and February 2023.

    Collection of data from the TAU side has received funding from Google.

    REPORT & REFERENCE:

    If you use this dataset you could cite this report on its design, capturing, and annotation process:

    Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen (2022). STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France.

    found here.

    A more detailed report on the properties of the new dataset and its audiovisual processing with a suitable baseline for DCASE2023 will be published soon.

    AIM:

    The STARSS22-23 dataset is suitable for training and evaluation of machine-listening models for sound event detection (SED), general sound source localization with diverse sounds or signal-of-interest localization, and joint sound-event-localization-and-detection (SELD). Additionally, the dataset can be used for evaluation of signal processing methods that do not necessarily rely on training, such as acoustic source localization methods and multiple-source acoustic tracking. The dataset allows evaluation of the performance and robustness of the aforementioned applications for diverse types of sounds, and under diverse acoustic conditions.

    Specifically the STARSS23 allows additionally evaluation of audiovisual processing methods, such as audiovisual source localization.

    SPECIFICATIONS:

    General:

    Recordings are taken in two different sites.

    Each recording clip is part of a recording session happening in a unique room.

    Groups of participants, sound making props, and scene scenarios are unique for each session (with a few exceptions).

    To achieve good variability and efficiency in the data, in terms of presence, density, movement, and/or spatial distribution of the sounds events, the scenes are loosely scripted.

    13 target classes are identified in the recordings and strongly annotated by humans.

    Spatial annotations for those active events are captured by an optical tracking system.

    Sound events out of the target classes are considered as interference.

    Occurrences of up to 3 simultaneous events are fairly common, while higher numbers of overlapping events (up to 5) can occur but are rare.

    Volume, duration, and data split:

    A total of 16 unique rooms captured in the recordings, 4 in Tokyo and 12 in Tampere (development set).

    70 recording clips of 30 sec ~ 5 min durations, with a total time of ~2hrs, captured in Tokyo (development dataset).

    98 recording clips of 40 sec ~ 9 min durations, with a total time of ~5.5hrs, captured in Tampere (development dataset).

    A training-testing split is provided for reporting results using the development dataset.

    40 recordings contributed by Sony for the training split, captured in 2 rooms (dev-train-sony).

    30 recordings contributed by Sony for the testing split, captured in 2 rooms (dev-test-sony).

    50 recordings contributed by TAU for the training split, captured in 7 rooms (dev-train-tau).

    48 recordings contributed by TAU for the testing split, captured in 5 rooms (dev-test-tau).

    About ~3.5hrs of additional recordings from both sites, captured in different rooms from the development set, will be released later as the evaluation set.

    Audio:

    Sampling rate: 24kHz.

    Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array (MIC).

    Video:

    Video 360° format: equirectangular

    Video resolution: 1920x960

    Video frames per second (fps): 29.97

    All audio recordings are accompanied by synchronised video recordings, apart from 12 audio recordings with missing videos (fold3_room21_mix001.wav - fold3_room21_mix012.wav)

    More detailed information on the dataset can be found in the included README file.

    SOUND CLASSES:

    13 target sound event classes are annotated. The classes follow loosely the Audioset ontology.

    1. Female speech, woman speaking
    2. Male speech, man speaking
    3. Clapping
    4. Telephone
    5. Laughter
    6. Domestic sounds
    7. Walk, footsteps
    8. Door, open or close
    9. Music
    10. Musical instrument
    11. Water tap, faucet
    12. Bell
    13. Knock

    The content of some of these classes corresponds to events of a limited range of Audioset-related subclasses. For more information see the README file.

    EXAMPLE APPLICATION:

    An implementation of a trainable model of a convolutional recurrent neural network, performing joint SELD, trained and evaluated with this dataset is provided here. This implementation will serve as the baseline method for the audio-only track in the DCASE 2023 Sound Event Localization and Detection Task.

    A baseline for the audiovisual track of DCASE 2023 Sound Event Localization and Detection Task will be published soon and referenced here.

    DEVELOPMENT AND EVALUATION:

    The current version (Version 1.0) of the dataset includes only the 168 development audio/video recordings and labels, used by the participants of Task 3 of DCASE2023 Challenge to train and validate their submitted systems. Version 1.1 will be including additionally the evaluation audio and video recordings without labels, for the evaluation phase of DCASE2023.

    If researchers wish to compare their system against the submissions of DCASE2023 Challenge, they will have directly comparable results if they use the evaluation data as their testing set.

    DOWNLOAD INSTRUCTIONS:

    The file foa_dev.zip, correspond to audio data of the FOA recording format. The file mic_dev.zip, correspond to audio data of the MIC recording format.

    The file video_dev.zip contains the common videos for both audio formats. The file metadata_dev.zip contains the common metadata for both audio formats.

    Download the zip files corresponding to the format of interest and use your favourite compression tool to unzip these zip files.

  13. Z

    STARSS22: Sony-TAu Realistic Spatial Soundscapes 2022 dataset

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Mar 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shimada, Kazuki (2023). STARSS22: Sony-TAu Realistic Spatial Soundscapes 2022 dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6387879
    Explore at:
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Adavanne, Sharath
    Mitsufuji, Yuki
    Virtanen, Tuomas
    Politis, Archontis
    Krause, Daniel Alexander
    Shimada, Kazuki
    Takahashi, Shusuke
    Sudarsanam, Parthasaarathy
    Koyama, Yuichiro
    Takahashi, Naoya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    DESCRIPTION:

    The Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University (TAU), and in Tokyo, Japan by SONY, using a similar setup and annotation procedure. The dataset is delivered in two 4-channel spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). These recordings serve as the development dataset for the DCASE 2022 Sound Event Localization and Detection Task of the DCASE 2022 Challenge.

    Contrary to the three previous datasets of synthetic spatial sound scenes of TAU Spatial Sound Events 2019 (development/evaluation), TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021 associated with the previous iterations of the DCASE Challenge, the STARS22 dataset contains recordings of real sound scenes and hence it avoids some of the pitfalls of synthetic generation of scenes. Some such key properties are:

    annotations are based on a combination of human annotators for sound event activity and optical tracking for spatial positions,

    the annotated target event classes are determined by the composition of the real scenes,

    the density, polyphony, occurences and co-occurences of events and sound classes is not random, and it follows actions and interactions of participants in the real scenes.

    The recordings were collected between September 2021 and January 2022. Collection of data from the TAU side has received funding from Google.

    REPORT & REFERENCE:

    If you use this dataset please cite the report on its creation, and the related DCASE2022 task setup:

    Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen (2022). STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France.

    found here.

    AIM:

    The dataset is suitable for training and evaluation of machine-listening models for sound event detection (SED), general sound source localization with diverse sounds or signal-of-interest localization, and joint sound-event-localization-and-detection (SELD). Additionally, the dataset can be used for evaluation of signal processing methods that do not necessarily rely on training, such as acoustic source localization methods and multiple-source acoustic tracking. The dataset allows evaluation of the performance and robustness of the aforementioned applications for diverse types of sounds, and under diverse acoustic conditions.

    SPECIFICATIONS:

    70 recording clips of 30 sec ~ 5 min durations, with a total time of ~2hrs, contributed by SONY (development dataset).

    51 recording clips of 1 min ~ 5 min durations, with a total time of ~3hrs, contributed by TAU (development dataset).

    52 recording clips with a total time of ~2hrs, contributed by SONY&TAU (evaluation dataset).

    A training-test split is provided for reporting results using the development dataset.

    40 recordings contributed by SONY for the training split, captured in 2 rooms (dev-train-sony).

    30 recordings contributed by SONY for the testing split, captured in 2 rooms (dev-test-sony).

    27 recordings contributed by TAU for the training split, captured in 4 rooms (dev-train-tau).

    24 recordings contributed by TAU for the testing split, captured in 3 rooms (dev-test-tau).

    A total of 11 unique rooms captured in the recordings, 4 from SONY and 7 from TAU (development set).

    Sampling rate 24kHz.

    Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array (MIC).

    Recordings are taken in two different countries and two different sites.

    Each recording clip is part of a recording session happening in a unique room.

    Groups of participants, sound making props, and scene scenarios are unique for each session (with a few exceptions).

    To achieve good variability and efficiency in the data, in terms of presence, density, movement, and/or spatial distribution of the sounds events, the scenes are loosely scripted.

    13 target classes are identified in the recordings and strongly annotated by humans.

    Spatial annotations for those active events are captured by an optical tracking system.

    Sound events out of the target classes are considered as interference.

    Occurences of up to 3 simultaneous events are fairly common, while higher numbers of overlapping events (up to 5) can occur but are rare.

    More detailed information on the dataset can be found in the included README file.

    SOUND CLASSES:

    13 target sound event classes are annotated. The classes follow loosely the Audioset ontology.

    1. Female speech, woman speaking
    2. Male speech, man speaking
    3. Clapping
    4. Telephone
    5. Laughter
    6. Domestic sounds
    7. Walk, footsteps
    8. Door, open or close
    9. Music
    10. Musical instrument
    11. Water tap, faucet
    12. Bell
    13. Knock

    The content of some of these classes corresponds to events of a limited range of Audioset-related subclasses. For more information see the README file.

    EXAMPLE APPLICATION:

    An implementation of a trainable model of a convolutional recurrent neural network, performing joint SELD, trained and evaluated with this dataset is provided here. This implementation will serve as the baseline method in the DCASE 2022 Sound Event Localization and Detection Task.

    DEVELOPMENT AND EVALUATION:

    The current version (Version 1.1) of the dataset includes the 121 development audio recordings and labels, used by the participants of Task 3 of DCASE2022 Challenge to train and validate their submitted systems, and the 52 evaluation audio recordings without labels, for the evaluation phase of DCASE2022.

    If researchers wish to compare their system against the submissions of DCASE2022 Challenge, they will have directly comparable results if they use the evaluation data as their testing set.

    DOWNLOAD INSTRUCTIONS:

    The file foa_dev.zip, correspond to audio data of the FOA recording format. The file mic_dev.zip, correspond to audio data of the MIC recording format. The metadata_dev.zip is the common metadata for both formats.

    The file foa_eval.zip, corresponds to audio data of the FOA recording format for the evaluation dataset. The file mic_eval.zip, corresponds to audio data of the MIC recording format for the evaluation dataset.

    Download the zip files corresponding to the format of interest and use your favourite compression tool to unzip these zip files.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bakhtin (2022). Google's Audioset: Reformatted [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7096701

Google's Audioset: Reformatted

Explore at:
Dataset updated
Sep 21, 2022
Dataset authored and provided by
Bakhtin
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Google's AudioSet consistently reformatted

During my work with Google's AudioSet(https://research.google.com/audioset/index.html) I encountered some problems due to the fact that Weak (https://research.google.com/audioset/download.html) and Strong (https://research.google.com/audioset/download_strong.html) versions of the dataset used different csv formatting for the data, and that also labels used in the two datasets are different (https://github.com/audioset/ontology/issues/9) and also presented in files with different formatting.

This dataset reformatting aims to unify the formats of the datasets so that it is possible to analyse them in the same pipelines, and also make the dataset files compatible with psds_eval, dcase_util and sed_eval Python packages used in Audio Processing.

For better formatted documentation and source code of reformatting refer to https://github.com/bakhtos/GoogleAudioSetReformatted

-Changes in dataset

All files are converted to tab-separated *.tsv files (i.e. csv files with \t as a separator). All files have a header as the first line.

-New fields and filenames

Fields are renamed according to the following table, to be compatible with psds_eval:

Old field -> New field YTID -> filename segment_id -> filename start_seconds -> onset start_time_seconds -> onset end_seconds -> offset end_time_seconds -> offset positive_labels -> event_label label -> event_label present -> present

For class label files, id is now the name for the for mid label (e.g. /m/09xor) and label for the human-readable label (e.g. Speech). Index of label indicated for Weak dataset labels (index field in class_labels_indices.csv) is not used.

Files are renamed according to the following table to ensure consisted naming of the form audioset_[weak|strong]_[train|eval]_[balanced|unbalanced|posneg]*.tsv:

Old name -> New name balanced_train_segments.csv -> audioset_weak_train_balanced.tsv unbalanced_train_segments.csv -> audioset_weak_train_unbalanced.tsv eval_segments.csv -> audioset_weak_eval.tsv audioset_train_strong.tsv -> audioset_strong_train.tsv audioset_eval_strong.tsv -> audioset_strong_eval.tsv audioset_eval_strong_framed_posneg.tsv -> audioset_strong_eval_posneg.tsv class_labels_indices.csv -> class_labels.tsv (merged with mid_to_display_name.tsv) mid_to_display_name.tsv -> class_labels.tsv (merged with class_labels_indices.csv)

-Strong dataset changes

Only changes to the Strong dataset are renaming of fields and reordering of columns, so that both Weak and Strong version have filename and event_label as first two columns.

-Weak dataset changes

-- Labels are given one per line, instead of comma-separated and quoted list

-- To make sure that filename format is the same as in Strong version, the following format change is made: The value of the start_seconds field is converted to milliseconds and appended to the filename with an underscore. Since all files in the dataset are assumed to be 10 seconds long, this unifies the format of filename with the Strong version and makes end_seconds also redundant.

-Class labels changes

Class labels from both datasets are merged into one file and given in alphabetical order of ids. Since same ids are present in both datasets, but sometimes with different human-readable labels, labels from Strong dataset overwrite those from Weak. It is possible to regenerate class_labels.tsv while giving priority to the Weak version of labels by calling convert_labels(False) from convert.py in the GitHub repository.

-License

Google's AudioSet was published in two stages - first the Weakly labelled data (Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017.), then the strongly labelled data (Hershey, Shawn, et al. "The benefit of temporally-strong labels in audio event classification." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.)

Both the original dataset and this reworked version are licensed under CC BY 4.0

Class labels come from the AudioSet Ontology, which is licensed under CC BY-SA 4.0.

Search
Clear search
Close search
Google apps
Main menu