Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Google's AudioSet consistently reformatted
During my work with Google's AudioSet(https://research.google.com/audioset/index.html) I encountered some problems due to the fact that Weak (https://research.google.com/audioset/download.html) and Strong (https://research.google.com/audioset/download_strong.html) versions of the dataset used different csv formatting for the data, and that also labels used in the two datasets are different (https://github.com/audioset/ontology/issues/9) and also presented in files with different formatting.
This dataset reformatting aims to unify the formats of the datasets so that it is possible to analyse them in the same pipelines, and also make the dataset files compatible with psds_eval, dcase_util and sed_eval Python packages used in Audio Processing.
For better formatted documentation and source code of reformatting refer to https://github.com/bakhtos/GoogleAudioSetReformatted
-Changes in dataset
All files are converted to tab-separated *.tsv
files (i.e. csv
files with \t
as a separator). All files have a header as the first line.
-New fields and filenames
Fields are renamed according to the following table, to be compatible with psds_eval:
Old field -> New field YTID -> filename segment_id -> filename start_seconds -> onset start_time_seconds -> onset end_seconds -> offset end_time_seconds -> offset positive_labels -> event_label label -> event_label present -> present
For class label files, id
is now the name for the for mid
label (e.g. /m/09xor
)
and label
for the human-readable label (e.g. Speech
). Index of label indicated
for Weak dataset labels (index
field in class_labels_indices.csv
) is not used.
Files are renamed according to the following table to ensure consisted naming
of the form audioset_[weak|strong]_[train|eval]_[balanced|unbalanced|posneg]*.tsv
:
Old name -> New name balanced_train_segments.csv -> audioset_weak_train_balanced.tsv unbalanced_train_segments.csv -> audioset_weak_train_unbalanced.tsv eval_segments.csv -> audioset_weak_eval.tsv audioset_train_strong.tsv -> audioset_strong_train.tsv audioset_eval_strong.tsv -> audioset_strong_eval.tsv audioset_eval_strong_framed_posneg.tsv -> audioset_strong_eval_posneg.tsv class_labels_indices.csv -> class_labels.tsv (merged with mid_to_display_name.tsv) mid_to_display_name.tsv -> class_labels.tsv (merged with class_labels_indices.csv)
-Strong dataset changes
Only changes to the Strong dataset are renaming of fields and reordering of columns,
so that both Weak and Strong version have filename
and event_label
as first
two columns.
-Weak dataset changes
-- Labels are given one per line, instead of comma-separated and quoted list
-- To make sure that filename
format is the same as in Strong version, the following
format change is made:
The value of the start_seconds
field is converted to milliseconds and appended to the filename
with an underscore. Since all files in the dataset are assumed to be 10 seconds long, this unifies the format of filename
with the Strong version and makes end_seconds
also redundant.
-Class labels changes
Class labels from both datasets are merged into one file and given in alphabetical order of id
s. Since same id
s are present in both datasets, but sometimes with different human-readable labels, labels from Strong dataset overwrite those from Weak. It is possible to regenerate class_labels.tsv
while giving priority to the Weak version of labels by calling convert_labels(False)
from convert.py in the GitHub repository.
-License
Google's AudioSet was published in two stages - first the Weakly labelled data (Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017.), then the strongly labelled data (Hershey, Shawn, et al. "The benefit of temporally-strong labels in audio event classification." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.)
Both the original dataset and this reworked version are licensed under CC BY 4.0
Class labels come from the AudioSet Ontology, which is licensed under CC BY-SA 4.0.
FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.
Citation
If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:
Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)
You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017
Contact
You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.
About this dataset
Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.
The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.
All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.
The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:
"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".
Some other relevant characteristics of FSDKaggle2018:
The dataset is split into a train set and a test set.
The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.
Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.
Non-verified annotations in the train set are properly flagged in train.csv
so that participants can opt to use this information during the development of their systems.
The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.
All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.
Data labeling process
The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.
Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.
Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv
). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.
The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.
More details about the data labeling process can be found in [3].
License
FSDKaggle2018 has licenses at two different levels, as explained next.
All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv
and test_post_competition_scoring_clips.csv
.
In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET
file downloaded with the FSDKaggle2018.doc
zip file.
Files
FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:
root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The VocalImitationSet is a collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound (https://freesound.org/), which were curated based on Google's AudioSet ontology (https://research.google.com/audioset/). We expect that this dataset will help research communities obtain a better understanding of human's vocal imitation and build a machine understand the imitations as humans do.
See https://github.com/interactiveaudiolab/VocalImitationSet for more information about this dataset and its latest updates.
For citations, please use this reference:
Bongjun Kim, Madhav Ghei, Bryan Pardo, and Zhiyao Duan, "Vocal Imitation Set: a dataset of vocally imitated sound events using the AudioSet ontology," Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Nov. 2018.
Contact Info:
- Interactive Audio Lab: http://music.eecs.northwestern.edu
- Bongjun Kim bongjun@u.northwestern.edu | http://www.bongjunkim.com
- Bryan Pardo pardo@northwestern.edu | http://www.bryanpardo.com
FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.
Citation
If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):
@article{fonseca2022FSD50K, title={{FSD50K}: an open dataset of human-labeled sound events}, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={30}, pages={829--852}, year={2022}, publisher={IEEE} }
Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).
Data curators
Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez
Contact
You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.
ABOUT FSD50K
Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.
What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.
Basic characteristics:
FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio
The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.
The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).
The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].
Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.
All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.
Ground truth labels are provided at the clip-level (i.e., weak labels).
The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).
In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).
The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.
Dev set:
40,966 audio clips totalling 80.4 hours of audio
Avg duration/clip: 7.1s
114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)
Labels are correct but could be occasionally incomplete
A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)
Eval set:
10,231 audio clips totalling 27.9 hours of audio
Avg duration/clip: 9.8s
38,596 smeared labels
Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)
Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.
LICENSE
All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:
The development set consists of 40,966 clips with the following licenses:
CC0: 14,959
CC-BY: 20,017
CC-BY-NC: 4616
CC Sampling+: 1374
The evaluation set consists of 10,231 clips with the following licenses:
CC0: 4914
CC-BY: 3489
CC-BY-NC: 1425
CC Sampling+: 403
For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.
In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).
Usage of FSD50K for commercial purposes:
If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.
Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.
FILES
FSD50K can be downloaded as a series of zip files with the following directory structure:
root
│
└───FSD50K.dev_audio/ Audio clips in the dev set
│
└───FSD50K.eval_audio/ Audio clips in the eval set
│
└───FSD50K.ground_truth/ Files for FSD50K's ground truth
│ │
│ └─── dev.csv Ground truth for the dev set
│ │
│ └─── eval.csv Ground truth for the eval set
│ │
│ └─── vocabulary.csv List of 200 sound classes in FSD50K
│
└───FSD50K.metadata/ Files for additional metadata
│ │
│ └─── class_info_FSD50K.json Metadata about the sound classes
│ │
│ └─── dev_clips_info_FSD50K.json Metadata about the dev clips
│ │
│ └─── eval_clips_info_FSD50K.json Metadata about the eval clips
│ │
│ └─── pp_pnp_ratings_FSD50K.json PP/PNP ratings
│ │
│ └─── collection/ Files for the sound collection format
│
└───FSD50K.doc/
│
└───README.md The dataset description file that you are reading
│
└───LICENSE-DATASET License of the FSD50K dataset as an entity
Each row (i.e. audio clip) of dev.csv contains the following information:
fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.
labels: the class labels (i.e., the ground truth). Note these class labels are smeared, i.e., the labels have been propagated in the upwards direction to the root of the ontology. More details about the label smearing process can be found in Appendix D of our paper.
mids: the Freebase identifiers corresponding to the class labels, as defined in the AudioSet Ontology specification
split: whether the clip belongs to train or val (see paper for details on the proposed split)
Rows in eval.csv follow the same format, except that there is no split column.
Note: We use a slightly different format than AudioSet for the naming of class labels in order to avoid potential problems with spaces, commas, etc. Example: we use Accelerating_and_revving_and_vroom instead of the original Accelerating, revving, vroom. You can go back to the original AudioSet naming using the information provided in vocabulary.csv (class label and mid for the 200 classes of FSD50K) and the AudioSet Ontology specification.
Files with additional metadata (FSD50K.metadata/)
To allow a variety of analysis and approaches with FSD50K, we provide the following metadata:
class_info_FSD50K.json: python dictionary where each entry corresponds to one sound class and contains: FAQs utilized during the annotation of the class, examples (representative audio clips), and verification_examples (audio clips presented to raters during annotation as a quality control mechanism). Audio clips are described by the Freesound id. Note: It may be that some of these examples are not included in the FSD50K release.
dev_clips_info_FSD50K.json: python dictionary where each entry corresponds to one dev clip and contains: title,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43805 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London.
Citation
If you use the FSD-FS dataset, please cite our paper and FSD50K.
@article{liang2022learning,
title={Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition},
author={Liang, Jinhua and Phan, Huy and Benetos, Emmanouil},
journal={arXiv preprint arXiv:2212.08952},
year={2022}
}
@ARTICLE{9645159, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={FSD50K: An Open Dataset of Human-Labeled Sound Events}, year={2022}, volume={30}, number={}, pages={829-852}, doi={10.1109/TASLP.2021.3133208}}
About FSD-FS
FSD-FS is an open database for multi-label few-shot audio classification containing 143 classes drawn from the FSD50K. It also inherits the AudioSet Ontology. FSD-FS follows the ratio 7:2:1 to split classes into base, validation, and evaluation sets, so there are 98 classes in the base set, 30 classes in the validation set, and 15 classes in the evaluation set (More details can be found in our paper).
LICENSE
FSD-FS are released in Creative Commons (CC) licenses. Same as FSD50K, each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. For more details, ones can refer to the link.
FILES
FSD-FS are organised in the structure:
root
|
└─── dev_base
|
└─── dev_val
|
└─── eval
REFERENCES AND LINKS
[1] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. [paper] [link]
[2] Fonseca, Eduardo, et al. "Fsd50k: an open dataset of human-labeled sound events." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021): 829-852. [paper] [code]
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.
Citation
If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:
Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)
You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017
Data curators
Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons
Contact
You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.
ABOUT FSDKaggle2019
Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.
FSDKaggle2019 employs audio clips from the following sources:
Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology
The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)
The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.
What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.
Ground Truth Labels
The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).
The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].
The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].
Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:
curated train set: correct (but potentially incomplete) labels
noisy train set: noisy labels
test set: correct and complete labels
Further details can be found below in the sections for each set.
Format
All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.
DATA SPLIT
FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.
Curated train set
The curated train set consists of manually-labeled data from FSD.
Number of clips/class: 75 except in a few cases (where there are less)
Total number of clips: 4970
Avg number of labels/clip: 1.2
Total duration: 10.5 hours
The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).
Noisy train set
The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].
Number of clips/class: 300
Total number of clips: 19,815
Avg number of labels/clip: 1.2
Total duration: ~80 hours
The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.
Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.
Test set
The test set is used for system evaluation and consists of manually-labeled data from FSD.
Number of clips/class: between 50 and 150
Total number of clips: 4481
Avg number of labels/clip: 1.4
Total duration: 12.9 hours
The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.
During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).
Acoustic mismatch
As mentioned before, FSDKaggle2019 uses audio clips from two sources:
FSD: curated train set and test set, and
YFCC: noisy train set.
While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.
This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.
LICENSE
All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.
Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.
Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.
In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.
FILES & DOWNLOAD
FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:
root
│
└───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set
│
└───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy
FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.
Data curators
Eduardo Fonseca and Mercedes Collado
Contact
You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.
Citation
If you use this dataset or part of it, please cite the following ICASSP 2019 paper:
Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019
You can also consider citing our ISMIR 2017 paper that describes the Freesound Annotator, which was used to gather the manual annotations included in FSDnoisy18k:
Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound Datasets: A Platform for the Creation of Open Audio Datasets”, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017
FSDnoisy18k description
What follows is a summary of the most basic aspects of FSDnoisy18k. For a complete description of FSDnoisy18k, make sure to check:
the FSDnoisy18k companion site: http://www.eduardofonseca.net/FSDnoisy18k/
the description provided in Section 2 of our ICASSP 2019 paper
FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.
The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. The 20 classes are: "Acoustic guitar", "Bass guitar", "Clapping", "Coin (dropping)", "Crash cymbal", "Dishes, pots, and pans", "Engine", "Fart", "Fire", "Fireworks", "Glass", "Hi-hat", "Piano", "Rain", "Slam", "Squeak", "Tearing", "Walk, footsteps", "Wind", and "Writing". FSDnoisy18k was created with the Freesound Annotator, which is a platform for the collaborative creation of open audio datasets.
We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).
The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.
The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.
Code
We've released the code for our ICASSP 2019 paper at https://github.com/edufonseca/icassp19. The framework comprises all the basic stages: feature extraction, training, inference and evaluation. After loading the FSDnoisy18k dataset, log-mel energies are computed and a CNN baseline is trained and evaluated. The code also allows to test four noise-robust loss functions. Please check our paper for more details.
Label noise characteristics
FSDnoisy18k features real label noise that is representative of audio data retrieved from the web, particularly from Freesound. The analysis of a per-class, random, 15% of the noisy portion of FSDnoisy18k revealed that roughly 40% of the analyzed labels are correct and complete, whereas 60% of the labels show some type of label noise. Please check the FSDnoisy18k companion site for a detailed characterization of the label noise in the dataset, including a taxonomy of label noise for singly-labeled data as well as a per-class description of the label noise.
FSDnoisy18k basic characteristics
The dataset most relevant characteristics are as follows:
FSDnoisy18k contains 18,532 audio clips (42.5h) unequally distributed in the 20 aforementioned classes drawn from the AudioSet Ontology.
The audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.
The audio clips are of variable length ranging from 300ms to 30s, and each clip has a single ground truth label (singly-labeled data).
The dataset is split into a test set and a train set. The test set is drawn entirely from the clean portion, while the remainder of data forms the train set.
The train set is composed of 17,585 clips (41.1h) unequally distributed among the 20 classes. It features a clean subset and a noisy subset. In terms of number of clips their proportion is 10%/90%, whereas in terms of duration the proportion is slightly more extreme (6%/94%). The per-class percentage of clean data within the train set is also imbalanced, ranging from 6.1% to 22.4%. The number of audio clips per class ranges from 51 to 170, and from 250 to 1000 in the clean and noisy subsets, respectively. Further, a noisy small subset is defined, which includes an amount of (noisy) data comparable (in terms of duration) to that of the clean subset.
The test set is composed of 947 clips (1.4h) that belong to the clean portion of the data. Its class distribution is similar to that of the clean subset of the train set. The number of per-class audio clips in the test set ranges from 30 to 72. The test set enables a multi-class classification problem.
FSDnoisy18k is an expandable dataset that features a per-class varying degree of types and amount of label noise. The dataset allows investigation of label noise as well as other approaches, from semi-supervised learning, e.g., self-training to learning with minimal supervision.
License
FSDnoisy18k has licenses at two different levels, as explained next. All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. In particular, all Freesound clips included in FSDnoisy18k are released under either CC-BY or CC0. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of audio clips and their corresponding license in the LICENSE-INDIVIDUAL-CLIPS file downloaded with the dataset.
In addition, FSDnoisy18k as a whole is the result of a curation process and it has an additional license. FSDnoisy18k is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the dataset.
Files
FSDnoisy18k can be downloaded as a series of zip files with the following directory structure:
root
│
└───FSDnoisy18k.audio_train/ Audio clips in the train set
│
└───FSDnoisy18k.audio_test/ Audio clips in the test set
│
└───FSDnoisy18k.meta/ Files for evaluation setup
│ │
│ └───train.csv Data split and ground truth for the train set
│ │
│ └───test.csv Ground truth for the test set
│
└───FSDnoisy18k.doc/
│
└───README.md The dataset description file that you are reading
│
└───LICENSE-DATASET License of the FSDnoisy18k dataset as an entity
│
└───LICENSE-INDIVIDUAL-CLIPS.csv Licenses of the individual audio clips from Freesound
Each row (i.e. audio clip) of the train.csv file contains the following information:
fname: the file name
label: the audio classification label (ground truth)
aso_id: the id of the corresponding category as per the AudioSet Ontology
manually_verified: Boolean (1 or 0) flag to indicate whether the clip belongs to the clean portion (1), or to the noisy portion (0) of the train set
noisy_small: Boolean (1 or 0) flag to indicate whether the clip belongs to the noisy_small portion (1) of the train set
Each row (i.e. audio clip) of the test.csv file contains the following information:
fname: the file name
label: the audio classification label (ground truth)
aso_id: the id of the corresponding category as per the AudioSet Ontology
Links
Source code for our preprint: https://github.com/edufonseca/icassp19 Freesound Annotator: https://annotator.freesound.org/ Freesound: https://freesound.org Eduardo Fonseca’s personal website: http://www.eduardofonseca.net/
Acknowledgments
This work is partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688382 AudioCommons. Eduardo Fonseca is also sponsored by a Google Faculty Research Award 2017. We thank everyone who contributed to FSDnoisy18k with annotations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.
In addition to ARCA23K, this release includes a companion dataset called ARCA23K-FSD, which is a single-label subset of the FSD50K dataset. ARCA23K-FSD contains the same sound classes as ARCA23K and the same number of audio clips per class. As it is a subset of FSD50K, each clip and its label have been manually verified. Note that only the ground truth data of ARCA23K-FSD is distributed in this release. To download the audio clips, please visit the Zenodo page for FSD50K.
A paper has been published detailing how the dataset was constructed. See the Citing section below.
The source code used to create the datasets is available: https://github.com/tqbl/arca23k-dataset
Characteristics
ARCA23K(-FSD) is divided into:
A training set containing 17,979 clips (39.6 hours for ARCA23K).
A validation set containing 2,264 clips (5.0 hours).
A test test containing 3,484 clips (7.3 hours).
There are 70 sound classes in total. Each class belongs to the AudioSet ontology.
Each audio clip was sourced from the Freesound database. Other than format conversions (e.g. resampling), the audio clips have not been modified.
The duration of the audio clips varies from 0.3 seconds to 30 seconds.
All audio clips are mono 16-bit WAV files sampled at 44.1 kHz.
Based on listening tests (details in paper), 46.4% of the training examples are estimated to be labelled incorrectly. Among the incorrectly-labelled examples, 75.9% are estimated to be out-of-vocabulary.
Sound Classes
The list of sound classes is given below. They are grouped based on the top-level superclasses of the AudioSet ontology.
Music
Acoustic guitar
Bass guitar
Bowed string instrument
Crash cymbal
Electric guitar
Gong
Harp
Organ
Piano
Rattle (instrument)
Scratching (performance technique)
Snare drum
Trumpet
Wind chime
Wind instrument, woodwind instrument
Sounds of things
Boom
Camera
Coin (dropping)
Computer keyboard
Crack
Dishes, pots, and pans
Drawer open or close
Drill
Gunshot, gunfire
Hammer
Keys jangling
Knock
Microwave oven
Printer
Sawing
Scissors
Skateboard
Slam
Splash, splatter
Squeak
Tap
Thump, thud
Toilet flush
Train
Water tap, faucet
Whoosh, swoosh, swish
Writing
Zipper (clothing)
Natural sounds
Crackle
Stream
Waves, surf
Wind
Human sounds
Burping, eructation
Chewing, mastication
Child speech, kid speaking
Clapping
Cough
Crying, sobbing
Fart
Female singing
Female speech, woman speaking
Finger snapping
Giggle
Male speech, man speaking
Run
Screaming
Walk, footsteps
Animal
Bark
Cricket
Livestock, farm animals, working animals
Meow
Rattle
Source-ambiguous sounds
Crumpling, crinkling
Crushing
Tearing
License and Attribution
This release is licensed under the Creative Commons Attribution 4.0 International License.
The audio clips distributed as part of ARCA23K were sourced from Freesound and have their own Creative Commons license. The license information and attribution for each audio clip can be found in ARCA23K.metadata/train.json, which also includes the original Freesound URLs.
The files under ARCA23K-FSD.ground_truth/ are an adaptation of the ground truth data provided as part of FSD50K, which is licensed under the Creative Commons Attribution 4.0 International License. The curators of FSD50K are Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano, and Sara Fernandez.
Citing
If you wish to cite this work, please cite the following paper:
T. Iqbal, Y. Cao, A. Bailey, M. D. Plumbley, and W. Wang, “ARCA23K: An audio dataset for investigating open-set label noise”, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 2021, Barcelona, Spain, pp. 201–205.
BibTeX:
@inproceedings{Iqbal2021, author = {Iqbal, T. and Cao, Y. and Bailey, A. and Plumbley, M. D. and Wang, W.}, title = {{ARCA23K}: An audio dataset for investigating open-set label noise}, booktitle = {Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)}, pages = {201--205}, year = {2021}, address = {Barcelona, Spain}, }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
VimSketch Dataset combines two publicly available datasets, created by the Interactive Audio Lab:
Vocal Imitation Set: a collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound (https://freesound.org/), which were curated based on Google's AudioSet ontology (https://research.google.com/audioset/).
VocalSketch Dataset: a dataset containing thousands of vocal imitations of a large set of diverse sounds.
Publications by the Interactive Audio Lab using VimSketch:
[pdf] Fatemeh Pishdadian, Bongjun Kim, Prem Seetharaman, Bryan Pardo. "Classifying Non-speech Vocals: Deep vs Signal Processing Representations," Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), 2019
Contact information:
Interactive Audio Lab: http://music.eecs.northwestern.edu
Bryan Pardo pardo@northwestern.edu | http://www.bryanpardo.com
Bongjun Kim bongjun@u.northwestern.edu | http://www.bongjunkim.com
Fatemeh Pishdadian fpishdadian@u.northwestern.edu | http://www.fatemehpishdadian.com
LAMA World Music Genre Dataset LAMA - LatinAmerica, Asia, MiddleEastern, Africa Genre Dataset This Dataset consists of the .wav files of audio classified into four categories: LatinAmerica, Asia, MiddleEastern, and Africa. We went through Google AudioSet ontology and pickled the ones we double-chekched to be from the region. We added 1-min audio (.wav), plots (.png), and numerical datapoints for training (.json). I hope that this work can help in several Deep Learning, Machine Learning projects in Music Genre Classification. Getting Started The data contained in LAMA can be classified into three categories: ->>>Section Format LatinAmerica Asia MiddleEast Africa audio .wav 535 539 548 645 graph plots .png 2140 2156 2192 2580 numerical .json 101650 102410 104120 122550 Overall statistics of LAMA. The numbers in “audio” and “plots” rows are counts of the included files in each section. The numbers provided in the “numerical” row are counted based on the total number of raw datapoints. Datapoints from two related files (trainMFCC.json, trainSC.json) were counted. Datapoints from trainZCR.json, and trainRMSE.json were not counted in this figure. LatinA refers to Latin America, and MiddleE refers to Middle East. What's in? audio .wav -> 1 min clip audio files from Latin America, Africa, Asia, and Middle East graph plots .png -> MFCC, STFT, FFT, waveform numerical .json -> x13 MFCC datapoints, x6 spectral contrast datapoints
DCASE2018 Task 4 is a dataset for large-scale weakly labeled semi-supervised sound event detection in domestic environments. The data are YouTube video excerpts focusing on domestic context which could be used for example in ambient assisted living applications. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events...) and potential industrial applications. Specifically, the task employs a subset of “Audioset: An Ontology And Human-Labeled Dataset For Audio Events” by Google. Audioset consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. Task 4 focuses on a subset of Audioset that consists of 10 classes of sound events: speech, dog, cat, alarm bell ringing, dishes, frying, blender, running water, vacuum cleaner, electric shaver toothbrush.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
DESCRIPTION:
The Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University (TAU), and in Tokyo, Japan by SONY, using a similar setup and annotation procedure. The dataset is delivered in two 4-channel spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). These recordings serve as the development dataset for the DCASE 2023 Sound Event Localization and Detection Task of the DCASE 2023 Challenge.
The STARSS23 dataset is a continuation of the STARSS22 dataset. It extends the previous version with the following:
An additional 2.5hrs of recordings in the development set, from 5 new rooms distributed in 47 new recording clips.
Distance labels (in cm) for the spatially annotated sound events, apart from only the previous azimuth and elevation labels.
360° videos spatially and temporally aligned to the audio recordings of the dataset (apart from 12 audio-only clips).
Additional new audio and video recordings will be added in the evaluation set of the dataset in a subsequent version.
Contrary to the three previous datasets of synthetic spatial sound scenes of TAU Spatial Sound Events 2019 (development/evaluation), TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021 associated with previous iterations of the DCASE Challenge, the STARS22-23 dataset contains recordings of real sound scenes and hence it avoids some of the pitfalls of synthetic generation of scenes. Some such key properties are:
annotations are based on a combination of human annotators for sound event activity and optical tracking for spatial positions,
the annotated target event classes are determined by the composition of the real scenes,
the density, polyphony, occurences and co-occurences of events and sound classes is not random, and it follows actions and interactions of participants in the real scenes.
The first round of recordings was collected between September 2021 and January 2022. A second round of recordings was collected between November 2022 and February 2023.
Collection of data from the TAU side has received funding from Google.
REPORT & REFERENCE:
If you use this dataset you could cite this report on its design, capturing, and annotation process:
Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen (2022). STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France.
found here.
A more detailed report on the properties of the new dataset and its audiovisual processing with a suitable baseline for DCASE2023 will be published soon.
AIM:
The STARSS22-23 dataset is suitable for training and evaluation of machine-listening models for sound event detection (SED), general sound source localization with diverse sounds or signal-of-interest localization, and joint sound-event-localization-and-detection (SELD). Additionally, the dataset can be used for evaluation of signal processing methods that do not necessarily rely on training, such as acoustic source localization methods and multiple-source acoustic tracking. The dataset allows evaluation of the performance and robustness of the aforementioned applications for diverse types of sounds, and under diverse acoustic conditions.
Specifically the STARSS23 allows additionally evaluation of audiovisual processing methods, such as audiovisual source localization.
SPECIFICATIONS:
General:
Recordings are taken in two different sites.
Each recording clip is part of a recording session happening in a unique room.
Groups of participants, sound making props, and scene scenarios are unique for each session (with a few exceptions).
To achieve good variability and efficiency in the data, in terms of presence, density, movement, and/or spatial distribution of the sounds events, the scenes are loosely scripted.
13 target classes are identified in the recordings and strongly annotated by humans.
Spatial annotations for those active events are captured by an optical tracking system.
Sound events out of the target classes are considered as interference.
Occurrences of up to 3 simultaneous events are fairly common, while higher numbers of overlapping events (up to 5) can occur but are rare.
Volume, duration, and data split:
A total of 16 unique rooms captured in the recordings, 4 in Tokyo and 12 in Tampere (development set).
70 recording clips of 30 sec ~ 5 min durations, with a total time of ~2hrs, captured in Tokyo (development dataset).
98 recording clips of 40 sec ~ 9 min durations, with a total time of ~5.5hrs, captured in Tampere (development dataset).
A training-testing split is provided for reporting results using the development dataset.
40 recordings contributed by Sony for the training split, captured in 2 rooms (dev-train-sony).
30 recordings contributed by Sony for the testing split, captured in 2 rooms (dev-test-sony).
50 recordings contributed by TAU for the training split, captured in 7 rooms (dev-train-tau).
48 recordings contributed by TAU for the testing split, captured in 5 rooms (dev-test-tau).
About ~3.5hrs of additional recordings from both sites, captured in different rooms from the development set, will be released later as the evaluation set.
Audio:
Sampling rate: 24kHz.
Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array (MIC).
Video:
Video 360° format: equirectangular
Video resolution: 1920x960
Video frames per second (fps): 29.97
All audio recordings are accompanied by synchronised video recordings, apart from 12 audio recordings with missing videos (fold3_room21_mix001.wav - fold3_room21_mix012.wav)
More detailed information on the dataset can be found in the included README file.
SOUND CLASSES:
13 target sound event classes are annotated. The classes follow loosely the Audioset ontology.
The content of some of these classes corresponds to events of a limited range of Audioset-related subclasses. For more information see the README file.
EXAMPLE APPLICATION:
An implementation of a trainable model of a convolutional recurrent neural network, performing joint SELD, trained and evaluated with this dataset is provided here. This implementation will serve as the baseline method for the audio-only track in the DCASE 2023 Sound Event Localization and Detection Task.
A baseline for the audiovisual track of DCASE 2023 Sound Event Localization and Detection Task will be published soon and referenced here.
DEVELOPMENT AND EVALUATION:
The current version (Version 1.0) of the dataset includes only the 168 development audio/video recordings and labels, used by the participants of Task 3 of DCASE2023 Challenge to train and validate their submitted systems. Version 1.1 will be including additionally the evaluation audio and video recordings without labels, for the evaluation phase of DCASE2023.
If researchers wish to compare their system against the submissions of DCASE2023 Challenge, they will have directly comparable results if they use the evaluation data as their testing set.
DOWNLOAD INSTRUCTIONS:
The file foa_dev.zip, correspond to audio data of the FOA recording format. The file mic_dev.zip, correspond to audio data of the MIC recording format.
The file video_dev.zip contains the common videos for both audio formats. The file metadata_dev.zip contains the common metadata for both audio formats.
Download the zip files corresponding to the format of interest and use your favourite compression tool to unzip these zip files.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
DESCRIPTION:
The Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University (TAU), and in Tokyo, Japan by SONY, using a similar setup and annotation procedure. The dataset is delivered in two 4-channel spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). These recordings serve as the development dataset for the DCASE 2022 Sound Event Localization and Detection Task of the DCASE 2022 Challenge.
Contrary to the three previous datasets of synthetic spatial sound scenes of TAU Spatial Sound Events 2019 (development/evaluation), TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021 associated with the previous iterations of the DCASE Challenge, the STARS22 dataset contains recordings of real sound scenes and hence it avoids some of the pitfalls of synthetic generation of scenes. Some such key properties are:
annotations are based on a combination of human annotators for sound event activity and optical tracking for spatial positions,
the annotated target event classes are determined by the composition of the real scenes,
the density, polyphony, occurences and co-occurences of events and sound classes is not random, and it follows actions and interactions of participants in the real scenes.
The recordings were collected between September 2021 and January 2022. Collection of data from the TAU side has received funding from Google.
REPORT & REFERENCE:
If you use this dataset please cite the report on its creation, and the related DCASE2022 task setup:
Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen (2022). STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), Nancy, France.
found here.
AIM:
The dataset is suitable for training and evaluation of machine-listening models for sound event detection (SED), general sound source localization with diverse sounds or signal-of-interest localization, and joint sound-event-localization-and-detection (SELD). Additionally, the dataset can be used for evaluation of signal processing methods that do not necessarily rely on training, such as acoustic source localization methods and multiple-source acoustic tracking. The dataset allows evaluation of the performance and robustness of the aforementioned applications for diverse types of sounds, and under diverse acoustic conditions.
SPECIFICATIONS:
70 recording clips of 30 sec ~ 5 min durations, with a total time of ~2hrs, contributed by SONY (development dataset).
51 recording clips of 1 min ~ 5 min durations, with a total time of ~3hrs, contributed by TAU (development dataset).
52 recording clips with a total time of ~2hrs, contributed by SONY&TAU (evaluation dataset).
A training-test split is provided for reporting results using the development dataset.
40 recordings contributed by SONY for the training split, captured in 2 rooms (dev-train-sony).
30 recordings contributed by SONY for the testing split, captured in 2 rooms (dev-test-sony).
27 recordings contributed by TAU for the training split, captured in 4 rooms (dev-train-tau).
24 recordings contributed by TAU for the testing split, captured in 3 rooms (dev-test-tau).
A total of 11 unique rooms captured in the recordings, 4 from SONY and 7 from TAU (development set).
Sampling rate 24kHz.
Two 4-channel 3-dimensional recording formats: first-order Ambisonics (FOA) and tetrahedral microphone array (MIC).
Recordings are taken in two different countries and two different sites.
Each recording clip is part of a recording session happening in a unique room.
Groups of participants, sound making props, and scene scenarios are unique for each session (with a few exceptions).
To achieve good variability and efficiency in the data, in terms of presence, density, movement, and/or spatial distribution of the sounds events, the scenes are loosely scripted.
13 target classes are identified in the recordings and strongly annotated by humans.
Spatial annotations for those active events are captured by an optical tracking system.
Sound events out of the target classes are considered as interference.
Occurences of up to 3 simultaneous events are fairly common, while higher numbers of overlapping events (up to 5) can occur but are rare.
More detailed information on the dataset can be found in the included README file.
SOUND CLASSES:
13 target sound event classes are annotated. The classes follow loosely the Audioset ontology.
The content of some of these classes corresponds to events of a limited range of Audioset-related subclasses. For more information see the README file.
EXAMPLE APPLICATION:
An implementation of a trainable model of a convolutional recurrent neural network, performing joint SELD, trained and evaluated with this dataset is provided here. This implementation will serve as the baseline method in the DCASE 2022 Sound Event Localization and Detection Task.
DEVELOPMENT AND EVALUATION:
The current version (Version 1.1) of the dataset includes the 121 development audio recordings and labels, used by the participants of Task 3 of DCASE2022 Challenge to train and validate their submitted systems, and the 52 evaluation audio recordings without labels, for the evaluation phase of DCASE2022.
If researchers wish to compare their system against the submissions of DCASE2022 Challenge, they will have directly comparable results if they use the evaluation data as their testing set.
DOWNLOAD INSTRUCTIONS:
The file foa_dev.zip, correspond to audio data of the FOA recording format. The file mic_dev.zip, correspond to audio data of the MIC recording format. The metadata_dev.zip is the common metadata for both formats.
The file foa_eval.zip, corresponds to audio data of the FOA recording format for the evaluation dataset. The file mic_eval.zip, corresponds to audio data of the MIC recording format for the evaluation dataset.
Download the zip files corresponding to the format of interest and use your favourite compression tool to unzip these zip files.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Google's AudioSet consistently reformatted
During my work with Google's AudioSet(https://research.google.com/audioset/index.html) I encountered some problems due to the fact that Weak (https://research.google.com/audioset/download.html) and Strong (https://research.google.com/audioset/download_strong.html) versions of the dataset used different csv formatting for the data, and that also labels used in the two datasets are different (https://github.com/audioset/ontology/issues/9) and also presented in files with different formatting.
This dataset reformatting aims to unify the formats of the datasets so that it is possible to analyse them in the same pipelines, and also make the dataset files compatible with psds_eval, dcase_util and sed_eval Python packages used in Audio Processing.
For better formatted documentation and source code of reformatting refer to https://github.com/bakhtos/GoogleAudioSetReformatted
-Changes in dataset
All files are converted to tab-separated *.tsv
files (i.e. csv
files with \t
as a separator). All files have a header as the first line.
-New fields and filenames
Fields are renamed according to the following table, to be compatible with psds_eval:
Old field -> New field YTID -> filename segment_id -> filename start_seconds -> onset start_time_seconds -> onset end_seconds -> offset end_time_seconds -> offset positive_labels -> event_label label -> event_label present -> present
For class label files, id
is now the name for the for mid
label (e.g. /m/09xor
)
and label
for the human-readable label (e.g. Speech
). Index of label indicated
for Weak dataset labels (index
field in class_labels_indices.csv
) is not used.
Files are renamed according to the following table to ensure consisted naming
of the form audioset_[weak|strong]_[train|eval]_[balanced|unbalanced|posneg]*.tsv
:
Old name -> New name balanced_train_segments.csv -> audioset_weak_train_balanced.tsv unbalanced_train_segments.csv -> audioset_weak_train_unbalanced.tsv eval_segments.csv -> audioset_weak_eval.tsv audioset_train_strong.tsv -> audioset_strong_train.tsv audioset_eval_strong.tsv -> audioset_strong_eval.tsv audioset_eval_strong_framed_posneg.tsv -> audioset_strong_eval_posneg.tsv class_labels_indices.csv -> class_labels.tsv (merged with mid_to_display_name.tsv) mid_to_display_name.tsv -> class_labels.tsv (merged with class_labels_indices.csv)
-Strong dataset changes
Only changes to the Strong dataset are renaming of fields and reordering of columns,
so that both Weak and Strong version have filename
and event_label
as first
two columns.
-Weak dataset changes
-- Labels are given one per line, instead of comma-separated and quoted list
-- To make sure that filename
format is the same as in Strong version, the following
format change is made:
The value of the start_seconds
field is converted to milliseconds and appended to the filename
with an underscore. Since all files in the dataset are assumed to be 10 seconds long, this unifies the format of filename
with the Strong version and makes end_seconds
also redundant.
-Class labels changes
Class labels from both datasets are merged into one file and given in alphabetical order of id
s. Since same id
s are present in both datasets, but sometimes with different human-readable labels, labels from Strong dataset overwrite those from Weak. It is possible to regenerate class_labels.tsv
while giving priority to the Weak version of labels by calling convert_labels(False)
from convert.py in the GitHub repository.
-License
Google's AudioSet was published in two stages - first the Weakly labelled data (Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017.), then the strongly labelled data (Hershey, Shawn, et al. "The benefit of temporally-strong labels in audio event classification." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.)
Both the original dataset and this reworked version are licensed under CC BY 4.0
Class labels come from the AudioSet Ontology, which is licensed under CC BY-SA 4.0.