33 datasets found

Z
FSD50K
data.niaid.nih.gov
opendatalab.com
+2more
Updated Apr 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca (2022). FSD50K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4060431
Explore at:
Dataset updated
Apr 24, 2022
Dataset provided by
Jordi Pons
Xavier Favory
Eduardo Fonseca
Xavier Serra
Frederic Font
Description
FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

Citation

If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):

@article{fonseca2022FSD50K, title={{FSD50K}: an open dataset of human-labeled sound events}, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={30}, pages={829--852}, year={2022}, publisher={IEEE} }

Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).

Data curators

Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez

Contact

You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.

ABOUT FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.

Basic characteristics:

FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio

The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.

The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).

The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].

Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.

All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.

Ground truth labels are provided at the clip-level (i.e., weak labels).

The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).

In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).

The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.

Dev set:

40,966 audio clips totalling 80.4 hours of audio

Avg duration/clip: 7.1s

114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)

Labels are correct but could be occasionally incomplete

A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)

Eval set:

10,231 audio clips totalling 27.9 hours of audio

Avg duration/clip: 9.8s

38,596 smeared labels

Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.

LICENSE

All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:

The development set consists of 40,966 clips with the following licenses:

CC0: 14,959

CC-BY: 20,017

CC-BY-NC: 4616

CC Sampling+: 1374

The evaluation set consists of 10,231 clips with the following licenses:

CC0: 4914

CC-BY: 3489

CC-BY-NC: 1425

CC Sampling+: 403

For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.

In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).

Usage of FSD50K for commercial purposes:

If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

FILES

FSD50K can be downloaded as a series of zip files with the following directory structure:

root │
└───FSD50K.dev_audio/ Audio clips in the dev set │
└───FSD50K.eval_audio/ Audio clips in the eval set │
└───FSD50K.ground_truth/ Files for FSD50K's ground truth │ │
│ └─── dev.csv Ground truth for the dev set │ │
│ └─── eval.csv Ground truth for the eval set
│ │
│ └─── vocabulary.csv List of 200 sound classes in FSD50K │
└───FSD50K.metadata/ Files for additional metadata │ │
│ └─── class_info_FSD50K.json Metadata about the sound classes │ │
│ └─── dev_clips_info_FSD50K.json Metadata about the dev clips │ │
│ └─── eval_clips_info_FSD50K.json Metadata about the eval clips │ │
│ └─── pp_pnp_ratings_FSD50K.json PP/PNP ratings
│ │
│ └─── collection/ Files for the sound collection format
│
└───FSD50K.doc/ │
└───README.md The dataset description file that you are reading │
└───LICENSE-DATASET License of the FSD50K dataset as an entity

Each row (i.e. audio clip) of dev.csv contains the following information:

fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.

labels: the class labels (i.e., the ground truth). Note these class labels are smeared, i.e., the labels have been propagated in the upwards direction to the root of the ontology. More details about the label smearing process can be found in Appendix D of our paper.

mids: the Freebase identifiers corresponding to the class labels, as defined in the AudioSet Ontology specification

split: whether the clip belongs to train or val (see paper for details on the proposed split)

Rows in eval.csv follow the same format, except that there is no split column.

Note: We use a slightly different format than AudioSet for the naming of class labels in order to avoid potential problems with spaces, commas, etc. Example: we use Accelerating_and_revving_and_vroom instead of the original Accelerating, revving, vroom. You can go back to the original AudioSet naming using the information provided in vocabulary.csv (class label and mid for the 200 classes of FSD50K) and the AudioSet Ontology specification.

Files with additional metadata (FSD50K.metadata/)

To allow a variety of analysis and approaches with FSD50K, we provide the following metadata:

class_info_FSD50K.json: python dictionary where each entry corresponds to one sound class and contains: FAQs utilized during the annotation of the class, examples (representative audio clips), and verification_examples (audio clips presented to raters during annotation as a quality control mechanism). Audio clips are described by the Freesound id. Note: It may be that some of these examples are not included in the FSD50K release.

dev_clips_info_FSD50K.json: python dictionary where each entry corresponds to one dev clip and contains: title,
Z
Google's Audioset: Reformatted
data.niaid.nih.gov
zenodo.org
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bakhtin (2022). Google's Audioset: Reformatted [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7096701
Explore at:
Dataset updated
Sep 21, 2022
Dataset authored and provided by
Bakhtin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Google's AudioSet consistently reformatted

During my work with Google's AudioSet(https://research.google.com/audioset/index.html) I encountered some problems due to the fact that Weak (https://research.google.com/audioset/download.html) and Strong (https://research.google.com/audioset/download_strong.html) versions of the dataset used different csv formatting for the data, and that also labels used in the two datasets are different (https://github.com/audioset/ontology/issues/9) and also presented in files with different formatting.

This dataset reformatting aims to unify the formats of the datasets so that it is possible to analyse them in the same pipelines, and also make the dataset files compatible with psds_eval, dcase_util and sed_eval Python packages used in Audio Processing.

For better formatted documentation and source code of reformatting refer to https://github.com/bakhtos/GoogleAudioSetReformatted

-Changes in dataset

All files are converted to tab-separated *.tsv files (i.e. csv files with \t as a separator). All files have a header as the first line.

-New fields and filenames

Fields are renamed according to the following table, to be compatible with psds_eval:

Old field -> New field YTID -> filename segment_id -> filename start_seconds -> onset start_time_seconds -> onset end_seconds -> offset end_time_seconds -> offset positive_labels -> event_label label -> event_label present -> present

For class label files, id is now the name for the for mid label (e.g. /m/09xor) and label for the human-readable label (e.g. Speech). Index of label indicated for Weak dataset labels (index field in class_labels_indices.csv) is not used.

Files are renamed according to the following table to ensure consisted naming of the form audioset_[weak|strong]_[train|eval]_[balanced|unbalanced|posneg]*.tsv:

Old name -> New name balanced_train_segments.csv -> audioset_weak_train_balanced.tsv unbalanced_train_segments.csv -> audioset_weak_train_unbalanced.tsv eval_segments.csv -> audioset_weak_eval.tsv audioset_train_strong.tsv -> audioset_strong_train.tsv audioset_eval_strong.tsv -> audioset_strong_eval.tsv audioset_eval_strong_framed_posneg.tsv -> audioset_strong_eval_posneg.tsv class_labels_indices.csv -> class_labels.tsv (merged with mid_to_display_name.tsv) mid_to_display_name.tsv -> class_labels.tsv (merged with class_labels_indices.csv)

-Strong dataset changes

Only changes to the Strong dataset are renaming of fields and reordering of columns, so that both Weak and Strong version have filename and event_label as first two columns.

-Weak dataset changes

-- Labels are given one per line, instead of comma-separated and quoted list

-- To make sure that filename format is the same as in Strong version, the following format change is made: The value of the start_seconds field is converted to milliseconds and appended to the filename with an underscore. Since all files in the dataset are assumed to be 10 seconds long, this unifies the format of filename with the Strong version and makes end_seconds also redundant.

-Class labels changes

Class labels from both datasets are merged into one file and given in alphabetical order of ids. Since same ids are present in both datasets, but sometimes with different human-readable labels, labels from Strong dataset overwrite those from Weak. It is possible to regenerate class_labels.tsv while giving priority to the Weak version of labels by calling convert_labels(False) from convert.py in the GitHub repository.

-License

Google's AudioSet was published in two stages - first the Weakly labelled data (Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017.), then the strongly labelled data (Hershey, Shawn, et al. "The benefit of temporally-strong labels in audio event classification." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.)

Both the original dataset and this reworked version are licensed under CC BY 4.0

Class labels come from the AudioSet Ontology, which is licensed under CC BY-SA 4.0.
h
FSD50k
huggingface.co
Updated Oct 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nelson Yalta (2020). FSD50k [Dataset]. https://huggingface.co/datasets/Fhrozen/FSD50k
Explore at:
Dataset updated
Oct 2, 2020
Authors
Nelson Yalta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Freesound Dataset 50k (FSD50K)

Important

This data set is a copy from the original one located at Zenodo.

Citation

If you use the FSD50K dataset, or part of it, please cite our paper:

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv 2020.

Data curators

Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary… See the full description on the dataset page: https://huggingface.co/datasets/Fhrozen/FSD50k.
FSD-FS
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jan 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinhua Liang; Jinhua Liang; Huy Phan; Huy Phan; Emmanouil Benetos; Emmanouil Benetos (2023). FSD-FS [Dataset]. http://doi.org/10.5281/zenodo.7557107
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7557107
Dataset updated
Jan 22, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jinhua Liang; Jinhua Liang; Huy Phan; Huy Phan; Emmanouil Benetos; Emmanouil Benetos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43805 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London.

Citation

If you use the FSD-FS dataset, please cite our paper and FSD50K.

@article{liang2022learning, title={Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition}, author={Liang, Jinhua and Phan, Huy and Benetos, Emmanouil}, journal={arXiv preprint arXiv:2212.08952}, year={2022} } @ARTICLE{9645159, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={FSD50K: An Open Dataset of Human-Labeled Sound Events}, year={2022}, volume={30}, number={}, pages={829-852}, doi={10.1109/TASLP.2021.3133208}}

About FSD-FS

FSD-FS is an open database for multi-label few-shot audio classification containing 143 classes drawn from the FSD50K. It also inherits the AudioSet Ontology. FSD-FS follows the ratio 7:2:1 to split classes into base, validation, and evaluation sets, so there are 98 classes in the base set, 30 classes in the validation set, and 15 classes in the evaluation set (More details can be found in our paper).

LICENSE

FSD-FS are released in Creative Commons (CC) licenses. Same as FSD50K, each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. For more details, ones can refer to the link.

FILES

FSD-FS are organised in the structure:

root | └─── dev_base | └─── dev_val | └─── eval

REFERENCES AND LINKS

[1] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. [paper] [link]

[2] Fonseca, Eduardo, et al. "Fsd50k: an open dataset of human-labeled sound events." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021): 829-852. [paper] [code]
Z
FSDKaggle2019
data.niaid.nih.gov
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel P. W. Ellis (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Eduardo Fonseca
Daniel P. W. Ellis
Manoj Plakal
Xavier Serra
Frederic Font
Description
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

Citation

If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Data curators

Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

ABOUT FSDKaggle2019

Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

FSDKaggle2019 employs audio clips from the following sources:

Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

Ground Truth Labels

The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

curated train set: correct (but potentially incomplete) labels

noisy train set: noisy labels

test set: correct and complete labels

Further details can be found below in the sections for each set.

Format

All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

DATA SPLIT

FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

Curated train set

The curated train set consists of manually-labeled data from FSD.

Number of clips/class: 75 except in a few cases (where there are less)

Total number of clips: 4970

Avg number of labels/clip: 1.2

Total duration: 10.5 hours

The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

Noisy train set

The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

Number of clips/class: 300

Total number of clips: 19,815

Avg number of labels/clip: 1.2

Total duration: ~80 hours

The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

Test set

The test set is used for system evaluation and consists of manually-labeled data from FSD.

Number of clips/class: between 50 and 150

Total number of clips: 4481

Avg number of labels/clip: 1.4

Total duration: 12.9 hours

The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

Acoustic mismatch

As mentioned before, FSDKaggle2019 uses audio clips from two sources:

FSD: curated train set and test set, and

YFCC: noisy train set.

While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

LICENSE

All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

FILES & DOWNLOAD

FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy
AudioSet [Train]
kaggle.com
Updated Dec 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZFTurbo (2020). AudioSet [Train] [Dataset]. https://www.kaggle.com/zfturbo/audioset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 3, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ZFTurbo
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

The AudioSet dataset is a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. AudioSet is brought to you by the Sound and Video Understanding teams pursing Machine Perception research at Google. The official AudioSet site is located here. The main problem is that AudioSet wasn't releases as audio-files rather as just Youttube links which were hard to use. In this dataset you can find extracted raw WAV-files for balanced train, evaluation and manually created test data.

Content

Dataset consists of following folders and files:

train_wav - folder with audio files in WAV format

class_label_indices.csv - file with class_id mapping

train.csv - meta-data including target classes for train audio files

train_missed.csv - files which are not available (comparing with original dataset)

Additional data

Validation data

Test data (including Leaderboard)

Current Problems

Around 10% of data already anavialable due to removal of some videos from YouTube.
h
fsd50k
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philippe Gonzalez (2025). fsd50k [Dataset]. https://huggingface.co/datasets/philgzl/fsd50k
Explore at:
Dataset updated
Jul 8, 2025
Authors
Philippe Gonzalez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
FSD50K: An open dataset of human-labeled sound events

This is a mirror of the FSD50K sound event dataset. The original files were converted from WAV to Opus to reduce the size and accelerate streaming.

Sampling rate: 48 kHz Channels: 1 Format: Opus Splits: Dev: 80 hours, 40966 clips. Eval: 28 hours, 10231 clips.

License: FSD50K is released under CC-BY. However, each clip has its own licence. Clip licenses include CC0, CC-BY, CC-BY-NC and CC Sampling+. Clip licenses are specified… See the full description on the dataset page: https://huggingface.co/datasets/philgzl/fsd50k.
FSC22 Dataset
kaggle.com
Updated Sep 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IRMIOT22 (2022). FSC22 Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/4213460
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/4213460
Dataset updated
Sep 20, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
IRMIOT22
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Forest environmental sound classification is one use case of ESC which has been widely experimenting to identify illegal activities inside a forest. With the unavailability of public datasets specific to forest sounds, there is a requirement for a benchmark forest environment sound dataset. With this motivation, the FSC22 was created as a public benchmark dataset, using the audio samples collected from FreeSound org.

This dataset includes 2025 labeled sound clips of 5s long. All the audio samples are distributed between six major parent-level classes; Mechanical sounds, Animal sounds, Environmental Sounds, Vehicle Sounds, Forest Threat Sounds, and Human Sounds. Further, each class is divided into subclasses that capture specific sounds which fall under the main category. Overall the dataset taxonomy consists of 34 classes as shown below. For the first phase of the dataset creation, 75 audio samples for every 27 classes were collected.

We expect that this dataset will help research communities with their research work governing Forest Acoustic monitoring and classification domain.
E
Test dataset for separation of speech, traffic sounds, wind noise, and...
live.european-language-grid.eu
audio wav
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Test dataset for separation of speech, traffic sounds, wind noise, and general sounds [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7681
Explore at:
audio wavAvailable download formats
Dataset updated
Apr 24, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset was generated as part of the paper:DCUnet-Based Multi-Model Approach for Universal Sound Separation,K. Arendt, A. Szumaczuk, B. Jasik, K. Piaskowski, P. Masztalski, M. Matuszewski, K. Nowicki, P. Zborowski.It contains various sounds from the Audio Set [1] and spoken utterances from VCTK [2] and DNS [3] datasets.Contents:sr_8k/ mix_clean/ s1/ s2/ s3/ s4/sr_16k/ mix_clean/ s1/ s2/ s3/ s4/sr_48k/ mix_clean/ s1/ s2/ s3/ s4/Each directory contains 512 audio samples in different sampling rate (sr_8k - 8 kHz, sr_16k - 16 kHz, sr_48k - 48 kHz).The audio samples for each sampling rate are different as they were generated randomly and separately.Each directory contains 5 subdirectories:- mix_clean - mixed sources,- s1 - source #1 (general sounds),- s2 - source #2 (speech),- s3 - source #3 (traffic sounds),- s4 - source #4 (wind noise).The sound mixtures were generated by adding s2, s3, s4 to s1 with SNR ranging from -10 to 10 dB w.r.t. s1.REFERENCES:[1] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.[2] Christophe Veaux, Junichi Yamagishi, and Kirsten Mac- Donald, “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit, [sound],” https://doi.org/10.7488/ds/1994, University of Edinburgh. The Centre for Speech Technology Research (CSTR). 2017.[3] Chandan K. A. Reddy, Ebrahim Beyrami, Harishchandra Dubey, Vishak Gopal, Roger Cheng, Ross Cutler, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, and Johannes Gehrke, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective speech quality and testing framework,” 2020.
Z
DCASE2019_task4_synthetic_data
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shah Ankit Parag (2020). DCASE2019_task4_synthetic_data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2583795
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Turpault Nicolas
Serizel Romain
Salamon Justin
Shah Ankit Parag
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Synthetic data for DCASE 2019 task 4

Freesound dataset [1,2]: A subset of FSD is used as foreground sound events for the synthetic subset of the dataset for DCASE 2019 task 4. FSD is a large-scale, general-purpose audio dataset composed of Freesound content annotated with labels from the AudioSet Ontology [3].

SINS dataset [4]: The derivative of the SINS dataset used for DCASE2018 task 5 is used as background for the synthetic subset of the dataset for DCASE 2019 task 4. The SINS dataset contains a continuous recording of one person living in a vacation home over a period of one week. It was collected using a network of 13 microphone arrays distributed over the entire home. The microphone array consists of 4 linearly arranged microphones.

The synthetic set is composed of 10 sec audio clips generated with Scaper [5]. The foreground events are obtained from FSD. Each event audio clip was verified manually to ensure that the sound quality and the event-to-background ratio were sufficient to be used an isolated event. We also verified that the event was actually dominant in the clip and we controlled if the event onset and offset are present in the clip. Each selected clip was then segmented when needed to remove silences before and after the event and between events when the file contained multiple occurrences of the event class.

License:

All sounds comming from FSD are released under Creative Commons licences. Synthetic sounds can only be used for competition purposes until the full CC license list is made available at the end of the competition.

Further information on dcase website.

References:

[1] F. Font, G. Roma & X. Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013.

[2] E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter & X. Serra. Freesound Datasets: A Platform for the Creation of Open Audio Datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017.

[3] Jort F. Gemmeke and Daniel P. W. Ellis and Dylan Freedman and Aren Jansen and Wade Lawrence and R. Channing Moore and Manoj Plakal and Marvin Ritter. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings IEEE ICASSP 2017, New Orleans, LA, 2017.

[4] Gert Dekkers, Steven Lauwereins, Bart Thoen, Mulu Weldegebreal Adhana, Henk Brouckxon, Toon van Waterschoot, Bart Vanrumste, Marian Verhelst, and Peter Karsmakers. The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 32–36. November 2017.

[5] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello. Scaper: A library for soundscape synthesis and augmentation In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2017.
h
AudioSet2K22
huggingface.co
Updated Sep 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nelson Yalta (2023). AudioSet2K22 [Dataset]. https://huggingface.co/datasets/Fhrozen/AudioSet2K22
Explore at:
Dataset updated
Sep 30, 2023
Authors
Nelson Yalta
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for audioset2022

Dataset Summary

The AudioSet ontology is a collection of sound events organized in a hierarchy. The ontology covers a wide range of everyday sounds, from human and animal sounds, to natural and environmental sounds, to musical and miscellaneous sounds. This repository only includes audio files for DCASE 2022 - Task 3 The included labels are limited to:

Female speech, woman speaking Male speech, man speaking Clapping Telephone Telephone bell… See the full description on the dataset page: https://huggingface.co/datasets/Fhrozen/AudioSet2K22.
ARCA23K
zenodo.org
bin, zip
Updated Feb 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turab Iqbal; Turab Iqbal; Yin Cao; Yin Cao; Andrew Bailey; Andrew Bailey; Mark D. Plumbley; Mark D. Plumbley; Wenwu Wang; Wenwu Wang (2022). ARCA23K [Dataset]. http://doi.org/10.5281/zenodo.5117901
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5117901
Dataset updated
Feb 25, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Turab Iqbal; Turab Iqbal; Yin Cao; Yin Cao; Andrew Bailey; Andrew Bailey; Mark D. Plumbley; Mark D. Plumbley; Wenwu Wang; Wenwu Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.

In addition to ARCA23K, this release includes a companion dataset called ARCA23K-FSD, which is a single-label subset of the FSD50K dataset. ARCA23K-FSD contains the same sound classes as ARCA23K and the same number of audio clips per class. As it is a subset of FSD50K, each clip and its label have been manually verified. Note that only the ground truth data of ARCA23K-FSD is distributed in this release. To download the audio clips, please visit the Zenodo page for FSD50K.

A paper has been published detailing how the dataset was constructed. See the Citing section below.

The source code used to create the datasets is available: https://github.com/tqbl/arca23k-dataset

Characteristics

ARCA23K(-FSD) is divided into:

A training set containing 17,979 clips (39.6 hours for ARCA23K).

A validation set containing 2,264 clips (5.0 hours).

A test test containing 3,484 clips (7.3 hours).

There are 70 sound classes in total. Each class belongs to the AudioSet ontology.

Each audio clip was sourced from the Freesound database. Other than format conversions (e.g. resampling), the audio clips have not been modified.

The duration of the audio clips varies from 0.3 seconds to 30 seconds.

All audio clips are mono 16-bit WAV files sampled at 44.1 kHz.

Based on listening tests (details in paper), 46.4% of the training examples are estimated to be labelled incorrectly. Among the incorrectly-labelled examples, 75.9% are estimated to be out-of-vocabulary.

Sound Classes

The list of sound classes is given below. They are grouped based on the top-level superclasses of the AudioSet ontology.

Music

Acoustic guitar

Bass guitar

Bowed string instrument

Crash cymbal

Electric guitar

Gong

Harp

Organ

Piano

Rattle (instrument)

Scratching (performance technique)

Snare drum

Trumpet

Wind chime

Wind instrument, woodwind instrument

Sounds of things

Boom

Camera

Coin (dropping)

Computer keyboard

Crack

Dishes, pots, and pans

Drawer open or close

Drill

Gunshot, gunfire

Hammer

Keys jangling

Knock

Microwave oven

Printer

Sawing

Scissors

Skateboard

Slam

Splash, splatter

Squeak

Tap

Thump, thud

Toilet flush

Train

Water tap, faucet

Whoosh, swoosh, swish

Writing

Zipper (clothing)

Natural sounds

Crackle

Stream

Waves, surf

Wind

Human sounds

Burping, eructation

Chewing, mastication

Child speech, kid speaking

Clapping

Cough

Crying, sobbing

Fart

Female singing

Female speech, woman speaking

Finger snapping

Giggle

Male speech, man speaking

Run

Screaming

Walk, footsteps

Animal

Bark

Cricket

Livestock, farm animals, working animals

Meow

Rattle

Source-ambiguous sounds

Crumpling, crinkling

Crushing

Tearing

License and Attribution

This release is licensed under the Creative Commons Attribution 4.0 International License.

The audio clips distributed as part of ARCA23K were sourced from Freesound and have their own Creative Commons license. The license information and attribution for each audio clip can be found in ARCA23K.metadata/train.json, which also includes the original Freesound URLs.

The files under ARCA23K-FSD.ground_truth/ are an adaptation of the ground truth data provided as part of FSD50K, which is licensed under the Creative Commons Attribution 4.0 International License. The curators of FSD50K are Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano, and Sara Fernandez.

Citing

If you wish to cite this work, please cite the following paper:

T. Iqbal, Y. Cao, A. Bailey, M. D. Plumbley, and W. Wang, “ARCA23K: An audio dataset for investigating open-set label noise”, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 2021, Barcelona, Spain, pp. 201–205.

BibTeX:

@inproceedings{Iqbal2021, author = {Iqbal, T. and Cao, Y. and Bailey, A. and Plumbley, M. D. and Wang, W.}, title = {{ARCA23K}: An audio dataset for investigating open-set label noise}, booktitle = {Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)}, pages = {201--205}, year = {2021}, address = {Barcelona, Spain}, }
Z
FSD-MIX-CLIPS
data.niaid.nih.gov
Updated Oct 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Wang (2021). FSD-MIX-CLIPS [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5574134
Explore at:
Dataset updated
Oct 17, 2021
Dataset provided by
Justin Salamon
Nicholas J. Bryan
Mark Cartwright
Yu Wang
Juan Pablo Bello
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Created by

Yu Wang, Nicholas J. Bryan, Justin Salamon, Mark Cartwright, and Juan Pablo Bello

Publication

If using this data in academic work, please cite the following paper, which presented this dataset:

Y. Wang, N. J. Bryan, J. Salamon, M. Cartwright, and J. P. Bello. "Who calls the shots? Rethinking Few-shot Learning for Audio", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021

Description

FSD-MIX-CLIPS is an open dataset of programmatically mixed audio clips with a controlled level of polyphony and signal-to-noise ratio. We use single-labeled clips from FSD50K as the source material for the foreground sound events and Brownian noise as the background to generate 281,039 10-second strongly-labeled soundscapes with Scaper. We refer this (intermediate) dataset of 10s soundscapes as FSD-MIX-SED. Each soundscape contains n events from n different sound classes where n is ranging from 1 to 5. We then extract 614,533 1s clips centered on each sound event in the soundscapes in FSD-MIX-SED to produce FSD-MIX-CLIPS.

Source material and annotations

Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce FSD-MIX-SED using Scaper with the script in the project repository.

All clips in FSD-MIX-CLIPS are extracted from FSD-MIX-SED. Therefore, for FSD-MIX-CLIPS, instead of releasing duplicated audio content, we provide annotations that specify the filename in FSD-MIX-SED and the corresponding starting time (in second) of each 1-second clip.

Foreground material from FSD50K

We choose clips shorter than 4s that have a single validated label with the Present and Predominant annotation type. We further trim the silence at the edges of each clip. The resulting subset contains clips each with a single and strong label. The 200 sound classes in FSD50K are hierarchically organized. We focus on the leaf nodes and rule out classes with less than 20 single-labeled clips. This gives us 89 sound classes. vocab.json contains the list of 89 classes, each class is then labeled by its index in the list.

Data splits

FSD-MIX-CLIPS is originally generated for the task of multi-label audio classification under a few-sho continual learning setup. Therefore, the classes are split into disjoint sets of base and novel classes where novel class data are only used at inference time. We partition the 89 classes into three splits: base, novel-val, and novel-test with 59, 15, and 15 classes, respectively. Base class data are used for both training and evaluation while novel-val/novel-test class data are used for validation/test only.

Files

FSD_MIX_SED.source.tar.gz contains the background Brownian noise and 10,296 of single-labeled sound events from FSD50K in .wav format. The original file size is 1.9GB.

FSD_MIX_SED.annotations.tar.gz contains 281,039 JAMS files. The original file size is 35GB.

FSD_MIX_CLIPS.annotations.tar.gz contains ground truth labels for 1-second clips in each data split in FSD_MIX_SED, specified by filename and starting time (sec).

vocab.json contains the 89 classes.

Foreground sound materials and soundscape annotations in FSD_MIX_SED are organized in a similar folder structure following the data splits:

root folder │
└───base/ Base classes (label 0-58) │ │
│ └─── train/
│ │ │
│ │ └─── audio or annotation files
│ │
│ └─── val/
│ │ │
│ │ └─── audio or annotation files
│ │
│ └─── test/
│ │
│ └─── audio or annotation files │ │ └───val/ Novel-val classes (label 59-73) │ │
│ └─── audio or annotation files
│
│
└───test/ Novel-test classes (label 74-88) │
└─── audio or annotation files

References

[1] Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, Xavier Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.
O
DCASE 2018 Task 4
opendatalab.com
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carnegie Mellon University, DCASE 2018 Task 4 [Dataset]. https://opendatalab.com/OpenDataLab/DCASE_2018_Task_4
Explore at:
zip(2496883 bytes)Available download formats
Dataset provided by
Carnegie Mellon University
Johannes Kepler University Linz
University of Lorraine
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
DCASE2018 Task 4 is a dataset for large-scale weakly labeled semi-supervised sound event detection in domestic environments. The data are YouTube video excerpts focusing on domestic context which could be used for example in ambient assisted living applications. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events...) and potential industrial applications. Specifically, the task employs a subset of “Audioset: An Ontology And Human-Labeled Dataset For Audio Events” by Google. Audioset consists of an expanding ontology of 632 sound event classes and a collection of 2 million human-labeled 10-second sound clips (less than 21% are shorter than 10-seconds) drawn from 2 million Youtube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. Task 4 focuses on a subset of Audioset that consists of 10 classes of sound events: speech, dog, cat, alarm bell ringing, dishes, frying, blender, running water, vacuum cleaner, electric shaver toothbrush.
f
Best performances up to statistic significance achieved using...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenjing Han; Eduardo Coutinho; Huabin Ruan; Haifeng Li; Björn Schuller; Xiaojie Yu; Xuan Zhu (2023). Best performances up to statistic significance achieved using semi-supervised active learning (SSAL), active learning (AL), and passive learning (PL) in pool-based and stream-based scenarios, as well as the number of human-labeled instances (#HLI) needed to achieve that performance. [Dataset]. http://doi.org/10.1371/journal.pone.0162075.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0162075.t009
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Wenjing Han; Eduardo Coutinho; Huabin Ruan; Haifeng Li; Björn Schuller; Xiaojie Yu; Xuan Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Best performances up to statistic significance achieved using semi-supervised active learning (SSAL), active learning (AL), and passive learning (PL) in pool-based and stream-based scenarios, as well as the number of human-labeled instances (#HLI) needed to achieve that performance.
Z
SONYC-FSD-SED
data.niaid.nih.gov
Updated Sep 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Wang (2022). SONYC-FSD-SED [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6392323
Explore at:
Dataset updated
Sep 20, 2022
Dataset provided by
Mark Cartwright
Yu Wang
Juan Pablo Bello
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Created by

Yu Wang, Mark Cartwright, and Juan Pablo Bello

Publication

If using this data in academic work, please cite the following paper, which presented this dataset:

Y. Wang, M. Cartwright, and J. P. Bello. "Active Few-Shot Learning for Sound Event Detection", INTERSPEECH, 2022

Description

SONYC-FSD-SED is an open dataset of programmatically mixed audio clips that simulates audio data in an environmental sound monitoring system, where sound class occurrences and co-occurrences exhibit seasonal periodic patterns. We use recordings collected from the Sound of New York City (SONYC) acoustic sensor network as backgrounds, and single-labeled clips in the FSD50K dataset as foreground events to generate 576,591 10-second strongly-labeled soundscapes with Scaper (including 111,294 additional test data for the experiment of sampling window). Instead of sampling foreground sound events uniformly, we simulate the occurrence probability of each class at different times in a year, creating more realistic temporal characteristics.

Source material and annotations

Due to the large size of the dataset, instead of releasing the raw audio files, we release the source material and soundscape annotations in JAMS format, which can be used to reproduce SONYC-FSD-SED using Scaper with the script in the project repository.

Background material from SONYC recordings

We pick a sensor from the SONYC sensor network and subsample from recordings it collected within a year (2017). We categorize these ∼550k 10-second clips into 96 bins based on timestamps, where each bin represents a unique combination of the month of a year, day of a week (weekday or weekend), and time of a day (divided into four 6-hour blocks). Next, we run a pre-trained urban sound event classifier over all recordings and filter out clips with active sound classes. We do not filter out footstep and bird since they appear too frequently, instead, we remove these two classes from the foreground sound material. Then from each bin, we choose the clip with the lowest sound pressure level, yielding 96 background clips.

Foreground material from FSD50K

We follow the same filtering process as in FSD-MIX-SED to get the subset of FSD50K with short single-labeled clips. In addition, we remove two classes, "Chirp_and_tweet" and "Walk_and_footsteps", that exist in our SONYC background recordings. This results in 87 sound classes. vocab.json contains the list of 87 classes, each class is then labeled by its index in the list. 0-42: train, 43-56: val, 57-86: test.

Occurrence probability modelling

For each class, we model its occurrence probability within a year. We use von Mises probability density functions to simulate the probability distribution over different weeks in a year and hours in a day considering their cyclic characteristics: (f(x|μ, κ) = e^{κcos(x−μ)}/2πI_0(κ)), where (I_0(κ)) is the modified Bessel function of order (0), (\mu) and (1/\kappa) are analogous to the mean and variance in the normal distribution. We randomly sample ((\mu_{year}, \mu_{day})) from ([-\pi, \pi]) and ((\kappa_{year}, \kappa_{day})) from ([0, 10]). We also randomly assign (p_{weekday} \in [0, 1] ), (p_{weekend} = 1 − p_{weekday}) to simulate the probability distribution over different days in a week. Finally, we get the probability distribution over the entire year with a 1-hour resolution. At a given timestamp, we integrate (f_{year}) and (f_{day}) over the 1-hour window and multiply them together with (p_{weekday}) or (p_{weekend}) depends on the day. To speed up the following sampling process, we scale the final probability distribution using a temperature parameter randomly sampled from ([2,3]).

Files

SONYC_FSD_SED.source.tar.gz: 96 SONYC backgrounds and 10,158 foreground sounds in .wav format. The original file size is 2GB.

SONYC_FSD_SED.annotations.tar.gz: 465,467 JAMS files. The original file size is 57GB.

SONYC_FSD_SED_add_test.annotations.tar.gz: 111,294 JAMS files for additional test data. The original file size is 14GB.

vocab.json: 87 classes.

occ_prob_per_cl.pkl: Occurrence probability for each foreground sound class.

References

[1] J. P. Bello, C. T. Silva, O. Nov, R. L. DuBois, A. Arora, J. Salamon, C. Mydlarz, and H. Doraiswamy, “SONYC: A system for monitoring, analyzing, and mitigating urban noise pollution,” Commun. ACM, 2019

[2] E. Fonseca, X. Favory, J. Pons, F. Font, X. Serra. "FSD50K: an Open Dataset of Human-Labeled Sound Events", arXiv:2010.00475, 2020.
Nonspeech7k dataset
zenodo.org
zip
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Mamunur Rashid; Muhammad Mamunur Rashid; Guiqing Li *; Chengrui Du; Guiqing Li *; Chengrui Du (2023). Nonspeech7k dataset [Dataset]. http://doi.org/10.5281/zenodo.6967442
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6967442
Dataset updated
Jun 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Muhammad Mamunur Rashid; Muhammad Mamunur Rashid; Guiqing Li *; Chengrui Du; Guiqing Li *; Chengrui Du
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset consists of 7,014 files delivered as 32kHz, mono audio files in .wav format and divided into train and test sets. The train set consists of 6,289, and the test set consists of 725 files. The files were strongly manually annotated with a single ground-truth label. The length of each file is from 500 milliseconds to 4 seconds.

The dataset is only allowed for non-commercial and academic research purposes under the creative commons (CC BY-NC-SA 4.0) license. If you use the dataset, please cite our paper and acknowledge the source(freesound.org, Youtube, and Aigei). More details about the Nonspeech7k dataset are available in our article.

Article title: "Nonspeech7k dataset: Classification and analysis of human nonspeech sound"
d
EUROPEAN CITIES Environmental Noise Data | Noise Complaints | GDPR Compliant...
datarade.ai
Updated May 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silencio Network (2025). EUROPEAN CITIES Environmental Noise Data | Noise Complaints | GDPR Compliant | 100% Traceable Consent [Dataset]. https://datarade.ai/data-products/european-cities-environmental-noise-data-noise-complaints-silencio-network
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
May 5, 2025
Dataset provided by
Quickkonnect UG
Authors
Silencio Network
Area covered
Belarus, United States of America, Finland, Italy, Latvia, Austria, Faroe Islands, Switzerland, Norway, Bulgaria, Europe
Description
Noise Complaint Dataset —Acoustic Source Detection

Silencio offers geolocated and categorized noise complaints collected directly through our mobile app. This unique dataset includes not only the location and time of each complaint but also the source of the noise (e.g., traffic, construction, nightlife, neighbors), making it a rare resource for research and monitoring focused on acoustic event classification, noise source identification, and urban sound analysis.

Unlike standard sound datasets, which often lack real-world context or human-labeled sources, Silencio’s dataset is built entirely from user-submitted reports, providing authentic, ground-truth labels for research. It is ideal for training models in sound recognition, urban noise prediction, acoustic scene analysis, and noise impact assessment.

Combined with Silencio’s Street Noise-Level Dataset, this complaint dataset allows researchers to correlate objective measurements with subjective community-reported noise events, opening up possibilities for multi-modal AI models that link noise intensity with human perception.

Data delivery options include: • CSV exports • S3 bucket delivery • (Upcoming) API access

All data is fully anonymized, GDPR-compliant, and available as both historical and updated datasets. We are open to early-access partnerships and custom formatting to meet AI research needs.
m
Arabic Natural Audio Dataset
data.mendeley.com
Updated May 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samira klaylat (2018). Arabic Natural Audio Dataset [Dataset]. http://doi.org/10.17632/xm232yxf7t.1
Explore at:
Unique identifier
https://doi.org/10.17632/xm232yxf7t.1
Dataset updated
May 30, 2018
Authors
Samira klaylat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the first Arabic Natural Audio Dataset (ANAD) developed to recognize 3 discrete emotions: Happy,angry, and surprised.

Eight videos of live calls between an anchor and a human outside the studio were downloaded from online Arabic talk shows. Each video was then divided into turns: callers and receivers. To label each video, 18 listeners were asked to listen to each video and select whether they perceive a happy, angry or surprised emotion. Silence, laughs and noisy chunks were removed. Every chunk was then automatically divided into 1 sec speech units forming our final corpus composed of 1384 records.

Twenty five acoustic features, also known as low-level descriptors, were extracted. These features are: intensity, zero crossing rates, MFCC 1-12 (Mel-frequency cepstral coefficients), F0 (Fundamental frequency) and F0 envelope, probability of voicing and, LSP frequency 0-7. On every feature nineteen statistical functions were applied. The functions are: maximum, minimum, range, absolute position of maximum, absolute position of minimum, arithmetic of mean, Linear Regression1, Linear Regression2, Linear RegressionA, Linear RegressionQ, standard Deviation, kurtosis, skewness, quartiles 1, 2, 3 and, inter-quartile ranges 1-2, 2-3, 1-3. The delta coefficient for every LLD is also computed as an estimate of the first derivative hence leading to a total of 950 features.
O
AudioSet
opendatalab.com
huggingface.co
zip
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2023). AudioSet [Dataset]. https://opendatalab.com/OpenDataLab/AudioSet
Explore at:
zipAvailable download formats
Dataset updated
Jul 1, 2023
Dataset provided by
Google
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from YouTube, therefore many of which are in poor-quality and contain multiple sound-sources. A hierarchical ontology of 632 event classes is employed to annotate these data, which means that the same sound could be annotated as different labels. For example, the sound of barking is annotated as Animal, Pets, and Dog. All the videos are split into Evaluation/Balanced-Train/Unbalanced-Train set.

Facebook

Twitter

Click to copy link

Link copied

Cite

Eduardo Fonseca (2022). FSD50K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4060431

FSD50K

Explore at:

Dataset updated

Apr 24, 2022

Dataset provided by

Jordi Pons
Xavier Favory
Eduardo Fonseca
Xavier Serra
Frederic Font

Description

FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

Citation

If you use the FSD50K dataset, or part of it, please cite our TASLP paper (available from [arXiv] [TASLP]):

@article{fonseca2022FSD50K, title={{FSD50K}: an open dataset of human-labeled sound events}, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={30}, pages={829--852}, year={2022}, publisher={IEEE} }

Paper update: This paper has been published in TASLP at the beginning of 2022. The accepted camera-ready version includes a number of improvements with respect to the initial submission. The main updates include: estimation of the amount of label noise in FSD50K, SNR comparison between FSD50K and AudioSet, improved description of evaluation metrics including equations, clarification of experimental methodology and some results, some content moved to Appendix for readability. The TASLP-accepted camera-ready version is available from arXiv (in particular, it is v2 in arXiv, displayed by default).

Data curators

Eduardo Fonseca, Xavier Favory, Jordi Pons, Mercedes Collado, Ceren Can, Rachit Gupta, Javier Arredondo, Gary Avendano and Sara Fernandez

Contact

You are welcome to contact Eduardo Fonseca should you have any questions, at efonseca@google.com.

ABOUT FSD50K

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology [1]. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

What follows is a brief summary of FSD50K's most important characteristics. Please have a look at our paper (especially Section 4) to extend the basic information provided here with relevant details for its usage, as well as discussion, limitations, applications and more.

Basic characteristics:

FSD50K contains 51,197 audio clips from Freesound, totalling 108.3 hours of multi-labeled audio

The dataset encompasses 200 sound classes (144 leaf nodes and 56 intermediate nodes) hierarchically organized with a subset of the AudioSet Ontology.

The audio content is composed mainly of sound events produced by physical sound sources and production mechanisms, including human sounds, sounds of things, animals, natural sounds, musical instruments and more. The vocabulary can be inspected in vocabulary.csv (see Files section below).

The acoustic material has been manually labeled by humans following a data labeling process using the Freesound Annotator platform [2].

Clips are of variable length from 0.3 to 30s, due to the diversity of the sound classes and the preferences of Freesound users when recording sounds.

All clips are provided as uncompressed PCM 16 bit 44.1 kHz mono audio files.

Ground truth labels are provided at the clip-level (i.e., weak labels).

The dataset poses mainly a large-vocabulary multi-label sound event classification problem, but also allows development and evaluation of a variety of machine listening approaches (see Sec. 4D in our paper).

In addition to audio clips and ground truth, additional metadata is made available (including raw annotations, sound predominance ratings, Freesound metadata, and more), allowing a variety of analyses and sound event research tasks (see Files section below).

The audio clips are grouped into a development (dev) set and an evaluation (eval) set such that they do not have clips from the same Freesound uploader.

Dev set:

40,966 audio clips totalling 80.4 hours of audio

Avg duration/clip: 7.1s

114,271 smeared labels (i.e., labels propagated in the upwards direction to the root of the ontology)

Labels are correct but could be occasionally incomplete

A train/validation split is provided (Sec. 3H). If a different split is used, it should be specified for reproducibility and fair comparability of results (see Sec. 5C of our paper)

Eval set:

10,231 audio clips totalling 27.9 hours of audio

Avg duration/clip: 9.8s

38,596 smeared labels

Eval set is labeled exhaustively (labels are correct and complete for the considered vocabulary)

Note: All classes in FSD50K are represented in AudioSet, except Crash cymbal, Human group actions, Human voice, Respiratory sounds, and Domestic sounds, home sounds.

LICENSE

All audio clips in FSD50K are released under Creative Commons (CC) licenses. Each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. Specifically:

The development set consists of 40,966 clips with the following licenses:

CC0: 14,959

CC-BY: 20,017

CC-BY-NC: 4616

CC Sampling+: 1374

The evaluation set consists of 10,231 clips with the following licenses:

CC0: 4914

CC-BY: 3489

CC-BY-NC: 1425

CC Sampling+: 403

For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses. The licenses are specified in the files dev_clips_info_FSD50K.json and eval_clips_info_FSD50K.json.

In addition, FSD50K as a whole is the result of a curation process and it has an additional license: FSD50K is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSD50K.doc zip file. We note that the choice of one license for the dataset as a whole is not straightforward as it comprises items with different licenses (such as audio clips, annotations, or data split). The choice of a global license in these cases may warrant further investigation (e.g., by someone with a background in copyright law).

Usage of FSD50K for commercial purposes:

If you'd like to use FSD50K for commercial purposes, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

Also, if you are interested in using FSD50K for machine learning competitions, please contact Eduardo Fonseca and Frederic Font at efonseca@google.com and frederic.font@upf.edu.

FILES

FSD50K can be downloaded as a series of zip files with the following directory structure:

root │
└───FSD50K.dev_audio/ Audio clips in the dev set │
└───FSD50K.eval_audio/ Audio clips in the eval set │
└───FSD50K.ground_truth/ Files for FSD50K's ground truth │ │
│ └─── dev.csv Ground truth for the dev set │ │
│ └─── eval.csv Ground truth for the eval set
│ │
│ └─── vocabulary.csv List of 200 sound classes in FSD50K │
└───FSD50K.metadata/ Files for additional metadata │ │
│ └─── class_info_FSD50K.json Metadata about the sound classes │ │
│ └─── dev_clips_info_FSD50K.json Metadata about the dev clips │ │
│ └─── eval_clips_info_FSD50K.json Metadata about the eval clips │ │
│ └─── pp_pnp_ratings_FSD50K.json PP/PNP ratings
│ │
│ └─── collection/ Files for the sound collection format
│
└───FSD50K.doc/ │
└───README.md The dataset description file that you are reading │
└───LICENSE-DATASET License of the FSD50K dataset as an entity

Each row (i.e. audio clip) of dev.csv contains the following information:

fname: the file name without the .wav extension, e.g., the fname 64760 corresponds to the file 64760.wav in disk. This number is the Freesound id. We always use Freesound ids as filenames.

labels: the class labels (i.e., the ground truth). Note these class labels are smeared, i.e., the labels have been propagated in the upwards direction to the root of the ontology. More details about the label smearing process can be found in Appendix D of our paper.

mids: the Freebase identifiers corresponding to the class labels, as defined in the AudioSet Ontology specification

split: whether the clip belongs to train or val (see paper for details on the proposed split)

Rows in eval.csv follow the same format, except that there is no split column.

Note: We use a slightly different format than AudioSet for the naming of class labels in order to avoid potential problems with spaces, commas, etc. Example: we use Accelerating_and_revving_and_vroom instead of the original Accelerating, revving, vroom. You can go back to the original AudioSet naming using the information provided in vocabulary.csv (class label and mid for the 200 classes of FSD50K) and the AudioSet Ontology specification.

Files with additional metadata (FSD50K.metadata/)

To allow a variety of analysis and approaches with FSD50K, we provide the following metadata:

class_info_FSD50K.json: python dictionary where each entry corresponds to one sound class and contains: FAQs utilized during the annotation of the class, examples (representative audio clips), and verification_examples (audio clips presented to raters during annotation as a quality control mechanism). Audio clips are described by the Freesound id. Note: It may be that some of these examples are not included in the FSD50K release.

dev_clips_info_FSD50K.json: python dictionary where each entry corresponds to one dev clip and contains: title,

Clear search

Close search

Google apps

Main menu

FSD50K

Google's Audioset: Reformatted

FSD50k

FSD-FS

FSDKaggle2019

AudioSet [Train]

Context

Content

Additional data

Current Problems

fsd50k

FSC22 Dataset

Test dataset for separation of speech, traffic sounds, wind noise, and...

DCASE2019_task4_synthetic_data

AudioSet2K22

ARCA23K

FSD-MIX-CLIPS

DCASE 2018 Task 4

Best performances up to statistic significance achieved using...

SONYC-FSD-SED

Nonspeech7k dataset

EUROPEAN CITIES Environmental Noise Data | Noise Complaints | GDPR Compliant...

Arabic Natural Audio Dataset

AudioSet

FSD50KSee More Versions

FSD50K