Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}
The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery. For a description of the corpus, see:
D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Music-Audio-Pseudo Captions
Pseudo Music and Audio Captions from LP-MusicCaps, Music Negation/Temporal Ordering WavCaps
Dataset Summary
Compared to other domains, music and audio domains cannot obtain well-written web caption data, and caption annotation is expensive. Therefore, we use the Music (LP-MusicCaps), (Music Negation/Temporal Ordering) and Audio (Wavcaps) datasets created with ChatGPT to re-organize them in the form of instructions, input… See the full description on the dataset page: https://huggingface.co/datasets/seungheondoh/music-audio-pseudo-captions.
This is the evaluation split for Task 6, Automated Audio Captioning, in DCASE 2020 Challenge.
This evaluation split is the Clotho testing split, which is thoroughly described in the corresponding paper:
K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.
available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990
This evaluation split is meant to be used for the purposes of the Task 6 at the scientific challenge DCASE 2020. This split it is not meant to be used for developing audio captioning methods. For developing audio captioning methods, you should use the development and evaluation splits of Clotho.
If you want the development and evaluation splits of Clotho dataset, you can find them also in Zenodo, at: https://zenodo.org/record/3490684
== License ==
The audio files in the archives:
clotho_audio_test.7z
and the associated meta-data in the CSV file:
clotho_metadata_test.csv
are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV file for each of the audio files. That is, each audio file in the 7z archive is listed in the CSV file with the meta-data. The meta-data for each file are:
File name
Start and ending samples for the excerpt that is used in the Clotho dataset
Uploader/user in the Freesound platform (manufacturer)
Link to the licence of the file
== References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019. Car scene, detailed in Audio Caption in a Car Setting with a Sentence-Level Loss published at ISCSLP 2021. Original captions in Mandarin Chinese, with English translations provided.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is derived from the evaluation subset of Clotho dataset. It is designed to analyze the behavior of the captioning system under certain perturbation in order to try and identify some open challenges in automated audio captioning. The original audio clips are transformed with audio_degrader. The transformations applied are the following:
Microphone response simulation
Mixup with another clip from the dataset (ratio -6dB, -3dB and 0dB)
Additive noise from DESED (ratio -12dB, -6dB, 0dB)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WavCaps
WavCaps is a ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research, where the audio clips are sourced from three websites (FreeSound, BBC Sound Effects, and SoundBible) and a sound event detection dataset (AudioSet Strongly-labelled Subset).
Paper: https://arxiv.org/abs/2303.17395 Github: https://github.com/XinhaoMei/WavCaps
Statistics
Data Source
avg. audio duration (s)avg. text length
FreeSound… See the full description on the dataset page: https://huggingface.co/datasets/cvssp/WavCaps.
This is a dataset containing audio captions and corresponding audio tags for a number of 3930 audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park). The files were annotated using a web-based tool. Each file is annotated by multiple annotators that provided tags and a one-sentence description of the audio content. The data also includes annotator competence estimated using MACE (Multi-Annotator Competence Estimation). The annotation procedure, processing and analysis of the data are presented in the following papers: Irene Martin-Morato, Annamaria Mesaros. What is the ground truth? Reliability of multi-annotator data for audio tagging, 29th European Signal Processing Conference, EUSIPCO 2021 Irene Martin-Morato, Annamaria Mesaros. Diversity and bias in audio captioning datasets, submitted to DCASE 2021 Workshop (to be updated with arxiv link) Data is provided as two files: MACS.yaml - containing the complete annotations in the following format: - filename: file1.wav
annotations:
- annotator_id: ann_1
sentence: caption text
tags:
- tag1
- tag2
- annotator_id: ann_2
sentence: caption text
tags:
- tag1 MACS_competence.csv - containing the estimated annotator competence; for each annotator_id in the yaml file, competence is a number between 0 (considered as annotating at random) and 1 id [tab] competence The audio files can be downloaded from https://zenodo.org/record/2589280 and are covered by their own license.
Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A retro-futurist drum machine groove drenched in bubbly synthetic sound effects and a hint of an acid bassline.
The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation and music-language retrieval. More information about the data, collection method and validation is provided in the paper describing the dataset.
If you use this dataset, please cite our paper:
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation, Manco, Ilaria and Weck, Benno and Doh, Seungheon and Won, Minz and Zhang, Yixiao and Bogdanov, Dmitry and Wu, Yusong and Chen, Ke and Tovstogan, Philip and Benetos, Emmanouil and Quinton, Elio and Fazekas, György and Nam, Juhan, Machine Learning for Audio Workshop at NeurIPS 2023, 2023
@article{mei2024wavcaps, title={Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research}, author={Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, year={2024}, publisher={IEEE} }
@article{wang2024audiobench, title={AudioBench: A Universal Benchmark for… See the full description on the dataset page: https://huggingface.co/datasets/AudioLLMs/wavcaps_test.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SpeechCoco
Introduction
Our corpus is an extension of the MS COCO image recognition and captioning dataset. MS COCO comprises images paired with a set of five captions. Yet, it does not include any speech. Therefore, we used Voxygen's text-to-speech system to synthesise the available captions.
The addition of speech as a new modality enables MSCOCO to be used for researches in the field of language acquisition, unsupervised term discovery, keyword spotting, or semantic embedding using speech and vision.
Our corpus is licensed under a Creative Commons Attribution 4.0 License.
Data Set
This corpus contains 616,767 spoken captions from MSCOCO's val2014 and train2014 subsets (respectively 414,113 for train2014 and 202,654 for val2014).
We used 8 different voices. 4 of them have a British accent (Paul, Bronwen, Judith, and Elizabeth) and the 4 others have an American accent (Phil, Bruce, Amanda, Jenny).
In order to make the captions sound more natural, we used SOX tempo command, enabling us to change the speed without changing the pitch. 1/3 of the captions are 10% slower than the original pace, 1/3 are 10% faster. The last third of the captions was kept untouched.
We also modified approximately 30% of the original captions and added disfluencies such as "um", "uh", "er" so that the captions would sound more natural.
Each WAV file is paired with a JSON file containing various information: timecode of each word in the caption, name of the speaker, name of the WAV file, etc. The JSON files have the following data structure:
{
"duration": float,
"speaker": string,
"synthesisedCaption": string,
"timecode": list,
"speed": float,
"wavFilename": string,
"captionID": int,
"imgID": int,
"disfluency": list
}
On average, each caption comprises 10.79 tokens, disfluencies included. The WAV files are on average 3.52 seconds long.
Repository
The repository is organized as follows:
CORPUS-MSCOCO (~75GB once decompressed)
train2014/ : folder contains 413,915 captions
json/
wav/
translations/
train_en_ja.txt
train_translate.sqlite3
train_2014.sqlite3
val2014/ : folder contains 202,520 captions
json/
wav/
translations/
train_en_ja.txt
train_translate.sqlite3
val_2014.sqlite3
speechcoco_API/
speechcoco/
_init_.py
speechcoco.py
setup.py
Filenames
.wav files contain the spoken version of a caption
.json files contain all the metadata of a given WAV file
.sqlite3 files are SQLite databases containing all the information contained in the JSON files
We adopted the following naming convention for both the WAV and JSON files:
imageID_captionID_Speaker_DisfluencyPosition_Speed[.wav/.json]
Script
We created a script called speechcoco.py in order to handle the metadata and allow the user to easily find captions according to specific filters. The script uses the *.db files.
Features:
Aggregate all the information in the JSON files into a single SQLite database
Find captions according to specific filters (name, gender and nationality of the speaker, disfluency position, speed, duration, and words in the caption). The script automatically builds the SQLite query. The user can also provide his own SQLite query.
The following Python code returns all the captions spoken by a male with an American accent for which the speed was slowed down by 10% and that contain "keys" at any position
# create SpeechCoco object
db = SpeechCoco(train_2014.sqlite3, train_translate.sqlite3, verbose=True)
# filter captions (returns Caption Objects)
captions = db.filterCaptions(gender="Male", nationality="US", speed=0.9, text='%keys%')
for caption in captions:
print('
{}\t{}\t{}\t{}\t{}\t{}\t\t{}'.format(caption.imageID,
caption.captionID,
caption.speaker.name,
caption.speaker.nationality,
caption.speed,
caption.filename,
caption.text))
...
298817 26763 Phil 0.9 298817_26763_Phil_None_0-9.wav A group of turkeys with bushes in the background.
108505 147972 Phil 0.9 108505_147972_Phil_Middle_0-9.wav Person using a, um, slider cell phone with blue backlit keys.
258289 154380 Bruce 0.9 258289_154380_Bruce_None_0-9.wav Some donkeys and sheep are in their green pens .
545312 201303 Phil 0.9 545312_201303_Phil_None_0-9.wav A man walking next to a couple of donkeys.
...
Find all the captions belonging to a specific image
captions = db.getImgCaptions(298817)
for caption in captions:
print('
{}'.format(caption.text))
Birds wondering through grassy ground next to bushes.
A flock of turkeys are making their way up a hill.
Um, ah. Two wild turkeys in a field walking around.
Four wild turkeys and some bushes trees and weeds.
A group of turkeys with bushes in the background.
Parse the timecodes and have them structured
input:
...
[1926.3068, "SYL", ""],
[1926.3068, "SEPR", " "],
[1926.3068, "WORD", "white"],
[1926.3068, "PHO", "w"],
[2050.7955, "PHO", "ai"],
[2144.6591, "PHO", "t"],
[2179.3182, "SYL", ""],
[2179.3182, "SEPR", " "]
...
output:
print(caption.timecode.parse())
...
{
'begin': 1926.3068,
'end': 2179.3182,
'syllable': [{'begin': 1926.3068,
'end': 2179.3182,
'phoneme': [{'begin': 1926.3068,
'end': 2050.7955,
'value': 'w'},
{'begin': 2050.7955,
'end': 2144.6591,
'value': 'ai'},
{'begin': 2144.6591,
'end': 2179.3182,
'value': 't'}],
'value': 'wait'}],
'value': 'white'
},
...
Convert the timecodes to Praat TextGrid files
caption.timecode.toTextgrid(outputDir, level=3)
Get the words, syllables and phonemes between n seconds/milliseconds
The following Python code returns all the words between 0.2 and 0.6 seconds for which at least 50% of the word's total length is within the specified interval
pprint(caption.getWords(0.20, 0.60, seconds=True, level=1, olapthr=50))
...
404537 827239 Bruce US 0.9 404537_827239_Bruce_None_0-9.wav Eyeglasses, a cellphone, some keys and other pocket items are all laid out on the cloth. .
[
{
'begin': 0.0,
'end': 0.7202778,
'overlapPercentage': 55.53412863758955,
'word': 'eyeglasses'
}
]
...
Get the translations of the selected captions
As for now, only japanese translations are available. We also used Kytea to tokenize and tag the captions translated with Google Translate
captions = db.getImgCaptions(298817)
for caption in captions:
print('
{}'.format(caption.text))
# Get translations and POS
print('\tja_google: {}'.format(db.getTranslation(caption.captionID, "ja_google")))
print('\t\tja_google_tokens: {}'.format(db.getTokens(caption.captionID, "ja_google")))
print('\t\tja_google_pos: {}'.format(db.getPOS(caption.captionID, "ja_google")))
print('\tja_excite: {}'.format(db.getTranslation(caption.captionID, "ja_excite")))
Birds wondering through grassy ground next to bushes.
ja_google: 鳥は茂みの下に茂った地面を抱えています。
ja_google_tokens: 鳥 は 茂み の 下 に 茂 っ た 地面 を 抱え て い ま す 。
ja_google_pos: 鳥/名詞/とり は/助詞/は 茂み/名詞/しげみ の/助詞/の 下/名詞/した に/助詞/に
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetically Spoken COCO
Version 1.0
This dataset contain synthetically generated spoken versions of MS COCO [1] captions. This
dataset was created as part the research reported in [5].
The speech was generated using gTTS [2]. The dataset consists of the following files:
- dataset.json: Captions associated with MS COCO images. This information comes from [3].
- sentid.txt: List of caption IDs. This file can be used to locate MFCC features of the MP3 files
in the numpy array stored in dataset.mfcc.npy.
- mp3.tgz: MP3 files with the audio. Each file name corresponds to caption ID in dataset.json
and in sentid.txt.
- dataset.mfcc.npy: Numpy array with the Mel Frequence Cepstral Coefficients extracted from
the audio. Each row corresponds to a caption. The order or the captions corresponds to the
ordering in the file sentid.txt. MFCCs were extracted using [4].
[1] http://mscoco.org/dataset/#overview
[2] https://pypi.python.org/pypi/gTTS
[3] https://github.com/karpathy/neuraltalk
[4] https://github.com/jameslyons/python_speech_features
[5] https://arxiv.org/abs/1702.01991
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The global audio accessibility market is experiencing robust growth, driven by increasing awareness of inclusivity and technological advancements. The market, estimated at $1.5 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. Key drivers include rising accessibility regulations, the proliferation of streaming services demanding closed captions and audio descriptions, and the increasing availability of cost-effective audio description software and services. The growing visually impaired and blind population globally further fuels market expansion. Significant market segments include online audio description services, preferred for their scalability and reach, and offline services which offer more tailored and customized solutions. Leading market players are continually innovating, incorporating AI-powered solutions for improved accuracy and efficiency in audio description generation. While the market presents significant opportunities, challenges remain. High implementation costs, particularly for offline services, and the need for skilled professionals to create high-quality audio descriptions can hinder widespread adoption, particularly in developing regions. However, the ongoing evolution of speech-to-text and text-to-speech technologies, alongside reductions in the cost of AI-driven tools, are likely to mitigate some of these restraints in the coming years. The market is geographically diverse, with North America and Europe currently dominating, but significant growth potential exists in emerging markets like Asia Pacific and the Middle East & Africa as awareness and accessibility legislation increases. The increasing integration of audio description within mainstream media and entertainment platforms will be a crucial factor in expanding market penetration and accessibility across all user segments.
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Dataset of captioned spectrograms (text describing the sound).
We introduce a new audio dataset called SoundDescs that can be used for tasks such as text to audio retrieval, audio captioning etc. This dataset contains 32,979 pairs of audio files and text descriptions. There are 23 categories found in SoundDescs including but not limited to nature, clocks, fire etc.
SoundDescs can be downloaded from here and retrieval results for this dataset can be found in the associated paper Audio Retrieval with Natural Language Queries: A Benchmark Study.
Clotho-AQA is an audio question-answering dataset consisting of 1991 audio samples taken from Clotho dataset [1]. Each audio sample has 6 associated questions collected through crowdsourcing. For each question, the answers are provided by three different annotators making a total of 35,838 question-answer pairs. For each audio sample, 4 questions are designed to be answered with 'yes' or 'no', while the remaining two questions are designed to be answered in a single word. More details about the data collection process and data splitting process can be found in our following paper.
S. Lipping, P. Sudarsanam, K. Drossos, T. Virtanen ‘Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering.’ The paper is available online at 2204.09634.pdf (arxiv.org)
If you use the Clotho-AQA dataset, please cite the paper mentioned above. A sample baseline model to use the Clotho-AQA dataset can be found at partha2409/AquaNet (github.com)
To use the dataset,
• Download and extract ‘audio_files.zip’. This contains all the 1991 audio samples in the dataset.
• Download ‘clotho_aqa_train.csv’, ‘clotho_aqa_val.csv’, and ‘clotho_aqa_test.csv’. These files contain the train, validation, and test splits, respectively. They contain the audio file name, questions, answers, and confidence scores provided by the annotators.
License:
The audio files in the archive ‘audio_files.zip’ are under the corresponding licenses (mostly CreativeCommons with attribution) of Freesound [2] platform, mentioned explicitly in the CSV file ’clotho_aqa_metadata.csv’ for each of the audio files. That is, each audio file in the archive is listed in the CSV file with meta-data. The meta-data for each file are:
• File name
• Keywords
• URL for the original audio file
• Start and ending samples for the excerpt that is used in the Clotho dataset
• Uploader/user in the Freesound platform (manufacturer)
• Link to the license of the file.
The questions and answers in the files:
• clotho_aqa_train.csv
• clotho_aqa_val.csv
• clotho_aqa_test.csv
are under the MIT license, described in the LICENSE file.
References:
[1] K. Drossos, S. Lipping and T. Virtanen, "Clotho: An Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736- 740, doi: 10.1109/ICASSP40776.2020.9052990.
[2] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.
AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds sourced from the AudioSet dataset. Annotators were provided the audio tracks together with category hints (and with additional video hints if needed).
Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}