Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}
https://zenodo.org/record/3490684#.ZDPNlXtBw2xhttps://zenodo.org/record/3490684#.ZDPNlXtBw2x
Clotho is an audio captioning dataset. The dataset focuses on the content of the audio, as well as the diversity of subtitles, and consists of 4,981 audio samples with 5 subtitles per audio sample (24,905 subtitles in total), with a duration of 15 to 30 seconds and a subtitle length of 8 to 20 words.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}