Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}
This is the evaluation split for Task 6, Automated Audio Captioning, in DCASE 2020 Challenge.
This evaluation split is the Clotho testing split, which is thoroughly described in the corresponding paper:
K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.
available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990
This evaluation split is meant to be used for the purposes of the Task 6 at the scientific challenge DCASE 2020. This split it is not meant to be used for developing audio captioning methods. For developing audio captioning methods, you should use the development and evaluation splits of Clotho.
If you want the development and evaluation splits of Clotho dataset, you can find them also in Zenodo, at: https://zenodo.org/record/3490684
== License ==
The audio files in the archives:
clotho_audio_test.7z
and the associated meta-data in the CSV file:
clotho_metadata_test.csv
are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV file for each of the audio files. That is, each audio file in the 7z archive is listed in the CSV file with the meta-data. The meta-data for each file are:
File name
Start and ending samples for the excerpt that is used in the Clotho dataset
Uploader/user in the Freesound platform (manufacturer)
Link to the licence of the file
== References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019.
Car scene, detailed in Audio Caption in a Car Setting with a Sentence-Level Loss published at ISCSLP 2021.
Original captions in Mandarin Chinese, with English translations provided.
This is a dataset containing audio captions and corresponding audio tags for a number of 3930 audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park). The files were annotated using a web-based tool. Each file is annotated by multiple annotators that provided tags and a one-sentence description of the audio content. The data also includes annotator competence estimated using MACE (Multi-Annotator Competence Estimation). The annotation procedure, processing and analysis of the data are presented in the following papers: Irene Martin-Morato, Annamaria Mesaros. What is the ground truth? Reliability of multi-annotator data for audio tagging, 29th European Signal Processing Conference, EUSIPCO 2021 Irene Martin-Morato, Annamaria Mesaros. Diversity and bias in audio captioning datasets, submitted to DCASE 2021 Workshop (to be updated with arxiv link) Data is provided as two files: MACS.yaml - containing the complete annotations in the following format: - filename: file1.wav
annotations:
- annotator_id: ann_1
sentence: caption text
tags:
- tag1
- tag2
- annotator_id: ann_2
sentence: caption text
tags:
- tag1 MACS_competence.csv - containing the estimated annotator competence; for each annotator_id in the yaml file, competence is a number between 0 (considered as annotating at random) and 1 id [tab] competence The audio files can be downloaded from https://zenodo.org/record/2589280 and are covered by their own license.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Music-Audio-Pseudo Captions
Pseudo Music and Audio Captions from LP-MusicCaps, Music Negation/Temporal Ordering WavCaps
Dataset Summary
Compared to other domains, music and audio domains cannot obtain well-written web caption data, and caption annotation is expensive. Therefore, we use the Music (LP-MusicCaps), (Music Negation/Temporal Ordering) and Audio (Wavcaps) datasets created with ChatGPT to re-organize them in the form of instructions, input… See the full description on the dataset page: https://huggingface.co/datasets/seungheondoh/music-audio-pseudo-captions.
According to our latest research, the global multilingual audio captioning glasses market size reached USD 1.18 billion in 2024, reflecting a robust surge in demand for advanced assistive and wearable technologies. The market is expected to grow at a CAGR of 18.7% from 2025 to 2033, forecasting a substantial rise to USD 6.09 billion by 2033. This impressive growth is primarily driven by increasing adoption of smart wearable devices, heightened awareness about accessibility solutions, and rapid advancements in artificial intelligence and real-time translation technologies.
The growth trajectory of the multilingual audio captioning glasses market is underpinned by several pivotal factors. One of the most significant drivers is the global push for inclusivity and accessibility, particularly for individuals with visual and hearing impairments. Governments, educational institutions, and public service providers are increasingly mandating the use of assistive technologies to enhance the quality of life for differently-abled individuals. This regulatory support, coupled with growing social awareness, is prompting manufacturers to innovate and develop more user-friendly, efficient, and cost-effective solutions. Furthermore, the integration of AI-based captioning and real-time translation features is making these devices more versatile, catering to a broader spectrum of users, including travelers, students, and professionals working in multilingual environments.
Another crucial growth factor is the rapid evolution of core technologies such as speech recognition, natural language processing, and machine learning. These advancements have significantly improved the accuracy, speed, and contextual relevance of audio captioning in multiple languages, making the devices highly reliable for real-time applications. The proliferation of 5G networks and cloud computing infrastructure has further enhanced the capability of these glasses, enabling seamless streaming and processing of audio data. As a result, the market is witnessing a surge in demand not only from the healthcare and education sectors but also from consumer electronics, corporate, and public service domains, where multilingual communication is essential for operational efficiency and user engagement.
The increasing penetration of smart devices and the growing trend of digitization across industries are also contributing to the expansion of the multilingual audio captioning glasses market. As consumers become more tech-savvy, there is a heightened expectation for devices that offer both convenience and enhanced functionality. The ability to receive real-time audio captions and translations through wearable glasses is particularly appealing in globalized workplaces, multicultural educational settings, and international travel scenarios. Additionally, the COVID-19 pandemic has accelerated the adoption of remote communication tools, further highlighting the need for accessible and inclusive technology solutions. This shift in consumer behavior is expected to sustain the market’s momentum over the coming years.
From a regional perspective, North America currently dominates the multilingual audio captioning glasses market, accounting for over 38% of the global revenue in 2024, followed by Europe and Asia Pacific. The high adoption rate in North America can be attributed to strong technological infrastructure, favorable government policies, and the presence of leading market players. However, Asia Pacific is poised to witness the fastest growth during the forecast period, driven by increasing investments in healthcare and education, rising disposable incomes, and a large population base with diverse linguistic needs. Europe remains a significant market, supported by robust regulatory frameworks and active initiatives to promote digital inclusivity.
The product type segment of the multilingual audio captioning glasses market is broadly categorized
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
📖 Paper | 🛠️ GitHub | 🔊 MECAT-Caption Dataset | 🔊 MECAT-QA Dataset
Dataset Description
MECAT (Multi-Expert Chain for Audio Tasks) is a comprehensive benchmark constructed on large-scale data to evaluate machine understanding of audio content through two core tasks:
Audio Captioning: Generating textual descriptions for given audio Audio Question Answering: Answering questions… See the full description on the dataset page: https://huggingface.co/datasets/mispeech/MECAT-QA.
@article{mei2024wavcaps, title={Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research}, author={Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, year={2024}, publisher={IEEE} }
@article{wang2024audiobench, title={AudioBench: A Universal Benchmark for… See the full description on the dataset page: https://huggingface.co/datasets/AudioLLMs/wavcaps_test.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WavCaps
WavCaps is a ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research, where the audio clips are sourced from three websites (FreeSound, BBC Sound Effects, and SoundBible) and a sound event detection dataset (AudioSet Strongly-labelled Subset).
Paper: https://arxiv.org/abs/2303.17395 Github: https://github.com/XinhaoMei/WavCaps
Statistics
Data Source
avg. audio duration (s)avg. text length
FreeSound… See the full description on the dataset page: https://huggingface.co/datasets/cvssp/WavCaps.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is derived from the evaluation subset of Clotho dataset. It is designed to analyze the behavior of the captioning system under certain perturbation in order to try and identify some open challenges in automated audio captioning. The original audio clips are transformed with audio_degrader. The transformations applied are the following:
Microphone response simulation
Mixup with another clip from the dataset (ratio -6dB, -3dB and 0dB)
Additive noise from DESED (ratio -12dB, -6dB, 0dB)
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description: The wavs/ directory contains 40,000 spoken audio captions in .wav audio format, one for each caption included in the train, dev, and test splits in the original Flickr 8k corpus (as defined by the files Flickr_8k.trainImages.txt, Flickr_8k.devImages.txt, and Flickr_8k.testImages.txt)
The audio is sampled at 16000 Hz with 16-bit depth, and stored in Microsoft WAVE audio format
The file wav2capt.txt contains a mapping from the .wav file names to the corresponding .jpg images and the caption number. The .jpg file names and caption numbers can then be mapped to the caption text via the Flickr8k.token.txt file from the original Flickr 8k corpus.
The file wav2spk.txt contains a mapping from the .wav file names to its speaker. Each unique speaker is numbered consecutively from 1 to 183 (the total number of unique speakers).
Citing:
D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015 (PDF)
M. Hodosh, P. Young and J. Hockenmaier (2013) "Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics", Journal of Artificial Intelligence Research, Volume 47, pages 853-899 https://www.jair.org/index.php/jair/article/view/10833/25854
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Dataset of captioned spectrograms (text describing the sound).
Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019.
This is a dataset containing audio captions for audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park) for 10 cities.
The files were annotated using a web-based tool as presented in:
Martin Morato, I., & Mesaros, A. (2021). Diversity and bias in audio captioning datasets. In F. Font, A. Mesaros, D. P.W. Ellis, E. Fonseca, M. Fuentes, & B. Elizalde (Eds.), Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021) (pp. 90-94)
Each file is annotated by multiple annotators that provided a one-sentence description of the audio content.
Data is provided in csv files:
original = original descriptions, non-translated
translated = Translated descriptions using automatic deep learning tool
900 annotated audio files, Finnish audio descriptions provided by visual-impaired and sighted people.
2050 annotated audio files, English audio descriptions provided by international students (not-necessarily English native-speakers).
3930 annotated audio files, English audio descriptions provided by international students (not-necessarily English native-speakers) biased by the provided audio tags.
The audio files can be downloaded from https://zenodo.org/record/2589280 and are covered by their own license.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Summary
This is an artifact corresponding to Section 2.3 of the following paper:
Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up AugmentationShih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji WatanabeInt. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) 2024[arXiv page] [code]
Upstream Dataset
The original captions come from the development… See the full description on the dataset page: https://huggingface.co/datasets/slseanwu/clotho-chatgpt-mixup-50K.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation
A retro-futurist drum machine groove drenched in bubbly synthetic sound effects and a hint of an acid bassline. The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation and music-language retrieval. More information about the data, collection method and validation is provided in the paper describing the dataset. If you use this dataset, please cite our paper: The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation, Manco, Ilaria and Weck, Benno and Doh, Seungheon and Won, Minz and Zhang, Yixiao and Bogdanov, Dmitry and Wu, Yusong and Chen, Ke and Tovstogan, Philip and Benetos, Emmanouil and Quinton, Elio and Fazekas, György and Nam, Juhan, Machine Learning for Audio Workshop at NeurIPS 2023, 2023
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Video and Audio Aligned Caption Dataset (VAAC)
Dataset that contains different captions for videos with audio.
Dataset Details
We present a framework for annotating videos with audiovisual textual descriptions. Our three-step process involves generating auditory captions from sounds using an audio captioner, generating visual captions from the video content using a video captioner, and using concatenation or instruction fine-tuned large language models… See the full description on the dataset page: https://huggingface.co/datasets/ResearcherT98/VAAC.
A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery. For a description of the corpus, see: D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015
Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}