73 datasets found

o
Data from: Clotho dataset
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Oct 15, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen (2019). Clotho dataset [Dataset]. http://doi.org/10.5281/zenodo.3490684
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3490684
Dataset updated
Oct 15, 2019
Authors
Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen
Description
Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}
Z
Audio captioning DCASE 2020 evaluation (testing) split
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Drossos (2020). Audio captioning DCASE 2020 evaluation (testing) split [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3865657
Explore at:
Dataset updated
Dec 8, 2020
Dataset provided by
Samuel Lipping
Konstantinos Drossos
Tuomas Virtanen
Description
This is the evaluation split for Task 6, Automated Audio Captioning, in DCASE 2020 Challenge.

This evaluation split is the Clotho testing split, which is thoroughly described in the corresponding paper:

K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.

available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990

This evaluation split is meant to be used for the purposes of the Task 6 at the scientific challenge DCASE 2020. This split it is not meant to be used for developing audio captioning methods. For developing audio captioning methods, you should use the development and evaluation splits of Clotho.

If you want the development and evaluation splits of Clotho dataset, you can find them also in Zenodo, at: https://zenodo.org/record/3490684

== License ==

The audio files in the archives:

clotho_audio_test.7z

and the associated meta-data in the CSV file:

clotho_metadata_test.csv

are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV file for each of the audio files. That is, each audio file in the 7z archive is listed in the CSV file with the meta-data. The meta-data for each file are:

File name

Start and ending samples for the excerpt that is used in the Clotho dataset

Uploader/user in the Freesound platform (manufacturer)

Link to the licence of the file

== References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
Audio Caption Dataset (Hospital & Car)
zenodo.org
application/gzip, bin +1
Updated Jan 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu; Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu (2022). Audio Caption Dataset (Hospital & Car) [Dataset]. http://doi.org/10.5281/zenodo.5833263
Explore at:
zip, application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5833263
Dataset updated
Jan 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu; Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019.

Car scene, detailed in Audio Caption in a Car Setting with a Sentence-Level Loss published at ISCSLP 2021.

Original captions in Mandarin Chinese, with English translations provided.
o
Data from: MACS - Multi-Annotator Captioned Soundscapes
explore.openaire.eu
producciocientifica.uv.es
+2more
Updated Jul 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irene Martin Morato; Annamaria Mesaros (2021). MACS - Multi-Annotator Captioned Soundscapes [Dataset]. http://doi.org/10.5281/zenodo.5114770
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5114770
Dataset updated
Jul 22, 2021
Authors
Irene Martin Morato; Annamaria Mesaros
Description
This is a dataset containing audio captions and corresponding audio tags for a number of 3930 audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park). The files were annotated using a web-based tool. Each file is annotated by multiple annotators that provided tags and a one-sentence description of the audio content. The data also includes annotator competence estimated using MACE (Multi-Annotator Competence Estimation). The annotation procedure, processing and analysis of the data are presented in the following papers: Irene Martin-Morato, Annamaria Mesaros. What is the ground truth? Reliability of multi-annotator data for audio tagging, 29th European Signal Processing Conference, EUSIPCO 2021 Irene Martin-Morato, Annamaria Mesaros. Diversity and bias in audio captioning datasets, submitted to DCASE 2021 Workshop (to be updated with arxiv link) Data is provided as two files: MACS.yaml - containing the complete annotations in the following format: - filename: file1.wav
annotations:
- annotator_id: ann_1
sentence: caption text
tags:
- tag1
- tag2
- annotator_id: ann_2
sentence: caption text
tags:
- tag1 MACS_competence.csv - containing the estimated annotator competence; for each annotator_id in the yaml file, competence is a number between 0 (considered as annotating at random) and 1 id [tab] competence The audio files can be downloaded from https://zenodo.org/record/2589280 and are covered by their own license.
h
music-audio-pseudo-captions
huggingface.co
Updated Aug 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
seungheon.doh (2023). music-audio-pseudo-captions [Dataset]. https://huggingface.co/datasets/seungheondoh/music-audio-pseudo-captions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2023
Authors
seungheon.doh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Music-Audio-Pseudo Captions

Pseudo Music and Audio Captions from LP-MusicCaps, Music Negation/Temporal Ordering WavCaps

Dataset Summary

Compared to other domains, music and audio domains cannot obtain well-written web caption data, and caption annotation is expensive. Therefore, we use the Music (LP-MusicCaps), (Music Negation/Temporal Ordering) and Audio (Wavcaps) datasets created with ChatGPT to re-organize them in the form of instructions, input… See the full description on the dataset page: https://huggingface.co/datasets/seungheondoh/music-audio-pseudo-captions.
Multilingual Audio Captioning Glasses Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Multilingual Audio Captioning Glasses Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/multilingual-audio-captioning-glasses-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Jul 5, 2025
Dataset provided by
Authors
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Multilingual Audio Captioning Glasses Market Outlook

According to our latest research, the global multilingual audio captioning glasses market size reached USD 1.18 billion in 2024, reflecting a robust surge in demand for advanced assistive and wearable technologies. The market is expected to grow at a CAGR of 18.7% from 2025 to 2033, forecasting a substantial rise to USD 6.09 billion by 2033. This impressive growth is primarily driven by increasing adoption of smart wearable devices, heightened awareness about accessibility solutions, and rapid advancements in artificial intelligence and real-time translation technologies.

The growth trajectory of the multilingual audio captioning glasses market is underpinned by several pivotal factors. One of the most significant drivers is the global push for inclusivity and accessibility, particularly for individuals with visual and hearing impairments. Governments, educational institutions, and public service providers are increasingly mandating the use of assistive technologies to enhance the quality of life for differently-abled individuals. This regulatory support, coupled with growing social awareness, is prompting manufacturers to innovate and develop more user-friendly, efficient, and cost-effective solutions. Furthermore, the integration of AI-based captioning and real-time translation features is making these devices more versatile, catering to a broader spectrum of users, including travelers, students, and professionals working in multilingual environments.

Another crucial growth factor is the rapid evolution of core technologies such as speech recognition, natural language processing, and machine learning. These advancements have significantly improved the accuracy, speed, and contextual relevance of audio captioning in multiple languages, making the devices highly reliable for real-time applications. The proliferation of 5G networks and cloud computing infrastructure has further enhanced the capability of these glasses, enabling seamless streaming and processing of audio data. As a result, the market is witnessing a surge in demand not only from the healthcare and education sectors but also from consumer electronics, corporate, and public service domains, where multilingual communication is essential for operational efficiency and user engagement.

The increasing penetration of smart devices and the growing trend of digitization across industries are also contributing to the expansion of the multilingual audio captioning glasses market. As consumers become more tech-savvy, there is a heightened expectation for devices that offer both convenience and enhanced functionality. The ability to receive real-time audio captions and translations through wearable glasses is particularly appealing in globalized workplaces, multicultural educational settings, and international travel scenarios. Additionally, the COVID-19 pandemic has accelerated the adoption of remote communication tools, further highlighting the need for accessible and inclusive technology solutions. This shift in consumer behavior is expected to sustain the market’s momentum over the coming years.

From a regional perspective, North America currently dominates the multilingual audio captioning glasses market, accounting for over 38% of the global revenue in 2024, followed by Europe and Asia Pacific. The high adoption rate in North America can be attributed to strong technological infrastructure, favorable government policies, and the presence of leading market players. However, Asia Pacific is poised to witness the fastest growth during the forecast period, driven by increasing investments in healthcare and education, rising disposable incomes, and a large population base with diverse linguistic needs. Europe remains a significant market, supported by robust regulatory frameworks and active initiatives to promote digital inclusivity.

Product Type Analysis

The product type segment of the multilingual audio captioning glasses market is broadly categorized
h
MECAT-QA
huggingface.co
Updated Aug 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Horizon Team, Xiaomi MiLM Plus (2025). MECAT-QA [Dataset]. https://huggingface.co/datasets/mispeech/MECAT-QA
Explore at:
Dataset updated
Aug 2, 2025
Dataset authored and provided by
Horizon Team, Xiaomi MiLM Plus
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

📖 Paper | 🛠️ GitHub | 🔊 MECAT-Caption Dataset | 🔊 MECAT-QA Dataset

Dataset Description

MECAT (Multi-Expert Chain for Audio Tasks) is a comprehensive benchmark constructed on large-scale data to evaluate machine understanding of audio content through two core tasks:

Audio Captioning: Generating textual descriptions for given audio Audio Question Answering: Answering questions… See the full description on the dataset page: https://huggingface.co/datasets/mispeech/MECAT-QA.
h
wavcaps_test
huggingface.co
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AudioLLMs (2024). wavcaps_test [Dataset]. https://huggingface.co/datasets/AudioLLMs/wavcaps_test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2024
Dataset authored and provided by
AudioLLMs
Description
@article{mei2024wavcaps, title={Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research}, author={Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, year={2024}, publisher={IEEE} }

@article{wang2024audiobench, title={AudioBench: A Universal Benchmark for… See the full description on the dataset page: https://huggingface.co/datasets/AudioLLMs/wavcaps_test.
h
WavCaps
huggingface.co
Updated Apr 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centre for Vision, Speech and Signal Processing - University of Surrey (2023). WavCaps [Dataset]. https://huggingface.co/datasets/cvssp/WavCaps
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2023
Dataset authored and provided by
Centre for Vision, Speech and Signal Processing - University of Surrey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WavCaps

WavCaps is a ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research, where the audio clips are sourced from three websites (FreeSound, BBC Sound Effects, and SoundBible) and a sound event detection dataset (AudioSet Strongly-labelled Subset).

Paper: https://arxiv.org/abs/2303.17395 Github: https://github.com/XinhaoMei/WavCaps

Statistics

Data Source

audio

avg. audio duration (s)avg. text length

FreeSound… See the full description on the dataset page: https://huggingface.co/datasets/cvssp/WavCaps.
Clotho Analysis Set
zenodo.org
zip
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos; Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos (2022). Clotho Analysis Set [Dataset]. http://doi.org/10.5281/zenodo.6604109
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6604109
Dataset updated
Jun 3, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos; Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is derived from the evaluation subset of Clotho dataset. It is designed to analyze the behavior of the captioning system under certain perturbation in order to try and identify some open challenges in automated audio captioning. The original audio clips are transformed with audio_degrader. The transformations applied are the following:

Microphone response simulation

Mixup with another clip from the dataset (ratio -6dB, -3dB and 0dB)

Additive noise from DESED (ratio -12dB, -6dB, 0dB)
Flickr 8k Audio Caption Corpus
kaggle.com
Updated Apr 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chirag Chauhan (2023). Flickr 8k Audio Caption Corpus [Dataset]. https://www.kaggle.com/datasets/warcoder/flickr-8k-audio-caption-corpus/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 27, 2023
Dataset provided by
Kaggle
Authors
Chirag Chauhan
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Description: The wavs/ directory contains 40,000 spoken audio captions in .wav audio format, one for each caption included in the train, dev, and test splits in the original Flickr 8k corpus (as defined by the files Flickr_8k.trainImages.txt, Flickr_8k.devImages.txt, and Flickr_8k.testImages.txt)

The audio is sampled at 16000 Hz with 16-bit depth, and stored in Microsoft WAVE audio format

The file wav2capt.txt contains a mapping from the .wav file names to the corresponding .jpg images and the caption number. The .jpg file names and caption numbers can then be mapped to the caption text via the Flickr8k.token.txt file from the original Flickr 8k corpus.

The file wav2spk.txt contains a mapping from the .wav file names to its speaker. Each unique speaker is numbered consecutively from 1 to 183 (the total number of unique speakers).

Citing:

D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015 (PDF)

M. Hodosh, P. Young and J. Hockenmaier (2013) "Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics", Journal of Artificial Intelligence Research, Volume 47, pages 853-899 https://www.jair.org/index.php/jair/article/view/10833/25854
h
spectrogram-captions
huggingface.co
Updated Dec 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Vučina (2023). spectrogram-captions [Dataset]. https://huggingface.co/datasets/vucinatim/spectrogram-captions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 25, 2023
Authors
Tim Vučina
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
Dataset of captioned spectrograms (text describing the sound).
t
Clotho v2 - Dataset - LDM
service.tib.eu
Updated Dec 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Clotho v2 - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/clotho-v2
Explore at:
Dataset updated
Dec 3, 2024
Description
Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences.
Z
Audio Caption Hospital Dataset
data.niaid.nih.gov
zenodo.org
Updated Jan 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mengyue Wu (2022). Audio Caption Hospital Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3715276
Explore at:
Dataset updated
Jan 10, 2022
Dataset provided by
Kai Yu
Heinrich Dinkel
Mengyue Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019.
SiVi-CAFE dataset - Sighted and Visually-impaired Captions for Audio in...
zenodo.org
csv
Updated Jun 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irene Martin Morato; Irene Martin Morato; Manu Harju; Manu Harju; Maija Hirvonen; Maija Hirvonen; Annamaria Mesaros; Annamaria Mesaros (2024). SiVi-CAFE dataset - Sighted and Visually-impaired Captions for Audio in Finnish and English [Dataset]. http://doi.org/10.5281/zenodo.11505823
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11505823
Dataset updated
Jun 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Irene Martin Morato; Irene Martin Morato; Manu Harju; Manu Harju; Maija Hirvonen; Maija Hirvonen; Annamaria Mesaros; Annamaria Mesaros
Description
This is a dataset containing audio captions for audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park) for 10 cities.

The files were annotated using a web-based tool as presented in:

Martin Morato, I., & Mesaros, A. (2021). Diversity and bias in audio captioning datasets. In F. Font, A. Mesaros, D. P.W. Ellis, E. Fonseca, M. Fuentes, & B. Elizalde (Eds.), Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021) (pp. 90-94)

Each file is annotated by multiple annotators that provided a one-sentence description of the audio content.

Data is provided in csv files:

sighted-EN-bias-original

sighted-FI-bias-translated

sighted-EN-no_bias-original

sighted-FI-no_bias-translated

visually_impaired-FI-original

visually_impaired-EN-translated

sighted-FI-original

sighted-EN-translated

original = original descriptions, non-translated
translated = Translated descriptions using automatic deep learning tool

900 annotated audio files, Finnish audio descriptions provided by visual-impaired and sighted people.
2050 annotated audio files, English audio descriptions provided by international students (not-necessarily English native-speakers).
3930 annotated audio files, English audio descriptions provided by international students (not-necessarily English native-speakers) biased by the provided audio tags.

The audio files can be downloaded from https://zenodo.org/record/2589280 and are covered by their own license.
h
clotho-chatgpt-mixup-50K
huggingface.co
Updated Nov 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shih-Lun Wu (2024). clotho-chatgpt-mixup-50K [Dataset]. https://huggingface.co/datasets/slseanwu/clotho-chatgpt-mixup-50K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2024
Authors
Shih-Lun Wu
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Summary

This is an artifact corresponding to Section 2.3 of the following paper:

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up AugmentationShih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji WatanabeInt. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) 2024[arXiv page] [code]

Upstream Dataset

The original captions come from the development… See the full description on the dataset page: https://huggingface.co/datasets/slseanwu/clotho-chatgpt-mixup-50K.
Z
Song Describer Dataset
data.niaid.nih.gov
huggingface.co
+1more
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Won, Minz (2024). Song Describer Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10072000
Explore at:
Dataset updated
Jul 10, 2024
Dataset provided by
Won, Minz
Bogdanov, Dmitry
Tovstogan, Philip
Weck, Benno
Manco, Ilaria
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

A retro-futurist drum machine groove drenched in bubbly synthetic sound effects and a hint of an acid bassline. The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation and music-language retrieval. More information about the data, collection method and validation is provided in the paper describing the dataset. If you use this dataset, please cite our paper: The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation, Manco, Ilaria and Weck, Benno and Doh, Seungheon and Won, Minz and Zhang, Yixiao and Bogdanov, Dmitry and Wu, Yusong and Chen, Ke and Tovstogan, Philip and Benetos, Emmanouil and Quinton, Elio and Fazekas, György and Nam, Juhan, Machine Learning for Audio Workshop at NeurIPS 2023, 2023
h
VAAC
huggingface.co
Updated Jun 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T (2024). VAAC [Dataset]. https://huggingface.co/datasets/ResearcherT98/VAAC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 20, 2024
Authors
T
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Video and Audio Aligned Caption Dataset (VAAC)

Dataset that contains different captions for videos with audio.

Dataset Details

We present a framework for annotating videos with audiovisual textual descriptions. Our three-step process involves generating auditory captions from sounds using an audio captioner, generating visual captions from the video content using a video captioner, and using concatenation or instruction fine-tuned large language models… See the full description on the dataset page: https://huggingface.co/datasets/ResearcherT98/VAAC.
P
WavCaps Dataset
library.toponeai.link
Updated Mar 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinhao Mei; Chutong Meng; Haohe Liu; Qiuqiang Kong; Tom Ko; Chengqi Zhao; Mark D. Plumbley; Yuexian Zou; Wenwu Wang (2025). WavCaps Dataset [Dataset]. https://library.toponeai.link/dataset/wavcaps
Explore at:
Dataset updated
Mar 2, 2025
Authors
Xinhao Mei; Chutong Meng; Haohe Liu; Qiuqiang Kong; Tom Ko; Chengqi Zhao; Mark D. Plumbley; Yuexian Zou; Wenwu Wang
Description
A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.
O
Flickr Audio Caption Corpus
opendatalab.com
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massachusetts Institute of Technology, Flickr Audio Caption Corpus [Dataset]. https://opendatalab.com/OpenDataLab/Flickr_Audio_Caption_Corpus
Explore at:
zip(5323839445 bytes)Available download formats
Dataset provided by
Massachusetts Institute of Technology
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery. For a description of the corpus, see: D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015

Facebook

Twitter

Click to copy link

Link copied

Cite

Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen (2019). Clotho dataset [Dataset]. http://doi.org/10.5281/zenodo.3490684

Data from: Clotho dataset

Explore at:

23 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5281/zenodo.3490684

Dataset updated

Oct 15, 2019

Authors

Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen

Description

Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}

Clear search

Close search

Google apps

Main menu

Data from: Clotho dataset

Audio captioning DCASE 2020 evaluation (testing) split

Audio Caption Dataset (Hospital & Car)

Data from: MACS - Multi-Annotator Captioned Soundscapes

music-audio-pseudo-captions

Multilingual Audio Captioning Glasses Market Research Report 2033

Multilingual Audio Captioning Glasses Market Outlook

Product Type Analysis

MECAT-QA

wavcaps_test

WavCaps

audio

Clotho Analysis Set

Flickr 8k Audio Caption Corpus

spectrogram-captions

Clotho v2 - Dataset - LDM

Audio Caption Hospital Dataset

SiVi-CAFE dataset - Sighted and Visually-impaired Captions for Audio in...

clotho-chatgpt-mixup-50K

Song Describer Dataset

VAAC

WavCaps Dataset

Flickr Audio Caption Corpus

Data from: Clotho datasetSee More Versions

Data from: Clotho dataset