73 datasets found
  1. o

    Data from: Clotho dataset

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Oct 15, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen (2019). Clotho dataset [Dataset]. http://doi.org/10.5281/zenodo.3490684
    Explore at:
    Dataset updated
    Oct 15, 2019
    Authors
    Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen
    Description

    Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}

  2. Z

    Audio captioning DCASE 2020 evaluation (testing) split

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantinos Drossos (2020). Audio captioning DCASE 2020 evaluation (testing) split [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3865657
    Explore at:
    Dataset updated
    Dec 8, 2020
    Dataset provided by
    Samuel Lipping
    Konstantinos Drossos
    Tuomas Virtanen
    Description

    This is the evaluation split for Task 6, Automated Audio Captioning, in DCASE 2020 Challenge.

    This evaluation split is the Clotho testing split, which is thoroughly described in the corresponding paper:

    K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.

    available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990

    This evaluation split is meant to be used for the purposes of the Task 6 at the scientific challenge DCASE 2020. This split it is not meant to be used for developing audio captioning methods. For developing audio captioning methods, you should use the development and evaluation splits of Clotho.

    If you want the development and evaluation splits of Clotho dataset, you can find them also in Zenodo, at: https://zenodo.org/record/3490684

    == License ==

    The audio files in the archives:

    clotho_audio_test.7z

    and the associated meta-data in the CSV file:

    clotho_metadata_test.csv

    are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV file for each of the audio files. That is, each audio file in the 7z archive is listed in the CSV file with the meta-data. The meta-data for each file are:

    File name

    Start and ending samples for the excerpt that is used in the Clotho dataset

    Uploader/user in the Freesound platform (manufacturer)

    Link to the licence of the file

    == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245

  3. Audio Caption Dataset (Hospital & Car)

    • zenodo.org
    application/gzip, bin +1
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu; Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu (2022). Audio Caption Dataset (Hospital & Car) [Dataset]. http://doi.org/10.5281/zenodo.5833263
    Explore at:
    zip, application/gzip, binAvailable download formats
    Dataset updated
    Jan 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu; Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019.

    Car scene, detailed in Audio Caption in a Car Setting with a Sentence-Level Loss published at ISCSLP 2021.

    Original captions in Mandarin Chinese, with English translations provided.

  4. o

    Data from: MACS - Multi-Annotator Captioned Soundscapes

    • explore.openaire.eu
    • producciocientifica.uv.es
    • +2more
    Updated Jul 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Martin Morato; Annamaria Mesaros (2021). MACS - Multi-Annotator Captioned Soundscapes [Dataset]. http://doi.org/10.5281/zenodo.5114770
    Explore at:
    Dataset updated
    Jul 22, 2021
    Authors
    Irene Martin Morato; Annamaria Mesaros
    Description

    This is a dataset containing audio captions and corresponding audio tags for a number of 3930 audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park). The files were annotated using a web-based tool. Each file is annotated by multiple annotators that provided tags and a one-sentence description of the audio content. The data also includes annotator competence estimated using MACE (Multi-Annotator Competence Estimation). The annotation procedure, processing and analysis of the data are presented in the following papers: Irene Martin-Morato, Annamaria Mesaros. What is the ground truth? Reliability of multi-annotator data for audio tagging, 29th European Signal Processing Conference, EUSIPCO 2021 Irene Martin-Morato, Annamaria Mesaros. Diversity and bias in audio captioning datasets, submitted to DCASE 2021 Workshop (to be updated with arxiv link) Data is provided as two files: MACS.yaml - containing the complete annotations in the following format: - filename: file1.wav
    annotations:
    - annotator_id: ann_1
    sentence: caption text
    tags:
    - tag1
    - tag2
    - annotator_id: ann_2
    sentence: caption text
    tags:
    - tag1 MACS_competence.csv - containing the estimated annotator competence; for each annotator_id in the yaml file, competence is a number between 0 (considered as annotating at random) and 1 id [tab] competence The audio files can be downloaded from https://zenodo.org/record/2589280 and are covered by their own license.

  5. h

    music-audio-pseudo-captions

    • huggingface.co
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    seungheon.doh (2023). music-audio-pseudo-captions [Dataset]. https://huggingface.co/datasets/seungheondoh/music-audio-pseudo-captions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2023
    Authors
    seungheon.doh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Music-Audio-Pseudo Captions

    Pseudo Music and Audio Captions from LP-MusicCaps, Music Negation/Temporal Ordering WavCaps

      Dataset Summary
    

    Compared to other domains, music and audio domains cannot obtain well-written web caption data, and caption annotation is expensive. Therefore, we use the Music (LP-MusicCaps), (Music Negation/Temporal Ordering) and Audio (Wavcaps) datasets created with ChatGPT to re-organize them in the form of instructions, input… See the full description on the dataset page: https://huggingface.co/datasets/seungheondoh/music-audio-pseudo-captions.

  6. Multilingual Audio Captioning Glasses Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Multilingual Audio Captioning Glasses Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/multilingual-audio-captioning-glasses-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    Authors
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Multilingual Audio Captioning Glasses Market Outlook



    According to our latest research, the global multilingual audio captioning glasses market size reached USD 1.18 billion in 2024, reflecting a robust surge in demand for advanced assistive and wearable technologies. The market is expected to grow at a CAGR of 18.7% from 2025 to 2033, forecasting a substantial rise to USD 6.09 billion by 2033. This impressive growth is primarily driven by increasing adoption of smart wearable devices, heightened awareness about accessibility solutions, and rapid advancements in artificial intelligence and real-time translation technologies.




    The growth trajectory of the multilingual audio captioning glasses market is underpinned by several pivotal factors. One of the most significant drivers is the global push for inclusivity and accessibility, particularly for individuals with visual and hearing impairments. Governments, educational institutions, and public service providers are increasingly mandating the use of assistive technologies to enhance the quality of life for differently-abled individuals. This regulatory support, coupled with growing social awareness, is prompting manufacturers to innovate and develop more user-friendly, efficient, and cost-effective solutions. Furthermore, the integration of AI-based captioning and real-time translation features is making these devices more versatile, catering to a broader spectrum of users, including travelers, students, and professionals working in multilingual environments.




    Another crucial growth factor is the rapid evolution of core technologies such as speech recognition, natural language processing, and machine learning. These advancements have significantly improved the accuracy, speed, and contextual relevance of audio captioning in multiple languages, making the devices highly reliable for real-time applications. The proliferation of 5G networks and cloud computing infrastructure has further enhanced the capability of these glasses, enabling seamless streaming and processing of audio data. As a result, the market is witnessing a surge in demand not only from the healthcare and education sectors but also from consumer electronics, corporate, and public service domains, where multilingual communication is essential for operational efficiency and user engagement.




    The increasing penetration of smart devices and the growing trend of digitization across industries are also contributing to the expansion of the multilingual audio captioning glasses market. As consumers become more tech-savvy, there is a heightened expectation for devices that offer both convenience and enhanced functionality. The ability to receive real-time audio captions and translations through wearable glasses is particularly appealing in globalized workplaces, multicultural educational settings, and international travel scenarios. Additionally, the COVID-19 pandemic has accelerated the adoption of remote communication tools, further highlighting the need for accessible and inclusive technology solutions. This shift in consumer behavior is expected to sustain the market’s momentum over the coming years.




    From a regional perspective, North America currently dominates the multilingual audio captioning glasses market, accounting for over 38% of the global revenue in 2024, followed by Europe and Asia Pacific. The high adoption rate in North America can be attributed to strong technological infrastructure, favorable government policies, and the presence of leading market players. However, Asia Pacific is poised to witness the fastest growth during the forecast period, driven by increasing investments in healthcare and education, rising disposable incomes, and a large population base with diverse linguistic needs. Europe remains a significant market, supported by robust regulatory frameworks and active initiatives to promote digital inclusivity.





    Product Type Analysis



    The product type segment of the multilingual audio captioning glasses market is broadly categorized

  7. h

    MECAT-QA

    • huggingface.co
    Updated Aug 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Horizon Team, Xiaomi MiLM Plus (2025). MECAT-QA [Dataset]. https://huggingface.co/datasets/mispeech/MECAT-QA
    Explore at:
    Dataset updated
    Aug 2, 2025
    Dataset authored and provided by
    Horizon Team, Xiaomi MiLM Plus
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

    📖 Paper | 🛠️ GitHub | 🔊 MECAT-Caption Dataset | 🔊 MECAT-QA Dataset

      Dataset Description
    

    MECAT (Multi-Expert Chain for Audio Tasks) is a comprehensive benchmark constructed on large-scale data to evaluate machine understanding of audio content through two core tasks:

    Audio Captioning: Generating textual descriptions for given audio Audio Question Answering: Answering questions… See the full description on the dataset page: https://huggingface.co/datasets/mispeech/MECAT-QA.

  8. h

    wavcaps_test

    • huggingface.co
    Updated Jul 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AudioLLMs (2024). wavcaps_test [Dataset]. https://huggingface.co/datasets/AudioLLMs/wavcaps_test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2024
    Dataset authored and provided by
    AudioLLMs
    Description

    @article{mei2024wavcaps, title={Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research}, author={Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, year={2024}, publisher={IEEE} }

    @article{wang2024audiobench, title={AudioBench: A Universal Benchmark for… See the full description on the dataset page: https://huggingface.co/datasets/AudioLLMs/wavcaps_test.

  9. h

    WavCaps

    • huggingface.co
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centre for Vision, Speech and Signal Processing - University of Surrey (2023). WavCaps [Dataset]. https://huggingface.co/datasets/cvssp/WavCaps
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2023
    Dataset authored and provided by
    Centre for Vision, Speech and Signal Processing - University of Surrey
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WavCaps

    WavCaps is a ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research, where the audio clips are sourced from three websites (FreeSound, BBC Sound Effects, and SoundBible) and a sound event detection dataset (AudioSet Strongly-labelled Subset).

    Paper: https://arxiv.org/abs/2303.17395 Github: https://github.com/XinhaoMei/WavCaps

      Statistics
    

    Data Source

    audio

    avg. audio duration (s)avg. text length

    FreeSound… See the full description on the dataset page: https://huggingface.co/datasets/cvssp/WavCaps.

  10. Clotho Analysis Set

    • zenodo.org
    zip
    Updated Jun 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos; Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos (2022). Clotho Analysis Set [Dataset]. http://doi.org/10.5281/zenodo.6604109
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 3, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos; Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is derived from the evaluation subset of Clotho dataset. It is designed to analyze the behavior of the captioning system under certain perturbation in order to try and identify some open challenges in automated audio captioning. The original audio clips are transformed with audio_degrader. The transformations applied are the following:

    • Microphone response simulation

    • Mixup with another clip from the dataset (ratio -6dB, -3dB and 0dB)

    • Additive noise from DESED (ratio -12dB, -6dB, 0dB)

  11. Flickr 8k Audio Caption Corpus

    • kaggle.com
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chirag Chauhan (2023). Flickr 8k Audio Caption Corpus [Dataset]. https://www.kaggle.com/datasets/warcoder/flickr-8k-audio-caption-corpus/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2023
    Dataset provided by
    Kaggle
    Authors
    Chirag Chauhan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Description: The wavs/ directory contains 40,000 spoken audio captions in .wav audio format, one for each caption included in the train, dev, and test splits in the original Flickr 8k corpus (as defined by the files Flickr_8k.trainImages.txt, Flickr_8k.devImages.txt, and Flickr_8k.testImages.txt)

    The audio is sampled at 16000 Hz with 16-bit depth, and stored in Microsoft WAVE audio format

    The file wav2capt.txt contains a mapping from the .wav file names to the corresponding .jpg images and the caption number. The .jpg file names and caption numbers can then be mapped to the caption text via the Flickr8k.token.txt file from the original Flickr 8k corpus.

    The file wav2spk.txt contains a mapping from the .wav file names to its speaker. Each unique speaker is numbered consecutively from 1 to 183 (the total number of unique speakers).

    Citing:

    D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015 (PDF)

    M. Hodosh, P. Young and J. Hockenmaier (2013) "Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics", Journal of Artificial Intelligence Research, Volume 47, pages 853-899 https://www.jair.org/index.php/jair/article/view/10833/25854

  12. h

    spectrogram-captions

    • huggingface.co
    Updated Dec 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Vučina (2023). spectrogram-captions [Dataset]. https://huggingface.co/datasets/vucinatim/spectrogram-captions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2023
    Authors
    Tim Vučina
    License

    https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

    Description

    Dataset of captioned spectrograms (text describing the sound).

  13. t

    Clotho v2 - Dataset - LDM

    • service.tib.eu
    Updated Dec 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Clotho v2 - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/clotho-v2
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences.

  14. Z

    Audio Caption Hospital Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengyue Wu (2022). Audio Caption Hospital Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3715276
    Explore at:
    Dataset updated
    Jan 10, 2022
    Dataset provided by
    Kai Yu
    Heinrich Dinkel
    Mengyue Wu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019.

  15. SiVi-CAFE dataset - Sighted and Visually-impaired Captions for Audio in...

    • zenodo.org
    csv
    Updated Jun 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Martin Morato; Irene Martin Morato; Manu Harju; Manu Harju; Maija Hirvonen; Maija Hirvonen; Annamaria Mesaros; Annamaria Mesaros (2024). SiVi-CAFE dataset - Sighted and Visually-impaired Captions for Audio in Finnish and English [Dataset]. http://doi.org/10.5281/zenodo.11505823
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Irene Martin Morato; Irene Martin Morato; Manu Harju; Manu Harju; Maija Hirvonen; Maija Hirvonen; Annamaria Mesaros; Annamaria Mesaros
    Description

    This is a dataset containing audio captions for audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park) for 10 cities.

    The files were annotated using a web-based tool as presented in:

    Martin Morato, I., & Mesaros, A. (2021). Diversity and bias in audio captioning datasets. In F. Font, A. Mesaros, D. P.W. Ellis, E. Fonseca, M. Fuentes, & B. Elizalde (Eds.), Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021) (pp. 90-94)

    Each file is annotated by multiple annotators that provided a one-sentence description of the audio content.


    Data is provided in csv files:

    • sighted-EN-bias-original
    • sighted-FI-bias-translated
    • sighted-EN-no_bias-original
    • sighted-FI-no_bias-translated
    • visually_impaired-FI-original
    • visually_impaired-EN-translated
    • sighted-FI-original
    • sighted-EN-translated

    original = original descriptions, non-translated
    translated = Translated descriptions using automatic deep learning tool

    900 annotated audio files, Finnish audio descriptions provided by visual-impaired and sighted people.
    2050 annotated audio files, English audio descriptions provided by international students (not-necessarily English native-speakers).
    3930 annotated audio files, English audio descriptions provided by international students (not-necessarily English native-speakers) biased by the provided audio tags.

    The audio files can be downloaded from https://zenodo.org/record/2589280 and are covered by their own license.

  16. h

    clotho-chatgpt-mixup-50K

    • huggingface.co
    Updated Nov 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shih-Lun Wu (2024). clotho-chatgpt-mixup-50K [Dataset]. https://huggingface.co/datasets/slseanwu/clotho-chatgpt-mixup-50K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2024
    Authors
    Shih-Lun Wu
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Summary

    This is an artifact corresponding to Section 2.3 of the following paper:

    Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up AugmentationShih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji WatanabeInt. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) 2024[arXiv page] [code]

      Upstream Dataset
    

    The original captions come from the development… See the full description on the dataset page: https://huggingface.co/datasets/slseanwu/clotho-chatgpt-mixup-50K.

  17. Z

    Song Describer Dataset

    • data.niaid.nih.gov
    • huggingface.co
    • +1more
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Won, Minz (2024). Song Describer Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10072000
    Explore at:
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    Won, Minz
    Bogdanov, Dmitry
    Tovstogan, Philip
    Weck, Benno
    Manco, Ilaria
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

    A retro-futurist drum machine groove drenched in bubbly synthetic sound effects and a hint of an acid bassline. The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation and music-language retrieval. More information about the data, collection method and validation is provided in the paper describing the dataset. If you use this dataset, please cite our paper: The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation, Manco, Ilaria and Weck, Benno and Doh, Seungheon and Won, Minz and Zhang, Yixiao and Bogdanov, Dmitry and Wu, Yusong and Chen, Ke and Tovstogan, Philip and Benetos, Emmanouil and Quinton, Elio and Fazekas, György and Nam, Juhan, Machine Learning for Audio Workshop at NeurIPS 2023, 2023

  18. h

    VAAC

    • huggingface.co
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    T (2024). VAAC [Dataset]. https://huggingface.co/datasets/ResearcherT98/VAAC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 20, 2024
    Authors
    T
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Video and Audio Aligned Caption Dataset (VAAC)

    Dataset that contains different captions for videos with audio.

      Dataset Details
    

    We present a framework for annotating videos with audiovisual textual descriptions. Our three-step process involves generating auditory captions from sounds using an audio captioner, generating visual captions from the video content using a video captioner, and using concatenation or instruction fine-tuned large language models… See the full description on the dataset page: https://huggingface.co/datasets/ResearcherT98/VAAC.

  19. P

    WavCaps Dataset

    • library.toponeai.link
    Updated Mar 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinhao Mei; Chutong Meng; Haohe Liu; Qiuqiang Kong; Tom Ko; Chengqi Zhao; Mark D. Plumbley; Yuexian Zou; Wenwu Wang (2025). WavCaps Dataset [Dataset]. https://library.toponeai.link/dataset/wavcaps
    Explore at:
    Dataset updated
    Mar 2, 2025
    Authors
    Xinhao Mei; Chutong Meng; Haohe Liu; Qiuqiang Kong; Tom Ko; Chengqi Zhao; Mark D. Plumbley; Yuexian Zou; Wenwu Wang
    Description

    A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.

  20. O

    Flickr Audio Caption Corpus

    • opendatalab.com
    zip
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massachusetts Institute of Technology, Flickr Audio Caption Corpus [Dataset]. https://opendatalab.com/OpenDataLab/Flickr_Audio_Caption_Corpus
    Explore at:
    zip(5323839445 bytes)Available download formats
    Dataset provided by
    Massachusetts Institute of Technology
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery. For a description of the corpus, see: D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen (2019). Clotho dataset [Dataset]. http://doi.org/10.5281/zenodo.3490684

Data from: Clotho dataset

Related Article
Explore at:
23 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 15, 2019
Authors
Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen
Description

Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}

Search
Clear search
Close search
Google apps
Main menu