59 datasets found
  1. o

    Data from: Clotho dataset

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Oct 15, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen (2019). Clotho dataset [Dataset]. http://doi.org/10.5281/zenodo.3490683
    Explore at:
    Dataset updated
    Oct 15, 2019
    Authors
    Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen
    Description

    Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}

  2. P

    Flickr Audio Caption Corpus Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flickr Audio Caption Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/flickr-audio-caption-corpus
    Explore at:
    Description

    The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery. For a description of the corpus, see:

    D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015

  3. h

    music-audio-pseudo-captions

    • huggingface.co
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    seungheon.doh (2023). music-audio-pseudo-captions [Dataset]. https://huggingface.co/datasets/seungheondoh/music-audio-pseudo-captions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2023
    Authors
    seungheon.doh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Music-Audio-Pseudo Captions

    Pseudo Music and Audio Captions from LP-MusicCaps, Music Negation/Temporal Ordering WavCaps

      Dataset Summary
    

    Compared to other domains, music and audio domains cannot obtain well-written web caption data, and caption annotation is expensive. Therefore, we use the Music (LP-MusicCaps), (Music Negation/Temporal Ordering) and Audio (Wavcaps) datasets created with ChatGPT to re-organize them in the form of instructions, input… See the full description on the dataset page: https://huggingface.co/datasets/seungheondoh/music-audio-pseudo-captions.

  4. Z

    Audio captioning DCASE 2020 evaluation (testing) split

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tuomas Virtanen (2020). Audio captioning DCASE 2020 evaluation (testing) split [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3865657
    Explore at:
    Dataset updated
    Dec 8, 2020
    Dataset provided by
    Konstantinos Drossos
    Samuel Lipping
    Tuomas Virtanen
    Description

    This is the evaluation split for Task 6, Automated Audio Captioning, in DCASE 2020 Challenge.

    This evaluation split is the Clotho testing split, which is thoroughly described in the corresponding paper:

    K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.

    available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990

    This evaluation split is meant to be used for the purposes of the Task 6 at the scientific challenge DCASE 2020. This split it is not meant to be used for developing audio captioning methods. For developing audio captioning methods, you should use the development and evaluation splits of Clotho.

    If you want the development and evaluation splits of Clotho dataset, you can find them also in Zenodo, at: https://zenodo.org/record/3490684

    == License ==

    The audio files in the archives:

    clotho_audio_test.7z

    and the associated meta-data in the CSV file:

    clotho_metadata_test.csv

    are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV file for each of the audio files. That is, each audio file in the 7z archive is listed in the CSV file with the meta-data. The meta-data for each file are:

    File name

    Start and ending samples for the excerpt that is used in the Clotho dataset

    Uploader/user in the Freesound platform (manufacturer)

    Link to the licence of the file

    == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245

  5. o

    Audio Caption Dataset (Hospital & Car)

    • explore.openaire.eu
    Updated May 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu (2019). Audio Caption Dataset (Hospital & Car) [Dataset]. http://doi.org/10.5281/zenodo.3715276
    Explore at:
    Dataset updated
    May 18, 2019
    Authors
    Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu
    Description

    This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019. Car scene, detailed in Audio Caption in a Car Setting with a Sentence-Level Loss published at ISCSLP 2021. Original captions in Mandarin Chinese, with English translations provided.

  6. Clotho Analysis Set

    • zenodo.org
    zip
    Updated Jun 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos; Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos (2022). Clotho Analysis Set [Dataset]. http://doi.org/10.5281/zenodo.6610709
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 3, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos; Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is derived from the evaluation subset of Clotho dataset. It is designed to analyze the behavior of the captioning system under certain perturbation in order to try and identify some open challenges in automated audio captioning. The original audio clips are transformed with audio_degrader. The transformations applied are the following:

    • Microphone response simulation

    • Mixup with another clip from the dataset (ratio -6dB, -3dB and 0dB)

    • Additive noise from DESED (ratio -12dB, -6dB, 0dB)

  7. h

    WavCaps

    • huggingface.co
    Updated Apr 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centre for Vision, Speech and Signal Processing - University of Surrey (2023). WavCaps [Dataset]. https://huggingface.co/datasets/cvssp/WavCaps
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2023
    Dataset authored and provided by
    Centre for Vision, Speech and Signal Processing - University of Surrey
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WavCaps

    WavCaps is a ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research, where the audio clips are sourced from three websites (FreeSound, BBC Sound Effects, and SoundBible) and a sound event detection dataset (AudioSet Strongly-labelled Subset).

    Paper: https://arxiv.org/abs/2303.17395 Github: https://github.com/XinhaoMei/WavCaps

      Statistics
    

    Data Source

    audio

    avg. audio duration (s)avg. text length

    FreeSound… See the full description on the dataset page: https://huggingface.co/datasets/cvssp/WavCaps.

  8. o

    Data from: MACS - Multi-Annotator Captioned Soundscapes

    • explore.openaire.eu
    • producciocientifica.uv.es
    • +1more
    Updated Jul 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Martin Morato; Annamaria Mesaros (2021). MACS - Multi-Annotator Captioned Soundscapes [Dataset]. http://doi.org/10.5281/zenodo.5114770
    Explore at:
    Dataset updated
    Jul 22, 2021
    Authors
    Irene Martin Morato; Annamaria Mesaros
    Description

    This is a dataset containing audio captions and corresponding audio tags for a number of 3930 audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park). The files were annotated using a web-based tool. Each file is annotated by multiple annotators that provided tags and a one-sentence description of the audio content. The data also includes annotator competence estimated using MACE (Multi-Annotator Competence Estimation). The annotation procedure, processing and analysis of the data are presented in the following papers: Irene Martin-Morato, Annamaria Mesaros. What is the ground truth? Reliability of multi-annotator data for audio tagging, 29th European Signal Processing Conference, EUSIPCO 2021 Irene Martin-Morato, Annamaria Mesaros. Diversity and bias in audio captioning datasets, submitted to DCASE 2021 Workshop (to be updated with arxiv link) Data is provided as two files: MACS.yaml - containing the complete annotations in the following format: - filename: file1.wav
    annotations:
    - annotator_id: ann_1
    sentence: caption text
    tags:
    - tag1
    - tag2
    - annotator_id: ann_2
    sentence: caption text
    tags:
    - tag1 MACS_competence.csv - containing the estimated annotator competence; for each annotator_id in the yaml file, competence is a number between 0 (considered as annotating at random) and 1 id [tab] competence The audio files can be downloaded from https://zenodo.org/record/2589280 and are covered by their own license.

  9. t

    Clotho v2 - Dataset - LDM

    • service.tib.eu
    Updated Dec 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Clotho v2 - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/clotho-v2
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences.

  10. Song Describer Dataset

    • zenodo.org
    • huggingface.co
    • +1more
    csv, pdf, tsv, txt +1
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilaria Manco; Ilaria Manco; Benno Weck; Benno Weck; Dmitry Bogdanov; Dmitry Bogdanov; Philip Tovstogan; Philip Tovstogan; Minz Won; Minz Won (2024). Song Describer Dataset [Dataset]. http://doi.org/10.5281/zenodo.10072001
    Explore at:
    tsv, csv, zip, txt, pdfAvailable download formats
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ilaria Manco; Ilaria Manco; Benno Weck; Benno Weck; Dmitry Bogdanov; Dmitry Bogdanov; Philip Tovstogan; Philip Tovstogan; Minz Won; Minz Won
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

    A retro-futurist drum machine groove drenched in bubbly synthetic sound effects and a hint of an acid bassline.

    The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation and music-language retrieval. More information about the data, collection method and validation is provided in the paper describing the dataset.

    If you use this dataset, please cite our paper:

    The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation, Manco, Ilaria and Weck, Benno and Doh, Seungheon and Won, Minz and Zhang, Yixiao and Bogdanov, Dmitry and Wu, Yusong and Chen, Ke and Tovstogan, Philip and Benetos, Emmanouil and Quinton, Elio and Fazekas, György and Nam, Juhan, Machine Learning for Audio Workshop at NeurIPS 2023, 2023

  11. h

    wavcaps_test

    • huggingface.co
    Updated Jul 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AudioLLMs (2024). wavcaps_test [Dataset]. https://huggingface.co/datasets/AudioLLMs/wavcaps_test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2024
    Dataset authored and provided by
    AudioLLMs
    Description

    @article{mei2024wavcaps, title={Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research}, author={Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, year={2024}, publisher={IEEE} }

    @article{wang2024audiobench, title={AudioBench: A Universal Benchmark for… See the full description on the dataset page: https://huggingface.co/datasets/AudioLLMs/wavcaps_test.

  12. Audio Caption Hospital Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengyue Wu; Heinrich Dinkel; Kai Yu; Mengyue Wu; Heinrich Dinkel; Kai Yu (2022). Audio Caption Hospital Dataset [Dataset]. http://doi.org/10.5281/zenodo.4671263
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jan 10, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mengyue Wu; Heinrich Dinkel; Kai Yu; Mengyue Wu; Heinrich Dinkel; Kai Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019.

  13. SPEECH-COCO

    • zenodo.org
    • explore.openaire.eu
    • +1more
    xz, zip
    Updated Nov 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William N. Havard; William N. Havard; Laurent Besacier; Laurent Besacier (2020). SPEECH-COCO [Dataset]. http://doi.org/10.5281/zenodo.4282267
    Explore at:
    zip, xzAvailable download formats
    Dataset updated
    Nov 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    William N. Havard; William N. Havard; Laurent Besacier; Laurent Besacier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SpeechCoco

    Introduction

    Our corpus is an extension of the MS COCO image recognition and captioning dataset. MS COCO comprises images paired with a set of five captions. Yet, it does not include any speech. Therefore, we used Voxygen's text-to-speech system to synthesise the available captions.

    The addition of speech as a new modality enables MSCOCO to be used for researches in the field of language acquisition, unsupervised term discovery, keyword spotting, or semantic embedding using speech and vision.

    Our corpus is licensed under a Creative Commons Attribution 4.0 License.

    Data Set

    • This corpus contains 616,767 spoken captions from MSCOCO's val2014 and train2014 subsets (respectively 414,113 for train2014 and 202,654 for val2014).

    • We used 8 different voices. 4 of them have a British accent (Paul, Bronwen, Judith, and Elizabeth) and the 4 others have an American accent (Phil, Bruce, Amanda, Jenny).

    • In order to make the captions sound more natural, we used SOX tempo command, enabling us to change the speed without changing the pitch. 1/3 of the captions are 10% slower than the original pace, 1/3 are 10% faster. The last third of the captions was kept untouched.

    • We also modified approximately 30% of the original captions and added disfluencies such as "um", "uh", "er" so that the captions would sound more natural.

    • Each WAV file is paired with a JSON file containing various information: timecode of each word in the caption, name of the speaker, name of the WAV file, etc. The JSON files have the following data structure:

    {
      "duration": float,
      "speaker": string,
      "synthesisedCaption": string,
      "timecode": list,
      "speed": float,
      "wavFilename": string,
      "captionID": int,
      "imgID": int,
      "disfluency": list
    }
    • On average, each caption comprises 10.79 tokens, disfluencies included. The WAV files are on average 3.52 seconds long.

    Repository

    The repository is organized as follows:

    • CORPUS-MSCOCO (~75GB once decompressed)

      • train2014/ : folder contains 413,915 captions

        • json/

        • wav/

        • translations/

          • train_en_ja.txt

          • train_translate.sqlite3

        • train_2014.sqlite3

      • val2014/ : folder contains 202,520 captions

        • json/

        • wav/

        • translations/

          • train_en_ja.txt

          • train_translate.sqlite3

        • val_2014.sqlite3

      • speechcoco_API/

        • speechcoco/

          • _init_.py

          • speechcoco.py

        • setup.py

    Filenames

    .wav files contain the spoken version of a caption

    .json files contain all the metadata of a given WAV file

    .sqlite3 files are SQLite databases containing all the information contained in the JSON files

    We adopted the following naming convention for both the WAV and JSON files:

    imageID_captionID_Speaker_DisfluencyPosition_Speed[.wav/.json]

    Script

    We created a script called speechcoco.py in order to handle the metadata and allow the user to easily find captions according to specific filters. The script uses the *.db files.

    Features:

    • Aggregate all the information in the JSON files into a single SQLite database

    • Find captions according to specific filters (name, gender and nationality of the speaker, disfluency position, speed, duration, and words in the caption). The script automatically builds the SQLite query. The user can also provide his own SQLite query.

    The following Python code returns all the captions spoken by a male with an American accent for which the speed was slowed down by 10% and that contain "keys" at any position

    # create SpeechCoco object
    db = SpeechCoco(train_2014.sqlite3, train_translate.sqlite3, verbose=True)
    
    # filter captions (returns Caption Objects)
    captions = db.filterCaptions(gender="Male", nationality="US", speed=0.9, text='%keys%')
    for caption in captions:
      print('
    {}\t{}\t{}\t{}\t{}\t{}\t\t{}'.format(caption.imageID,
                             caption.captionID,
                             caption.speaker.name,
                             caption.speaker.nationality,
                             caption.speed,
                             caption.filename,
                             caption.text))
    ...
    298817   26763  Phil  0.9   298817_26763_Phil_None_0-9.wav     A group of turkeys with bushes in the background.
    108505   147972 Phil  0.9   108505_147972_Phil_Middle_0-9.wav        Person using a, um, slider cell phone with blue backlit keys.
    258289   154380 Bruce  0.9   258289_154380_Bruce_None_0-9.wav        Some donkeys and sheep are in their green pens .
    545312   201303 Phil  0.9   545312_201303_Phil_None_0-9.wav     A man walking next to a couple of donkeys.
    ...
    • Find all the captions belonging to a specific image

    captions = db.getImgCaptions(298817)
    for caption in captions:
      print('
    {}'.format(caption.text))
    Birds wondering through grassy ground next to bushes.
    A flock of turkeys are making their way up a hill.
    Um, ah. Two wild turkeys in a field walking around.
    Four wild turkeys and some bushes trees and weeds.
    A group of turkeys with bushes in the background.
    • Parse the timecodes and have them structured

    input:

    ...
    [1926.3068, "SYL", ""],
    [1926.3068, "SEPR", " "],
    [1926.3068, "WORD", "white"],
    [1926.3068, "PHO", "w"],
    [2050.7955, "PHO", "ai"],
    [2144.6591, "PHO", "t"],
    [2179.3182, "SYL", ""],
    [2179.3182, "SEPR", " "]
    ...

    output:

    print(caption.timecode.parse())
    ...
    {
    'begin': 1926.3068,
    'end': 2179.3182,
    'syllable': [{'begin': 1926.3068,
           'end': 2179.3182,
           'phoneme': [{'begin': 1926.3068,
                  'end': 2050.7955,
                  'value': 'w'},
                 {'begin': 2050.7955,
                  'end': 2144.6591,
                  'value': 'ai'},
                 {'begin': 2144.6591,
                  'end': 2179.3182,
                  'value': 't'}],
           'value': 'wait'}],
    'value': 'white'
    },
    ...
    • Convert the timecodes to Praat TextGrid files

    caption.timecode.toTextgrid(outputDir, level=3)
    • Get the words, syllables and phonemes between n seconds/milliseconds

    The following Python code returns all the words between 0.2 and 0.6 seconds for which at least 50% of the word's total length is within the specified interval

    pprint(caption.getWords(0.20, 0.60, seconds=True, level=1, olapthr=50))
    ...
    404537   827239 Bruce  US   0.9   404537_827239_Bruce_None_0-9.wav        Eyeglasses, a cellphone, some keys and other pocket items are all laid out on the cloth. .
    [
      {
        'begin': 0.0,
        'end': 0.7202778,
        'overlapPercentage': 55.53412863758955,
        'word': 'eyeglasses'
      }
    ]
     ...
    • Get the translations of the selected captions

    As for now, only japanese translations are available. We also used Kytea to tokenize and tag the captions translated with Google Translate

    captions = db.getImgCaptions(298817)
    for caption in captions:
      print('
    {}'.format(caption.text))
    
      # Get translations and POS
      print('\tja_google: {}'.format(db.getTranslation(caption.captionID, "ja_google")))
      print('\t\tja_google_tokens: {}'.format(db.getTokens(caption.captionID, "ja_google")))
      print('\t\tja_google_pos: {}'.format(db.getPOS(caption.captionID, "ja_google")))
      print('\tja_excite: {}'.format(db.getTranslation(caption.captionID, "ja_excite")))

      Birds wondering through grassy ground next to bushes.
      ja_google: 鳥は茂みの下に茂った地面を抱えています。
        ja_google_tokens: 鳥 は 茂み の 下 に 茂 っ た 地面 を 抱え て い ま す 。
        ja_google_pos: 鳥/名詞/とり は/助詞/は 茂み/名詞/しげみ の/助詞/の 下/名詞/した に/助詞/に

  14. Synthetically Spoken COCO

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +2
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi; Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi (2020). Synthetically Spoken COCO [Dataset]. http://doi.org/10.5281/zenodo.400926
    Explore at:
    txt, json, bin, application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi; Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetically Spoken COCO

    Version 1.0

    This dataset contain synthetically generated spoken versions of MS COCO [1] captions. This
    dataset was created as part the research reported in [5].
    The speech was generated using gTTS [2]. The dataset consists of the following files:

    - dataset.json: Captions associated with MS COCO images. This information comes from [3].
    - sentid.txt: List of caption IDs. This file can be used to locate MFCC features of the MP3 files
    in the numpy array stored in dataset.mfcc.npy.
    - mp3.tgz: MP3 files with the audio. Each file name corresponds to caption ID in dataset.json
    and in sentid.txt.
    - dataset.mfcc.npy: Numpy array with the Mel Frequence Cepstral Coefficients extracted from
    the audio. Each row corresponds to a caption. The order or the captions corresponds to the
    ordering in the file sentid.txt. MFCCs were extracted using [4].

    [1] http://mscoco.org/dataset/#overview
    [2] https://pypi.python.org/pypi/gTTS
    [3] https://github.com/karpathy/neuraltalk
    [4] https://github.com/jameslyons/python_speech_features
    [5] https://arxiv.org/abs/1702.01991

  15. A

    Audio Accessibility Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Audio Accessibility Report [Dataset]. https://www.marketreportanalytics.com/reports/audio-accessibility-73846
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global audio accessibility market is experiencing robust growth, driven by increasing awareness of inclusivity and technological advancements. The market, estimated at $1.5 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. Key drivers include rising accessibility regulations, the proliferation of streaming services demanding closed captions and audio descriptions, and the increasing availability of cost-effective audio description software and services. The growing visually impaired and blind population globally further fuels market expansion. Significant market segments include online audio description services, preferred for their scalability and reach, and offline services which offer more tailored and customized solutions. Leading market players are continually innovating, incorporating AI-powered solutions for improved accuracy and efficiency in audio description generation. While the market presents significant opportunities, challenges remain. High implementation costs, particularly for offline services, and the need for skilled professionals to create high-quality audio descriptions can hinder widespread adoption, particularly in developing regions. However, the ongoing evolution of speech-to-text and text-to-speech technologies, alongside reductions in the cost of AI-driven tools, are likely to mitigate some of these restraints in the coming years. The market is geographically diverse, with North America and Europe currently dominating, but significant growth potential exists in emerging markets like Asia Pacific and the Middle East & Africa as awareness and accessibility legislation increases. The increasing integration of audio description within mainstream media and entertainment platforms will be a crucial factor in expanding market penetration and accessibility across all user segments.

  16. h

    spectrogram-captions

    • huggingface.co
    Updated Dec 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Vučina (2023). spectrogram-captions [Dataset]. https://huggingface.co/datasets/vucinatim/spectrogram-captions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2023
    Authors
    Tim Vučina
    License

    https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

    Description

    Dataset of captioned spectrograms (text describing the sound).

  17. P

    SoundDescs Dataset

    • paperswithcode.com
    Updated May 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A. Sophia Koepke; Andreea-Maria Oncescu; João F. Henriques; Zeynep Akata; Samuel Albanie (2022). SoundDescs Dataset [Dataset]. https://paperswithcode.com/dataset/sounddescs
    Explore at:
    Dataset updated
    May 8, 2022
    Authors
    A. Sophia Koepke; Andreea-Maria Oncescu; João F. Henriques; Zeynep Akata; Samuel Albanie
    Description

    We introduce a new audio dataset called SoundDescs that can be used for tasks such as text to audio retrieval, audio captioning etc. This dataset contains 32,979 pairs of audio files and text descriptions. There are 23 categories found in SoundDescs including but not limited to nature, clocks, fire etc.

    SoundDescs can be downloaded from here and retrieval results for this dataset can be found in the associated paper Audio Retrieval with Natural Language Queries: A Benchmark Study.

  18. Z

    Clotho-AQA dataset

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parthasaarathy Sudarsanam (2022). Clotho-AQA dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6473206
    Explore at:
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Parthasaarathy Sudarsanam
    Konstantinos Drossos
    Samuel Lipping
    Tuomas Virtanen
    Description

    Clotho-AQA is an audio question-answering dataset consisting of 1991 audio samples taken from Clotho dataset [1]. Each audio sample has 6 associated questions collected through crowdsourcing. For each question, the answers are provided by three different annotators making a total of 35,838 question-answer pairs. For each audio sample, 4 questions are designed to be answered with 'yes' or 'no', while the remaining two questions are designed to be answered in a single word. More details about the data collection process and data splitting process can be found in our following paper.

    S. Lipping, P. Sudarsanam, K. Drossos, T. Virtanen ‘Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering.’ The paper is available online at 2204.09634.pdf (arxiv.org)

    If you use the Clotho-AQA dataset, please cite the paper mentioned above. A sample baseline model to use the Clotho-AQA dataset can be found at partha2409/AquaNet (github.com)

    To use the dataset,

    • Download and extract ‘audio_files.zip’. This contains all the 1991 audio samples in the dataset.

    • Download ‘clotho_aqa_train.csv’, ‘clotho_aqa_val.csv’, and ‘clotho_aqa_test.csv’. These files contain the train, validation, and test splits, respectively. They contain the audio file name, questions, answers, and confidence scores provided by the annotators.

    License:

    The audio files in the archive ‘audio_files.zip’ are under the corresponding licenses (mostly CreativeCommons with attribution) of Freesound [2] platform, mentioned explicitly in the CSV file ’clotho_aqa_metadata.csv’ for each of the audio files. That is, each audio file in the archive is listed in the CSV file with meta-data. The meta-data for each file are:

    • File name

    • Keywords

    • URL for the original audio file

    • Start and ending samples for the excerpt that is used in the Clotho dataset

    • Uploader/user in the Freesound platform (manufacturer)

    • Link to the license of the file.

    The questions and answers in the files:

    • clotho_aqa_train.csv

    • clotho_aqa_val.csv

    • clotho_aqa_test.csv

    are under the MIT license, described in the LICENSE file.

    References:

    [1] K. Drossos, S. Lipping and T. Virtanen, "Clotho: An Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736- 740, doi: 10.1109/ICASSP40776.2020.9052990.

    [2] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245

  19. P

    WavCaps Dataset

    • library.toponeai.link
    • paperswithcode.com
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinhao Mei; Chutong Meng; Haohe Liu; Qiuqiang Kong; Tom Ko; Chengqi Zhao; Mark D. Plumbley; Yuexian Zou; Wenwu Wang (2025). WavCaps Dataset [Dataset]. https://library.toponeai.link/dataset/wavcaps
    Explore at:
    Dataset updated
    Apr 30, 2025
    Authors
    Xinhao Mei; Chutong Meng; Haohe Liu; Qiuqiang Kong; Tom Ko; Chengqi Zhao; Mark D. Plumbley; Yuexian Zou; Wenwu Wang
    Description

    A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.

  20. P

    AudioCaps Dataset

    • paperswithcode.com
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Dongjoo Kim; Byeongchang Kim; Hyunmin Lee; Gunhee Kim (2025). AudioCaps Dataset [Dataset]. https://paperswithcode.com/dataset/audiocaps
    Explore at:
    Dataset updated
    May 12, 2025
    Authors
    Chris Dongjoo Kim; Byeongchang Kim; Hyunmin Lee; Gunhee Kim
    Description

    AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds sourced from the AudioSet dataset. Annotators were provided the audio tracks together with category hints (and with additional video hints if needed).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen (2019). Clotho dataset [Dataset]. http://doi.org/10.5281/zenodo.3490683

Data from: Clotho dataset

Related Article
Explore at:
26 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 15, 2019
Authors
Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen
Description

Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}

Search
Clear search
Close search
Google apps
Main menu