59 datasets found

o
Data from: Clotho dataset
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Oct 15, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen (2019). Clotho dataset [Dataset]. http://doi.org/10.5281/zenodo.3490683
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3490683
Dataset updated
Oct 15, 2019
Authors
Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen
Description
Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}
P
Flickr Audio Caption Corpus Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flickr Audio Caption Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/flickr-audio-caption-corpus
Explore at:
Description
The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery. For a description of the corpus, see:

D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015
h
music-audio-pseudo-captions
huggingface.co
Updated Aug 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
seungheon.doh (2023). music-audio-pseudo-captions [Dataset]. https://huggingface.co/datasets/seungheondoh/music-audio-pseudo-captions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2023
Authors
seungheon.doh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Music-Audio-Pseudo Captions

Pseudo Music and Audio Captions from LP-MusicCaps, Music Negation/Temporal Ordering WavCaps

Dataset Summary

Compared to other domains, music and audio domains cannot obtain well-written web caption data, and caption annotation is expensive. Therefore, we use the Music (LP-MusicCaps), (Music Negation/Temporal Ordering) and Audio (Wavcaps) datasets created with ChatGPT to re-organize them in the form of instructions, input… See the full description on the dataset page: https://huggingface.co/datasets/seungheondoh/music-audio-pseudo-captions.
Z
Audio captioning DCASE 2020 evaluation (testing) split
data.niaid.nih.gov
explore.openaire.eu
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tuomas Virtanen (2020). Audio captioning DCASE 2020 evaluation (testing) split [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3865657
Explore at:
Dataset updated
Dec 8, 2020
Dataset provided by
Konstantinos Drossos
Samuel Lipping
Tuomas Virtanen
Description
This is the evaluation split for Task 6, Automated Audio Captioning, in DCASE 2020 Challenge.

This evaluation split is the Clotho testing split, which is thoroughly described in the corresponding paper:

K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.

available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990

This evaluation split is meant to be used for the purposes of the Task 6 at the scientific challenge DCASE 2020. This split it is not meant to be used for developing audio captioning methods. For developing audio captioning methods, you should use the development and evaluation splits of Clotho.

If you want the development and evaluation splits of Clotho dataset, you can find them also in Zenodo, at: https://zenodo.org/record/3490684

== License ==

The audio files in the archives:

clotho_audio_test.7z

and the associated meta-data in the CSV file:

clotho_metadata_test.csv

are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV file for each of the audio files. That is, each audio file in the 7z archive is listed in the CSV file with the meta-data. The meta-data for each file are:

File name

Start and ending samples for the excerpt that is used in the Clotho dataset

Uploader/user in the Freesound platform (manufacturer)

Link to the licence of the file

== References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
o
Audio Caption Dataset (Hospital & Car)
explore.openaire.eu
Updated May 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu (2019). Audio Caption Dataset (Hospital & Car) [Dataset]. http://doi.org/10.5281/zenodo.3715276
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3715276
Dataset updated
May 18, 2019
Authors
Mengyue Wu; Xuenan Xu; Heinrich Dinkel; Kai Yu
Description
This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019. Car scene, detailed in Audio Caption in a Car Setting with a Sentence-Level Loss published at ISCSLP 2021. Original captions in Mandarin Chinese, with English translations provided.
Clotho Analysis Set
zenodo.org
zip
Updated Jun 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos; Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos (2022). Clotho Analysis Set [Dataset]. http://doi.org/10.5281/zenodo.6610709
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6610709
Dataset updated
Jun 3, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos; Felix Gontier; Romain Serizel; Huang Xie; Samuel Lipping; Tuomas Virtanen; Konstantinos Drossos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is derived from the evaluation subset of Clotho dataset. It is designed to analyze the behavior of the captioning system under certain perturbation in order to try and identify some open challenges in automated audio captioning. The original audio clips are transformed with audio_degrader. The transformations applied are the following:

Microphone response simulation

Mixup with another clip from the dataset (ratio -6dB, -3dB and 0dB)

Additive noise from DESED (ratio -12dB, -6dB, 0dB)
h
WavCaps
huggingface.co
Updated Apr 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centre for Vision, Speech and Signal Processing - University of Surrey (2023). WavCaps [Dataset]. https://huggingface.co/datasets/cvssp/WavCaps
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 13, 2023
Dataset authored and provided by
Centre for Vision, Speech and Signal Processing - University of Surrey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WavCaps

WavCaps is a ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research, where the audio clips are sourced from three websites (FreeSound, BBC Sound Effects, and SoundBible) and a sound event detection dataset (AudioSet Strongly-labelled Subset).

Paper: https://arxiv.org/abs/2303.17395 Github: https://github.com/XinhaoMei/WavCaps

Statistics

Data Source

audio

avg. audio duration (s)avg. text length

FreeSound… See the full description on the dataset page: https://huggingface.co/datasets/cvssp/WavCaps.
o
Data from: MACS - Multi-Annotator Captioned Soundscapes
explore.openaire.eu
producciocientifica.uv.es
+1more
Updated Jul 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irene Martin Morato; Annamaria Mesaros (2021). MACS - Multi-Annotator Captioned Soundscapes [Dataset]. http://doi.org/10.5281/zenodo.5114770
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5114770
Dataset updated
Jul 22, 2021
Authors
Irene Martin Morato; Annamaria Mesaros
Description
This is a dataset containing audio captions and corresponding audio tags for a number of 3930 audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park). The files were annotated using a web-based tool. Each file is annotated by multiple annotators that provided tags and a one-sentence description of the audio content. The data also includes annotator competence estimated using MACE (Multi-Annotator Competence Estimation). The annotation procedure, processing and analysis of the data are presented in the following papers: Irene Martin-Morato, Annamaria Mesaros. What is the ground truth? Reliability of multi-annotator data for audio tagging, 29th European Signal Processing Conference, EUSIPCO 2021 Irene Martin-Morato, Annamaria Mesaros. Diversity and bias in audio captioning datasets, submitted to DCASE 2021 Workshop (to be updated with arxiv link) Data is provided as two files: MACS.yaml - containing the complete annotations in the following format: - filename: file1.wav
annotations:
- annotator_id: ann_1
sentence: caption text
tags:
- tag1
- tag2
- annotator_id: ann_2
sentence: caption text
tags:
- tag1 MACS_competence.csv - containing the estimated annotator competence; for each annotator_id in the yaml file, competence is a number between 0 (considered as annotating at random) and 1 id [tab] competence The audio files can be downloaded from https://zenodo.org/record/2589280 and are covered by their own license.
t
Clotho v2 - Dataset - LDM
service.tib.eu
Updated Dec 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Clotho v2 - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/clotho-v2
Explore at:
Dataset updated
Dec 3, 2024
Description
Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences.
Song Describer Dataset
zenodo.org
huggingface.co
+1more
csv, pdf, tsv, txt +1
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ilaria Manco; Ilaria Manco; Benno Weck; Benno Weck; Dmitry Bogdanov; Dmitry Bogdanov; Philip Tovstogan; Philip Tovstogan; Minz Won; Minz Won (2024). Song Describer Dataset [Dataset]. http://doi.org/10.5281/zenodo.10072001
Explore at:
tsv, csv, zip, txt, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10072001
Dataset updated
Jul 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ilaria Manco; Ilaria Manco; Benno Weck; Benno Weck; Dmitry Bogdanov; Dmitry Bogdanov; Philip Tovstogan; Philip Tovstogan; Minz Won; Minz Won
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation
A retro-futurist drum machine groove drenched in bubbly synthetic sound effects and a hint of an acid bassline.
The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation and music-language retrieval. More information about the data, collection method and validation is provided in the paper describing the dataset.
If you use this dataset, please cite our paper:
The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation, Manco, Ilaria and Weck, Benno and Doh, Seungheon and Won, Minz and Zhang, Yixiao and Bogdanov, Dmitry and Wu, Yusong and Chen, Ke and Tovstogan, Philip and Benetos, Emmanouil and Quinton, Elio and Fazekas, György and Nam, Juhan, Machine Learning for Audio Workshop at NeurIPS 2023, 2023
h
wavcaps_test
huggingface.co
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AudioLLMs (2024). wavcaps_test [Dataset]. https://huggingface.co/datasets/AudioLLMs/wavcaps_test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2024
Dataset authored and provided by
AudioLLMs
Description
@article{mei2024wavcaps, title={Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research}, author={Mei, Xinhao and Meng, Chutong and Liu, Haohe and Kong, Qiuqiang and Ko, Tom and Zhao, Chengqi and Plumbley, Mark D and Zou, Yuexian and Wang, Wenwu}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, year={2024}, publisher={IEEE} }

@article{wang2024audiobench, title={AudioBench: A Universal Benchmark for… See the full description on the dataset page: https://huggingface.co/datasets/AudioLLMs/wavcaps_test.
Audio Caption Hospital Dataset
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jan 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mengyue Wu; Heinrich Dinkel; Kai Yu; Mengyue Wu; Heinrich Dinkel; Kai Yu (2022). Audio Caption Hospital Dataset [Dataset]. http://doi.org/10.5281/zenodo.4671263
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4671263
Dataset updated
Jan 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mengyue Wu; Heinrich Dinkel; Kai Yu; Mengyue Wu; Heinrich Dinkel; Kai Yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of the Hospital scene of our Audio Caption dataset. Details can be seen in our paper Audio Caption: Listen and Tell published at ICASSP2019.
SPEECH-COCO
zenodo.org
explore.openaire.eu
+1more
xz, zip
Updated Nov 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William N. Havard; William N. Havard; Laurent Besacier; Laurent Besacier (2020). SPEECH-COCO [Dataset]. http://doi.org/10.5281/zenodo.4282267
Explore at:
zip, xzAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4282267
Dataset updated
Nov 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
William N. Havard; William N. Havard; Laurent Besacier; Laurent Besacier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SpeechCoco

Introduction

Our corpus is an extension of the MS COCO image recognition and captioning dataset. MS COCO comprises images paired with a set of five captions. Yet, it does not include any speech. Therefore, we used Voxygen's text-to-speech system to synthesise the available captions.

The addition of speech as a new modality enables MSCOCO to be used for researches in the field of language acquisition, unsupervised term discovery, keyword spotting, or semantic embedding using speech and vision.

Our corpus is licensed under a Creative Commons Attribution 4.0 License.

Data Set

This corpus contains 616,767 spoken captions from MSCOCO's val2014 and train2014 subsets (respectively 414,113 for train2014 and 202,654 for val2014).

We used 8 different voices. 4 of them have a British accent (Paul, Bronwen, Judith, and Elizabeth) and the 4 others have an American accent (Phil, Bruce, Amanda, Jenny).

In order to make the captions sound more natural, we used SOX tempo command, enabling us to change the speed without changing the pitch. 1/3 of the captions are 10% slower than the original pace, 1/3 are 10% faster. The last third of the captions was kept untouched.

We also modified approximately 30% of the original captions and added disfluencies such as "um", "uh", "er" so that the captions would sound more natural.

Each WAV file is paired with a JSON file containing various information: timecode of each word in the caption, name of the speaker, name of the WAV file, etc. The JSON files have the following data structure:

{ "duration": float, "speaker": string, "synthesisedCaption": string, "timecode": list, "speed": float, "wavFilename": string, "captionID": int, "imgID": int, "disfluency": list }

On average, each caption comprises 10.79 tokens, disfluencies included. The WAV files are on average 3.52 seconds long.

Repository

The repository is organized as follows:

CORPUS-MSCOCO (~75GB once decompressed)

train2014/ : folder contains 413,915 captions

json/

wav/

translations/

train_en_ja.txt

train_translate.sqlite3

train_2014.sqlite3

val2014/ : folder contains 202,520 captions

json/

wav/

translations/

train_en_ja.txt

train_translate.sqlite3

val_2014.sqlite3

speechcoco_API/

speechcoco/

_init_.py

speechcoco.py

setup.py

Filenames

.wav files contain the spoken version of a caption

.json files contain all the metadata of a given WAV file

.sqlite3 files are SQLite databases containing all the information contained in the JSON files

We adopted the following naming convention for both the WAV and JSON files:

imageID_captionID_Speaker_DisfluencyPosition_Speed[.wav/.json]

Script

We created a script called speechcoco.py in order to handle the metadata and allow the user to easily find captions according to specific filters. The script uses the *.db files.

Features:

Aggregate all the information in the JSON files into a single SQLite database

Find captions according to specific filters (name, gender and nationality of the speaker, disfluency position, speed, duration, and words in the caption). The script automatically builds the SQLite query. The user can also provide his own SQLite query.

The following Python code returns all the captions spoken by a male with an American accent for which the speed was slowed down by 10% and that contain "keys" at any position

# create SpeechCoco object db = SpeechCoco(train_2014.sqlite3, train_translate.sqlite3, verbose=True) # filter captions (returns Caption Objects) captions = db.filterCaptions(gender="Male", nationality="US", speed=0.9, text='%keys%') for caption in captions: print(' {}\t{}\t{}\t{}\t{}\t{}\t\t{}'.format(caption.imageID, caption.captionID, caption.speaker.name, caption.speaker.nationality, caption.speed, caption.filename, caption.text))

... 298817 26763 Phil 0.9 298817_26763_Phil_None_0-9.wav A group of turkeys with bushes in the background. 108505 147972 Phil 0.9 108505_147972_Phil_Middle_0-9.wav Person using a, um, slider cell phone with blue backlit keys. 258289 154380 Bruce 0.9 258289_154380_Bruce_None_0-9.wav Some donkeys and sheep are in their green pens . 545312 201303 Phil 0.9 545312_201303_Phil_None_0-9.wav A man walking next to a couple of donkeys. ...

Find all the captions belonging to a specific image

captions = db.getImgCaptions(298817) for caption in captions: print(' {}'.format(caption.text))

Birds wondering through grassy ground next to bushes. A flock of turkeys are making their way up a hill. Um, ah. Two wild turkeys in a field walking around. Four wild turkeys and some bushes trees and weeds. A group of turkeys with bushes in the background.

Parse the timecodes and have them structured

input:

... [1926.3068, "SYL", ""], [1926.3068, "SEPR", " "], [1926.3068, "WORD", "white"], [1926.3068, "PHO", "w"], [2050.7955, "PHO", "ai"], [2144.6591, "PHO", "t"], [2179.3182, "SYL", ""], [2179.3182, "SEPR", " "] ...

output:

print(caption.timecode.parse())

... { 'begin': 1926.3068, 'end': 2179.3182, 'syllable': [{'begin': 1926.3068, 'end': 2179.3182, 'phoneme': [{'begin': 1926.3068, 'end': 2050.7955, 'value': 'w'}, {'begin': 2050.7955, 'end': 2144.6591, 'value': 'ai'}, {'begin': 2144.6591, 'end': 2179.3182, 'value': 't'}], 'value': 'wait'}], 'value': 'white' }, ...

Convert the timecodes to Praat TextGrid files

caption.timecode.toTextgrid(outputDir, level=3)

Get the words, syllables and phonemes between n seconds/milliseconds

The following Python code returns all the words between 0.2 and 0.6 seconds for which at least 50% of the word's total length is within the specified interval

pprint(caption.getWords(0.20, 0.60, seconds=True, level=1, olapthr=50))

... 404537 827239 Bruce US 0.9 404537_827239_Bruce_None_0-9.wav Eyeglasses, a cellphone, some keys and other pocket items are all laid out on the cloth. . [ { 'begin': 0.0, 'end': 0.7202778, 'overlapPercentage': 55.53412863758955, 'word': 'eyeglasses' } ] ...

Get the translations of the selected captions

As for now, only japanese translations are available. We also used Kytea to tokenize and tag the captions translated with Google Translate

captions = db.getImgCaptions(298817) for caption in captions: print(' {}'.format(caption.text)) # Get translations and POS print('\tja_google: {}'.format(db.getTranslation(caption.captionID, "ja_google"))) print('\t\tja_google_tokens: {}'.format(db.getTokens(caption.captionID, "ja_google"))) print('\t\tja_google_pos: {}'.format(db.getPOS(caption.captionID, "ja_google"))) print('\tja_excite: {}'.format(db.getTranslation(caption.captionID, "ja_excite")))

Birds wondering through grassy ground next to bushes. ja_google: 鳥は茂みの下に茂った地面を抱えています。 ja_google_tokens: 鳥は茂みの下に茂った地面を抱えています。 ja_google_pos: 鳥/名詞/とりは/助詞/は茂み/名詞/しげみの/助詞/の下/名詞/したに/助詞/に
Synthetically Spoken COCO
zenodo.org
data.niaid.nih.gov
application/gzip, bin +2
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi; Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi (2020). Synthetically Spoken COCO [Dataset]. http://doi.org/10.5281/zenodo.400926
Explore at:
txt, json, bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.400926
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi; Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetically Spoken COCO

Version 1.0

This dataset contain synthetically generated spoken versions of MS COCO [1] captions. This
dataset was created as part the research reported in [5].
The speech was generated using gTTS [2]. The dataset consists of the following files:

- dataset.json: Captions associated with MS COCO images. This information comes from [3].
- sentid.txt: List of caption IDs. This file can be used to locate MFCC features of the MP3 files
in the numpy array stored in dataset.mfcc.npy.
- mp3.tgz: MP3 files with the audio. Each file name corresponds to caption ID in dataset.json
and in sentid.txt.
- dataset.mfcc.npy: Numpy array with the Mel Frequence Cepstral Coefficients extracted from
the audio. Each row corresponds to a caption. The order or the captions corresponds to the
ordering in the file sentid.txt. MFCCs were extracted using [4].

[1] http://mscoco.org/dataset/#overview
[2] https://pypi.python.org/pypi/gTTS
[3] https://github.com/karpathy/neuraltalk
[4] https://github.com/jameslyons/python_speech_features
[5] https://arxiv.org/abs/1702.01991
A
Audio Accessibility Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Audio Accessibility Report [Dataset]. https://www.marketreportanalytics.com/reports/audio-accessibility-73846
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Apr 9, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global audio accessibility market is experiencing robust growth, driven by increasing awareness of inclusivity and technological advancements. The market, estimated at $1.5 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. Key drivers include rising accessibility regulations, the proliferation of streaming services demanding closed captions and audio descriptions, and the increasing availability of cost-effective audio description software and services. The growing visually impaired and blind population globally further fuels market expansion. Significant market segments include online audio description services, preferred for their scalability and reach, and offline services which offer more tailored and customized solutions. Leading market players are continually innovating, incorporating AI-powered solutions for improved accuracy and efficiency in audio description generation. While the market presents significant opportunities, challenges remain. High implementation costs, particularly for offline services, and the need for skilled professionals to create high-quality audio descriptions can hinder widespread adoption, particularly in developing regions. However, the ongoing evolution of speech-to-text and text-to-speech technologies, alongside reductions in the cost of AI-driven tools, are likely to mitigate some of these restraints in the coming years. The market is geographically diverse, with North America and Europe currently dominating, but significant growth potential exists in emerging markets like Asia Pacific and the Middle East & Africa as awareness and accessibility legislation increases. The increasing integration of audio description within mainstream media and entertainment platforms will be a crucial factor in expanding market penetration and accessibility across all user segments.
h
spectrogram-captions
huggingface.co
Updated Dec 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Vučina (2023). spectrogram-captions [Dataset]. https://huggingface.co/datasets/vucinatim/spectrogram-captions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 25, 2023
Authors
Tim Vučina
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
Dataset of captioned spectrograms (text describing the sound).
P
SoundDescs Dataset
paperswithcode.com
Updated May 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A. Sophia Koepke; Andreea-Maria Oncescu; João F. Henriques; Zeynep Akata; Samuel Albanie (2022). SoundDescs Dataset [Dataset]. https://paperswithcode.com/dataset/sounddescs
Explore at:
Dataset updated
May 8, 2022
Authors
A. Sophia Koepke; Andreea-Maria Oncescu; João F. Henriques; Zeynep Akata; Samuel Albanie
Description
We introduce a new audio dataset called SoundDescs that can be used for tasks such as text to audio retrieval, audio captioning etc. This dataset contains 32,979 pairs of audio files and text descriptions. There are 23 categories found in SoundDescs including but not limited to nature, clocks, fire etc.

SoundDescs can be downloaded from here and retrieval results for this dataset can be found in the associated paper Audio Retrieval with Natural Language Queries: A Benchmark Study.
Z
Clotho-AQA dataset
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Apr 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Parthasaarathy Sudarsanam (2022). Clotho-AQA dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6473206
Explore at:
Dataset updated
Apr 22, 2022
Dataset provided by
Parthasaarathy Sudarsanam
Konstantinos Drossos
Samuel Lipping
Tuomas Virtanen
Description
Clotho-AQA is an audio question-answering dataset consisting of 1991 audio samples taken from Clotho dataset [1]. Each audio sample has 6 associated questions collected through crowdsourcing. For each question, the answers are provided by three different annotators making a total of 35,838 question-answer pairs. For each audio sample, 4 questions are designed to be answered with 'yes' or 'no', while the remaining two questions are designed to be answered in a single word. More details about the data collection process and data splitting process can be found in our following paper.

S. Lipping, P. Sudarsanam, K. Drossos, T. Virtanen ‘Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering.’ The paper is available online at 2204.09634.pdf (arxiv.org)

If you use the Clotho-AQA dataset, please cite the paper mentioned above. A sample baseline model to use the Clotho-AQA dataset can be found at partha2409/AquaNet (github.com)

To use the dataset,

• Download and extract ‘audio_files.zip’. This contains all the 1991 audio samples in the dataset.

• Download ‘clotho_aqa_train.csv’, ‘clotho_aqa_val.csv’, and ‘clotho_aqa_test.csv’. These files contain the train, validation, and test splits, respectively. They contain the audio file name, questions, answers, and confidence scores provided by the annotators.

License:

The audio files in the archive ‘audio_files.zip’ are under the corresponding licenses (mostly CreativeCommons with attribution) of Freesound [2] platform, mentioned explicitly in the CSV file ’clotho_aqa_metadata.csv’ for each of the audio files. That is, each audio file in the archive is listed in the CSV file with meta-data. The meta-data for each file are:

• File name

• Keywords

• URL for the original audio file

• Start and ending samples for the excerpt that is used in the Clotho dataset

• Uploader/user in the Freesound platform (manufacturer)

• Link to the license of the file.

The questions and answers in the files:

• clotho_aqa_train.csv

• clotho_aqa_val.csv

• clotho_aqa_test.csv

are under the MIT license, described in the LICENSE file.

References:

[1] K. Drossos, S. Lipping and T. Virtanen, "Clotho: An Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736- 740, doi: 10.1109/ICASSP40776.2020.9052990.

[2] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
P
WavCaps Dataset
library.toponeai.link
paperswithcode.com
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinhao Mei; Chutong Meng; Haohe Liu; Qiuqiang Kong; Tom Ko; Chengqi Zhao; Mark D. Plumbley; Yuexian Zou; Wenwu Wang (2025). WavCaps Dataset [Dataset]. https://library.toponeai.link/dataset/wavcaps
Explore at:
Dataset updated
Apr 30, 2025
Authors
Xinhao Mei; Chutong Meng; Haohe Liu; Qiuqiang Kong; Tom Ko; Chengqi Zhao; Mark D. Plumbley; Yuexian Zou; Wenwu Wang
Description
A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research.
P
AudioCaps Dataset
paperswithcode.com
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Dongjoo Kim; Byeongchang Kim; Hyunmin Lee; Gunhee Kim (2025). AudioCaps Dataset [Dataset]. https://paperswithcode.com/dataset/audiocaps
Explore at:
Dataset updated
May 12, 2025
Authors
Chris Dongjoo Kim; Byeongchang Kim; Hyunmin Lee; Gunhee Kim
Description
AudioCaps is a dataset of sounds with event descriptions that was introduced for the task of audio captioning, with sounds sourced from the AudioSet dataset. Annotators were provided the audio tracks together with category hints (and with additional video hints if needed).

Facebook

Twitter

Click to copy link

Link copied

Cite

Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen (2019). Clotho dataset [Dataset]. http://doi.org/10.5281/zenodo.3490683

Data from: Clotho dataset

Explore at:

26 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5281/zenodo.3490683

Dataset updated

Oct 15, 2019

Authors

Konstantinos Drossos; Samuel Lipping; Tuomas Virtanen

Description

Clotho is a novel audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Clotho is thoroughly described in our paper: K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990. available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990 If you use Clotho, please cite our paper. To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset These are the files for the development and evaluation splits of Clotho dataset. -------------------------------------------------------------------------------------------------------- == Usage == To use the dataset you have to: Download the audio files: clotho_audio_development.7z and clotho_audio_evalution.7z Download the files with the captions: clotho_captions_development.csv and clotho_captions_evaluation.csv Download the files with the associated metadata: clotho_metadata_development.csv and clotho_metadata_evaluation.csv Extract the audio files Then you can use each audio file with its corresponding captions -------------------------------------------------------------------------------------------------------- == License == The audio files in the archives: clotho_audio_development.7z and clotho_audio_evalution.7z and the associated meta-data in the CSV files: clotho_metadata_development.csv clotho_metadata_evaluation.csv are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are: File name Keywords URL for the original audio file Start and ending samples for the excerpt that is used in the Clotho dataset Uploader/user in the Freesound platform (manufacturer) Link to the licence of the file The captions in the files: clotho_captions_development.csv clotho_captions_evaluation.csv are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence). -------------------------------------------------------------------------------------------------------- == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245 {"references": ["Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245"]}

Clear search

Close search

Google apps

Main menu

Data from: Clotho dataset

Flickr Audio Caption Corpus Dataset

music-audio-pseudo-captions

Audio captioning DCASE 2020 evaluation (testing) split

Audio Caption Dataset (Hospital & Car)

Clotho Analysis Set

WavCaps

audio

Data from: MACS - Multi-Annotator Captioned Soundscapes

Clotho v2 - Dataset - LDM

Song Describer Dataset

The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

wavcaps_test

Audio Caption Hospital Dataset

SPEECH-COCO

Synthetically Spoken COCO

Audio Accessibility Report

spectrogram-captions

SoundDescs Dataset

Clotho-AQA dataset

WavCaps Dataset

AudioCaps Dataset

Data from: Clotho datasetSee More Versions

Data from: Clotho dataset