100+ datasets found

a
AVSpeech: Large-scale Audio-Visual Speech Dataset
academictorrents.com
bittorrent
Updated Jan 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ariel Ephrat and Inbar Mosseri and Oran Lang and Tali Dekel and Kevin Wilson and Avinatan Hassidim and William T. Freeman and Michael Rubinstein (2020). AVSpeech: Large-scale Audio-Visual Speech Dataset [Dataset]. https://academictorrents.com/details/b078815ca447a3e4d17e8a2a34f13183ec5dec41
Explore at:
bittorrent(1503015135350)Available download formats
Dataset updated
Jan 31, 2020
Dataset authored and provided by
Ariel Ephrat and Inbar Mosseri and Oran Lang and Tali Dekel and Kevin Wilson and Avinatan Hassidim and William T. Freeman and Michael Rubinstein
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
AVSpeech is a new, large-scale audio-visual dataset comprising speech video clips with no interfering background noises. The segments are 3-10 seconds long, and in each clip the audible sound in the soundtrack belongs to a single speaking person, visible in the video. In total, the dataset contains roughly 4700 hours* of video segments, from a total of 290k YouTube videos, spanning a wide variety of people, languages and face poses. For more details on how we created the dataset see our paper, Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation (). * UPLOADER S NOTE: This dataset contains 3000 hours of video segments and not the entire 4700 hours. 1700 hours were not included as some no longer existed on youtube, had a copyright violation, not available in the United States, or was of poor quality. Over 1 million segments are included in this torrent, each between 3 - 10 seconds, and in 720p resolution.
O
AVSpeech
opendatalab.com
zip
Updated May 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hebrew University of Jerusalem (2023). AVSpeech [Dataset]. https://opendatalab.com/OpenDataLab/AVSpeech
Explore at:
zipAvailable download formats
Dataset updated
May 2, 2023
Dataset provided by
Google Research
Hebrew University of Jerusalem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AVSpeech is a large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
t
Acoustic AVSpeech - Dataset - LDM
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Acoustic AVSpeech - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/acoustic-avspeech
Explore at:
Dataset updated
Dec 16, 2024
Description
The Acoustic AVSpeech dataset is a benchmark for visual acoustic matching.
a
AVE-Speech
aifasthub.com
huggingface.co
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multi-Modal Learning Group (2025). AVE-Speech [Dataset]. https://aifasthub.com/datasets/MML-Group/AVE-Speech
Explore at:
Dataset updated
Aug 23, 2025
Dataset authored and provided by
Multi-Modal Learning Group
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
AVE Speech: A Comprehensive Multi-Modal Dataset for Speech Recognition Integrating Audio, Visual, and Electromyographic Signals

Abstract

AVE Speech is a large-scale Mandarin speech corpus that pairs synchronized audio, lip video and surface electromyography (EMG) recordings. The dataset contains 100 sentences read by 100 native speakers. Each participant repeated the full corpus ten times, yielding over 55 hours of data per modality. These complementary signals enable… See the full description on the dataset page: https://huggingface.co/datasets/MML-Group/AVE-Speech.
E
Laboratory Conditions Czech Audio-Visual Speech Corpus
catalog.elra.info
live.european-language-grid.eu
Updated Nov 5, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2008). Laboratory Conditions Czech Audio-Visual Speech Corpus [Dataset]. http://catalog.elra.info/en-us/repository/browse/ELRA-S0283/
Explore at:
Dataset updated
Nov 5, 2008
Dataset provided by
ELRA (European Language Resources Association)
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
License
http://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems. The corpus consists of about 25 hours of audio-visual records of 65 speakers in laboratory conditions. Data collection was done with static illumination, and recorded subjects were instructed to remain static.The average speaker age was 22 years old. Speakers were asked to read 200 sentences each (50 common for all speakers and 150 specific to each speaker). The average total length of recording per speaker is 23 minutes.All audio-visual data are transcribed (.trs files) and divided into sentences (one sentence per file). For each video file we get the description file containing information about the position and size of the region of interest.Acoustic data are stored in wave files using PCM format, sampling frequency 44kHz, resolution 16 bits. Each speaker’s acoustic data set represents about 140 MB of disk space (about 9 GB as a whole).Visual data are stored in video files (.avi format) using the digital video (DV) codec. Visual data per speaker take about 3 GB of disk (about 195 GB as a whole) and are stored on an IDE hard disk (NTFS format).
Russian Speech Recognition Dataset - 338 Hours
kaggle.com
Updated Jun 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata (2025). Russian Speech Recognition Dataset - 338 Hours [Dataset]. https://www.kaggle.com/datasets/unidpro/russian-speech-recognition-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Unidata
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Russian Speech Dataset for recognition task

Dataset comprises 338 hours of telephone dialogues in Russian, collected from 460 native speakers across various topics and domains, with an impressive 98% Word Accuracy Rate. It is designed for research in speech recognition, focusing on various recognition models, primarily aimed at meeting the requirements for automatic speech recognition (ASR) systems.

By utilizing this dataset, researchers and developers can advance their understanding and capabilities in automatic speech recognition (ASR) systems, transcribing audio, and natural language processing (NLP). - Get the data

💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

Metadata for the dataset

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22059654%2Fa3f375fb273dcad3fe17403bdfccb63b%2Fssssssssss.PNG?generation=1739884059328284&alt=media" alt=""> - Audio files: High-quality recordings in WAV format - Text transcriptions: Accurate and detailed transcripts for each audio segment - Speaker information: Metadata on native speakers, including gender and etc - Topics: Diverse domains such as general conversations, business and etc

The native speakers and various topics and domains covered in the dataset make it an ideal resource for research community, allowing researchers to study spoken languages, dialects, and language patterns.

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects
h
vad-human-ava-speech
huggingface.co
Updated Sep 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TTF Datascience, NCCR@LiRI, UZH (2023). vad-human-ava-speech [Dataset]. https://huggingface.co/datasets/nccratliri/vad-human-ava-speech
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Dataset authored and provided by
TTF Datascience, NCCR@LiRI, UZH
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Positive Transfer Of The Whisper Speech Transformer To Human And Animal Voice Activity Detection

We proposed WhisperSeg, utilizing the Whisper Transformer pre-trained for Automatic Speech Recognition (ASR) for both human and animal Voice Activity Detection (VAD). For more details, please refer to our paper

Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection Nianlong Gu, Kanghwi Lee, Maris Basha, Sumit Kumar Ram, Guanghao You, Richard H.… See the full description on the dataset page: https://huggingface.co/datasets/nccratliri/vad-human-ava-speech.
Z
Enhanced RAVDESS Speech Dataset
data.niaid.nih.gov
Updated Oct 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryan, Nicholas J. (2021). Enhanced RAVDESS Speech Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4783520
Explore at:
Dataset updated
Oct 2, 2021
Dataset provided by
Jin, Zeyu
Morrison, Max
Caceres, Juan-Pablo
Bryan, Nicholas J.
Pardo, Bryan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This is a modified version of the speech audio contained within the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. The original dataset can be found here. The unmodified version of just the speech audio used as source material for this dataset can be found here. This dataset performs speech enhancement and bandwidth extension on the original speech using HiFi-GAN. HiFi-GAN produces high-quality speech at 48 kHz that contains significantly less noise and reverb relative to the original recordings.

If you use this work as part of an academic publication, please cite the papers corresponding to both the original dataset as well as HiFi-GAN:

Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

Su, Jiaqi, Zeyu Jin, and Adam Finkelstein. "HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks." Proc. Interspeech. October 2020.

Note that there are two recent papers with the name "HiFi-GAN". Please be sure to cite the correct paper as listed here.
F
Audio Visual Speech Dataset: Saudi Arabian Arabic
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Audio Visual Speech Dataset: Saudi Arabian Arabic [Dataset]. https://www.futurebeeai.com/dataset/multi-modal-dataset/saudi-arabian-arabic-visual-speech-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Saudi Arabia
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Saudi Arabian Arabic Language Visual Speech Dataset! This dataset is a collection of diverse, single-person unscripted spoken videos supporting research in visual speech recognition, emotion detection, and multimodal communication.
Dataset Content
This visual speech dataset contains 1000 videos in Saudi Arabian Arabic language each paired with a corresponding high-fidelity audio track. Each participant is answering a specific question in a video in an unscripted and spontaneous nature.
•Participant Diversity:
•
Speakers: The dataset includes visual speech data from more than 200 participants from different states/provinces of Saudi Arabia.

•
Regions: Ensures a balanced representation of Skip 3 accents, dialects, and demographics.

•
Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.

Video Data
While recording each video extensive guidelines are kept in mind to maintain the quality and diversity.
•Recording Details:
•
File Duration: Average duration of 30 seconds to 3 minutes per video.

•
Formats: Videos are available in MP4 or MOV format.

•
Resolution: Videos are recorded in ultra-high-definition resolution with 30 fps or above.

•
Device: Both the latest Android and iOS devices are used in this collection.

•
Recording Conditions: Videos were recorded under various conditions to ensure diversity and reduce bias:

•
Indoor and Outdoor Settings: Includes both indoor and outdoor recordings.

•
Lighting Variations: Captures videos in daytime, nighttime, and varying lighting conditions.

•
Camera Positions: Includes handheld and fixed camera positions, as well as portrait and landscape orientations.

•
Face Orientation: Contains straight face and tilted face angles.

•
Participant Positions: Records participants in both standing and seated positions.

•
Motion Variations: Features both stationary and moving videos, where participants pass through different lighting conditions.

•
Occlusions: Includes videos where the participant's face is partially occluded by hand movements, microphones, hair, glasses, and facial hair.

•
Focus: In each video, the participant's face remains in focus throughout the video duration, ensuring the face stays within the video frame.

•
Video Content: In each video, the participant answers a specific question in an unscripted manner. These questions are designed to capture various emotions of participants. The dataset contain videos expressing following human emotions:

•Happy
•Sad
•Excited
•Angry
•Annoyed
•Normal
•
Question Diversity: For each human emotion participant answered a specific question expressing that particular emotion.

Metadata
The dataset provides comprehensive metadata for each video recording and participant:
<span
T
speech_commands
tensorflow.org
datasets.activeloop.ai
+1more
Updated Jan 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). speech_commands [Dataset]. http://identifiers.org/arxiv:1804.03209
Explore at:
Unique identifier
https://identifiers.org/arxiv:1804.03209
Dataset updated
Jan 13, 2023
Description
An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('speech_commands', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
F
Audio Visual Speech Dataset: European Portuguese
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Audio Visual Speech Dataset: European Portuguese [Dataset]. https://www.futurebeeai.com/dataset/multi-modal-dataset/european-portuguese-visual-speech-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Portuguese Language Visual Speech Dataset! This dataset is a collection of diverse, single-person unscripted spoken videos supporting research in visual speech recognition, emotion detection, and multimodal communication.
Dataset Content
This visual speech dataset contains 1000 videos in Portuguese language each paired with a corresponding high-fidelity audio track. Each participant is answering a specific question in a video in an unscripted and spontaneous nature.
•Participant Diversity:
•
Speakers: The dataset includes visual speech data from more than 200 participants from different states/provinces of Portugal.

•
Regions: Ensures a balanced representation of Skip 3 accents, dialects, and demographics.

•
Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.

Video Data
While recording each video extensive guidelines are kept in mind to maintain the quality and diversity.
•Recording Details:
•
File Duration: Average duration of 30 seconds to 3 minutes per video.

•
Formats: Videos are available in MP4 or MOV format.

•
Resolution: Videos are recorded in ultra-high-definition resolution with 30 fps or above.

•
Device: Both the latest Android and iOS devices are used in this collection.

•
Recording Conditions: Videos were recorded under various conditions to ensure diversity and reduce bias:

•
Indoor and Outdoor Settings: Includes both indoor and outdoor recordings.

•
Lighting Variations: Captures videos in daytime, nighttime, and varying lighting conditions.

•
Camera Positions: Includes handheld and fixed camera positions, as well as portrait and landscape orientations.

•
Face Orientation: Contains straight face and tilted face angles.

•
Participant Positions: Records participants in both standing and seated positions.

•
Motion Variations: Features both stationary and moving videos, where participants pass through different lighting conditions.

•
Occlusions: Includes videos where the participant's face is partially occluded by hand movements, microphones, hair, glasses, and facial hair.

•
Focus: In each video, the participant's face remains in focus throughout the video duration, ensuring the face stays within the video frame.

•
Video Content: In each video, the participant answers a specific question in an unscripted manner. These questions are designed to capture various emotions of participants. The dataset contain videos expressing following human emotions:

•Happy
•Sad
•Excited
•Angry
•Annoyed
•Normal
•
Question Diversity: For each human emotion participant answered a specific question expressing that particular emotion.

Metadata
The dataset provides comprehensive metadata for each video recording and participant:
•
AnglistikVoices: L2 English speech dataset
zenodo.org
bin, zip
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhilesh Kakolu Ramarao; Anna Sophia Stein; Alba Tahiri; Dalia Rodrigues Carvalho; Charlotte Antonia Weismann; Olivia Sophie Schäfer; Julia Kaczor; Nhut Ha Tran; Christina Elena Telaar; Leonie Bauer; Merle Jütten; Chimène Mafuta; Vassiliki Vicky Agelopoulou; Quinn Arin Gromek Grabowski; Akhilesh Kakolu Ramarao; Anna Sophia Stein; Alba Tahiri; Dalia Rodrigues Carvalho; Charlotte Antonia Weismann; Olivia Sophie Schäfer; Julia Kaczor; Nhut Ha Tran; Christina Elena Telaar; Leonie Bauer; Merle Jütten; Chimène Mafuta; Vassiliki Vicky Agelopoulou; Quinn Arin Gromek Grabowski (2025). AnglistikVoices: L2 English speech dataset [Dataset]. http://doi.org/10.5281/zenodo.12525952
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12525952
Dataset updated
May 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Akhilesh Kakolu Ramarao; Anna Sophia Stein; Alba Tahiri; Dalia Rodrigues Carvalho; Charlotte Antonia Weismann; Olivia Sophie Schäfer; Julia Kaczor; Nhut Ha Tran; Christina Elena Telaar; Leonie Bauer; Merle Jütten; Chimène Mafuta; Vassiliki Vicky Agelopoulou; Quinn Arin Gromek Grabowski; Akhilesh Kakolu Ramarao; Anna Sophia Stein; Alba Tahiri; Dalia Rodrigues Carvalho; Charlotte Antonia Weismann; Olivia Sophie Schäfer; Julia Kaczor; Nhut Ha Tran; Christina Elena Telaar; Leonie Bauer; Merle Jütten; Chimène Mafuta; Vassiliki Vicky Agelopoulou; Quinn Arin Gromek Grabowski
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
Jun 26, 2024
Description
AnglistikVoices: an L2 English speech dataset

This repository contains an L2 (second language) English speech corpus consisting of 74 minutes of recorded audio from 15 non-native English speaking participants. The dataset was created as part of a university course, with all participants being students who are also the authors of this dataset.

Dataset Specifications

Total participants: 15 non-native English speakers

Total audio duration: 74 minutes

Recordings per participant: 60 audio samples each

Sentence alignment: Available for 8 out of 15 participant

Recording equipment: Audio-Technica ATM75 microphone

Stimuli: All sentences are from the Artie Bias Corpus (https://github.com/artie-inc/artie-bias-corpus)

Recording environment: Recording booth

The dataset contains individual recordings of non-native English speakers organized by participant ID. For 8 participants, sentence-level alignments are provided. All recordings were captured in a controlled acoustic environment using Audio-Technica ATM75 microphone to ensure high audio quality.

The recordings consist of spoken English utterances from each participant. Detailed linguistic profiles for each participant are available in the metadata.xlsx file, which is indexed by participant ID and contains information on native language, proficiency level, language learning history, and other relevant linguistic background data.

The audio files are organized by participant ID, matching the identifiers used in the metadata file for easy cross-referencing between the audio recordings and participant linguistic profiles.

Authors and Contributors

This dataset was created by the student participants themselves as part of their coursework.

Course Instructor: Akhilesh Kakolu Ramarao

Teaching Assistant: Anna Sophia Stein

If you have any questions, you can contact: kakolura@hhu.de

If you use this dataset in your research, please cite:

@dataset{kakolu_ramarao_2024_anglistikvoices, author = {Kakolu Ramarao, A. and Stein, A. S. and Tahiri, A. and Rodrigues, D. C. and Antonia Weismann, C. and Schäfer, O. S. and Kaczor, J. and Tran, N. H. and Elena Telaar, C. and Bauer, L. and Jütten, M. and Mafuta, C. and Agelopoulou, V. V. and Grabowski, Q. A. G.}, title = {AnglistikVoices: L2 English speech dataset}, publisher = {Zenodo}, version = {v1.0.0}, year = {2024}, month = jun, doi = {10.5281/zenodo.12525952}, url = {https://doi.org/10.5281/zenodo.12525952}, note = {LabPhon 19, Hanyang Institute for Phonetics and Cognitive Sciences of Language (HIPCS), Hanyang University in Seoul, Korea} }
F
Audio Visual Speech Dataset: French
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Audio Visual Speech Dataset: French [Dataset]. https://www.futurebeeai.com/dataset/multi-modal-dataset/french-visual-speech-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
French
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the French Language Visual Speech Dataset! This dataset is a collection of diverse, single-person unscripted spoken videos supporting research in visual speech recognition, emotion detection, and multimodal communication.
Dataset Content
This visual speech dataset contains 1000 videos in French language each paired with a corresponding high-fidelity audio track. Each participant is answering a specific question in a video in an unscripted and spontaneous nature.
•Participant Diversity:
•
Speakers: The dataset includes visual speech data from more than 200 participants from different states/provinces of France.

•
Regions: Ensures a balanced representation of Skip 3 accents, dialects, and demographics.

•
Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.

Video Data
While recording each video extensive guidelines are kept in mind to maintain the quality and diversity.
•Recording Details:
•
File Duration: Average duration of 30 seconds to 3 minutes per video.

•
Formats: Videos are available in MP4 or MOV format.

•
Resolution: Videos are recorded in ultra-high-definition resolution with 30 fps or above.

•
Device: Both the latest Android and iOS devices are used in this collection.

•
Recording Conditions: Videos were recorded under various conditions to ensure diversity and reduce bias:

•
Indoor and Outdoor Settings: Includes both indoor and outdoor recordings.

•
Lighting Variations: Captures videos in daytime, nighttime, and varying lighting conditions.

•
Camera Positions: Includes handheld and fixed camera positions, as well as portrait and landscape orientations.

•
Face Orientation: Contains straight face and tilted face angles.

•
Participant Positions: Records participants in both standing and seated positions.

•
Motion Variations: Features both stationary and moving videos, where participants pass through different lighting conditions.

•
Occlusions: Includes videos where the participant's face is partially occluded by hand movements, microphones, hair, glasses, and facial hair.

•
Focus: In each video, the participant's face remains in focus throughout the video duration, ensuring the face stays within the video frame.

•
Video Content: In each video, the participant answers a specific question in an unscripted manner. These questions are designed to capture various emotions of participants. The dataset contain videos expressing following human emotions:

•Happy
•Sad
•Excited
•Angry
•Annoyed
•Normal
•
Question Diversity: For each human emotion participant answered a specific question expressing that particular emotion.

Metadata
The dataset provides comprehensive metadata for each video recording and participant:
•
<b
s
Speech dataset
marketplace.sshopencloud.eu
Updated Apr 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Speech dataset [Dataset]. https://marketplace.sshopencloud.eu/dataset/8ixgG5
Explore at:
Dataset updated
Apr 24, 2020
Description
The real-world speech data set consists of 3686 segments of English speech spoken with different accents. This dataset is provided by the Speech Processing Group at Brno University of Technology, Czech Republic. The majority data corresponds to American accent and only 1.65% corresponds to one of seven other accents (these are referred to as outliers). The speech segments are represented by 400-dimensional so called i-vectors which are widely used state-of-the-art features for speaker and language recognition. Learing Outlier Ensembles: The Best of Both Worlds – Supervised and Unsupervised. Barbora Micenkova, Brian McWilliams, and Ira Assent, KDD ODD2 Workshop, 2014.
F
Audio Visual Speech Dataset: Indian English
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Audio Visual Speech Dataset: Indian English [Dataset]. https://www.futurebeeai.com/dataset/multi-modal-dataset/indian-english-visual-speech-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Indian English Language Visual Speech Dataset! This dataset is a collection of diverse, single-person unscripted spoken videos supporting research in visual speech recognition, emotion detection, and multimodal communication.
Dataset Content
This visual speech dataset contains 1000 videos in Indian English language each paired with a corresponding high-fidelity audio track. Each participant is answering a specific question in a video in an unscripted and spontaneous nature.
•Participant Diversity:
•
Speakers: The dataset includes visual speech data from more than 200 participants from different states/provinces of India.

•
Regions: Ensures a balanced representation of Skip 3 accents, dialects, and demographics.

•
Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.

Video Data
While recording each video extensive guidelines are kept in mind to maintain the quality and diversity.
•Recording Details:
•
File Duration: Average duration of 30 seconds to 3 minutes per video.

•
Formats: Videos are available in MP4 or MOV format.

•
Resolution: Videos are recorded in ultra-high-definition resolution with 30 fps or above.

•
Device: Both the latest Android and iOS devices are used in this collection.

•
Recording Conditions: Videos were recorded under various conditions to ensure diversity and reduce bias:

•
Indoor and Outdoor Settings: Includes both indoor and outdoor recordings.

•
Lighting Variations: Captures videos in daytime, nighttime, and varying lighting conditions.

•
Camera Positions: Includes handheld and fixed camera positions, as well as portrait and landscape orientations.

•
Face Orientation: Contains straight face and tilted face angles.

•
Participant Positions: Records participants in both standing and seated positions.

•
Motion Variations: Features both stationary and moving videos, where participants pass through different lighting conditions.

•
Occlusions: Includes videos where the participant's face is partially occluded by hand movements, microphones, hair, glasses, and facial hair.

•
Focus: In each video, the participant's face remains in focus throughout the video duration, ensuring the face stays within the video frame.

•
Video Content: In each video, the participant answers a specific question in an unscripted manner. These questions are designed to capture various emotions of participants. The dataset contain videos expressing following human emotions:

•Happy
•Sad
•Excited
•Angry
•Annoyed
•Normal
•
Question Diversity: For each human emotion participant answered a specific question expressing that particular emotion.

Metadata
The dataset provides comprehensive metadata for each video recording and participant:
•
m
US Customer to US Customer Speech Dataset in English for Automobiles
data.macgence.com
mp3
Updated Mar 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). US Customer to US Customer Speech Dataset in English for Automobiles [Dataset]. https://data.macgence.com/dataset/us-customer-to-us-customer-speech-dataset-in-english-for-automobiles
Explore at:
mp3Available download formats
Dataset updated
Mar 3, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
Discover a speech dataset of US customers discussing automobiles in English. Perfect for AI development, voice recognition, and automotive research.
Z
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
data.niaid.nih.gov
zenodo.org
Updated Oct 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Russo, Frank A. (2024). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1188975
Explore at:
Dataset updated
Oct 19, 2024
Dataset provided by
Russo, Frank A.
Livingstone, Steven R.
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Description

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.

Citing the RAVDESS

The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.

Academic paper citation

Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

Personal use citation

Include a link to this Zenodo page - https://zenodo.org/record/1188976

Commercial Licenses

Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.

Contact Information

If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.

Example Videos

Watch a sample of the RAVDESS speech and song videos.

Emotion Classification Users

If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].

Construction and Validation

Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.

The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.

Contents

Audio-only files

Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440.

Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.

Audio-Visual and Video-only files

Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:

Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880.

Song files (Video_Song_Actor_01.zip to Video_Song_Actor_24.zip) collectively contains 2024 files: 44 trials per actor x 2 modalities (AV, VO) x 23 actors = 2024.

File Summary

In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).

File naming convention

Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics: Filename identifiers

Modality (01 = full-AV, 02 = video-only, 03 = audio-only).

Vocal channel (01 = speech, 02 = song).

Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).

Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.

Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").

Repetition (01 = 1st repetition, 02 = 2nd repetition).

Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).

Filename example: 02-01-06-01-02-01-12.mp4

Video-only (02)

Speech (01)

Fearful (06)

Normal intensity (01)

Statement "dogs" (02)

1st Repetition (01)

12th Actor (12)

Female, as the actor ID number is even.

License information

The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0

Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.

Related Data sets

RAVDESS Facial Landmark Tracking data set [Zenodo project page].
f
Dataset of British English speech recordings for psychoacoustics and speech...
salford.figshare.com
datasetcatalog.nlm.nih.gov
application/x-gzip
Updated Feb 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trevor John Cox; Simone Graetzer; Michael A Akeroyd; Jonathan Barker; John Culling; Graham Naylor; Eszter Porter; Rhoddy Viveros Muñoz (2025). Dataset of British English speech recordings for psychoacoustics and speech processing research [Dataset]. http://doi.org/10.17866/rd.salford.16918180.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.17866/rd.salford.16918180.v3
Dataset updated
Feb 3, 2025
Dataset provided by
University of Salford
Authors
Trevor John Cox; Simone Graetzer; Michael A Akeroyd; Jonathan Barker; John Culling; Graham Naylor; Eszter Porter; Rhoddy Viveros Muñoz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Clarity Speech Corpus is a forty speaker British English speech dataset. The corpus was created for the purpose of running listening tests to gauge speech intelligibility and quality in the Clarity Project, which has the goal of advancing speech signal processing by hearing aids through a series of challenges. The dataset is suitable for machine learning and other uses in speech and hearing technology, acoustics and psychoacoustics. The data comprises recordings of approximately 10,000 sentences drawn from the British National Corpus (BNC) with suitable length, words and grammatical construction for speech intelligibility testing. The collection process involved the selection of a subset of BNC sentences, the recording of these produced by 40 British English speakers, and the processing of these recordings to create individual sentence recordings with associated prompts and metadata.clarity_utterances.v1_2.tar.gz contains all the recordings as .wav files, with the accompanying metadata such as text prompts in clarity_master.json. Further details are given in the readme.Sample_clarity_utterances.zip contains a sample of 10.Please reference the following data paper, which has details on how the corpus was generated: Graetzer, S., Akeroyd, M.A., Barker, J., Cox, T.J., Culling, J.F., Naylor, G., Porter, E. and Muñoz, R.V., 2022. Dataset of British English speech recordings for psychoacoustics and speech processing research: the Clarity Speech Corpus. Data in Brief, p.107951.
The Grid Audio-Visual Speech Corpus
zenodo.org
live.european-language-grid.eu
pdf, zip
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Cooke; Jon Barker; Stuart Cunningham; Xu Shao; Martin Cooke; Jon Barker; Stuart Cunningham; Xu Shao (2024). The Grid Audio-Visual Speech Corpus [Dataset]. http://doi.org/10.5281/zenodo.3625687
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3625687
Dataset updated
Jul 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Martin Cooke; Jon Barker; Stuart Cunningham; Xu Shao; Martin Cooke; Jon Barker; Stuart Cunningham; Xu Shao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Grid Corpus is a large multitalker audiovisual sentence corpus designed to support joint computational-behavioral studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female), for a total of 34000 sentences. Sentences are of the form "put red at G9 now".

audio_25k.zip contains the wav format utterances at a 25 kHz sampling rate in a separate directory per talker
alignments.zip provides word-level time alignments, again separated by talker
s1.zip, s2.zip etc contain .jpg videos for each talker [note that due to an oversight, no video for talker t21 is available]

The Grid Corpus is described in detail in the paper jasagrid.pdf included in the dataset.
h
English-Technical-Speech-Dataset
huggingface.co
Updated Oct 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tejasva Maurya (2024). English-Technical-Speech-Dataset [Dataset]. https://huggingface.co/datasets/Tejasva-Maurya/English-Technical-Speech-Dataset
Explore at:
Dataset updated
Oct 26, 2024
Authors
Tejasva Maurya
Description
English Technical Speech Dataset

Overview

The English Technical Speech Dataset is a curated collection of English technical vocabulary recordings, designed for applications like Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and Audio Classification. The dataset includes 11,247 entries and provides audio files, transcriptions, and speaker embeddings to support the development of robust technical language models.

Language: English (technical focus) Total… See the full description on the dataset page: https://huggingface.co/datasets/Tejasva-Maurya/English-Technical-Speech-Dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ariel Ephrat and Inbar Mosseri and Oran Lang and Tali Dekel and Kevin Wilson and Avinatan Hassidim and William T. Freeman and Michael Rubinstein (2020). AVSpeech: Large-scale Audio-Visual Speech Dataset [Dataset]. https://academictorrents.com/details/b078815ca447a3e4d17e8a2a34f13183ec5dec41

AVSpeech: Large-scale Audio-Visual Speech Dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

bittorrent(1503015135350)Available download formats

Dataset updated

Jan 31, 2020

Dataset authored and provided by

Ariel Ephrat and Inbar Mosseri and Oran Lang and Tali Dekel and Kevin Wilson and Avinatan Hassidim and William T. Freeman and Michael Rubinstein

License

https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

Description

AVSpeech is a new, large-scale audio-visual dataset comprising speech video clips with no interfering background noises. The segments are 3-10 seconds long, and in each clip the audible sound in the soundtrack belongs to a single speaking person, visible in the video. In total, the dataset contains roughly 4700 hours* of video segments, from a total of 290k YouTube videos, spanning a wide variety of people, languages and face poses. For more details on how we created the dataset see our paper, Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation (). * UPLOADER S NOTE: This dataset contains 3000 hours of video segments and not the entire 4700 hours. 1700 hours were not included as some no longer existed on youtube, had a copyright violation, not available in the United States, or was of poor quality. Over 1 million segments are included in this torrent, each between 3 - 10 seconds, and in 720p resolution.

Clear search

Close search

Google apps

Main menu

AVSpeech: Large-scale Audio-Visual Speech Dataset

AVSpeech

Acoustic AVSpeech - Dataset - LDM

AVE-Speech

Laboratory Conditions Czech Audio-Visual Speech Corpus

Russian Speech Recognition Dataset - 338 Hours

Russian Speech Dataset for recognition task

💵 Buy the Dataset: This is a limited preview of the data. To access the full dataset, please contact us at https://unidata.pro to discuss your requirements and pricing options.

Metadata for the dataset

🌐 UniData provides high-quality datasets, content moderation, data collection and annotation for your AI/ML projects

vad-human-ava-speech

Enhanced RAVDESS Speech Dataset

Audio Visual Speech Dataset: Saudi Arabian Arabic

Introduction

Dataset Content

Video Data

Metadata

speech_commands

Audio Visual Speech Dataset: European Portuguese

Introduction

Dataset Content

Video Data

Metadata

AnglistikVoices: L2 English speech dataset

AnglistikVoices: an L2 English speech dataset

Dataset Specifications

Authors and Contributors

Audio Visual Speech Dataset: French

Introduction

Dataset Content

Video Data

Metadata

Speech dataset

Audio Visual Speech Dataset: Indian English

Introduction

Dataset Content

Video Data

Metadata

US Customer to US Customer Speech Dataset in English for Automobiles

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

Dataset of British English speech recordings for psychoacoustics and speech...

The Grid Audio-Visual Speech Corpus

English-Technical-Speech-Dataset

AVSpeech: Large-scale Audio-Visual Speech Dataset