9 datasets found
  1. E

    The Grid Audio-Visual Speech Corpus

    • live.european-language-grid.eu
    • explore.openaire.eu
    • +1more
    mpeg-2
    Updated May 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). The Grid Audio-Visual Speech Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7769
    Explore at:
    mpeg-2Available download formats
    Dataset updated
    May 15, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Grid Corpus is a large multitalker audiovisual sentence corpus designed to support joint computational-behavioral studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female), for a total of 34000 sentences. Sentences are of the form "put red at G9 now".audio_25k.zip contains the wav format utterances at a 25 kHz sampling rate in a separate directory per talkeralignments.zip provides word-level time alignments, again separated by talkers1.zip, s2.zip etc contain .jpg videos for each talker [note that due to an oversight, no video for talker t21 is available]The Grid Corpus is described in detail in the paper jasagrid.pdf included in the dataset.

  2. E

    Laboratory Conditions Czech Audio-Visual Speech Corpus

    • catalog.elra.info
    • live.european-language-grid.eu
    Updated Nov 5, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2008). Laboratory Conditions Czech Audio-Visual Speech Corpus [Dataset]. http://catalog.elra.info/en-us/repository/browse/ELRA-S0283/
    Explore at:
    Dataset updated
    Nov 5, 2008
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    http://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    http://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems. The corpus consists of about 25 hours of audio-visual records of 65 speakers in laboratory conditions. Data collection was done with static illumination, and recorded subjects were instructed to remain static.The average speaker age was 22 years old. Speakers were asked to read 200 sentences each (50 common for all speakers and 150 specific to each speaker). The average total length of recording per speaker is 23 minutes.All audio-visual data are transcribed (.trs files) and divided into sentences (one sentence per file). For each video file we get the description file containing information about the position and size of the region of interest.Acoustic data are stored in wave files using PCM format, sampling frequency 44kHz, resolution 16 bits. Each speaker’s acoustic data set represents about 140 MB of disk space (about 9 GB as a whole).Visual data are stored in video files (.avi format) using the digital video (DV) codec. Visual data per speaker take about 3 GB of disk (about 195 GB as a whole) and are stored on an IDE hard disk (NTFS format).

  3. h

    EGCLLC

    • huggingface.co
    Updated Dec 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SilentSpeak (2023). EGCLLC [Dataset]. https://huggingface.co/datasets/SilentSpeak/EGCLLC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 29, 2023
    Dataset authored and provided by
    SilentSpeak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Enhanced GRID Corpus with Lip Landmark Coordinates

      Introduction
    

    This enhanced version of the GRID audiovisual sentence corpus, originally available at Zenodo, incorporates significant new features for auditory-visual speech recognition research. Building upon the preprocessed data from LipNet-PyTorch, we have added lip landmark coordinates to the dataset, providing detailed positional information of key points around the lips. This addition greatly enhances its utility in… See the full description on the dataset page: https://huggingface.co/datasets/SilentSpeak/EGCLLC.

  4. E

    Czech Audio-Visual Speech Corpus for Recognition with Impaired Conditions

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Nov 5, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2008). Czech Audio-Visual Speech Corpus for Recognition with Impaired Conditions [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0284/
    Explore at:
    Dataset updated
    Nov 5, 2008
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems collected with impaired illumination conditions. The corpus consists of about 20 hours of audio-visual records of 50 speakers in laboratory conditions. Recorded subjects were instructed to remain static. The illumination varied and chunks of each speaker were recorded with several different conditions, such as full illumination, or illumination from one side (left or right) only. These conditions make the database usable for training lip-/head-tracking systems under various illumination conditions independently of the language. Speakers were asked to read 200 sentences each (50 common for all speakers and 150 specific to each speaker). The average total length of recording per speaker was 23 minutes.Acoustic data are stored in wave files using PCM format, sampling frequency 44kHz, resolution 16 bits. Each speaker’s acoustic data set represents about 180 MB of disk space (about 8.8 GB).Visual data are stored in video files (.avi format) using the digital video (DV) codec. Visual data per speaker take about 3.7 GB of disk (about 185 GB as a whole) and are stored on an IDE hard disk (NTFS format).

  5. GRID/CHiME-2 Track 1 - Video Features (25ms, 10ms)

    • zenodo.org
    • live.european-language-grid.eu
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hendrik Meutzner; Hendrik Meutzner (2020). GRID/CHiME-2 Track 1 - Video Features (25ms, 10ms) [Dataset]. http://doi.org/10.5281/zenodo.260211
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hendrik Meutzner; Hendrik Meutzner
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This archive contains the video features in Kaldi's [1] ark format that correspond to the CHiME-2 Track 1 [2] utterances for the isolated data (train, devel, test).

    The video files have been taken from the GRID corpus [3,4]. The features contain the 63-dimensional DCT coefficients of the landmark points extracted using the Viola-Jones algorithm. The features have been end-pointed and interpolated using a differential digital analyser in order to match the length of the utterances when using a frame length of 25ms and a frame shift of 10ms, which is the default configuration of Kaldi's feature extraction scripts.

    [1] http://kaldi-asr.org

    [2] http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/chime2_task1.html

    [3] http://spandh.dcs.shef.ac.uk/gridcorpus

    [4] Martin Cooke, Jon Barker, and Stuart Cunningham and Xu Shao, "An audio-visual corpus for speech perception and automatic speech recognition", The Journal of the Acoustical Society of America 120, 2421 (2006); http://doi.org/10.1121/1.2229005

  6. h

    VisualTTS_GRID

    • huggingface.co
    Updated Dec 4, 2005
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xin Cheng (2005). VisualTTS_GRID [Dataset]. https://huggingface.co/datasets/ceaglex/VisualTTS_GRID
    Explore at:
    Dataset updated
    Dec 4, 2005
    Authors
    Xin Cheng
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GRID

    GRID is an audio-visual corpus has been collected to support the use of common material in speech perception and automatic speech recognition studies. The corpus consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers. Sentences are simple, syntactically identical phrases such as “place green at B 4 now.” Intelligibility tests using the audio signals suggest that the material is easily identifiable in quiet and low levels of stationary… See the full description on the dataset page: https://huggingface.co/datasets/ceaglex/VisualTTS_GRID.

  7. Data from: Electroencephalography Responses to Simplified Visual Signals...

    • zenodo.org
    bin, zip
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrico Varano; Enrico Varano; Tobias Reichenbach; Tobias Reichenbach (2023). Data from: Electroencephalography Responses to Simplified Visual Signals Reveal Explain Differences in Speech-in-Noise Comprehension [Dataset]. http://doi.org/10.5281/zenodo.6855795
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Aug 29, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Enrico Varano; Enrico Varano; Tobias Reichenbach; Tobias Reichenbach
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contents and Folder Structure:

    EEG Experiment

    • EEG_stimuli: these are the videos that were presented to participants in the EEG experiment, and the code that generates them from the original corpus (link)
    • data_2020>split_trials: contains the raw EEG data starting 1.995s before each trial and ending 1.995s after each trial with naming convention subXx_VV_YY_N.fif where X or Xx is the subject number, YY is the modality condition (AV for audiovisual and V0 for video only), N is the trial number (between 0 and 4 inclusive), and VV is the video condition (1e for the envelope dot, 1m is the mismatched dot, 4v is the cartoon, bw is the edge detection and nh is the natural condition).
      • unprocessed>raw: contains the unprocessed raw EEG data
        • processed>Fs-200>BP-1-80-ASR-INTP-AVR: contains the pre-processed raw EEG data: the output of run_preprocessing.m
        • processed>Fs-200>BP-1-80-ASR-INTP-AVR-ICr: contains the pre-processed raw EEG data after ICA cleaning: the output of run_reject_ICs.m
        • stim>stim_dwnspl: contains the aligned 200Hz envelopes of the presented speech used as features for the time-lagged models
    • EEG_analysis_code [note: please extract the contents of this folder to match paths]
      • 2_ICA_filt: this folder contains the MATLAB code that performs the pre-processing of the EEG data, including filtering, downsampling, ICA cleaning etc. The main functions are:
        • run_preprocessing.m: downsampling, filtering, ASR cleaning
        • run_reject_ICs.m: ICLabel ICA cleaning
      • 3_analysis: this is the Python code that performs the TRF and backward modelling on the EEG data. The main functions are:
        • multisensory_bw.py: backwards model
        • multisensory_fw.py: forwards model

    Behavioural Experiment

    • behavioural
      • 0_dataset: these are the videos that were presented to participants in the behavioural experiment, and the code that generates them from the original corpus (AV GRID corpus)
      • 3_analysis: behavioural data analysis script
        • main function: data_grid_v3.py
    • behavioural_data>data_grid: behavioural results
  8. KWAI-AD-AudVis

    • zenodo.org
    • live.european-language-grid.eu
    zip
    Updated Sep 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fei Tao; Fei Tao; Xudong Liu; Xiaorong Mei; Lei Yuan; Ji Liu; Xudong Liu; Xiaorong Mei; Lei Yuan; Ji Liu (2020). KWAI-AD-AudVis [Dataset]. http://doi.org/10.5281/zenodo.4023390
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 15, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Fei Tao; Fei Tao; Xudong Liu; Xiaorong Mei; Lei Yuan; Ji Liu; Xudong Liu; Xiaorong Mei; Lei Yuan; Ji Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    It consists of 85,432 ads videos from the China popular short-term video app, Kwai. The videos were made and uploaded by commercial advertisers rather than personal users. The reason to use the ads videos lied on two folds: 1) the source guarantees the videos under control to some level, such as high-resolution pictures and intention-ally designed scene; 2) the ads videos mimic the style of the ones uploaded by personal users, as they are played in be-tween the personal videos in Kwai app. It can be seen as a quality controlled UGVs dataset.The dataset was collected in two batches (Batch-1 is our preliminary work), coming with the tags of ads industry cluster. The videos were randomly picked from a pool. The pool was formed by selecting the ads from several contiguous days.Half of the selected ads had click through rate(CTR) in top30000 within that day and the other half had CTR in bottom30000. It should be noticed that the released dataset is a sub-set of the pool. The audio track had2 channels (we mixed to mono channel in the study) and was sampled at 44.1 kHz, while the visual track had resolution of1280×720 and was sampled at 25frame per second(FPS).This dataset is a extension of the KWAI-AD corpus [3]. It is not only suitable for tasks in multimodal learning area, but also for ones in ads recommendation. It shows that the ads videos have three main characteristics: 1) The videos may have very inconsistent information in visual or audio streams. For example, the video may play a drama-like story at first, and then present the product introduction, whose scenes are very different. 2) The correspondence between audio and visual streams is not clear.For instance, similar visual objects (e.g. talking salesman)come with very different audio streams. 3) The relationship between audio and video varies in different industries. For example, game or E-commerce ads will have very different styles. These characteristics make the dataset suitable yet challenging for our study about the AVC learning.

    In the folder, you will see: audio_features.tar.gz, meta, README, samples, ad_label.npy, video_fetaures.tar.gz. The details are included in README.

    If you use our dataset, please cite our paper: "Themes Inferred Audio-visual Correspondence Learning" (https://arxiv.org/pdf/2009.06573.pdf)

  9. E

    Polish-English parallel corpus from the website of the National Audiovisual...

    • catalog.elra.info
    • live.european-language-grid.eu
    • +1more
    Updated Feb 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2020). Polish-English parallel corpus from the website of the National Audiovisual Institute (Processed) [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-W0289/
    Explore at:
    Dataset updated
    Feb 27, 2020
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalog.elra.info/static/from_media/metashare/licences/Open_Under_PSI.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/Open_Under_PSI.pdf

    Description

    This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu.Polish-English parallel corpus from the website of the National Audiovisual Institute (http://www.nina.gov.pl)

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). The Grid Audio-Visual Speech Corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7769

The Grid Audio-Visual Speech Corpus

Explore at:
mpeg-2Available download formats
Dataset updated
May 15, 2022
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Grid Corpus is a large multitalker audiovisual sentence corpus designed to support joint computational-behavioral studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female), for a total of 34000 sentences. Sentences are of the form "put red at G9 now".audio_25k.zip contains the wav format utterances at a 25 kHz sampling rate in a separate directory per talkeralignments.zip provides word-level time alignments, again separated by talkers1.zip, s2.zip etc contain .jpg videos for each talker [note that due to an oversight, no video for talker t21 is available]The Grid Corpus is described in detail in the paper jasagrid.pdf included in the dataset.

Search
Clear search
Close search
Google apps
Main menu