98 datasets found
  1. Multimodal Vision-Audio-Language Dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.

    Details can be found in the attached report.

    Annotation

    The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.

    The split into train, validation and test set follows the split of the original datasets.

    Installation

    pip install pandas pyarrow

    Example

    import pandas as pd
    df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
    print(df.iloc[0])

    dataset AudioSet

    filename train/---2_BBVHAA.mp3

    captions_visual [a man in a black hat and glasses.]

    captions_auditory [a man speaks and dishes clank.]

    tags [Speech]

    Description

    The annotation file consists of the following fields:

    filename: Name of the corresponding file (video or audio file)
    dataset: Source dataset associated with the data point
    captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
    captions_auditory: A list of captions related to the auditory content of the video
    tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided

    Data files

    The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  2. BurnedAreaUAV Dataset (v1.1)

    • zenodo.org
    bin, jpeg, png
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tiago F. R. Ribeiro; Tiago F. R. Ribeiro; Fernando Silva; Fernando Silva; José Moreira; José Moreira; Rogério Luís de C. Costa; Rogério Luís de C. Costa (2024). BurnedAreaUAV Dataset (v1.1) [Dataset]. http://doi.org/10.5281/zenodo.7944963
    Explore at:
    bin, png, jpegAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tiago F. R. Ribeiro; Tiago F. R. Ribeiro; Fernando Silva; Fernando Silva; José Moreira; José Moreira; Rogério Luís de C. Costa; Rogério Luís de C. Costa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Description

    A manually annotated dataset, consisting of the video frames and segmentation masks, for segmentation of forest fire burned area based on a video captured by a UAV. A detailed explanation of the dataset generation is available in the open-access article "Burned area semantic segmentation: A novel dataset and evaluation using convolutional networks".

    Data Collection

    The BurnedAreaUAV dataset derives from a video captured at the coordinates' latitude 41° 23' 37.56" and longitude -7° 37' 0.32", at Torre do Pinhão, in northern Portugal in an area characterized by shrubby to herbaceous vegetation. The video was captured during the evolution of a prescribed fire using a DJI Phantom 4 PRO UAV equipped with an FC6310S RGB camera.

    Video Overview

    The video captures a prescribed fire where the burned area increases progressively. At the beginning of the sequence, a significant portion of the UAV's sensor field of view is already burned, and the burned area expands as time goes by. The video was collected by an RGB sensor installed on a drone while keeping the drone in a nearly static stationary stance during the data collection duration.

    The video has about 15 minutes and a frame rate of 25 frames per second, amounting to 22500 images. It was collected by an RGB sensor installed on a drone while keeping the drone in a nearly static stationary stance during the data collection duration. Throughout this period, the progression of the burned area is observed. The original video has a resolution of 720×1280 and is stored in H.264 (or MPEG-4 Part 10) format. No audio signal was collected.

    Manual Annotation

    The annotation was done every 100 frames, which corresponds to a sampling period of 4 seconds. Two classes are considered: burned_area and unburned_area. This annotation has been done for the entire length of the video. The training set consists of 226 frame-image pairs and the test set of 23. The training and test annotations are offset by 50 frames.

    We plan to expand this dataset in the future.


    File Organization (BurnedAreaUAV_v1.rar)

    The data is available in PNG, JSON (Labelme format), and WKT (segmentation masks only). The raw video data is also made available.

    Concomitantly, photos were taken that allow to obtain metadata about the position of the drone, including height and coordinates, the orientation of the drone and the camera, and others. The geographic data regarding the location of the controlled fire are represented in a KML file that Google Earth and other geospatial software can read. We also provide two high-resolution orthophotos of the area of interest before and after burning.

    The data produced by the segmentation models developed in "Burned area semantic segmentation: A novel dataset and evaluation using convolutional networks", comprising outputs in PNG and WKT formats, is also readily available upon request

    BurnedAreaUAV_dataset_v1.rar
    MP4_video (folder)
    -- original_prescribed_burn_video.mp4

    PNG (folder)
    train (folder)
    frames (folder)
    -- frame_000000.png (raster image)
    -- frame_000100.png
    -- frame_000200.png

    msks (folder)
    -- mask_000000.png
    -- mask_000100.png
    -- mask_000200.png

    test (folder)
    frames (folder)
    -- frame_020250.png
    -- frame_020350.png
    -- frame_020350.png

    msks (folder)
    -- mask_020250.png
    -- mask_020350.png
    -- mask_020350.png

    JSON (folder)
    -- train_valid_json (folder)
    -- frame_000000.json (Labelme format)
    -- frame_000100.json
    -- frame_000200.json
    -- frame_000300.json

    -- test_json (folder)
    -- frame_020250.json
    -- frame_020350.json
    -- frame_020450.json

    WKT_files (folder)
    -- train_valid.wkt (list of masks polygons)
    -- test.wkt

    UAV photos (metadata)
    -- uav_photo1_metadata.JPG
    -- uav_photo2_metadata.JPG

    High resolution ortophoto files
    -- odm_orthophoto_afterBurning.png
    -- odm_orthophoto_beforeBurning.png

    Keyhole Markup Language file (area under study polygon)
    -- pinhao_cell_precribed_area.kml

    Acknowledgements

    This dataset results from activities developed in the context of partially projects funded by FCT - Fundação para a Ciência e a Tecnologia, I.P., through projects MIT-EXPL/ACC/0057/2021 and UIDB/04524/2020, and under the Scientific Employment Stimulus - Institutional Call - CEECINST/00051/2018.

    The source code is available here.

  3. d

    GLips - German Lipreading Dataset - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Oct 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). GLips - German Lipreading Dataset - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/aca4d0c4-9e81-560c-b3af-de011686ecc6
    Explore at:
    Dataset updated
    Oct 29, 2023
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also included. The size of the uncompressed dataset is 16GB. Copyright of original data: Hessian Parliament (https://hessischer-landtag.de). If you use this dataset, you agree to use it for research purpose only and to cite the following reference in any works that make any use of the dataset.

  4. 4

    Multimodal WEDAR dataset for attention regulation behaviors, self-reported...

    • data.4tu.nl
    • 4tu.edu.hpc.n-helix.com
    zip
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoon Lee; Marcus Specht (2023). Multimodal WEDAR dataset for attention regulation behaviors, self-reported distractions, reaction time, and knowledge gain in e-reading [Dataset]. http://doi.org/10.4121/8f730aa3-ad04-4419-8a5b-325415d2294b.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 9, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Yoon Lee; Marcus Specht
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diverse learning theories have been constructed to understand learners' internal states through various tangible predictors. We focus on self-regulatory actions that are subconscious and habitual actions triggered by behavior agents' 'awareness' of their attention loss. We hypothesize that self-regulatory behaviors (i.e., attention regulation behaviors) also occur in e-reading as 'regulators' as found in other behavior models (Ekman, P., & Friesen, W. V., 1969). In this work, we try to define the types and frequencies of attention regulation behaviors in e-reading. We collected various cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading.

    The text 'How to make the most of your day at Disneyland Resort Paris' has been implemented on a screen-based e-reader, which we developed in a pdf-reader format. An informative, entertaining text was adopted to capture learners' attentional shifts during knowledge acquisition. The text has 2685 words, distributed over ten pages, with one subtopic on each page. A built-in webcam on Mac Pro and a mouse have been used for the data collection, aiming for real-world implementation only with essential computational devices. A height-adjustable laptop stand has been used to compensate for participants' eye levels.

    Thirty learners in higher education have been invited for a screen-based e-reading task (M=16.2, SD=5.2 minutes). A pre-test questionnaire with ten multiple-choice questions was given before the reading to check their prior knowledge level about the topic. There was no specific time limit to finish the questionnaire. We collected cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading. Learners were asked to report their distractions on two levels during the reading: 1) In-text distraction (e.g., still reading the text with low attentiveness) or 2) out-of-text distraction (e.g., thinking of something else while not reading the text anymore). We implemented two noticeably-designed buttons on the right-hand side of the screen interface to minimize possible distraction from the reporting task. After triggering a new page, we implemented blur stimuli on the text in the random range of 20 seconds. It ensures that the blur stimuli occur at least once on each page. Participants were asked to click the de-blur button on the text area of the screen to proceed with the reading. The button has been implemented in the whole text area, so participants can minimize the effort to find and click the button. Reaction time for de-blur has been measured, too, to grasp the arousal of learners during the reading. We asked participants to answer pre-test and post-test questionnaires about the reading material. Participants were given ten multiple-choice questions before the session, while the same set of questions was given after the reading session (i.e., formative questions) with added subtopic summarization questions (i.e., summative questions). It can provide insights into the quantitative and qualitative knowledge gained through the session and different learning outcomes based on individual differences. A video dataset of 931,440 frames has been annotated with the attention regulator behaviors using an annotation tool that plays the long sequence clip by clip, which contains 30 frames. Two annotators (doctoral students) have done two stages of labeling. In the first stage, the annotators were trained on the labeling criteria and annotated the attention regulator behaviors separately based on their judgments. The labels were summarized and cross-checked in the second round to address the inconsistent cases, resulting in five attention regulation behaviors and one neutral state. See WEDAR_readme.csv for detailed descriptions of features.

    The dataset has been uploaded 1) raw data, which has formed as we collected, and 2) preprocessed, that we extracted useful features for further learning analytics based on real-time and post-hoc data.

    Reference

    Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. semiotica, 1(1), 49-98.

  5. FSDKaggle2018

    • zenodo.org
    • opendatalab.com
    • +1more
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
    Description

    FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

    Citation

    If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    About this dataset

    Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

    The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

    All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

    The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

    "Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

    Some other relevant characteristics of FSDKaggle2018:

    • The dataset is split into a train set and a test set.

    • The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

    • Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

    • Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

    • The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

    • All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

    Data labeling process

    The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

    Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

    Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

    The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

    More details about the data labeling process can be found in [3].

    License

    FSDKaggle2018 has licenses at two different levels, as explained next.

    All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

    In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

    Files

    FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

    root
    │
    └───FSDKaggle2018.audio_train/ Audio clips in the train set │
    └───FSDKaggle2018.audio_test/ Audio clips in the test set │
    └───FSDKaggle2018.meta/ Files for evaluation setup │ │
    │ └───train_post_competition.csv Data split and ground truth for the train set │ │
    │ └───test_post_competition_scoring_clips.csv Ground truth for the test set

    └───FSDKaggle2018.doc/ │
    └───README.md The dataset description file you are reading │
    └───LICENSE-DATASET

  6. NFED-fmri

    • zenodo.org
    zip
    Updated Sep 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    panapn Chen; panapn Chen (2024). NFED-fmri [Dataset]. http://doi.org/10.5281/zenodo.13759829
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    panapn Chen; panapn Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 13, 2024
    Description
    # A large-scale fMRI dataset in response to short naturalistic facial expressions videos
    Naturalistic facial expressions dataset (NFED),a large-scale dataset of whole-brain functional magnetic resonance imaging (fMRI) responses to 1,320 short (3s) facial expression video clips.NFED offers researchers fMRI data that enables them to investigate the neural mechanisms involved in processing emotional information communicated by facial expression videos in real-world environments.
    The dataset contains raw data, pre-processed volume data,pre-processed surface data and suface-based analyzed data.
    To get more details, please refer to the paper at {website} and the dataset at https://openneuro.org/datasets/ds005047

    ## Preprocess procedure
    The MRI data were preprocessed by using Kay et al, combining code written in MATLAB and certain tools from FreeSurfer, SPM,and FSL(http://github.com/kendrickkay).We used FreeSurfer software (http://surfer.nmr.mgh.harvard.edu) to construct the pial and white surfaces of participants from the T1 volume. Additionally, we established an intermediate gray matter surface between the pial and the white surfaces for all participants.

    **code: ./volume_pre-process/**

    Detailed usage notes are available in codes, please read carefully and modify variables to satisfy your customed environment.
    ## GLM of main experiment
    We performed a single-trial GLM(GLMsingle),a advanced denoising toolbox in MATLAB used to improve single-trial BOLD response estimates, to model the time-series data in surface forma from the pre-processed fMRI data of each participant. Three specific amplitudes of responses (i.e., beta values) were estimated by modeling the BOLD response in relation to each video onset from 1 to 3 seconds in 1 second steps.Using GLMsingle in the manner described above, the BOLD responses evoked by each video were assessed in the run of each session. In total, we extracted 2(repetitions) x 3 (seconds) beta estimates for each video condition in the training set,and 10 (repetitions) x 3(seconds) estimated beta values for each video condition in the testing set.
    **code: ./GLMsingle-main-experiment/matlab/examples.m**

    #### retinotopic mapping

    The fMRI data from the the population receptive field experiment were analyzed by a pRF model implemented in the analyzePRF toolbox (http://cvnlab.net/analyzePRF/) to characterize individual retinotopic representation. Make sure to download required software mentioned in the code.

    **code: ./Functional-localizer-experiment-analysis/s4a_analysis_prf.m**

    #### fLoc experiment

    We used GLMdenoise,a data-driven denoising method,to analyze the pre-processed fMRI data from the fLoc experiment.We used a "condition-split" strategy to code the 10 stimulus categories, splitting the trials related to each category into individual conditions in each run. Six response estimates (beta values) for each category were produced by using six condition-splits.To quantify selectivity for various categories and domains,we computed t-values using the GLM beta values after fitting the GLM.The regions of interest with category selectivity for each participant were defined by using the resulting maps.
    **code: ./Functional-localizer-experiment-analysis/s4a_analysis_floc.m**


    ## Validation
    ### Basic quality control
    **code: ./validation/FD/FD.py**
    **code: ./validation/tSNR/tSNR.py**
    ### noise celling
    The data are available at https://openneuro.org/datasets/ds005047."./validation/code/noise_celling/sub-xx" store the intermediate files required for running the program
    **code: ./validation/noise_celling/Noise_Ceiling.py**

    ### Correspondence between human brain and DCNN
    The data are available at https://openneuro.org/datasets/ds005047.We combined the data from main experiment and functional localizer experiments to build an encoding model to replicate the hierarchical correspondences of representation between the brain and the DCNN. The encoding models were built to map artificial representations from each layer of the pre-trained VideoMAEv2 to neural representations from each area of the human visual cortex as defined in the multimodal parcellation atlas.
    **code: ./validation/dnnbrain/**

    ### Semantic metadata of action and expression labels reveal that NFED can encode temporal and spatial stimuli features in the brain
    The data are available at https://openneuro.org/datasets/ds005047."./validation/code/semantic_metadata/xx_xx_semantic_metadata" store the intermediate files required for running the program.
    **code: ./validation/semantic_metadata/**
    ### GLMsingle-main-experiment
    The data are available at https://openneuro.org/datasets/ds005047."./validation/code/noise_celling/sub-xx" store the intermediate files required for running the program
    **code: ./validation/noise_celling/Noise_Ceiling.py**




    ## results
    The results can be viewed at "https://openneuro.org/datasets/ds005047/derivatives/validation/results/brain_map_individual".

    ## Whole-brain mapping
    The whole-brain data mapped to the cerebral cortex as obtained from the technical validation.
    **code: ./show_results_allbrain/Showresults.m**

    ## Mannually prepared environment
    We provide the *requirements.txt* to install python packages used in these codes. However, some packages like *GLM* and *pre-processing* require external dependecies and we have provided the packages in the corresponding file.

    ## stimuli
    The video stimuli used in the NFED experiment are saved in the "stimuli_1" and "stimuli_2" folders.

  7. m

    Data from: A Tape-Reading Molecular Ratchet

    • data.mendeley.com
    Updated Oct 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    daniel tetlow (2022). A Tape-Reading Molecular Ratchet [Dataset]. http://doi.org/10.17632/k5m7fv49xv.1
    Explore at:
    Dataset updated
    Oct 19, 2022
    Authors
    daniel tetlow
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary Data (1H & 13C NMR, CD and MS files) for "A Tape-Reading Molecular Ratchet"

  8. C

    Raw Data for ConfLab: A Data Collection Concept, Dataset, and Benchmark for...

    • data.4tu.nl
    Updated Jun 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chirag Raman; Jose Vargas Quiros; Stephanie Tan; Ashraful Islam; Ekin Gedik; Hayley Hung (2022). Raw Data for ConfLab: A Data Collection Concept, Dataset, and Benchmark for Machine Analysis of Free-Standing Social Interactions in the Wild [Dataset]. http://doi.org/10.4121/20017748.v2
    Explore at:
    Dataset updated
    Jun 7, 2022
    Dataset provided by
    4TU.ResearchData
    Authors
    Chirag Raman; Jose Vargas Quiros; Stephanie Tan; Ashraful Islam; Ekin Gedik; Hayley Hung
    License

    https://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdfhttps://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdf

    Description

    This file contains raw data for cameras and wearables of the ConfLab dataset.


    ./cameras

    contains the overhead video recordings for 9 cameras (cam2-10) in MP4 files.

    These cameras cover the whole interaction floor, with camera 2 capturing the

    bottom of the scene layout, and camera 10 capturing top of the scene layout.

    Note that cam5 ran out of battery before the other cameras and thus the recordings

    are cut short. However, cam4 and 6 contain significant overlap with cam 5, to

    reconstruct any information needed.


    Note that the annotations are made and provided in 2 minute segments.

    The annotated portions of the video include the last 3min38sec of x2xxx.MP4

    video files, and the first 12 min of x3xxx.MP4 files for cameras (2,4,6,8,10),

    with "x" being the placeholder character in the mp4 file names. If one wishes

    to separate the video into 2 min segments as we did, the "video-splitting.sh"

    script is provided.


    ./camera-calibration contains the camera instrinsic files obtained from

    https://github.com/idiap/multicamera-calibration. Camera extrinsic parameters can

    be calculated using the existing intrinsic parameters and the instructions in the

    multicamera-calibration repo. The coordinates in the image are provided by the

    crosses marked on the floor, which are visible in the video recordings.

    The crosses are 1m apart (=100cm).


    ./wearables

    subdirectory includes the IMU, proximity and audio data from each

    participant at the Conflab event (48 in total). In the directory numbered

    by participant ID, the following data are included:

    1. raw audio file

    2. proximity (bluetooth) pings (RSSI) file (raw and csv) and a visualization

    3. Tri-axial accelerometer data (raw and csv) and a visualization

    4. Tri-axial gyroscope data (raw and csv) and a visualization

    5. Tri-axial magnetometer data (raw and csv) and a visualization

    6. Game rotation vector (raw and csv), recorded in quaternions.


    All files are timestamped.

    The sampling frequencies are:

    - audio: 1250 Hz

    - rest: around 50Hz. However, the sample rate is not fixed

    and instead the timestamps should be used.


    For rotation, the game rotation vector's output frequency is limited by the

    actual sampling frequency of the magnetometer. For more information, please refer to

    https://invensense.tdk.com/wp-content/uploads/2016/06/DS-000189-ICM-20948-v1.3.pdf


    Audio files in this folder are in raw binary form. The following can be used to convert

    them to WAV files (1250Hz):


    ffmpeg -f s16le -ar 1250 -ac 1 -i /path/to/audio/file


    Synchronization of cameras and werables data

    Raw videos contain timecode information which matches the timestamps of the data in

    the "wearables" folder. The starting timecode of a video can be read as:

    ffprobe -hide_banner -show_streams -i /path/to/video


    ./audio

    ./sync: contains wav files per each subject

    ./sync_files: auxiliary csv files used to sync the audio. Can be used to improve the synchronization.

    The code used for syncing the audio can be found here:

    https://github.com/TUDelft-SPC-Lab/conflab/tree/master/preprocessing/audio

  9. Z

    FSDKaggle2019

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Eduardo Fonseca
    Daniel P. W. Ellis
    Manoj Plakal
    Frederic Font
    Xavier Serra
    Description

    FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

    Citation

    If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Data curators

    Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    ABOUT FSDKaggle2019

    Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

    FSDKaggle2019 employs audio clips from the following sources:

    Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

    The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

    The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

    What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

    Ground Truth Labels

    The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

    The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

    The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

    Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

    curated train set: correct (but potentially incomplete) labels

    noisy train set: noisy labels

    test set: correct and complete labels

    Further details can be found below in the sections for each set.

    Format

    All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

    DATA SPLIT

    FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

    Curated train set

    The curated train set consists of manually-labeled data from FSD.

    Number of clips/class: 75 except in a few cases (where there are less)

    Total number of clips: 4970

    Avg number of labels/clip: 1.2

    Total duration: 10.5 hours

    The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

    Noisy train set

    The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

    Number of clips/class: 300

    Total number of clips: 19,815

    Avg number of labels/clip: 1.2

    Total duration: ~80 hours

    The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

    Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

    Test set

    The test set is used for system evaluation and consists of manually-labeled data from FSD.

    Number of clips/class: between 50 and 150

    Total number of clips: 4481

    Avg number of labels/clip: 1.4

    Total duration: 12.9 hours

    The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

    During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

    Acoustic mismatch

    As mentioned before, FSDKaggle2019 uses audio clips from two sources:

    FSD: curated train set and test set, and

    YFCC: noisy train set.

    While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

    This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

    LICENSE

    All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

    Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

    Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

    In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

    FILES & DOWNLOAD

    FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy

  10. I

    The Visual-Inertial Canoe Dataset

    • aws-databank-alb.library.illinois.edu
    • databank.illinois.edu
    Updated Nov 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Miller; Soon-Jo Chung; Seth Hutchinson (2017). The Visual-Inertial Canoe Dataset [Dataset]. http://doi.org/10.13012/B2IDB-9342111_V1
    Explore at:
    Dataset updated
    Nov 14, 2017
    Authors
    Martin Miller; Soon-Jo Chung; Seth Hutchinson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    Office of Naval Research
    Description

    If you use this dataset, please cite the IJRR data paper (bibtex is below). We present a dataset collected from a canoe along the Sangamon River in Illinois. The canoe was equipped with a stereo camera, an IMU, and a GPS device, which provide visual data suitable for stereo or monocular applications, inertial measurements, and position data for ground truth. We recorded a canoe trip up and down the river for 44 minutes covering 2.7 km round trip. The dataset adds to those previously recorded in unstructured environments and is unique in that it is recorded on a river, which provides its own set of challenges and constraints that are described in this paper. The data is divided into subsets, which can be downloaded individually. Video previews are available on Youtube: https://www.youtube.com/channel/UCOU9e7xxqmL_s4QX6jsGZSw The information below can also be found in the README files provided in the 527 dataset and each of its subsets. The purpose of this document is to assist researchers in using this dataset. Images ====== Raw --- The raw images are stored in the cam0 and cam1 directories in bmp format. They are bayered images that need to be debayered and undistorted before they are used. The camera parameters for these images can be found in camchain-imucam.yaml. Note that the camera intrinsics describe a 1600x1200 resolution image, so the focal length and center pixel coordinates must be scaled by 0.5 before they are used. The distortion coefficients remain the same even for the scaled images. The camera to imu tranformation matrix is also in this file. cam0/ refers to the left camera, and cam1/ refers to the right camera. Rectified --------- Stereo rectified, undistorted, row-aligned, debayered images are stored in the rectified/ directory in the same way as the raw images except that they are in png format. The params.yaml file contains the projection and rotation matrices necessary to use these images. The resolution of these parameters do not need to be scaled as is necessary for the raw images. params.yml ---------- The stereo rectification parameters. R0,R1,P0,P1, and Q correspond to the outputs of the OpenCV stereoRectify function except that 1s and 2s are replaced by 0s and 1s, respectively. R0: The rectifying rotation matrix of the left camera. R1: The rectifying rotation matrix of the right camera. P0: The projection matrix of the left camera. P1: The projection matrix of the right camera. Q: Disparity to depth mapping matrix T_cam_imu: Transformation matrix for a point in the IMU frame to the left camera frame. camchain-imucam.yaml -------------------- The camera intrinsic and extrinsic parameters and the camera to IMU transformation usable with the raw images. T_cam_imu: Transformation matrix for a point in the IMU frame to the camera frame. distortion_coeffs: lens distortion coefficients using the radial tangential model. intrinsics: focal length x, focal length y, principal point x, principal point y resolution: resolution of calibration. Scale the intrinsics for use with the raw 800x600 images. The distortion coefficients do not change when the image is scaled. T_cn_cnm1: Transformation matrix from the right camera to the left camera. Sensors ------- Here, each message in name.csv is described ###rawimus### time # GPS time in seconds message name # rawimus acceleration_z # m/s^2 IMU uses right-forward-up coordinates -acceleration_y # m/s^2 acceleration_x # m/s^2 angular_rate_z # rad/s IMU uses right-forward-up coordinates -angular_rate_y # rad/s angular_rate_x # rad/s ###IMG### time # GPS time in seconds message name # IMG left image filename right image filename ###inspvas### time # GPS time in seconds message name # inspvas latitude longitude altitude # ellipsoidal height WGS84 in meters north velocity # m/s east velocity # m/s up velocity # m/s roll # right hand rotation about y axis in degrees pitch # right hand rotation about x axis in degrees azimuth # left hand rotation about z axis in degrees clockwise from north ###inscovs### time # GPS time in seconds message name # inscovs position covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz m^2 attitude covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz deg^2 velocity covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz (m/s)^2 ###bestutm### time # GPS time in seconds message name # bestutm utm zone # numerical zone utm character # alphabetical zone northing # m easting # m height # m above mean sea level Camera logs ----------- The files name.cam0 and name.cam1 are text files that correspond to cameras 0 and 1, respectively. The columns are defined by: unused: The first column is all 1s and can be ignored. software frame number: This number increments at the end of every iteration of the software loop. camera frame number: This number is generated by the camera and increments each time the shutter is triggered. The software and camera frame numbers do not have to start at the same value, but if the difference between the initial and final values is not the same, it suggests that frames may have been dropped. camera timestamp: This is the cameras internal timestamp of the frame capture in units of 100 milliseconds. PC timestamp: This is the PC time of arrival of the image. name.kml -------- The kml file is a mapping file that can be read by software such as Google Earth. It contains the recorded GPS trajectory. name.unicsv ----------- This is a csv file of the GPS trajectory in UTM coordinates that can be read by gpsbabel, software for manipulating GPS paths. @article{doi:10.1177/0278364917751842, author = {Martin Miller and Soon-Jo Chung and Seth Hutchinson}, title ={The Visual–Inertial Canoe Dataset}, journal = {The International Journal of Robotics Research}, volume = {37}, number = {1}, pages = {13-20}, year = {2018}, doi = {10.1177/0278364917751842}, URL = {https://doi.org/10.1177/0278364917751842}, eprint = {https://doi.org/10.1177/0278364917751842} }

  11. u

    Data from: DIPSER: A Dataset for In-Person Student Engagement Recognition in...

    • observatorio-cientifico.ua.es
    • scidb.cn
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel; Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel (2025). DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild [Dataset]. https://observatorio-cientifico.ua.es/documentos/67321d21aea56d4af0484172
    Explore at:
    Dataset updated
    2025
    Authors
    Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel; Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel
    Description

    Data DescriptionThe DIPSER dataset is designed to assess student attention and emotion in in-person classroom settings, consisting of RGB camera data, smartwatch sensor data, and labeled attention and emotion metrics. It includes multiple camera angles per student to capture posture and facial expressions, complemented by smartwatch data for inertial and biometric metrics. Attention and emotion labels are derived from self-reports and expert evaluations. The dataset includes diverse demographic groups, with data collected in real-world classroom environments, facilitating the training of machine learning models for predicting attention and correlating it with emotional states.Data Collection and Generation ProceduresThe dataset was collected in a natural classroom environment at the University of Alicante, Spain. The recording setup consisted of six general cameras positioned to capture the overall classroom context and individual cameras placed at each student’s desk. Additionally, smartwatches were used to collect biometric data, such as heart rate, accelerometer, and gyroscope readings.Experimental SessionsNine distinct educational activities were designed to ensure a comprehensive range of engagement scenarios:News Reading – Students read projected or device-displayed news.Brainstorming Session – Idea generation for problem-solving.Lecture – Passive listening to an instructor-led session.Information Organization – Synthesizing information from different sources.Lecture Test – Assessment of lecture content via mobile devices.Individual Presentations – Students present their projects.Knowledge Test – Conducted using Kahoot.Robotics Experimentation – Hands-on session with robotics.MTINY Activity Design – Development of educational activities with computational thinking.Technical SpecificationsRGB Cameras: Individual cameras recorded at 640×480 pixels, while context cameras captured at 1280×720 pixels.Frame Rate: 9-10 FPS depending on the setup.Smartwatch Sensors: Collected heart rate, accelerometer, gyroscope, rotation vector, and light sensor data at a frequency of 1–100 Hz.Data Organization and FormatsThe dataset follows a structured directory format:/groupX/experimentY/subjectZ.zip Each subject-specific folder contains:images/ (individual facial images)watch_sensors/ (sensor readings in JSON format)labels/ (engagement & emotion annotations)metadata/ (subject demographics & session details)Annotations and LabelingEach data entry includes engagement levels (1-5) and emotional states (9 categories) based on both self-reported labels and evaluations by four independent experts. A custom annotation tool was developed to ensure consistency across evaluations.Missing Data and Data QualitySynchronization: A centralized server ensured time alignment across devices. Brightness changes were used to verify synchronization.Completeness: No major missing data, except for occasional random frame drops due to embedded device performance.Data Consistency: Uniform collection methodology across sessions, ensuring high reliability.Data Processing MethodsTo enhance usability, the dataset includes preprocessed bounding boxes for face, body, and hands, along with gaze estimation and head pose annotations. These were generated using YOLO, MediaPipe, and DeepFace.File Formats and AccessibilityImages: Stored in standard JPEG format.Sensor Data: Provided as structured JSON files.Labels: Available as CSV files with timestamps.The dataset is publicly available under the CC-BY license and can be accessed along with the necessary processing scripts via the DIPSER GitHub repository.Potential Errors and LimitationsDue to camera angles, some student movements may be out of frame in collaborative sessions.Lighting conditions vary slightly across experiments.Sensor latency variations are minimal but exist due to embedded device constraints.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025dipserdatasetinpersonstudent1, title={DIPSER: A Dataset for In-Person Student1 Engagement Recognition in the Wild}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Carolina Lorenzo Álvarez and Jorge Fernandez-Herrero and Diego Viejo and Rosabel Roig-Vila and Miguel Cazorla}, year={2025}, eprint={2502.20209}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20209}, } Usage and ReproducibilityResearchers can utilize standard tools like OpenCV, TensorFlow, and PyTorch for analysis. The dataset supports research in machine learning, affective computing, and education analytics, offering a unique resource for engagement and attention studies in real-world classroom environments.

  12. Z

    GENEA Challenge 2023 Dataset Files

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youngwoo Yoon (2023). GENEA Challenge 2023 Dataset Files [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8199132
    Explore at:
    Dataset updated
    Jul 31, 2023
    Dataset provided by
    Taras Kucherenko
    Rajmund Nagy
    Youngwoo Yoon
    Description

    This Zenodo repository contains the main dataset for the GENEA Challenge 2023, which is based on the Talking With Hands 16.2M dataset.

    Notation:

    Please take note of the following nomenclature when reading this document:

    main agent refers to the speaker in the dyadic interaction for which the systems generated motions.

    interlocutor refers to the speaker in front of the main agent.

    Contents:

    The “genea2023_trn" and "genea2023_val" zip files contain audio files (in WAV format), time-aligned transcriptions (in TSV format), and motion files (in BVH format) for the training and validation datasets, respectively.

    The "genea2023_test" zip file contains audio files (in WAV format) and transcriptions (in TSV format) for the test set, but no motion. The corresponding test motion is available at:

    https://zenodo.org/record/8146027

    Each zip file also contains a "metadata.csv" file that contains information for all files regarding the speaker ID and whether or not the motion files contain finger motion.

    Note that the speech audio in the data sometimes has been replaced by silence for the purpose of anonymisation.

    In the test set, files with indices from 0 to 40 correspond to "matched" interactions (the core test set), where main agent and interlocutor data come from the same conversation, whilst file indices from 41 to 69 correspond to "mismatched" interactions (the extended test set), where main agent and interlocutor data come from different conversations.

    Folder structure:

    main-agent/ (main agent): Encapsulates BVH, TSV, WAV data subfolders for the main agent.

    interloctr/ (interlocutor): Encapsulates BVH, TSV, WAV data subfolders for the interlocutor.

    bvh/ (motion): Time-aligned 3D full-body motion-capture data in BVH format from a speaking and gesticulating actor. Each file is a single person, but each data sample contains files for both the main agent and the interlocutor.

    wav/ (audio): Recorded audio data in WAV format from a speaking and gesticulating actor with a close-talking microphone. Parts of the audio recordings have been muted to omit personally identifiable information.

    tsv/ (text): Word-level time-aligned text transcriptions of the above audio recordings in TSV format (tab-separated values). For privacy reasons, the transcriptions do not include references to personally identifiable information, similar to the audio files.

    Data processing scripts:

    We provide a number of optional scripts for encoding and processing the challenge data:

    Audio: Scripts for extracting basic audio features, such as spectrograms, prosodic features, and mel-frequency cepstral coefficients (MFCCs) can be found at this link.

    Text: A script to encode text transcriptions to word vectors using FastText is available here: tsv2wordvectors.py

    Motion: If you wish to encode the joint angles from the BVH files to and from an exponential map representation, you can use scripts by Simon Alexanderson based on the PyMo library, which are available here:

    bvh2features.py

    features2bvh.py

    Attribution:

    If you use this material, please cite our latest paper on the GENEA Challenge 2023. At the time of writing (2023-07-25) this is our ACM ICMI 2023 paper:

    Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2023. The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’23). ACM.

    Also, please cite the paper about the original dataset from Meta Research:

    Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. 2019. Talking With Hands 16.2M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’19). IEEE, 763–772.

    The motion and audio files are based on the Talking With Hands 16.2M dataset at https://github.com/facebookresearch/TalkingWithHands32M/. The material is available under a CC BY NC 4.0 Attribution-NonCommercial 4.0 International license, with the text provided in LICENSE.txt.

    To find more GENEA Challenge 2023 material on the web, please see:

    https://genea-workshop.github.io/2023/challenge/

    If you have any questions or comments, please contact:

    The GENEA Challenge organisers genea-challenge@googlegroups.com

  13. Z

    FSDnoisy18k

    • data.niaid.nih.gov
    • paperswithcode.com
    • +2more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FSDnoisy18k [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2529933
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Mercedes Collado
    Xavier Favory
    Eduardo Fonseca
    Daniel P. W. Ellis
    Manoj Plakal
    Frederic Font
    Xavier Serra
    Description

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

    Data curators

    Eduardo Fonseca and Mercedes Collado

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    Citation

    If you use this dataset or part of it, please cite the following ICASSP 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

    You can also consider citing our ISMIR 2017 paper that describes the Freesound Annotator, which was used to gather the manual annotations included in FSDnoisy18k:

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound Datasets: A Platform for the Creation of Open Audio Datasets”, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    FSDnoisy18k description

    What follows is a summary of the most basic aspects of FSDnoisy18k. For a complete description of FSDnoisy18k, make sure to check:

    the FSDnoisy18k companion site: http://www.eduardofonseca.net/FSDnoisy18k/

    the description provided in Section 2 of our ICASSP 2019 paper

    FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

    The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. The 20 classes are: "Acoustic guitar", "Bass guitar", "Clapping", "Coin (dropping)", "Crash cymbal", "Dishes, pots, and pans", "Engine", "Fart", "Fire", "Fireworks", "Glass", "Hi-hat", "Piano", "Rain", "Slam", "Squeak", "Tearing", "Walk, footsteps", "Wind", and "Writing". FSDnoisy18k was created with the Freesound Annotator, which is a platform for the collaborative creation of open audio datasets.

    We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).

    The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.

    The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.

    Code

    We've released the code for our ICASSP 2019 paper at https://github.com/edufonseca/icassp19. The framework comprises all the basic stages: feature extraction, training, inference and evaluation. After loading the FSDnoisy18k dataset, log-mel energies are computed and a CNN baseline is trained and evaluated. The code also allows to test four noise-robust loss functions. Please check our paper for more details.

    Label noise characteristics

    FSDnoisy18k features real label noise that is representative of audio data retrieved from the web, particularly from Freesound. The analysis of a per-class, random, 15% of the noisy portion of FSDnoisy18k revealed that roughly 40% of the analyzed labels are correct and complete, whereas 60% of the labels show some type of label noise. Please check the FSDnoisy18k companion site for a detailed characterization of the label noise in the dataset, including a taxonomy of label noise for singly-labeled data as well as a per-class description of the label noise.

    FSDnoisy18k basic characteristics

    The dataset most relevant characteristics are as follows:

    FSDnoisy18k contains 18,532 audio clips (42.5h) unequally distributed in the 20 aforementioned classes drawn from the AudioSet Ontology.

    The audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

    The audio clips are of variable length ranging from 300ms to 30s, and each clip has a single ground truth label (singly-labeled data).

    The dataset is split into a test set and a train set. The test set is drawn entirely from the clean portion, while the remainder of data forms the train set.

    The train set is composed of 17,585 clips (41.1h) unequally distributed among the 20 classes. It features a clean subset and a noisy subset. In terms of number of clips their proportion is 10%/90%, whereas in terms of duration the proportion is slightly more extreme (6%/94%). The per-class percentage of clean data within the train set is also imbalanced, ranging from 6.1% to 22.4%. The number of audio clips per class ranges from 51 to 170, and from 250 to 1000 in the clean and noisy subsets, respectively. Further, a noisy small subset is defined, which includes an amount of (noisy) data comparable (in terms of duration) to that of the clean subset.

    The test set is composed of 947 clips (1.4h) that belong to the clean portion of the data. Its class distribution is similar to that of the clean subset of the train set. The number of per-class audio clips in the test set ranges from 30 to 72. The test set enables a multi-class classification problem.

    FSDnoisy18k is an expandable dataset that features a per-class varying degree of types and amount of label noise. The dataset allows investigation of label noise as well as other approaches, from semi-supervised learning, e.g., self-training to learning with minimal supervision.

    License

    FSDnoisy18k has licenses at two different levels, as explained next. All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. In particular, all Freesound clips included in FSDnoisy18k are released under either CC-BY or CC0. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of audio clips and their corresponding license in the LICENSE-INDIVIDUAL-CLIPS file downloaded with the dataset.

    In addition, FSDnoisy18k as a whole is the result of a curation process and it has an additional license. FSDnoisy18k is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the dataset.

    Files

    FSDnoisy18k can be downloaded as a series of zip files with the following directory structure:

    root │
    └───FSDnoisy18k.audio_train/ Audio clips in the train set │
    └───FSDnoisy18k.audio_test/ Audio clips in the test set │
    └───FSDnoisy18k.meta/ Files for evaluation setup │ │
    │ └───train.csv Data split and ground truth for the train set │ │
    │ └───test.csv Ground truth for the test set

    └───FSDnoisy18k.doc/ │
    └───README.md The dataset description file that you are reading │
    └───LICENSE-DATASET License of the FSDnoisy18k dataset as an entity

    └───LICENSE-INDIVIDUAL-CLIPS.csv Licenses of the individual audio clips from Freesound

    Each row (i.e. audio clip) of the train.csv file contains the following information:

    fname: the file name

    label: the audio classification label (ground truth)

    aso_id: the id of the corresponding category as per the AudioSet Ontology

    manually_verified: Boolean (1 or 0) flag to indicate whether the clip belongs to the clean portion (1), or to the noisy portion (0) of the train set

    noisy_small: Boolean (1 or 0) flag to indicate whether the clip belongs to the noisy_small portion (1) of the train set

    Each row (i.e. audio clip) of the test.csv file contains the following information:

    fname: the file name

    label: the audio classification label (ground truth)

    aso_id: the id of the corresponding category as per the AudioSet Ontology

    Links

    Source code for our preprint: https://github.com/edufonseca/icassp19 Freesound Annotator: https://annotator.freesound.org/ Freesound: https://freesound.org Eduardo Fonseca’s personal website: http://www.eduardofonseca.net/

    Acknowledgments

    This work is partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688382 AudioCommons. Eduardo Fonseca is also sponsored by a Google Faculty Research Award 2017. We thank everyone who contributed to FSDnoisy18k with annotations.

  14. z

    Data from: Aircraft Marshaling Signals Dataset of FMCW Radar and Event-Based...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, text/x-python
    Updated Dec 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leon Müller; Leon Müller; Manolis Sifalakis; Manolis Sifalakis; Sherif Eissa; Sherif Eissa; Amirreza Yousefzadeh; Amirreza Yousefzadeh; Sander Stuijk; Sander Stuijk; Federico Corradi; Federico Corradi; Paul Detterer; Paul Detterer (2023). Aircraft Marshaling Signals Dataset of FMCW Radar and Event-Based Camera for Sensor Fusion [Dataset]. http://doi.org/10.5281/zenodo.10359770
    Explore at:
    bin, text/x-pythonAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset provided by
    IEEE
    Authors
    Leon Müller; Leon Müller; Manolis Sifalakis; Manolis Sifalakis; Sherif Eissa; Sherif Eissa; Amirreza Yousefzadeh; Amirreza Yousefzadeh; Sander Stuijk; Sander Stuijk; Federico Corradi; Federico Corradi; Paul Detterer; Paul Detterer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 1, 2023
    Description

    Dataset Introduction

    The advent of neural networks capable of learning salient features from variance in the radar data has expanded the breadth of radar applications, often as an alternative sensor or a complementary modality to camera vision. Gesture recognition for command control is arguably the most commonly explored application. Nevertheless, more suitable benchmarking datasets than currently available are needed to assess and compare the merits of the different proposed solutions and explore a broader range of scenarios than simple hand-gesturing a few centimeters away from a radar transmitter/receiver. Most current publicly available radar datasets used in gesture recognition provide limited diversity, do not provide access to raw ADC data, and are not significantly challenging. To address these shortcomings, we created and make available a new dataset that combines FMCW radar and dynamic vision camera of 10 aircraft marshalling signals (whole body) at several distances and angles from the sensors, recorded from 13 people. The two modalities are hardware synchronized using the radar's PRI signal. Moreover, in the supporting publication we propose a sparse encoding of the time domain (ADC) signals that achieve a dramatic data rate reduction (>76%) while retaining the efficacy of the downstream FFT processing (<2% accuracy loss on recognition tasks), and can be used to create an sparse event-based representation of the radar data. In this way the dataset can be used as a two-modality neuromorphic dataset.

    Synchronization of the two modalities

    The PRI pulses from the radar have been hard-wired to the event stream of the DVS sensor, and timestamped using the DVS clock. Based on this signal the DVS event stream has been segmented such that groups of events (time-bins) of the DVS are mapped with individual radar pulses (chirps).

    Data storage

    DVS events (x,y coords and timestamps) are stored in structured arrays, and one such structured array object is associated with the data of a radar transmission (pulse/chirp). A radar transmission is a vector of 512 ADC levels that correspond to sampling points of chirping signal (FMCW radar) that lasts about ~1.3ms. Every 192 radar transmissions are stacked in a matrix called a radar frame (each transmission is a row in that matrix). A data capture (recording) consisting of some thousands of continuous radar transmissions is therefore segmented in a number of radar frames. Finally radar frames and the corresponding DVS structured arrays are stored in separate containers in a custom-made multi-container file format (extension .rad). We provide a (rad file) parser for extracting the data out of these files. There is one file per capture of continuous gesture recording of about 10s.

    Note the number of 192 transmissions per radar frame is an ad-hoc segmentation that suits the purpose of obtaining sufficient signal resolution in a 2D FFT typical in radar signal processing, for the range resolution of the specific radar. It also served the purpose of fast streaming storing of the data during capture. For extracting individual data points for the dataset however, one can pool together (concat) all the radar frames from a single capture file and re-segment them according to liking. The data loader that we provide offers this, with a default of re-segmenting every 769 transmissions (about 1s of gesturing).

    Data captures directory organization (radar8Ghz-DVS-marshaling_signals_20220901_publication_anonymized.7z)

    The dataset captures (recordings) are organized in a common directory structure which encompasses additional metadata information about the captures.

    dataset_dir/

    Identifiers

    • stage [train, test].
    • room: [conference_room, foyer, open_space].
    • subject: [0-9]. Note that 0 stands for no person, and 1 for an unlabeled, random person (only present in test).
    • gesture: ['none', 'emergency_stop', 'move_ahead', 'move_back_v1', 'move_back_v2', 'slow_down' 'start_engines', 'stop_engines', 'straight_ahead', 'turn_left', 'turn_right'].
    • distance: ['xxx', '100', '150', '200', '250', '300', '350', '400', '450'] (in cm). Note that xxx is used for none gestures when there is no person present in front of the radar (i.e. background samples), or when a person is walking in front of the radar with varying distances but performing no gesture.

    The test data captures contain both subjects that appear in the train data as well as previously unseen subjects. Similarly the test data contain captures from the spaces that train data were recorded at, as well as from a new unseen open space.

    Files List

    radar8Ghz-DVS-marshaling_signals_20220901_publication_anonymized.7z

    This is the actual archive bundle with the data captures (recordings).

    rad_file_parser_2.py

    Parser for individual .rad files, which contain capture data.

    loader.py

    A convenience PyTorch Dataset loader (partly Tonic compatible). You practically only need this to quick-start if you don't want to delve too much into code reading. When you init a DvsRadarAircraftMarshallingSignals class object it automatically downloads the dataset archive and the .rad file parser, unpacks the archive, and imports the .rad parser to load the data. One can then request from it a training set, a validation set and a test set as torch.Datasets to work with.

    aircraft_marshalling_signals_howto.ipynb

    Jupyter notebook for exemplary basic use of loader.py

    Contact

    For further information or questions try contacting first M. Sifalakis or F. Corradi.

  15. c

    Data from: Dataset corresponding to 'The (im)possibility of forgiveness? An...

    • datacatalogue.cessda.eu
    • ssh.datastations.nl
    • +1more
    Updated Apr 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D.A. Forster; C.A.M. Hermans (2023). Dataset corresponding to 'The (im)possibility of forgiveness? An empirical intercultural Bible reading of Matthew 18.15-35' [Dataset]. http://doi.org/10.17026/dans-x9b-379m
    Explore at:
    Dataset updated
    Apr 11, 2023
    Dataset provided by
    Radboud University
    Authors
    D.A. Forster; C.A.M. Hermans
    Description

    The study this datasets is part of focusses on intercultural Biblical hermeneutics and was necessary to gain insights into how the participants understood concepts and processes of forgiveness when reading Matthew 18.15-35. Moreover, since social identity complexity theory plays an important role in intercultural hermeneutics, it was necessary to gather the data in a social, or group reading context. Each of the two participating groups were largely homogenous in terms of race and culture. Demographic information was an important component in ascertaining to what extent social identity categories (such as race and culture) influence the hermeneutic perspectives of the participants. The participants self-reported their race, age and gender in an individual information session that took place before the first group meeting.

    In order to gain data from the participants related to their intercultural Biblical hermeneutic views of forgiveness, the researcher designed a series of group readings of the Biblical text in which the participants had the opportunity to read the text and then share their understandings of forgiveness with the researcher and the rest of the group. These group readings where structured in the form of focus group encounters in which the conversations were recorded by means of an audio recording device and then transcribed for later analysis.

    This dataset contains the raw data (transcripts) that was gathered in a series of 6 focus group meetings (datasets D1 - D6). The datasets each are numbered according to the Atlas.ti convention. The data is recorded in Rich Text File (RTF) format. In the heading of each data file the date and location of the recording of the data is noted (e.g., Church Street Methodist Church, 21/06/2015).
    The data has been anonymised in order to protect the privacy of the participants, except for the original Atlas.ti file (which is only available on request). Each participant has been allocated a code represent their inputs (P1-P12). The P* remains the same for that participant in all of the datasets. The interviewer has been allocated the symbol 'I' or [I].

    A methodology text and read me text are part of the dataset and explain the context of the dataset. In these documents, references are made to parts of the corresponding publication (PhD dissertation):
    Forster, DA. 2017. 'The (im)possibility of forgiveness? An empirical intercultural Bible reading of Matthew 18.15-35'. PHD thesis. Radboud University.

  16. d

    ASR database ARTUR 1.0 (audio) - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Mar 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ASR database ARTUR 1.0 (audio) - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/700ba856-a0b9-56c2-8fa7-056198ffe7d2
    Explore at:
    Dataset updated
    Mar 8, 2023
    Description

    Artur 1.0 is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,067 hours of speech. 884 hours are transcribed, while the remaining 183 hours are recordings only. This repository entry includes audio files only, the transcriptions are available on http://hdl.handle.net/11356/1772. The data are structured as follows: (1) Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own audio file and has a corresponding transcription file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own audio file and has a corresponding transcription file. (1d) Artur-B-Izloceno, 27 hours: The recordings include different types of errors, typically, incorrect reading of sentences or a noisy environment. (2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: media recordings, online recordings of conferences, workshops, education videos, etc. (3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics. (4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Speech from the Slovene National Assembly. Further information on the database are available in the Artur-DOC file, which is part of this repository entry.

  17. h

    gigaspeech

    • huggingface.co
    • paperswithcode.com
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SpeechColab, gigaspeech [Dataset]. https://huggingface.co/datasets/speechcolab/gigaspeech
    Explore at:
    Dataset authored and provided by
    SpeechColab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.

  18. h

    ljspeech

    • huggingface.co
    Updated Sep 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artem Ploujnikov (2023). ljspeech [Dataset]. https://huggingface.co/datasets/flexthink/ljspeech
    Explore at:
    Dataset updated
    Sep 30, 2023
    Authors
    Artem Ploujnikov
    Description

    This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

  19. v

    vHMML Reading Room Datasets

    • vhmml.org
    • ftp.vhmml.org
    csv, json
    Updated Aug 23, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Hill Museum & Manuscript Library (HMML) (2018). vHMML Reading Room Datasets [Dataset]. https://www.vhmml.org/dataPortal/dataset
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Aug 23, 2018
    Dataset authored and provided by
    The Hill Museum & Manuscript Library (HMML)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The downloadable Reading Room datasets are updated weekly, include all active records, use the vHMML Schema (https://www.vhmml.org/dataPortal/schema), and are line-delimited

  20. h

    lj_speech

    • huggingface.co
    • tensorflow.org
    Updated Feb 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keith Ito (2021). lj_speech [Dataset]. https://huggingface.co/datasets/keithito/lj_speech
    Explore at:
    Dataset updated
    Feb 15, 2021
    Authors
    Keith Ito
    License

    https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/

    Description

    This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books in English. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

    Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .wav format and is not converted to a float32 array. To convert the audio file to a float32 array, please make use of the .map() function as follows:

    import soundfile as sf
    
    def map_to_array(batch):
      speech_array, _ = sf.read(batch["file"])
      batch["speech"] = speech_array
      return batch
    
    dataset = dataset.map(map_to_array, remove_columns=["file"])
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
Organization logo

Multimodal Vision-Audio-Language Dataset

Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.

Details can be found in the attached report.

Annotation

The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.

The split into train, validation and test set follows the split of the original datasets.

Installation

pip install pandas pyarrow

Example

import pandas as pd
df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
print(df.iloc[0])

dataset AudioSet

filename train/---2_BBVHAA.mp3

captions_visual [a man in a black hat and glasses.]

captions_auditory [a man speaks and dishes clank.]

tags [Speech]

Description

The annotation file consists of the following fields:

filename: Name of the corresponding file (video or audio file)
dataset: Source dataset associated with the data point
captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
captions_auditory: A list of captions related to the auditory content of the video
tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided

Data files

The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Search
Clear search
Close search
Google apps
Main menu