98 datasets found

Multimodal Vision-Audio-Language Dataset
zenodo.org
data.niaid.nih.gov
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10060785
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.
Details can be found in the attached report.
Annotation
The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.
The split into train, validation and test set follows the split of the original datasets.
Installation
pip install pandas pyarrow
Example
import pandas as pd
df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
print(df.iloc[0])
dataset AudioSet
filename train/---2_BBVHAA.mp3
captions_visual [a man in a black hat and glasses.]
captions_auditory [a man speaks and dishes clank.]
tags [Speech]
Description
The annotation file consists of the following fields:

filename: Name of the corresponding file (video or audio file)
dataset: Source dataset associated with the data point
captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
captions_auditory: A list of captions related to the auditory content of the video
tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided
Data files
The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
BurnedAreaUAV Dataset (v1.1)
zenodo.org
bin, jpeg, png
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tiago F. R. Ribeiro; Tiago F. R. Ribeiro; Fernando Silva; Fernando Silva; José Moreira; José Moreira; Rogério Luís de C. Costa; Rogério Luís de C. Costa (2024). BurnedAreaUAV Dataset (v1.1) [Dataset]. http://doi.org/10.5281/zenodo.7944963
Explore at:
bin, png, jpegAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7944963
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tiago F. R. Ribeiro; Tiago F. R. Ribeiro; Fernando Silva; Fernando Silva; José Moreira; José Moreira; Rogério Luís de C. Costa; Rogério Luís de C. Costa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Description
A manually annotated dataset, consisting of the video frames and segmentation masks, for segmentation of forest fire burned area based on a video captured by a UAV. A detailed explanation of the dataset generation is available in the open-access article "Burned area semantic segmentation: A novel dataset and evaluation using convolutional networks".
Data Collection
The BurnedAreaUAV dataset derives from a video captured at the coordinates' latitude 41° 23' 37.56" and longitude -7° 37' 0.32", at Torre do Pinhão, in northern Portugal in an area characterized by shrubby to herbaceous vegetation. The video was captured during the evolution of a prescribed fire using a DJI Phantom 4 PRO UAV equipped with an FC6310S RGB camera.
Video Overview
The video captures a prescribed fire where the burned area increases progressively. At the beginning of the sequence, a significant portion of the UAV's sensor field of view is already burned, and the burned area expands as time goes by. The video was collected by an RGB sensor installed on a drone while keeping the drone in a nearly static stationary stance during the data collection duration.
The video has about 15 minutes and a frame rate of 25 frames per second, amounting to 22500 images. It was collected by an RGB sensor installed on a drone while keeping the drone in a nearly static stationary stance during the data collection duration. Throughout this period, the progression of the burned area is observed. The original video has a resolution of 720×1280 and is stored in H.264 (or MPEG-4 Part 10) format. No audio signal was collected.
Manual Annotation
The annotation was done every 100 frames, which corresponds to a sampling period of 4 seconds. Two classes are considered: burned_area and unburned_area. This annotation has been done for the entire length of the video. The training set consists of 226 frame-image pairs and the test set of 23. The training and test annotations are offset by 50 frames.
We plan to expand this dataset in the future.

File Organization (BurnedAreaUAV_v1.rar)
The data is available in PNG, JSON (Labelme format), and WKT (segmentation masks only). The raw video data is also made available.
Concomitantly, photos were taken that allow to obtain metadata about the position of the drone, including height and coordinates, the orientation of the drone and the camera, and others. The geographic data regarding the location of the controlled fire are represented in a KML file that Google Earth and other geospatial software can read. We also provide two high-resolution orthophotos of the area of interest before and after burning.
The data produced by the segmentation models developed in "Burned area semantic segmentation: A novel dataset and evaluation using convolutional networks", comprising outputs in PNG and WKT formats, is also readily available upon request

BurnedAreaUAV_dataset_v1.rar
MP4_video (folder)
-- original_prescribed_burn_video.mp4
PNG (folder)
train (folder)
frames (folder)
-- frame_000000.png (raster image)
-- frame_000100.png
-- frame_000200.png
…
msks (folder)
-- mask_000000.png
-- mask_000100.png
-- mask_000200.png
…
test (folder)
frames (folder)
-- frame_020250.png
-- frame_020350.png
-- frame_020350.png
…
msks (folder)
-- mask_020250.png
-- mask_020350.png
-- mask_020350.png
…
JSON (folder)
-- train_valid_json (folder)
-- frame_000000.json (Labelme format)
-- frame_000100.json
-- frame_000200.json
-- frame_000300.json
…
-- test_json (folder)
-- frame_020250.json
-- frame_020350.json
-- frame_020450.json
…
WKT_files (folder)
-- train_valid.wkt (list of masks polygons)
-- test.wkt
UAV photos (metadata)
-- uav_photo1_metadata.JPG
-- uav_photo2_metadata.JPG
High resolution ortophoto files
-- odm_orthophoto_afterBurning.png
-- odm_orthophoto_beforeBurning.png
Keyhole Markup Language file (area under study polygon)
-- pinhao_cell_precribed_area.kml
Acknowledgements
This dataset results from activities developed in the context of partially projects funded by FCT - Fundação para a Ciência e a Tecnologia, I.P., through projects MIT-EXPL/ACC/0057/2021 and UIDB/04524/2020, and under the Scientific Employment Stimulus - Institutional Call - CEECINST/00051/2018.

The source code is available here.
d
GLips - German Lipreading Dataset - Dataset - B2FIND
b2find.dkrz.de
Updated Oct 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). GLips - German Lipreading Dataset - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/aca4d0c4-9e81-560c-b3af-de011686ecc6
Explore at:
Dataset updated
Oct 29, 2023
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also included. The size of the uncompressed dataset is 16GB. Copyright of original data: Hessian Parliament (https://hessischer-landtag.de). If you use this dataset, you agree to use it for research purpose only and to cite the following reference in any works that make any use of the dataset.
4
Multimodal WEDAR dataset for attention regulation behaviors, self-reported...
data.4tu.nl
4tu.edu.hpc.n-helix.com
zip
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoon Lee; Marcus Specht (2023). Multimodal WEDAR dataset for attention regulation behaviors, self-reported distractions, reaction time, and knowledge gain in e-reading [Dataset]. http://doi.org/10.4121/8f730aa3-ad04-4419-8a5b-325415d2294b.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/8f730aa3-ad04-4419-8a5b-325415d2294b.v1
Dataset updated
May 9, 2023
Dataset provided by
4TU.ResearchData
Authors
Yoon Lee; Marcus Specht
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diverse learning theories have been constructed to understand learners' internal states through various tangible predictors. We focus on self-regulatory actions that are subconscious and habitual actions triggered by behavior agents' 'awareness' of their attention loss. We hypothesize that self-regulatory behaviors (i.e., attention regulation behaviors) also occur in e-reading as 'regulators' as found in other behavior models (Ekman, P., & Friesen, W. V., 1969). In this work, we try to define the types and frequencies of attention regulation behaviors in e-reading. We collected various cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading.
The text 'How to make the most of your day at Disneyland Resort Paris' has been implemented on a screen-based e-reader, which we developed in a pdf-reader format. An informative, entertaining text was adopted to capture learners' attentional shifts during knowledge acquisition. The text has 2685 words, distributed over ten pages, with one subtopic on each page. A built-in webcam on Mac Pro and a mouse have been used for the data collection, aiming for real-world implementation only with essential computational devices. A height-adjustable laptop stand has been used to compensate for participants' eye levels.
Thirty learners in higher education have been invited for a screen-based e-reading task (M=16.2, SD=5.2 minutes). A pre-test questionnaire with ten multiple-choice questions was given before the reading to check their prior knowledge level about the topic. There was no specific time limit to finish the questionnaire. We collected cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading. Learners were asked to report their distractions on two levels during the reading: 1) In-text distraction (e.g., still reading the text with low attentiveness) or 2) out-of-text distraction (e.g., thinking of something else while not reading the text anymore). We implemented two noticeably-designed buttons on the right-hand side of the screen interface to minimize possible distraction from the reporting task. After triggering a new page, we implemented blur stimuli on the text in the random range of 20 seconds. It ensures that the blur stimuli occur at least once on each page. Participants were asked to click the de-blur button on the text area of the screen to proceed with the reading. The button has been implemented in the whole text area, so participants can minimize the effort to find and click the button. Reaction time for de-blur has been measured, too, to grasp the arousal of learners during the reading. We asked participants to answer pre-test and post-test questionnaires about the reading material. Participants were given ten multiple-choice questions before the session, while the same set of questions was given after the reading session (i.e., formative questions) with added subtopic summarization questions (i.e., summative questions). It can provide insights into the quantitative and qualitative knowledge gained through the session and different learning outcomes based on individual differences. A video dataset of 931,440 frames has been annotated with the attention regulator behaviors using an annotation tool that plays the long sequence clip by clip, which contains 30 frames. Two annotators (doctoral students) have done two stages of labeling. In the first stage, the annotators were trained on the labeling criteria and annotated the attention regulator behaviors separately based on their judgments. The labels were summarized and cross-checked in the second round to address the inconsistent cases, resulting in five attention regulation behaviors and one neutral state. See WEDAR_readme.csv for detailed descriptions of features.
The dataset has been uploaded 1) raw data, which has formed as we collected, and 2) preprocessed, that we extracted useful features for further learning analytics based on real-time and post-hoc data.
Reference
Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. semiotica, 1(1), 49-98.
FSDKaggle2018
zenodo.org
opendatalab.com
+1more
zip
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2552860
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
Description
FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

Citation

If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

About this dataset

Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

"Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

Some other relevant characteristics of FSDKaggle2018:

The dataset is split into a train set and a test set.

The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

Data labeling process

The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

More details about the data labeling process can be found in [3].

License

FSDKaggle2018 has licenses at two different levels, as explained next.

All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

Files

FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2018.audio_train/ Audio clips in the train set │
└───FSDKaggle2018.audio_test/ Audio clips in the test set │
└───FSDKaggle2018.meta/ Files for evaluation setup │ │
│ └───train_post_competition.csv Data split and ground truth for the train set │ │
│ └───test_post_competition_scoring_clips.csv Ground truth for the test set
│
└───FSDKaggle2018.doc/ │
└───README.md The dataset description file you are reading │
└───LICENSE-DATASET

NFED-fmri

zenodo.org

zip

Updated Sep 14, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

panapn Chen; panapn Chen (2024). NFED-fmri [Dataset]. http://doi.org/10.5281/zenodo.13759829

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13759829

Dataset updated

Sep 14, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

panapn Chen; panapn Chen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Sep 13, 2024

Description

# A large-scale fMRI dataset in response to short naturalistic facial expressions videos
Naturalistic facial expressions dataset (NFED),a large-scale dataset of whole-brain functional magnetic resonance imaging (fMRI) responses to 1,320 short (3s) facial expression video clips.NFED offers researchers fMRI data that enables them to investigate the neural mechanisms involved in processing emotional information communicated by facial expression videos in real-world environments.
The dataset contains raw data, pre-processed volume data,pre-processed surface data and suface-based analyzed data.
To get more details, please refer to the paper at {website} and the dataset at https://openneuro.org/datasets/ds005047

## Preprocess procedure
The MRI data were preprocessed by using Kay et al, combining code written in MATLAB and certain tools from FreeSurfer, SPM,and FSL(http://github.com/kendrickkay).We used FreeSurfer software (http://surfer.nmr.mgh.harvard.edu) to construct the pial and white surfaces of participants from the T1 volume. Additionally, we established an intermediate gray matter surface between the pial and the white surfaces for all participants.

**code: ./volume_pre-process/**

Detailed usage notes are available in codes, please read carefully and modify variables to satisfy your customed environment.
## GLM of main experiment
We performed a single-trial GLM(GLMsingle),a advanced denoising toolbox in MATLAB used to improve single-trial BOLD response estimates, to model the time-series data in surface forma from the pre-processed fMRI data of each participant. Three specific amplitudes of responses (i.e., beta values) were estimated by modeling the BOLD response in relation to each video onset from 1 to 3 seconds in 1 second steps.Using GLMsingle in the manner described above, the BOLD responses evoked by each video were assessed in the run of each session. In total, we extracted 2(repetitions) x 3 (seconds) beta estimates for each video condition in the training set,and 10 (repetitions) x 3(seconds) estimated beta values for each video condition in the testing set.
**code: ./GLMsingle-main-experiment/matlab/examples.m**

#### retinotopic mapping

The fMRI data from the the population receptive field experiment were analyzed by a pRF model implemented in the analyzePRF toolbox (http://cvnlab.net/analyzePRF/) to characterize individual retinotopic representation. Make sure to download required software mentioned in the code.

**code: ./Functional-localizer-experiment-analysis/s4a_analysis_prf.m**

#### fLoc experiment

We used GLMdenoise,a data-driven denoising method,to analyze the pre-processed fMRI data from the fLoc experiment.We used a "condition-split" strategy to code the 10 stimulus categories, splitting the trials related to each category into individual conditions in each run. Six response estimates (beta values) for each category were produced by using six condition-splits.To quantify selectivity for various categories and domains,we computed t-values using the GLM beta values after fitting the GLM.The regions of interest with category selectivity for each participant were defined by using the resulting maps.
**code: ./Functional-localizer-experiment-analysis/s4a_analysis_floc.m**


## Validation
### Basic quality control
**code: ./validation/FD/FD.py**
**code: ./validation/tSNR/tSNR.py**
### noise celling
The data are available at https://openneuro.org/datasets/ds005047."./validation/code/noise_celling/sub-xx" store the intermediate files required for running the program
**code: ./validation/noise_celling/Noise_Ceiling.py**

### Correspondence between human brain and DCNN
The data are available at https://openneuro.org/datasets/ds005047.We combined the data from main experiment and functional localizer experiments to build an encoding model to replicate the hierarchical correspondences of representation between the brain and the DCNN. The encoding models were built to map artificial representations from each layer of the pre-trained VideoMAEv2 to neural representations from each area of the human visual cortex as defined in the multimodal parcellation atlas.
**code: ./validation/dnnbrain/**

### Semantic metadata of action and expression labels reveal that NFED can encode temporal and spatial stimuli features in the brain
The data are available at https://openneuro.org/datasets/ds005047."./validation/code/semantic_metadata/xx_xx_semantic_metadata" store the intermediate files required for running the program.
**code: ./validation/semantic_metadata/**
### GLMsingle-main-experiment
The data are available at https://openneuro.org/datasets/ds005047."./validation/code/noise_celling/sub-xx" store the intermediate files required for running the program
**code: ./validation/noise_celling/Noise_Ceiling.py**




## results
 The results can be viewed at "https://openneuro.org/datasets/ds005047/derivatives/validation/results/brain_map_individual".

## Whole-brain mapping
The whole-brain data mapped to the cerebral cortex as obtained from the technical validation.
**code: ./show_results_allbrain/Showresults.m**

## Mannually prepared environment
We provide the *requirements.txt* to install python packages used in these codes. However, some packages like *GLM* and *pre-processing* require external dependecies and we have provided the packages in the corresponding file.

## stimuli
The video stimuli used in the NFED experiment are saved in the "stimuli_1" and "stimuli_2" folders.

m
Data from: A Tape-Reading Molecular Ratchet
data.mendeley.com
Updated Oct 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
daniel tetlow (2022). A Tape-Reading Molecular Ratchet [Dataset]. http://doi.org/10.17632/k5m7fv49xv.1
Explore at:
Unique identifier
https://doi.org/10.17632/k5m7fv49xv.1
Dataset updated
Oct 19, 2022
Authors
daniel tetlow
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Data (1H & 13C NMR, CD and MS files) for "A Tape-Reading Molecular Ratchet"
C
Raw Data for ConfLab: A Data Collection Concept, Dataset, and Benchmark for...
data.4tu.nl
Updated Jun 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chirag Raman; Jose Vargas Quiros; Stephanie Tan; Ashraful Islam; Ekin Gedik; Hayley Hung (2022). Raw Data for ConfLab: A Data Collection Concept, Dataset, and Benchmark for Machine Analysis of Free-Standing Social Interactions in the Wild [Dataset]. http://doi.org/10.4121/20017748.v2
Explore at:
Unique identifier
https://doi.org/10.4121/20017748.v2
Dataset updated
Jun 7, 2022
Dataset provided by
4TU.ResearchData
Authors
Chirag Raman; Jose Vargas Quiros; Stephanie Tan; Ashraful Islam; Ekin Gedik; Hayley Hung
License
https://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdfhttps://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdf
Description
This file contains raw data for cameras and wearables of the ConfLab dataset.

./cameras

contains the overhead video recordings for 9 cameras (cam2-10) in MP4 files.

These cameras cover the whole interaction floor, with camera 2 capturing the

bottom of the scene layout, and camera 10 capturing top of the scene layout.

Note that cam5 ran out of battery before the other cameras and thus the recordings

are cut short. However, cam4 and 6 contain significant overlap with cam 5, to

reconstruct any information needed.

Note that the annotations are made and provided in 2 minute segments.

The annotated portions of the video include the last 3min38sec of x2xxx.MP4

video files, and the first 12 min of x3xxx.MP4 files for cameras (2,4,6,8,10),

with "x" being the placeholder character in the mp4 file names. If one wishes

to separate the video into 2 min segments as we did, the "video-splitting.sh"

script is provided.

./camera-calibration contains the camera instrinsic files obtained from

https://github.com/idiap/multicamera-calibration. Camera extrinsic parameters can

be calculated using the existing intrinsic parameters and the instructions in the

multicamera-calibration repo. The coordinates in the image are provided by the

crosses marked on the floor, which are visible in the video recordings.

The crosses are 1m apart (=100cm).

./wearables

subdirectory includes the IMU, proximity and audio data from each

participant at the Conflab event (48 in total). In the directory numbered

by participant ID, the following data are included:

1. raw audio file

2. proximity (bluetooth) pings (RSSI) file (raw and csv) and a visualization

3. Tri-axial accelerometer data (raw and csv) and a visualization

4. Tri-axial gyroscope data (raw and csv) and a visualization

5. Tri-axial magnetometer data (raw and csv) and a visualization

6. Game rotation vector (raw and csv), recorded in quaternions.

All files are timestamped.

The sampling frequencies are:

- audio: 1250 Hz

- rest: around 50Hz. However, the sample rate is not fixed

and instead the timestamps should be used.

For rotation, the game rotation vector's output frequency is limited by the

actual sampling frequency of the magnetometer. For more information, please refer to

https://invensense.tdk.com/wp-content/uploads/2016/06/DS-000189-ICM-20948-v1.3.pdf

Audio files in this folder are in raw binary form. The following can be used to convert

them to WAV files (1250Hz):

ffmpeg -f s16le -ar 1250 -ac 1 -i /path/to/audio/file

Synchronization of cameras and werables data

Raw videos contain timecode information which matches the timestamps of the data in

the "wearables" folder. The starting timecode of a video can be read as:

ffprobe -hide_banner -show_streams -i /path/to/video

./audio

./sync: contains wav files per each subject

./sync_files: auxiliary csv files used to sync the audio. Can be used to improve the synchronization.

The code used for syncing the audio can be found here:

https://github.com/TUDelft-SPC-Lab/conflab/tree/master/preprocessing/audio
Z
FSDKaggle2019
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Fonseca (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Eduardo Fonseca
Daniel P. W. Ellis
Manoj Plakal
Frederic Font
Xavier Serra
Description
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

Citation

If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Data curators

Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

ABOUT FSDKaggle2019

Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

FSDKaggle2019 employs audio clips from the following sources:

Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

Ground Truth Labels

The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

curated train set: correct (but potentially incomplete) labels

noisy train set: noisy labels

test set: correct and complete labels

Further details can be found below in the sections for each set.

Format

All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

DATA SPLIT

FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

Curated train set

The curated train set consists of manually-labeled data from FSD.

Number of clips/class: 75 except in a few cases (where there are less)

Total number of clips: 4970

Avg number of labels/clip: 1.2

Total duration: 10.5 hours

The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

Noisy train set

The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

Number of clips/class: 300

Total number of clips: 19,815

Avg number of labels/clip: 1.2

Total duration: ~80 hours

The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

Test set

The test set is used for system evaluation and consists of manually-labeled data from FSD.

Number of clips/class: between 50 and 150

Total number of clips: 4481

Avg number of labels/clip: 1.4

Total duration: 12.9 hours

The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

Acoustic mismatch

As mentioned before, FSDKaggle2019 uses audio clips from two sources:

FSD: curated train set and test set, and

YFCC: noisy train set.

While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

LICENSE

All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

FILES & DOWNLOAD

FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy
I
The Visual-Inertial Canoe Dataset
aws-databank-alb.library.illinois.edu
databank.illinois.edu
Updated Nov 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Miller; Soon-Jo Chung; Seth Hutchinson (2017). The Visual-Inertial Canoe Dataset [Dataset]. http://doi.org/10.13012/B2IDB-9342111_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9342111_V1
Dataset updated
Nov 14, 2017
Authors
Martin Miller; Soon-Jo Chung; Seth Hutchinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
Office of Naval Research
Description
If you use this dataset, please cite the IJRR data paper (bibtex is below). We present a dataset collected from a canoe along the Sangamon River in Illinois. The canoe was equipped with a stereo camera, an IMU, and a GPS device, which provide visual data suitable for stereo or monocular applications, inertial measurements, and position data for ground truth. We recorded a canoe trip up and down the river for 44 minutes covering 2.7 km round trip. The dataset adds to those previously recorded in unstructured environments and is unique in that it is recorded on a river, which provides its own set of challenges and constraints that are described in this paper. The data is divided into subsets, which can be downloaded individually. Video previews are available on Youtube: https://www.youtube.com/channel/UCOU9e7xxqmL_s4QX6jsGZSw The information below can also be found in the README files provided in the 527 dataset and each of its subsets. The purpose of this document is to assist researchers in using this dataset. Images ====== Raw --- The raw images are stored in the cam0 and cam1 directories in bmp format. They are bayered images that need to be debayered and undistorted before they are used. The camera parameters for these images can be found in camchain-imucam.yaml. Note that the camera intrinsics describe a 1600x1200 resolution image, so the focal length and center pixel coordinates must be scaled by 0.5 before they are used. The distortion coefficients remain the same even for the scaled images. The camera to imu tranformation matrix is also in this file. cam0/ refers to the left camera, and cam1/ refers to the right camera. Rectified --------- Stereo rectified, undistorted, row-aligned, debayered images are stored in the rectified/ directory in the same way as the raw images except that they are in png format. The params.yaml file contains the projection and rotation matrices necessary to use these images. The resolution of these parameters do not need to be scaled as is necessary for the raw images. params.yml ---------- The stereo rectification parameters. R0,R1,P0,P1, and Q correspond to the outputs of the OpenCV stereoRectify function except that 1s and 2s are replaced by 0s and 1s, respectively. R0: The rectifying rotation matrix of the left camera. R1: The rectifying rotation matrix of the right camera. P0: The projection matrix of the left camera. P1: The projection matrix of the right camera. Q: Disparity to depth mapping matrix T_cam_imu: Transformation matrix for a point in the IMU frame to the left camera frame. camchain-imucam.yaml -------------------- The camera intrinsic and extrinsic parameters and the camera to IMU transformation usable with the raw images. T_cam_imu: Transformation matrix for a point in the IMU frame to the camera frame. distortion_coeffs: lens distortion coefficients using the radial tangential model. intrinsics: focal length x, focal length y, principal point x, principal point y resolution: resolution of calibration. Scale the intrinsics for use with the raw 800x600 images. The distortion coefficients do not change when the image is scaled. T_cn_cnm1: Transformation matrix from the right camera to the left camera. Sensors ------- Here, each message in name.csv is described ###rawimus### time # GPS time in seconds message name # rawimus acceleration_z # m/s^2 IMU uses right-forward-up coordinates -acceleration_y # m/s^2 acceleration_x # m/s^2 angular_rate_z # rad/s IMU uses right-forward-up coordinates -angular_rate_y # rad/s angular_rate_x # rad/s ###IMG### time # GPS time in seconds message name # IMG left image filename right image filename ###inspvas### time # GPS time in seconds message name # inspvas latitude longitude altitude # ellipsoidal height WGS84 in meters north velocity # m/s east velocity # m/s up velocity # m/s roll # right hand rotation about y axis in degrees pitch # right hand rotation about x axis in degrees azimuth # left hand rotation about z axis in degrees clockwise from north ###inscovs### time # GPS time in seconds message name # inscovs position covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz m^2 attitude covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz deg^2 velocity covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz (m/s)^2 ###bestutm### time # GPS time in seconds message name # bestutm utm zone # numerical zone utm character # alphabetical zone northing # m easting # m height # m above mean sea level Camera logs ----------- The files name.cam0 and name.cam1 are text files that correspond to cameras 0 and 1, respectively. The columns are defined by: unused: The first column is all 1s and can be ignored. software frame number: This number increments at the end of every iteration of the software loop. camera frame number: This number is generated by the camera and increments each time the shutter is triggered. The software and camera frame numbers do not have to start at the same value, but if the difference between the initial and final values is not the same, it suggests that frames may have been dropped. camera timestamp: This is the cameras internal timestamp of the frame capture in units of 100 milliseconds. PC timestamp: This is the PC time of arrival of the image. name.kml -------- The kml file is a mapping file that can be read by software such as Google Earth. It contains the recorded GPS trajectory. name.unicsv ----------- This is a csv file of the GPS trajectory in UTM coordinates that can be read by gpsbabel, software for manipulating GPS paths. @article{doi:10.1177/0278364917751842, author = {Martin Miller and Soon-Jo Chung and Seth Hutchinson}, title ={The Visual–Inertial Canoe Dataset}, journal = {The International Journal of Robotics Research}, volume = {37}, number = {1}, pages = {13-20}, year = {2018}, doi = {10.1177/0278364917751842}, URL = {https://doi.org/10.1177/0278364917751842}, eprint = {https://doi.org/10.1177/0278364917751842} }
u
Data from: DIPSER: A Dataset for In-Person Student Engagement Recognition in...
observatorio-cientifico.ua.es
scidb.cn
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel; Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel (2025). DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild [Dataset]. https://observatorio-cientifico.ua.es/documentos/67321d21aea56d4af0484172
Explore at:
Dataset updated
2025
Authors
Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel; Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel
Description
Data DescriptionThe DIPSER dataset is designed to assess student attention and emotion in in-person classroom settings, consisting of RGB camera data, smartwatch sensor data, and labeled attention and emotion metrics. It includes multiple camera angles per student to capture posture and facial expressions, complemented by smartwatch data for inertial and biometric metrics. Attention and emotion labels are derived from self-reports and expert evaluations. The dataset includes diverse demographic groups, with data collected in real-world classroom environments, facilitating the training of machine learning models for predicting attention and correlating it with emotional states.Data Collection and Generation ProceduresThe dataset was collected in a natural classroom environment at the University of Alicante, Spain. The recording setup consisted of six general cameras positioned to capture the overall classroom context and individual cameras placed at each student’s desk. Additionally, smartwatches were used to collect biometric data, such as heart rate, accelerometer, and gyroscope readings.Experimental SessionsNine distinct educational activities were designed to ensure a comprehensive range of engagement scenarios:News Reading – Students read projected or device-displayed news.Brainstorming Session – Idea generation for problem-solving.Lecture – Passive listening to an instructor-led session.Information Organization – Synthesizing information from different sources.Lecture Test – Assessment of lecture content via mobile devices.Individual Presentations – Students present their projects.Knowledge Test – Conducted using Kahoot.Robotics Experimentation – Hands-on session with robotics.MTINY Activity Design – Development of educational activities with computational thinking.Technical SpecificationsRGB Cameras: Individual cameras recorded at 640×480 pixels, while context cameras captured at 1280×720 pixels.Frame Rate: 9-10 FPS depending on the setup.Smartwatch Sensors: Collected heart rate, accelerometer, gyroscope, rotation vector, and light sensor data at a frequency of 1–100 Hz.Data Organization and FormatsThe dataset follows a structured directory format:/groupX/experimentY/subjectZ.zip Each subject-specific folder contains:images/ (individual facial images)watch_sensors/ (sensor readings in JSON format)labels/ (engagement & emotion annotations)metadata/ (subject demographics & session details)Annotations and LabelingEach data entry includes engagement levels (1-5) and emotional states (9 categories) based on both self-reported labels and evaluations by four independent experts. A custom annotation tool was developed to ensure consistency across evaluations.Missing Data and Data QualitySynchronization: A centralized server ensured time alignment across devices. Brightness changes were used to verify synchronization.Completeness: No major missing data, except for occasional random frame drops due to embedded device performance.Data Consistency: Uniform collection methodology across sessions, ensuring high reliability.Data Processing MethodsTo enhance usability, the dataset includes preprocessed bounding boxes for face, body, and hands, along with gaze estimation and head pose annotations. These were generated using YOLO, MediaPipe, and DeepFace.File Formats and AccessibilityImages: Stored in standard JPEG format.Sensor Data: Provided as structured JSON files.Labels: Available as CSV files with timestamps.The dataset is publicly available under the CC-BY license and can be accessed along with the necessary processing scripts via the DIPSER GitHub repository.Potential Errors and LimitationsDue to camera angles, some student movements may be out of frame in collaborative sessions.Lighting conditions vary slightly across experiments.Sensor latency variations are minimal but exist due to embedded device constraints.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025dipserdatasetinpersonstudent1, title={DIPSER: A Dataset for In-Person Student1 Engagement Recognition in the Wild}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Carolina Lorenzo Álvarez and Jorge Fernandez-Herrero and Diego Viejo and Rosabel Roig-Vila and Miguel Cazorla}, year={2025}, eprint={2502.20209}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20209}, } Usage and ReproducibilityResearchers can utilize standard tools like OpenCV, TensorFlow, and PyTorch for analysis. The dataset supports research in machine learning, affective computing, and education analytics, offering a unique resource for engagement and attention studies in real-world classroom environments.
Z
GENEA Challenge 2023 Dataset Files
data.niaid.nih.gov
zenodo.org
Updated Jul 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Youngwoo Yoon (2023). GENEA Challenge 2023 Dataset Files [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8199132
Explore at:
Dataset updated
Jul 31, 2023
Dataset provided by
Taras Kucherenko
Rajmund Nagy
Youngwoo Yoon
Description
This Zenodo repository contains the main dataset for the GENEA Challenge 2023, which is based on the Talking With Hands 16.2M dataset.

Notation:

Please take note of the following nomenclature when reading this document:

main agent refers to the speaker in the dyadic interaction for which the systems generated motions.

interlocutor refers to the speaker in front of the main agent.

Contents:

The “genea2023_trn" and "genea2023_val" zip files contain audio files (in WAV format), time-aligned transcriptions (in TSV format), and motion files (in BVH format) for the training and validation datasets, respectively.

The "genea2023_test" zip file contains audio files (in WAV format) and transcriptions (in TSV format) for the test set, but no motion. The corresponding test motion is available at:

https://zenodo.org/record/8146027

Each zip file also contains a "metadata.csv" file that contains information for all files regarding the speaker ID and whether or not the motion files contain finger motion.

Note that the speech audio in the data sometimes has been replaced by silence for the purpose of anonymisation.

In the test set, files with indices from 0 to 40 correspond to "matched" interactions (the core test set), where main agent and interlocutor data come from the same conversation, whilst file indices from 41 to 69 correspond to "mismatched" interactions (the extended test set), where main agent and interlocutor data come from different conversations.

Folder structure:

main-agent/ (main agent): Encapsulates BVH, TSV, WAV data subfolders for the main agent.

interloctr/ (interlocutor): Encapsulates BVH, TSV, WAV data subfolders for the interlocutor.

bvh/ (motion): Time-aligned 3D full-body motion-capture data in BVH format from a speaking and gesticulating actor. Each file is a single person, but each data sample contains files for both the main agent and the interlocutor.

wav/ (audio): Recorded audio data in WAV format from a speaking and gesticulating actor with a close-talking microphone. Parts of the audio recordings have been muted to omit personally identifiable information.

tsv/ (text): Word-level time-aligned text transcriptions of the above audio recordings in TSV format (tab-separated values). For privacy reasons, the transcriptions do not include references to personally identifiable information, similar to the audio files.

Data processing scripts:

We provide a number of optional scripts for encoding and processing the challenge data:

Audio: Scripts for extracting basic audio features, such as spectrograms, prosodic features, and mel-frequency cepstral coefficients (MFCCs) can be found at this link.

Text: A script to encode text transcriptions to word vectors using FastText is available here: tsv2wordvectors.py

Motion: If you wish to encode the joint angles from the BVH files to and from an exponential map representation, you can use scripts by Simon Alexanderson based on the PyMo library, which are available here:

bvh2features.py

features2bvh.py

Attribution:

If you use this material, please cite our latest paper on the GENEA Challenge 2023. At the time of writing (2023-07-25) this is our ACM ICMI 2023 paper:

Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2023. The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI ’23). ACM.

Also, please cite the paper about the original dataset from Meta Research:

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. 2019. Talking With Hands 16.2M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV ’19). IEEE, 763–772.

The motion and audio files are based on the Talking With Hands 16.2M dataset at https://github.com/facebookresearch/TalkingWithHands32M/. The material is available under a CC BY NC 4.0 Attribution-NonCommercial 4.0 International license, with the text provided in LICENSE.txt.

To find more GENEA Challenge 2023 material on the web, please see:

https://genea-workshop.github.io/2023/challenge/

If you have any questions or comments, please contact:

The GENEA Challenge organisers genea-challenge@googlegroups.com
Z
FSDnoisy18k
data.niaid.nih.gov
paperswithcode.com
+2more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FSDnoisy18k [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2529933
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Mercedes Collado
Xavier Favory
Eduardo Fonseca
Daniel P. W. Ellis
Manoj Plakal
Frederic Font
Xavier Serra
Description
FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

Data curators

Eduardo Fonseca and Mercedes Collado

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

Citation

If you use this dataset or part of it, please cite the following ICASSP 2019 paper:

Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

You can also consider citing our ISMIR 2017 paper that describes the Freesound Annotator, which was used to gather the manual annotations included in FSDnoisy18k:

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound Datasets: A Platform for the Creation of Open Audio Datasets”, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

FSDnoisy18k description

What follows is a summary of the most basic aspects of FSDnoisy18k. For a complete description of FSDnoisy18k, make sure to check:

the FSDnoisy18k companion site: http://www.eduardofonseca.net/FSDnoisy18k/

the description provided in Section 2 of our ICASSP 2019 paper

FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

The source of audio content is Freesound—a sound sharing site created an maintained by the Music Technology Group hosting over 400,000 clips uploaded by its community of users, who additionally provide some basic metadata (e.g., tags, and title). The 20 classes of FSDnoisy18k are drawn from the AudioSet Ontology and are selected based on data availability as well as on their suitability to allow the study of label noise. The 20 classes are: "Acoustic guitar", "Bass guitar", "Clapping", "Coin (dropping)", "Crash cymbal", "Dishes, pots, and pans", "Engine", "Fart", "Fire", "Fireworks", "Glass", "Hi-hat", "Piano", "Rain", "Slam", "Squeak", "Tearing", "Walk, footsteps", "Wind", and "Writing". FSDnoisy18k was created with the Freesound Annotator, which is a platform for the collaborative creation of open audio datasets.

We defined a clean portion of the dataset consisting of correct and complete labels. The remaining portion is referred to as the noisy portion. Each clip in the dataset has a single ground truth label (singly-labeled data).

The clean portion of the data consists of audio clips whose labels are rated as present in the clip and predominant (almost all with full inter-annotator agreement), meaning that the label is correct and, in most cases, there is no additional acoustic material other than the labeled class. A few clips may contain some additional sound events, but they occur in the background and do not belong to any of the 20 target classes. This is more common for some classes that rarely occur alone, e.g., “Fire”, “Glass”, “Wind” or “Walk, footsteps”.

The noisy portion of the data consists of audio clips that received no human validation. In this case, they are categorized on the basis of the user-provided tags in Freesound. Hence, the noisy portion features a certain amount of label noise.

Code

We've released the code for our ICASSP 2019 paper at https://github.com/edufonseca/icassp19. The framework comprises all the basic stages: feature extraction, training, inference and evaluation. After loading the FSDnoisy18k dataset, log-mel energies are computed and a CNN baseline is trained and evaluated. The code also allows to test four noise-robust loss functions. Please check our paper for more details.

Label noise characteristics

FSDnoisy18k features real label noise that is representative of audio data retrieved from the web, particularly from Freesound. The analysis of a per-class, random, 15% of the noisy portion of FSDnoisy18k revealed that roughly 40% of the analyzed labels are correct and complete, whereas 60% of the labels show some type of label noise. Please check the FSDnoisy18k companion site for a detailed characterization of the label noise in the dataset, including a taxonomy of label noise for singly-labeled data as well as a per-class description of the label noise.

FSDnoisy18k basic characteristics

The dataset most relevant characteristics are as follows:

FSDnoisy18k contains 18,532 audio clips (42.5h) unequally distributed in the 20 aforementioned classes drawn from the AudioSet Ontology.

The audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

The audio clips are of variable length ranging from 300ms to 30s, and each clip has a single ground truth label (singly-labeled data).

The dataset is split into a test set and a train set. The test set is drawn entirely from the clean portion, while the remainder of data forms the train set.

The train set is composed of 17,585 clips (41.1h) unequally distributed among the 20 classes. It features a clean subset and a noisy subset. In terms of number of clips their proportion is 10%/90%, whereas in terms of duration the proportion is slightly more extreme (6%/94%). The per-class percentage of clean data within the train set is also imbalanced, ranging from 6.1% to 22.4%. The number of audio clips per class ranges from 51 to 170, and from 250 to 1000 in the clean and noisy subsets, respectively. Further, a noisy small subset is defined, which includes an amount of (noisy) data comparable (in terms of duration) to that of the clean subset.

The test set is composed of 947 clips (1.4h) that belong to the clean portion of the data. Its class distribution is similar to that of the clean subset of the train set. The number of per-class audio clips in the test set ranges from 30 to 72. The test set enables a multi-class classification problem.

FSDnoisy18k is an expandable dataset that features a per-class varying degree of types and amount of label noise. The dataset allows investigation of label noise as well as other approaches, from semi-supervised learning, e.g., self-training to learning with minimal supervision.

License

FSDnoisy18k has licenses at two different levels, as explained next. All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. In particular, all Freesound clips included in FSDnoisy18k are released under either CC-BY or CC0. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of audio clips and their corresponding license in the LICENSE-INDIVIDUAL-CLIPS file downloaded with the dataset.

In addition, FSDnoisy18k as a whole is the result of a curation process and it has an additional license. FSDnoisy18k is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the dataset.

Files

FSDnoisy18k can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDnoisy18k.audio_train/ Audio clips in the train set │
└───FSDnoisy18k.audio_test/ Audio clips in the test set │
└───FSDnoisy18k.meta/ Files for evaluation setup │ │
│ └───train.csv Data split and ground truth for the train set │ │
│ └───test.csv Ground truth for the test set
│
└───FSDnoisy18k.doc/ │
└───README.md The dataset description file that you are reading │
└───LICENSE-DATASET License of the FSDnoisy18k dataset as an entity
│
└───LICENSE-INDIVIDUAL-CLIPS.csv Licenses of the individual audio clips from Freesound

Each row (i.e. audio clip) of the train.csv file contains the following information:

fname: the file name

label: the audio classification label (ground truth)

aso_id: the id of the corresponding category as per the AudioSet Ontology

manually_verified: Boolean (1 or 0) flag to indicate whether the clip belongs to the clean portion (1), or to the noisy portion (0) of the train set

noisy_small: Boolean (1 or 0) flag to indicate whether the clip belongs to the noisy_small portion (1) of the train set

Each row (i.e. audio clip) of the test.csv file contains the following information:

fname: the file name

label: the audio classification label (ground truth)

aso_id: the id of the corresponding category as per the AudioSet Ontology

Links

Source code for our preprint: https://github.com/edufonseca/icassp19 Freesound Annotator: https://annotator.freesound.org/ Freesound: https://freesound.org Eduardo Fonseca’s personal website: http://www.eduardofonseca.net/

Acknowledgments

This work is partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688382 AudioCommons. Eduardo Fonseca is also sponsored by a Google Faculty Research Award 2017. We thank everyone who contributed to FSDnoisy18k with annotations.
z
Data from: Aircraft Marshaling Signals Dataset of FMCW Radar and Event-Based...
zenodo.org
explore.openaire.eu
+1more
bin, text/x-python
Updated Dec 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leon Müller; Leon Müller; Manolis Sifalakis; Manolis Sifalakis; Sherif Eissa; Sherif Eissa; Amirreza Yousefzadeh; Amirreza Yousefzadeh; Sander Stuijk; Sander Stuijk; Federico Corradi; Federico Corradi; Paul Detterer; Paul Detterer (2023). Aircraft Marshaling Signals Dataset of FMCW Radar and Event-Based Camera for Sensor Fusion [Dataset]. http://doi.org/10.5281/zenodo.10359770
Explore at:
bin, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10359770
Dataset updated
Dec 11, 2023
Dataset provided by
IEEE
Authors
Leon Müller; Leon Müller; Manolis Sifalakis; Manolis Sifalakis; Sherif Eissa; Sherif Eissa; Amirreza Yousefzadeh; Amirreza Yousefzadeh; Sander Stuijk; Sander Stuijk; Federico Corradi; Federico Corradi; Paul Detterer; Paul Detterer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 1, 2023
Description
Dataset Introduction
The advent of neural networks capable of learning salient features from variance in the radar data has expanded the breadth of radar applications, often as an alternative sensor or a complementary modality to camera vision. Gesture recognition for command control is arguably the most commonly explored application. Nevertheless, more suitable benchmarking datasets than currently available are needed to assess and compare the merits of the different proposed solutions and explore a broader range of scenarios than simple hand-gesturing a few centimeters away from a radar transmitter/receiver. Most current publicly available radar datasets used in gesture recognition provide limited diversity, do not provide access to raw ADC data, and are not significantly challenging. To address these shortcomings, we created and make available a new dataset that combines FMCW radar and dynamic vision camera of 10 aircraft marshalling signals (whole body) at several distances and angles from the sensors, recorded from 13 people. The two modalities are hardware synchronized using the radar's PRI signal. Moreover, in the supporting publication we propose a sparse encoding of the time domain (ADC) signals that achieve a dramatic data rate reduction (>76%) while retaining the efficacy of the downstream FFT processing (<2% accuracy loss on recognition tasks), and can be used to create an sparse event-based representation of the radar data. In this way the dataset can be used as a two-modality neuromorphic dataset.
Synchronization of the two modalities
The PRI pulses from the radar have been hard-wired to the event stream of the DVS sensor, and timestamped using the DVS clock. Based on this signal the DVS event stream has been segmented such that groups of events (time-bins) of the DVS are mapped with individual radar pulses (chirps).
Data storage
DVS events (x,y coords and timestamps) are stored in structured arrays, and one such structured array object is associated with the data of a radar transmission (pulse/chirp). A radar transmission is a vector of 512 ADC levels that correspond to sampling points of chirping signal (FMCW radar) that lasts about ~1.3ms. Every 192 radar transmissions are stacked in a matrix called a radar frame (each transmission is a row in that matrix). A data capture (recording) consisting of some thousands of continuous radar transmissions is therefore segmented in a number of radar frames. Finally radar frames and the corresponding DVS structured arrays are stored in separate containers in a custom-made multi-container file format (extension .rad). We provide a (rad file) parser for extracting the data out of these files. There is one file per capture of continuous gesture recording of about 10s.
Note the number of 192 transmissions per radar frame is an ad-hoc segmentation that suits the purpose of obtaining sufficient signal resolution in a 2D FFT typical in radar signal processing, for the range resolution of the specific radar. It also served the purpose of fast streaming storing of the data during capture. For extracting individual data points for the dataset however, one can pool together (concat) all the radar frames from a single capture file and re-segment them according to liking. The data loader that we provide offers this, with a default of re-segmenting every 769 transmissions (about 1s of gesturing).
Data captures directory organization (radar8Ghz-DVS-marshaling_signals_20220901_publication_anonymized.7z)
The dataset captures (recordings) are organized in a common directory structure which encompasses additional metadata information about the captures.
dataset_dir/
Identifiers
stage [train, test].
room: [conference_room, foyer, open_space].
subject: [0-9]. Note that 0 stands for no person, and 1 for an unlabeled, random person (only present in test).
gesture: ['none', 'emergency_stop', 'move_ahead', 'move_back_v1', 'move_back_v2', 'slow_down' 'start_engines', 'stop_engines', 'straight_ahead', 'turn_left', 'turn_right'].
distance: ['xxx', '100', '150', '200', '250', '300', '350', '400', '450'] (in cm). Note that xxx is used for none gestures when there is no person present in front of the radar (i.e. background samples), or when a person is walking in front of the radar with varying distances but performing no gesture.
The test data captures contain both subjects that appear in the train data as well as previously unseen subjects. Similarly the test data contain captures from the spaces that train data were recorded at, as well as from a new unseen open space.
Files List
radar8Ghz-DVS-marshaling_signals_20220901_publication_anonymized.7z
This is the actual archive bundle with the data captures (recordings).
rad_file_parser_2.py
Parser for individual .rad files, which contain capture data.
loader.py
A convenience PyTorch Dataset loader (partly Tonic compatible). You practically only need this to quick-start if you don't want to delve too much into code reading. When you init a DvsRadarAircraftMarshallingSignals class object it automatically downloads the dataset archive and the .rad file parser, unpacks the archive, and imports the .rad parser to load the data. One can then request from it a training set, a validation set and a test set as torch.Datasets to work with.
aircraft_marshalling_signals_howto.ipynb
Jupyter notebook for exemplary basic use of loader.py
Contact
For further information or questions try contacting first M. Sifalakis or F. Corradi.
c
Data from: Dataset corresponding to 'The (im)possibility of forgiveness? An...
datacatalogue.cessda.eu
ssh.datastations.nl
+1more
Updated Apr 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
D.A. Forster; C.A.M. Hermans (2023). Dataset corresponding to 'The (im)possibility of forgiveness? An empirical intercultural Bible reading of Matthew 18.15-35' [Dataset]. http://doi.org/10.17026/dans-x9b-379m
Explore at:
Unique identifier
https://doi.org/10.17026/dans-x9b-379m
Dataset updated
Apr 11, 2023
Dataset provided by
Radboud University
Authors
D.A. Forster; C.A.M. Hermans
Description
The study this datasets is part of focusses on intercultural Biblical hermeneutics and was necessary to gain insights into how the participants understood concepts and processes of forgiveness when reading Matthew 18.15-35. Moreover, since social identity complexity theory plays an important role in intercultural hermeneutics, it was necessary to gather the data in a social, or group reading context. Each of the two participating groups were largely homogenous in terms of race and culture. Demographic information was an important component in ascertaining to what extent social identity categories (such as race and culture) influence the hermeneutic perspectives of the participants. The participants self-reported their race, age and gender in an individual information session that took place before the first group meeting.
In order to gain data from the participants related to their intercultural Biblical hermeneutic views of forgiveness, the researcher designed a series of group readings of the Biblical text in which the participants had the opportunity to read the text and then share their understandings of forgiveness with the researcher and the rest of the group. These group readings where structured in the form of focus group encounters in which the conversations were recorded by means of an audio recording device and then transcribed for later analysis.
This dataset contains the raw data (transcripts) that was gathered in a series of 6 focus group meetings (datasets D1 - D6). The datasets each are numbered according to the Atlas.ti convention. The data is recorded in Rich Text File (RTF) format. In the heading of each data file the date and location of the recording of the data is noted (e.g., Church Street Methodist Church, 21/06/2015).
The data has been anonymised in order to protect the privacy of the participants, except for the original Atlas.ti file (which is only available on request). Each participant has been allocated a code represent their inputs (P1-P12). The P* remains the same for that participant in all of the datasets. The interviewer has been allocated the symbol 'I' or [I].
A methodology text and read me text are part of the dataset and explain the context of the dataset. In these documents, references are made to parts of the corresponding publication (PhD dissertation):
Forster, DA. 2017. 'The (im)possibility of forgiveness? An empirical intercultural Bible reading of Matthew 18.15-35'. PHD thesis. Radboud University.
d
ASR database ARTUR 1.0 (audio) - Dataset - B2FIND
b2find.dkrz.de
Updated Mar 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ASR database ARTUR 1.0 (audio) - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/700ba856-a0b9-56c2-8fa7-056198ffe7d2
Explore at:
Dataset updated
Mar 8, 2023
Description
Artur 1.0 is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,067 hours of speech. 884 hours are transcribed, while the remaining 183 hours are recordings only. This repository entry includes audio files only, the transcriptions are available on http://hdl.handle.net/11356/1772. The data are structured as follows: (1) Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own audio file and has a corresponding transcription file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own audio file and has a corresponding transcription file. (1d) Artur-B-Izloceno, 27 hours: The recordings include different types of errors, typically, incorrect reading of sentences or a noisy environment. (2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: media recordings, online recordings of conferences, workshops, education videos, etc. (3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics. (4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Speech from the Slovene National Assembly. Further information on the database are available in the Artur-DOC file, which is part of this repository entry.
h
gigaspeech
huggingface.co
paperswithcode.com
+1more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SpeechColab, gigaspeech [Dataset]. https://huggingface.co/datasets/speechcolab/gigaspeech
Explore at:
Dataset authored and provided by
SpeechColab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.
h
ljspeech
huggingface.co
Updated Sep 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artem Ploujnikov (2023). ljspeech [Dataset]. https://huggingface.co/datasets/flexthink/ljspeech
Explore at:
Dataset updated
Sep 30, 2023
Authors
Artem Ploujnikov
Description
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
v
vHMML Reading Room Datasets
vhmml.org
ftp.vhmml.org
csv, json
Updated Aug 23, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Hill Museum & Manuscript Library (HMML) (2018). vHMML Reading Room Datasets [Dataset]. https://www.vhmml.org/dataPortal/dataset
Explore at:
csv, jsonAvailable download formats
Dataset updated
Aug 23, 2018
Dataset authored and provided by
The Hill Museum & Manuscript Library (HMML)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The downloadable Reading Room datasets are updated weekly, include all active records, use the vHMML Schema (https://www.vhmml.org/dataPortal/schema), and are line-delimited
h
lj_speech
huggingface.co
tensorflow.org
Updated Feb 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keith Ito (2021). lj_speech [Dataset]. https://huggingface.co/datasets/keithito/lj_speech
Explore at:
Dataset updated
Feb 15, 2021
Authors
Keith Ito
License
https://choosealicense.com/licenses/unlicense/https://choosealicense.com/licenses/unlicense/
Description
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books in English. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

Note that in order to limit the required storage for preparing this dataset, the audio is stored in the .wav format and is not converted to a float32 array. To convert the audio file to a float32 array, please make use of the .map() function as follows:

import soundfile as sf def map_to_array(batch): speech_array, _ = sf.read(batch["file"]) batch["speech"] = speech_array return batch dataset = dataset.map(map_to_array, remove_columns=["file"])

Facebook

Twitter

Click to copy link

Link copied

Cite

Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785

Multimodal Vision-Audio-Language Dataset

Explore at:

Unique identifier

https://doi.org/10.5281/zenodo.10060785

Dataset updated

Jul 11, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.

Details can be found in the attached report.

Annotation

The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.

The split into train, validation and test set follows the split of the original datasets.

Installation

pip install pandas pyarrow

Example

import pandas as pd
df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
print(df.iloc[0])

dataset AudioSet
filename train/---2_BBVHAA.mp3
captions_visual [a man in a black hat and glasses.]
captions_auditory [a man speaks and dishes clank.]
tags [Speech]

Description

The annotation file consists of the following fields:

filename: Name of the corresponding file (video or audio file)
dataset: Source dataset associated with the data point
captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
captions_auditory: A list of captions related to the auditory content of the video
tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided

Data files

The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

Clear search

Close search

Google apps

Main menu

Multimodal Vision-Audio-Language Dataset

Annotation

Installation

Example

Description

Data files

BurnedAreaUAV Dataset (v1.1)

GLips - German Lipreading Dataset - Dataset - B2FIND

Multimodal WEDAR dataset for attention regulation behaviors, self-reported...

FSDKaggle2018

NFED-fmri

Data from: A Tape-Reading Molecular Ratchet

Raw Data for ConfLab: A Data Collection Concept, Dataset, and Benchmark for...

FSDKaggle2019

The Visual-Inertial Canoe Dataset

Data from: DIPSER: A Dataset for In-Person Student Engagement Recognition in...

GENEA Challenge 2023 Dataset Files

FSDnoisy18k

Data from: Aircraft Marshaling Signals Dataset of FMCW Radar and Event-Based...

Data from: Dataset corresponding to 'The (im)possibility of forgiveness? An...

ASR database ARTUR 1.0 (audio) - Dataset - B2FIND

gigaspeech

ljspeech

vHMML Reading Room Datasets

lj_speech

Multimodal Vision-Audio-Language Dataset

Annotation

Installation

Example

Description

Data files