100+ datasets found

Multimodal Recommendation System Datasets

kaggle.com

Updated Aug 21, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Ignacio Avas (2023). Multimodal Recommendation System Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/6338676

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/6338676

Dataset updated

Aug 21, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ignacio Avas

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Quick start

To read any dataset you can use the following code

>>> import numpy as np
>>> embed_image = np.load('embed_image.npy')
>>> embed_image.shape
(33962, 768)
>>> embed_text = np.load('embed_text.npy')
>>> embed_text.shape
(33962, 768)
>>> import pandas as pd
>>> items = pd.read_csv('items.txt')
>>> m = len(items)
>>> print(f'{m} items in dataset')
33962
>>> users = pd.read_csv('users.txt')
>>> n = len(users)
>>> print(f'{n} users in dataset')
14790
>>> train = pd.read_csv('train.txt')
>>> train
     user  item
0    13444 23557
1    13444 33739
...    ...  ...
317109 13506 29993
317110 13506 13931
>>> from scipy.sparse import csr_matrix
>>> train_matrix = csr_matrix((np.ones(len(train)), (train.user, train.item)), shape=(n,m))

Folders

This dataset contains six datasets. Each dataset is duplicated with seven combinations of different Image and Text encoders, so you should see 42 folders.

Each folder is the name of the dataset and the encoder used for the visual and textual parts. For example: bookcrossing-vit_bert.

The datasets are: - Clothing, Shoes and Jewelry (Amazon) - Home and Kitchen (Amazon) - Musical Instruments (Amazon) - Movies and TV (Amazon) - Book-Crossing - Movielens 25M

And the encoders are: - CLIP (Image and Text) (*-clip_clip). This is the main one used in the experiments. - ViT and BERT (*-vit_bert) - CLIP (only visual data) *-clip_none - ViT only *-vit_none - BERT only *-none_bert - CLIP (text only) *-clip_none - No textual or visual information *-none_none

Files per folder

For each dataset, we have the following files, considering we have M items and N users, textual embeddings with D (like 1024) dimensions, and Visual with E dimensions (like 768) - embed_image.npy A NumPy array of MxE elements. - embed_text.npy A NumPy array of MXD elements. - items.csv A CSV with the Item ID in the original dataset (like the Amazon ASIN, the Movie ID, etc.) and the item number, an integer from 0 to M-1 - users.csv A CSV with the User ID in the original dataset (like the Amazon Reviewer Id) and the item number, an integer from 0 to N-1 - train.txt, validation.txt and test.txt are CSV files with the portions of the reviews for train validation and test. It has the item the user liked or reviewed positively. Each row has a positive user item.

We consider a review "positive" if the rating is four or more (or 8 or more for Book-crossing).

The vector is zeroed out if an Item does not have an image or text.

Dataset stats

Dataset	Users	Item	Ratings	Density
Clothing & Shoes & Jewelry	23318	38493	178944	0.020%
Home & Kitchen	5968	57645	135839	0.040%
Movies & TV	21974	23958	216110	0.041%
Musical Instruments	14429	29040	93923	0.022%
Book-crossing	14790	33962	519613	0.103%
Movielens 25M	162541	59047	25000095	0.260%

Modifications from the original source

Only a tiny fraction of the dataset was taken for the Amazon Datasets by considering reviews in a specific date range.

For the Bookcrossing dataset, only items with images were considered.

There are various other minor tweaks on how to obtain images and texts. The repo https://github.com/igui/MultimodalRecomAnalysis has the Notebook and scripts to reproduce the dataset extraction from scratch.

h
synthetic-multiturn-multimodal
huggingface.co
Updated Jan 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mesolitica (2024). synthetic-multiturn-multimodal [Dataset]. https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 28, 2024
Dataset authored and provided by
Mesolitica
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Multiturn Multimodal

We want to generate synthetic data that able to understand position and relationship between multi-images and multi-audio, example as below, All notebooks at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/multiturn-multimodal

multi-images

synthetic-multi-images-relationship.jsonl, 100000 rows, 109MB. Images at https://huggingface.co/datasets/mesolitica/translated-LLaVA-Pretrain/tree/main

Example data

{'filename':… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal.
h
Source code and data for the PhD Thesis "Measuring the Contributions of...
heidata.uni-heidelberg.de
zip
Updated Dec 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Letitia Parcalabescu; Letitia Parcalabescu (2024). Source code and data for the PhD Thesis "Measuring the Contributions of Vision and Text Modalities in Multimodal Transformers" [Dataset]. http://doi.org/10.11588/DATA/68HOOP
Explore at:
zip(17206604), zip(456409), zip(488208), zip(489773), zip(854757425)Available download formats
Unique identifier
https://doi.org/10.11588/DATA/68HOOP
Dataset updated
Dec 20, 2024
Dataset provided by
heiDATA
Authors
Letitia Parcalabescu; Letitia Parcalabescu
License
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.11588/DATA/68HOOPhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.11588/DATA/68HOOP
Dataset funded by
bwHPC and the German Research Foundation (DFG)
Description
This dataset contains source code and data used in the PhD thesis "Measuring the Contributions of Vision and Text Modalities in Multimodal Transformers". The dataset is split into five repositories: Code and resources related to chapter 2 of the thesis (Section 2.2., method described in "Using Scene Graph Representations and Knowledge Bases") Code and resources related to chapter 3 of the thesis (VALSE dataset). Code and resources related to chapter 4 of the thesis: MM-SHAP measure and experiments code. Code and resources related to chapter 5 of the thesis: CCSHAP measure and experiments code related to large language models (LLMs). Code and resources related to the experiments with vision and language model decoders from chapters 3, 4, and 5.
aiMotive Multimodal Dataset
kaggle.com
zip
Updated Dec 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tamas Matuszka (2022). aiMotive Multimodal Dataset [Dataset]. https://www.kaggle.com/datasets/tamasmatuszka/aimotive-multimodal-dataset/code
Explore at:
zip(84497743865 bytes)Available download formats
Dataset updated
Dec 16, 2022
Authors
Tamas Matuszka
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Autonomous driving is a popular research area within the computer vision research community. Since autonomous vehicles are highly safety-critical, ensuring robustness is essential for real-world deployment. While several public multimodal datasets are accessible, they mainly comprise two sensor modalities (camera, LiDAR) which are not well suited for adverse weather. In addition, they lack far-range annotations, making it harder to train neural networks that are the base of a highway assistant function of an autonomous vehicle. Therefore, we introduce a multimodal dataset for robust autonomous driving with long-range perception. The dataset consists of 176 scenes with synchronized and calibrated LiDAR, camera, and radar sensors covering a 360-degree field of view. The collected data was captured in highway, urban, and suburban areas during daytime, night, and rain and is annotated with 3D bounding boxes with consistent identifiers across frames. Furthermore, we trained unimodal and multimodal baseline models for 3D object detection.

The paper describing the dataset can be read here: https://arxiv.org/pdf/2211.09445.pdf

If you use aiMotive Multimodal Dataset in your research, please cite our work by using the following BibTeX entry:

@article{matuszka2022aimotivedataset, title = {aiMotive Dataset: A Multimodal Dataset for Robust Autonomous Driving with Long-Range Perception}, author = {Matuszka, Tamás and Barton, Iván and Butykai, Ádám and Hajas, Péter and Kiss, Dávid and Kovács, Domonkos and Kunsági-Máté, Sándor and Lengyel, Péter and Németh, Gábor and Pető, Levente and Ribli, Dezső and Szeghy, Dávid and Vajna, Szabolcs and Varga, Bálint}, doi = {10.48550/ARXIV.2211.09445}, url = {https://arxiv.org/abs/2211.09445}, publisher = {arXiv}, year = {2022}, }
h
multimodalpragmatic
huggingface.co
Updated Jun 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tong Liu (2024). multimodalpragmatic [Dataset]. https://huggingface.co/datasets/tongliuphysics/multimodalpragmatic
Explore at:
Dataset updated
Jun 22, 2024
Authors
Tong Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multimodal Pragmatic Jailbreak on Text-to-image Models

Project page | Paper | Code The Multimodal Pragmatic Unsafe Prompts (MPUP) is a dataset designed to assess the multimodal pragmatic safety in Text-to-Image (T2I) models. It comprises two key sections: image_prompt, and text_prompt.

Dataset Usage Downloading the Data

To download the dataset, install Huggingface Datasets and then use the following command: from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/tongliuphysics/multimodalpragmatic.
R
Test Multi Modal Dataset
universe.roboflow.com
zip
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MyPersonalWorkspace (2024). Test Multi Modal Dataset [Dataset]. https://universe.roboflow.com/mypersonalworkspace-ch3ye/test-multi-modal
Explore at:
zipAvailable download formats
Dataset updated
Dec 19, 2024
Dataset authored and provided by
MyPersonalWorkspace
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Docs Descriptions
Description
Test Multi Modal

## Overview Test Multi Modal is a dataset for vision language (multimodal) tasks - it contains Docs annotations for 1,998 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
p
Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data...
physionet.org
Updated Aug 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luis R Soenksen; Yu Ma; Cynthia Zeng; Leonard David Jean Boussioux; Kimberly Villalobos Carballo; Liangyuan Na; Holly Wiberg; Michael Li; Ignacio Fuentes; Dimitris Bertsimas (2022). Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays [Dataset]. http://doi.org/10.13026/3f8d-qe93
Explore at:
Unique identifier
https://doi.org/10.13026/3f8d-qe93
Dataset updated
Aug 23, 2022
Authors
Luis R Soenksen; Yu Ma; Cynthia Zeng; Leonard David Jean Boussioux; Kimberly Villalobos Carballo; Liangyuan Na; Holly Wiberg; Michael Li; Ignacio Fuentes; Dimitris Bertsimas
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
A multimodal combination of the MIMIC-IV v1.0.0 and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 databases filtered to only include patients that have at least one chest X-ray performed with the goal of validating multi-modal predictive analytics in healthcare operations can be generated with the present resource. This multimodal dataset generated through this code contains 34,540 individual patient files in the form of "pickle" Python object structures, which covers a total of 7,279 hospitalization stays involving 6,485 unique patients. Additionally, code to extract feature embeddings as well as the list of pre-processed features are included in this repository.
ASSIST-IoT Multimodal Fall Detection Dataset
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Sep 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasiya Danilenka; Anastasiya Danilenka; Piotr Sowiński; Piotr Sowiński; Monika Kobus; Monika Kobus; Anna Dąbrowska; Anna Dąbrowska; Kajetan Rachwał; Kajetan Rachwał; Karolina Bogacka; Karolina Bogacka; Krzysztof Baszczyński; Krzysztof Baszczyński (2023). ASSIST-IoT Multimodal Fall Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.8340428
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8340428
Dataset updated
Sep 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anastasiya Danilenka; Anastasiya Danilenka; Piotr Sowiński; Piotr Sowiński; Monika Kobus; Monika Kobus; Anna Dąbrowska; Anna Dąbrowska; Kajetan Rachwał; Kajetan Rachwał; Karolina Bogacka; Karolina Bogacka; Krzysztof Baszczyński; Krzysztof Baszczyński
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multimodal dataset for fall detection. Includes acceleration data collected from a tag and two smartwatches, and location reported by the tag. More details about the data collection procedure can be found in notes.md.

Contents

The repository contains:

data/location_data.csv and data/full_acceleration – preprocessed acceleration and location data from 10 participants and mannequin simulated falls with target variable identified

data/subsampled_acceleration_data.csv – subsampled acceleration dataset used for training the AI model

notes.md – description of activities performed and notes from data collection

videos – reference videos for performed activities

Authors

Piotr Sowiński – research methodology, data collection and processing

Monika Kobus – research methodology, data collection

Anna Dąbrowska – research methodology, methodological supervision

Kajetan Rachwał – data collection

Karolina Bogacka – research methodology

Krzysztof Baszczyński – research methodology, data collection

Anastasiya Danilenka – research methodology, data collection and processing

Acknowledgements

This work is part of the ASSIST-IoT project that has received funding from the EU’s Horizon 2020 research and innovation programme under grant agreement No 957258.

The Central Institute for Labour Protection – National Research Institute provided facilities and equipment for data collection.

License

The dataset is licensed under the Creative Commons Attribution 4.0 International License.
MUHACU: A Benchmark Dataset for Multi-modal HumanActivity Understanding
zenodo.org
data.niaid.nih.gov
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yue Zhuo; Yaqing Liao; Yuecheng Lei; Lizhen Qu; Xiaojun Chang; Zenglin Xu; Yue Zhuo; Yaqing Liao; Yuecheng Lei; Lizhen Qu; Xiaojun Chang; Zenglin Xu (2021). MUHACU: A Benchmark Dataset for Multi-modal HumanActivity Understanding [Dataset]. http://doi.org/10.5281/zenodo.4968721
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4968721
Dataset updated
Jun 18, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yue Zhuo; Yaqing Liao; Yuecheng Lei; Lizhen Qu; Xiaojun Chang; Zenglin Xu; Yue Zhuo; Yaqing Liao; Yuecheng Lei; Lizhen Qu; Xiaojun Chang; Zenglin Xu
Description
Abstract:

Video understanding extends the level of temporal action recognition. Taking an example of a video containing rich human action, we can reason and predict future actions based on the first several actions in the stream. However, when the task comes to the machine, it could be still difficult to make the forecast and planning based on the video feature of these daily human actions. We formalize the task as Multi-modal Human Activity Understanding. Given a small fraction of the original video clip and a set of action sequences, a machine should be able to find the most reasonable action sequence in the set which can well represent the future actions of the observed video frames. We design the task into two settings: one is completely on the understanding of initial video frames; another provides with both the initial state (video frames)and the goal state (high-level intent). We called them Human ActionForecasting and Human Action Planning separately. We then propose the fully annotated benchmark called MUHACU (MUlti-modalHuman ACtivity Understanding), consisting of 2.9k videos and 157action classes from the original Charades [1] videos. We refine the original labels of the Charades video labels and add more features to aid our task completion. In addition, we provide two strong baseline systems from two directions, information retrieval, and end-to-end training, sharing some insights on potential solutions to this task.

Introduction:

We have tailored and refined the original annotation in the Charades dataset by selecting 2.9k videos and crowdsourcing the corresponding intent in each video. In order to meet the design of the initial state, we generally choose the first 20% length of each video as the initial states. Along with the dataset, the multi-modal knowledge base is crafted semi-automatically. Containing temporal action relationships, visual and textual features of atomic actions, and action sequence and high-level intents, the knowledge base is well served the idea of generalization. We demonstrate the Multi-modal Human Activity Understanding (MUHACU) task is challenging to machines by evaluating a strong hybrid end-to-end framework in the format of multi-modal cloze task.

In summary, MUHACU facilitates multi-modal learning systems that observe through visual features, and forecast and plan in the language in the real-world environment. Our contributions, in brief, are: (1) We propose the first multi-modal knowledge base for temporal activity understanding. (2) We propose baselines for demonstrating the effectiveness of the knowledge base. (3) We propose the novel multi-modal benchmark for evaluating models backed by the knowledge base and dataset.

MUHACU contains the following fields:

KB:2402 videos

_ KB _ # of action-level entities 157 # of activity video entities 2402 # of intent for each video 2 # of action video entities 12118 # of action sequences(non repeat seq) 2402(1969) # of action state templates 27 avg. # of action sequence length 5.04 ————————————————————————————————————————————————————————

Features in KB:

_ feature num size _ action visual prototype feat 157 [1024,] action textual prototype feat 157 [768,] intent feat 2402*2 [768,] video-level visual feat 2402 + 12118 [1024,] snippet-level visual feat 2402 + 12118 [frames//8, 1024] _

evaluation task: 510 videos for human action planning and human action forecasting

_ num human action planning human action forecasting _ # of videos (action sequences) 510 510 avg. # of observed acts 2.79 2.79 avg. # of predicted acts 2.40 2.40 avg. # of total acts 5.19 5.19 \# of choices 6 6 \# of answers 1(435)/2(75) 1 \# of intent 0 1 _

training dataset-split: We also provide a dataset-split to training the baseline model to learn the future ground truth sequence. The initial 2402 KB videos are distributed by the standard split 8:2 for training (1921videos) and validation (481 videos).

_ train validation test _ 1921 481 510 _

More details about the dataset are in README.txt

Availability:

Our data set and knowledge base is available online at https://zenodo.org/deposit/4968721 in order to support sustainability. The resource is maintained under creative Commons Attribution4.0 International license, implying the re-usability. We follow the widely-used standards of FAIR Data principles, which are designed to make resources findable, accessible, interoperable, and re-usable. TheGitHubrepository contains the complete source code and check-points for the baseline systems are available at https://github.com/MUHACU/MUHACU.

[1]Sigurdsson, Gunnar A., et al. "Hollywood in homes: Crowdsourcing data collection for activity understanding." European Conference on Computer Vision. Springer, Cham, 2016.
R
Multi Modal Dataset
universe.roboflow.com
zip
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
multimodal (2024). Multi Modal Dataset [Dataset]. https://universe.roboflow.com/multimodal-7wsdz/multi-modal/model/1
Explore at:
zipAvailable download formats
Dataset updated
Jun 19, 2024
Dataset authored and provided by
multimodal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Reference Inventory Assembly Bounding Boxes
Description
Multi Modal

## Overview Multi Modal is a dataset for object detection tasks - it contains Reference Inventory Assembly annotations for 200 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
MELD Preprocessed
kaggle.com
zip
Updated Mar 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argish Abhangi (2025). MELD Preprocessed [Dataset]. https://www.kaggle.com/datasets/argish/meld-preprocessed
Explore at:
zip(3527202381 bytes)Available download formats
Dataset updated
Mar 1, 2025
Authors
Argish Abhangi
Description
The MELD Preprocessed Dataset is a multi-modal dataset designed for research on emotion recognition from audio, video, and textual data. The dataset builds upon the original MELD dataset and applies extensive preprocessing steps to extract features from different modalities. Each sample is saved as a .pt file containing a dictionary of preprocessed features, making it easy for developers to load and integrate into PyTorch-based workflows.

Data Sources

Audio: Waveforms extracted from the original video files.

Video: Video files are processed to sample frames at a target frame rate (default: 2 fps) and to detect faces using a Haar Cascade classifier.

Text: Utterances from the dialogue, which are cleaned using custom encoding functions to fix potential byte encoding issues.

Emotion Labels: Each sample is associated with an emotion label.

Preprocessing Pipeline

The preprocessing script performs several key steps:

Text Cleaning:

fix_encoding_with_bytes(text): Decodes text from bytes using UTF-8, Latin-1, or cp1252, ensuring correct encoding.

replace_double_encoding(text): Fixes issues related to double-encoded characters (e.g., replacing "Â’" with the proper apostrophe).

Audio Processing:

Extracts raw audio waveform from each sample.

Computes a Mel-spectrogram using torchaudio.transforms.MelSpectrogram with 64 mel bins (VGGish format).

Converts the spectrogram to a logarithmic scale for numerical stability.

Video Processing:

Reads video frames at a specified target FPS (default: 2 fps) using OpenCV.

For each video, samples frames evenly based on the original video's FPS.

Applies Haar Cascade face detection on the frames to extract the first detected face.

Resizes the detected face to 224x224 and converts it to RGB. If no face is detected, a default black image (224x224x3) is returned.

Saving Processed Samples:

Each sample is saved as a .pt file in a directory structure split by data type (train, dev, and test).

The filename is derived from the original video filename (e.g., dia0_utt1.mp4 becomes dia0_utt1.pt).

Data Format

Each preprocessed sample is stored in a .pt file and contains a dictionary with the following keys:

utterance (str): The cleaned textual utterance.

emotion (str/int): The corresponding emotion label.

video_path (str): Original path to the video file from which the sample was extracted.

audio (Tensor): Raw audio waveform tensor of shape [channels, time].

audio_sample_rate (int): The sampling rate of the audio waveform.

audio_mel (Tensor): The computed log-scaled Mel-spectrogram with shape [channels, n_mels, time].

face (NumPy array): The extracted face image (RGB format) of shape (224, 224, 3). If no face was detected, a default black image is provided.

Directory Structure

The preprocessed files are organized into splits: preprocessed_data/ ├── train/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... ├── dev/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... └── test/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt └── ...

Loading and Using the Dataset

A custom PyTorch dataset and DataLoader are provided to facilitate easy integration:

Dataset Class

from torch.utils.data import Dataset import os import torch class PreprocessedMELDDataset(Dataset): def _init_(self, data_dir): """ Args: data_dir (str): Directory where preprocessed .pt files are stored. """ self.data_dir = data_dir self.files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.pt')] def _len_(self): return len(self.files) def _getitem_(self, idx): sample_path = self.files[idx] sample = torch.load(sample_path) return sample

Custom Collate Function

def preprocessed_collate_fn(batch): """ Collates a list of sample dictionaries into a single dictionary with keys mapping to lists. Modify this function to pad or stack tensor data if needed. """ collated = {} collated['utterance'] = [sample['utterance'] for sample in batch] collated['emotion'] = [sample['emotion'] for sample in batch] collated['video_path'] = [sample['video_path'] for sample in batch] collated['audio'] = [sample['audio'] for sample in batch] collated['audio_sample_rate'] = batch[0]['audio_sample_rate'] collated['audio_mel'] = [sample['audio_mel'] for sample in batch] collated['face'] = [sample['face'] for sample in batch] return collated

Creating DataLoaders

from torch.utils.data import DataLoader # Define paths for each split train_data_dir = "preprocessed_data/train" dev_data_dir = "preproces...
Datasets for Evaluation of Multimodal Image Registration
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiahao Lu; Jiahao Lu; Johan Öfverstedt; Johan Öfverstedt; Joakim Lindblad; Joakim Lindblad; Nataša Sladoje; Nataša Sladoje (2021). Datasets for Evaluation of Multimodal Image Registration [Dataset]. http://doi.org/10.5281/zenodo.5557568
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5557568
Dataset updated
Oct 11, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jiahao Lu; Jiahao Lu; Johan Öfverstedt; Johan Öfverstedt; Joakim Lindblad; Joakim Lindblad; Nataša Sladoje; Nataša Sladoje
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

Aerial data

The Aerial dataset is divided into 3 sub-groups by IDs: {7, 9, 20, 3, 15, 18}, {10, 1, 13, 4, 11, 6, 16}, {14, 8, 17, 5, 19, 12, 2}. Since the images vary in size, each image is subdivided into the maximal number of equal-sized non-overlapping regions such that each region can contain exactly one 300x300 px image patch. Then one 300x300 px image patch is extracted from the centre of each region. The particular 3-folded grouping followed by splitting leads to that each evaluation fold contains 72 test samples.

Modality A: Near-Infrared (NIR)

Modality B: three colour channels (in B-G-R order)

Cytological data

The Cytological data contains images from 3 different cell lines; all images from one cell line is treated as one fold in 3-folded cross-validation. Each image in the dataset is subdivided from 600x600 px into 2x2 patches of size 300x300 px, so that there are 420 test samples in each evaluation fold.

Modality A: Fluorescence Images

Modality B: Quantitative Phase Images (QPI)

Histological dataset

For the Histological data, to avoid too easy registration relying on the circular border of the TMA cores, the evaluation images are created by cutting 834x834 px patches from the centres of the original 134 TMA image pairs.

Modality A: Second Harmonic Generation (SHG)

Modality B: Bright-Field (BF)

The evaluation set created from the above three publicly available 2D datasets consists of images undergone 4 levels of (rigid) transformations of increasing size of displacement. The level of transformations is determined by the size of the rotation angle θ and the displacement tx & ty, detailed in this table. Each image sample is transformed exactly once at each transformation level so that all levels have the same number of samples.

Radiological data

The Radiological dataset is divided into 3 sub-groups by patient IDs: {109, 106, 003, 006}, {108, 105, 007, 001}, {107, 102, 005, 009}. Since the Radiological dataset is non-isotropic (and also of varying resolution), it is resampled using B-spline interpolation to 1 mm³ cubic voxels, taking explicit care to not resample twice; displaced volumes are transformed and resampled in one step.

Modality A: T1-weighted MRI

Modality B: T2-weighted MRI

(Run make_rire_patches.py to generate the sub-volumes.)

Reference sub-volumes of size 210x210x70 voxels are cropped directly from centres of the (non-displaced) resampled volumes. Similarly as for the aforementioned 2D datasets, random (uniformly-distributed) transformations are composed of rotations θx, θy ∈ [-4, 4] degrees around the x- and y-axes, rotation θz ∈ [-20, 20] degrees around the z-axis, translations tx, ty ∈ [-19.6, 19.6] voxels in x and y directions and translation tz ∈ [-6.5, 6.5] voxels in z direction. 40 rigid transformations of increasing sizes of displacement are applied to each volume. Transformed sub-volumes, of size 210x210x70 voxels, are cropped from centres of the transformed and resampled volumes.

In total, it contains 864 image pairs created from the aerial dataset, 5040 image pairs created from the cytological dataset, 536 image pairs created from the histological dataset, and metadata with scripts to create the 480 volume pairs from the radiological dataset. Each image pair consists of a reference patch \(I^{\text{Ref}}\) and its corresponding initial transformed patch \(I^{\text{Init}}\) in both modalities, along with the ground-truth transformation parameters to recover it.

Scripts to calculate the registration performance and to plot the overall results can be found in https://github.com/MIDA-group/MultiRegEval, and instructions to generate more evaluation data with different settings can be found in https://github.com/MIDA-group/MultiRegEval/tree/master/Datasets#instructions-for-customising-evaluation-data.

Metadata

In the *.zip files, each row in {Zurich,Balvan}_patches/fold[1-3]/patch_tlevel[1-4]/info_test.csv or Eliceiri_patches/patch_tlevel[1-4]/info_test.csv provides the information of an image pair as follow:

Filename: identifier(ID) of the image pair

X1_Ref: x-coordinate of the upper-left corner of reference patch I_Ref

Y1_Ref: y-coordinate of the upper-left corner of reference patch I_Ref

X2_Ref: x-coordinate of the lower-left corner of reference patch I_Ref

Y2_Ref: y-coordinate of the lower-left corner of reference patch I_Ref

X3_Ref: x-coordinate of the lower-right corner of reference patch I_Ref

Y3_Ref: y-coordinate of the lower-right corner of reference patch I_Ref

X4_Ref: x-coordinate of the upper-right corner of reference patch I_Ref

Y4_Ref: y-coordinate of the upper-right corner of reference patch I_Ref

X1_Trans: x-coordinate of the upper-left corner of transformed patch I_Init

Y1_Trans: y-coordinate of the upper-left corner of transformed patch I_Init

X2_Trans: x-coordinate of the lower-left corner of transformed patch I_Init

Y2_Trans: y-coordinate of the lower-left corner of transformed patch I_Init

X3_Trans: x-coordinate of the lower-right corner of transformed patch I_Init

Y3_Trans: y-coordinate of the lower-right corner of transformed patch I_Init

X4_Trans: x-coordinate of the upper-right corner of transformed patch I_Init

Y4_Trans: y-coordinate of the upper-right corner of transformed patch I_Init

Displacement: mean Euclidean distance between reference corner points and transformed corner points

RelativeDisplacement: the ratio of displacement to the width/height of image patch

Tx: randomly generated translation in the x-direction to synthesise the transformed patch I_Init

Ty: randomly generated translation in the y-direction to synthesise the transformed patch I_Init

AngleDegree: randomly generated rotation in degrees to synthesise the transformed patch I_Init

AngleRad: randomly generated rotation in radian to synthesise the transformed patch I_Init

In addition, each row in RIRE_patches/fold[1-3]/patch_tlevel[1-4]/info_test.csv has following columns:

Z1_Ref: z-coordinate of the upper-left corner of reference patch I_Ref

Z2_Ref: z-coordinate of the lower-left corner of reference patch I_Ref

Z3_Ref: z-coordinate of the lower-right corner of reference patch I_Ref

Z4_Ref: z-coordinate of the upper-right corner of reference patch I_Ref

Z1_Trans: z-coordinate of the upper-left corner of transformed patch I_Init

Z2_Trans: z-coordinate of the lower-left corner of transformed patch I_Init

Z3_Trans: z-coordinate of the lower-right corner of transformed patch I_Init

Z4_Trans: z-coordinate of the upper-right corner of transformed patch I_Init

(...and similarly, coordinates of the 5th-8th corners)

Tz: randomly generated translation in z-direction to synthesise the transformed patch I_Init

AngleDegreeX: randomly generated rotation around X-axis in degrees to synthesise the transformed patch I_Init

AngleRadX: randomly generated rotation around X-axis in radian to synthesise the transformed patch I_Init

AngleDegreeY: randomly generated rotation around Y-axis in degrees to synthesise the transformed patch I_Init

AngleRadY: randomly generated rotation around Y-axis in radian to synthesise the transformed patch I_Init

AngleDegreeZ: randomly generated rotation around Z-axis in degrees to synthesise the transformed patch I_Init

AngleRadZ: randomly generated rotation around Z-axis in radian to synthesise the transformed patch I_Init

Naming convention

Aerial Data

zh{ID}_{iRow}_{iCol}_{ReferenceOrTransformed}.png

Example: zh5_03_02_R.png indicates the Reference patch of the 3rd row and 2nd column cut from the image with ID zh5.

</li> <li><strong>Cytological data</strong> <ul> <li> <pre> {{cellline}_{treatment}_{fieldofview}_{iFrame}}_{iRow}_{iCol}_{ReferenceOrTransformed}.png</pre> </li> <li>Example: <code>PNT1A_do_1_f15_02_01_T.png</code> indicates the <em>Transformed
t
Data from: REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic...
researchdata.tuwien.ac.at
txt, zip
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee (2025). REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly [Dataset]. http://doi.org/10.48436/0ewrv-8cb44
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.48436/0ewrv-8cb44
Dataset updated
Jul 15, 2025
Dataset provided by
TU Wien
Authors
Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 9, 2025 - Jan 14, 2025
Description
REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

📋 Introduction

Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.

✨ Key Features

Multimodality: REASSEMBLE contains data from robot proprioception, RGB cameras, Force&Torque sensors, microphones, and event cameras

Multitask labels: REASSEMBLE contains labeling which enables research in Temporal Action Segmentation, Motion Policy Learning, Anomaly detection, and Task Inversion.

Long horizon: Demonstrations in the REASSEMBLE dataset cover long horizon tasks and actions which usually span multiple steps.

Hierarchical labels: REASSEMBLE contains actions segmentation labels at two hierarchical levels.

🔴 Dataset Collection

Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the dataset’s completeness. Low-level Skill annotations were added manually after data collection, and all labels were carefully reviewed to ensure accuracy.

📑 Dataset Structure

The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json corresponds to 2025-01-09-13-59-54.h5.

The structure of the JSON files is as follows:

{"Hama1": [ [x ,y, z], [qx, qy, qz, qw] ], "Hama2": [ [x ,y, z], [qx, qy, qz, qw] ], "DAVIS346": [ [x ,y, z], [qx, qy, qz, qw] ], "NIST_Board1": [ [x ,y, z], [qx, qy, qz, qw] ] }

[x, y, z] represent the position of the object, and [qx, qy, qz, qw] represent its orientation as a quaternion.

The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robot’s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensor’s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.

📁

The splits folder contains two text files which list the h5 files used for the traning and validation splits.

📌 Important Resources

The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.

📄 Project website: https://tuwien-asl.github.io/REASSEMBLE_page/
💻 Code: https://github.com/TUWIEN-ASL/REASSEMBLE

⚠️ File comments

Below is a table which contains a list records which have any issues. Issues typically correspond to missing data from one of the sensors.

Recording Issue
2025-01-10-15-28-50.h5 hand cam missing at beginning
2025-01-10-16-17-40.h5 missing hand cam
2025-01-10-17-10-38.h5 hand cam missing at beginning
2025-01-10-17-54-09.h5 no empty action at
h
Data from: kerta
huggingface.co
Updated Nov 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Budi Susanto (2025). kerta [Dataset]. http://doi.org/10.57967/hf/7089
Explore at:
Unique identifier
https://doi.org/10.57967/hf/7089
Dataset updated
Nov 25, 2025
Authors
Budi Susanto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Kerta Corpus: Multimodal Code Readability Dataset

Summary

Kerta Corpus is a multimodal dataset for code readability research. This dataset combines:

Metric features from the Scalabrino tool, which includes the feature definitions of Scalabrino, Buse and Weimer, and Posnett. Hand-crafted code metrics (56 static metrics) (in progress) Rendered code highlight images (PNG format) A Java Method Declaration corpus labeled into three readability classes: 0 — Unreadable 1 —… See the full description on the dataset page: https://huggingface.co/datasets/budsus/kerta.
DeepFashion-MultiModal
kaggle.com
opendatalab.com
zip
Updated Sep 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
silverstone (2024). DeepFashion-MultiModal [Dataset]. https://www.kaggle.com/datasets/silverstone1903/deep-fashion-multimodal/code
Explore at:
zip(2025725175 bytes)Available download formats
Dataset updated
Sep 16, 2024
Authors
silverstone
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Original data contains 44k. This one only contains images from front.

DeepFashion-MultiModal

Text2Human: Text-Driven Controllable Human Image Generation
Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy and Ziwei Liu
In ACM Transactions on Graphics (Proceedings of SIGGRAPH), 2022.

From MMLab@NTU affliated with S-Lab, Nanyang Technological University and SenseTime Research.

https://github.com/yumingj/DeepFashion-MultiModal/raw/main/assets/logo.png">

[Project Page] | [Paper] | [Code] | [Demo Video]

DeepFashion-MultiModal is a large-scale high-quality human dataset with rich multi-modal annotations. It has the following properties: 1. It contains 44,096 high-resolution human images, including 12,701 full body human images. 2. For each full body images, we manually annotate the human parsing labels of 24 classes. 3. For each full body images, we manually annotate the keypoints. 4. We extract DensePose for each human image. 5. Each image is manually annotated with attributes for both clothes shapes and textures. 6. We provide a textual description for each image.

@article{jiang2022text2human, title={Text2Human: Text-Driven Controllable Human Image Generation}, author={Jiang, Yuming and Yang, Shuai and Qiu, Haonan and Wu, Wayne and Loy, Chen Change and Liu, Ziwei}, journal={ACM Transactions on Graphics (TOG)}, volume={41}, number={4}, articleno={162}, pages={1--11}, year={2022}, publisher={ACM New York, NY, USA}, doi={10.1145/3528223.3530104}, } @inproceedings{liuLQWTcvpr16DeepFashion, author = {Liu, Ziwei and Luo, Ping and Qiu, Shi and Wang, Xiaogang and Tang, Xiaoou}, title = {DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations}, booktitle = {Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2016} }

MultiBanFakeDetect: Multimodal Bangla Fake News

kaggle.com

zip

Updated Aug 14, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mukaffi Moin (2025). MultiBanFakeDetect: Multimodal Bangla Fake News [Dataset]. https://www.kaggle.com/datasets/mukaffimoin/multibanfakedetect-multimodal-bangla-fake-news/code

Explore at:

zip(2608129399 bytes)Available download formats

Dataset updated

Aug 14, 2025

Authors

Mukaffi Moin

License

https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

Description

MultiBanFakeDetect Dataset

The MultiBanFakeDetect dataset consists of 9,600 text–image instances collected from online forums, news websites, and social media. It covers multiple themes — political, social, technology, and entertainment — with a balanced distribution of real and fake instances.

The dataset is split into:

Training: 7,680 instances
Testing: 960 instances
Validation: 960 instances

📊 Statistical Overview – Types of Fake News

Type	Training	Testing	Validation
Misinformation	1,288	161	162
Rumor	1,215	152	151
Clickbait	1,337	167	167
Non-fake	3,840	480	480
Total	7,680	960	960

🏷️ Distribution by Labels

Label	Training	Testing	Validation
1 (Fake)	3,840	480	480
0 (Non-Fake)	3,840	480	480
Total	7,680	960	960

🌍 Statistical Overview – Categories of Fake News

Category	Training	Testing	Validation
Entertainment	640	80	80
Sports	640	80	80
Technology	640	80	80
National	640	80	80
Lifestyle	640	80	80
Politics	640	80	80
Education	640	80	80
International	640	80	80
Crime	640	80	80
Finance	640	80	80
Business	640	80	80
Miscellaneous	640	80	80
Total	7,680	960	960

@article{FARIA2025100347,
title = {MultiBanFakeDetect: Integrating advanced fusion techniques for multimodal detection of Bangla fake news in under-resourced contexts},
journal = {International Journal of Information Management Data Insights},
volume = {5},
number = {2},
pages = {100347},
year = {2025},
issn = {2667-0968},
doi = {https://doi.org/10.1016/j.jjimei.2025.100347},
url = {https://www.sciencedirect.com/science/article/pii/S2667096825000291},
author = {Fatema Tuj Johora Faria and Mukaffi Bin Moin and Zayeed Hasan and Md. Arafat Alam Khandaker and Niful Islam and Khan Md Hasib and M.F. Mridha},
keywords = {Fake news detection, Multimodal dataset, Textual analysis, Visual analysis, Bangla language, Under-resource, Fusion techniques, Deep learning}}

Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities...
figshare.com
zip
Updated Oct 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dane Morgan; Maciej P. Polak (2025). Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots" [Dataset]. http://doi.org/10.6084/m9.figshare.28559639.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28559639.v2
Dataset updated
Oct 21, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Dane Morgan; Maciej P. Polak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots"https://arxiv.org/abs/2503.12326This repository contains datasets and tools related to PlotExtract, a pipeline for automated plot digitization using LLM-based vision models. Below is a description of the key components:Dataset Output Files*.out_data - Results of LLM-based visual data extraction from plot images. These files contain the extracted data points in CSV-like format.*.out_code - Python code generated by the LLM to recreate the source plot using the extracted data.*.out_conversation - Full conversations with the LLM conducted by PlotExtract, including prompts and responses.interpolated_* - Visual and statistical comparisons based on interpolation between the LLM-extracted data and the ground-truth. These correspond to the interpolation accuracy assessments described in the paper.pointwise_* - Visual and statistical comparisons on a point-by-point basis between extracted and ground-truth data. These correspond to pointwise accuracy evaluations from the main text.*.stats - Numerical summaries of extraction accuracy, referenced in the associated visual comparisons.*.csv - Manually extracted ground truth data used as reference for evaluating extraction accuracy.All of the above files are generated automatically during PlotExtract execution.Published, Synthetic, and chartQA DatasetThe Published Dataset does not include original plot images due to copyright restrictions. Instead, each plot is referenced in source_images.csv, which lists:DOI of the source publicationFigure numberFilename used in this datasetThe Synthetic Dataset includes synthetic plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes.The chartQA Dataset (https://doi.org/10.48550/arXiv.2203.10244) includes chartQA plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes. There are two equivaent datasets: FULL and CROPPED, the first one containing original images and the second one containing images cropped as much as possible to preserve the plot only and remove additional text.CodesAll source code, including PlotExtract and supporting scripts for evaluation and comparison, is included in MPPolak_DMorgan_PlotExtract_Codes.zip.Each script contains usage instructions in-line and is intended to be self-explanatory for users familiar with Python-based data processing workflows.
F
TamperedNews Dataset
data.uni-hannover.de
partaa, partab +15
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). TamperedNews Dataset [Dataset]. https://data.uni-hannover.de/dataset/tamperednews
Explore at:
partae, partao, partaf, tar, partaa, partac, partad, partab, partag, partam, partan, partak, partaj, partai, partah, partap, partalAvailable download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

This repository contains the TamperedNews dataset introduced in the paper:

Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl, and Ralph Ewerth. 2020. Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR '20). Association for Computing Machinery, New York, NY, USA, 16–25. DOI: https://doi.org/10.1145/3372278.3390670

Content

tamperednews.tar.gz:

dataset.jsonl containing:

Web links to the news texts

Web links to the news image

Outputs of the named entity recognition and disambiguation (NERD) approach

Untampered and tampered entities

entity_type.jsonl file for each entity type containing the following information for each entity:

Wikidata ID

Wikidata label

Meta information used for tampering

Web links to all reference images crawled from Google, Bing, and Wikidata

splits for testing and validation

tamperednews_features.tar.gz:

Visual features of the news images for persons, locations, and scenes

Visual features of the reference images for persons, locations, and scenes

tamperednews_wordembeddings.tar.gz: Word embeddings of all nouns in the news texts

Source Code

The source code to reproduce our results as well as download scripts to crawl news texts and images can be found on our GitHub page: https://github.com/TIBHannover/cross-modal_entity_consistency
R
Multi Modal Task 2 Dataset
universe.roboflow.com
zip
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
multimodal (2024). Multi Modal Task 2 Dataset [Dataset]. https://universe.roboflow.com/multimodal-7wsdz/multi-modal-task-2/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Jun 19, 2024
Dataset authored and provided by
multimodal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Reference Inventory Assembly YSiY Bounding Boxes
Description
Multi Modal Task 2

## Overview Multi Modal Task 2 is a dataset for object detection tasks - it contains Reference Inventory Assembly YSiY annotations for 200 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
multimodal-open-r1-8192-filtered-tighter
huggingface.co
Updated Jun 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Feuer (2025). multimodal-open-r1-8192-filtered-tighter [Dataset]. https://huggingface.co/datasets/penfever/multimodal-open-r1-8192-filtered-tighter
Explore at:
Dataset updated
Jun 8, 2025
Authors
Benjamin Feuer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
multimodal-open-r1-8192-filtered-tighter

Original dataset structure preserved, filtered by token length and image quality

Dataset Description

This dataset was processed using the data-preproc package for vision-language model training.

Processing Configuration

Base Model: allenai/Molmo-7B-O-0924 Tokenizer: allenai/Molmo-7B-O-0924 Sequence Length: 8192 Processing Type: Vision Language (VL)

Dataset Features

input_ids: Tokenized input sequences… See the full description on the dataset page: https://huggingface.co/datasets/penfever/multimodal-open-r1-8192-filtered-tighter.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ignacio Avas (2023). Multimodal Recommendation System Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/6338676

Multimodal Recommendation System Datasets

Datasets for AlignMacridVAE (Amazon, BookCrossing, Movielens)

Explore at:

267 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/6338676

Dataset updated

Aug 21, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ignacio Avas

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Quick start

To read any dataset you can use the following code

>>> import numpy as np
>>> embed_image = np.load('embed_image.npy')
>>> embed_image.shape
(33962, 768)
>>> embed_text = np.load('embed_text.npy')
>>> embed_text.shape
(33962, 768)
>>> import pandas as pd
>>> items = pd.read_csv('items.txt')
>>> m = len(items)
>>> print(f'{m} items in dataset')
33962
>>> users = pd.read_csv('users.txt')
>>> n = len(users)
>>> print(f'{n} users in dataset')
14790
>>> train = pd.read_csv('train.txt')
>>> train
     user  item
0    13444 23557
1    13444 33739
...    ...  ...
317109 13506 29993
317110 13506 13931
>>> from scipy.sparse import csr_matrix
>>> train_matrix = csr_matrix((np.ones(len(train)), (train.user, train.item)), shape=(n,m))

Folders

This dataset contains six datasets. Each dataset is duplicated with seven combinations of different Image and Text encoders, so you should see 42 folders.

Each folder is the name of the dataset and the encoder used for the visual and textual parts. For example: bookcrossing-vit_bert.

The datasets are: - Clothing, Shoes and Jewelry (Amazon) - Home and Kitchen (Amazon) - Musical Instruments (Amazon) - Movies and TV (Amazon) - Book-Crossing - Movielens 25M

Files per folder

We consider a review "positive" if the rating is four or more (or 8 or more for Book-crossing).

The vector is zeroed out if an Item does not have an image or text.

Dataset stats

Dataset	Users	Item	Ratings	Density
Clothing & Shoes & Jewelry	23318	38493	178944	0.020%
Home & Kitchen	5968	57645	135839	0.040%
Movies & TV	21974	23958	216110	0.041%
Musical Instruments	14429	29040	93923	0.022%
Book-crossing	14790	33962	519613	0.103%
Movielens 25M	162541	59047	25000095	0.260%

Modifications from the original source

Only a tiny fraction of the dataset was taken for the Amazon Datasets by considering reviews in a specific date range.

For the Bookcrossing dataset, only items with images were considered.

Clear search

Close search

Google apps

Main menu

Recording	Issue
2025-01-10-15-28-50.h5	hand cam missing at beginning
2025-01-10-16-17-40.h5	missing hand cam
2025-01-10-17-10-38.h5	hand cam missing at beginning
2025-01-10-17-54-09.h5	no empty action at

Multimodal Recommendation System Datasets

Quick start

Folders

Files per folder

Dataset stats

Modifications from the original source

synthetic-multiturn-multimodal

Source code and data for the PhD Thesis "Measuring the Contributions of...

aiMotive Multimodal Dataset

multimodalpragmatic

Test Multi Modal Dataset

Test Multi Modal

Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data...

ASSIST-IoT Multimodal Fall Detection Dataset

MUHACU: A Benchmark Dataset for Multi-modal HumanActivity Understanding

Multi Modal Dataset

Multi Modal

MELD Preprocessed

Data Sources

Preprocessing Pipeline

Data Format

Directory Structure

Loading and Using the Dataset

Dataset Class

Custom Collate Function

Creating DataLoaders

Datasets for Evaluation of Multimodal Image Registration

Data from: REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic...

REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

📋 Introduction

✨ Key Features

🔴 Dataset Collection

📑 Dataset Structure

📌 Important Resources

⚠️ File comments

Data from: kerta

DeepFashion-MultiModal

DeepFashion-MultiModal

MultiBanFakeDetect: Multimodal Bangla Fake News

MultiBanFakeDetect Dataset

📊 Statistical Overview – Types of Fake News

🏷️ Distribution by Labels

🌍 Statistical Overview – Categories of Fake News

Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities...

TamperedNews Dataset

Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

Content

Source Code

Multi Modal Task 2 Dataset

Multi Modal Task 2

multimodal-open-r1-8192-filtered-tighter

Multimodal Recommendation System Datasets

Datasets for AlignMacridVAE (Amazon, BookCrossing, Movielens)

Quick start

Folders

Files per folder

Dataset stats

Modifications from the original source