Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To read any dataset you can use the following code
>>> import numpy as np
>>> embed_image = np.load('embed_image.npy')
>>> embed_image.shape
(33962, 768)
>>> embed_text = np.load('embed_text.npy')
>>> embed_text.shape
(33962, 768)
>>> import pandas as pd
>>> items = pd.read_csv('items.txt')
>>> m = len(items)
>>> print(f'{m} items in dataset')
33962
>>> users = pd.read_csv('users.txt')
>>> n = len(users)
>>> print(f'{n} users in dataset')
14790
>>> train = pd.read_csv('train.txt')
>>> train
user item
0 13444 23557
1 13444 33739
... ... ...
317109 13506 29993
317110 13506 13931
>>> from scipy.sparse import csr_matrix
>>> train_matrix = csr_matrix((np.ones(len(train)), (train.user, train.item)), shape=(n,m))
This dataset contains six datasets. Each dataset is duplicated with seven combinations of different Image and Text encoders, so you should see 42 folders.
Each folder is the name of the dataset and the encoder used for the visual and textual parts. For example: bookcrossing-vit_bert.
The datasets are: - Clothing, Shoes and Jewelry (Amazon) - Home and Kitchen (Amazon) - Musical Instruments (Amazon) - Movies and TV (Amazon) - Book-Crossing - Movielens 25M
And the encoders are:
- CLIP (Image and Text) (*-clip_clip). This is the main one used in the experiments.
- ViT and BERT (*-vit_bert)
- CLIP (only visual data) *-clip_none
- ViT only *-vit_none
- BERT only *-none_bert
- CLIP (text only) *-clip_none
- No textual or visual information *-none_none
For each dataset, we have the following files, considering we have M items and N users, textual embeddings with D (like 1024) dimensions, and Visual with E dimensions (like 768)
- embed_image.npy A NumPy array of MxE elements.
- embed_text.npy A NumPy array of MXD elements.
- items.csv A CSV with the Item ID in the original dataset (like the Amazon ASIN, the Movie ID, etc.) and the item number, an integer from 0 to M-1
- users.csv A CSV with the User ID in the original dataset (like the Amazon Reviewer Id) and the item number, an integer from 0 to N-1
- train.txt, validation.txt and test.txt are CSV files with the portions of the reviews for train validation and test. It has the item the user liked or reviewed positively. Each row has a positive user item.
We consider a review "positive" if the rating is four or more (or 8 or more for Book-crossing).
The vector is zeroed out if an Item does not have an image or text.
| Dataset | Users | Item | Ratings | Density |
|---|---|---|---|---|
| Clothing & Shoes & Jewelry | 23318 | 38493 | 178944 | 0.020% |
| Home & Kitchen | 5968 | 57645 | 135839 | 0.040% |
| Movies & TV | 21974 | 23958 | 216110 | 0.041% |
| Musical Instruments | 14429 | 29040 | 93923 | 0.022% |
| Book-crossing | 14790 | 33962 | 519613 | 0.103% |
| Movielens 25M | 162541 | 59047 | 25000095 | 0.260% |
Only a tiny fraction of the dataset was taken for the Amazon Datasets by considering reviews in a specific date range.
For the Bookcrossing dataset, only items with images were considered.
There are various other minor tweaks on how to obtain images and texts. The repo https://github.com/igui/MultimodalRecomAnalysis has the Notebook and scripts to reproduce the dataset extraction from scratch.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Multiturn Multimodal
We want to generate synthetic data that able to understand position and relationship between multi-images and multi-audio, example as below, All notebooks at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/multiturn-multimodal
multi-images
synthetic-multi-images-relationship.jsonl, 100000 rows, 109MB. Images at https://huggingface.co/datasets/mesolitica/translated-LLaVA-Pretrain/tree/main
Example data
{'filename':… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal.
Facebook
Twitterhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.11588/DATA/68HOOPhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.11588/DATA/68HOOP
This dataset contains source code and data used in the PhD thesis "Measuring the Contributions of Vision and Text Modalities in Multimodal Transformers". The dataset is split into five repositories: Code and resources related to chapter 2 of the thesis (Section 2.2., method described in "Using Scene Graph Representations and Knowledge Bases") Code and resources related to chapter 3 of the thesis (VALSE dataset). Code and resources related to chapter 4 of the thesis: MM-SHAP measure and experiments code. Code and resources related to chapter 5 of the thesis: CCSHAP measure and experiments code related to large language models (LLMs). Code and resources related to the experiments with vision and language model decoders from chapters 3, 4, and 5.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Autonomous driving is a popular research area within the computer vision research community. Since autonomous vehicles are highly safety-critical, ensuring robustness is essential for real-world deployment. While several public multimodal datasets are accessible, they mainly comprise two sensor modalities (camera, LiDAR) which are not well suited for adverse weather. In addition, they lack far-range annotations, making it harder to train neural networks that are the base of a highway assistant function of an autonomous vehicle. Therefore, we introduce a multimodal dataset for robust autonomous driving with long-range perception. The dataset consists of 176 scenes with synchronized and calibrated LiDAR, camera, and radar sensors covering a 360-degree field of view. The collected data was captured in highway, urban, and suburban areas during daytime, night, and rain and is annotated with 3D bounding boxes with consistent identifiers across frames. Furthermore, we trained unimodal and multimodal baseline models for 3D object detection.
The paper describing the dataset can be read here: https://arxiv.org/pdf/2211.09445.pdf
If you use aiMotive Multimodal Dataset in your research, please cite our work by using the following BibTeX entry:
@article{matuszka2022aimotivedataset, title = {aiMotive Dataset: A Multimodal Dataset for Robust Autonomous Driving with Long-Range Perception}, author = {Matuszka, Tamás and Barton, Iván and Butykai, Ádám and Hajas, Péter and Kiss, Dávid and Kovács, Domonkos and Kunsági-Máté, Sándor and Lengyel, Péter and Németh, Gábor and Pető, Levente and Ribli, Dezső and Szeghy, Dávid and Vajna, Szabolcs and Varga, Bálint}, doi = {10.48550/ARXIV.2211.09445}, url = {https://arxiv.org/abs/2211.09445}, publisher = {arXiv}, year = {2022}, }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multimodal Pragmatic Jailbreak on Text-to-image Models
Project page | Paper | Code The Multimodal Pragmatic Unsafe Prompts (MPUP) is a dataset designed to assess the multimodal pragmatic safety in Text-to-Image (T2I) models. It comprises two key sections: image_prompt, and text_prompt.
Dataset Usage
Downloading the Data
To download the dataset, install Huggingface Datasets and then use the following command: from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/tongliuphysics/multimodalpragmatic.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Test Multi Modal is a dataset for vision language (multimodal) tasks - it contains Docs annotations for 1,998 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
Twitterhttps://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
A multimodal combination of the MIMIC-IV v1.0.0 and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 databases filtered to only include patients that have at least one chest X-ray performed with the goal of validating multi-modal predictive analytics in healthcare operations can be generated with the present resource. This multimodal dataset generated through this code contains 34,540 individual patient files in the form of "pickle" Python object structures, which covers a total of 7,279 hospitalization stays involving 6,485 unique patients. Additionally, code to extract feature embeddings as well as the list of pre-processed features are included in this repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Multimodal dataset for fall detection. Includes acceleration data collected from a tag and two smartwatches, and location reported by the tag. More details about the data collection procedure can be found in notes.md.
Contents
The repository contains:
data/location_data.csv and data/full_acceleration – preprocessed acceleration and location data from 10 participants and mannequin simulated falls with target variable identifieddata/subsampled_acceleration_data.csv – subsampled acceleration dataset used for training the AI modelnotes.md – description of activities performed and notes from data collectionvideos – reference videos for performed activitiesAuthors
Acknowledgements
This work is part of the ASSIST-IoT project that has received funding from the EU’s Horizon 2020 research and innovation programme under grant agreement No 957258.
The Central Institute for Labour Protection – National Research Institute provided facilities and equipment for data collection.
License
The dataset is licensed under the Creative Commons Attribution 4.0 International License.
Facebook
TwitterAbstract:
Video understanding extends the level of temporal action recognition. Taking an example of a video containing rich human action, we can reason and predict future actions based on the first several actions in the stream. However, when the task comes to the machine, it could be still difficult to make the forecast and planning based on the video feature of these daily human actions. We formalize the task as Multi-modal Human Activity Understanding. Given a small fraction of the original video clip and a set of action sequences, a machine should be able to find the most reasonable action sequence in the set which can well represent the future actions of the observed video frames. We design the task into two settings: one is completely on the understanding of initial video frames; another provides with both the initial state (video frames)and the goal state (high-level intent). We called them Human ActionForecasting and Human Action Planning separately. We then propose the fully annotated benchmark called MUHACU (MUlti-modalHuman ACtivity Understanding), consisting of 2.9k videos and 157action classes from the original Charades [1] videos. We refine the original labels of the Charades video labels and add more features to aid our task completion. In addition, we provide two strong baseline systems from two directions, information retrieval, and end-to-end training, sharing some insights on potential solutions to this task.
Introduction:
We have tailored and refined the original annotation in the Charades dataset by selecting 2.9k videos and crowdsourcing the corresponding intent in each video. In order to meet the design of the initial state, we generally choose the first 20% length of each video as the initial states. Along with the dataset, the multi-modal knowledge base is crafted semi-automatically. Containing temporal action relationships, visual and textual features of atomic actions, and action sequence and high-level intents, the knowledge base is well served the idea of generalization. We demonstrate the Multi-modal Human Activity Understanding (MUHACU) task is challenging to machines by evaluating a strong hybrid end-to-end framework in the format of multi-modal cloze task.
In summary, MUHACU facilitates multi-modal learning systems that observe through visual features, and forecast and plan in the language in the real-world environment. Our contributions, in brief, are: (1) We propose the first multi-modal knowledge base for temporal activity understanding. (2) We propose baselines for demonstrating the effectiveness of the knowledge base. (3) We propose the novel multi-modal benchmark for evaluating models backed by the knowledge base and dataset.
MUHACU contains the following fields:
KB:2402 videos
_
KB
_
# of action-level entities 157
# of activity video entities 2402
# of intent for each video 2
# of action video entities 12118
# of action sequences(non repeat seq) 2402(1969)
# of action state templates 27
avg. # of action sequence length 5.04
————————————————————————————————————————————————————————
Features in KB:
_
feature num size
_
action visual prototype feat 157 [1024,]
action textual prototype feat 157 [768,]
intent feat 2402*2 [768,]
video-level visual feat 2402 + 12118 [1024,]
snippet-level visual feat 2402 + 12118 [frames//8, 1024]
_
evaluation task: 510 videos for human action planning and human action forecasting
_
num human action planning human action forecasting
_
# of videos (action sequences) 510 510
avg. # of observed acts 2.79 2.79
avg. # of predicted acts 2.40 2.40
avg. # of total acts 5.19 5.19
\# of choices 6 6
\# of answers 1(435)/2(75) 1
\# of intent 0 1
_
training dataset-split: We also provide a dataset-split to training the baseline model to learn the future ground truth sequence. The initial 2402 KB videos are distributed by the standard split 8:2 for training (1921videos) and validation (481 videos).
_
train validation test
_
1921 481 510
_
More details about the dataset are in README.txt
Availability:
Our data set and knowledge base is available online at https://zenodo.org/deposit/4968721 in order to support sustainability. The resource is maintained under creative Commons Attribution4.0 International license, implying the re-usability. We follow the widely-used standards of FAIR Data principles, which are designed to make resources findable, accessible, interoperable, and re-usable. TheGitHubrepository contains the complete source code and check-points for the baseline systems are available at https://github.com/MUHACU/MUHACU.
[1]Sigurdsson, Gunnar A., et al. "Hollywood in homes: Crowdsourcing data collection for activity understanding." European Conference on Computer Vision. Springer, Cham, 2016.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Multi Modal is a dataset for object detection tasks - it contains Reference Inventory Assembly annotations for 200 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterThe MELD Preprocessed Dataset is a multi-modal dataset designed for research on emotion recognition from audio, video, and textual data. The dataset builds upon the original MELD dataset and applies extensive preprocessing steps to extract features from different modalities. Each sample is saved as a .pt file containing a dictionary of preprocessed features, making it easy for developers to load and integrate into PyTorch-based workflows.
The preprocessing script performs several key steps:
Text Cleaning:
fix_encoding_with_bytes(text): Decodes text from bytes using UTF-8, Latin-1, or cp1252, ensuring correct encoding.replace_double_encoding(text): Fixes issues related to double-encoded characters (e.g., replacing "Â’" with the proper apostrophe).Audio Processing:
torchaudio.transforms.MelSpectrogram with 64 mel bins (VGGish format).Video Processing:
Saving Processed Samples:
.pt file in a directory structure split by data type (train, dev, and test).dia0_utt1.mp4 becomes dia0_utt1.pt).Each preprocessed sample is stored in a .pt file and contains a dictionary with the following keys:
utterance (str): The cleaned textual utterance.emotion (str/int): The corresponding emotion label.video_path (str): Original path to the video file from which the sample was extracted.audio (Tensor): Raw audio waveform tensor of shape [channels, time].audio_sample_rate (int): The sampling rate of the audio waveform.audio_mel (Tensor): The computed log-scaled Mel-spectrogram with shape [channels, n_mels, time].face (NumPy array): The extracted face image (RGB format) of shape (224, 224, 3). If no face was detected, a default black image is provided.The preprocessed files are organized into splits:
preprocessed_data/
├── train/
│ ├── dia0_utt0.pt
│ ├── dia1_utt1.pt
│ └── ...
├── dev/
│ ├── dia0_utt0.pt
│ ├── dia1_utt1.pt
│ └── ...
└── test/
│ ├── dia0_utt0.pt
│ ├── dia1_utt1.pt
└── ...
A custom PyTorch dataset and DataLoader are provided to facilitate easy integration:
from torch.utils.data import Dataset
import os
import torch
class PreprocessedMELDDataset(Dataset):
def _init_(self, data_dir):
"""
Args:
data_dir (str): Directory where preprocessed .pt files are stored.
"""
self.data_dir = data_dir
self.files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.pt')]
def _len_(self):
return len(self.files)
def _getitem_(self, idx):
sample_path = self.files[idx]
sample = torch.load(sample_path)
return sample
def preprocessed_collate_fn(batch):
"""
Collates a list of sample dictionaries into a single dictionary with keys mapping to lists.
Modify this function to pad or stack tensor data if needed.
"""
collated = {}
collated['utterance'] = [sample['utterance'] for sample in batch]
collated['emotion'] = [sample['emotion'] for sample in batch]
collated['video_path'] = [sample['video_path'] for sample in batch]
collated['audio'] = [sample['audio'] for sample in batch]
collated['audio_sample_rate'] = batch[0]['audio_sample_rate']
collated['audio_mel'] = [sample['audio_mel'] for sample in batch]
collated['face'] = [sample['face'] for sample in batch]
return collated
from torch.utils.data import DataLoader
# Define paths for each split
train_data_dir = "preprocessed_data/train"
dev_data_dir = "preproces...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Modality A: Near-Infrared (NIR)
Modality B: three colour channels (in B-G-R order)
Modality A: Fluorescence Images
Modality B: Quantitative Phase Images (QPI)
Modality A: Second Harmonic Generation (SHG)
Modality B: Bright-Field (BF)
The evaluation set created from the above three publicly available 2D datasets consists of images undergone 4 levels of (rigid) transformations of increasing size of displacement. The level of transformations is determined by the size of the rotation angle θ and the displacement tx & ty, detailed in this table. Each image sample is transformed exactly once at each transformation level so that all levels have the same number of samples.
Modality A: T1-weighted MRI
Modality B: T2-weighted MRI
(Run make_rire_patches.py to generate the sub-volumes.)
Reference sub-volumes of size 210x210x70 voxels are cropped directly from centres of the (non-displaced) resampled volumes. Similarly as for the aforementioned 2D datasets, random (uniformly-distributed) transformations are composed of rotations θx, θy ∈ [-4, 4] degrees around the x- and y-axes, rotation θz ∈ [-20, 20] degrees around the z-axis, translations tx, ty ∈ [-19.6, 19.6] voxels in x and y directions and translation tz ∈ [-6.5, 6.5] voxels in z direction. 40 rigid transformations of increasing sizes of displacement are applied to each volume. Transformed sub-volumes, of size 210x210x70 voxels, are cropped from centres of the transformed and resampled volumes.
In total, it contains 864 image pairs created from the aerial dataset, 5040 image pairs created from the cytological dataset, 536 image pairs created from the histological dataset, and metadata with scripts to create the 480 volume pairs from the radiological dataset. Each image pair consists of a reference patch \(I^{\text{Ref}}\) and its corresponding initial transformed patch \(I^{\text{Init}}\) in both modalities, along with the ground-truth transformation parameters to recover it.
Scripts to calculate the registration performance and to plot the overall results can be found in https://github.com/MIDA-group/MultiRegEval, and instructions to generate more evaluation data with different settings can be found in https://github.com/MIDA-group/MultiRegEval/tree/master/Datasets#instructions-for-customising-evaluation-data.
Metadata
In the *.zip files, each row in {Zurich,Balvan}_patches/fold[1-3]/patch_tlevel[1-4]/info_test.csv or Eliceiri_patches/patch_tlevel[1-4]/info_test.csv provides the information of an image pair as follow:
Filename: identifier(ID) of the image pair
X1_Ref: x-coordinate of the upper-left corner of reference patch IRef
Y1_Ref: y-coordinate of the upper-left corner of reference patch IRef
X2_Ref: x-coordinate of the lower-left corner of reference patch IRef
Y2_Ref: y-coordinate of the lower-left corner of reference patch IRef
X3_Ref: x-coordinate of the lower-right corner of reference patch IRef
Y3_Ref: y-coordinate of the lower-right corner of reference patch IRef
X4_Ref: x-coordinate of the upper-right corner of reference patch IRef
Y4_Ref: y-coordinate of the upper-right corner of reference patch IRef
X1_Trans: x-coordinate of the upper-left corner of transformed patch IInit
Y1_Trans: y-coordinate of the upper-left corner of transformed patch IInit
X2_Trans: x-coordinate of the lower-left corner of transformed patch IInit
Y2_Trans: y-coordinate of the lower-left corner of transformed patch IInit
X3_Trans: x-coordinate of the lower-right corner of transformed patch IInit
Y3_Trans: y-coordinate of the lower-right corner of transformed patch IInit
X4_Trans: x-coordinate of the upper-right corner of transformed patch IInit
Y4_Trans: y-coordinate of the upper-right corner of transformed patch IInit
Displacement: mean Euclidean distance between reference corner points and transformed corner points
RelativeDisplacement: the ratio of displacement to the width/height of image patch
Tx: randomly generated translation in the x-direction to synthesise the transformed patch IInit
Ty: randomly generated translation in the y-direction to synthesise the transformed patch IInit
AngleDegree: randomly generated rotation in degrees to synthesise the transformed patch IInit
AngleRad: randomly generated rotation in radian to synthesise the transformed patch IInit
In addition, each row in RIRE_patches/fold[1-3]/patch_tlevel[1-4]/info_test.csv has following columns:
Naming convention
zh{ID}_{iRow}_{iCol}_{ReferenceOrTransformed}.png
zh5_03_02_R.png indicates the Reference patch of the 3rd row and 2nd column cut from the image with ID zh5.</li>
<li><strong>Cytological data</strong>
<ul>
<li>
<pre> {{cellline}_{treatment}_{fieldofview}_{iFrame}}_{iRow}_{iCol}_{ReferenceOrTransformed}.png</pre>
</li>
<li>Example: <code>PNT1A_do_1_f15_02_01_T.png</code> indicates the <em>Transformed
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.
Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the dataset’s completeness. Low-level Skill annotations were added manually after data collection, and all labels were carefully reviewed to ensure accuracy.
The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json corresponds to 2025-01-09-13-59-54.h5.
The structure of the JSON files is as follows:
{"Hama1": [
[x ,y, z],
[qx, qy, qz, qw]
],
"Hama2": [
[x ,y, z],
[qx, qy, qz, qw]
],
"DAVIS346": [
[x ,y, z],
[qx, qy, qz, qw]
],
"NIST_Board1": [
[x ,y, z],
[qx, qy, qz, qw]
]
}
[x, y, z] represent the position of the object, and [qx, qy, qz, qw] represent its orientation as a quaternion.
The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robot’s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensor’s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.
📁
The splits folder contains two text files which list the h5 files used for the traning and validation splits.
The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.
📄 Project website: https://tuwien-asl.github.io/REASSEMBLE_page/
💻 Code: https://github.com/TUWIEN-ASL/REASSEMBLE
| Recording | Issue |
| 2025-01-10-15-28-50.h5 | hand cam missing at beginning |
| 2025-01-10-16-17-40.h5 | missing hand cam |
| 2025-01-10-17-10-38.h5 | hand cam missing at beginning |
| 2025-01-10-17-54-09.h5 | no empty action at |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Kerta Corpus: Multimodal Code Readability Dataset
Summary
Kerta Corpus is a multimodal dataset for code readability research. This dataset combines:
Metric features from the Scalabrino tool, which includes the feature definitions of Scalabrino, Buse and Weimer, and Posnett. Hand-crafted code metrics (56 static metrics) (in progress) Rendered code highlight images (PNG format) A Java Method Declaration corpus labeled into three readability classes: 0 — Unreadable 1 —… See the full description on the dataset page: https://huggingface.co/datasets/budsus/kerta.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Original data contains 44k. This one only contains images from front.
Text2Human: Text-Driven Controllable Human Image Generation
Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy and Ziwei Liu
In ACM Transactions on Graphics (Proceedings of SIGGRAPH), 2022.
From MMLab@NTU affliated with S-Lab, Nanyang Technological University and SenseTime Research.
https://github.com/yumingj/DeepFashion-MultiModal/raw/main/assets/logo.png">
[Project Page] | [Paper] | [Code] | [Demo Video]
DeepFashion-MultiModal is a large-scale high-quality human dataset with rich multi-modal annotations. It has the following properties: 1. It contains 44,096 high-resolution human images, including 12,701 full body human images. 2. For each full body images, we manually annotate the human parsing labels of 24 classes. 3. For each full body images, we manually annotate the keypoints. 4. We extract DensePose for each human image. 5. Each image is manually annotated with attributes for both clothes shapes and textures. 6. We provide a textual description for each image.
@article{jiang2022text2human,
title={Text2Human: Text-Driven Controllable Human Image Generation},
author={Jiang, Yuming and Yang, Shuai and Qiu, Haonan and Wu, Wayne and Loy, Chen Change and Liu, Ziwei},
journal={ACM Transactions on Graphics (TOG)},
volume={41},
number={4},
articleno={162},
pages={1--11},
year={2022},
publisher={ACM New York, NY, USA},
doi={10.1145/3528223.3530104},
}
@inproceedings{liuLQWTcvpr16DeepFashion,
author = {Liu, Ziwei and Luo, Ping and Qiu, Shi and Wang, Xiaogang and Tang, Xiaoou},
title = {DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations},
booktitle = {Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2016}
}
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
The MultiBanFakeDetect dataset consists of 9,600 text–image instances collected from online forums, news websites, and social media. It covers multiple themes — political, social, technology, and entertainment — with a balanced distribution of real and fake instances.
The dataset is split into:
| Type | Training | Testing | Validation |
|---|---|---|---|
| Misinformation | 1,288 | 161 | 162 |
| Rumor | 1,215 | 152 | 151 |
| Clickbait | 1,337 | 167 | 167 |
| Non-fake | 3,840 | 480 | 480 |
| Total | 7,680 | 960 | 960 |
| Label | Training | Testing | Validation |
|---|---|---|---|
| 1 (Fake) | 3,840 | 480 | 480 |
| 0 (Non-Fake) | 3,840 | 480 | 480 |
| Total | 7,680 | 960 | 960 |
| Category | Training | Testing | Validation |
|---|---|---|---|
| Entertainment | 640 | 80 | 80 |
| Sports | 640 | 80 | 80 |
| Technology | 640 | 80 | 80 |
| National | 640 | 80 | 80 |
| Lifestyle | 640 | 80 | 80 |
| Politics | 640 | 80 | 80 |
| Education | 640 | 80 | 80 |
| International | 640 | 80 | 80 |
| Crime | 640 | 80 | 80 |
| Finance | 640 | 80 | 80 |
| Business | 640 | 80 | 80 |
| Miscellaneous | 640 | 80 | 80 |
| Total | 7,680 | 960 | 960 |
@article{FARIA2025100347,
title = {MultiBanFakeDetect: Integrating advanced fusion techniques for multimodal detection of Bangla fake news in under-resourced contexts},
journal = {International Journal of Information Management Data Insights},
volume = {5},
number = {2},
pages = {100347},
year = {2025},
issn = {2667-0968},
doi = {https://doi.org/10.1016/j.jjimei.2025.100347},
url = {https://www.sciencedirect.com/science/article/pii/S2667096825000291},
author = {Fatema Tuj Johora Faria and Mukaffi Bin Moin and Zayeed Hasan and Md. Arafat Alam Khandaker and Niful Islam and Khan Md Hasib and M.F. Mridha},
keywords = {Fake news detection, Multimodal dataset, Textual analysis, Visual analysis, Bangla language, Under-resource, Fusion techniques, Deep learning}}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots"https://arxiv.org/abs/2503.12326This repository contains datasets and tools related to PlotExtract, a pipeline for automated plot digitization using LLM-based vision models. Below is a description of the key components:Dataset Output Files*.out_data - Results of LLM-based visual data extraction from plot images. These files contain the extracted data points in CSV-like format.*.out_code - Python code generated by the LLM to recreate the source plot using the extracted data.*.out_conversation - Full conversations with the LLM conducted by PlotExtract, including prompts and responses.interpolated_* - Visual and statistical comparisons based on interpolation between the LLM-extracted data and the ground-truth. These correspond to the interpolation accuracy assessments described in the paper.pointwise_* - Visual and statistical comparisons on a point-by-point basis between extracted and ground-truth data. These correspond to pointwise accuracy evaluations from the main text.*.stats - Numerical summaries of extraction accuracy, referenced in the associated visual comparisons.*.csv - Manually extracted ground truth data used as reference for evaluating extraction accuracy.All of the above files are generated automatically during PlotExtract execution.Published, Synthetic, and chartQA DatasetThe Published Dataset does not include original plot images due to copyright restrictions. Instead, each plot is referenced in source_images.csv, which lists:DOI of the source publicationFigure numberFilename used in this datasetThe Synthetic Dataset includes synthetic plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes.The chartQA Dataset (https://doi.org/10.48550/arXiv.2203.10244) includes chartQA plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes. There are two equivaent datasets: FULL and CROPPED, the first one containing original images and the second one containing images cropped as much as possible to preserve the plot only and remove additional text.CodesAll source code, including PlotExtract and supporting scripts for evaluation and comparison, is included in MPPolak_DMorgan_PlotExtract_Codes.zip.Each script contains usage instructions in-line and is intended to be self-explanatory for users familiar with Python-based data processing workflows.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
This repository contains the TamperedNews dataset introduced in the paper:
Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl, and Ralph Ewerth. 2020. Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR '20). Association for Computing Machinery, New York, NY, USA, 16–25. DOI: https://doi.org/10.1145/3372278.3390670
dataset.jsonl containing:
entity_type.jsonl file for each entity type containing the following information for each entity:
The source code to reproduce our results as well as download scripts to crawl news texts and images can be found on our GitHub page: https://github.com/TIBHannover/cross-modal_entity_consistency
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Multi Modal Task 2 is a dataset for object detection tasks - it contains Reference Inventory Assembly YSiY annotations for 200 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
multimodal-open-r1-8192-filtered-tighter
Original dataset structure preserved, filtered by token length and image quality
Dataset Description
This dataset was processed using the data-preproc package for vision-language model training.
Processing Configuration
Base Model: allenai/Molmo-7B-O-0924 Tokenizer: allenai/Molmo-7B-O-0924 Sequence Length: 8192 Processing Type: Vision Language (VL)
Dataset Features
input_ids: Tokenized input sequences… See the full description on the dataset page: https://huggingface.co/datasets/penfever/multimodal-open-r1-8192-filtered-tighter.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To read any dataset you can use the following code
>>> import numpy as np
>>> embed_image = np.load('embed_image.npy')
>>> embed_image.shape
(33962, 768)
>>> embed_text = np.load('embed_text.npy')
>>> embed_text.shape
(33962, 768)
>>> import pandas as pd
>>> items = pd.read_csv('items.txt')
>>> m = len(items)
>>> print(f'{m} items in dataset')
33962
>>> users = pd.read_csv('users.txt')
>>> n = len(users)
>>> print(f'{n} users in dataset')
14790
>>> train = pd.read_csv('train.txt')
>>> train
user item
0 13444 23557
1 13444 33739
... ... ...
317109 13506 29993
317110 13506 13931
>>> from scipy.sparse import csr_matrix
>>> train_matrix = csr_matrix((np.ones(len(train)), (train.user, train.item)), shape=(n,m))
This dataset contains six datasets. Each dataset is duplicated with seven combinations of different Image and Text encoders, so you should see 42 folders.
Each folder is the name of the dataset and the encoder used for the visual and textual parts. For example: bookcrossing-vit_bert.
The datasets are: - Clothing, Shoes and Jewelry (Amazon) - Home and Kitchen (Amazon) - Musical Instruments (Amazon) - Movies and TV (Amazon) - Book-Crossing - Movielens 25M
And the encoders are:
- CLIP (Image and Text) (*-clip_clip). This is the main one used in the experiments.
- ViT and BERT (*-vit_bert)
- CLIP (only visual data) *-clip_none
- ViT only *-vit_none
- BERT only *-none_bert
- CLIP (text only) *-clip_none
- No textual or visual information *-none_none
For each dataset, we have the following files, considering we have M items and N users, textual embeddings with D (like 1024) dimensions, and Visual with E dimensions (like 768)
- embed_image.npy A NumPy array of MxE elements.
- embed_text.npy A NumPy array of MXD elements.
- items.csv A CSV with the Item ID in the original dataset (like the Amazon ASIN, the Movie ID, etc.) and the item number, an integer from 0 to M-1
- users.csv A CSV with the User ID in the original dataset (like the Amazon Reviewer Id) and the item number, an integer from 0 to N-1
- train.txt, validation.txt and test.txt are CSV files with the portions of the reviews for train validation and test. It has the item the user liked or reviewed positively. Each row has a positive user item.
We consider a review "positive" if the rating is four or more (or 8 or more for Book-crossing).
The vector is zeroed out if an Item does not have an image or text.
| Dataset | Users | Item | Ratings | Density |
|---|---|---|---|---|
| Clothing & Shoes & Jewelry | 23318 | 38493 | 178944 | 0.020% |
| Home & Kitchen | 5968 | 57645 | 135839 | 0.040% |
| Movies & TV | 21974 | 23958 | 216110 | 0.041% |
| Musical Instruments | 14429 | 29040 | 93923 | 0.022% |
| Book-crossing | 14790 | 33962 | 519613 | 0.103% |
| Movielens 25M | 162541 | 59047 | 25000095 | 0.260% |
Only a tiny fraction of the dataset was taken for the Amazon Datasets by considering reviews in a specific date range.
For the Bookcrossing dataset, only items with images were considered.
There are various other minor tweaks on how to obtain images and texts. The repo https://github.com/igui/MultimodalRecomAnalysis has the Notebook and scripts to reproduce the dataset extraction from scratch.