100+ datasets found
  1. Multimodal Recommendation System Datasets

    • kaggle.com
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ignacio Avas (2023). Multimodal Recommendation System Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/6338676
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ignacio Avas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quick start

    To read any dataset you can use the following code

    >>> import numpy as np
    >>> embed_image = np.load('embed_image.npy')
    >>> embed_image.shape
    (33962, 768)
    >>> embed_text = np.load('embed_text.npy')
    >>> embed_text.shape
    (33962, 768)
    >>> import pandas as pd
    >>> items = pd.read_csv('items.txt')
    >>> m = len(items)
    >>> print(f'{m} items in dataset')
    33962
    >>> users = pd.read_csv('users.txt')
    >>> n = len(users)
    >>> print(f'{n} users in dataset')
    14790
    >>> train = pd.read_csv('train.txt')
    >>> train
         user  item
    0    13444 23557
    1    13444 33739
    ...    ...  ...
    317109 13506 29993
    317110 13506 13931
    >>> from scipy.sparse import csr_matrix
    >>> train_matrix = csr_matrix((np.ones(len(train)), (train.user, train.item)), shape=(n,m))
    

    Folders

    This dataset contains six datasets. Each dataset is duplicated with seven combinations of different Image and Text encoders, so you should see 42 folders.

    Each folder is the name of the dataset and the encoder used for the visual and textual parts. For example: bookcrossing-vit_bert.

    The datasets are: - Clothing, Shoes and Jewelry (Amazon) - Home and Kitchen (Amazon) - Musical Instruments (Amazon) - Movies and TV (Amazon) - Book-Crossing - Movielens 25M

    And the encoders are: - CLIP (Image and Text) (*-clip_clip). This is the main one used in the experiments. - ViT and BERT (*-vit_bert) - CLIP (only visual data) *-clip_none - ViT only *-vit_none - BERT only *-none_bert - CLIP (text only) *-clip_none - No textual or visual information *-none_none

    Files per folder

    For each dataset, we have the following files, considering we have M items and N users, textual embeddings with D (like 1024) dimensions, and Visual with E dimensions (like 768) - embed_image.npy A NumPy array of MxE elements. - embed_text.npy A NumPy array of MXD elements. - items.csv A CSV with the Item ID in the original dataset (like the Amazon ASIN, the Movie ID, etc.) and the item number, an integer from 0 to M-1 - users.csv A CSV with the User ID in the original dataset (like the Amazon Reviewer Id) and the item number, an integer from 0 to N-1 - train.txt, validation.txt and test.txt are CSV files with the portions of the reviews for train validation and test. It has the item the user liked or reviewed positively. Each row has a positive user item.

    We consider a review "positive" if the rating is four or more (or 8 or more for Book-crossing).

    The vector is zeroed out if an Item does not have an image or text.

    Dataset stats

    DatasetUsersItemRatingsDensity
    Clothing & Shoes & Jewelry23318384931789440.020%
    Home & Kitchen5968576451358390.040%
    Movies & TV21974239582161100.041%
    Musical Instruments1442929040939230.022%
    Book-crossing14790339625196130.103%
    Movielens 25M16254159047250000950.260%

    Modifications from the original source

    Only a tiny fraction of the dataset was taken for the Amazon Datasets by considering reviews in a specific date range.

    For the Bookcrossing dataset, only items with images were considered.

    There are various other minor tweaks on how to obtain images and texts. The repo https://github.com/igui/MultimodalRecomAnalysis has the Notebook and scripts to reproduce the dataset extraction from scratch.

  2. h

    synthetic-multiturn-multimodal

    • huggingface.co
    Updated Jan 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mesolitica (2024). synthetic-multiturn-multimodal [Dataset]. https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2024
    Dataset authored and provided by
    Mesolitica
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multiturn Multimodal

    We want to generate synthetic data that able to understand position and relationship between multi-images and multi-audio, example as below, All notebooks at https://github.com/mesolitica/malaysian-dataset/tree/master/chatbot/multiturn-multimodal

      multi-images
    

    synthetic-multi-images-relationship.jsonl, 100000 rows, 109MB. Images at https://huggingface.co/datasets/mesolitica/translated-LLaVA-Pretrain/tree/main

      Example data
    

    {'filename':… See the full description on the dataset page: https://huggingface.co/datasets/mesolitica/synthetic-multiturn-multimodal.

  3. h

    Source code and data for the PhD Thesis "Measuring the Contributions of...

    • heidata.uni-heidelberg.de
    zip
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Letitia Parcalabescu; Letitia Parcalabescu (2024). Source code and data for the PhD Thesis "Measuring the Contributions of Vision and Text Modalities in Multimodal Transformers" [Dataset]. http://doi.org/10.11588/DATA/68HOOP
    Explore at:
    zip(17206604), zip(456409), zip(488208), zip(489773), zip(854757425)Available download formats
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    heiDATA
    Authors
    Letitia Parcalabescu; Letitia Parcalabescu
    License

    https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.11588/DATA/68HOOPhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.11588/DATA/68HOOP

    Dataset funded by
    bwHPC and the German Research Foundation (DFG)
    Description

    This dataset contains source code and data used in the PhD thesis "Measuring the Contributions of Vision and Text Modalities in Multimodal Transformers". The dataset is split into five repositories: Code and resources related to chapter 2 of the thesis (Section 2.2., method described in "Using Scene Graph Representations and Knowledge Bases") Code and resources related to chapter 3 of the thesis (VALSE dataset). Code and resources related to chapter 4 of the thesis: MM-SHAP measure and experiments code. Code and resources related to chapter 5 of the thesis: CCSHAP measure and experiments code related to large language models (LLMs). Code and resources related to the experiments with vision and language model decoders from chapters 3, 4, and 5.

  4. aiMotive Multimodal Dataset

    • kaggle.com
    zip
    Updated Dec 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tamas Matuszka (2022). aiMotive Multimodal Dataset [Dataset]. https://www.kaggle.com/datasets/tamasmatuszka/aimotive-multimodal-dataset/code
    Explore at:
    zip(84497743865 bytes)Available download formats
    Dataset updated
    Dec 16, 2022
    Authors
    Tamas Matuszka
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Autonomous driving is a popular research area within the computer vision research community. Since autonomous vehicles are highly safety-critical, ensuring robustness is essential for real-world deployment. While several public multimodal datasets are accessible, they mainly comprise two sensor modalities (camera, LiDAR) which are not well suited for adverse weather. In addition, they lack far-range annotations, making it harder to train neural networks that are the base of a highway assistant function of an autonomous vehicle. Therefore, we introduce a multimodal dataset for robust autonomous driving with long-range perception. The dataset consists of 176 scenes with synchronized and calibrated LiDAR, camera, and radar sensors covering a 360-degree field of view. The collected data was captured in highway, urban, and suburban areas during daytime, night, and rain and is annotated with 3D bounding boxes with consistent identifiers across frames. Furthermore, we trained unimodal and multimodal baseline models for 3D object detection.

    The paper describing the dataset can be read here: https://arxiv.org/pdf/2211.09445.pdf

    If you use aiMotive Multimodal Dataset in your research, please cite our work by using the following BibTeX entry:

    @article{matuszka2022aimotivedataset, title = {aiMotive Dataset: A Multimodal Dataset for Robust Autonomous Driving with Long-Range Perception}, author = {Matuszka, Tamás and Barton, Iván and Butykai, Ádám and Hajas, Péter and Kiss, Dávid and Kovács, Domonkos and Kunsági-Máté, Sándor and Lengyel, Péter and Németh, Gábor and Pető, Levente and Ribli, Dezső and Szeghy, Dávid and Vajna, Szabolcs and Varga, Bálint}, doi = {10.48550/ARXIV.2211.09445}, url = {https://arxiv.org/abs/2211.09445}, publisher = {arXiv}, year = {2022}, }

  5. h

    multimodalpragmatic

    • huggingface.co
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tong Liu (2024). multimodalpragmatic [Dataset]. https://huggingface.co/datasets/tongliuphysics/multimodalpragmatic
    Explore at:
    Dataset updated
    Jun 22, 2024
    Authors
    Tong Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multimodal Pragmatic Jailbreak on Text-to-image Models

    Project page | Paper | Code The Multimodal Pragmatic Unsafe Prompts (MPUP) is a dataset designed to assess the multimodal pragmatic safety in Text-to-Image (T2I) models. It comprises two key sections: image_prompt, and text_prompt.

      Dataset Usage
    
    
    
    
    
    
    
      Downloading the Data
    

    To download the dataset, install Huggingface Datasets and then use the following command: from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/tongliuphysics/multimodalpragmatic.

  6. R

    Test Multi Modal Dataset

    • universe.roboflow.com
    zip
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MyPersonalWorkspace (2024). Test Multi Modal Dataset [Dataset]. https://universe.roboflow.com/mypersonalworkspace-ch3ye/test-multi-modal
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset authored and provided by
    MyPersonalWorkspace
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Docs Descriptions
    Description

    Test Multi Modal

    ## Overview
    
    Test Multi Modal is a dataset for vision language (multimodal) tasks - it contains Docs annotations for 1,998 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  7. p

    Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data...

    • physionet.org
    Updated Aug 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis R Soenksen; Yu Ma; Cynthia Zeng; Leonard David Jean Boussioux; Kimberly Villalobos Carballo; Liangyuan Na; Holly Wiberg; Michael Li; Ignacio Fuentes; Dimitris Bertsimas (2022). Code for generating the HAIM multimodal dataset of MIMIC-IV clinical data and x-rays [Dataset]. http://doi.org/10.13026/3f8d-qe93
    Explore at:
    Dataset updated
    Aug 23, 2022
    Authors
    Luis R Soenksen; Yu Ma; Cynthia Zeng; Leonard David Jean Boussioux; Kimberly Villalobos Carballo; Liangyuan Na; Holly Wiberg; Michael Li; Ignacio Fuentes; Dimitris Bertsimas
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    A multimodal combination of the MIMIC-IV v1.0.0 and MIMIC Chest X-ray (MIMIC-CXR-JPG) v2.0.0 databases filtered to only include patients that have at least one chest X-ray performed with the goal of validating multi-modal predictive analytics in healthcare operations can be generated with the present resource. This multimodal dataset generated through this code contains 34,540 individual patient files in the form of "pickle" Python object structures, which covers a total of 7,279 hospitalization stays involving 6,485 unique patients. Additionally, code to extract feature embeddings as well as the list of pre-processed features are included in this repository.

  8. ASSIST-IoT Multimodal Fall Detection Dataset

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Sep 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasiya Danilenka; Anastasiya Danilenka; Piotr Sowiński; Piotr Sowiński; Monika Kobus; Monika Kobus; Anna Dąbrowska; Anna Dąbrowska; Kajetan Rachwał; Kajetan Rachwał; Karolina Bogacka; Karolina Bogacka; Krzysztof Baszczyński; Krzysztof Baszczyński (2023). ASSIST-IoT Multimodal Fall Detection Dataset [Dataset]. http://doi.org/10.5281/zenodo.8340428
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anastasiya Danilenka; Anastasiya Danilenka; Piotr Sowiński; Piotr Sowiński; Monika Kobus; Monika Kobus; Anna Dąbrowska; Anna Dąbrowska; Kajetan Rachwał; Kajetan Rachwał; Karolina Bogacka; Karolina Bogacka; Krzysztof Baszczyński; Krzysztof Baszczyński
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multimodal dataset for fall detection. Includes acceleration data collected from a tag and two smartwatches, and location reported by the tag. More details about the data collection procedure can be found in notes.md.

    Contents

    The repository contains:

    • data/location_data.csv and data/full_acceleration – preprocessed acceleration and location data from 10 participants and mannequin simulated falls with target variable identified
    • data/subsampled_acceleration_data.csv – subsampled acceleration dataset used for training the AI model
    • notes.md – description of activities performed and notes from data collection
    • videos – reference videos for performed activities

    Authors

    Acknowledgements

    This work is part of the ASSIST-IoT project that has received funding from the EU’s Horizon 2020 research and innovation programme under grant agreement No 957258.

    The Central Institute for Labour Protection – National Research Institute provided facilities and equipment for data collection.

    License

    The dataset is licensed under the Creative Commons Attribution 4.0 International License.

  9. MUHACU: A Benchmark Dataset for Multi-modal HumanActivity Understanding

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jun 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yue Zhuo; Yaqing Liao; Yuecheng Lei; Lizhen Qu; Xiaojun Chang; Zenglin Xu; Yue Zhuo; Yaqing Liao; Yuecheng Lei; Lizhen Qu; Xiaojun Chang; Zenglin Xu (2021). MUHACU: A Benchmark Dataset for Multi-modal HumanActivity Understanding [Dataset]. http://doi.org/10.5281/zenodo.4968721
    Explore at:
    Dataset updated
    Jun 18, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yue Zhuo; Yaqing Liao; Yuecheng Lei; Lizhen Qu; Xiaojun Chang; Zenglin Xu; Yue Zhuo; Yaqing Liao; Yuecheng Lei; Lizhen Qu; Xiaojun Chang; Zenglin Xu
    Description

    Abstract:

    Video understanding extends the level of temporal action recognition. Taking an example of a video containing rich human action, we can reason and predict future actions based on the first several actions in the stream. However, when the task comes to the machine, it could be still difficult to make the forecast and planning based on the video feature of these daily human actions. We formalize the task as Multi-modal Human Activity Understanding. Given a small fraction of the original video clip and a set of action sequences, a machine should be able to find the most reasonable action sequence in the set which can well represent the future actions of the observed video frames. We design the task into two settings: one is completely on the understanding of initial video frames; another provides with both the initial state (video frames)and the goal state (high-level intent). We called them Human ActionForecasting and Human Action Planning separately. We then propose the fully annotated benchmark called MUHACU (MUlti-modalHuman ACtivity Understanding), consisting of 2.9k videos and 157action classes from the original Charades [1] videos. We refine the original labels of the Charades video labels and add more features to aid our task completion. In addition, we provide two strong baseline systems from two directions, information retrieval, and end-to-end training, sharing some insights on potential solutions to this task.

    Introduction:

    We have tailored and refined the original annotation in the Charades dataset by selecting 2.9k videos and crowdsourcing the corresponding intent in each video. In order to meet the design of the initial state, we generally choose the first 20% length of each video as the initial states. Along with the dataset, the multi-modal knowledge base is crafted semi-automatically. Containing temporal action relationships, visual and textual features of atomic actions, and action sequence and high-level intents, the knowledge base is well served the idea of generalization. We demonstrate the Multi-modal Human Activity Understanding (MUHACU) task is challenging to machines by evaluating a strong hybrid end-to-end framework in the format of multi-modal cloze task.

    In summary, MUHACU facilitates multi-modal learning systems that observe through visual features, and forecast and plan in the language in the real-world environment. Our contributions, in brief, are: (1) We propose the first multi-modal knowledge base for temporal activity understanding. (2) We propose baselines for demonstrating the effectiveness of the knowledge base. (3) We propose the novel multi-modal benchmark for evaluating models backed by the knowledge base and dataset.

    MUHACU contains the following fields:

    KB:2402 videos

    _
    KB
    _
    # of action-level entities          157 
    # of activity video entities         2402 
      # of intent for each video         2 
    # of action video entities          12118 
    # of action sequences(non repeat seq)    2402(1969) 
    # of action state templates          27
    avg. # of action sequence length       5.04
    ————————————————————————————————————————————————————————

    Features in KB:

    _
    feature               num            size
    _
    action visual prototype feat    157           [1024,]
    action textual prototype feat    157           [768,]
    intent feat            2402*2          [768,] 
    video-level visual feat      2402 + 12118       [1024,]
    snippet-level visual feat     2402 + 12118    [frames//8, 1024]
    _

    evaluation task: 510 videos for human action planning and human action forecasting

    _
      num            human action planning      human action forecasting
    _
    # of videos (action sequences)    510                510 
      avg. # of observed acts     2.79               2.79 
      avg. # of predicted acts    2.40               2.40 
      avg. # of total acts      5.19               5.19 
      \# of choices          6                 6 
      \# of answers          1(435)/2(75)            1
      \# of intent           0                 1
    _

    training dataset-split: We also provide a dataset-split to training the baseline model to learn the future ground truth sequence. The initial 2402 KB videos are distributed by the standard split 8:2 for training (1921videos) and validation (481 videos).

    _
    train         validation          test
    _
    1921            481            510
    _

    More details about the dataset are in README.txt

    Availability:

    Our data set and knowledge base is available online at https://zenodo.org/deposit/4968721 in order to support sustainability. The resource is maintained under creative Commons Attribution4.0 International license, implying the re-usability. We follow the widely-used standards of FAIR Data principles, which are designed to make resources findable, accessible, interoperable, and re-usable. TheGitHubrepository contains the complete source code and check-points for the baseline systems are available at https://github.com/MUHACU/MUHACU.

    [1]Sigurdsson, Gunnar A., et al. "Hollywood in homes: Crowdsourcing data collection for activity understanding." European Conference on Computer Vision. Springer, Cham, 2016.

  10. R

    Multi Modal Dataset

    • universe.roboflow.com
    zip
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    multimodal (2024). Multi Modal Dataset [Dataset]. https://universe.roboflow.com/multimodal-7wsdz/multi-modal/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    multimodal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Reference Inventory Assembly Bounding Boxes
    Description

    Multi Modal

    ## Overview
    
    Multi Modal is a dataset for object detection tasks - it contains Reference Inventory Assembly annotations for 200 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  11. MELD Preprocessed

    • kaggle.com
    zip
    Updated Mar 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Argish Abhangi (2025). MELD Preprocessed [Dataset]. https://www.kaggle.com/datasets/argish/meld-preprocessed
    Explore at:
    zip(3527202381 bytes)Available download formats
    Dataset updated
    Mar 1, 2025
    Authors
    Argish Abhangi
    Description

    The MELD Preprocessed Dataset is a multi-modal dataset designed for research on emotion recognition from audio, video, and textual data. The dataset builds upon the original MELD dataset and applies extensive preprocessing steps to extract features from different modalities. Each sample is saved as a .pt file containing a dictionary of preprocessed features, making it easy for developers to load and integrate into PyTorch-based workflows.

    Data Sources

    • Audio: Waveforms extracted from the original video files.
    • Video: Video files are processed to sample frames at a target frame rate (default: 2 fps) and to detect faces using a Haar Cascade classifier.
    • Text: Utterances from the dialogue, which are cleaned using custom encoding functions to fix potential byte encoding issues.
    • Emotion Labels: Each sample is associated with an emotion label.

    Preprocessing Pipeline

    The preprocessing script performs several key steps:

    1. Text Cleaning:

      • fix_encoding_with_bytes(text): Decodes text from bytes using UTF-8, Latin-1, or cp1252, ensuring correct encoding.
      • replace_double_encoding(text): Fixes issues related to double-encoded characters (e.g., replacing "Â’" with the proper apostrophe).
    2. Audio Processing:

      • Extracts raw audio waveform from each sample.
      • Computes a Mel-spectrogram using torchaudio.transforms.MelSpectrogram with 64 mel bins (VGGish format).
      • Converts the spectrogram to a logarithmic scale for numerical stability.
    3. Video Processing:

      • Reads video frames at a specified target FPS (default: 2 fps) using OpenCV.
      • For each video, samples frames evenly based on the original video's FPS.
      • Applies Haar Cascade face detection on the frames to extract the first detected face.
      • Resizes the detected face to 224x224 and converts it to RGB. If no face is detected, a default black image (224x224x3) is returned.
    4. Saving Processed Samples:

      • Each sample is saved as a .pt file in a directory structure split by data type (train, dev, and test).
      • The filename is derived from the original video filename (e.g., dia0_utt1.mp4 becomes dia0_utt1.pt).

    Data Format

    Each preprocessed sample is stored in a .pt file and contains a dictionary with the following keys:

    • utterance (str): The cleaned textual utterance.
    • emotion (str/int): The corresponding emotion label.
    • video_path (str): Original path to the video file from which the sample was extracted.
    • audio (Tensor): Raw audio waveform tensor of shape [channels, time].
    • audio_sample_rate (int): The sampling rate of the audio waveform.
    • audio_mel (Tensor): The computed log-scaled Mel-spectrogram with shape [channels, n_mels, time].
    • face (NumPy array): The extracted face image (RGB format) of shape (224, 224, 3). If no face was detected, a default black image is provided.

    Directory Structure

    The preprocessed files are organized into splits: preprocessed_data/ ├── train/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... ├── dev/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... └── test/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt └── ...

    Loading and Using the Dataset

    A custom PyTorch dataset and DataLoader are provided to facilitate easy integration:

    Dataset Class

    from torch.utils.data import Dataset
    import os
    import torch
    
    class PreprocessedMELDDataset(Dataset):
      def _init_(self, data_dir):
        """
        Args:
          data_dir (str): Directory where preprocessed .pt files are stored.
        """
        self.data_dir = data_dir
        self.files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.pt')]
        
      def _len_(self):
        return len(self.files)
      
      def _getitem_(self, idx):
        sample_path = self.files[idx]
        sample = torch.load(sample_path)
        return sample
    

    Custom Collate Function

    def preprocessed_collate_fn(batch):
      """
      Collates a list of sample dictionaries into a single dictionary with keys mapping to lists.
      Modify this function to pad or stack tensor data if needed.
      """
      collated = {}
      collated['utterance'] = [sample['utterance'] for sample in batch]
      collated['emotion'] = [sample['emotion'] for sample in batch]
      collated['video_path'] = [sample['video_path'] for sample in batch]
      collated['audio'] = [sample['audio'] for sample in batch]
      collated['audio_sample_rate'] = batch[0]['audio_sample_rate']
      collated['audio_mel'] = [sample['audio_mel'] for sample in batch]
      collated['face'] = [sample['face'] for sample in batch]
      return collated
    

    Creating DataLoaders

    from torch.utils.data import DataLoader
    
    # Define paths for each split
    train_data_dir = "preprocessed_data/train"
    dev_data_dir = "preproces...
    
  12. Datasets for Evaluation of Multimodal Image Registration

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiahao Lu; Jiahao Lu; Johan Öfverstedt; Johan Öfverstedt; Joakim Lindblad; Joakim Lindblad; Nataša Sladoje; Nataša Sladoje (2021). Datasets for Evaluation of Multimodal Image Registration [Dataset]. http://doi.org/10.5281/zenodo.5557568
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 11, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jiahao Lu; Jiahao Lu; Johan Öfverstedt; Johan Öfverstedt; Joakim Lindblad; Joakim Lindblad; Nataša Sladoje; Nataša Sladoje
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    • Aerial data
    • The Aerial dataset is divided into 3 sub-groups by IDs: {7, 9, 20, 3, 15, 18}, {10, 1, 13, 4, 11, 6, 16}, {14, 8, 17, 5, 19, 12, 2}. Since the images vary in size, each image is subdivided into the maximal number of equal-sized non-overlapping regions such that each region can contain exactly one 300x300 px image patch. Then one 300x300 px image patch is extracted from the centre of each region. The particular 3-folded grouping followed by splitting leads to that each evaluation fold contains 72 test samples.
      • Modality A: Near-Infrared (NIR)

      • Modality B: three colour channels (in B-G-R order)

    • Cytological data
    • The Cytological data contains images from 3 different cell lines; all images from one cell line is treated as one fold in 3-folded cross-validation. Each image in the dataset is subdivided from 600x600 px into 2x2 patches of size 300x300 px, so that there are 420 test samples in each evaluation fold.
      • Modality A: Fluorescence Images

      • Modality B: Quantitative Phase Images (QPI)

    • Histological dataset
    • For the Histological data, to avoid too easy registration relying on the circular border of the TMA cores, the evaluation images are created by cutting 834x834 px patches from the centres of the original 134 TMA image pairs.
      • Modality A: Second Harmonic Generation (SHG)

      • Modality B: Bright-Field (BF)

    The evaluation set created from the above three publicly available 2D datasets consists of images undergone 4 levels of (rigid) transformations of increasing size of displacement. The level of transformations is determined by the size of the rotation angle θ and the displacement tx & ty, detailed in this table. Each image sample is transformed exactly once at each transformation level so that all levels have the same number of samples.

    • Radiological data
    • The Radiological dataset is divided into 3 sub-groups by patient IDs: {109, 106, 003, 006}, {108, 105, 007, 001}, {107, 102, 005, 009}. Since the Radiological dataset is non-isotropic (and also of varying resolution), it is resampled using B-spline interpolation to 1 mm3 cubic voxels, taking explicit care to not resample twice; displaced volumes are transformed and resampled in one step.
      • Modality A: T1-weighted MRI

      • Modality B: T2-weighted MRI

    (Run make_rire_patches.py to generate the sub-volumes.)

    Reference sub-volumes of size 210x210x70 voxels are cropped directly from centres of the (non-displaced) resampled volumes. Similarly as for the aforementioned 2D datasets, random (uniformly-distributed) transformations are composed of rotations θx, θy ∈ [-4, 4] degrees around the x- and y-axes, rotation θz ∈ [-20, 20] degrees around the z-axis, translations tx, ty ∈ [-19.6, 19.6] voxels in x and y directions and translation tz ∈ [-6.5, 6.5] voxels in z direction. 40 rigid transformations of increasing sizes of displacement are applied to each volume. Transformed sub-volumes, of size 210x210x70 voxels, are cropped from centres of the transformed and resampled volumes.

    In total, it contains 864 image pairs created from the aerial dataset, 5040 image pairs created from the cytological dataset, 536 image pairs created from the histological dataset, and metadata with scripts to create the 480 volume pairs from the radiological dataset. Each image pair consists of a reference patch \(I^{\text{Ref}}\) and its corresponding initial transformed patch \(I^{\text{Init}}\) in both modalities, along with the ground-truth transformation parameters to recover it.

    Scripts to calculate the registration performance and to plot the overall results can be found in https://github.com/MIDA-group/MultiRegEval, and instructions to generate more evaluation data with different settings can be found in https://github.com/MIDA-group/MultiRegEval/tree/master/Datasets#instructions-for-customising-evaluation-data.

    Metadata

    In the *.zip files, each row in {Zurich,Balvan}_patches/fold[1-3]/patch_tlevel[1-4]/info_test.csv or Eliceiri_patches/patch_tlevel[1-4]/info_test.csv provides the information of an image pair as follow:

    • Filename: identifier(ID) of the image pair

    • X1_Ref: x-coordinate of the upper-left corner of reference patch IRef

    • Y1_Ref: y-coordinate of the upper-left corner of reference patch IRef

    • X2_Ref: x-coordinate of the lower-left corner of reference patch IRef

    • Y2_Ref: y-coordinate of the lower-left corner of reference patch IRef

    • X3_Ref: x-coordinate of the lower-right corner of reference patch IRef

    • Y3_Ref: y-coordinate of the lower-right corner of reference patch IRef

    • X4_Ref: x-coordinate of the upper-right corner of reference patch IRef

    • Y4_Ref: y-coordinate of the upper-right corner of reference patch IRef

    • X1_Trans: x-coordinate of the upper-left corner of transformed patch IInit

    • Y1_Trans: y-coordinate of the upper-left corner of transformed patch IInit

    • X2_Trans: x-coordinate of the lower-left corner of transformed patch IInit

    • Y2_Trans: y-coordinate of the lower-left corner of transformed patch IInit

    • X3_Trans: x-coordinate of the lower-right corner of transformed patch IInit

    • Y3_Trans: y-coordinate of the lower-right corner of transformed patch IInit

    • X4_Trans: x-coordinate of the upper-right corner of transformed patch IInit

    • Y4_Trans: y-coordinate of the upper-right corner of transformed patch IInit

    • Displacement: mean Euclidean distance between reference corner points and transformed corner points

    • RelativeDisplacement: the ratio of displacement to the width/height of image patch

    • Tx: randomly generated translation in the x-direction to synthesise the transformed patch IInit

    • Ty: randomly generated translation in the y-direction to synthesise the transformed patch IInit

    • AngleDegree: randomly generated rotation in degrees to synthesise the transformed patch IInit

    • AngleRad: randomly generated rotation in radian to synthesise the transformed patch IInit

    In addition, each row in RIRE_patches/fold[1-3]/patch_tlevel[1-4]/info_test.csv has following columns:

    • Z1_Ref: z-coordinate of the upper-left corner of reference patch IRef
    • Z2_Ref: z-coordinate of the lower-left corner of reference patch IRef
    • Z3_Ref: z-coordinate of the lower-right corner of reference patch IRef
    • Z4_Ref: z-coordinate of the upper-right corner of reference patch IRef
    • Z1_Trans: z-coordinate of the upper-left corner of transformed patch IInit
    • Z2_Trans: z-coordinate of the lower-left corner of transformed patch IInit
    • Z3_Trans: z-coordinate of the lower-right corner of transformed patch IInit
    • Z4_Trans: z-coordinate of the upper-right corner of transformed patch IInit
    • (...and similarly, coordinates of the 5th-8th corners)
    • Tz: randomly generated translation in z-direction to synthesise the transformed patch IInit
    • AngleDegreeX: randomly generated rotation around X-axis in degrees to synthesise the transformed patch IInit
    • AngleRadX: randomly generated rotation around X-axis in radian to synthesise the transformed patch IInit
    • AngleDegreeY: randomly generated rotation around Y-axis in degrees to synthesise the transformed patch IInit
    • AngleRadY: randomly generated rotation around Y-axis in radian to synthesise the transformed patch IInit
    • AngleDegreeZ: randomly generated rotation around Z-axis in degrees to synthesise the transformed patch IInit
    • AngleRadZ: randomly generated rotation around Z-axis in radian to synthesise the transformed patch IInit

    Naming convention

    • Aerial Data
      •  zh{ID}_{iRow}_{iCol}_{ReferenceOrTransformed}.png
      • Example: zh5_03_02_R.png indicates the Reference patch of the 3rd row and 2nd column cut from the image with ID zh5.
      </li>
      <li><strong>Cytological data</strong>
      <ul>
        <li>
        <pre> {{cellline}_{treatment}_{fieldofview}_{iFrame}}_{iRow}_{iCol}_{ReferenceOrTransformed}.png</pre>
        </li>
        <li>Example: <code>PNT1A_do_1_f15_02_01_T.png</code> indicates the <em>Transformed
      
  13. t

    Data from: REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic...

    • researchdata.tuwien.ac.at
    txt, zip
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee (2025). REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly [Dataset]. http://doi.org/10.48436/0ewrv-8cb44
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    TU Wien
    Authors
    Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee; Daniel Jan Sliwowski; Shail Jadav; Sergej Stanovcic; Jędrzej Orbik; Johannes Heidersberger; Dongheui Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 9, 2025 - Jan 14, 2025
    Description

    REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly

    📋 Introduction

    Robotic manipulation remains a core challenge in robotics, particularly for contact-rich tasks such as industrial assembly and disassembly. Existing datasets have significantly advanced learning in manipulation but are primarily focused on simpler tasks like object rearrangement, falling short of capturing the complexity and physical dynamics involved in assembly and disassembly. To bridge this gap, we present REASSEMBLE (Robotic assEmbly disASSEMBLy datasEt), a new dataset designed specifically for contact-rich manipulation tasks. Built around the NIST Assembly Task Board 1 benchmark, REASSEMBLE includes four actions (pick, insert, remove, and place) involving 17 objects. The dataset contains 4,551 demonstrations, of which 4,035 were successful, spanning a total of 781 minutes. Our dataset features multi-modal sensor data including event cameras, force-torque sensors, microphones, and multi-view RGB cameras. This diverse dataset supports research in areas such as learning contact-rich manipulation, task condition identification, action segmentation, and more. We believe REASSEMBLE will be a valuable resource for advancing robotic manipulation in complex, real-world scenarios.

    ✨ Key Features

    • Multimodality: REASSEMBLE contains data from robot proprioception, RGB cameras, Force&Torque sensors, microphones, and event cameras
    • Multitask labels: REASSEMBLE contains labeling which enables research in Temporal Action Segmentation, Motion Policy Learning, Anomaly detection, and Task Inversion.
    • Long horizon: Demonstrations in the REASSEMBLE dataset cover long horizon tasks and actions which usually span multiple steps.
    • Hierarchical labels: REASSEMBLE contains actions segmentation labels at two hierarchical levels.

    🔴 Dataset Collection

    Each demonstration starts by randomizing the board and object poses, after which an operator teleoperates the robot to assemble and disassemble the board while narrating their actions and marking task segment boundaries with key presses. The narrated descriptions are transcribed using Whisper [1], and the board and camera poses are measured at the beginning using a motion capture system, though continuous tracking is avoided due to interference with the event camera. Sensory data is recorded with rosbag and later post-processed into HDF5 files without downsampling or synchronization, preserving raw data and timestamps for future flexibility. To reduce memory usage, video and audio are stored as encoded MP4 and MP3 files, respectively. Transcription errors are corrected automatically or manually, and a custom visualization tool is used to validate the synchronization and correctness of all data and annotations. Missing or incorrect entries are identified and corrected, ensuring the dataset’s completeness. Low-level Skill annotations were added manually after data collection, and all labels were carefully reviewed to ensure accuracy.

    📑 Dataset Structure

    The dataset consists of several HDF5 (.h5) and JSON (.json) files, organized into two directories. The poses directory contains the JSON files, which store the poses of the cameras and the board in the world coordinate frame. The data directory contains the HDF5 files, which store the sensory readings and annotations collected as part of the REASSEMBLE dataset. Each JSON file can be matched with its corresponding HDF5 file based on their filenames, which include the timestamp when the data was recorded. For example, 2025-01-09-13-59-54_poses.json corresponds to 2025-01-09-13-59-54.h5.

    The structure of the JSON files is as follows:

    {"Hama1": [
        [x ,y, z],
        [qx, qy, qz, qw]
     ], 
     "Hama2": [
        [x ,y, z],
        [qx, qy, qz, qw]
     ], 
     "DAVIS346": [
        [x ,y, z],
        [qx, qy, qz, qw]
     ], 
     "NIST_Board1": [
        [x ,y, z],
        [qx, qy, qz, qw]
     ]
    }

    [x, y, z] represent the position of the object, and [qx, qy, qz, qw] represent its orientation as a quaternion.

    The HDF5 (.h5) format organizes data into two main types of structures: datasets, which hold the actual data, and groups, which act like folders that can contain datasets or other groups. In the diagram below, groups are shown as folder icons, and datasets as file icons. The main group of the file directly contains the video, audio, and event data. To save memory, video and audio are stored as encoded byte strings, while event data is stored as arrays. The robot’s proprioceptive information is kept in the robot_state group as arrays. Because different sensors record data at different rates, the arrays vary in length (signified by the N_xxx variable in the data shapes). To align the sensory data, each sensor’s timestamps are stored separately in the timestamps group. Information about action segments is stored in the segments_info group. Each segment is saved as a subgroup, named according to its order in the demonstration, and includes a start timestamp, end timestamp, a success indicator, and a natural language description of the action. Within each segment, low-level skills are organized under a low_level subgroup, following the same structure as the high-level annotations.

    📁

    The splits folder contains two text files which list the h5 files used for the traning and validation splits.

    📌 Important Resources

    The project website contains more details about the REASSEMBLE dataset. The Code for loading and visualizing the data is avaibile on our github repository.

    📄 Project website: https://tuwien-asl.github.io/REASSEMBLE_page/
    💻 Code: https://github.com/TUWIEN-ASL/REASSEMBLE

    ⚠️ File comments

    Below is a table which contains a list records which have any issues. Issues typically correspond to missing data from one of the sensors.

    RecordingIssue
    2025-01-10-15-28-50.h5hand cam missing at beginning
    2025-01-10-16-17-40.h5missing hand cam
    2025-01-10-17-10-38.h5hand cam missing at beginning
    2025-01-10-17-54-09.h5no empty action at

  14. h

    Data from: kerta

    • huggingface.co
    Updated Nov 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Budi Susanto (2025). kerta [Dataset]. http://doi.org/10.57967/hf/7089
    Explore at:
    Dataset updated
    Nov 25, 2025
    Authors
    Budi Susanto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Kerta Corpus: Multimodal Code Readability Dataset

      Summary
    

    Kerta Corpus is a multimodal dataset for code readability research. This dataset combines:

    Metric features from the Scalabrino tool, which includes the feature definitions of Scalabrino, Buse and Weimer, and Posnett. Hand-crafted code metrics (56 static metrics) (in progress) Rendered code highlight images (PNG format) A Java Method Declaration corpus labeled into three readability classes: 0 — Unreadable 1 —… See the full description on the dataset page: https://huggingface.co/datasets/budsus/kerta.

  15. DeepFashion-MultiModal

    • kaggle.com
    • opendatalab.com
    zip
    Updated Sep 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    silverstone (2024). DeepFashion-MultiModal [Dataset]. https://www.kaggle.com/datasets/silverstone1903/deep-fashion-multimodal/code
    Explore at:
    zip(2025725175 bytes)Available download formats
    Dataset updated
    Sep 16, 2024
    Authors
    silverstone
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Original data contains 44k. This one only contains images from front.

    DeepFashion-MultiModal

    Text2Human: Text-Driven Controllable Human Image Generation
    Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy and Ziwei Liu
    In ACM Transactions on Graphics (Proceedings of SIGGRAPH), 2022.

    From MMLab@NTU affliated with S-Lab, Nanyang Technological University and SenseTime Research.

    https://github.com/yumingj/DeepFashion-MultiModal/raw/main/assets/logo.png">

    [Project Page] | [Paper] | [Code] | [Demo Video]

    DeepFashion-MultiModal is a large-scale high-quality human dataset with rich multi-modal annotations. It has the following properties: 1. It contains 44,096 high-resolution human images, including 12,701 full body human images. 2. For each full body images, we manually annotate the human parsing labels of 24 classes. 3. For each full body images, we manually annotate the keypoints. 4. We extract DensePose for each human image. 5. Each image is manually annotated with attributes for both clothes shapes and textures. 6. We provide a textual description for each image.

    @article{jiang2022text2human,
     title={Text2Human: Text-Driven Controllable Human Image Generation},
     author={Jiang, Yuming and Yang, Shuai and Qiu, Haonan and Wu, Wayne and Loy, Chen Change and Liu, Ziwei},
     journal={ACM Transactions on Graphics (TOG)},
     volume={41},
     number={4},
     articleno={162},
     pages={1--11},
     year={2022},
     publisher={ACM New York, NY, USA},
     doi={10.1145/3528223.3530104},
    }
    
    @inproceedings{liuLQWTcvpr16DeepFashion,
     author = {Liu, Ziwei and Luo, Ping and Qiu, Shi and Wang, Xiaogang and Tang, Xiaoou},
     title = {DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations},
     booktitle = {Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
     month = {June},
     year = {2016}
     }
    
  16. MultiBanFakeDetect: Multimodal Bangla Fake News

    • kaggle.com
    zip
    Updated Aug 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mukaffi Moin (2025). MultiBanFakeDetect: Multimodal Bangla Fake News [Dataset]. https://www.kaggle.com/datasets/mukaffimoin/multibanfakedetect-multimodal-bangla-fake-news/code
    Explore at:
    zip(2608129399 bytes)Available download formats
    Dataset updated
    Aug 14, 2025
    Authors
    Mukaffi Moin
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    MultiBanFakeDetect Dataset

    The MultiBanFakeDetect dataset consists of 9,600 text–image instances collected from online forums, news websites, and social media. It covers multiple themes — political, social, technology, and entertainment — with a balanced distribution of real and fake instances.

    The dataset is split into:

    • Training: 7,680 instances
    • Testing: 960 instances
    • Validation: 960 instances

    📊 Statistical Overview – Types of Fake News

    TypeTrainingTestingValidation
    Misinformation1,288161162
    Rumor1,215152151
    Clickbait1,337167167
    Non-fake3,840480480
    Total7,680960960

    🏷️ Distribution by Labels

    LabelTrainingTestingValidation
    1 (Fake)3,840480480
    0 (Non-Fake)3,840480480
    Total7,680960960

    🌍 Statistical Overview – Categories of Fake News

    CategoryTrainingTestingValidation
    Entertainment6408080
    Sports6408080
    Technology6408080
    National6408080
    Lifestyle6408080
    Politics6408080
    Education6408080
    International6408080
    Crime6408080
    Finance6408080
    Business6408080
    Miscellaneous6408080
    Total7,680960960
    @article{FARIA2025100347,
    title = {MultiBanFakeDetect: Integrating advanced fusion techniques for multimodal detection of Bangla fake news in under-resourced contexts},
    journal = {International Journal of Information Management Data Insights},
    volume = {5},
    number = {2},
    pages = {100347},
    year = {2025},
    issn = {2667-0968},
    doi = {https://doi.org/10.1016/j.jjimei.2025.100347},
    url = {https://www.sciencedirect.com/science/article/pii/S2667096825000291},
    author = {Fatema Tuj Johora Faria and Mukaffi Bin Moin and Zayeed Hasan and Md. Arafat Alam Khandaker and Niful Islam and Khan Md Hasib and M.F. Mridha},
    keywords = {Fake news detection, Multimodal dataset, Textual analysis, Visual analysis, Bangla language, Under-resource, Fusion techniques, Deep learning}}
    
    
  17. Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities...

    • figshare.com
    zip
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dane Morgan; Maciej P. Polak (2025). Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots" [Dataset]. http://doi.org/10.6084/m9.figshare.28559639.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 21, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Dane Morgan; Maciej P. Polak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data, Codes, and Supplementary Figures for "Leveraging Vision Capabilities of Multimodal LLMs for Automated Data Extraction from Plots"https://arxiv.org/abs/2503.12326This repository contains datasets and tools related to PlotExtract, a pipeline for automated plot digitization using LLM-based vision models. Below is a description of the key components:Dataset Output Files*.out_data - Results of LLM-based visual data extraction from plot images. These files contain the extracted data points in CSV-like format.*.out_code - Python code generated by the LLM to recreate the source plot using the extracted data.*.out_conversation - Full conversations with the LLM conducted by PlotExtract, including prompts and responses.interpolated_* - Visual and statistical comparisons based on interpolation between the LLM-extracted data and the ground-truth. These correspond to the interpolation accuracy assessments described in the paper.pointwise_* - Visual and statistical comparisons on a point-by-point basis between extracted and ground-truth data. These correspond to pointwise accuracy evaluations from the main text.*.stats - Numerical summaries of extraction accuracy, referenced in the associated visual comparisons.*.csv - Manually extracted ground truth data used as reference for evaluating extraction accuracy.All of the above files are generated automatically during PlotExtract execution.Published, Synthetic, and chartQA DatasetThe Published Dataset does not include original plot images due to copyright restrictions. Instead, each plot is referenced in source_images.csv, which lists:DOI of the source publicationFigure numberFilename used in this datasetThe Synthetic Dataset includes synthetic plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes.The chartQA Dataset (https://doi.org/10.48550/arXiv.2203.10244) includes chartQA plot images, extracted data, generated replots, and evaluation outputs for benchmarking purposes. There are two equivaent datasets: FULL and CROPPED, the first one containing original images and the second one containing images cropped as much as possible to preserve the plot only and remove additional text.CodesAll source code, including PlotExtract and supporting scripts for evaluation and comparison, is included in MPPolak_DMorgan_PlotExtract_Codes.zip.Each script contains usage instructions in-line and is intended to be self-explanatory for users familiar with Python-based data processing workflows.

  18. F

    TamperedNews Dataset

    • data.uni-hannover.de
    partaa, partab +15
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). TamperedNews Dataset [Dataset]. https://data.uni-hannover.de/dataset/tamperednews
    Explore at:
    partae, partao, partaf, tar, partaa, partac, partad, partab, partag, partam, partan, partak, partaj, partai, partah, partap, partalAvailable download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

    This repository contains the TamperedNews dataset introduced in the paper:

    Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl, and Ralph Ewerth. 2020. Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR '20). Association for Computing Machinery, New York, NY, USA, 16–25. DOI: https://doi.org/10.1145/3372278.3390670

    Content

    • tamperednews.tar.gz:
      • dataset.jsonl containing:
        • Web links to the news texts
        • Web links to the news image
        • Outputs of the named entity recognition and disambiguation (NERD) approach
        • Untampered and tampered entities
      • entity_type.jsonl file for each entity type containing the following information for each entity:
        • Wikidata ID
        • Wikidata label
        • Meta information used for tampering
        • Web links to all reference images crawled from Google, Bing, and Wikidata
      • splits for testing and validation
    • tamperednews_features.tar.gz:
      • Visual features of the news images for persons, locations, and scenes
      • Visual features of the reference images for persons, locations, and scenes
    • tamperednews_wordembeddings.tar.gz: Word embeddings of all nouns in the news texts

    Source Code

    The source code to reproduce our results as well as download scripts to crawl news texts and images can be found on our GitHub page: https://github.com/TIBHannover/cross-modal_entity_consistency

  19. R

    Multi Modal Task 2 Dataset

    • universe.roboflow.com
    zip
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    multimodal (2024). Multi Modal Task 2 Dataset [Dataset]. https://universe.roboflow.com/multimodal-7wsdz/multi-modal-task-2/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    multimodal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Reference Inventory Assembly YSiY Bounding Boxes
    Description

    Multi Modal Task 2

    ## Overview
    
    Multi Modal Task 2 is a dataset for object detection tasks - it contains Reference Inventory Assembly YSiY annotations for 200 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  20. h

    multimodal-open-r1-8192-filtered-tighter

    • huggingface.co
    Updated Jun 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Feuer (2025). multimodal-open-r1-8192-filtered-tighter [Dataset]. https://huggingface.co/datasets/penfever/multimodal-open-r1-8192-filtered-tighter
    Explore at:
    Dataset updated
    Jun 8, 2025
    Authors
    Benjamin Feuer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    multimodal-open-r1-8192-filtered-tighter

    Original dataset structure preserved, filtered by token length and image quality

      Dataset Description
    

    This dataset was processed using the data-preproc package for vision-language model training.

      Processing Configuration
    

    Base Model: allenai/Molmo-7B-O-0924 Tokenizer: allenai/Molmo-7B-O-0924 Sequence Length: 8192 Processing Type: Vision Language (VL)

      Dataset Features
    

    input_ids: Tokenized input sequences… See the full description on the dataset page: https://huggingface.co/datasets/penfever/multimodal-open-r1-8192-filtered-tighter.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ignacio Avas (2023). Multimodal Recommendation System Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/6338676
Organization logo

Multimodal Recommendation System Datasets

Datasets for AlignMacridVAE (Amazon, BookCrossing, Movielens)

Explore at:
267 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ignacio Avas
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Quick start

To read any dataset you can use the following code

>>> import numpy as np
>>> embed_image = np.load('embed_image.npy')
>>> embed_image.shape
(33962, 768)
>>> embed_text = np.load('embed_text.npy')
>>> embed_text.shape
(33962, 768)
>>> import pandas as pd
>>> items = pd.read_csv('items.txt')
>>> m = len(items)
>>> print(f'{m} items in dataset')
33962
>>> users = pd.read_csv('users.txt')
>>> n = len(users)
>>> print(f'{n} users in dataset')
14790
>>> train = pd.read_csv('train.txt')
>>> train
     user  item
0    13444 23557
1    13444 33739
...    ...  ...
317109 13506 29993
317110 13506 13931
>>> from scipy.sparse import csr_matrix
>>> train_matrix = csr_matrix((np.ones(len(train)), (train.user, train.item)), shape=(n,m))

Folders

This dataset contains six datasets. Each dataset is duplicated with seven combinations of different Image and Text encoders, so you should see 42 folders.

Each folder is the name of the dataset and the encoder used for the visual and textual parts. For example: bookcrossing-vit_bert.

The datasets are: - Clothing, Shoes and Jewelry (Amazon) - Home and Kitchen (Amazon) - Musical Instruments (Amazon) - Movies and TV (Amazon) - Book-Crossing - Movielens 25M

And the encoders are: - CLIP (Image and Text) (*-clip_clip). This is the main one used in the experiments. - ViT and BERT (*-vit_bert) - CLIP (only visual data) *-clip_none - ViT only *-vit_none - BERT only *-none_bert - CLIP (text only) *-clip_none - No textual or visual information *-none_none

Files per folder

For each dataset, we have the following files, considering we have M items and N users, textual embeddings with D (like 1024) dimensions, and Visual with E dimensions (like 768) - embed_image.npy A NumPy array of MxE elements. - embed_text.npy A NumPy array of MXD elements. - items.csv A CSV with the Item ID in the original dataset (like the Amazon ASIN, the Movie ID, etc.) and the item number, an integer from 0 to M-1 - users.csv A CSV with the User ID in the original dataset (like the Amazon Reviewer Id) and the item number, an integer from 0 to N-1 - train.txt, validation.txt and test.txt are CSV files with the portions of the reviews for train validation and test. It has the item the user liked or reviewed positively. Each row has a positive user item.

We consider a review "positive" if the rating is four or more (or 8 or more for Book-crossing).

The vector is zeroed out if an Item does not have an image or text.

Dataset stats

DatasetUsersItemRatingsDensity
Clothing & Shoes & Jewelry23318384931789440.020%
Home & Kitchen5968576451358390.040%
Movies & TV21974239582161100.041%
Musical Instruments1442929040939230.022%
Book-crossing14790339625196130.103%
Movielens 25M16254159047250000950.260%

Modifications from the original source

Only a tiny fraction of the dataset was taken for the Amazon Datasets by considering reviews in a specific date range.

For the Bookcrossing dataset, only items with images were considered.

There are various other minor tweaks on how to obtain images and texts. The repo https://github.com/igui/MultimodalRecomAnalysis has the Notebook and scripts to reproduce the dataset extraction from scratch.

Search
Clear search
Close search
Google apps
Main menu