MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Docmatix
Dataset description
Docmatix is part of the Idefics3 release (stay tuned). It is a massive dataset for Document Visual Question Answering that was used for the fine-tuning of the vision-language model Idefics3.
Load the dataset
To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/Docmatix")
If you want the dataset to link to the pdf filesโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/Docmatix.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Docmatix-IR
Docmatix is originally a large dataset designed for fine-tuning large vision-language models on Visual Question Answering tasks. It contains a substantial collection of PDF images (2.4M) and a vast set of questions (9.5M) related to these images. However, many of the questions in the Docmatix dataset are not suitable for open-domain question answering. To address this, we have converted Docmatix into Docmatix-IR, a training set suitable for training document visualโฆ See the full description on the dataset page: https://huggingface.co/datasets/Tevatron/docmatix-ir.
Docmatix used in MoCa Continual Pre-training
๐ Homepage | ๐ป Code | ๐ค MoCa-Qwen25VL-7B | ๐ค MoCa-Qwen25VL-3B | ๐ Datasets | ๐ Paper
Introduction
This is a interleaved multimodal pre-training dataset used in the modality-aware continual pre-training of MoCa models. It is adapted from Docmatix by concatenating document screenshots and texts. The dataset consists of interleaved multimodal examples. text is a string containing text while images are image binaries thatโฆ See the full description on the dataset page: https://huggingface.co/datasets/moca-embed/docmatix.
jiuhai/docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community
Nayana-cognitivelab/Nayana-DocQA-ta-10k-v1-docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community
EtashGuha/Docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community
Nayana-cognitivelab/Docmatix-single-pdfs dataset hosted on Hugging Face and contributed by the HF Datasets community
danaaubakirova/docmatix-subset dataset hosted on Hugging Face and contributed by the HF Datasets community
sionic-ai/docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community
Nayana-cognitivelab/Nayana-OCRBench-in-0.1k-v1-docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community
Nayana-cognitivelab/Nayana-DocOCR-mr-ta-te-or-pa-15k-v1-docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community
monology/docmatix-multipage dataset hosted on Hugging Face and contributed by the HF Datasets community
mrodriguesoliv/docmatix-turing dataset hosted on Hugging Face and contributed by the HF Datasets community
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Docmatix
Dataset description
Docmatix is part of the Idefics3 release (stay tuned). It is a massive dataset for Document Visual Question Answering that was used for the fine-tuning of the vision-language model Idefics3.
Load the dataset
To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/Docmatix")
If you want the dataset to link to the pdf filesโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/Docmatix.