13 datasets found
  1. h

    Docmatix

    • huggingface.co
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HuggingFaceM4 (2024). Docmatix [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/Docmatix
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 18, 2024
    Dataset authored and provided by
    HuggingFaceM4
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Docmatix

      Dataset description
    

    Docmatix is part of the Idefics3 release (stay tuned). It is a massive dataset for Document Visual Question Answering that was used for the fine-tuning of the vision-language model Idefics3.

      Load the dataset
    

    To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/Docmatix")

    If you want the dataset to link to the pdf filesโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/Docmatix.

  2. h

    docmatix-ir

    • huggingface.co
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tevatron (2024). docmatix-ir [Dataset]. https://huggingface.co/datasets/Tevatron/docmatix-ir
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2024
    Dataset authored and provided by
    Tevatron
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Docmatix-IR

    Docmatix is originally a large dataset designed for fine-tuning large vision-language models on Visual Question Answering tasks. It contains a substantial collection of PDF images (2.4M) and a vast set of questions (9.5M) related to these images. However, many of the questions in the Docmatix dataset are not suitable for open-domain question answering. To address this, we have converted Docmatix into Docmatix-IR, a training set suitable for training document visualโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Tevatron/docmatix-ir.

  3. h

    docmatix

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MoCa (2025). docmatix [Dataset]. https://huggingface.co/datasets/moca-embed/docmatix
    Explore at:
    Dataset updated
    Mar 26, 2025
    Dataset authored and provided by
    MoCa
    Description

    Docmatix used in MoCa Continual Pre-training

    ๐Ÿ  Homepage | ๐Ÿ’ป Code | ๐Ÿค– MoCa-Qwen25VL-7B | ๐Ÿค– MoCa-Qwen25VL-3B | ๐Ÿ“š Datasets | ๐Ÿ“„ Paper

      Introduction
    

    This is a interleaved multimodal pre-training dataset used in the modality-aware continual pre-training of MoCa models. It is adapted from Docmatix by concatenating document screenshots and texts. The dataset consists of interleaved multimodal examples. text is a string containing text while images are image binaries thatโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/moca-embed/docmatix.

  4. h

    docmatix

    • huggingface.co
    Updated Sep 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JiuhaiChen (2024). docmatix [Dataset]. https://huggingface.co/datasets/jiuhai/docmatix
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2024
    Authors
    JiuhaiChen
    Description

    jiuhai/docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    Nayana-DocQA-ta-10k-v1-docmatix

    • huggingface.co
    Updated Nov 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nayana-CognitiveLab (2024). Nayana-DocQA-ta-10k-v1-docmatix [Dataset]. https://huggingface.co/datasets/Nayana-cognitivelab/Nayana-DocQA-ta-10k-v1-docmatix
    Explore at:
    Dataset updated
    Nov 19, 2024
    Dataset authored and provided by
    Nayana-CognitiveLab
    Description

    Nayana-cognitivelab/Nayana-DocQA-ta-10k-v1-docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    Docmatix

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Etash Guha, Docmatix [Dataset]. https://huggingface.co/datasets/EtashGuha/Docmatix
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Etash Guha
    Description

    EtashGuha/Docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    Docmatix-single-pdfs

    • huggingface.co
    Updated Apr 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nayana-CognitiveLab (2025). Docmatix-single-pdfs [Dataset]. https://huggingface.co/datasets/Nayana-cognitivelab/Docmatix-single-pdfs
    Explore at:
    Dataset updated
    Apr 3, 2025
    Dataset authored and provided by
    Nayana-CognitiveLab
    Description

    Nayana-cognitivelab/Docmatix-single-pdfs dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    docmatix-subset

    • huggingface.co
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dana Aubakirova (2024). docmatix-subset [Dataset]. https://huggingface.co/datasets/danaaubakirova/docmatix-subset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2024
    Authors
    Dana Aubakirova
    Description

    danaaubakirova/docmatix-subset dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. docmatix

    • huggingface.co
    Updated Aug 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sionic-ai (2025). docmatix [Dataset]. https://huggingface.co/datasets/sionic-ai/docmatix
    Explore at:
    Dataset updated
    Aug 31, 2025
    Dataset provided by
    Sionic AI Inc
    Authors
    sionic-ai
    Description

    sionic-ai/docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    Nayana-OCRBench-in-0.1k-v1-docmatix

    • huggingface.co
    Updated Jul 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nayana-CognitiveLab (2025). Nayana-OCRBench-in-0.1k-v1-docmatix [Dataset]. https://huggingface.co/datasets/Nayana-cognitivelab/Nayana-OCRBench-in-0.1k-v1-docmatix
    Explore at:
    Dataset updated
    Jul 21, 2025
    Dataset authored and provided by
    Nayana-CognitiveLab
    Description

    Nayana-cognitivelab/Nayana-OCRBench-in-0.1k-v1-docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    Nayana-DocOCR-mr-ta-te-or-pa-15k-v1-docmatix

    • huggingface.co
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nayana-CognitiveLab (2024). Nayana-DocOCR-mr-ta-te-or-pa-15k-v1-docmatix [Dataset]. https://huggingface.co/datasets/Nayana-cognitivelab/Nayana-DocOCR-mr-ta-te-or-pa-15k-v1-docmatix
    Explore at:
    Dataset updated
    Nov 14, 2024
    Dataset authored and provided by
    Nayana-CognitiveLab
    Description

    Nayana-cognitivelab/Nayana-DocOCR-mr-ta-te-or-pa-15k-v1-docmatix dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    docmatix-multipage

    • huggingface.co
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    monology (2025). docmatix-multipage [Dataset]. https://huggingface.co/datasets/monology/docmatix-multipage
    Explore at:
    Dataset updated
    Jun 30, 2025
    Authors
    monology
    Description

    monology/docmatix-multipage dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    docmatix-turing

    • huggingface.co
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matheus Oliveira (2025). docmatix-turing [Dataset]. https://huggingface.co/datasets/mrodriguesoliv/docmatix-turing
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2025
    Authors
    Matheus Oliveira
    Description

    mrodriguesoliv/docmatix-turing dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
HuggingFaceM4 (2024). Docmatix [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/Docmatix

Docmatix

Docmatix

HuggingFaceM4/Docmatix

Explore at:
26 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2024
Dataset authored and provided by
HuggingFaceM4
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Card for Docmatix

  Dataset description

Docmatix is part of the Idefics3 release (stay tuned). It is a massive dataset for Document Visual Question Answering that was used for the fine-tuning of the vision-language model Idefics3.

  Load the dataset

To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/Docmatix")

If you want the dataset to link to the pdf filesโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/Docmatix.

Search
Clear search
Close search
Google apps
Main menu