6 datasets found
  1. h

    Cambrian-10M

    • huggingface.co
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYU VisionX (2024). Cambrian-10M [Dataset]. https://huggingface.co/datasets/nyu-visionx/Cambrian-10M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    NYU VisionX
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cambrian-10M Dataset

    Please see paper & website for more information:

    https://cambrian-mllm.github.io/ https://arxiv.org/abs/2406.16860

      Overview
    

    Cambrian-10M is a comprehensive dataset designed for instruction tuning, particularly in multimodal settings involving visual interaction data. The dataset is crafted to address the scarcity of high-quality multimodal instruction-tuning data and to maintain the language abilities of multimodal large language models (LLMs).… See the full description on the dataset page: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M.

  2. h

    Cambrian-Alignment

    • huggingface.co
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cambrian-Alignment [Dataset]. https://huggingface.co/datasets/nyu-visionx/Cambrian-Alignment
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    NYU VisionX
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cambrian-Alignment Dataset

    Please see paper & website for more information:

    https://cambrian-mllm.github.io/ https://arxiv.org/abs/2406.16860

      Overview
    

    Cambrian-Alignment is an question-answering alignment dataset comprised of alignment data from LLaVA, Mini-Gemini, Allava, and ShareGPT4V.

      Getting Started with Cambrian Alignment Data
    

    Before you start, ensure you have sufficient storage space to download and process the data.

    Download the Data Repository… See the full description on the dataset page: https://huggingface.co/datasets/nyu-visionx/Cambrian-Alignment.

  3. h

    CV-Bench

    • huggingface.co
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYU VisionX (2024). CV-Bench [Dataset]. https://huggingface.co/datasets/nyu-visionx/CV-Bench
    Explore at:
    Dataset updated
    Jul 7, 2024
    Dataset authored and provided by
    NYU VisionX
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cambrian Vision-Centric Benchmark (CV-Bench)

    This repository contains the Cambrian Vision-Centric Benchmark (CV-Bench), introduced in Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs.

      Files
    

    The test*.parquet files contain the dataset annotations and images pre-loaded for processing with HF Datasets. These can be loaded in 3 different configurations using… See the full description on the dataset page: https://huggingface.co/datasets/nyu-visionx/CV-Bench.

  4. Recharge across the Cambrian Limestone Aquifer

    • researchdata.edu.au
    Updated Jul 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2020). Recharge across the Cambrian Limestone Aquifer [Dataset]. https://researchdata.edu.au/recharge-cambrian-limestone-aquifer/2980282
    Explore at:
    Dataset updated
    Jul 28, 2020
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Bioregional Assessment Program
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The dataset was derived by the Geological and Bioregional Assessment Program from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. This is an initial dataset that was published for peer review on 21/07/2020 and will be finalised when the journal paper is revised and accepted for publication. This dataset contains the inputs, outputs and code used to estimate recharge across the extent of the Cambrian Limestone Aquifer. The method is described in a journal paper:

    Crosbie and Rachakonda (2020) Constraining probabilistic chloride mass balance recharge estimates using baseflow and remotely sensed evapotranspiration: The Cambrian Limestone Aquifer northern Australia. Submitted to Hydrogeology Journal.

    A copy of this draft journal paper is included in the dataset.

    Attribution

    Geological and Bioregional Assessment Program

    History

    This dataset contains the inputs, outputs and code used to estimate recharge across the extent of the Cambrian Limestone Aquifer. The method is described in a journal paper:\r \r Crosbie and Rachakonda (2020) Constraining probabilistic chloride mass balance recharge estimates using baseflow and remotely sensed evapotranspiration: The Cambrian Limestone Aquifer northern Australia. Submitted to Hydrogeology Journal\r \r A copy of this draft journal paper is included in the dataset.

  5. h

    Oasis

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Letian Zhang, Oasis [Dataset]. https://huggingface.co/datasets/Letian2003/Oasis
    Explore at:
    Authors
    Letian Zhang
    Description

    Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

    This dataset contains Oasis-500k dataset. [Read the Paper] | [Github Repo]

    All images come from Cambrian-10M. Instructions and responses are generated by MLLM.

  6. h

    FUSION-Finetune-12M

    • huggingface.co
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zheng Liu (2025). FUSION-Finetune-12M [Dataset]. https://huggingface.co/datasets/starriver030515/FUSION-Finetune-12M
    Explore at:
    Dataset updated
    Apr 8, 2025
    Authors
    Zheng Liu
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    FUSION-12M Dataset

    Please see paper & website for more information:

    https://arxiv.org/abs/2504.09925 https://github.com/starriver030515/FUSION

      Overview
    

    FUSION-12M is a large-scale, diverse multimodal instruction-tuning dataset used to train FUSION-3B and FUSION-8B models. It builds upon Cambrian-1 by significantly expanding both the quantity and variety of data, particularly in areas such as OCR, mathematical reasoning, and synthetic high-quality Q&A data. The goal is… See the full description on the dataset page: https://huggingface.co/datasets/starriver030515/FUSION-Finetune-12M.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
NYU VisionX (2024). Cambrian-10M [Dataset]. https://huggingface.co/datasets/nyu-visionx/Cambrian-10M

Cambrian-10M

nyu-visionx/Cambrian-10M

Explore at:
7 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 25, 2024
Dataset authored and provided by
NYU VisionX
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Cambrian-10M Dataset

Please see paper & website for more information:

https://cambrian-mllm.github.io/ https://arxiv.org/abs/2406.16860

  Overview

Cambrian-10M is a comprehensive dataset designed for instruction tuning, particularly in multimodal settings involving visual interaction data. The dataset is crafted to address the scarcity of high-quality multimodal instruction-tuning data and to maintain the language abilities of multimodal large language models (LLMs).… See the full description on the dataset page: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M.

Search
Clear search
Close search
Google apps
Main menu