Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Cambrian-10M Dataset
Please see paper & website for more information:
https://cambrian-mllm.github.io/ https://arxiv.org/abs/2406.16860
Overview
Cambrian-10M is a comprehensive dataset designed for instruction tuning, particularly in multimodal settings involving visual interaction data. The dataset is crafted to address the scarcity of high-quality multimodal instruction-tuning data and to maintain the language abilities of multimodal large language models (LLMs).… See the full description on the dataset page: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Cambrian-Alignment Dataset
Please see paper & website for more information:
https://cambrian-mllm.github.io/ https://arxiv.org/abs/2406.16860
Overview
Cambrian-Alignment is an question-answering alignment dataset comprised of alignment data from LLaVA, Mini-Gemini, Allava, and ShareGPT4V.
Getting Started with Cambrian Alignment Data
Before you start, ensure you have sufficient storage space to download and process the data.
Download the Data Repository… See the full description on the dataset page: https://huggingface.co/datasets/nyu-visionx/Cambrian-Alignment.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Cambrian Vision-Centric Benchmark (CV-Bench)
This repository contains the Cambrian Vision-Centric Benchmark (CV-Bench), introduced in Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs.
Files
The test*.parquet files contain the dataset annotations and images pre-loaded for processing with HF Datasets. These can be loaded in 3 different configurations using… See the full description on the dataset page: https://huggingface.co/datasets/nyu-visionx/CV-Bench.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Geological and Bioregional Assessment Program from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. This is an initial dataset that was published for peer review on 21/07/2020 and will be finalised when the journal paper is revised and accepted for publication. This dataset contains the inputs, outputs and code used to estimate recharge across the extent of the Cambrian Limestone Aquifer. The method is described in a journal paper:
Crosbie and Rachakonda (2020) Constraining probabilistic chloride mass balance recharge estimates using baseflow and remotely sensed evapotranspiration: The Cambrian Limestone Aquifer northern Australia. Submitted to Hydrogeology Journal.
A copy of this draft journal paper is included in the dataset.
Geological and Bioregional Assessment Program
This dataset contains the inputs, outputs and code used to estimate recharge across the extent of the Cambrian Limestone Aquifer. The method is described in a journal paper:\r \r Crosbie and Rachakonda (2020) Constraining probabilistic chloride mass balance recharge estimates using baseflow and remotely sensed evapotranspiration: The Cambrian Limestone Aquifer northern Australia. Submitted to Hydrogeology Journal\r \r A copy of this draft journal paper is included in the dataset.
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
This dataset contains Oasis-500k dataset. [Read the Paper] | [Github Repo]
All images come from Cambrian-10M. Instructions and responses are generated by MLLM.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
FUSION-12M Dataset
Please see paper & website for more information:
https://arxiv.org/abs/2504.09925 https://github.com/starriver030515/FUSION
Overview
FUSION-12M is a large-scale, diverse multimodal instruction-tuning dataset used to train FUSION-3B and FUSION-8B models. It builds upon Cambrian-1 by significantly expanding both the quantity and variety of data, particularly in areas such as OCR, mathematical reasoning, and synthetic high-quality Q&A data. The goal is… See the full description on the dataset page: https://huggingface.co/datasets/starriver030515/FUSION-Finetune-12M.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Cambrian-10M Dataset
Please see paper & website for more information:
https://cambrian-mllm.github.io/ https://arxiv.org/abs/2406.16860
Overview
Cambrian-10M is a comprehensive dataset designed for instruction tuning, particularly in multimodal settings involving visual interaction data. The dataset is crafted to address the scarcity of high-quality multimodal instruction-tuning data and to maintain the language abilities of multimodal large language models (LLMs).… See the full description on the dataset page: https://huggingface.co/datasets/nyu-visionx/Cambrian-10M.