Introducing Cambrian-1, a family of vision-centric multimodal LLMs (MLLMs). Cambrian-1 is structured around five key pillars:
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
While stronger language models can enhance multimodal capabilities,
the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research.
Cambrian-1 is structured around five key pillars, each offering important insights into the design space of MLLMs:
Click to jump to each section.
To this end, Cambrian-1 not only achieves state-of-the-art performance, but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. See §State-of-the-art MLLM performance. We provide model weights, code, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.
Who's answering: LLM or MLLM?: We compare performance between vision-disabled and vision-enabled settings across MLLMs trained with 23 different vision backbones. Our findings reveal that some benchmarks such as MMMU and AI2D are less reliant on visual inputs, whereas others such as MMVP and MME experience significant performance declines, indicating their effective evaluation of multimodality
Benchmark Clustering and Analysis: Through correlation analysis and principal component analysis of MLLM performances across various benchmarks, distinct clusters emerge categorized as "General," "Knowledge," "Chart & OCR," and "Vision-Centric." We also find that vision-centric benchmarks are underrepresented in the current evaluation landscape.
Cambrian Vision-Centric Benchmark (CV-Bench) To address the scarcity of vision-centric benchmarks, we introduce CV-Bench—repurposing standard vision tasks for multimodal evaluation. CV-Bench contains approximately 2600 vision-centric VQA questions, addressing the issues with existing vision-centric bechmark size.
MLLMs connect pre-trained LLM and vision backbones using a connector such as an MLP projector. Various studies have suggested different optimal training methodologies for MLLMs.
One Stage vs Two Stage Training Recent work suggests skipping connector pre-training to reduce compute costs without harming performance.
We experiment with 0, 0.5M, and 1.2M adapter data. Following LLaVA's method
Freeze vs Unfreeze Vision Encoder There are also mixed practices in freezing or unfreezing vision backbones during fine-tuning. Some argue that unfreezing the vision backbone significantly degrades performance. Our experiments demonstrate that, with a reasonable vision model learning rate, unfreezing benefits performance across all benchmarks except for a marginal change in Knowledge benchmarks.
MLLMs provide a more real-world evaluation of visual representations than traditional benchmarks like ImageNet-1k. We use 2-stage instruction tuning with 1.2M adapter data and 737K fine-tuning data to compare a variety of vision models on downstream MLLM performance. Our evaluations show language-supervised models exhibit strong advantages across all benchmark categories, especially in OCR & chart tasks. However, despite the smaller dataset size of SSL models like DINOv2, they perform competitively in vision-centric benchmarks.
Hover & click to interact.
Narrowing the gap between CLIP and SSL models Above, we observe that DINOv2 stands midway between SSL models and CLIP models on general VQA and knowledge VQA tasks, even outperforming some CLIP models on vision-centric benchmarks with higher resolution. We investigate unfreezing the vision backbones and increasing the amount of visual fine-tuning data to narrow this gap. In Figure 5, we observe that by unfreezing the vision backbone, the DINOv2-based MLLM fine-tuned with 5M data surpasses the MLLM trained with a CLIP model on 0.7M data. Additionally, the gap between DINOv2 and the CLIP models is reduced under the 5M data experiment setting.
Combining Multiple Vision Encoders As observed in Figure 4, different vision models excel in different aspects of MLLM performance. We explore the potential of combining multiple vision encoders to leverage their distinctive representations. Given that different vision encoders use varying architectures and image resolutions, we interpolate the output visual tokens to a fixed number, 576. The results are tabulated in Table 2, where we observe consistent performance improvements with the addition of more models.
Vision Backbone | General | Knowledge | OCR & Chart | Vision-Centric | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Encoders | Average | MMEP | MMB | SEEDI | GQA | SQAI | MMMUV | MathVistaM | AI2D | ChartQA | OCRBench | TextVQA | DocVQA | MMVP | RealWorldQA | CV-Bench2D | CV-Bench3D |
SigLIP+DINOv2 | 51.61 | 1,432.02 | 61.28 | 65.99 | 63.30 | 68.82 | 35.69 | 29.40 | 60.01 | 43.00 | 35.70 | 60.40 | 37.54 | 30.00 | 53.99 | 55.52 | 53.58 |
SigLIP+DINOv2+ConvNext | 54.52 | 1,503.51 | 63.83 | 67.97 | 63.95 | 70.40 | 35.99 | 29.30 | 60.69 | 48.20 | 36.90 | 64.97 | 45.53 | 34.67 | 58.69 | 55.74 | 60.33 |
SigLIP+DINOv2+ConvNext+CLIP | 54.74 | 1,479.46 | 63.32 | 67.63 | 64.04 | 71.39 | 35.49 | 29.10 | 59.88 | 50.24 | 39.60 | 64.55 | 46.12 | 32.67 | 58.95 | 58.54 | 60.42 |
SigLIP+ConvNext | 54.53 | 1,494.97 | 64.60 | 67.98 | 63.58 | 71.05 | 34.90 | 29.80 | 60.85 | 50.64 | 38.00 | 64.53 | 46.52 | 32.00 | 57.91 | 58.83 | 56.58 |
CLIP+ConvNext | 54.45 | 1,511.08 | 63.83 | 67.41 | 63.63 | 70.80 | 35.09 | 30.40 | 59.91 | 51.32 | 35.00 | 64.45 | 47.88 | 33.33 | 57.25 | 56.32 | 59.08 |
SigLIP+DINOv2+ConvNext | 53.78 | 1,450.64 | 63.57 | 67.79 | 63.63 | 71.34 | 34.80 | 30.20 | 61.04 | 49.32 | 37.70 | 64.05 | 45.83 | 30.00 | 56.21 | 58.08 | 54.33 |
SigLIP+CLIP+ConvNext | 54.53 | 1,507.28 | 63.23 | 68.64 | 63.63 | 71.10 | 35.89 | 30.90 | 59.97 | 52.36 | 38.50 | 65.40 | 47.92 | 28.67 | 57.25 | 57.66 | 55.92 |
However, this strategy has two limitations: 1) it employs interpolation, which can potentially lead to information loss, especially on vision encoders with high-resolution feature maps, and 2) it treats each model equally by simple concatenation. Therefore, we seek a more effective strategy that fully leverages model combinations with less information loss and more flexibility.
To effectively aggregate features from multiple vision encoders and reduce information loss during interpolation, we use a set of learnable latent queries that interact with multiple vision features through cross-attention layers
Previous work highlights the importance of data in training MLLMs, but explicit investigations are limited. In this study, we gather all available instruction tuning data and examine data curation by enhancing diversity, balancing sources, and improving mixtures.
Collecting Instruction Tuning Data from existing data sources We first use existing multimodal benchmarks and datasets involving visual interaction data, such as Visual Question Answering (VQA) and OCR data. We also collect a small volume of high-quality language-only instruction-following data to maintain its language ability.
Targeted Internet Data Collection Engine We also introduce a data engine designed to create large-scale, reliable, high-quality knowledge-based multimodal instruction tuning data.
Cambrian-10M To this end, we create a large pool of instruction tuning data, which we refer to as Cambrian-10M. This pool contains approximately 9784k data points, offering a diverse range of data for our work and future research. We visualize its composition in Figure 7.
Cambrian-10M is a large pool of instruction tuning data sourced from a variety of data sources, with an unbalanced data ratio between categories. Here, we take a preliminary step to study data curation by improving data balancing and adjusting data ratios.
Data Balancing We follow previous work to set thresholds t for the number of data points from a single data source. We choose t = 150k, 250k, 350k, and 450k in this section and observe an elbow effect in Table 3—finding that a threshold between 250k and 350k work the best for Cambrian-10M.
Average | General | Knowledge | OCR & Chart | Vision-Centric | |
---|---|---|---|---|---|
150k | 53.7 | 68.0 | 51.3 | 45.2 | 50.5 |
250k | 54.3 | 68.1 | 51.5 | 45.3 | 52.2 |
350k | 54.3 | 67.4 | 51.4 | 46.0 | 52.3 |
450k | 54.2 | 68.0 | 52.2 | 45.5 | 50.7 |
Data Ratio Given the various capabilities of different types of visual instruction tuning data, it is essential to balance the ratio of these data types. We conduct pilot experiments with a fixed dataset size of 1350k, examining the impact of different data ratios on downstream performance. We visualize the results in Figure 10 and summarize our findings as follows: (i) Balancing General, OCR, and Language data is crucial. (ii) Performance on knowledge-intensive tasks is influenced by multiple factors, often requiring a mix of OCR, chart, reasoning, and general perception.
Cambrian-7M By applying data filtering to Cambrian-10M with our identified data ratio, we create a smaller but higher-quality dataset called Cambrian-7M. Table 4 showcases the benefits of a well-balanced and carefully curated dataset. Despite having fewer samples, Cambrian-7M demonstrates improved performance.
Average | General | Knowledge | OCR & Chart | Vision-Centric | |
---|---|---|---|---|---|
LLaVA-665K | 40.7 | 64.7 | 45.2 | 20.8 | 32.0 |
Cambrian-10M | 54.8 | 68.7 | 51.6 | 47.3 | 51.4 |
Cambrian-7M | 55.9 | 69.6 | 52.6 | 47.3 | 54.1 |
Here, we investigate a phenomenon we term the "answer machine phenomenon." We observe that a well-trained MLLM may excel at VQA benchmarks, but lack basic conversational abilities and default to outputting short, curt responses (see examples in Figure 5).
To address this, we find that incorporating additional system prompts during training mitigates this phenomenon. We append prompts such as "Answer the question using a single word or phrase." before questions that generate a single word or phrase in the response. We observe that after integrating these system prompts, the model's benchmark performance remains unchanged, while its conversational ability significantly improves.
Finally, we leverage the insights from all of our previous studies to train a high-performance Cambrian model. We train with three different sizes of LLM backbones: LLaMA-3-Instruct-8B, Vicuna-1.5-13B, and Hermes-2-Yi-34B. Our visual tower uses a combination of four models—SigLIP, CLIP, DINOv2, and OpenCLIP ConvNeXt (see Combining Multiple Vision Encoders) with the Spatial Vision Aggregator. We use 2.5M adapter data and Cambrian-7M instruction tuning data (see Data Curation). We evaluate our models on the categorized benchmarks, and tabulate the results in Table 5. Cambrian-1 exceeds other open-source models such as LLaVA-NeXT and Mini-Gemini, and achieves comparable performance on a number of benchmarks with the best proprietary models such as GPT-4V, Gemini-Pro, and MM-1.
Model | General | Knowledge | OCR & Chart | Vision-Centric | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | # Vis Tok. | Avg | MMEP | MMB | SEEDI | GQA | Avg | SQAI | MMMUV | MathVistaM | AI2D | Avg | ChartQA | OCRBench | TextVQA | DocVQA | Avg | MMVP | RealworldQA | CV-Bench2D | CV-Bench3D |
GPT-4V | UNK. | 63.0 | 1409.4 | 75.8 | 69.1 | 36.8 | 65.2 | 75.7 | 56.8 | 49.9 | 78.2 | 77.4 | 78.5 | 64.5 | 78.0 | 88.4 | 62.4 | 50.0 | 61.4 | 64.3 | 73.8 |
Gemini-1.0 Pro | UNK. | - | 1496.6 | 73.6 | 70.7 | - | - | 79.5 | 47.9 | 45.2 | - | - | - | 65.9 | - | - | - | - | - | - | - |
Gemini-1.5 Pro | UNK. | - | - | - | - | - | - | - | 58.5 | 52.1 | 80.3 | - | 81.3 | - | 73.5 | 86.5 | - | - | 67.5 | - | - |
Grok-1.5 | UNK. | - | - | - | - | - | - | - | 53.6 | 52.8 | 88.3 | - | 76.1 | - | 78.1 | 85.6 | - | - | 68.7 | - | - |
MM-1-8B | 144 | - | 1529.3 | 72.3 | 69.9 | - | - | 72.6 | 37.0 | 35.9 | - | - | - | - | - | - | - | - | - | - | - |
MM-1-30B | 144 | - | 1637.6 | 75.1 | 72.1 | - | - | 81.0 | 44.7 | 39.4 | - | - | - | - | - | - | - | - | - | - | - |
Base LLM: Llama-3-Ins-8B | |||||||||||||||||||||
Mini-Gemini-HD-8B | 2880 | 72.7 | 1606.0 | 72.7 | 73.2 | 64.5 | 55.7 | 75.1 | 37.3 | 37.0 | 73.5 | 62.9 | 59.1 | 47.7 | 70.2 | 74.6 | 51.5 | 18.7 | 62.1 | 62.2 | 63.0 |
LLaVA-NeXT-8B | 2880 | 72.5 | 1603.7 | 72.1 | 72.7 | 65.2 | 55.6 | 72.8 | 41.7 | 36.3 | 71.6 | 63.9 | 69.5 | 49.0 | 64.6 | 72.6 | 56.6 | 38.7 | 60.1 | 62.2 | 65.3 |
Cambrian-1-8B | 576 | 73.1 | 1,547.1 | 75.9 | 74.7 | 64.6 | 61.3 | 80.4 | 42.7 | 49.0 | 73.0 | 71.3 | 73.3 | 62.4 | 71.7 | 77.8 | 65.0 | 51.3 | 64.2 | 72.3 | 72.0 |
Base LLM: Vicuna-1.5-13B | |||||||||||||||||||||
Mini-Gemini-HD-13B | 2880 | 70.7 | 1597.0 | 68.6 | 70.6 | 63.7 | 54.1 | 71.9 | 37.3 | 37.0 | 70.1 | 60.8 | 56.6 | 46.6 | 70.2 | 69.8 | 49.4 | 19.3 | 57.5 | 53.6 | 67.3 |
LLaVA-NeXT-13B | 2880 | 69.9 | 1575.0 | 70.0 | 65.6 | 65.4 | 53.7 | 73.5 | 36.2 | 35.1 | 70.0 | 62.9 | 62.2 | 51.4 | 67.1 | 70.9 | 55.9 | 36.0 | 59.1 | 62.7 | 65.7 |
Cambrian-1-13B | 576 | 73.7 | 1,610.4 | 75.7 | 74.4 | 64.3 | 60.2 | 79.3 | 40.0 | 48.0 | 73.6 | 71.3 | 73.8 | 61.9 | 72.8 | 76.8 | 62.2 | 41.3 | 63.0 | 72.5 | 71.8 |
Base LLM: Hermes2-Yi-34B | |||||||||||||||||||||
Mini-Gemini-HD-34B | 2880 | 76.2 | 1659.0 | 80.6 | 75.3 | 65.8 | 62.4 | 77.7 | 48.0 | 43.4 | 80.5 | 68.1 | 67.6 | 51.8 | 74.1 | 78.9 | 63.8 | 37.3 | 67.2 | 71.5 | 79.2 |
LLaVA-NeXT-34B | 2880 | 76.0 | 1633.2 | 79.3 | 75.9 | 67.1 | 62.5 | 81.8 | 46.7 | 46.5 | 74.9 | 67.7 | 68.7 | 54.5 | 69.5 | 78.1 | 64.0 | 47.3 | 61.0 | 73.0 | 74.8 |
Cambrian-1-34B | 576 | 76.8 | 1689.3 | 81.4 | 75.3 | 65.8 | 67.0 | 85.6 | 49.7 | 53.2 | 79.7 | 71.9 | 75.6 | 60.0 | 76.7 | 75.5 | 68.5 | 52.7 | 67.8 | 74.0 | 79.7 |
To conclude, Cambrian-1 is a family of state-of-the-art MLLMs that achieve top performance across diverse benchmarks and excel in visual-centric tasks. We provide model weights, open-source code, datasets, and detailed recipes for model training and evaluation. We hope our work will strengthen the open research community and accelerate research in both visual representation learning and multimodal systems.
@article{tong2024cambrian,
title={{Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs}},
author={Tong, Shengbang and Brown, Ellis and Wu, Penghao and Woo, Sanghyun and Middepogu, Manoj and Akula, Sai Charitha and Yang, Jihan and Yang, Shusheng, and Iyer, Adithya and Pan, Xichen and Wang, Austin and Fergus, Rob and LeCun, Yann and Xie, Saining},
journal={arXiv preprint arXiv:2406.16860},
year={2024}
}