We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research.

Cambrian-1 is structured around five key pillars, each offering important insights into the design space of MLLMs:

To this end, Cambrian-1 not only achieves state-of-the-art performance, but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. See §State-of-the-art MLLM performance. We provide model weights, code, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.

Analyzing the Benchmarks

Who's answering: LLM or MLLM?: We compare performance between vision-disabled and vision-enabled settings across MLLMs trained with 23 different vision backbones. Our findings reveal that some benchmarks such as MMMU and AI2D are less reliant on visual inputs, whereas others such as MMVP and MME experience significant performance declines, indicating their effective evaluation of multimodality

Benchmark Clustering and Analysis: Through correlation analysis and principal component analysis of MLLM performances across various benchmarks, distinct clusters emerge categorized as "General," "Knowledge," "Chart & OCR," and "Vision-Centric." We also find that vision-centric benchmarks are underrepresented in the current evaluation landscape.

benchmark category — **Figure 1:** Analyzing the benchmarks. **Left:** Performance comparison of MLLMs with visual input enabled and disabled across various benchmarks. **Right:** Principal component analysis displaying clusters of benchmarks based on performance metrics, with bubble size corresponding to benchmark size.

Cambrian Vision-Centric Benchmark (CV-Bench) To address the scarcity of vision-centric benchmarks, we introduce CV-Bench—repurposing standard vision tasks for multimodal evaluation. CV-Bench contains approximately 2600 vision-centric VQA questions, addressing the issues with existing vision-centric bechmark size.

Instruction Tuning Recipes

MLLMs connect pre-trained LLM and vision backbones using a connector such as an MLP projector. Various studies have suggested different optimal training methodologies for MLLMs.

One Stage vs Two Stage Training Recent work suggests skipping connector pre-training to reduce compute costs without harming performance. We experiment with 0, 0.5M, and 1.2M adapter data. Following LLaVA's method, we initially tune only the connector, then unfreeze both the LLM and connector for instruction tuning with a 737K data mix. Figure 3 indicates that pre-training the connector boosts performance, and using more adapter data enhances it further, leading us to standardize on a 2-stage training approach with 1.2M adapter data.

Freeze vs Unfreeze Vision Encoder There are also mixed practices in freezing or unfreezing vision backbones during fine-tuning. Some argue that unfreezing the vision backbone significantly degrades performance. Our experiments demonstrate that, with a reasonable vision model learning rate, unfreezing benefits performance across all benchmarks except for a marginal change in Knowledge benchmarks.

Instruction Tuning Recipes — **Figure 3:** MLLMs benefit from pre-training the adapter with more data, and finetuning with unfrozen visual encoder.

MLLMs as a Vision Model Evaluator

MLLMs provide a more real-world evaluation of visual representations than traditional benchmarks like ImageNet-1k. We use 2-stage instruction tuning with 1.2M adapter data and 737K fine-tuning data to compare a variety of vision models on downstream MLLM performance. Our evaluations show language-supervised models exhibit strong advantages across all benchmark categories, especially in OCR & chart tasks. However, despite the smaller dataset size of SSL models like DINOv2, they perform competitively in vision-centric benchmarks.

**Figure 4:** MLLMs as an interface to evaluate visual representations.

Hover & click to interact.

Narrowing the gap between CLIP and SSL models Above, we observe that DINOv2 stands midway between SSL models and CLIP models on general VQA and knowledge VQA tasks, even outperforming some CLIP models on vision-centric benchmarks with higher resolution. We investigate unfreezing the vision backbones and increasing the amount of visual fine-tuning data to narrow this gap. In Figure 5, we observe that by unfreezing the vision backbone, the DINOv2-based MLLM fine-tuned with 5M data surpasses the MLLM trained with a CLIP model on 0.7M data. Additionally, the gap between DINOv2 and the CLIP models is reduced under the 5M data experiment setting.

Combining Multiple Vision Encoders As observed in Figure 4, different vision models excel in different aspects of MLLM performance. We explore the potential of combining multiple vision encoders to leverage their distinctive representations. Given that different vision encoders use varying architectures and image resolutions, we interpolate the output visual tokens to a fixed number, 576. The results are tabulated in Table 2, where we observe consistent performance improvements with the addition of more models.

Vision Backbone		General				Knowledge				OCR & Chart				Vision-Centric
Encoders	Average	MME^P	MMB	SEED^I	GQA	SQA^I	MMMU^V	MathVista^M	AI2D	ChartQA	OCRBench	TextVQA	DocVQA	MMVP	RealWorldQA	CV-Bench^2D	CV-Bench^3D
SigLIP+DINOv2	51.61	1,432.02	61.28	65.99	63.30	68.82	35.69	29.40	60.01	43.00	35.70	60.40	37.54	30.00	53.99	55.52	53.58
SigLIP+DINOv2+ConvNext	54.52	1,503.51	63.83	67.97	63.95	70.40	35.99	29.30	60.69	48.20	36.90	64.97	45.53	34.67	58.69	55.74	60.33
SigLIP+DINOv2+ConvNext+CLIP	54.74	1,479.46	63.32	67.63	64.04	71.39	35.49	29.10	59.88	50.24	39.60	64.55	46.12	32.67	58.95	58.54	60.42
SigLIP+ConvNext	54.53	1,494.97	64.60	67.98	63.58	71.05	34.90	29.80	60.85	50.64	38.00	64.53	46.52	32.00	57.91	58.83	56.58
CLIP+ConvNext	54.45	1,511.08	63.83	67.41	63.63	70.80	35.09	30.40	59.91	51.32	35.00	64.45	47.88	33.33	57.25	56.32	59.08
SigLIP+DINOv2+ConvNext	53.78	1,450.64	63.57	67.79	63.63	71.34	34.80	30.20	61.04	49.32	37.70	64.05	45.83	30.00	56.21	58.08	54.33
SigLIP+CLIP+ConvNext	54.53	1,507.28	63.23	68.64	63.63	71.10	35.89	30.90	59.97	52.36	38.50	65.40	47.92	28.67	57.25	57.66	55.92

Table 2: All Benchmark Results for Model Ensemble with 1.2M Adapter Data + 737K Instruction Tuning Data

However, this strategy has two limitations: 1) it employs interpolation, which can potentially lead to information loss, especially on vision encoders with high-resolution feature maps, and 2) it treats each model equally by simple concatenation. Therefore, we seek a more effective strategy that fully leverages model combinations with less information loss and more flexibility.

Instruction Tuning Data for Training MLLMs

Previous work highlights the importance of data in training MLLMs, but explicit investigations are limited. In this study, we gather all available instruction tuning data and examine data curation by enhancing diversity, balancing sources, and improving mixtures.

Data Collection

Collecting Instruction Tuning Data from existing data sources We first use existing multimodal benchmarks and datasets involving visual interaction data, such as Visual Question Answering (VQA) and OCR data. We also collect a small volume of high-quality language-only instruction-following data to maintain its language ability.

**Figure 7:** Cambrian-7M: A Large-Scale Curated Instruction Tuning Dataset for Training MLLM.

Targeted Internet Data Collection Engine We also introduce a data engine designed to create large-scale, reliable, high-quality knowledge-based multimodal instruction tuning data.

Cambrian-10M To this end, we create a large pool of instruction tuning data, which we refer to as Cambrian-10M. This pool contains approximately 9784k data points, offering a diverse range of data for our work and future research. We visualize its composition in Figure 7.

Data Curation

Cambrian-10M is a large pool of instruction tuning data sourced from a variety of data sources, with an unbalanced data ratio between categories. Here, we take a preliminary step to study data curation by improving data balancing and adjusting data ratios.

Data Balancing We follow previous work to set thresholds t for the number of data points from a single data source. We choose t = 150k, 250k, 350k, and 450k in this section and observe an elbow effect in Table 3—finding that a threshold between 250k and 350k work the best for Cambrian-10M.

**Figure 9:** Data Balancing via Applying Thresholds on Data Sources.

	Average	General	Knowledge	OCR & Chart	Vision-Centric
150k	53.7	68.0	51.3	45.2	50.5
250k	54.3	68.1	51.5	45.3	52.2
350k	54.3	67.4	51.4	46.0	52.3
450k	54.2	68.0	52.2	45.5	50.7

Table 3: Threshold 𝑡 value between 250k and 350k obtains better performance.

Data Ratio Given the various capabilities of different types of visual instruction tuning data, it is essential to balance the ratio of these data types. We conduct pilot experiments with a fixed dataset size of 1350k, examining the impact of different data ratios on downstream performance. We visualize the results in Figure 10 and summarize our findings as follows: (i) Balancing General, OCR, and Language data is crucial. (ii) Performance on knowledge-intensive tasks is influenced by multiple factors, often requiring a mix of OCR, chart, reasoning, and general perception.

**Figrue 10:** Exploring instruction tuning data mixture ratios.

Cambrian-7M By applying data filtering to Cambrian-10M with our identified data ratio, we create a smaller but higher-quality dataset called Cambrian-7M. Table 4 showcases the benefits of a well-balanced and carefully curated dataset. Despite having fewer samples, Cambrian-7M demonstrates improved performance.

	Average	General	Knowledge	OCR & Chart	Vision-Centric
LLaVA-665K	40.7	64.7	45.2	20.8	32.0
Cambrian-10M	54.8	68.7	51.6	47.3	51.4
Cambrian-7M	55.9	69.6	52.6	47.3	54.1

Table 4: Performance improves with better instruction tuning data curation.

Alleviating the "Answer Machine Phenomenon" via System Prompts

Here, we investigate a phenomenon we term the "answer machine phenomenon." We observe that a well-trained MLLM may excel at VQA benchmarks, but lack basic conversational abilities and default to outputting short, curt responses (see examples in Figure 5).

To address this, we find that incorporating additional system prompts during training mitigates this phenomenon. We append prompts such as "Answer the question using a single word or phrase." before questions that generate a single word or phrase in the response. We observe that after integrating these system prompts, the model's benchmark performance remains unchanged, while its conversational ability significantly improves.

Incorporating System Prompt in Instruction Tuning Data alleviates “Answer Machine Phenomenon” — **Figure 11:** Incorporating System Prompt in Instruction Tuning Data alleviates the “Answer Machine Phenomenon”.

State of the Art MLLM Performance

Finally, we leverage the insights from all of our previous studies to train a high-performance Cambrian model. We train with three different sizes of LLM backbones: LLaMA-3-Instruct-8B, Vicuna-1.5-13B, and Hermes-2-Yi-34B. Our visual tower uses a combination of four models—SigLIP, CLIP, DINOv2, and OpenCLIP ConvNeXt (see Combining Multiple Vision Encoders) with the Spatial Vision Aggregator. We use 2.5M adapter data and Cambrian-7M instruction tuning data (see Data Curation). We evaluate our models on the categorized benchmarks, and tabulate the results in Table 5. Cambrian-1 exceeds other open-source models such as LLaVA-NeXT and Mini-Gemini, and achieves comparable performance on a number of benchmarks with the best proprietary models such as GPT-4V, Gemini-Pro, and MM-1.

Model		General					Knowledge					OCR & Chart					Vision-Centric
Method	# Vis Tok.	Avg	MME^P	MMB	SEED^I	GQA	Avg	SQA^I	MMMU^V	MathVista^M	AI2D	Avg	ChartQA	OCRBench	TextVQA	DocVQA	Avg	MMVP	RealworldQA	CV-Bench^2D	CV-Bench^3D
GPT-4V	UNK.	63.0	1409.4	75.8	69.1	36.8	65.2	75.7	56.8	49.9	78.2	77.4	78.5	64.5	78.0	88.4	62.4	50.0	61.4	64.3	73.8
Gemini-1.0 Pro	UNK.	-	1496.6	73.6	70.7	-	-	79.5	47.9	45.2	-	-	-	65.9	-	-	-	-	-	-	-
Gemini-1.5 Pro	UNK.	-	-	-	-	-	-	-	58.5	52.1	80.3	-	81.3	-	73.5	86.5	-	-	67.5	-	-
Grok-1.5	UNK.	-	-	-	-	-	-	-	53.6	52.8	88.3	-	76.1	-	78.1	85.6	-	-	68.7	-	-
MM-1-8B	144	-	1529.3	72.3	69.9	-	-	72.6	37.0	35.9	-	-	-	-	-	-	-	-	-	-	-
MM-1-30B	144	-	1637.6	75.1	72.1	-	-	81.0	44.7	39.4	-	-	-	-	-	-	-	-	-	-	-
Base LLM: Llama-3-Ins-8B
Mini-Gemini-HD-8B	2880	72.7	1606.0	72.7	73.2	64.5	55.7	75.1	37.3	37.0	73.5	62.9	59.1	47.7	70.2	74.6	51.5	18.7	62.1	62.2	63.0
LLaVA-NeXT-8B	2880	72.5	1603.7	72.1	72.7	65.2	55.6	72.8	41.7	36.3	71.6	63.9	69.5	49.0	64.6	72.6	56.6	38.7	60.1	62.2	65.3
Cambrian-1-8B	576	73.1	1,547.1	75.9	74.7	64.6	61.3	80.4	42.7	49.0	73.0	71.3	73.3	62.4	71.7	77.8	65.0	51.3	64.2	72.3	72.0
Base LLM: Vicuna-1.5-13B
Mini-Gemini-HD-13B	2880	70.7	1597.0	68.6	70.6	63.7	54.1	71.9	37.3	37.0	70.1	60.8	56.6	46.6	70.2	69.8	49.4	19.3	57.5	53.6	67.3
LLaVA-NeXT-13B	2880	69.9	1575.0	70.0	65.6	65.4	53.7	73.5	36.2	35.1	70.0	62.9	62.2	51.4	67.1	70.9	55.9	36.0	59.1	62.7	65.7
Cambrian-1-13B	576	73.7	1,610.4	75.7	74.4	64.3	60.2	79.3	40.0	48.0	73.6	71.3	73.8	61.9	72.8	76.8	62.2	41.3	63.0	72.5	71.8
Base LLM: Hermes2-Yi-34B
Mini-Gemini-HD-34B	2880	76.2	1659.0	80.6	75.3	65.8	62.4	77.7	48.0	43.4	80.5	68.1	67.6	51.8	74.1	78.9	63.8	37.3	67.2	71.5	79.2
LLaVA-NeXT-34B	2880	76.0	1633.2	79.3	75.9	67.1	62.5	81.8	46.7	46.5	74.9	67.7	68.7	54.5	69.5	78.1	64.0	47.3	61.0	73.0	74.8
Cambrian-1-34B	576	76.8	1689.3	81.4	75.3	65.8	67.0	85.6	49.7	53.2	79.7	71.9	75.6	60.0	76.7	75.5	68.5	52.7	67.8	74.0	79.7

Table 5: Cambrian-1 outperforms other open-source models and achieves comparable performance with proprietary models, while using only 576 visual tokens.

Cambrian-1

A Fully Open, Vision-Centric
Exploration of Multimodal LLMs

Analyzing the Benchmarks

Instruction Tuning Recipes

MLLMs as a Vision Model Evaluator

Spatial Vision Aggregator (SVA): A New Connector Design

Instruction Tuning Data for Training MLLMs

Data Collection

Data Curation

Alleviating the "Answer Machine Phenomenon" via System Prompts

State of the Art MLLM Performance

Conclusion

BibTeX

Cambrian-1

A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Analyzing the Benchmarks

Instruction Tuning Recipes

MLLMs as a Vision Model Evaluator

Spatial Vision Aggregator (SVA): A New Connector Design

Instruction Tuning Data for Training MLLMs

Data Collection

Data Curation

Alleviating the "Answer Machine Phenomenon" via System Prompts

State of the Art MLLM Performance

Conclusion

BibTeX

A Fully Open, Vision-Centric
Exploration of Multimodal LLMs