Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead of the persistent scene humans perceive. We revisit pose as a lightweight supervisory signal and introduce Cambrian-P, a video MLLM augmented with per-frame learnable camera tokens and a pose regression head. With a carefully designed sampling scheme, the model achieves substantial gains of 4.5–6.5% on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, and, as a byproduct, achieves state-of-the-art streaming pose estimation on ScanNet. Surprisingly, training on pseudo-annotated poses from in-the-wild video further improves general video QA benchmarks, showing pose helps beyond spatial reasoning. Together, these results position camera pose as a fundamental signal for video models that reason about the physical world.

Cambrian-P

Why pose? Video is the 2D observation of a dynamic 3D scene from a coherent sequence of viewpoints. Each viewpoint is defined by an observer's pose, i.e., its 3D position and orientation, specifying how the camera is embedded in the physical world. Pose is the lightest 3D signal: it compactly encodes how views relate geometrically, enforces global consistency through rigid-body constraints, and disentangles camera motion from scene dynamics. Camera pose is not merely a useful auxiliary cue, but a foundational inductive bias for spatially aware video understanding.

Architecture. Cambrian-P builds on the Cambrian-S backbone (SigLIP2-SO400m vision encoder + Qwen2.5 LLM). To enable pose estimation inside the LLM's feature space, we introduce two learnable camera tokens, c_first and c_rest, appended after the visual tokens of each frame. After the LLM forward pass, a linear projector and a four-layer self-attention head regress a 9-D pose encoding (translation, quaternion, horizontal and vertical field-of-view) for every frame, with all poses expressed in the coordinate system of the first camera.

Training objective. The total loss combines next-token prediction with a pose regression loss. The pose loss is a weighted L1 over translation, quaternion rotation, and field-of-view. Translation is normalized by the average consecutive-frame distance \( \bar d \) so that indoor and outdoor scenes contribute comparable gradients, and a stop-gradient least-squares scale aligns predictions to ground truth on non-metric datasets.

Interleaved training of Cambrian-P — **Figure 3: Interleaved training.** *Top:* augmented pose-only samples using dynamic frame sampling and only pose supervision. *Bottom:* samples using uniform frame sampling with both VQA and pose supervision. `L` is the total number of frames in the video.

Bridging the training-dynamics gap. VQA and pose estimation have conflicting needs: VQA prefers uniform temporal sampling and minimal augmentation, while pose estimation prefers random starts, variable intervals, and heavy augmentation. We resolve this with two ingredients:

Interleaved training: we mix VQA-only, pose-only, and joint VQA+pose samples in training. Dedicated pose-only samples use pose-model sampling and augmentation and are supervised only by \( \mathcal{L}_{\text{pose}} \), allowing pose training to scale independently of VQA.
Random-jitter frame sampling: each uniformly sampled frame index for VQA is perturbed by a small random offset, breaking the memorization of fixed frame–pose correspondences without harming VQA coverage.

Improved Spatial Video QA

Cambrian-P is fine-tuned from Cambrian-S-7B (stage 3) on VSI-590K and yields state-of-the-art performance on VSI-Bench. Compared to its no-pose counterpart, it gains +4.5% on VSI-Bench, with the largest improvements on tasks that demand global spatial understanding such as absolute distance, relative direction, and route planning.

Model	LM	Avg.	Numerical Answer				Multiple-Choice Answer
Model	LM	Avg.	Obj. Count	Abs. Dist.	Obj. Size	Room Size	Rel. Dist.	Rel. Dir.	Route Plan	Appr. Order
General-purpose Models
GPT-4o	Unk.	34.0	46.2	5.3	43.8	38.2	37.0	41.3	31.5	28.5
Gemini-2.5 Pro	Unk.	51.5	43.8	34.9	64.3	42.8	61.1	47.8	45.9	71.3
Qwen2.5VL-7B	Qwen2.5-7B	29.3	25.2	10.5	36.4	29.6	38.4	38.0	29.8	26.8
InternVL-3 8B	Qwen2.5-7B	42.1	68.1	39.0	48.4	33.6	48.3	36.4	27.3	35.4
InternVL-3.5 8B	Qwen3-8B	56.3	–	–	–	–	–	–	–	–
Qwen3-VL 8B	Qwen3-8B	56.6	–	–	–	–	–	–	–	–
Spatial-specialist Models
VST 7B	Qwen2.5-7B	61.2	71.6	43.8	75.5	69.2	60.0	55.6	44.3	69.2
VLM-3R 7B	Qwen2-7B	60.9	70.2	49.4	69.2	67.1	65.4	80.5	45.4	40.1
VG-LLM 8B	Qwen2.5-7B	50.7	67.9	37.7	58.6	62.0	46.6	40.7	32.4	59.2
Cambrian-S 7B	Qwen2.5-7B	67.5	73.2	50.5	74.9	72.2	71.1	76.2	41.8	80.1
SenseNova-SI 8B	Qwen2.5-7B	68.7	–	–	–	–	–	–	–	–
GeoThinker 7B	Qwen2.5-7B	68.5	–	–	–	–	–	–	–	–
GeoThinker 8B	Qwen3-8B	72.6	–	–	–	–	–	–	–	–
Cambrian-S-7B^†	Qwen2.5-7B	69.2	73.6	53.7	75.2	74.7	71.5	82.0	38.7	84.3
Cambrian-P	Qwen2.5-7B	73.7	74.9	60.1	76.0	76.9	74.8	89.5	52.6	85.0

Table 1: VSI-Bench results. ^†Cambrian-S fine-tuned only on VSI-590K. Best results in bold.

Camera-Movement Understanding (VSTemporalI-Bench)

We further fine-tune Cambrian-P on VSI-590K + VLM-3R data and evaluate on VSTemporalI-Bench, which probes camera-motion understanding. Pose supervision yields a +20% jump on the camera-movement-direction subtask over the no-pose baseline.

Method	Avg.	Cam-Obj Abs. Dist.	Cam. Displace.	Cam. Mov. Dir.	Obj-Obj Rel. Pos.	Cam-Obj Rel. Dist.
GPT-4o	38.2	29.5	23.4	37.3	58.1	42.5
Gemini-1.5 Flash	32.1	28.5	20.9	24.4	52.6	33.9
LLaVA-NeXT-Video-72B	44.0	32.3	10.5	48.1	78.3	50.9
VLM-3R-7B	58.8	39.4	39.6	60.6	86.5	68.6
GeoThinker-8B	67.4	38.4	45.8	84.2	93.6	75.2
Cambrian-P (w/o pose)	62.4	39.4	40.6	67.7	92.2	72.0
Cambrian-P	68.9	42.5	46.6	87.7	94.3	73.2

Table 2: VSTemporalI-Bench results. Pose supervision yields +20% on the camera-movement-direction subtask.

Out-of-Distribution Generalization

Although Cambrian-P is fine-tuned only on VSI-590K, the local-to-global video understanding it acquires through pose supervision transfers to a wide range of out-of-distribution spatial and general video QA benchmarks.

Model	SparBench	MMSIBench	MMSIVideo	MindCube	MVBench	EgoSchema	Percept. Test	Tomato
Cambrian-P (w/o pose)	32.7	26.2	20.1	34.3	51.9	49.6	56.4	20.4
Cambrian-P	35.9	28.0	22.9	38.4	53.5	52.5	58.4	26.7

Table 3: Out-of-distribution generalization across 8 spatial and general video QA benchmarks.

Scaling Pose Supervision with Pseudo Labels

Ground-truth camera pose annotations are only available in a few 3D datasets (ScanNet, ScanNet++, ARKitScenes inside VSI-590K). To scale pose supervision to general-domain videos, we pseudo-annotate CamS-590K, a 590K-clip subsample of Cambrian-S-3M, the open-domain video instruction-tuning corpus used in Cambrian-S Stage 3. The pipeline runs (i) a scene-cut detector, (ii) a Qwen3-VL pose-aware quality filter that drops synthetic, text-overlaid, screen-recorded, blurry, or through-glass clips, (iii) VIPE for streaming pose recovery, and (iv) a trajectory-quality pass that interpolates frames violating per-frame velocity / acceleration / rotation limits. The resulting pseudo poses are routed through the same interleaved training recipe as GT poses, with no architecture change.

Adding CamS-590K boosts general video QA but slightly dips VSI-Bench, showing that generic video tuning does not by itself preserve spatial reasoning. GT pose restores the spatial gain without giving up the general-VQA improvements, and layering VIPE pseudo poses on CamS data pushes all four benchmarks up further. These results suggest that pseudo poses, even when derived from noisy in-the-wild videos, provide a scalable supervision signal for video understanding.

Training Data	Pose Sup.	% Pose Sup.	VSI-Bench	MVBench	Perception Test	EgoSchema
Spatial VQA Data Only
VSI-590K	—	0%	71.2	51.7	56.7	48.5
VSI-590K	GT	49%	73.7	53.8	58.1	51.3
Spatial VQA + General VQA Data
VSI-590K + CamS-590K	—	0%	70.9	68.0	66.9	71.2
VSI-590K + CamS-590K	GT	25%	73.7	67.9	67.8	71.7
VSI-590K + CamS-590K	GT + Pseudo	48%	73.9	69.3	67.9	73.6

Table 4: Cambrian-P with general VQA training data and pseudo-pose supervision (128 input frames). Adding CamS-590K boosts general video QA but slightly hurts VSI-Bench without pose; GT pose restores the spatial gain; VIPE pseudo poses on CamS-590K improve all four benchmarks under an otherwise unchanged recipe.

Streaming Camera Pose Estimation

As a byproduct of pose supervision, Cambrian-P doubles as a competitive streaming camera-pose estimator. It achieves the lowest ATE on ScanNet among all streaming methods, despite using a standard MLLM backbone (causal SigLIP encoder, no DINOv2, no bidirectional transformer). On TUM-dynamic and Sintel it remains competitive with specialist models like CUT3R, StreamVGGT, and Point3R.

Model	ScanNet			TUM-dynamic			Sintel
Model	ATE ↓	RPE-t ↓	RPE-r ↓	ATE ↓	RPE-t ↓	RPE-r ↓	ATE ↓	RPE-t ↓	RPE-r ↓
Offline Models
VGGT	0.035	0.015	0.380	0.009	0.008	0.350	0.172	0.061	0.470
π³	0.031	0.013	0.347	0.014	0.009	0.312	0.074	0.040	0.282
MapAnything	0.052	0.025	0.720	0.029	0.023	0.370	0.226	0.077	0.640
MASt3R-GA	0.078	0.020	0.475	0.038	0.012	0.448	0.185	0.060	1.496
MonST3R-GA	0.077	0.018	0.529	0.098	0.019	0.935	0.111	0.044	0.869
Streaming Models
StreamVGGT	0.127	0.041	1.880	0.062	0.030	0.690	0.273	0.109	0.850
CUT3R	0.096	0.022	0.590	0.045	0.015	0.440	0.215	0.070	0.630
Point3R	0.097	0.035	2.791	0.058	0.031	0.758	0.442	0.154	1.897
Spann3R	0.096	0.023	0.661	0.056	0.021	0.591	0.329	0.110	4.471
G²VLM	0.148	0.048	1.220	0.129	0.044	0.700	0.301	0.135	1.450
Cambrian-P (Ours)	0.078	0.023	0.880	0.046	0.020	0.580	0.239	0.081	2.440

Table 5: Camera pose estimation on ScanNet, TUM-dynamic, and Sintel. Best values among streaming methods in bold.

Analysis

Scaling. The pose objective scales gracefully along all three axes — model size, data size, and training iterations — with the gap over the no-pose baseline widening as we scale up.

Scaling of Cambrian-P (VQA) — **Figure 5: Scaling.** Larger models or more data both yield higher VSI-Bench scores and lower pose ATE across benchmarks.

Scaling of Cambrian-P (pose ATE) — **Figure 5: Scaling.** Larger models or more data both yield higher VSI-Bench scores and lower pose ATE across benchmarks.

Camera pose > depth. Adding a depth head yields smaller VQA gains than pose, and combining pose + depth slightly hurts both objectives. Within the pose loss, both translation and rotation contribute meaningfully; field-of-view alone behaves comparably to depth.

Pose makes MLLMs think more globally. Grouping VSI-Bench questions by normalized object distance (near / medium / far), pose supervision delivers the largest gains on far object pairs (e.g., +6.6% on relative distance and +10.2% on relative direction for far questions), suggesting that pose enables more global spatial reasoning.

Improvement vs object distance — **Figure 6: Camera pose improves global spatial reasoning.** Performance gain of Cambrian-P over its no-pose counterpart increases with normalized object distance for both relative-distance and relative-direction questions in VSI-Bench.

Latency. Despite having 6–10× more parameters than specialist 3D reconstruction models, Cambrian-P achieves lower per-frame latency thanks to (i) compact SigLIP visual tokens, (ii) causal-attention FLOPs savings, and (iii) the standard KV-cache mechanism of LLM inference. In offline mode it processes 90 ScanNet frames in 2.16 s; in streaming mode it matches CUT3R's per-frame cost.

Method	#Params	Offline / seq (s)	Offline / frame (s)	Streaming / seq (s)	Streaming / frame (s)
VGGT	1.26B	9.90	0.11	—	—
CUT3R	0.80B	5.22	0.06	6.03	0.07
StreamVGGT	1.26B	—	—	9.00	0.10
Cambrian-P (Ours)	8.20B	2.16	0.02	5.76	0.06

Table 6: Inference latency on the ScanNet test set (NVIDIA L40S, 90-frame sequences).

Cambrian-P

Pose-Grounded Video Understanding

Overview

Cambrian-P

Qualitative Visualization

Improved Spatial Video QA

Camera-Movement Understanding (VSTemporalI-Bench)

Out-of-Distribution Generalization

Scaling Pose Supervision with Pseudo Labels

Streaming Camera Pose Estimation

Analysis

Conclusion

BibTeX