Cambrian-P

Pose-Grounded Video Understanding

Method Icon
Pose-Grounded MLLM: We introduce per-frame learnable camera tokens and a lightweight pose head that regress camera pose from the LLM's hidden states.
Training Icon
Interleaved Training: Pose estimation and Video Query Answering disagree on how to sample and augment frames. We interleave pose-only batches with jittered VQA batches, balancing gradients from both objectives in a single training run.
VQA Icon
Spatial VQA: 4.5–6.5% gains on VSI-Bench and VSTI-Bench, plus consistent improvements across eight OOD spatial and general video benchmarks.
Pose Icon
Streaming Pose Estimation: State-of-the-art camera pose accuracy on ScanNet, surpassing specialist reconstruction models.
Teaser Image

Overview

Cambrian-P teaser figure
Figure 1: Cambrian-P in video QA. Cambrian-P equips current video MLLMs with native camera-pose prediction via an extra camera token per frame. By positioning each video frame in a shared spatial coordinate frame, the model effectively reasons about the underlying 3D world projected in video.

Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead of the persistent scene humans perceive. We revisit pose as a lightweight supervisory signal and introduce Cambrian-P, a video MLLM augmented with per-frame learnable camera tokens and a pose regression head. With a carefully designed sampling scheme, the model achieves substantial gains of 4.5–6.5% on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, and, as a byproduct, achieves state-of-the-art streaming pose estimation on ScanNet. Surprisingly, training on pseudo-annotated poses from in-the-wild video further improves general video QA benchmarks, showing pose helps beyond spatial reasoning. Together, these results position camera pose as a fundamental signal for video models that reason about the physical world.

Method Icon Method VQA Icon Spatial VQA Results Qualitative Viz Icon Qualitative Visualization Pose Icon Camera Pose Estimation Analysis Icon Analysis

Click any icon to jump to the corresponding section.


Cambrian-P

Why pose? Video is the 2D observation of a dynamic 3D scene from a coherent sequence of viewpoints. Each viewpoint is defined by an observer's pose, i.e., its 3D position and orientation, specifying how the camera is embedded in the physical world. Pose is the lightest 3D signal: it compactly encodes how views relate geometrically, enforces global consistency through rigid-body constraints, and disentangles camera motion from scene dynamics. Camera pose is not merely a useful auxiliary cue, but a foundational inductive bias for spatially aware video understanding.

Cambrian-P architecture
Figure 2: Cambrian-P architecture. Cambrian-P imposes minimal modifications to existing MLLM architectures, introducing only two learnable pose tokens and a lightweight pose head. These tokens are appended to the visual tokens of each frame and positioned before text embeddings.

Architecture. Cambrian-P builds on the Cambrian-S backbone (SigLIP2-SO400m vision encoder + Qwen2.5 LLM). To enable pose estimation inside the LLM's feature space, we introduce two learnable camera tokens, c_first and c_rest, appended after the visual tokens of each frame. After the LLM forward pass, a linear projector and a four-layer self-attention head regress a 9-D pose encoding (translation, quaternion, horizontal and vertical field-of-view) for every frame, with all poses expressed in the coordinate system of the first camera.

Training objective. The total loss combines next-token prediction with a pose regression loss. The pose loss is a weighted L1 over translation, quaternion rotation, and field-of-view. Translation is normalized by the average consecutive-frame distance \( \bar d \) so that indoor and outdoor scenes contribute comparable gradients, and a stop-gradient least-squares scale aligns predictions to ground truth on non-metric datasets.

Interleaved training of Cambrian-P
Figure 3: Interleaved training. Top: augmented pose-only samples using dynamic frame sampling and only pose supervision. Bottom: samples using uniform frame sampling with both VQA and pose supervision. L is the total number of frames in the video.

Bridging the training-dynamics gap. VQA and pose estimation have conflicting needs: VQA prefers uniform temporal sampling and minimal augmentation, while pose estimation prefers random starts, variable intervals, and heavy augmentation. We resolve this with two ingredients:

  • Interleaved training: we mix VQA-only, pose-only, and joint VQA+pose samples in training. Dedicated pose-only samples use pose-model sampling and augmentation and are supervised only by \( \mathcal{L}_{\text{pose}} \), allowing pose training to scale independently of VQA.
  • Random-jitter frame sampling: each uniformly sampled frame index for VQA is perturbed by a small random offset, breaking the memorization of fixed frame–pose correspondences without harming VQA coverage.

Qualitative Visualization

Improved Spatial Video QA

Cambrian-P is fine-tuned from Cambrian-S-7B (stage 3) on VSI-590K and yields state-of-the-art performance on VSI-Bench. Compared to its no-pose counterpart, it gains +4.5% on VSI-Bench, with the largest improvements on tasks that demand global spatial understanding such as absolute distance, relative direction, and route planning.

Model LM Avg. Numerical Answer Multiple-Choice Answer
Obj. CountAbs. Dist.Obj. SizeRoom Size Rel. Dist.Rel. Dir.Route PlanAppr. Order
General-purpose Models
GPT-4oUnk.34.046.25.343.838.237.041.331.528.5
Gemini-2.5 ProUnk.51.543.834.964.342.861.147.845.971.3
Qwen2.5VL-7BQwen2.5-7B29.325.210.536.429.638.438.029.826.8
InternVL-3 8BQwen2.5-7B42.168.139.048.433.648.336.427.335.4
InternVL-3.5 8BQwen3-8B56.3
Qwen3-VL 8BQwen3-8B56.6
Spatial-specialist Models
VST 7BQwen2.5-7B61.271.643.875.569.260.055.644.369.2
VLM-3R 7BQwen2-7B60.970.249.469.267.165.480.545.440.1
VG-LLM 8BQwen2.5-7B50.767.937.758.662.046.640.732.459.2
Cambrian-S 7BQwen2.5-7B67.573.250.574.972.271.176.241.880.1
SenseNova-SI 8BQwen2.5-7B68.7
GeoThinker 7BQwen2.5-7B68.5
GeoThinker 8BQwen3-8B72.6
Cambrian-S-7BQwen2.5-7B69.273.653.775.274.771.582.038.784.3
Cambrian-PQwen2.5-7B73.774.960.176.076.974.889.552.685.0
Table 1: VSI-Bench results. Cambrian-S fine-tuned only on VSI-590K. Best results in bold.

Camera-Movement Understanding (VSTemporalI-Bench)

We further fine-tune Cambrian-P on VSI-590K + VLM-3R data and evaluate on VSTemporalI-Bench, which probes camera-motion understanding. Pose supervision yields a +20% jump on the camera-movement-direction subtask over the no-pose baseline.

MethodAvg. Cam-Obj Abs. Dist.Cam. Displace. Cam. Mov. Dir.Obj-Obj Rel. Pos.Cam-Obj Rel. Dist.
GPT-4o38.229.523.437.358.142.5
Gemini-1.5 Flash32.128.520.924.452.633.9
LLaVA-NeXT-Video-72B44.032.310.548.178.350.9
VLM-3R-7B58.839.439.660.686.568.6
GeoThinker-8B67.438.445.884.293.675.2
Cambrian-P (w/o pose)62.439.440.667.792.272.0
Cambrian-P68.942.546.687.794.373.2
Table 2: VSTemporalI-Bench results. Pose supervision yields +20% on the camera-movement-direction subtask.

Out-of-Distribution Generalization

Although Cambrian-P is fine-tuned only on VSI-590K, the local-to-global video understanding it acquires through pose supervision transfers to a wide range of out-of-distribution spatial and general video QA benchmarks.

ModelSparBenchMMSIBenchMMSIVideoMindCube MVBenchEgoSchemaPercept. TestTomato
Cambrian-P (w/o pose)32.726.220.134.351.949.656.420.4
Cambrian-P35.928.022.938.453.552.558.426.7
Table 3: Out-of-distribution generalization across 8 spatial and general video QA benchmarks.

Scaling Pose Supervision with Pseudo Labels

Ground-truth camera pose annotations are only available in a few 3D datasets (ScanNet, ScanNet++, ARKitScenes inside VSI-590K). To scale pose supervision to general-domain videos, we pseudo-annotate CamS-590K, a 590K-clip subsample of Cambrian-S-3M, the open-domain video instruction-tuning corpus used in Cambrian-S Stage 3. The pipeline runs (i) a scene-cut detector, (ii) a Qwen3-VL pose-aware quality filter that drops synthetic, text-overlaid, screen-recorded, blurry, or through-glass clips, (iii) VIPE for streaming pose recovery, and (iv) a trajectory-quality pass that interpolates frames violating per-frame velocity / acceleration / rotation limits. The resulting pseudo poses are routed through the same interleaved training recipe as GT poses, with no architecture change.

Adding CamS-590K boosts general video QA but slightly dips VSI-Bench, showing that generic video tuning does not by itself preserve spatial reasoning. GT pose restores the spatial gain without giving up the general-VQA improvements, and layering VIPE pseudo poses on CamS data pushes all four benchmarks up further. These results suggest that pseudo poses, even when derived from noisy in-the-wild videos, provide a scalable supervision signal for video understanding.

Training DataPose Sup.% Pose Sup. VSI-BenchMVBenchPerception TestEgoSchema
Spatial VQA Data Only
VSI-590K0%71.251.756.748.5
VSI-590KGT49%73.753.858.151.3
Spatial VQA + General VQA Data
VSI-590K + CamS-590K0%70.968.066.971.2
VSI-590K + CamS-590KGT25%73.767.967.871.7
VSI-590K + CamS-590KGT + Pseudo48%73.969.367.973.6
Table 4: Cambrian-P with general VQA training data and pseudo-pose supervision (128 input frames). Adding CamS-590K boosts general video QA but slightly hurts VSI-Bench without pose; GT pose restores the spatial gain; VIPE pseudo poses on CamS-590K improve all four benchmarks under an otherwise unchanged recipe.

Streaming Camera Pose Estimation

As a byproduct of pose supervision, Cambrian-P doubles as a competitive streaming camera-pose estimator. It achieves the lowest ATE on ScanNet among all streaming methods, despite using a standard MLLM backbone (causal SigLIP encoder, no DINOv2, no bidirectional transformer). On TUM-dynamic and Sintel it remains competitive with specialist models like CUT3R, StreamVGGT, and Point3R.

Model ScanNet TUM-dynamic Sintel
ATE ↓RPE-t ↓RPE-r ↓ ATE ↓RPE-t ↓RPE-r ↓ ATE ↓RPE-t ↓RPE-r ↓
Offline Models
VGGT0.0350.0150.3800.0090.0080.3500.1720.0610.470
π³0.0310.0130.3470.0140.0090.3120.0740.0400.282
MapAnything0.0520.0250.7200.0290.0230.3700.2260.0770.640
MASt3R-GA0.0780.0200.4750.0380.0120.4480.1850.0601.496
MonST3R-GA0.0770.0180.5290.0980.0190.9350.1110.0440.869
Streaming Models
StreamVGGT0.1270.0411.8800.0620.0300.6900.2730.1090.850
CUT3R0.0960.0220.5900.0450.0150.4400.2150.0700.630
Point3R0.0970.0352.7910.0580.0310.7580.4420.1541.897
Spann3R0.0960.0230.6610.0560.0210.5910.3290.1104.471
G²VLM0.1480.0481.2200.1290.0440.7000.3010.1351.450
Cambrian-P (Ours)0.0780.0230.8800.0460.0200.5800.2390.0812.440
Table 5: Camera pose estimation on ScanNet, TUM-dynamic, and Sintel. Best values among streaming methods in bold.

Analysis

Scaling. The pose objective scales gracefully along all three axes — model size, data size, and training iterations — with the gap over the no-pose baseline widening as we scale up.

Scaling of Cambrian-P (VQA) Scaling of Cambrian-P (pose ATE)
Figure 5: Scaling. Larger models or more data both yield higher VSI-Bench scores and lower pose ATE across benchmarks.

Camera pose > depth. Adding a depth head yields smaller VQA gains than pose, and combining pose + depth slightly hurts both objectives. Within the pose loss, both translation and rotation contribute meaningfully; field-of-view alone behaves comparably to depth.

Pose makes MLLMs think more globally. Grouping VSI-Bench questions by normalized object distance (near / medium / far), pose supervision delivers the largest gains on far object pairs (e.g., +6.6% on relative distance and +10.2% on relative direction for far questions), suggesting that pose enables more global spatial reasoning.

Improvement vs object distance
Figure 6: Camera pose improves global spatial reasoning. Performance gain of Cambrian-P over its no-pose counterpart increases with normalized object distance for both relative-distance and relative-direction questions in VSI-Bench.

Latency. Despite having 6–10× more parameters than specialist 3D reconstruction models, Cambrian-P achieves lower per-frame latency thanks to (i) compact SigLIP visual tokens, (ii) causal-attention FLOPs savings, and (iii) the standard KV-cache mechanism of LLM inference. In offline mode it processes 90 ScanNet frames in 2.16 s; in streaming mode it matches CUT3R's per-frame cost.

Method#Params Offline / seq (s)Offline / frame (s) Streaming / seq (s)Streaming / frame (s)
VGGT1.26B9.900.11
CUT3R0.80B5.220.066.030.07
StreamVGGT1.26B9.000.10
Cambrian-P (Ours)8.20B2.160.025.760.06
Table 6: Inference latency on the ScanNet test set (NVIDIA L40S, 90-frame sequences).

Conclusion

We introduce Cambrian-P, a pose-grounded video understanding model that equips standard MLLMs with the capability to connect individual frames in a shared 3D space. With a simple yet scalable architectural design and tailored training dynamics, Cambrian-P improves spatial and general video QA, and achieves competitive streaming pose estimation against state-of-the-art methods. Our results position camera pose as an important missing signal for video MLLMs: it grounds frames in a globally consistent 3D space and encourages learning cross-frame correspondences. Cambrian-P advances MLLMs toward real-world grounded video understanding.

Pose trajectories on ScanNet test scenes
Figure 4: Camera pose trajectories on ScanNet test scenes. Ground truth (gray dashed) vs. predictions (blue) from Cambrian-P, CUT3R, StreamVGGT, and G²VLM. These scenes are disjoint from the VSI-Bench evaluation sequences — Cambrian-P generalizes to unseen indoor environments, capturing both global path shape and fine-grained turns.

BibTeX

@article{yang2026cambrianp,
  title   = {Cambrian-P: Pose-Grounded Video Understanding},
  author  = {Yang, Jihan and Zhao, Zifan and Pan, Xichen and Yang, Shusheng and Zhang, Junyi and Kang, Bingyi and Xu, Hu and Xie, Saining},
  journal = {arXiv preprint arXiv:2605.22819},
  year    = {2026},
}