Planning in metric 3D — not in pixels — keeps execution geometrically grounded.
Hierarchical Vision-Language-Action (VLA) models split manipulation into a high-level planner and a low-level policy. Today's strongest low-level policies operate on 3D point clouds, yet existing planners predict only 2D pixel trajectories — forcing each waypoint to inherit the depth of whatever surface lies beneath it.
3D HAMSTER removes that mismatch. We build:
Our depth-aware VLM predicts metric 3D end-effector trajectories directly — every waypoint a real point in the scene, not a pixel whose depth must be inferred at execution. The two views below let you scrutinize that geometry yourself: paired renders that lock the same trajectory to two camera angles, and free-orbit point clouds you can explore from any viewpoint. Across DroidSpatial-Bench and our real-world benchmark, our planner stays tight on ground truth where strong general-purpose VLMs — RoboBrain 2.5 and Gemini 3.0 Pro — visibly drift in depth.
The planner takes a single RGB-D observation and a language instruction, and directly outputs a metric 3D end-effector trajectory — a real path in the scene. Rotating it makes clear the prediction lives in 3D, not on the image plane.


How it is trained. We build on Qwen3-VL-8B and add a dedicated depth encoder plus a dense depth-reconstruction loss that keeps the model's internal geometry metrically faithful. Training runs in two stages — Stage 1 aligns the depth features with the VLM space (encoders frozen), Stage 2 fine-tunes for trajectory prediction with LoRA. The planner learns from a mixture of 3D-capability data (RGB-D trajectories and spatial reasoning) and 2D preservation data (RGB-only pointing, detection, and VQA) that keeps its general vision-language ability intact.
| Source | Env. | Modality | Tasks | Size |
|---|---|---|---|---|
| 3D Capability Data | ||||
| RLBench | Sim | RGB-D | 2D / 3D / 2D→3D | 606K |
| DROID | Real | RGB-D | 2D / 3D / 2D→3D | 123K |
| InternData-M1 | Sim | RGB-D | 2D / 3D / 2D→3D | 1.5M |
| RefSpatial | Mix | RGB-D | Spatial QA / Vacant loc. | 2.2M |
| Preservation Data | ||||
| RoboPoint | Sim | RGB | 2D pointing | 666K |
| PixMo | Real | RGB | 2D pointing | 171K |
| LVIS | Real | RGB | 2D bbox det. | 138K |
| Honey-1M | Web | RGB | General VQA | 749K |
| Total | ≈ 6.15M | |||

“Put the ring on the red peg”

“Take the marker out of the cup”

“Set the lantern on the vintage monitor”

“Which point is farther back, 1 or 2?”

“Spot the item left of the red box”

“Where can I find the cups?”

“Locate the kid in this image”

“Tell me about this steam locomotive”
Three real-world tasks, three different scenes — each rendered with the ground-truth trajectory and the predictions from our VLM, RoboBrain 2.5, and Gemini 3.0 Pro overlaid on the same point cloud. Our planner threads the correct 3D path in every scene, while the baselines drift in depth or miss the target surface.
The same comparison on DroidSpatial-Bench — three benchmark scenes where our planner threads the gripper onto the correct target surface, while RoboBrain 2.5 and Gemini 3.0 Pro drift in depth.
Don’t take our word for it — orbit freely and check the geometry from any angle. Each viewer shows our 3D HAMSTER prediction (a red→blue gradient trajectory) on the scene point cloud. From every viewpoint, our trajectory stays glued to the correct object surface.
Drag to orbit, scroll to zoom, right-click to pan. Use ← / → (or the buttons) to cycle scenes within a carousel.
We roll out 3D HAMSTER on a real Franka Panda across three task families and the five condition shifts from our quantitative protocol (§05): In-distribution, Language, Spatial, Visual, and Multiple. Pick a task to scan all five conditions at a glance, then select one to inspect its rollout next to the predicted 3D trajectory in an interactive point cloud.
The same real-robot scenes run head-to-head against strong baselines. 3D HAMSTER’s metrically grounded guidance holds where surface-level 2D guidance drifts.
Same four methods on the real Franka Panda, evaluated under shifted lighting conditions. 3D depth cues remain stable across illumination changes while surface-level 2D guidance degrades.
On the real Franka Panda, the target object is placed inside a visually crowded scene populated with distractor objects. The instruction uses an unseen referring expression, further stressing language grounding. 3D HAMSTER's metrically grounded trajectory guidance keeps the policy on-task while 2D baselines drift onto visually salient distractors.
HAMSTER predicts 2D pixel trajectories that cling to the scene surface (the “graffiti effect”), so execution drifts off the target. 3D HAMSTER predicts metric 3D trajectories that stay geometrically grounded — turning the same failures into successes.
Our depth-encoder-augmented VLM produces more accurate 3D trajectories than strong baselines including proprietary VLMs and open-source models.
| Model | Input | δ = 5 cm | δ = 10 cm | ||||
|---|---|---|---|---|---|---|---|
| Start | End | Both | Start | End | Both | ||
| Proprietary API Models | |||||||
| Claude Sonnet 4.6 | RGB | 1.4 | 8.8 | 0.7 | 6.1 | 16.2 | 2.0 |
| GPT-5.2 | RGB | 6.8 | 29.7 | 2.7 | 29.7 | 45.3 | 16.2 |
| Gemini-3.0-Pro | RGB | 29.1 | 44.0 | 16.2 | 43.2 | 56.1 | 29.7 |
| Open-Source Models | |||||||
| RoboBrain-2.5-8B | RGB | 61.5 | 58.1 | 39.2 | 80.4 | 74.3 | 60.1 |
| 3D HAMSTER (Ours) | |||||||
| Qwen3-VL-8B | RGB | 0.7 | 9.5 | 0.7 | 0.7 | 14.9 | 0.7 |
| + 3D Traj. Data | RGB | 50.0 | 50.0 | 27.7 | 71.6 | 72.3 | 50.0 |
| + Depth Encoder | RGBD | 62.8 | 62.2 | 42.6 | 83.8 | 75.0 | 62.8 |
| + Ldepth | RGBD | 63.5 | 66.2 | 41.9 | 80.4 | 82.4 | 65.5 |
Per-variation success rates averaged across 11 tasks on Colosseum, which stress-tests policies under 14 perturbation axes. 3D guidance provides both appearance invariance and geometric robustness.
| Method | None | MO | RO | Light | Table | Distract | BG Tex | RLB Var | Cam Pose | All Var | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3DFA | 53.8 | 39.0 | 27.2 | 38.0 | 30.3 | 36.4 | 43.3 | 50.0 | 46.8 | 0.8 | 36.6 |
| 3DFA + HAMSTER | 49.5 | 38.1 | 28.8 | 38.8 | 42.3 | 40.8 | 43.3 | 50.5 | 48.4 | 7.2 | 38.8 |
| 3DFA + 3D HAMSTER | 62.9 | 49.4 | 36.3 | 54.4 | 39.6 | 44.8 | 52.0 | 52.5 | 49.2 | 7.2 | 44.8 |
Success rates (%) on real Franka Panda arm across three task families. 3D HAMSTER achieves the highest average success across all tasks, with the largest gains under visual and spatial distribution shifts.
| Method | In-D | Lang | Spatial | Visual | Multiple | Avg. |
|---|---|---|---|---|---|---|
| Button Pressing | ||||||
| π0.5 | 100 | 80 | 40 | 90 | 60 | 74 |
| 3DFA | 90 | 30 | 0 | 30 | 40 | 38 |
| 3DFA + HAMSTER | 80 | 50 | 20 | 80 | 70 | 60 |
| 3DFA + 3D HAMSTER | 100 | 90 | 50 | 100 | 60 | 80 |
| Pouring | ||||||
| π0.5 | 60 | 15 | 35 | 65 | 30 | 41 |
| 3DFA | 80 | 45 | 50 | 50 | 25 | 50 |
| 3DFA + HAMSTER | 75 | 45 | 40 | 35 | 30 | 45 |
| 3DFA + 3D HAMSTER | 95 | 75 | 65 | 65 | 40 | 68 |
| Pick-and-Place | ||||||
| π0.5 | 100 | 50 | 15 | 30 | 5 | 40 |
| 3DFA | 65 | 35 | 55 | 30 | 20 | 41 |
| 3DFA + HAMSTER | 65 | 65 | 35 | 30 | 35 | 46 |
| 3DFA + 3D HAMSTER | 90 | 70 | 55 | 50 | 45 | 62 |
@INPROCEEDINGS{hwang20263dhamster,
author={Hwang, Dongyoon and Lee, Byungkun and Kim, Dongjin and Jang, Hyojin and Jin, Hoiyeong and Mun, Jueun and Park, Minho and Lee, Hojoon and Kim, Hyunseung and Choo, Jaegul},
booktitle={2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
title={{3D HAMSTER}: Bridging Planning and Control in Hierarchical Vision Language Action Models through {3D} Trajectory Guidance},
year={2026}}