IROS 2026

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

Dongyoon Hwang* 1, Byungkun Lee* 1, Dongjin Kim* 1, Hyojin Jang1, Hoiyeong Jin1, Jueun Mun2, Minho Park1, Hojoon Lee1, Hyunseung Kim1, 3, Jaegul Choo† 1
1KAIST AI 2POSTECH 3KRAFTON AI
*Equal contribution   Corresponding author

Overview

TL;DR: 3D HAMSTER predicts metrically grounded 3D end-effector trajectories from a depth-augmented VLM and executes them through a pointcloud-based low-level policy, enabling robust manipulation across diverse real-world scenes, instructions, and visual conditions.
Depth-Aware VLM Planner
Predicts metric 3D end-effector trajectories from RGB-D input.
Point-Cloud Policy
Executes the predicted 3D trajectory on the observed point cloud.
Robust Manipulation
Generalizes across language, spatial, and visual distribution shifts.
Predicts a metric 3D trajectory
Rolls out on the observed point cloud
Robust across diverse tasks & conditions

Planning in metric 3D — not in pixels — keeps execution geometrically grounded.

Method

Hierarchical Vision-Language-Action (VLA) models split manipulation into a high-level planner and a low-level policy. Today's strongest low-level policies operate on 3D point clouds, yet existing planners predict only 2D pixel trajectories — forcing each waypoint to inherit the depth of whatever surface lies beneath it.

3D HAMSTER removes that mismatch. We build:

  1. A depth-aware VLM planner — augmented with a geometry encoder and a dense depth-reconstruction objective — that predicts metric 3D end-effector trajectories. Visualized in §02.
  2. A 3D-trajectory-conditioned point-cloud policy (rectified-flow, 3DFA) trained to roll out the planner's predictions on real hardware. Real-world rollouts in §03.
Architecture diagram of 3D HAMSTER showing the depth-augmented VLM planner producing 3D waypoints that are unprojected and fused into the point cloud consumed by the low-level policy.
Key Contributions
  1. 3D-native hierarchical VLA. The VLM planner outputs metric 3D end-effector trajectories that feed directly into a point-cloud low-level policy — no 2D-to-3D translation in between.
  2. Recipe for metric-3D VLM prediction. A depth encoder, a depth-reconstruction objective, and a curated data mixture turn Qwen3-VL into a planner whose (u, v, d) waypoints unproject to consistent world coordinates.
  3. Geometry-grounded generalization. Planning in metric 3D — instead of pixel paths whose depth must be inferred at execution — yields manipulation that holds across novel scenes, instructions, and lighting.

Depth-Aware VLM Planner

Our depth-aware VLM predicts metric 3D end-effector trajectories directly — every waypoint a real point in the scene, not a pixel whose depth must be inferred at execution. The two views below let you scrutinize that geometry yourself: paired renders that lock the same trajectory to two camera angles, and free-orbit point clouds you can explore from any viewpoint. Across DroidSpatial-Bench and our real-world benchmark, our planner stays tight on ground truth where strong general-purpose VLMs — RoboBrain 2.5 and Gemini 3.0 Pro — visibly drift in depth.

From RGB-D + Instruction to a 3D Trajectory

The planner takes a single RGB-D observation and a language instruction, and directly outputs a metric 3D end-effector trajectory — a real path in the scene. Rotating it makes clear the prediction lives in 3D, not on the image plane.

RGB input
RGB
Depth input
Depth
Prompt“pick up the blue car and put it in the pink bowl”
3D HAMSTER
VLM
Metric 3D trajectory

How it is trained. We build on Qwen3-VL-8B and add a dedicated depth encoder plus a dense depth-reconstruction loss that keeps the model's internal geometry metrically faithful. Training runs in two stages — Stage 1 aligns the depth features with the VLM space (encoders frozen), Stage 2 fine-tunes for trajectory prediction with LoRA. The planner learns from a mixture of 3D-capability data (RGB-D trajectories and spatial reasoning) and 2D preservation data (RGB-only pointing, detection, and VQA) that keeps its general vision-language ability intact.

Training data composition. Eight sources across two categories — 3D-capability (RGB-D) for metric trajectory prediction, and preservation (RGB) to retain general ability.
SourceEnv.ModalityTasksSize
3D Capability Data
RLBenchSimRGB-D2D / 3D / 2D→3D606K
DROIDRealRGB-D2D / 3D / 2D→3D123K
InternData-M1SimRGB-D2D / 3D / 2D→3D1.5M
RefSpatialMixRGB-DSpatial QA / Vacant loc.2.2M
Preservation Data
RoboPointSimRGB2D pointing666K
PixMoRealRGB2D pointing171K
LVISRealRGB2D bbox det.138K
Honey-1MWebRGBGeneral VQA749K
Total≈ 6.15M
Representative samples. One real training example per source, with its ground-truth label (trajectory, points, or box) overlaid.
RLBench sample with a 3D trajectory overlay.
RLBenchRGB-D

“Put the ring on the red peg”

DROID sample with a 3D trajectory overlay.
DROIDRGB-D

“Take the marker out of the cup”

InternData-M1 sample with a 3D trajectory overlay.
InternData-M1RGB-D

“Set the lantern on the vintage monitor”

RefSpatial spatial-reasoning sample.
RefSpatialRGB-D

“Which point is farther back, 1 or 2?”

RoboPoint sample with affordance points.
RoboPointRGB

“Spot the item left of the red box”

PixMo pointing sample.
PixMoRGB

“Where can I find the cups?”

LVIS detection sample with a bounding box.
LVISRGB

“Locate the kid in this image”

Honey-1M general VQA sample.
Honey-1MRGB

“Tell me about this steam locomotive”

Side-by-Side Renders

Three real-world tasks, three different scenes — each rendered with the ground-truth trajectory and the predictions from our VLM, RoboBrain 2.5, and Gemini 3.0 Pro overlaid on the same point cloud. Our planner threads the correct 3D path in every scene, while the baselines drift in depth or miss the target surface.

Legend: ground-truth trajectory, 3D HAMSTER (ours), and baseline predictions.
Pick up the blue car and place it in the basket
Pick up the lying book and insert it into the gap between the other books
Pick up the banana and place it on the white mat

The same comparison on DroidSpatial-Bench — three benchmark scenes where our planner threads the gripper onto the correct target surface, while RoboBrain 2.5 and Gemini 3.0 Pro drift in depth.

Put the white plate in the white dish tub in the sink
Pick the green lid from the table and place it on the small bottle
Put the wooden spoon in the open drawer

Explore Interactively

Don’t take our word for it — orbit freely and check the geometry from any angle. Each viewer shows our 3D HAMSTER prediction (a red→blue gradient trajectory) on the scene point cloud. From every viewpoint, our trajectory stays glued to the correct object surface.

Drag to orbit, scroll to zoom, right-click to pan. Use / (or the buttons) to cycle scenes within a carousel.

Real Robot Results

Ours3D HAMSTER — Every Task & Condition

We roll out 3D HAMSTER on a real Franka Panda across three task families and the five condition shifts from our quantitative protocol (§05): In-distribution, Language, Spatial, Visual, and Multiple. Pick a task to scan all five conditions at a glance, then select one to inspect its rollout next to the predicted 3D trajectory in an interactive point cloud.


vs. Baselines3D HAMSTER vs. Prior Methods

The same real-robot scenes run head-to-head against strong baselines. 3D HAMSTER’s metrically grounded guidance holds where surface-level 2D guidance drifts.

Varied Lighting

Same four methods on the real Franka Panda, evaluated under shifted lighting conditions. 3D depth cues remain stable across illumination changes while surface-level 2D guidance degrades.

“Pick up the volleyball and put it in the pink bowl”

Success
3D HAMSTER (Ours)
Failure
HAMSTER
Failure
3DFA
Failure
π0.5

Cluttered Scenes

On the real Franka Panda, the target object is placed inside a visually crowded scene populated with distractor objects. The instruction uses an unseen referring expression, further stressing language grounding. 3D HAMSTER's metrically grounded trajectory guidance keeps the policy on-task while 2D baselines drift onto visually salient distractors.

“Pick up the ingredient that makes wine and put it in the lemon-colored bowl”

Success
3D HAMSTER (Ours)
Failure
HAMSTER
Failure
3DFA
Failure
π0.5

HAMSTER vs. 3D HAMSTER

HAMSTER predicts 2D pixel trajectories that cling to the scene surface (the “graffiti effect”), so execution drifts off the target. 3D HAMSTER predicts metric 3D trajectories that stay geometrically grounded — turning the same failures into successes.

“Pick up the volleyball and place it in the basket”

Graffiti Effect
HAMSTER (2D)
Grounded in 3D
3D HAMSTER (Ours)
Failure
HAMSTER (2D)
Success
3D HAMSTER (Ours)

Quantitative Results

3D Trajectory Prediction on DroidSpatial-Bench

Our depth-encoder-augmented VLM produces more accurate 3D trajectories than strong baselines including proprietary VLMs and open-source models.

Table I. 3D Trajectory Prediction on DroidSpatial-Bench. Accuracy at two metric tolerances (δ = 5 cm and 10 cm) for start, end, and both endpoints. Best is highlighted.
Model Input δ = 5 cm δ = 10 cm
Start End Both Start End Both
Proprietary API Models
Claude Sonnet 4.6 RGB 1.4 8.8 0.7 6.1 16.2 2.0
GPT-5.2 RGB 6.8 29.7 2.7 29.7 45.3 16.2
Gemini-3.0-Pro RGB 29.1 44.0 16.2 43.2 56.1 29.7
Open-Source Models
RoboBrain-2.5-8B RGB 61.5 58.1 39.2 80.4 74.3 60.1
3D HAMSTER (Ours)
Qwen3-VL-8B RGB 0.7 9.5 0.7 0.7 14.9 0.7
 + 3D Traj. Data RGB 50.0 50.0 27.7 71.6 72.3 50.0
 + Depth Encoder RGBD 62.8 62.2 42.6 83.8 75.0 62.8
 + Ldepth RGBD 63.5 66.2 41.9 80.4 82.4 65.5

Simulation: Colosseum Benchmark

Per-variation success rates averaged across 11 tasks on Colosseum, which stress-tests policies under 14 perturbation axes. 3D guidance provides both appearance invariance and geometric robustness.

Table II. Colosseum simulation: per-variation success rate averaged across 11 tasks.
Method None MO RO Light Table Distract BG Tex RLB Var Cam Pose All Var Avg.
3DFA 53.8 39.0 27.2 38.0 30.3 36.4 43.3 50.0 46.8 0.8 36.6
3DFA + HAMSTER 49.5 38.1 28.8 38.8 42.3 40.8 43.3 50.5 48.4 7.2 38.8
3DFA + 3D HAMSTER 62.9 49.4 36.3 54.4 39.6 44.8 52.0 52.5 49.2 7.2 44.8

Real-World Manipulation

Success rates (%) on real Franka Panda arm across three task families. 3D HAMSTER achieves the highest average success across all tasks, with the largest gains under visual and spatial distribution shifts.

Table III. Real-world manipulation success rates (%) on a Franka Panda arm across three task families and five condition shifts (In-distribution, Language, Spatial, Visual, Multiple).
Method In-D Lang Spatial Visual Multiple Avg.
Button Pressing
π0.5 100 80 40 90 60 74
3DFA 90 30 0 30 40 38
3DFA + HAMSTER 80 50 20 80 70 60
3DFA + 3D HAMSTER 100 90 50 100 60 80
Pouring
π0.5 60 15 35 65 30 41
3DFA 80 45 50 50 25 50
3DFA + HAMSTER 75 45 40 35 30 45
3DFA + 3D HAMSTER 95 75 65 65 40 68
Pick-and-Place
π0.5 100 50 15 30 5 40
3DFA 65 35 55 30 20 41
3DFA + HAMSTER 65 65 35 30 35 46
3DFA + 3D HAMSTER 90 70 55 50 45 62

BibTeX

Copied!
@INPROCEEDINGS{hwang20263dhamster,
  author={Hwang, Dongyoon and Lee, Byungkun and Kim, Dongjin and Jang, Hyojin and Jin, Hoiyeong and Mun, Jueun and Park, Minho and Lee, Hojoon and Kim, Hyunseung and Choo, Jaegul},
  booktitle={2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  title={{3D HAMSTER}: Bridging Planning and Control in Hierarchical Vision Language Action Models through {3D} Trajectory Guidance},
  year={2026}}