3D HAMSTER: Bridging Planning and Control through 3D Trajectory Guidance

Overview

TL;DR: 3D HAMSTER predicts metrically grounded 3D end-effector trajectories from a depth-augmented VLM and executes them through a pointcloud-based low-level policy, enabling robust manipulation across diverse real-world scenes, instructions, and visual conditions.

Depth-Aware VLM Planner

Predicts metric 3D end-effector trajectories from RGB-D input.

Point-Cloud Policy

Executes the predicted 3D trajectory on the observed point cloud.

Robust Manipulation

Generalizes across language, spatial, and visual distribution shifts.

Predicts a metric 3D trajectory

Rolls out on the observed point cloud

Robust across diverse tasks & conditions

Planning in metric 3D — not in pixels — keeps execution geometrically grounded.

Approach

Method

Hierarchical Vision-Language-Action (VLA) models split manipulation into a high-level planner and a low-level policy. Today's strongest low-level policies operate on 3D point clouds, yet existing planners predict only 2D pixel trajectories — forcing each waypoint to inherit the depth of whatever surface lies beneath it.

3D HAMSTER removes that mismatch. We build:

A depth-aware VLM planner — augmented with a geometry encoder and a dense depth-reconstruction objective — that predicts metric 3D end-effector trajectories. Visualized in §02.
A 3D-trajectory-conditioned point-cloud policy (rectified-flow, 3DFA) trained to roll out the planner's predictions on real hardware. Real-world rollouts in §03.

Architecture diagram of 3D HAMSTER showing the depth-augmented VLM planner producing 3D waypoints that are unprojected and fused into the point cloud consumed by the low-level policy.

Key Contributions

3D-native hierarchical VLA. The VLM planner outputs metric 3D end-effector trajectories that feed directly into a point-cloud low-level policy — no 2D-to-3D translation in between.
Recipe for metric-3D VLM prediction. A depth encoder, a depth-reconstruction objective, and a curated data mixture turn Qwen3-VL into a planner whose (u, v, d) waypoints unproject to consistent world coordinates.
Geometry-grounded generalization. Planning in metric 3D — instead of pixel paths whose depth must be inferred at execution — yields manipulation that holds across novel scenes, instructions, and lighting.

Trajectory Prediction

Depth-Aware VLM Planner

Our depth-aware VLM predicts metric 3D end-effector trajectories directly — every waypoint a real point in the scene, not a pixel whose depth must be inferred at execution. The two views below let you scrutinize that geometry yourself: paired renders that lock the same trajectory to two camera angles, and free-orbit point clouds you can explore from any viewpoint. Across DroidSpatial-Bench and our real-world benchmark, our planner stays tight on ground truth where strong general-purpose VLMs — RoboBrain 2.5 and Gemini 3.0 Pro — visibly drift in depth.

From RGB-D + Instruction to a 3D Trajectory

The planner takes a single RGB-D observation and a language instruction, and directly outputs a metric 3D end-effector trajectory — a real path in the scene. Rotating it makes clear the prediction lives in 3D, not on the image plane.

How it is trained. We build on Qwen3-VL-8B and add a dedicated depth encoder plus a dense depth-reconstruction loss that keeps the model's internal geometry metrically faithful. Training runs in two stages — Stage 1 aligns the depth features with the VLM space (encoders frozen), Stage 2 fine-tunes for trajectory prediction with LoRA. The planner learns from a mixture of 3D-capability data (RGB-D trajectories and spatial reasoning) and 2D preservation data (RGB-only pointing, detection, and VQA) that keeps its general vision-language ability intact.

Training data composition. Eight sources across two categories — 3D-capability (RGB-D) for metric trajectory prediction, and preservation (RGB) to retain general ability.

Source	Env.	Modality	Tasks	Size
3D Capability Data
RLBench	Sim	RGB-D	2D / 3D / 2D→3D	606K
DROID	Real	RGB-D	2D / 3D / 2D→3D	123K
InternData-M1	Sim	RGB-D	2D / 3D / 2D→3D	1.5M
RefSpatial	Mix	RGB-D	Spatial QA / Vacant loc.	2.2M
Preservation Data
RoboPoint	Sim	RGB	2D pointing	666K
PixMo	Real	RGB	2D pointing	171K
LVIS	Real	RGB	2D bbox det.	138K
Honey-1M	Web	RGB	General VQA	749K
Total				≈ 6.15M

RLBench sample with a 3D trajectory overlay. — RLBenchRGB-D

“Put the ring on the red peg”

DROID sample with a 3D trajectory overlay. — RLBenchRGB-D

“Put the ring on the red peg”

Side-by-Side Renders

Three real-world tasks, three different scenes — each rendered with the ground-truth trajectory and the predictions from our VLM, RoboBrain 2.5, and Gemini 3.0 Pro overlaid on the same point cloud. Our planner threads the correct 3D path in every scene, while the baselines drift in depth or miss the target surface.

Legend: ground-truth trajectory, 3D HAMSTER (ours), and baseline predictions.

Pick up the blue car and place it in the basket

Pick up the lying book and insert it into the gap between the other books

Pick up the banana and place it on the white mat

The same comparison on DroidSpatial-Bench — three benchmark scenes where our planner threads the gripper onto the correct target surface, while RoboBrain 2.5 and Gemini 3.0 Pro drift in depth.

Put the white plate in the white dish tub in the sink

Pick the green lid from the table and place it on the small bottle

Put the wooden spoon in the open drawer

Explore Interactively

Don’t take our word for it — orbit freely and check the geometry from any angle. Each viewer shows our 3D HAMSTER prediction (a red→blue gradient trajectory) on the scene point cloud. From every viewpoint, our trajectory stays glued to the correct object surface.

Drag to orbit, scroll to zoom, right-click to pan. Use ← / → (or the buttons) to cycle scenes within a carousel.

DroidSpatial-Bench 1 / 15

Click to focus, then use ← →

Real-World 1 / 5

Click to focus, then use ← →

Robot Rollouts · Real-World

Real Robot Results

Ours3D HAMSTER — Every Task & Condition

We roll out 3D HAMSTER on a real Franka Panda across three task families and the five condition shifts from our quantitative protocol (§05): In-distribution, Language, Spatial, Visual, and Multiple. Pick a task to scan all five conditions at a glance, then select one to inspect its rollout next to the predicted 3D trajectory in an interactive point cloud.

Real-Robot Rollout video

Interactive 3D interactive

drag to rotate

vs. Baselines3D HAMSTER vs. Prior Methods

The same real-robot scenes run head-to-head against strong baselines. 3D HAMSTER’s metrically grounded guidance holds where surface-level 2D guidance drifts.

Varied Lighting

Same four methods on the real Franka Panda, evaluated under shifted lighting conditions. 3D depth cues remain stable across illumination changes while surface-level 2D guidance degrades.

“Pick up the volleyball and put it in the pink bowl”

Success

3D HAMSTER (Ours)

Failure

HAMSTER

Failure

3DFA

Failure

π0.5

Cluttered Scenes

On the real Franka Panda, the target object is placed inside a visually crowded scene populated with distractor objects. The instruction uses an unseen referring expression, further stressing language grounding. 3D HAMSTER's metrically grounded trajectory guidance keeps the policy on-task while 2D baselines drift onto visually salient distractors.

“Pick up the ingredient that makes wine and put it in the lemon-colored bowl”

Success

3D HAMSTER (Ours)

Failure

HAMSTER

Failure

3DFA

Failure

π0.5

2D vs. 3D Trajectory Guidance

HAMSTER vs. 3D HAMSTER

HAMSTER predicts 2D pixel trajectories that cling to the scene surface (the “graffiti effect”), so execution drifts off the target. 3D HAMSTER predicts metric 3D trajectories that stay geometrically grounded — turning the same failures into successes.

“Pick up the volleyball and place it in the basket”

Graffiti Effect

HAMSTER (2D)

Grounded in 3D

3D HAMSTER (Ours)

Failure

HAMSTER (2D)

Success

3D HAMSTER (Ours)

Benchmarks · Accuracy & Success Rate

Quantitative Results

3D Trajectory Prediction on DroidSpatial-Bench

Our depth-encoder-augmented VLM produces more accurate 3D trajectories than strong baselines including proprietary VLMs and open-source models.

Table I. 3D Trajectory Prediction on DroidSpatial-Bench. Accuracy at two metric tolerances (δ = 5 cm and 10 cm) for start, end, and both endpoints. Best is highlighted.

Model	Input	δ = 5 cm			δ = 10 cm
Model	Input	Start	End	Both	Start	End	Both
Proprietary API Models
Claude Sonnet 4.6	RGB	1.4	8.8	0.7	6.1	16.2	2.0
GPT-5.2	RGB	6.8	29.7	2.7	29.7	45.3	16.2
Gemini-3.0-Pro	RGB	29.1	44.0	16.2	43.2	56.1	29.7
Open-Source Models
RoboBrain-2.5-8B	RGB	61.5	58.1	39.2	80.4	74.3	60.1
3D HAMSTER (Ours)
Qwen3-VL-8B	RGB	0.7	9.5	0.7	0.7	14.9	0.7
+ 3D Traj. Data	RGB	50.0	50.0	27.7	71.6	72.3	50.0
+ Depth Encoder	RGBD	62.8	62.2	42.6	83.8	75.0	62.8
+ L_depth	RGBD	63.5	66.2	41.9	80.4	82.4	65.5

Simulation: Colosseum Benchmark

Per-variation success rates averaged across 11 tasks on Colosseum, which stress-tests policies under 14 perturbation axes. 3D guidance provides both appearance invariance and geometric robustness.

Table II. Colosseum simulation: per-variation success rate averaged across 11 tasks.

Method	None	MO	RO	Light	Table	Distract	BG Tex	RLB Var	Cam Pose	All Var	Avg.
3DFA	53.8	39.0	27.2	38.0	30.3	36.4	43.3	50.0	46.8	0.8	36.6
3DFA + HAMSTER	49.5	38.1	28.8	38.8	42.3	40.8	43.3	50.5	48.4	7.2	38.8
3DFA + 3D HAMSTER	62.9	49.4	36.3	54.4	39.6	44.8	52.0	52.5	49.2	7.2	44.8

Real-World Manipulation

Success rates (%) on real Franka Panda arm across three task families. 3D HAMSTER achieves the highest average success across all tasks, with the largest gains under visual and spatial distribution shifts.

Table III. Real-world manipulation success rates (%) on a Franka Panda arm across three task families and five condition shifts (In-distribution, Language, Spatial, Visual, Multiple).

Method	In-D	Lang	Spatial	Visual	Multiple	Avg.
Button Pressing
π0.5	100	80	40	90	60	74
3DFA	90	30	0	30	40	38
3DFA + HAMSTER	80	50	20	80	70	60
3DFA + 3D HAMSTER	100	90	50	100	60	80
Pouring
π0.5	60	15	35	65	30	41
3DFA	80	45	50	50	25	50
3DFA + HAMSTER	75	45	40	35	30	45
3DFA + 3D HAMSTER	95	75	65	65	40	68
Pick-and-Place
π0.5	100	50	15	30	5	40
3DFA	65	35	55	30	20	41
3DFA + HAMSTER	65	65	35	30	35	46
3DFA + 3D HAMSTER	90	70	55	50	45	62

Citation

BibTeX

Copied!

@INPROCEEDINGS{hwang20263dhamster,
  author={Hwang, Dongyoon and Lee, Byungkun and Kim, Dongjin and Jang, Hyojin and Jin, Hoiyeong and Mun, Jueun and Park, Minho and Lee, Hojoon and Kim, Hyunseung and Choo, Jaegul},
  booktitle={2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  title={{3D HAMSTER}: Bridging Planning and Control in Hierarchical Vision Language Action Models through {3D} Trajectory Guidance},
  year={2026}}

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

Overview

Method

Depth-Aware VLM Planner

From RGB-D + Instruction to a 3D Trajectory

Side-by-Side Renders

Explore Interactively

Real Robot Results

Ours3D HAMSTER — Every Task & Condition

vs. Baselines3D HAMSTER vs. Prior Methods

Varied Lighting

“Pick up the volleyball and put it in the pink bowl”

Cluttered Scenes

“Pick up the ingredient that makes wine and put it in the lemon-colored bowl”

HAMSTER vs. 3D HAMSTER

“Pick up the volleyball and place it in the basket”

Quantitative Results

3D Trajectory Prediction on DroidSpatial-Bench

Simulation: Colosseum Benchmark

Real-World Manipulation

BibTeX