Comparison between Vanilla GR00T-N1-2B and ACG (Ours) in real-world strawberry picking with SO-101.
The vanilla GR00T-N1-2B often exhibits severe jittering, occasionally striking and displacing the target object.
In contrast, ACG maintains smooth and stable motions, successfully grasping the strawberry without disturbance.

Abstract

Diffusion and flow-matching models have emerged as powerful robot policies, enabling Vision-Language-Action (VLA) models to generalize across diverse scenes and instructions. Yet, when trained via imitation learning, their high generative capacity makes them sensitive to noise in human demonstrations: jerks, pauses, and jitter which reduce action coherence. Reduced action coherence causes instability and trajectory drift during deployment, failures that are catastrophic in fine-grained manipulation where precision is crucial. In this paper, we present Action Coherence Guidance (ACG) for VLA models, a training-free test-time guidance algorithm that improves action coherence and thereby yields performance gains. Evaluated on RoboCasa, DexMimicGen, and real-world SO-101 tasks, ACG consistently improves action coherence and boosts success rates across diverse manipulation tasks.

ACG: Guiding Policy with Incoherent Action Generation

ACG constructs an incoherent vector field and combines it with the original vector field to extrapolate away from it that steers sampling toward coherent action sequences.

Conceptual illustration of ACG
Method overview
Starting from the original model architecture (left), we modify self-attention by replacing the attention map with the identity map to generate an incoherent action sequence (right). The coherent denoising vector is then guided using the opposite direction of the incoherent denoising vector (middle).

Qualitative Comparison

Visual comparison of different methods on real-world strawberry-picking tasks.
ACG produces smoother and more coherent motions than baselines, resulting in higher success rates.

Real-world (SO-101)

Vanilla

Exhibits severe jerks and jittering, often knocking strawberries away from the target.

Ensemble 2

Improves temporal stability but hesitates between targets due to averaging across inferences.

Ensemble 5

Improves temporal stability but hesitates between targets due to averaging across inferences.

Action Smoothing

Reduces fluctuations but produces inaccurate motions since no task-specific prior is applied.

Feature Smoothing

Reduces fluctuations but produces inaccurate motions since no task-specific prior is applied.

CFG

Shows no noticeable improvement in coherence over the vanilla baseline.

White Noise

Generates smooth but imprecise motions, occasionally pushing strawberries away.

ACG (Ours)

Performs smooth, coherent, and precise picking motions, consistently achieving successful strawberry grasps.

Simulation (RoboCasa)

Coffee Serving Task

w/o ACG

The gripper shakes during grasping, leading to frequent failures during placement.

w/ ACG

Grasps and serves the cup with stable, controlled movements without shaking or dropping.

Microwave Button Task

w/o ACG

The gripper jitters and fails to press the button correctly.

w/ ACG

Accurately presses the microwave button in a single, stable motion.

Simulation (DexMimicGen)

TwoArmThreading

w/o ACG

Fails to insert the rod due to jittery and imprecise motion generation.

w/ ACG

Successfully threads the rod through the hole with coordinated actions.

TwoArmLiftTray

w/o ACG

Fails to lift the tray properly due to hesitation during grasping.

w/ ACG

Lifts the tray smoothly with a stable, balanced grasp on both handles.

Quantitative Results

Result 1: ACG consistently outperforms baselines by significant margins.

Quantitative Results

Result 2: ACG significantly improves temporal action coherence.

Quantitative Results

Result 3: ACG remains effective across various VLA backbones.

Quantitative Results