The vanilla GR00T-N1-2B often exhibits severe jittering, occasionally striking and displacing the target object.
In contrast, ACG maintains smooth and stable motions, successfully grasping the strawberry without disturbance.
Diffusion and flow-matching models have emerged as powerful robot policies, enabling Vision-Language-Action (VLA) models to generalize across diverse scenes and instructions. Yet, when trained via imitation learning, their high generative capacity makes them sensitive to noise in human demonstrations: jerks, pauses, and jitter which reduce action coherence. Reduced action coherence causes instability and trajectory drift during deployment, failures that are catastrophic in fine-grained manipulation where precision is crucial. In this paper, we present Action Coherence Guidance (ACG) for VLA models, a training-free test-time guidance algorithm that improves action coherence and thereby yields performance gains. Evaluated on RoboCasa, DexMimicGen, and real-world SO-101 tasks, ACG consistently improves action coherence and boosts success rates across diverse manipulation tasks.
ACG constructs an incoherent vector field and combines it with the original vector field to extrapolate away from it that steers sampling toward coherent action sequences.
Visual comparison of different methods on real-world strawberry-picking tasks.
ACG produces smoother and more coherent motions than baselines, resulting in higher success rates.
Exhibits severe jerks and jittering, often knocking strawberries away from the target.
Improves temporal stability but hesitates between targets due to averaging across inferences.
Improves temporal stability but hesitates between targets due to averaging across inferences.
Reduces fluctuations but produces inaccurate motions since no task-specific prior is applied.
Reduces fluctuations but produces inaccurate motions since no task-specific prior is applied.
Shows no noticeable improvement in coherence over the vanilla baseline.
Generates smooth but imprecise motions, occasionally pushing strawberries away.
Performs smooth, coherent, and precise picking motions, consistently achieving successful strawberry grasps.
The gripper shakes during grasping, leading to frequent failures during placement.
Grasps and serves the cup with stable, controlled movements without shaking or dropping.
The gripper jitters and fails to press the button correctly.
Accurately presses the microwave button in a single, stable motion.
Fails to insert the rod due to jittery and imprecise motion generation.
Successfully threads the rod through the hole with coordinated actions.
Fails to lift the tray properly due to hesitation during grasping.
Lifts the tray smoothly with a stable, balanced grasp on both handles.
Result 1: ACG consistently outperforms baselines by significant margins.
Result 2: ACG significantly improves temporal action coherence.
Result 3: ACG remains effective across various VLA backbones.