ACG: Action Coherence Guidance for Flow-based VLA models

Comparison between Vanilla GR00T-N1-2B and ACG (Ours) in real-world strawberry picking with SO-101.
The vanilla GR00T-N1-2B often exhibits severe jittering, occasionally striking and displacing the target object.
In contrast, ACG maintains smooth and stable motions, successfully grasping the strawberry without disturbance.

Abstract

Diffusion and flow-matching models have emerged as powerful robot policies, enabling Vision-Language-Action (VLA) models to generalize across diverse scenes and instructions. Yet, when trained via imitation learning, their high generative capacity makes them sensitive to noise in human demonstrations: jerks, pauses, and jitter which reduce action coherence. Reduced action coherence causes instability and trajectory drift during deployment, failures that are catastrophic in fine-grained manipulation where precision is crucial. In this paper, we present Action Coherence Guidance (ACG) for VLA models, a training-free test-time guidance algorithm that improves action coherence and thereby yields performance gains. Evaluated on RoboCasa, DexMimicGen, and real-world SO-101 tasks, ACG consistently improves action coherence and boosts success rates across diverse manipulation tasks.

ACG: Guiding Policy with Incoherent Action Generation

ACG constructs an incoherent vector field and combines it with the original vector field to extrapolate away from it that steers sampling toward coherent action sequences.

Method overview — **Starting from the original model architecture (left)**, we modify self-attention by replacing the attention map with the identity map to generate an incoherent action sequence (right). The coherent denoising vector is then guided using the opposite direction of the incoherent denoising vector (middle).

Qualitative Comparison

Visual comparison of different methods on real-world strawberry-picking tasks.
ACG produces smoother and more coherent motions than baselines, resulting in higher success rates.