Vision–Language–Action · SO(2) Equivariance

EquiVLA: A General Framework for Rotationally Equivariant Vision–Language–Action Models

“Rotating the workspace should rotate the action — not require more data.”

Anonymous Authors
Under review
📄 Paper soon 💻 Code soon ▶ Video soon
92.6%
LIBERO avg success (+14.5pp) vs. GR00T N1.5
4.03
CALVIN ABCD→D avg seq. length (+0.58)
54→72%
Mobile ALOHA real-robot success (+18pp)

Abstract

Vision–Language–Action (VLA) models have emerged as a powerful paradigm for generalist robot manipulation, yet they lack geometric inductive biases: policies trained at specific orientations require substantially more data to generalize across rotational configurations. We present EquiVLA, the first general framework for end-to-end $SO(2)$-equivariant VLA models, applicable to any architecture coupling a frozen vision–language backbone with a flow-matching Diffusion Transformer action head.

EquiVLA introduces EquiPerceptor, which produces approximately $SO(2)$-equivariant visual representations from frozen ViT features; and EquiActor, an exactly $SO(2)$-equivariant flow-matching Diffusion Transformer action head. Together, they establish an approximate $SO(2)$ equivariance chain from camera observations to predicted action sequences.

Instantiated on GR00T N1.5 and evaluated across four LIBERO suites, CALVIN ABCD→D, and five real-robot tasks on Mobile ALOHA, EquiVLA achieves 92.6% average success on LIBERO (vs. 78.1% baseline), an average sequence length of 4.03 on CALVIN (vs. 3.45), and improves real-robot success from 54% to 72%.

Method Overview

EquiVLA imposes $SO(2)$ equivariance on VLA architectures through two composable modules — with no modification to pretrained VLM weights.

Unable to display PDF. Download PDF

Figure 2. The EquiVLA pipeline. An approximate $SO(2)$ equivariance chain is preserved from camera observations to predicted actions.
🔭

EquiPerceptor

— Approximate $SO(2)$-Equivariant Visual Representations

EquiPerceptor extends Frame Averaging from globally pooled vectors to spatially-indexed ViT patch token sequences. When the input image is rotated, patch tokens are displaced in the grid, so naive averaging destroys spatial localization. EquiPerceptor applies the inverse group action jointly at two levels: a spatial permutation $\tau(h^{-1})$ that maps each displaced token back to its canonical patch position, and a feature-space transformation $\rho_{\text{reg}}(h^{-1})$ in the regular representation of $G$.

This yields two streams:

zeq
equivariant token map — transforms with the scene
zinv
invariant token map — rotation-independent

zinv is fed into the frozen VLM alongside wrist-camera and language tokens to produce language-grounded context tokens zctx. An Equivariant Adapter then fuses zeq and zctx via a learned invariant gate — all gate inputs are restricted to invariant quantities, preserving equivariance throughout.

🎯

EquiActor

— Exactly $SO(2)$-Equivariant Flow-Matching DiT

EquiActor replaces the standard DiT action head with a fully $SO(2)$-equivariant counterpart built on steerable escnn layers in the regular feature space. All linear projections, attention Q/K/V matrices, state/action encoders, and the action decoder are $G$-steerable layers. Equivariant attention is achieved via the geometric inner product $\langle q, k\rangle = \sum_{g} q[g]\cdot k[g]$, which is $G$-invariant, yielding equivariant output when combined with equivariant $V$.

Action representation. End-effector position and orientation transform as $SO(2)$ vectors under scene rotation, while gripper width is rotation-invariant. The group action on the action vector decomposes into irreducible representations depending on the control mode:

  • Absolute control:  $\rho_1^{3} \oplus (\rho_1 \oplus \rho_0) \oplus \rho_0$ — where $\rho_1^{3}$ encodes the 6D end-effector rotation as three 2D vector pairs, $\rho_1 \oplus \rho_0$ encodes $xy$ translation and $z$ height, and $\rho_0$ encodes gripper width.
  • Relative control:  $\rho_0^{6} \oplus \rho_1^{4} \oplus \rho_2$ — where $\rho_0^{6}$ encodes invariant scalar offsets and $\rho_2$ captures the frequency-2 component from quadratic terms in the relative rotation decomposition.

EquiActor is trained from scratch, as steerable layers are structurally incompatible with unconstrained pretrained weights. The equivariant inductive bias compensates for the loss of pretrained action-head initialization.

End-to-End Equivariance Guarantee

Together, EquiPerceptor and EquiActor establish an approximate $SO(2)$ equivariance chain:

$$\hat{a}_t\big(g \cdot o_t,\; \rho_s(g)\, s_t\big) \;\approx\; \rho_a(g) \cdot \hat{a}_t\big(o_t, s_t\big) \qquad \forall\, g \in C_u$$

EquiActor alone satisfies this relation exactly when paired with invariant VLM context tokens. The approximation error in the full system stems from EquiPerceptor's token-level discretization and is bounded formally in the paper.

Results

LIBERO Benchmark

4 suites (LIBERO-10, Goal, Object, Spatial) × 10 tasks × 50 rollouts; per-suite training; relative and absolute EEF control; replanning every timestep; averaged over 2 seeds.

Method Ctrl LIBERO-10 Goal Object Spatial Avg ↑
π₀ Rel. 73.0 93.0 86.0 90.0 86.0
OpenVLA Rel. 55.0 79.2 88.4 84.7 76.8
SmolVLA Rel. 61.0 61.4 66.0 74.0 65.6
GR00T N1.5 (baseline) Rel. 72.0 75.0 83.4 82.0 78.1
GR00T N1.5 + EquiActor Rel. 82.6 88.0 95.2 98.2 91.0
EquiVLA (ours) Rel. 87.6 89.4 98.0 95.4 92.6
GR00T N1.5 (baseline) Abs. 52.0 55.2 74.6 68.6 62.6
GR00T N1.5 + EquiActor Abs. 63.0 70.0 79.4 82.0 73.6
EquiVLA (ours) Abs. 73.6 70.4 83.0 77.6 76.1

Gains are progressive: EquiActor alone accounts for most of the improvement (+12.9pp relative); EquiPerceptor contributes a further +1.6pp (relative) and +2.5pp (absolute).

CALVIN ABCD→D Benchmark

Single-frame observations (image + proprioception); trained on environments A, B, C, D; zero-shot evaluation on held-out environment D; 1000 instruction chains of up to 5 sequential tasks.

Method T1 T2 T3 T4 T5 Avg ↑
HULC (multi-frame†) 88.9 73.3 58.7 47.5 38.3 3.07
MoDE (multi-frame†) 97.1 92.5 87.9 83.5 77.9 4.39
GR00T N1.5 (baseline) 89.0 79.2 68.7 59.4 48.5 3.45
GR00T N1.5 + EquiActor 93.7 85.8 77.8 70.1 61.9 3.89
EquiVLA (ours) 95.0 88.5 81.1 73.8 64.3 4.03

† Multi-frame baselines use temporal history; not directly comparable.

Gains are largest at the tail of the chain — Task 5 improves from 48.5% to 64.3% (+15.8pp), suggesting equivariance mitigates error accumulation in long-horizon execution. EquiVLA approaches MoDE (4.39) despite using single-frame observations only.

Real Robot — Mobile ALOHA

5 tabletop manipulation tasks; 150 teleoperated demonstrations each; 20 trials per task. Four tasks use the right arm only; Shorts Folding is bimanual.

Task GR00T N1.5 EquiVLA (ours) Δ
Banana in Pot 12/20 (60%) 15/20 (75%) +15pp
Block Storing 9/20 (45%) 11/20 (55%) +10pp
House Building 3/20 (15%) 10/20 (50%) +35pp
Letter Aligning 13/20 (65%) 19/20 (95%) +30pp
Shorts Folding 17/20 (85%) 17/20 (85%) 0pp
Average 54% 72% +18pp
  • Largest gains on orientation-variant tasks: House Building (+35pp, triangular block grasped at varied angles), Letter Aligning (+30pp, I block at arbitrary orientations).
  • Zero cost when rotational symmetry is absent: Shorts Folding (85% for both), confirming no performance penalty from the equivariant inductive bias.
Banana in Pot — frame 1 Banana in Pot — frame 2 Banana in Pot — final
Banana in Pot
75% vs. 60%
Block Storing — frame 1 Block Storing — frame 2 Block Storing — final
Block Storing
55% vs. 45%
House Building — initial House Building — final
House Building
50% vs. 15%
Letter Aligning — initial Letter Aligning — final
Letter Aligning
95% vs. 65%
Shorts Folding — initial Shorts Folding — final
Shorts Folding
85% vs. 85%
Figure 5. Real-robot tasks on Mobile ALOHA. For each task the first and last frames show the initial and goal states. Success rates: EquiVLA (blue) vs. GR00T N1.5 over 20 trials.

Per-Task Videos

Banana in Pot
Block Storing
House Building
Letter Aligning
Shorts Folding

Video

Video. Overview of EquiVLA — framework motivation, architecture, and real-robot demonstrations across all five tasks.