Vision–Language–Action · SO(2) Equivariance

EquiVLA: A General Framework for Rotationally Equivariant Vision–Language–Action Models

“Rotating the workspace should rotate the action — not require more data.”

Anonymous Authors

Under review

📄 Paper soon 💻 Code soon ▶ Video soon

92.6%

LIBERO avg success (+14.5pp) vs. GR00T N1.5

4.03

CALVIN ABCD→D avg seq. length (+0.58)

54→72%

Mobile ALOHA real-robot success (+18pp)

Abstract

Vision–Language–Action (VLA) models have emerged as a powerful paradigm for generalist robot manipulation, yet they lack geometric inductive biases: policies trained at specific orientations require substantially more data to generalize across rotational configurations. We present EquiVLA, the first general framework for end-to-end $SO(2)$-equivariant VLA models, applicable to any architecture coupling a frozen vision–language backbone with a flow-matching Diffusion Transformer action head.

EquiVLA introduces EquiPerceptor, which produces approximately $SO(2)$-equivariant visual representations from frozen ViT features; and EquiActor, an exactly $SO(2)$-equivariant flow-matching Diffusion Transformer action head. Together, they establish an approximate $SO(2)$ equivariance chain from camera observations to predicted action sequences.

Instantiated on GR00T N1.5 and evaluated across four LIBERO suites, CALVIN ABCD→D, and five real-robot tasks on Mobile ALOHA, EquiVLA achieves 92.6% average success on LIBERO (vs. 78.1% baseline), an average sequence length of 4.03 on CALVIN (vs. 3.45), and improves real-robot success from 54% to 72%.

Method Overview

EquiVLA imposes $SO(2)$ equivariance on VLA architectures through two composable modules — with no modification to pretrained VLM weights.

Figure 2. The EquiVLA pipeline. An approximate $SO(2)$ equivariance chain is preserved from camera observations to predicted actions.

🔭

EquiPerceptor

— Approximate $SO(2)$-Equivariant Visual Representations

EquiPerceptor extends Frame Averaging from globally pooled vectors to spatially-indexed ViT patch token sequences. When the input image is rotated, patch tokens are displaced in the grid, so naive averaging destroys spatial localization. EquiPerceptor applies the inverse group action jointly at two levels: a spatial permutation $\tau(h^{-1})$ that maps each displaced token back to its canonical patch position, and a feature-space transformation $\rho_{\text{reg}}(h^{-1})$ in the regular representation of $G$.

This yields two streams:

zeq

equivariant token map — transforms with the scene

zinv

invariant token map — rotation-independent

z^inv is fed into the frozen VLM alongside wrist-camera and language tokens to produce language-grounded context tokens z^ctx. An Equivariant Adapter then fuses z^eq and z^ctx via a learned invariant gate — all gate inputs are restricted to invariant quantities, preserving equivariance throughout.

🎯

EquiActor

— Exactly $SO(2)$-Equivariant Flow-Matching DiT

EquiActor replaces the standard DiT action head with a fully $SO(2)$-equivariant counterpart built on steerable escnn layers in the regular feature space. All linear projections, attention Q/K/V matrices, state/action encoders, and the action decoder are $G$-steerable layers. Equivariant attention is achieved via the geometric inner product $\langle q, k\rangle = \sum_{g} q[g]\cdot k[g]$, which is $G$-invariant, yielding equivariant output when combined with equivariant $V$.

Action representation. End-effector position and orientation transform as $SO(2)$ vectors under scene rotation, while gripper width is rotation-invariant. The group action on the action vector decomposes into irreducible representations depending on the control mode:

Absolute control: $\rho_1^{3} \oplus (\rho_1 \oplus \rho_0) \oplus \rho_0$ — where $\rho_1^{3}$ encodes the 6D end-effector rotation as three 2D vector pairs, $\rho_1 \oplus \rho_0$ encodes $xy$ translation and $z$ height, and $\rho_0$ encodes gripper width.
Relative control: $\rho_0^{6} \oplus \rho_1^{4} \oplus \rho_2$ — where $\rho_0^{6}$ encodes invariant scalar offsets and $\rho_2$ captures the frequency-2 component from quadratic terms in the relative rotation decomposition.

EquiActor is trained from scratch, as steerable layers are structurally incompatible with unconstrained pretrained weights. The equivariant inductive bias compensates for the loss of pretrained action-head initialization.

End-to-End Equivariance Guarantee

Together, EquiPerceptor and EquiActor establish an approximate $SO(2)$ equivariance chain:

$$\hat{a}_t\big(g \cdot o_t,\; \rho_s(g)\, s_t\big) \;\approx\; \rho_a(g) \cdot \hat{a}_t\big(o_t, s_t\big) \qquad \forall\, g \in C_u$$

EquiActor alone satisfies this relation exactly when paired with invariant VLM context tokens. The approximation error in the full system stems from EquiPerceptor's token-level discretization and is bounded formally in the paper.

Results

LIBERO Benchmark

4 suites (LIBERO-10, Goal, Object, Spatial) × 10 tasks × 50 rollouts; per-suite training; relative and absolute EEF control; replanning every timestep; averaged over 2 seeds.

Method	Ctrl	LIBERO-10	Goal	Object	Spatial	Avg ↑
π₀	Rel.	73.0	93.0	86.0	90.0	86.0
OpenVLA	Rel.	55.0	79.2	88.4	84.7	76.8
SmolVLA	Rel.	61.0	61.4	66.0	74.0	65.6
GR00T N1.5 (baseline)	Rel.	72.0	75.0	83.4	82.0	78.1
GR00T N1.5 + EquiActor	Rel.	82.6	88.0	95.2	98.2	91.0
EquiVLA (ours)	Rel.	87.6	89.4	98.0	95.4	92.6
GR00T N1.5 (baseline)	Abs.	52.0	55.2	74.6	68.6	62.6
GR00T N1.5 + EquiActor	Abs.	63.0	70.0	79.4	82.0	73.6
EquiVLA (ours)	Abs.	73.6	70.4	83.0	77.6	76.1

Gains are progressive: EquiActor alone accounts for most of the improvement (+12.9pp relative); EquiPerceptor contributes a further +1.6pp (relative) and +2.5pp (absolute).

CALVIN ABCD→D Benchmark

Single-frame observations (image + proprioception); trained on environments A, B, C, D; zero-shot evaluation on held-out environment D; 1000 instruction chains of up to 5 sequential tasks.

Method	T1	T2	T3	T4	T5	Avg ↑
HULC (multi-frame†)	88.9	73.3	58.7	47.5	38.3	3.07
MoDE (multi-frame†)	97.1	92.5	87.9	83.5	77.9	4.39
GR00T N1.5 (baseline)	89.0	79.2	68.7	59.4	48.5	3.45
GR00T N1.5 + EquiActor	93.7	85.8	77.8	70.1	61.9	3.89
EquiVLA (ours)	95.0	88.5	81.1	73.8	64.3	4.03

† Multi-frame baselines use temporal history; not directly comparable.

Gains are largest at the tail of the chain — Task 5 improves from 48.5% to 64.3% (+15.8pp), suggesting equivariance mitigates error accumulation in long-horizon execution. EquiVLA approaches MoDE (4.39) despite using single-frame observations only.

Real Robot — Mobile ALOHA

5 tabletop manipulation tasks; 150 teleoperated demonstrations each; 20 trials per task. Four tasks use the right arm only; Shorts Folding is bimanual.

Task	GR00T N1.5	EquiVLA (ours)	Δ
Banana in Pot	12/20 (60%)	15/20 (75%)	+15pp
Block Storing	9/20 (45%)	11/20 (55%)	+10pp
House Building	3/20 (15%)	10/20 (50%)	+35pp
Letter Aligning	13/20 (65%)	19/20 (95%)	+30pp
Shorts Folding	17/20 (85%)	17/20 (85%)	0pp
Average	54%	72%	+18pp

Largest gains on orientation-variant tasks: House Building (+35pp, triangular block grasped at varied angles), Letter Aligning (+30pp, I block at arbitrary orientations).
Zero cost when rotational symmetry is absent: Shorts Folding (85% for both), confirming no performance penalty from the equivariant inductive bias.