“Rotating the workspace should rotate the action — not require more data.”
Vision–Language–Action (VLA) models have emerged as a powerful paradigm for generalist robot manipulation, yet they lack geometric inductive biases: policies trained at specific orientations require substantially more data to generalize across rotational configurations. We present EquiVLA, the first general framework for end-to-end $SO(2)$-equivariant VLA models, applicable to any architecture coupling a frozen vision–language backbone with a flow-matching Diffusion Transformer action head.
EquiVLA introduces EquiPerceptor, which produces approximately $SO(2)$-equivariant visual representations from frozen ViT features; and EquiActor, an exactly $SO(2)$-equivariant flow-matching Diffusion Transformer action head. Together, they establish an approximate $SO(2)$ equivariance chain from camera observations to predicted action sequences.
Instantiated on GR00T N1.5 and evaluated across four LIBERO suites, CALVIN ABCD→D, and five real-robot tasks on Mobile ALOHA, EquiVLA achieves 92.6% average success on LIBERO (vs. 78.1% baseline), an average sequence length of 4.03 on CALVIN (vs. 3.45), and improves real-robot success from 54% to 72%.
EquiVLA imposes $SO(2)$ equivariance on VLA architectures through two composable modules — with no modification to pretrained VLM weights.
EquiPerceptor extends Frame Averaging from globally pooled vectors to spatially-indexed ViT patch token sequences. When the input image is rotated, patch tokens are displaced in the grid, so naive averaging destroys spatial localization. EquiPerceptor applies the inverse group action jointly at two levels: a spatial permutation $\tau(h^{-1})$ that maps each displaced token back to its canonical patch position, and a feature-space transformation $\rho_{\text{reg}}(h^{-1})$ in the regular representation of $G$.
This yields two streams:
zinv is fed into the frozen VLM alongside wrist-camera and language tokens to produce language-grounded context tokens zctx. An Equivariant Adapter then fuses zeq and zctx via a learned invariant gate — all gate inputs are restricted to invariant quantities, preserving equivariance throughout.
EquiActor replaces the standard DiT action head with a fully $SO(2)$-equivariant counterpart built on steerable escnn layers in the regular feature space. All linear projections, attention Q/K/V matrices, state/action encoders, and the action decoder are $G$-steerable layers. Equivariant attention is achieved via the geometric inner product $\langle q, k\rangle = \sum_{g} q[g]\cdot k[g]$, which is $G$-invariant, yielding equivariant output when combined with equivariant $V$.
Action representation. End-effector position and orientation transform as $SO(2)$ vectors under scene rotation, while gripper width is rotation-invariant. The group action on the action vector decomposes into irreducible representations depending on the control mode:
EquiActor is trained from scratch, as steerable layers are structurally incompatible with unconstrained pretrained weights. The equivariant inductive bias compensates for the loss of pretrained action-head initialization.
Together, EquiPerceptor and EquiActor establish an approximate $SO(2)$ equivariance chain:
$$\hat{a}_t\big(g \cdot o_t,\; \rho_s(g)\, s_t\big) \;\approx\; \rho_a(g) \cdot \hat{a}_t\big(o_t, s_t\big) \qquad \forall\, g \in C_u$$EquiActor alone satisfies this relation exactly when paired with invariant VLM context tokens. The approximation error in the full system stems from EquiPerceptor's token-level discretization and is bounded formally in the paper.
4 suites (LIBERO-10, Goal, Object, Spatial) × 10 tasks × 50 rollouts; per-suite training; relative and absolute EEF control; replanning every timestep; averaged over 2 seeds.
| Method | Ctrl | LIBERO-10 | Goal | Object | Spatial | Avg ↑ |
|---|---|---|---|---|---|---|
| π₀ | Rel. | 73.0 | 93.0 | 86.0 | 90.0 | 86.0 |
| OpenVLA | Rel. | 55.0 | 79.2 | 88.4 | 84.7 | 76.8 |
| SmolVLA | Rel. | 61.0 | 61.4 | 66.0 | 74.0 | 65.6 |
| GR00T N1.5 (baseline) | Rel. | 72.0 | 75.0 | 83.4 | 82.0 | 78.1 |
| GR00T N1.5 + EquiActor | Rel. | 82.6 | 88.0 | 95.2 | 98.2 | 91.0 |
| EquiVLA (ours) | Rel. | 87.6 | 89.4 | 98.0 | 95.4 | 92.6 |
| GR00T N1.5 (baseline) | Abs. | 52.0 | 55.2 | 74.6 | 68.6 | 62.6 |
| GR00T N1.5 + EquiActor | Abs. | 63.0 | 70.0 | 79.4 | 82.0 | 73.6 |
| EquiVLA (ours) | Abs. | 73.6 | 70.4 | 83.0 | 77.6 | 76.1 |
Gains are progressive: EquiActor alone accounts for most of the improvement (+12.9pp relative); EquiPerceptor contributes a further +1.6pp (relative) and +2.5pp (absolute).
Single-frame observations (image + proprioception); trained on environments A, B, C, D; zero-shot evaluation on held-out environment D; 1000 instruction chains of up to 5 sequential tasks.
| Method | T1 | T2 | T3 | T4 | T5 | Avg ↑ |
|---|---|---|---|---|---|---|
| HULC (multi-frame†) | 88.9 | 73.3 | 58.7 | 47.5 | 38.3 | 3.07 |
| MoDE (multi-frame†) | 97.1 | 92.5 | 87.9 | 83.5 | 77.9 | 4.39 |
| GR00T N1.5 (baseline) | 89.0 | 79.2 | 68.7 | 59.4 | 48.5 | 3.45 |
| GR00T N1.5 + EquiActor | 93.7 | 85.8 | 77.8 | 70.1 | 61.9 | 3.89 |
| EquiVLA (ours) | 95.0 | 88.5 | 81.1 | 73.8 | 64.3 | 4.03 |
† Multi-frame baselines use temporal history; not directly comparable.
Gains are largest at the tail of the chain — Task 5 improves from 48.5% to 64.3% (+15.8pp), suggesting equivariance mitigates error accumulation in long-horizon execution. EquiVLA approaches MoDE (4.39) despite using single-frame observations only.
5 tabletop manipulation tasks; 150 teleoperated demonstrations each; 20 trials per task. Four tasks use the right arm only; Shorts Folding is bimanual.
| Task | GR00T N1.5 | EquiVLA (ours) | Δ |
|---|---|---|---|
| Banana in Pot | 12/20 (60%) | 15/20 (75%) | +15pp |
| Block Storing | 9/20 (45%) | 11/20 (55%) | +10pp |
| House Building | 3/20 (15%) | 10/20 (50%) | +35pp |
| Letter Aligning | 13/20 (65%) | 19/20 (95%) | +30pp |
| Shorts Folding | 17/20 (85%) | 17/20 (85%) | 0pp |
| Average | 54% | 72% | +18pp |