Equivariance is dead, long live equivariance?
When should you bake symmetries into your architecture versus just scaling up — an attempt at a nuanced take for molecular modelling.

I’m currently writing my PhD thesis, tentatively titled “Geometric Deep Learning for Molecular Modelling and Design“. As I look back over the past three and a half years, a central theme of my research has been the interplay between the physical symmetries that govern molecular systems and whether to implement these symmetries as inductive biases in deep learning architectures.
Through this post, I want to discuss the engineering and computational aspects of molecular modelling, particularly the notion of the hardware lottery — the marriage of architectures and hardware that determines which research ideas rise to prominence. These discussions focus on Transformers versus Graph Neural Networks, and roto-translation equivariance versus learning symmetries at scale, which are the two main architectural paradigms in my research.1
Transformers are winning the hardware lottery
Transformers are GNNs which implement a fully-connected message passing scheme via dense matrix multiplications. In contrast, GNNs typically implement sparse message passing over locally connected structures. Sparse message passing operations are significantly slower on GPUs for size ranges of typical molecular systems. Modern GPUs are hyper-optimised for dense computation.
For molecular modelling, the state-of-the-art architecture is generally an Equivariant GNN.2 The best models rely on higher-order tensor representations to achieve maximum expressivity while preserving roto-translation symmetries. This results in a significant increase in memory usage and computational complexity, making equivariant networks orders of magnitude slower to train and scale up than standard Transformers on current hardware.
The evolution from AlphaFold2 to AlphaFold3 exemplifies a paradigm shift in recent years. AlphaFold3's architecture is a lot simpler compared to AlphaFold2, which explicitly incorporated roto-translation equivariance when predicting 3D coordinates of protein structures. Instead, AlphaFold3 uses a Transformer-based architecture and data augmentation when learning to predict 3D coordinates. This approach is easier to scale and generalises naturally to all-atom biomolecular complexes. AlphaFold3 is a very effective demonstration of geometric symmetries learnt at scale using a sufficiently expressive model.
In the near term, the hardware lottery will likely lead to favouring Transformers. If you believe in scaling and want to train molecular foundation models with billions of parameters on large datasets, a Transformer would probably be the architecture of choice. Training equivariant networks at such scales would simply be prohibitively expensive at present.
There’s a lot more nuance to this discussion, though.
A problem-centric approach to architectures
It would be naive to conclude that equivariant networks are inferior to unconstrained architectures — the choice of inductive biases depends fundamentally on the problem at hand.
Case 1: Interatomic potentials for molecular dynamics
When data is limited or strict symmetry guarantees are essential — such as in molecular property prediction and dynamics simulation — explicitly enforcing symmetries provides greater data efficiency and generalization. For instance, equivariant GNNs with higher-order tensors are the current state-of-the-art in interatomic potentials for molecular simulation.
What’s the intuition here? For most practical applications in molecular simulations, models must learn physically meaningful and smooth energy landscapes—here, local interactions and equivariant representations that transform predictably under roto-translation provide essential inductive biases for capturing the underlying physics.3

In contrast, when large-scale training data is available and exact symmetry guarantees are not crucial, implicit or learned symmetry constraints can have an advantage. Diffusion-based generative models exemplify this scenario.
Case 2: Generating molecular structure with diffusion
In diffusion models, a denoiser network learns the underlying data distribution by observing molecular structures under varying noise levels and iteratively reconstructing valid configurations. What matters most is that the denoiser produces valid molecular structures given noisy inputs. If the denoiser produces different outputs from rotated versions of the same noisy input, this may not be problematic as long as both outputs represent physically plausible structures.
Again, What’s the intuition? If you’ve trained some diffusion models, you may agree that learning from each sample in the data distribution under different noise levels is crucial for optimal generative modelling. This boils down to performing as many epochs of training as possible with a sufficiently expressive denoiser, as approximate roto-translation equivariance often emerges in unconstrained networks when trained at scale. Since the noisy intermediate steps do not represent physically meaningful structures, the inductive biases of explicit equivariance and locality become less critical.

This phenomenon helps explain the strong performance of recent Transformer-based diffusion models for molecular generation.4 The hardware lottery enables Transformers to be trained for many more iterations than equivariant networks within the same computational budget, leading to improved performance.
Closing thoughts
Overall, roto-translation equivariance is a powerful inductive bias and strong guarantee of physical correctness. However, equivariance can also be viewed as a hard constraint that ultimately limits model expressivity. A similar argument can be made regarding locality.
Here’s a pragmatic perspective that I’d like to leave you with: Architectures are tools for solving problems. The choice of architecture should be driven by the problem at hand, the available data and the computational resources.
In other words, the best models are the ones you can train today!
Its important to caveat that both Transformers and GNNs also bake in the inductive bias of permutation equivariance over sets of inputs (and use positional encodings when handling ordered inputs). I think permutation symmetry will continue to be an important inductive bias in architectures for molecules and beyond.
Depending on whether local interactions are a useful inductive bias, Equivariant GNNs may operate on sparse graphs or consider all pairwise interactions.
Notable examples of this paradigm include MACE from Cambridge and UMA from FAIR Chemistry. Two recent papers have also discussed these intuitions in great detail with similar conclusions: eSEN and The dark side of forces.
Some examples are Molecular Conformer Fields from Apple, AlphaFold3 from DeepMind, and All-atom Diffusion Transformers, from FAIR Chemistry and yours truly.