Hew Phipps

Modelling Protein Folding with Graph Networks Bridges Geometry and Kinetics

hew.phipps@live.co.uk (Hew Phipps) — Tue, 20 Jan 2026 13:55:06 +0000

Introduction

Protein folding is still a difficult problem. Although AlphaFold made great strides in advancing static structure prediction, the dynamic process that transports a protein from random coil to equilibrium structure is difficult to model. Go based Ising-like energy models such as the WSME model were among the first successful mathematical models of protein folding as a kinetic process. The approach is as follows: given a native-like folded conformation, we can assign residues with any possible alternative conformation as either native-like or not. One way to define this is to consider the residue’s formation of native-like contacts ($C_\alpha - C_\alpha$ distance $<8\dot{A}$). In folding this could be the difference between a random coil unfolded state where a given residue makes few native-like contacting pairs with spatially proximal residues, versus the folded state where that residue may exist in an alpha helix.

The Ising-like approach is to simply formulate the protein with a binary state for each residue, taking 1 if that residue is in the native-like conformation and 0 otherwise. The WSME model itself is specifically formulated to enforce sequence contiguity, and the associated Ising-like hamiltonian is combined with an entropic parameter for each residue resulting in a canonical Free energy functional. The model is therefore kinetically meaningful and approximates a protein’s folding free energy landscape. For a more detailed description of the WSME model see Ooka, Liu, Arai. 2022.

What I’m hoping to illustrate with reference to the WSME model is that modelling protein folding as a dynamical process of state transitions is well established. This blog explores how we might exploit graph networks, which lend themselves well to such state-based formulations, for some slightly different approaches to modelling protein folding. The analysis I undertake is a more interpretative representation of folding prioritising a mechanistic intuition whilst drawing from key concepts used in the more established framework of Markov State Models (MSMs). By tracking observed transitions between states we construct heuristic estimates of relative contact flip rates for individual residue pairs. While these are not true kinetics or free energies directly comparable to the typically uniform WSME contact energies, they serve as a exploratory proxy for illustrating how kinetic heterogeneity can emerge in contact formation/breaking. It also has the perk of cool visualisations which in this case actually portray a meaningful picture of protein folding.

Throughout this blog I will specifically consider molecular dynamics (MD) simulations of protein folding, namely that of the 10 residue artificial fast folder Chignolin. Naturally this means we are modelling numerical physics-based approximations of protein folding rather than experimental data such as SAXS and HX. The reason for this is because MD trajectories are trivial to interpret and prepare, and for a number of well described fast folding proteins, are accessible and reproducible. However, as a TLDR the intractability of proteins >30 residues with this method makes my later manifold edges-based kinetic barrier analysis largely useless in the face of the clustering required to even build these graphs. However, I thought this was interesting enough to warrant a blog regardless, and i hope that it can at least serve as a somewhat intuitive introduction to elements in MSM theory.

Contents

Methods - Building a Graph
Identifying the Transition State Ensemble
Bridging Geometry with Energy
Take Homes

Methods - Building a Graph

Like the WSME model I’m going to rely on residue contacts to define a protein’s state in folding. For this we’ll define residue contacts as any pair of residues who’s $C_\alpha - C_\alpha$ distance is $< 8\dot{A}$ and are separated in sequence by at least 4 residues. We will consider every unique contact map ($N_{\text{res}} \times N_{\text{res}}$ boolean matrix) of a protein as a possible topological state that the protein can take during folding. For a graph $G = (V, E)$ of nodes $v \in V$ and edges $e \in E$, these will be the nodes of our graph.

Where edges $E= \{(v_i, v_j) \}$ represent relationships between states we can define them as follows: Given an MD trajectory of protein folding we may want to draw edges between nodes if the respective two contact maps are at any point adjacent in sequence in the trajectory - in other words, if node $v_i$ is the contact map of frame $t$ in the trajectory, we draw an edge from $v_i$ to $v_j$ if the contact map of either frames $t-1$ or $t+1$ belongs to node $v_j$. The below image is one such graph network for the artificial fast folding protein Chignolin (10 residues) using the trajectory data from Majewski et al. 2022 who repeated the original simulations from Kresten Lindorff-Larsen et al. How Fast-Folding Proteins Fold: Figure 1: Graph where nodes represent unique contact maps and edges connect temporally adjacent nodes. Node size corresponds to its respective contact map’s count in the folding trajectory. Nodes are coloured by the index in the trajectory of the first occurrence of their respective contact map. The outlined black node indicates the starting frame node (the unfolded conformation). Blue edge colouring highlights the shortest path to the yellow folded node whilst red edge highlighting indicates post-folded state transitions. The red start sits in the centre of the folded node.

Immediately we can spot something interesting. The largest node (yellow) represents the most commonly visited unique contact map in the trajectory which, given this is a folding simulation, we might assume is the folded conformation. This state is surrounded by a number of nodes that are visited only after the folded state is reached. Indeed by colouring edges appearing after reaching that state with red, we can see that the protein, upon reaching its folded state, seems to bounce around exclusively within this small basin. On the other hand we see the comparatively large space explored by the protein before reaching this ensemble, depicting the protein’s search over the folding landscape.

However, as this graph is constructed purely by temporal edge information there is no physical meaning to its coordinate system - if nodes are close together this is just a product of the graph building algorithm. Let’s keep our temporal edges visible but overlay them on a graph who’s real edges are defined by adjacency in the space of all contact maps. In other words, nodes $i$ and $j$ are connected if they differ by exactly one contact. This forms an $N\text{res}\choose 2$ dimensional discrete hypercube manifold* which we can project to 2D coordinates by embedding with UMAP to get the following: Figure 2: Unlike the previous graph node connectivity is defined by adjacency on the discrete contact map configuration space manifold and projected down to a 2D representation to approximate topological difference in structure between nodes. Violet edge colouring represents the shortest path adhering strictly to the underlying manifold edges (manifold geodesic)

*Actually the manifold is heavily constrained down to a $\mathcal{M} = \sum_{i=1}^{N_\text{res}}(N - (i+3)) = 21$ dimensional hypercube because of the sequence separation distance of 3 residues in our contact definition. This means the total number of unique possible states is $2^{21}$

Again we observe the folding process as an apparently somewhat inefficient search over the contact map space bottlenecked by a number of key nodes. This raises the question of why all of the space to the right of the starting state is explored before reaching the first transition considering that the progression to the folded state is clearly left-directional (although I must emphasise the inequivalence between geometry and reaction coordinates in this embedded space). So, is this due to inefficiencies in nature’s folding machine, inaccuracies in the MD approximate physics, or perhaps an apt visualisation of the initial random search theory of protein folding posited by Dill & Chan 1997.?

One thing we can reveal from taking the intersection of manifold and temporal edges (edges that only appear in both sets) is that the unfolded state is a non-continuous process disconnected from the folded state along the manifold space - this isn’t anything surprising it just means this flip is likely hidden under temporal transitions consisting of multiple contact flips. Specifically the third edge on the manifold geodesic (ASP2, THR5 formation) is never observed in the set of temporal edges, the protein navigates around it rather than making the direct link. (ASP2, THR5) is also not a native contact.

Naturally we might think that applying the shortest path to these graphs would give insight into optimal folding paths. But clearly Chignolin does not take the shortest or manifold geodesic paths, why? It’s likely that some energetically favourable conformation must be reached before the large jump into the folded conformation can be achieved. But, are these graphs just pretty or could they help us explore how to identify these transition states and ways to engineer proteins to more efficiently reach them without all the random search? Perhaps we can formalise these sentiments into a quantitative framework by drawing on the properties of these graphs.

Identifying the Transition State Ensemble

One property we can easily exploit is the betweenness centrality $C_B(v)$ which is a metric that balances how connected a given node is with how much interconnectivity it provides, in other words how often it appears in the shortest paths between all other pairs of nodes.

$$ C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}} $$

Where $\sigma_{st}$ is the total number of shortest paths from node $s$ to node $t$ and $\sigma_{st}(v)$ is the number of those shortest paths that pass through $v$. This is a good descriptor of topological bottleneck states which are likely key to successful folding but may be naiively confused with the the Transition State Ensemble (TSE). Let’s consider the top 10 betweenness centrality nodes: Figure 3: Chignolin protein folding graph on the contact space manifold (PCA and UMAP embedded). This time crosses indicate bottleneck nodes with a high betweenness centrality. The size of the crosses and their colour represents the magnitude of the betweenness centrality score.

Notice how many of the most significant bottlenecks lie on the shortest paths but there are still some distributed over the unfolded space. A more appropriate explanation of what high betweenness centrality nodes represent is structural hubs that are easily accessible to lots of other states. These nodes are likely vital for the protein to find before branching off into attempts at folding as they increase accessibility to perhaps more energetically rugged regions of the landscape.

But structural accessibility does not necessarily mean transition state. Instead we can take a more rigorous approach to identifying the TSE, specifically by drawing from MSM theory which utilises another key graph property called committor probabilities. Committor probabilities exploit the fact that in graphs we know what possible paths are ahead via node connectivity so at each point we can compute for the probability that node $i$ will reach the folded state before returning to the unfolded state (start node).

Imagine we have a weighted undirected graph $G = (V,E,W)$ with nodes $V = \{v_1,...,v_N\}$, each representing a unique state (contact map), and edges $E = \{ (v_i,v_j)\}$ with weights $w_{ij} \geq 0$ given by transition counts between nodes $(v_i,v_j)$. In standard MSM modelling we would typically employ a clustered graph with states representing conformational ensembles and validated for markovianity at a chosen lag time. In our case we preserved interpretability and geometric meaning with unique contact state nodes.

To compute the committor on this undirected graph we treat it as a reversible flux-weighted graph. We first regularise the transition counts by adding a small regularisation factor $\alpha$:

$$ \hat{w}_{ij} = w_{ij} + \alpha $$

This allows us to symmetrise the matrix of counts $C$ to enforce reversability:

$$ c_{ij} = \frac{\hat{w}_{ij} + \hat{w}_{ji}}{2} $$

This enforces detailed balance by construction without having to assume ergodicity, meaning the transitions counts $C=\{c_{ij} \}$ now reflect the probabilities in a random walk over the graph:

$$ P_{ij} = \frac{c_{ij}}{\sum_k c_{ik}} = \frac{c_{ij}}{D_{ii}} $$

Where $D$ is the degree matrix where $D_{ii} = \sum_k c_{ik}$ is the degree of node $i$. We can now exploit something called the Laplacian of the graph which measures how a signal on a node differs from its neighbours and is given by:

$$ L = D - C $$

From the symmetric graph Laplacian we can marginalise each node over all its edges:

$$ L_{ii} = \sum_j c_{ij}, \quad L_{ij} = -c_{ij}(i\neq j) $$

The committor probability $q_i$ can be computed exactly for small graphs as the probability that state $i$ reaches the boundary folded node $B$ before the unfolded node $A$. This is the Dirichlet boundary conditions $q(A) = 0, \quad q(B) = 1$ resulting in the discrete harmonic equation:

$$ \sum_j c_{ij} (q_i - q_j) = 0, \quad \forall i \notin A \cup B $$

In practice, this construction yields a discrete, symmetric committor defined directly on the contact-map graph. Because the graph is undirected and regularised, the resulting committor reflects both kinetic connectivity and entropic structure in contact space. While this object is an approximation of the true committor of the underlying MD, provides a geometry-aware reaction coordinate consistent with our undirected treatment of contact-flip kinetics.

As a proxy of a true committor probability for our non-MSM graph $q_i$ gives us a reaction coordinates-like state ordering in the folding pathway. With such reaction coordinates we can more explicitly identify the transition state which is canonically defined as the bottleneck of the reaction. If $q=0$ is the probability of unfolding before folding and $q=1$ is the probability of folding before unfolding (folded state) then any node with $q \approx 0.5$ is sitting on the precipice of falling towards either state - imagine it sitting on a ridge tilting into the unfolded and folded energy basins. This will be how we define the transition state ensemble.

Using this approach we can take the 10 nodes with $q$ values closest to 0.5 as the transition state ensemble and compare to our previous estimations: Figure 4: Chignolin protein folding graph on the contact space manifold (PCA and UMAP embedded). This time all edges except those between nodes with $q>0.5$ are removed and they are coloured by $q$. Nodes are sized by $q$. Crosses label the top 10 nodes with committor probabilities closest to $0.5$. The size and colour of the crosses represents their committor probabilities. Note the yellow start which is the only node shared between this q-based TSE and the top 10 betweenness centrality nodes. The blue start represents the folded node.

The discrepancy between these two sets is informative. For our graph, the high betweenness centrality nodes are more often energetic traps than true transition states, they do not necessarily lead to the folded state. In other words geometric connectivity does not align with propensity to fold. Indeed there is no correlation between the values across all nodes: Figure 5: Comparison of the committor probabilities and betweenness centralities of the q-based TSE (top 10 nodes closes to q=0.5) and the top 10 betweenness centrality nodes.

The lack of an apparent relationship means that generally for Chignolin the transition states are largely not accessed by the protein - it explicitly avoids highly connected geometric intersections to fold. This is a strong indicator that the folding process in this case may be inefficient in its initial random search.

This difference between our geometric measure and kinetic measure exposes the robustness of more established methods in MSMs. However, this does not mean the geometric representation we gain from our graph is not useful. While committors provide kinetic insight, our graph’s geometric structure offers a unique opportunity to estimate energetic parameters directly.

Bridging Geometry with Energy

Some geometry that is meaningful is the vector between the unfolded and folded states, the geodesic shortest path adhering to the manifold. Even if the shortest path isn’t energetically optimal, it serves as a useful directional bias in the contact-map space as a vector that points in the direction of folding. Further, the distances on our graph (specifically the number of edges / hamming distance) are a geometric measure of the number of contact flips needed to transport from one state to another.

What’s powerful about our manifold + temporal edges graph is that temporal transitions can have hamming distances > 1 (more than one contact flip), indicating more than one unique contact change happened within the 200ps timesteps of this trajectory data. This gives two useful properties I will use to attempt to infer individual contact flip kinetics:

Higher Hamming distances in one timestep means the barrier for the combined event is low relative to the timestep, or intermediates are short-lived and unresolved. Such edges represent cooperative events.
We can infer the sequence of manifold jumps taken in a temporal edge by the collection of shortest paths on the manifold to achieve it.

A simple histogram of hamming distances for all temporal edges (where if hamming distances are >1 we infer the distance by shortest path on the manifold) nicely shows the appropriateness of a 200ps timestep at capturing single flip events: Figure 6: Histogram showing the distribution of temporal transitions’ inferred hamming distance. Hamming distances inferred by shortest path on the manifold edges between the temporal nodes for each temporal edge.

Now here’s the really unique part, we can use these properties in a cool way to estimate the kinetic cost of flipping a contact (either forming or breaking a contact between a pair of residues). This is particularly interesting for the likes of the WSME model which uses contact energies $\epsilon_{ij}$ but canonically defines them with a uniform value - although to be explicit what i describe here is a kinetic rate barrier which is not equivalent to energetic cost such as those used by the WSME model. However, in future work I’m hoping to extend this to a real energetic cost using a thermodynamic approach.

To estimate these contact flip rates I developed a relatively crude algorithm inspired from standards in MSM kinetics analysis (apologies for any cringe this causes experienced MSM modellers), namely:

Observed temporal transition counts can be converted into relative rate-like quantities (under the fixed lag time of our 200ps timestep for this Chignolin simulation data).
We can apply a Arrhenius-like transform to these rates to give relative kinetic energy barriers.

I say crude because as I mentioned these are not real energies, even if we transform Arrhenius style. The Arrhenius equation gives a rate $k$ from the energy change $\Delta G$:

$$ k = A \exp(-ΔG/k_B T) $$

In our case we rearrange for the energy given the rate for the contact flip between residues $k(a,b)$ (note this is not directional as we are currently only working with an undirected graph):

$$ \Delta G(a,b) = -k_B T \ln \left(\frac{k(a,b)}{A} \right) $$

The problem with this is that the attempt frequency $A$ is not arbitrary, it is specific to each system. For this work instead of determining an exact $A$ we can replace it with a selected reference rate with the consequence that $\Delta G^\dagger$ becomes a relative kinetic barrier as opposed to an absolute activation energy:

$$ \Delta G^\dagger(a,b) = -k_B T \ln \left(\frac{k(a,b)}{k_\text{ref}} \right) $$

In practice, $k_{\text{ref}}$ can be taken as the median observed flip rate, yielding a measure of how difficult each contact flip is relative to a typical event. Alternatively, choosing $k_{\text{ref}} = \max k_\text{ref}$ assigns the fastest observed flip the lowest barrier, with all other contact flips measured relative to it. Given this formulation for the relative kinetic barrier of a contact pair marginalised over all contact map configurations I use my crude algorithm to estimate the non-directional rate $k(a,b)$ of flipping for any given contact pair on the graph.

Algorithm

For every temporal edge between nodes ($v_l$,$v_k$) with observed transition counts $C_{lk}^{\text{obs}}$:

Find all shortest paths between $u$ and $v$ on the manifold graph where $N_\text{paths}$ is the number of degenerate shortest paths.
Assign each path a flux* $C_{lk}^{\text{obs}}/N_\text{paths}$
For each path iterate through its constituent manifold edges $(v_i,v_j)$ adding the flux to each edges’ inferred count $C_{ij}^{\text{inferred}}$ Note this algorithm yields the original temporal transition counts for temporal edges of hamming distance 1

*As our graph actually exists in a 21 dimensional hypercube there are $d!$ possible shortest paths for any temporal transition with hamming distance $d$. To reconcile this we can simply use a distributed flux maximum entropy approach (the least assumptive choice) $C_{lk}^{\text{obs}}/N_\text{paths}$ where we uniformly split the transition counts between the number of paths. This maximum entropy naturally captures heterogeneity in inferred manifold edge counts despite arising form uniform splits because of differential usage of these edges within different temporal edge paths. E.g. if two temporal edges of hamming distance 3 share some manifold edges in their shortest manifold paths their flux is additive meaning unshared manifold edges have a different flux, you can imagine how this stacks over hundreds of temporal edges and constituent manifold edge combinations.

Given that the same contact flip can be made by different nodes with different contact map contexts we must aggregate counts across all manifold edges involving a given contact flip $a,b$ to get a marginal score for that contact:

$$ C^\text{inf}(a,b) = \sum_{(v_i,v_j) \in \mathcal{E}_{ab}} C^\text{inf}_{ij} $$

where $\mathcal{E}_{ab}$ is the set of all manifold edges that represent flipping contact $(a,b)$:

$$ \mathcal{E}_{ab} = \{(v_i \rightarrow v_j) : \text{nodes } v_i \text{ and } v_j \text{ differ only in contact } (a,b)\} $$

Now we can infer the transition probability that contact $(a,b)$ flips within a timestep by normalising over the number of opportunities for that flip:

$$ P(a,b) = \frac{C^\text{inf}(a,b)}{N_\text{adj}(a,b)} $$

where $N_\text{adj}(a,b)$ is the combined state population counts for all nodes that are involved in a $(a,b)$ contact flip:

$$ N_\text{adj}(a,b) = \sum_{(v_i \rightarrow v_j ) \in \mathcal{E}_{ab}} N_i $$

Note we only consider summing over $i$ because as the graph is undirected $\mathcal{N}_{ab}$ implicitly iterates over the reverse $v_j,v_i$. Given we know our time-step (lag time) is 200ps we can approximate the flip rate as the inverse mean waiting time $k \approx P/\tau$ which assumes exponential kinetics. Under the approximation that contact flips are memoryless over the chosen lag time, we can map the discrete-time flip probability to an effective continuous-time rate using the standard exponential waiting-time relationship:

$$ k(a,b) = -\frac{1}{\tau}\ln(1-P(a,b)) $$

Using this rate with our Arrhenius-like equation we get the kinetic barriers of the contact pairs found in our graph which we can visualise nicely with a contact map matrix: Figure 7: (left) Chignolin contact matrix coloured by inferred contact flip relative kinetic barrier $\Delta G^\dagger(a,b)$ for 12 of the 14 native state contacts (right). The native state is taken as the most occupied node rather than the crystal structure PDB.

This gives us a clear indication of which contacts flip more readily in the observed dynamics, for example residue pair $(1,6)$ flip far slower than the rest suggesting there is a high kinetic barrier to forming (or breaking) this contact. Again I emphasise that as our graph is undirected we cannot resolve to energies for forming or breaking contact, the kinetic barrier we have inferred is joint over them. Perhaps in a future blog I’ll decompose this into formation and break barriers.

Either way, we can perform one last brief analysis to characterise the folding profile of Chignolin. Like is typical in MSMs we can plot energy against reaction coordinate - however very much unorthodoxly in this case we will use our contact pair kinetic barrier values in place of the energy. By taking the sum of kinetic barriers for every node’s contact map we can get an idea of how slow the trajectory enters and leaves that state. Against committor probability as a reaction coordinate we get a clear visualisation of the well established two-state folding nature of Chignolin: Figure 8: Chignolin two-state folding. Total kinetic barrier taken as the sum of all native-like contacts’ kinetic barriers is calculated for each unique contact map and plotted against the respective states’ reaction coordinate given by the committor probability q. Background shading is a density function of the scatter points. Central funnelling is characteristic of two-state folding.

This is all rather theoretical and unvalidated. One of the reasons for this is that, as I highlight later with the poor scaling of the choice of contact map discrete states, reasonably sized protein such as Chymotrypsin inhibitor 2 (CI2) for which there is phi value experimental data to directly compare to are largely intractable (at least on my laptop). Hence why I’ve left this work to a blog.

Take Homes

In this blog I’ve introduced modelling protein folding (specifically MD trajectories of folding) with discrete states as contact maps using graph networks. We’ve explored some useful graph properties and dipped our toes into the mature field of MSMs for which I refer you to more dedicated work for a more accredited introduction. Although very much explorative I’ve outlined a somewhat physics-based method for inferring the kinetic barriers associated with residue contact pair flips. As we mentioned this is unfortunately not decomposable to formation or breaking costs and is also not an absolute free energy. I would like to emphasise that it is 100% hypothetical and I have not rigorously experimented with it against experimental data but I hope it might have piqued your interest or inspired you.

Referring back to our opening statements, could any of this be useful in protein engineering applications? The answer is, quite possibly. For one our inferred contact flip kinetic rates present a clear picture of which contacts have high barriers, either slowing folding significantly or representing heavy structural constraints. Targeting such residues with rational modifications may well lead to intended changes in folding rates.

However, as I’ve mentioned the most troublesome limitation is scaling with protein size. The problem is that the number of unique contact maps scales $2^{N_{\text{res}}\choose 2}$ with protein size $N_{\text{res}}$ and so for any reasonably sized protein that is not some artificial fast folder like the 10 residue chignolin the node sizes converge to 1 as the number of possible contact maps vastly exceeds the number of samples.

There is a quick but dirty solution. We can cluster contact maps proximal on the manifold of contact space together thus reducing the number of nodes to levels ameanable to data sizes and visualisation requirements. For example, let’s consider the largest protein from the Kresten Lindorff-Larsen et al folding dataset, the lambda repressor, here’s its folding graph embedded with PCA and UMAP onto the contact map manifold then clustered with HDBSCAN: Figure 8: Lambda repressor folding graph in the embedded contact map space with nodes representing clusters of unique contact maps close in embedded PCA -> UMAP space. Grey edges connecting temporally adjacent nodes (cluster labels adjacent in the trajectory given each frame in the trajectory is assigned a cluster label). the blue shortest path operates on the temporal edges.

Also, although we used unique contact maps as our nodes one could easily employ other state representations such as secondary structure content which would have $4^{N_{\text{res}}}$ configurations as opposed to the contact map $2^{N\text{res}\choose 2}$. For MSMs it’s much for common to use RMSD to the native state values for clustering, or even torsional angles.

In the end it comes down to scale vs mechanistic interpretability. Clustering is the only way we can even construct the graph for larger proteins and it does not lend itself well to our kinetic contact flip energy analysis thanks to sacrificing manifold edge fidelity into clusters. However, for more MSM-like free energy and committor analysis it is functional under the subjectivity of clustering choices.

Regardless, I’m going to keep looking into the contact flip energies on the side so stay tuned for anything useful I might find whether through blogs or publication (wish me luck!).

The code used in this analysis and links to the MD simulation data is publicly available in the following GitHub repo.

Flow Matching from the Mathematics

hew.phipps@live.co.uk (Hew Phipps) — Wed, 01 Oct 2025 13:55:06 +0000

Introduction

In the world of computational structural biology you might have heard of diffusion models as the current big thing in generative modelling. Diffusion models are great because primarily they look cool when you visualise the denoising process to generate a protein structure (checkout RFdiffusion Colab notebook), but also because they are state of the art at diverse and designable protein backbone structure generation.

Originally emerging from computer vision, a lot of work has been built up around their application to macromolecules - especially exciting is their harmonious union with geometric deep learning in the case of SE(3) equivariance (see FrameDiff). I don’t know about you but I get particularly excited about geometric deep learning, mostly because it involves objectively dope words like “manifold” and “Riemannian”, better yet “Riemannian manifolds” - woah! (see Bronstein’s geometric deep learning for more fun vocabulary to add to your vernacular- like “geodesic”, Geometric Deep Learning).

But we’re getting side tracked. Diffusion is a square to rectangle case of score-based generative models with the clause that diffusion refers explicitly to the learning of a time-dependent score function that is typically learned via a denoising process. Checkout Jakub Tomczak’s blog for more on diffusion and score-based generative models. Flow matching, although technically different to score-based generative models, also makes use of transformations to gaussian but is generally faster and not constrained to discrete time steps (or even Gaussian priors). So the big question is, how does one flow match?

This question is particularly personal to me as my current DPhil focuses heavily on utilising flow matching for solving some particularly exciting problems in Biology. Although, despite hours of nose in book literature and dreams of electric maths my understanding of the deeper theory is sketchy at best so I’m using this blog as a way to further that understanding whilst hopefully being somewhat educationally useful for others by giving verry much a noob’s walkthrough of Flow Matching. Specifically, because maths looks cool and doing it might help me get cool internships ,this blog will focus on the underpinning mathematics of it all coming from the perspective of a traditionally trained biochemist and mathematical amateur like myself.

Flow Matching is a powerful generalisation of diffusion in that it steps away from the discrete time steps denoising approach to a broader definition of a “flow” over a continuous timescale - image some time dependent field shaping/morphing the source distribution into our data target distribution. Let’s say we have a large number of images as examples that represent some higher level data distribution that encompasses the set of all images, much of the efforts of deep learning in recent years has been improving on generative models that can reliably sample from this distribution (with or without learning it). Diffusion does this by learning the inverse of a sequential noising process that transforms the data into Gaussian noise. In flow matching we similarly use a Gaussian as a source distribution but increase our abstraction by viewing each intermediary step as a probability distribution that is a diffeomorphism of the source distribution and target distribution who’s transformation from the source is described by a vector field dependent on time in the continuous space rather than a discrete set of steps. We can then sample from the model simply by integrating the learned vector field which happens to be significantly faster than the denoising process.

Flow Mat(c)h

Much of the math below I’ve recited from this fantastic resource from Meta and the original flow matching paper, and tried to wrap it in a more accessible description. This is also very much the simplest formulation of flow matching so if you’re interested in reading further, particularly for non-Euclidean approaches, I refer back to Meta’s paper.

In flow matching we aim to learn the parameters $\theta$ of a velocity field $v_t$ that acts on a flow $\psi_t(x)$ which describes the change over time of some sample $x$, where $x_0 = \psi_0(x)$ and is formulated as an ODE $\frac{d}{dt}(\psi_t(x)) = v_t(\psi_t(x))$. $v_t^{\theta}$ is complex and typically learned by neural network but is generally intractable, we will see how we resolve this later. $v_t$ provides deterministic trajectories morphing a source distribution $p_0$ to a distribution $p_t$ at time point $t \in [0,1]$ transporting distributions forward in time. Thus, $p_t$ represents a probability path that is realised into a probability distribution at any $t$. The velocity field $v_t$ is learned from a set of training data $X_1 \sim q$ functioning as an empirical approximation of an unknown underlying distribution $q$ where we assign $p_1 = q$, and a Gaussian source distribution $p_0 = \mathcal{N}(x | 0,I)$. Learning $v_t^{\theta}$ allows generative interpolation from any sample $x_0 = X_0 \sim p_0$ to $x_1 = X_1 \sim p_1$ from any time point $t$ so that the distribution of generated $p_1 \approx q$. In other words to generatively sample from $q$ we can simply sample from the Gaussian source $p_0$ then integrate to $t=1$ (or an earlier time point to achieve results similar to partial diffusion).

A quick note on definitions to avoid some of the confusion I encountered:

$q$ is the underlying unknown probability distribution we aim to model and from which we have some samples of data which makes up our empirical training set approximation $p_1 \approx q$.
$X_1$ is the random variable $X_1 \sim p_1$ which represents performing a single draw from the training set.
$x_1$ is the realisation or actual value of the the draw $X_1$, in the case of the MNIST dataset it would be a single image in the training set.
Similarly, $X_0$ is the random variable $X_0 \sim p_0$ where $p_0$ is our source distribution which we choose to be something we can easily sample from - a Gaussian in this case.
So $p_t$ is a path of probability distributions (probability path) which is a function that returns a probability distribution for a given timepoint $t$.

The flow $\psi_t(x)$ and velocity field $v_t$ are a bit more abstract. The flow $\psi_t(x)$ describes the transport of a sample through time which I interpret as it being a mapping function that simply maps a sample $x_0$ at $t=0$ to its respective position at $t$. This is the determinism of flow matching. The flow is often also described as a push-forward operator but that is perhaps more useful when discussing non-euclidean spaces. $v_t$ can be seen as a forcefield that gives every point $\psi_t(x)$ in the space of each distribution in $p_t$ a direction and speed telling it where it goes next.

The core approach of Flow Matching, specifically Conditional Flow Matching (CNF) as introduced by Lipman et al., is how we construct the probability path $p_t$ for learning the vector field $v_t^{\theta}$. Specifically, we can formulate $p_t$ as an average over the set of all conditional probability paths ${p_t}$ for each different data endpoint (training point) within $X_1 = x_1$. With a Gaussian source distribution we get the following conditional for each separate datapoint in $x_1$:

$$p_{t|1}(x|x_1) = \mathcal{N}(x|tx_1, (1-t)^2I)$$

This is useful as it makes things actually tractable. We can recover the full marginal probability path $p_t(x)$ over all $x_1$ by essentially taking an average over each separate conditional datapoint in the empirical training set - something known as mixture of conditionals:

$$p_t(x) = \int p_{t|1}(x|x_1)q(x_1)dx_1$$

Note this is specifically a marginalisation of $X_1$ with respect to $q$ which is not equivalent to the standard marginalisation over a joint distribution as that applies to cases of multiple variables which would be those at other time points hence why the below is an average and does not make use of the product rule. Apologies if this was obvious but it was a point of terminology confusion for me.

In the case of $t=0$ we see how the above resolves into the source distribution:

$$p_{0|1}(x|x_1) = \mathcal{N}(x|0 \cdot x_1, (1-0)^2I)$$

Where the mean of the Normal becomes $0 \times x_1 = 0$:

$$p_0(x) = \int \mathcal{N}(x|0,I)q(x_1) dx_1 = \mathcal{N}(x|0,I)$$

For $t=1$ we see how with mean $tx_1 = x_1$ and variance $(1-t)^2I = 0$ we get $p_{1|1}(x|x_1) = \mathcal{N}(x|x_1,0)$ which becomes the Dirac delta measure $\delta(x-x_1)$ which is essentially assigning all probability mass to $x=x_1$, therefore:

$$p_1(x) = \int \delta(x - x_1)q(x_1)dx_1 = q(x)$$

Having recovered our source and target distributions with this formulation we have shown how $p_t(x)$ satisfies a conditional optimal-transport, also known as a linear path, which allows us to define the random variable $X_t \sim p_t$ as a linear combination of $X_0 \sim p$ and $X_1 \sim q$:

$$X_t = tX_1 + (1-t)X_0 \sim p_t$$

Flow matching is constructed in this way as it enables the above closed form and tractable formulation of a probability path $p_t$ with smooth interpolation between the Gaussian source and target distribution allowing generation of samples from any intermediate distribution by a linear combination of a Gaussian and the data. In other words we are defining a linear interpolation between pairs of samples from $X_0$ and $X_1$ that is deterministic and linear in time from start to end corresponding to the straight line (or geodesic) in Euclidean space.

This linear path is associated with the true velocity field $v_t$ which we can now learn by randomly sampling timepoints $t \sim \mathcal{U}[0,1]$ and determining $x_t$ by the above linear interpolation allowing us to learn a parameterised $v_t^{\theta}(x)$ by neural network with the following Mean Squared Error (MSE) loss:

$$ \mathcal{L}\_{\text{FM}}^{\theta} = \mathbb{E}\_{t,X_t} ||v\_t^{\theta}(X\_t) - v\_t(X\_t) ||^2 $$

When conditioned on a single randomly selected target example $X_1 = x_1$ we reduce the complexity of this joint over two high-dimensional distributions $p(X_t, X_1)$ to a lower dimension $p(X_t | X_1 = x_1)$ by fixing the endpoint which allows tractable sampling. Thus, we adjust the linear combination to the conditional case:

$$ X_{t|1} = tx_1 + (1-t)X_0 \quad \sim \quad p_{t|1}(\cdot|x_1) = \mathcal{N}(\cdot | tx_1, (1-t)^2I) $$

By differentiating the above with respect to $t$ we can get the rate of change of $X_{t|1}$ which is its instantaneous velocity given the fixed endpoint $x_1$:

$$ \frac{d}{dt} X_t = x_1 - X_0 $$

So, for the conditional process we simply get a constant vector of $x_1 - X_0$. Assuming we have a randomly selected timepoint $t \sim \mathcal{U}[0,1]$ and have drawn a sample $x = X_t \sim p_t$ at that time point then we also have $x = X_t = tx_1 + (1-t)X_0$ where in the case of $t \neq 1$ (as that would lead to division by 0) we rearrange to get $X_0$:

$$ X_0 = \frac{x - tx_1}{1-t} $$

Which we plug into our conditional velocity leading to the instantaneous velocity at $t$ given the current point $x$ and corresponding datapoint $x_1$:

$$ v_t(x|x_1) = \frac{x_1 - x}{1 - t} $$

This generates the conditional probability path $p_{t|1}(x_t|x_1)$. Now this simple conditional velocity field can be used to rewrite the previous loss function as the conditional expectation over the posterior of possible endpoints $X_1$ conditioned on the current state $X_t = x$:

$$ v_t(x) = \mathbb{E}[v_t(x|X_1) | X_t = x] = \frac{\mathbb{E}[X_1 | X_t=x] - x}{1-t} $$

Showing how the instantaneous velocity of $x$ is the posterior expectation $\mathbb{E}[X_1 | X_t=x]$ of the training data minus the current position and divided by the time remaining - simple enough! This can also be seen as a weighted average over the probability paths:

$$ v_t(x) = \int v_t(x|x_1)\frac{p_{t|1}(x|x_1)q(x_1)}{p_t(x)}dx_1 $$

Which similarly gives the marginal vector field. So, we are left with a simple training recipe where, upon sampling a random timepoint $t$, $X_0$, $x_1 = X_1$, and interpolating $X_t = tX_1 + (1-t)X_0$, we train the learnable vector field $v_t^{\theta}$ by averaging over the conditional velocity vectors for training-source pairs then regressing the learnable vector field $v_t^{\theta}$ to the conditional velocity vector $\frac{x_1 - x_t}{1-t}$ (see the algorithm at the end).

A final powerful observation, who’s derivation I leave to the pros in the referenced papers, is that both the marginal and conditional velocity field loss functions have the same gradients for learning $v_t^{\theta}$:

$$ \bigtriangledown\_{\theta}\mathcal{L}\_{\text{FM}}(\theta) = \bigtriangledown\_{\theta}\mathcal{L}\_{\text{CFM}}(\theta) $$

This represents one of the most desirable properties of Flow Matching (specifically Conditional Flow Matching - CFM) which allows one to train just using per-sample conditionals. In other words, we Flow Match on individual samples. Thus, the simplest implementation of Flow Matching using a Gaussian source distribution with training data in the Euclidean space and exploiting the conditional loss gives the following final form:

$$ \mathcal{L}\_{\text{CFM}}(\theta) = \mathbb{E}\_{t,X\_0,X\_1}||v\_t^{\theta}(X\_t) - (X\_1 - X\_0) ||^2 $$

We can also define the more general realisation which applies to non-Euclidean space when considering $\psi_t(x)$ as the appropriate push-forward:

$$ \mathcal{L}\_{\text{CFM}}(\theta) = \mathbb{E}\_{t,X\_0,X\_1}||v\_t^{\theta}(\psi_t(x\_0)) - \frac{d}{dt}\psi_t(x\_0)||^2 $$

Where in the Euclidean space the second differential term simplifies to $x_1 - x_0$. Things get even more interesting applying the above generalised definition to the Riemannian world but this blog is long enough and has depleted my IQ reserves. So, to answer the question of how to flow match? Do it on individual samples!

Vs Diffusion

For reference, here are the pros of Flow Matching against typical Diffusion:

Faster. More efficient sampling as we only need to solve an ODE.
Sample from any timepoint without needing to employ tricks like consistency models.
Determinism
Beyond Gaussian source distributions

Everyone Loves an Algorithm

Given source distribution $p_0 = \mathcal{N}(0,I)$ and target distribution $p_1 = q$ defined by a set of training data examples $x_1 = X_1 \sim q$, learn a vector field $v_t^{\theta}$ with parameters $\theta$ with neural network that takes as input $(x,t)$. Train by repeating for $N$ epochs:

Draw data example $x_1 \sim q$ (or minibatch of data points)
Sample $x_0 \sim p_0$ (or minibatch of samples)
Sample $T$ timepoints $t \sim \mathcal{U}[0,1], \quad t\neq 1$ for each sample (or minibatch of samples)
Determine $x_t$ by linear interpolation: $x = tx_1 + (1-t)x_0$
Compute per-sample target conditional velocity: $v_t(x|x_1) = \frac{x_1 - x}{1 - t} = x_1 - x_0$
Predict estimate of $v_t^{\theta}(x_t)$ from neural network for each sample
Compute loss: $\mathcal{L}\_{\text{CFM}}(\theta) = \mathbb{E}\_{t,x\_0,x\_1} || v\_t^{\theta}(x\_t) - (x\_1 - x\_0)||^2$
To sample we can simply draw $x_0 \sim p_0$ and integrate the ODE $v_t^{\theta}(x,t)$ over $t = [0:1]$ with a standard ODE solver.

Footnote $p_{t|1}$ is notational weirdness that means $p_t(x|X_1 = x_1)$ meaning probability distribution at time $t$ conditioned on the single endpoint sample $x_1$ NOT time $t$ given 1 or given $t=1$. For example $x_1$ can be a single image from the MNIST set where $X_1$ is the random variable meaning when we sample the random variable $X_1$ representing the training set we draw a particular instance or single example image $x_1$

Blog 3: Protein Modelling Pt.3 - Thermodynamics and Statistical Mechanics

hew.phipps@live.co.uk (Hew Phipps) — Thu, 11 Sep 2025 13:55:06 +0000

Introduction

In the previous blog we describe the first method for approximating the Potts model via message passing as proposed by Weigt et al.. The resultant method was somewhat ineffecient relying on a slow iterative belief propagation. In this blog I will walk you through the next iteration in methods for approximating the Potts model, specifically the Mean Field approximation approach pioneered by the same group that introduced message passing. We will walk through the paper by Morcos et al. and they’re more elegant and efficient solution to the Potts model through a marriage of statistical mechanics and thermodynamics to evolutionary sequence analysis.

Contents

Empirical Frequencies
Derivation of the Mean Field Approach
Inference

As a reminder, we start with the canonical Potts protein model where we are given a sequence $\boldsymbol{\sigma} = \{\sigma_0, ...\sigma_N \}$ of length $N$, and a multiple sequence alignment of $M$ homologous sequences arranged in a matrix $\mathcal{D}$ of size $M \times N$ where any element is an amino acid in the set of $20$ canonical amino acids and the gap character $a \in \{1,...,q\}, \quad q=21$ that can be indexed by $\sigma_i^{(m)}$ for $i \in N$ and $m \in M$. We want to use this data to approximate the following Potts model by estimating the parameters $\boldsymbol{h}$ and $\boldsymbol{J}$ representing the single-site and couplings parameters respectively:

$$ P(\boldsymbol{\sigma}) = \frac{1}{Z} \exp\left\{\sum_{i}^N \boldsymbol{h}_i(\sigma_i) + \sum_{i,j > i}^N \boldsymbol{J}_{ij}(\sigma_i,\sigma_j) \right\} $$

Where $Z$ is the partition function that normalises the Hamiltonian to a probability value between 0 and 1.

Mean Field Approximation

the Mean Field approach we will describe allows inference up to $10^4$ times faster than via message passing. As before we start with a definition of empirical frequencies.

Empirical Frequencies

We start by obtaining single-site and pairwise empirical frequencies as sample data estimates of our $\boldsymbol{h}$ and $\boldsymbol{J}$ parameters as with the previous approach, although with a slightly different formulation:

$$ f_i(a) = \frac{1}{\lambda + M_{eff}}\left(\frac{\lambda}{q} + \sum_{k=1}^M \frac{1}{m_a}\delta_i(a) \right) $$

$$ f_{ij}(a,b) = \frac{1}{\lambda + M_{eff}}\left(\frac{\lambda}{q^2} + \sum_{k=1}^M \frac{1}{m_a}\delta_{ij}(a,b) \right) $$

Notice the two new terms, $M_{eff}$ and $m_a$. These are included to counter a further problem with MSA data alongside regularisation in that they are prone to sampling bias along phylogenetic trees of high sequence similarity which can exacerbate the presence of closely related sequences over distantly related homologues - as well as implicit biases in the sequencing actions of humans. To counter this a sequence reweighting is applied by the per-sequence factor $m_a$ which is balanced by an effective sequence count $M_{eff}$. Typically we reweight so that any sequences above a threshold sequence similarity, typically 0.8, are down weighted reducing the effective number of sequence $M_{eff} < M$. Thus $M_{eff}$ is a sum of sequence weights scores $m_a$ rather than the total sequence count. The exact calculation of the per-sequence weight score is as follows:

$$ m_a = |\{ b \in \{1,...,M\}|\operatorname{seqid}(\boldsymbol{\sigma}^{a}, \boldsymbol{\sigma}^{b}) > 0.8 \}| $$

Following this the authors introduce the connected correlations as a measure of pairwise interdependence in a similar vain to the Mutual Information (actually the covariance of the single site marginals):

$$ C_{ij}(a,b) = P_{ij}(a,b) - P_i(a)P_j(b) $$

Where the empirical approximation to this follows:

$$ C_{ij}(a,b) = f_{ij}(a,b) - f_i(a)f_j(b) $$

The computational efficiency and simplistic appeal of Mean Field approximation as a method for approximating the Potts model arises from the simple relationship:

$$ C_{ij}^{-1} = - \boldsymbol{J}_{ij} $$

So the largely iterative procedure of the message passing approach is reduced to a matrix inversion which, given fixes gauges and regularisation, is solveable. The actual derivation of this requires some key tools from physics which I will not attempt to explain here (I’m not a physicist!) but I will walk you through the process regardless.

Derivation of the Mean Field Approach

Essentially the authors apply something called a Legendre transform to change the variables of our Potts model from the Hamiltonian formulation’s fields (single-site and pairwise) to a Gibbs potential expressed in terms of “spin” distributions (or in this case amino acid probability distributions) allowing inference of the partition function by a statistical approach using our multiple sequence alignment. They first introduce the (perturbed) hamiltonian as follows:

$$ \mathcal{H} = -\alpha \sum_{1 \leq i < j \leq N} e_{ij}(\sigma_i,\sigma_j) - \sum_{i=1}^N h_i(\sigma_i) $$

Where the parameter $\alpha$ is introduced to allow for interpolation between an independent model $\alpha = 0$ and the full model for $\alpha = 1$. In physics the Hamiltonian describes the energy of a given configuration of the system (for us a protein sequence in the space of protein sequences) and is related to the free energy (Hemholtz free energy) $F$ by:

$$ F = U - TS $$

This is classic thermodynamics, where $U$ is the energy, $T$ the temperature and $S$ the entropy. In statistical mechanics the entropy and average energy give the same free energy:

$$ F = -\kappa_B T \ln Z $$

Where $k_B$ is the Boltzmann constant and $Z$ the partition function which we know is a weighted sum over all possible configurations of the system weighted by its Boltzmann factor $e^{-\beta H}$:

$$ Z = \sum_{\{\boldsymbol{\sigma_i} \}} e^{-\beta \mathcal{H} (\{\boldsymbol{\sigma_i}\})} $$

Combining this with the above we get the formal Free energy (note how the Boltzmann constant and temperature are combined into the constant $\frac{1}{\beta}$):

$$ F = - \frac{1}{\beta} \ln \left(\sum_{\{\boldsymbol{\sigma_i}\}}e^{-\beta \mathcal{H} (\{\boldsymbol{\sigma_i} \})} \right) $$

This is all parameterised by energy terms which we do not have access to but want to infer from a statistical distribution of sequences in a multiple sequence alignment. We can use a Legendre transform of $F$ to change variables switching from the free energy to the Gibbs potential $\mathcal{G}$ which changes dependence on the fields $\boldsymbol{h_i}$ to so-called magnetizations $m_i$, or probability marginals $P_i$. The Legendre transform of $F[h]$ to $G[P]$ is as follows:

$$ \mathcal{G}[P] = F[\boldsymbol{h}] + \sum_i\boldsymbol{h}_iP_i $$

Where $P_i = -\frac{\partial F}{\partial h_i}$ is the negative partial derivative of F with respect to the single site fields. This is true because, after some differentiation of the above free energy equations:

$$ \frac{\partial F}{\partial \boldsymbol{h}_i(\sigma_i)} = -\frac{1}{\beta}(\beta \langle \delta_{\sigma_i,a}\rangle) = - \langle \delta_{\sigma_i,a}\rangle $$

and the expectation over the kronecker delta indicator function that is 1 if $\sigma_i == a$ is exactly equal to the single-site marginal $P_i(a)$.

Now $\mathcal{G}$ depends on the system’s state (probabilities) not the external forces (fields), allowing inference from our statistical distribution. So for the Potts model the Gibbs potential is defined as:

$$ -\mathcal{G}(\alpha) = \ln \left[\sum_{\{\sigma_i | i = 1, ..., N\}} e^{-H(\alpha)} \right] - \sum_{i=1}^N \sum_{b=1}^{q-1} \boldsymbol{h}_i(b) P_i(b) $$

Where the first term is the statistical free energy $F = -\ln Z$ with the partition function $Z$ and the second term (with sign rearrangements) is the added marginals. Taking derivatives of $G(\alpha, \{P_i\})$ with respect to $P_i(a)$ gives the fields:

$$ \boldsymbol{h}_i(\sigma_i) = \frac{\partial \mathcal{G}(\alpha)}{\partial P_i(\sigma_i)} $$

Here’s the hatrick. If we take the derivative of these single site fields $\boldsymbol{h}_i$ with respect to $P_j(\sigma_j)$ we get:

$$ \frac{\partial \boldsymbol{h}_i(\sigma_i)}{\partial P_j(\sigma_j)} = \frac{\partial^2 \mathcal{G}(\alpha)}{\partial P_i(\sigma_i)\partial P_j(\sigma_j)} $$

Remember the connected correlations $C_{ij}$. in statistical physics it is well established that the second derivative of the free energy (with respect to the fields) gives the connected correlations! Given our first derivative of the free energy from earlier:

$$ \frac{\partial F}{\partial \boldsymbol{h}_i(\sigma_i)}= - \langle \delta_{\sigma_i,a}\rangle = P_i(\sigma_i) $$

Taking the second derivative with respect to the single site fields gives:

$$ \frac{\partial^2 F}{\partial \boldsymbol{h}_i^2(\sigma_i)} = \frac{\partial P_i(\sigma_i)}{\partial \boldsymbol{h}_j(\sigma_j)} $$

Which equals the connected correlations:

$$ C_{ij}(\sigma_i,\sigma_j) = \langle \delta_{\sigma_i,a}\delta_{\sigma_j,b}\rangle - P_i(\sigma_i)P_j(\sigma_j) = \frac{\partial P_i(\sigma_i)}{\partial \boldsymbol{h}_j(\sigma_j)} $$

Where $\langle \delta_{\sigma_i,a}\delta_{\sigma_j,b}\rangle = P_{ij}(\sigma_i,\sigma_j)$. Notice how we are left with the inverse of the derivative of the single site fields with respect to $P_j(\sigma_j)$. In other words the inverse of the connected correlations describes the change in fields in response to the marginals:

$$ (C^{-1})_{ij}(\sigma_i,\sigma_j) = \frac{\partial \boldsymbol{h}_i(\sigma_i)}{\partial P_j(\sigma_j)} $$

The authors emphasise that this relationships holds for any value of $\alpha$ and that the $q-1$ gauge makes this matrix invertible by removing “trivial linear dependencies resulting from the normalisation of $P_{ij}$. Given that the connected correlations holds information on our pairwise marginals $C_{ij}(\sigma_i,\sigma_j) = P_{ij}(\sigma_i,\sigma_j) - P_i(\sigma_i)P_j(\sigma_j)$ we can use it to obtain them via a simple rearrangement:

$$ P_{ij}(\sigma_i,\sigma_j) = P_i(\sigma_i)P_j(\sigma_j) + C_{ij}(\sigma_i,\sigma_j) $$

However thanks to the earlier proof we can obtain $C_{ij}$ by inverting the Hessian of the Gibbs potential $\mathcal{G}$:

$$ C = \left(\frac{\partial^2 \mathcal{G(\alpha)}}{\partial P_i(\sigma_i)\partial P_j(\sigma_j)}\right)^{-1} $$

So to compute the pairwise marginals we must evaluate the inverse of the Hessian of $\mathcal{G}(\alpha)$ and add the product of the single-site marginals. To do so we need an approximation $\mathcal{G}(\alpha)$. The authors achieve this via a Taylor expansion of $\mathcal{G}$ for two terms at $\alpha = 0$ (the independent Hamiltonian), also known as a first-order mean field expansion. In other words, the Gibbs potential is approximated the value of the independent system $\mathcal{G}(0)$ plus an interaction potential acting through the first order (second) term which is the mean-field correction:

$$ \mathcal{G}(\alpha) \approx \mathcal{G}(0) + \frac{d \mathcal{G}}{d \alpha}|_{\alpha = 0} $$

This works easily for the independent single site case (no couplings) and in the paper they also extend this to the pairwise couplings interaction energy with a small coupling expansion (Taylor series on 0) which I refer the reader to for a more rigorous description than anything I could attempt (Morcos et al.).

Considering the Gibbs potential in the independent case $\alpha = 0$ where the Gibbs potential is equivalent to the negative entropy of an ensemble of $N$ uncoupled Potts spins with marginals $P_i(\sigma_i)$, the free energy equals the average energy minus the entropy. For $\alpha = 0$ the Legendre transform removes the complete average energy leaving the entropy of uncoupled spins:

$$ \mathcal{G}(0) = \sum_{i=1}^N \sum_{a=1}^q P_i(a) \ln P_i(a) $$

Due to the gauge the authors modify this to eliminate $P_i(q)$ reducing the expression to independent variables:

$$ \mathcal{G}(0) = \sum_{i=1}^N \sum_{a=1}^{q-1} P_i(a) \ln P_i(a) + \sum_{i=1}^N\left[1 - \sum_{a=1}^{q-1} P_i(a)\right] \ln \left[1 - \sum_{a=1}^{q-1} P_i(a) \right] $$

For $\alpha = 0$ we look to obtain $\frac{d\mathcal{G}(\alpha)}{d\alpha}$ from $\mathcal{G}(\alpha)$ for which we get:

$$ \frac{d\mathcal{G}(\alpha)}{d\alpha} = -\left\langle \sum_{i<j}\boldsymbol{J}_{ij}(a,b) \right\rangle_\alpha $$

Or the average of the couplings term in the Hamiltonian. At $\alpha = 0$ this is trivial as we can factorise the joint distribution of all variables over single sites:

$$ \frac{d\mathcal{G}(\alpha)}{d\alpha} |_{\alpha=0} = -\sum_{i<j}\sum_{a,b} \boldsymbol{J}_{ij}(a,b)P_i(a)P_j(b) $$

Inserting our equations for $\mathcal{G}(0)$ and $\frac{d\mathcal{G}(\alpha)}{d\alpha} |_{\alpha=0}$ into our expansion of the Gibbs potential returns its first-order approximation:

$$ \sum_{i=1}^N \sum_{a=1}^{q-1} P_i(a) \ln P_i(a) + \sum_{i=1}^N\left[1 - \sum_{a=1}^{q-1} P_i(a)\right] \ln \left[1 - \sum_{a=1}^{q-1} P_i(a) \right] -\sum_{i<j}\sum_{a,b} \boldsymbol{J}_{ij}(a,b)P_i(a)P_j(b) $$

Or its mean-field approximation.

Now we’re nearly there! The final step of this procedure is to obtain so-called “self-consistent mean-field equations” for the single-site fields and connected correlations. Drawing from Physics once more we refer to the variational principle in thermodynamics that minimises the free energy (Gibbs potential) with respect to the marginal distributions $P_i(a)$ ignoring all correlations except those of the mean fields to converge on the true underlying Boltzmann distribution $P(\{\boldsymbol{\sigma}\})$. This requires a massive differentiation over the Gibbs potential with respect to $P_i(a)$ whilst normalising to $\sum_{a=1}^q P_i(a) = 1$. As with any minimisation we obtain the point at which the gradient equals 0 which through the magic of the author’s hard work leaves the single-site mean field equation:

$$ \frac{P_i(\sigma_i)}{p_i(q)} = \exp \left\{\boldsymbol{h}_i(\sigma_i) + \sum_{j\neq i}\sum_{\sigma_j} \boldsymbol{J_{ij}}(\sigma_i,\sigma_j)P_j(\sigma_j) \right\} $$

Being each site’s marginal depends on the average state of all the others.

Inference

Finally returning to our connected correlations where we required an approximation of the Gibbs potential we need to derive the Hessian of the above first-order mean field Gibbs potential with respect to $P_i(\sigma_i)$ and secondly $P_j(\sigma_j)$ which gives the inverse connected correlations. Again this is a complex differentiation which, if you’re really keen you can try yourself and considering the normalisation of the marginals summation to 1, but I will just provide the end-result for. We get two cases as a result of this differentiation. For the case of $i \neq j$ we explicitly obtain the inference solution to the Potts model using mean field approximation:

$$ C_{ij}^{-1}(\sigma_i, \sigma_j) = -\boldsymbol{J}_{ij}(\sigma_i,\sigma_j) $$

And that’s it! This inspiring medley of thermodynamics, statistical mechanics and evolutionary sequence analysis starting from a 21 state Ising model has led to rapid inference through the simple inversion of the connected correlations matrix which we have an empirical approximation for.

Blog 2: Protein Modelling Pt.2 - Graph Network Messaging

hew.phipps@live.co.uk (Hew Phipps) — Tue, 08 Jul 2025 13:55:06 +0000

Introduction

In the previous blog we outlined the statistical basis of the Potts model. The resultant Ising-like form is largely intractable for any reasonably sized protein and multiple sequence alignment. Although no closed form solution exists, methods exploiting numerical approaches to approximate the energy function have shown remarkable success when tested for their accuracy in predicting protein residue contacts. In this blog we’ll discuss the two earliest approaches to this.

Contents

Message Passing

$$ P(\boldsymbol{\sigma}) = \frac{1}{Z} \exp\left\{\sum_{i}^N \boldsymbol{h}_i(\sigma_i) + \sum_{i,j > i}^N \boldsymbol{J}_{ij}(\sigma_i,\sigma_j) \right\} $$

Where $Z$ is the partition function that normalises the Hamiltonian to a probability value between 0 and 1.

Solving the above provides a sequence-based model that encodes structural information via estimation of a protein’s contact map as well as other functionality such as generative sampling. Specifically, by explicitly modelling a global joint distribution of the sequence we are able to reliably disentangle true 3D contacts between residue pairs whilst standard covariance analysis and Mutual Information approaches get confused by indirect residue correlations (Figure 1). Figure 1: (left) multiple sequence alignment showing three highly correlated positions suggesting they are all in contact (2-4-8-2) but in reality in 3D space they are transiently connected via position 4 (right). Direct contact analysis is able to disentangle these transient contacts from real contacts to identify real couplings (2-4, 4-8).

One can imagine three residues in a protein sequence co-evolving where residue pair 1-2 and 2-3 are close in 3D space but 1-3 are not yet their indirect correlation through residue 2 suggests structural proximity. This false positive contact is successfully disentangled by the global probability distribution of the Potts model which, unlike Mutual Information, encodes information about the relationships with every other residue of a given residue and thus incorporates knowledge of shared relationships with a second residue other than the direct relationship between the two of them.

Message Passing

The first attempt at approximating the Potts model for small proteins was a message passing approach taken by Weigt et al. who used a graph network model to iteratively update the edges between nodes (residues) eventually converging on the true contact map successfully disentangling true direct contacts from indirect interactions. Weigt et al. specifically do this for the sensor histidine kinase (SK) and response regular (RR) protein pair of bacterial two-component signalling systems meaning their analysis is on interacting or coupled contact pairs between two proteins rather than within a single monomer. In their work they concatenate the sequences of the two proteins for each sequence pair essentially treating them as a single monomer. Without going too deep into the nitty gritty of message passing I’ll go through how they used it to learn the Potts model.

In their paper they consider specifically two interacting proteins with the aim of identifying their interacting residues with the same logic as intradomain residues in contact. They do this by taking paired MSAs for both proteins and concatenating the two paired sequences along each row. The resultant combined MSA is of length $N=(N_1 + N_2)$ with $M$ protein pairs with sequences $\boldsymbol{\sigma}$ capable of taking any value from the canonical amino acid alphabet + a gap character $\boldsymbol{\sigma} \in \{1,...,q=21\}$.

Empirical Frequencies

In the previous blog we noted how the solution to the Potts model comes from a Maximum entropy approach constrained by the frequency counts for each amino acid at single $f_i(A_i)$ and pairwise sites $f_{ij}(A_i,A_j)$. Weight et al. and other subsequent papers utilise variations of weighted and regularised frequency counts instead of just plain frequencies. Specifically in the Weigt et al. paper:

$$ f_i(a) = \frac{1}{\lambda q+ M}\left(\lambda + \sum_{k=1}^M \delta_i(\sigma_i^k=a) \right) $$

$$ f_{ij}(a,b) = \frac{1}{\lambda q + M}\left(\frac{\lambda}{q} + \sum_{k=1}^M \delta_{ij}(\sigma_i^k=a,\sigma_j^k=b) \right) $$

Where the $\delta(\cdot)$ is the indicator function taking on 1 if the identity inside is correct and 0 if not. The above give weighted regularised frequency counts which compensate for a major issue with learning a model from multiple sequence alignments. Specifically, regularisation is employed by adding a pseudocount $\lambda$ which essentially is just a small number (adjustable hyperparameter) we add to our frequency counts that means that even if a particular amino acid is not seen in the column it still has a non-zero frequency value. This works to prevent overfitting as it is almost never the case that there is absoloutely 0 chance of a residue being accessible to evolution for any position on any protein. Any 0 frequency observed is generally just an artifact of the MSA sampling.

Returning to the paper, in the previous blog we discussed gauge fixing so as to prevent parameter freedom for the same probability (or hamiltonian). In this approach the authors a zero-sum gauge. Specifically:

$$ \sum_{\sigma_i=1}^q \boldsymbol{J}_{ij}(\sigma_i,\sigma_j) = \sum_{\sigma_j=1}^q\boldsymbol{J}_{ij}(\sigma_i,\sigma_j) = 0 , \quad \forall i,j $$

$$ \sum_{\sigma_i=1}^q(\boldsymbol{h}_i(\sigma_i)) = 0 $$

Meaning that the solution to our maximum entropy approach becomes unique. This zero-sum also has the effect of reducing the number of parameters to be determined by $q-1$ as we can infer the last $q$ value parameter for each site by the sum. Having established these prerequisits we can begin to consider how we learn these $\boldsymbol{h}$ and $\boldsymbol{J}$ parameters.

Inference

Inference is achieved in the paper through gradient descent following this 2-step algorithm:

For a trial Hamiltonian, marginal distributions for single and pairwise positions are calculated.
Parameters are updated with gradient descent by the difference between the above marginals and the empirical frequencies.

We start by setting the couplings to 0 and the single-site fields as the true marginals. The parameter update scheme is then defined as:

$$ \Delta J_{ij} = \epsilon \left[f_{ij}(\sigma_i,\sigma_j) - p_{ij}(\sigma_i,\sigma_j) - \frac{f_i(\sigma_i) + f_j(\sigma_j) - P_i(\sigma_i) - P_j(\sigma_j)}{q} \right] $$

$$ \Delta h_i(\sigma_i) = \epsilon [ f_i(\sigma_i) - P_i(\sigma_i)] $$

Where $\epsilon$ controls the step size. Even though we initialise single site fields with their marginals equal to the empirical frequencies we must still include an update rule for them at each step as when we update the couplings first for the pairwise marginals this can shift the single site maginals so we reevaluate the single site fields to fix the single site marginals.

The problem is that the single and two-site marginals are computationally intractable for any reasonably sized protein as the marginals are computed by summation over all positions except those in question:

$$ P_i(\sigma_i) = \sum_{\sigma_{-i}}P(\boldsymbol{\sigma}) $$

and the same for the pairwise sites. The authors resolve this with the application of message passing. The neat solution they develop is to actually replace the gradient descent approach altogether with message passing.

Message Passing

The single-site marginals are evaluated by standard belief propagation (BP), a method for approximating marginals on a probability graph with pairwise interactions without summing over the entire combinatorial space. Such an approach works by exchange of messages or beliefs between sites. Nodes (or residue sites) pass messages about what they believe to be the marginal at their site if the neighbour they are passing to was removed. Specifically we denote a message being passed about a marginal at site $i$ to neighbour $j$ as $P_{i \rightarrow j}(A_i)$. The reason we exclude the neighbour $j$ is because we don’t want $i$ to send a message about itself to $j$ including $j$’s own belief about $i$ itself which would be a form of double counting leading to recursive updates.

To reiterate, if $i$’s belief about its own marginal comes from all other nodes we want to explicitly exclude the influence of $j$ when passing a message to $j$ about $i$. This recursive level thinking extends down the tree of messages where the computation of $i$’s marginal is an aggregate of the marginals of all the other nodes which we multiple together giving the product in the below BP update equation:

$$ P_{i \rightarrow j}(\sigma_i) \sim \exp \left( h_i(\sigma_i)\right) \prod_{k \neq i,j}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right] $$

First all messages are initialised randomly before iterative updates using the above are repeated until no message has been updated by more than $10^{-5}$. The true marginals are then computed by:

$$ P_{i}(\sigma_i) \sim \exp \left( h_i(\sigma_i)\right) \prod_{k \neq i}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right] $$

Where we no longer exclude $j$. You can read up more on BP through this blog by Andy Jones. The core innovation of this paper however is the inversion of this approach. Consider that we already know the marginal distributions we want, they are our empirical frequencies $P_i(\sigma_i) = f_i(\sigma_i)$. So if we can formulate our message passing to not rely on $h_i(\sigma_i)$ we can then solve for it directly by inference using BP. A neat way to eliminate $h_i(\sigma_i)$ from the above two equations to obtain this is to simply take their ratio and rearrange:

$$ \frac{P_{i \rightarrow j}(\sigma_i)}{P_{i}(\sigma_i)} \propto \frac{\exp \left( h_i(\sigma_i)\right) \prod_{k \neq i,j}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right]} {\exp \left( h_i(\sigma_i)\right) \prod_{k \neq i}\left[ \sum_{\sigma_k}\exp (-J_{ki}(\sigma_k,\sigma_i))P_{k\rightarrow i}(\sigma_k) \right]} $$

Cancelling terms except for the denominator’s $k=j$ we simplify to:

$$ \frac{P_{i \rightarrow j}(\sigma_i)}{P_{i}(\sigma_i)} \propto \frac{1} {\sum_{\sigma_j}\exp (-J_{ij}(\sigma_i,\sigma_j))P_{j\rightarrow i}(\sigma_j)} $$

then rearrange:

$$ P_{i \rightarrow j}(\sigma_i) \propto \frac{P_{i}(\sigma_i)} {\sum_{\sigma_j}\exp (-J_{ij}(\sigma_i,\sigma_j))P_{j\rightarrow i}(\sigma_j)} $$

Which gives a message update solution reliant on the marginals rather than $h_i(\sigma_i)$ successfully inverting the problem. After convergence we can then compute $h_i(\sigma_i)$ by simply rearranging the previous marginals equation and replacing $P_i(\sigma_i)$ with $f_i(\sigma_i)$:

$$ \exp(h_i(\sigma_i)) \propto \frac{f_i(\sigma_i)}{\prod_{j\neq i}\left[ \sum_{\sigma_j}\exp(J_{ij}(\sigma_i,\sigma_j))P_{j \rightarrow i}(\sigma_j)\right]} $$

Where we have replaced index $k$ with $j$ for notational consistency.

Having now formulated a solution for single site fields inference we need to consider the couplings. This is somewhat more complicated than a satisfying rearrangement of BP. The authors employ something called susceptibility propagation which I will defer to the paper (or a future blog).

Unfortunately, although exhibiting a significant speed up compared to an MCMC approach this method still suffers from being a largely iterative procedure. In the next part in this series we will address this issue with a powerful alternative formulation developed soon after this approach - Mean Field Approximation. So stay tuned for the next blog where I’ll take us through this beautiful convergence of statistical mechanics and thermodynamics with biology.

Blog 1: Protein Modelling Pt.1 - From Similarity to Structure Prediction

hew.phipps@live.co.uk (Hew Phipps) — Mon, 03 Mar 2025 13:55:06 +0000

Introduction

Biology has typically been the least quantitative of the sciences. However, over recent decades the surge in sequencing data, made possible by next generation sequencing, has facilitated the application of statistical and machine learning to biology. In this blog I will describe what first got me into computational biology and what helped power the first drastic improvements in protein structure prediction introduced by AlphaFold2. Specifically, I will discuss what is now called evolutionary or Direct Coupling Analysis (DCA) and focus on several key papers that formulated this fascinating application of statistical mechanical principles to biology.

Contents

Introduction
What is Protein Modelling?
Pairwise Modelling
Learning the Potts Model

What is Protein Modelling?

Starting from the very top, what are proteins? Proteins are molecular machines which, at the simplest level, can be characterised by their molecular composition with their amino acid sequence - a 1 dimensional chronological string of single letter representations of each amino acid molecule (residue) making up the complete molecular structure of the protein. In some ways, this is the most primitive “model” of a protein we have. It tells us the exact chemical composition of that protein, yet we can’t even infer it’s exact respective genetic sequence without further information (due to the redundancy of the amino acid codons).

Let’s level up our model. Commonly across biology we are given protein sequences without any knowledge of the function or properties of that protein. To resolve this we have developed various methods for comparing protein sequences in an effort to determine their degree of similarity and thus infer function from similarity to proteins with known function. Naturally this requires a quantitative approach which has come in the form of sequence alignments. At the simplest level there’s the pairwise alignment which simply sums the number of mismatches in two sequences of the same length. For sequences of different lengths things get a bit more complicated, but we can save that for another blog. Further complexity can be achieved by scoring mismatches differently depending on the relatedness of the amino acid pair’s chemical properties with a substitution matrix (a popular one being the BLOSUM62 matrix).

At all levels these are models of a protein, in this case providing functionality via the ability to quantify relatedness between sequences. Taking a step further we encounter Hidden Markov Models which are a particularly powerful model of proteins, again describing similarity but this time probabilisticly, allowing us to determine the likelihood that a given protein sequence is related to a collection of others (a family of homologs) - this is called homology modelling. What I’m going to describe in this blog is what I see as the next step up in protein modelling, the Potts Model. I’ll describe its emergence from a beautiful application of statistical mechanics by way of the Ising Model and how it’s functionality extends beyond just similarity/homology modelling to generative protein design and 3D structure prediction.

A Note on Residue Contacts

This approach to modelling proteins is known generally as Direct Contact Analysis (DCA), derived from the fact that protein’s can be modelled by so-called contact maps which define a contact as any residue pair with at least one atom each within a threshold distance of eachother - typically 4$\dot{A}$. Contact maps, typically plotted residue index against residue index, are symmetrical scatter plots with points at each residue pair index that satisfies this condition. These maps are remarkably powerful in protein structure analysis and prediction. Indeed much of the initial success of AlphaFold2 can be attributed to their use of multiple sequence alignment DCA analysis to predict contact maps for a given sequence within their Evo-former module.

Pairwise Modelling

The models we discussed earlier consider each residue in the protein sequence independently of the others in the sequence. In reality, every residue in the protein is likely influenced by (and influences) every other residue. This is particularly evident in the case of epistasis - a phenomenon in which mutating a specified residue in a protein sequence with and without a mutation in a secondary residue in the sequence will have drastically varying effects on the protein (in terms of stability, fitness and other protein properties).

This implies a level of inter-residue dependence in protein sequences. Drawing from probability theory we get a nice mathematical formalisation for this. If two variables are independent:

$$ P(A,B) = P(A)P(B) $$

This is equally represented in epistasis which is defined as:

$$ E(\text{seq}^{mut1,mut2}) != E(\text{seq}^{mut1})E(\text{seq}^{mut2}) $$

So in the case of proteins we can quantify the degree of dependence between two residues $i$ and $j$ by their Mutual Information (MI):

$$ \text{MI} = P(\text{seq}_{i},\text{seq}_{j}) - P(\text{seq}_{i})P(\text{seq}_{j}) $$

This is good because just like HMMs we want a probabilistic formalism for our protein model that allows us to statistically determine the model parameters but also provide a degree of certainty (and be generative, but we will discuss that later). Starting at a basic level, as we touched on the dependency between residues we can look at modelling a protein as a multi-dimensional distribution over the residues in its sequence which we write as a joint distribution over the whole sequence where each position in the sequence is a variable $x_i$ taking on one of the 20 amino acids and a gap character:

$$ P(\text{seq}) = P(x_0, ...., x_N) $$

where $N$ is the length of the sequence. So we now have a complex joint probability distribution of a sequence composed of a number of dependent variables. Following the chain rule of probability we can decompose this joint distribution into conditional probabilities for each individual residue given the rest of the sequence:

$$ P(\boldsymbol{x}) = \prod x_i | x_{-i} $$

We can also marginalise over the rest of the sequence for any residue to get the probability of that residue:

$$ P(x_i = a) = \sum_{-i} P(x_i) $$

This probabilistic formalism can be represented graphically as an undirected graph and is known as a Markov Random Field (MRF) in the field of machine learning. The characteristics of this are that every variable (in our case, every residue) is represented as a node in a graph with edges connecting residues that are dependent. For reasons that will be clearer later, every node in this graph is connected to one another meaning that every residue exhibits some non-zero dependence on every other residue in the sequence. This also leads to the problem of making the graph particularly difficult to solve for any realistically sized protein. The problem of solving (or approximating) this so called Potts model for different proteins is the challenge addressed by the papers I will discuss in this blog series.

Learning the Potts Model

To keep this first blog post to a reasonable size I will briefly discuss the analytical form of the Potts model before walking through some of the first methods for approximating the model for proteins in subsequent blogs.

How do we parameterise a probabilistic model? Mathematically, parameterisation means defining a functional form for our probability distribution from which we can adjust the parameters controlling the characteristics of the distribution to best fit some data.

We previously mentioned the vast amount of sequence data available since the explosion in high throughput sequencing - this is data is already used to parameterise HMMs. HMMs are typically generated from alignments of similar sequences, multiple sequence alignments, by considering each individual column in the alignment (which refers to a distinct position in the query sequence unless gapped) and computing some empirical frequencies that we use to infer the HMM parameters. I will hopefully have a separate post describing HMMs in more detail later - the point is that we parameterise our models with empirical data.

Of course, this means that the data we use has a significant impact on the parameterisation and therefore accuracy of our model. We want our model to capture the rules that define our sequences and their relationship with each other so the more data we can obtain the more accurately our distribution can be parameterised. Although standard sequence alignments are quite capable of collecting up to thousands of sequence homologs for a given query sequence, for the Potts model we really want within the region of 10,000s. One of the benefits of HMMs over standard sequence alignment is they are better at identifying sequences that are likely functional or structurally similar (homologues) without necessarily being similar in exact sequence composition (typical alignment methods) meaning we can capture much more information on the query protein by collecting more of its related (often evolutionarily ancestral) sequences together. So the first step of learning our model is collecting all the related sequences returned from the query and stacking them into a large multiple sequence alignment.

Parameterising the Potts model

Let’s look at this from a frequentist approach. We have 10,000s of homologous protein sequences where the alignment algorithm has inserted gaps to represent insertions/deletions making all sequences (rows) the same length. This is a matrix of size $n \times m$ where $n$ is the length (columns) of the (gapped) query sequence and $m$ is the number of sequences (rows) in the alignment. The classic frequentist approach would be to compute the per-residue statistics at each column $f_i(a)$ where $i$ is the column index and $a$ the residue which can take values from $a \in \{1,...,q\}$ with $q$ being the length of the alphabet, typically $21$. The expectation is that proteins from the same family would have similar alignments and therefore similar per-residue single site frequencies. The more basic Position-Specific Scoring Matrix (PSSM) uses these to compute a log-odds score by comparing to the frequency of the given amino acid in all proteins.

This is a valid model but it doesn’t capture any of the pairwise dependencies we are interested in. So how about we generate a second matrix (or in this case tensor) which captures the pairwise statistics $f_{ij}(a,b)$ - meaning the frequency of residue $a$ at position $x_i$ whilst residue $b$ is present at position $x_j$. Graph networks provide a powerful solution to modelling such large pairwise probabilistic models by employing a system of nodes and edges where an edge links two nodes that are conditionally dependent (as in $P(A,B) != P(A)P(B)$). In this way we configure the graph so that any independent variables do not have linking edges and we can design an algorithm that is capable of computing the joint distribution without marginalising over the entire graph.

However our protein model is not capable of making assumptions about the independence of residue pairs as we are trying to determine these inter-residue dependencies. One approach to take is the Maximum Entropy approach where we take the simplest possible parameterisation and fit our parameters to satisfy the empirical data in the multiple sequence alignment (single site and pairwise frequencies), this is the formal Potts model: who’s form is a consequence of the Maximum Entropy approach:

$$ P(x) = \frac{1}{Z} \exp \left(\sum_{i=1}^n h_i(x_i) + \sum_{1\leq i<j \leq n} J_{ij}(x_i,x_j)\right) $$

Let’s take it apart bit by bit. Inside the brackets there are two summation terms, the first delimits our single-site parameters $h_i(x_i)$ being a function for each column $i$ that takes a possible amino acid and returns a value. The second term is our pairwise parameters $J_{ij}(x_i,x_j)$ which is a function for each unique pair of positions and takes an amino acid input for each position $i,j$. Notice the subscript of the second summation, the pairwise parameters $J_{ij}$ form a pairwise matrix that is symmetrical as $J_{ij} = J_{ji}$ so we explicitly state only unique pairs which is what the subscript dictates. $Z$ is just a normalising factor that make sure the resultant model is a valid probability distribution that sums to 1. In reality, $Z$, known as the partition function in statistical mechanics, is our biggest problem with this approach as it represents a summation over all possible possible configurations of the system. For any reasonably sized protein this is intractable due to the $21^{N}$ possible configurations so we must approximate $Z$. Luckily this is a problem thoroughly explored by physicists so we can draw on them for inspiration.

Indeed, this form is particularly interesting because it exactly resembles something called the Ising model (pronounced ees-ing model as I’m told). The Ising model emerged from condensed matter physics where it was developed to statistically model lattices of particles exhibiting either up or down spins. The model was capable of capturing non-local interactions between these particles. The Potts model is essentially an expansion of the Ising model beyond binary variables. In other words, we have a model that specialises in capturing long range dependencies between residues in a protein sequence -> epistasis!

Under the max entropy approach the most probable unconstrained values for these parameters are $1/q$ and $1/q^2$ for every column and unique column pair respectively. We then “learn” the optimal values of $h_i$ and $J_{ij}$ from our sequence alignments so that the parameters satisfy the empirical data, or marginals:

$$ \sum_x P(x) \delta_i(x_i = a) = f_i(a) $$

$$ \sum_x P(x) \delta_{ij}(x_i = a, x_j = b) = f_{ij}(a,b) $$

Where the kronecker delta function $\delta_i(\cdot)$ is equal to 1 if the condition in the arguments is true and 0 if not. In other words, we are free to choose parameters so long as the respective site-wise marginals are enforced. This is possible because there is a certain degree of overparameterisation in this model. Specifically, we have $q \times n$ possible single site values and $\frac{q^2n^2}{2}$ possible pairwise site parameters. For a protein of 200 residues this is 4,200 + 8,820,000 = 8,824,200 possible unique parameter values whilst we are typically working with multiple sequence alignments of size $\times10^4$ providing far fewer constraints than parameters.

Realistically, many unique parameterisations give the same probability distribution - imagine shifting all $h_i$ values by some value like 0.1 in the same direction, the distribution stays the same but the values are different. To account for this we utilise gauge invariance where we fix a gauge such as the Ising gauge:

$$ \sum_{a=1}^q h_i(a) = 0 \quad \forall i $$

$$ \sum_{a=1}^q J_{ij}(a,b) = 0 \quad \forall j,b, \quad \sum_{b=1}^q J_{ij}(a,b) = 0 \quad \forall i,a $$

Which enforces that our single site and pairwise fields sum to 0 at each site and each row and column of the coupling matrix $J_{ij}$ has mean 0. Another common choice is the zero-sum gauge:

$$ \sum_a J_{ij}(a,b) = \sum_b J_{ij}(a,b) = 0 $$

Which enforces that all pairwise parameters for a given site sum to 0 so any form of parameter redundancy e.g. by shifting all pairwise values at a site by 0.1, is invalidated. Finally, in a similar light there is also the reference-state gauge which enforces a chosen residue’s parameters (typically the 21st or gap state) are equal to 0 so all fields and pairwise parameters are relative to it:

$$ h_i(q) = 0 \quad \forall i $$

$$ J_{ij}(a,q) = J_{ij}(q,b) = 0 \quad \forall i,j,a,b $$

This reduces the freedom of parameters from $\binom{N}{2}q^2 + Nq$ to $\binom{N}{2}(q-1)^2 + N(q-1)$ which is the same as the number of constraints making the solution to our max ent. approach unique. Having established the analytical form of our distribution and asserted the uniqueness of its solution the question is how we obtain the parameters $\boldsymbol{h}$ and $\boldsymbol{J}$. In the next blog I will address how we go about learning the parameters of this model, of which there are numerous approaches. I hope to see you there!