The original method to obtain such potentials is the quasi-chemical approximation, due to Miyazawa and Jernigan.[2] It was later followed by the potential of mean force (statistical PMF [Note 1]), developed by Sippl.[3] Although the obtained scores are often considered as approximations of the free energy—thus referred to as pseudo-energies—this physical interpretation is incorrect.[4][5] Nonetheless, they are applied with success in many cases, because they frequently correlate with actual Gibbs free energy differences.[6]
Overview
Possible features to which a pseudo-energy can be assigned include:
The classic application is, however, based on pairwise amino acid contacts or distances, thus producing statistical interatomic potentials. For pairwise amino acid contacts, a statistical potential is formulated as an interaction matrix that assigns a weight or energy value to each possible pair of standard amino acids. The energy of a particular structural model is then the combined energy of all pairwise contacts (defined as two amino acids within a certain distance of each other) in the structure. The energies are determined using statistics on amino acid contacts in a database of known protein structures (obtained from the PDB).
History
Initial development
Many textbooks present the statistical PMFs as proposed by Sippl [3] as a simple consequence of the Boltzmann distribution, as applied to pairwise distances between amino acids. This is incorrect, but a useful start to introduce the construction of the potential in practice.
The Boltzmann distribution applied to a specific pair of amino acids,
is given by:
The quantity is the free energy assigned to the pairwise system.
Simple rearrangement results in the inverse Boltzmann formula,
which expresses the free energy as a function of :
To construct a PMF, one then introduces a so-called reference state with a corresponding distribution and partition function
, and calculates the following free energy difference:
The reference state typically results from a hypothetical
system in which the specific interactions between the amino acids
are absent. The second term involving and
can be ignored, as it is a constant.
In practice, is estimated from the database of known protein
structures, while typically results from calculations
or simulations. For example, could be the conditional probability
of finding the atoms of a valine and a serine at a given
distance from each other, giving rise to the free energy difference
. The total free energy difference of a protein,
, is then claimed to be the sum
of all the pairwise free energies:
where the sum runs over all amino acid pairs
(with ) and is their corresponding distance. In many studies does not depend on the amino acid sequence.[7]
Conceptual issues
Intuitively, it is clear that a low value for indicates
that the set of distances in a structure is more likely in proteins than
in the reference state. However, the physical meaning of these statistical PMFs has
been widely disputed, since their introduction.[4][5][8][9] The main issues are:
The wrong interpretation of this "potential" as a true, physically valid potential of mean force;
The nature of the so-called reference state and its optimal formulation;
The validity of generalizations beyond pairwise distances.
Controversial analogy
In response to the issue regarding the physical validity, the first justification of statistical PMFs was attempted by Sippl.[10] It was based on an analogy with the statistical physics of liquids. For liquids, the potential of mean force is related to the radial distribution function, which is given by:[11]
where and are the respective probabilities of
finding two particles at a distance from each other in the liquid
and in the reference state. For liquids, the reference state
is clearly defined; it corresponds to the ideal gas, consisting of
non-interacting particles. The two-particle potential of mean force
is related to by:
According to the reversible work theorem, the two-particle
potential of mean force is the reversible work required to
bring two particles in the liquid from infinite separation to a distance
from each other.[11]
Sippl justified the use of statistical PMFs—a few years after he introduced
them for use in protein structure prediction—by
appealing to the analogy with the reversible work theorem for liquids. For liquids, can be experimentally measured
using small angle X-ray scattering; for proteins, is obtained
from the set of known protein structures, as explained in the previous
section. However, as Ben-Naim wrote in a publication on the subject:[5]
[...] the quantities, referred to as "statistical potentials," "structure
based potentials," or "pair potentials of mean force", as derived from
the protein data bank (PDB), are neither "potentials" nor "potentials of
mean force," in the ordinary sense as used in the literature on
liquids and solutions.
Moreover, this analogy does not solve the issue of how to specify a suitable reference state for proteins.
Machine learning
In the mid-2000s, authors started to combine multiple statistical potentials, derived from different structural features, into composite scores.[12] For that purpose, they used machine learning techniques, such as support vector machines (SVMs). Probabilistic neural networks (PNNs) have also been applied for the training of a position-specific distance-dependent statistical potential.[13] In 2016, the DeepMind artificial intelligence research laboratory started to apply deep learning techniques to the development of a torsion- and distance-dependent statistical potential.[14] The resulting method, named AlphaFold, won the 13th Critical Assessment of Techniques for Protein Structure Prediction (CASP) by correctly predicting the most accurate structure for 25 out of 43 free modellingdomains.
Explanation
Bayesian probability
Baker and co-workers [15] justified statistical PMFs from a
Bayesian point of view and used these insights in the construction of
the coarse grained ROSETTA energy function. According
to Bayesian probability calculus, the conditional probability of a structure , given the amino acid sequence , can be
written as:
is proportional to the product of
the likelihood times the prior. By assuming that the likelihood can be approximated
as a product of pairwise probabilities, and applying Bayes' theorem, the
likelihood can be written as:
where the product runs over all amino acid pairs (with
), and is the distance between amino acids and .
Obviously, the negative of the logarithm of the expression
has the same functional form as the classic
pairwise distance statistical PMFs, with the denominator playing the role of the
reference state. This explanation has two shortcomings: it relies on the unfounded assumption the likelihood can be expressed
as a product of pairwise probabilities, and it is purely qualitative.
Probability kinematics
Hamelryck and co-workers [6] later gave a quantitative explanation for the statistical potentials, according to which they approximate a form of probabilistic reasoning due to Richard Jeffrey and named probability kinematics. This variant of Bayesian thinking (sometimes called "Jeffrey conditioning") allows updating a prior distribution based on new information on the probabilities of the elements of a partition on the support of the prior. From this point of view, (i) it is not necessary to assume that the database of protein structures—used to build the potentials—follows a Boltzmann distribution, (ii) statistical potentials generalize readily beyond pairwise differences, and (iii) the reference ratio is determined by the prior distribution.
Reference ratio
Expressions that resemble statistical PMFs naturally result from the application of
probability theory to solve a fundamental problem that arises in protein
structure prediction: how to improve an imperfect probability
distribution over a first variable using a probability
distribution over a second variable , with .[6] Typically, and are fine and coarse grained variables, respectively. For example, could concern
the local structure of the protein, while could concern the pairwise distances between the amino acids. In that case, could for example be a vector of dihedral angles that specifies all atom positions (assuming ideal bond lengths and angles).
In order to combine the two distributions, such that the local structure will be distributed according to , while
the pairwise distances will be distributed according to , the following expression is needed:
where is the distribution over implied by . The ratio in the expression corresponds
to the PMF. Typically, is brought in by sampling (typically from a fragment library), and not explicitly evaluated; the ratio, which in contrast is explicitly evaluated, corresponds to Sippl's PMF. This explanation is quantitive, and allows the generalization of statistical PMFs from pairwise distances to arbitrary coarse grained variables. It also
provides a rigorous definition of the reference state, which is implied by . Conventional applications of pairwise distance statistical PMFs usually lack two
necessary features to make them fully rigorous: the use of a proper probability distribution over pairwise distances in proteins, and the recognition that the reference state is rigorously
defined by .
Applications
Statistical potentials are used as energy functions in the assessment of an ensemble of structural models produced by homology modeling or protein threading. Many differently parameterized statistical potentials have been shown to successfully identify the native state structure from an ensemble of decoy or non-native structures.[16] Statistical potentials are not only used for protein structure prediction, but also for modelling the protein folding pathway.[17][18]
^ abSippl MJ (1990). "Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins". J Mol Biol. 213 (4): 859–883. doi:10.1016/s0022-2836(05)80269-4. PMID2359125.
^ abThomas PD, Dill KA (1996). "Statistical potentials extracted from protein structures: how accurate are they?". J Mol Biol. 257 (2): 457–469. doi:10.1006/jmbi.1996.0175. PMID8609636.
^ abcBen-Naim A (1997). "Statistical potentials extracted from protein structures: Are these meaningful potentials?". J Chem Phys. 107 (9): 3698–3706. Bibcode:1997JChPh.107.3698B. doi:10.1063/1.474725.
^Rooman M, Wodak S (1995). "Are database-derived potentials valid for scoring both forward and inverted protein folding?". Protein Eng. 8 (9): 849–858. doi:10.1093/protein/8.9.849. PMID8746722.
^Koppensteiner WA, Sippl MJ (1998). "Knowledge-based potentials–back to the roots". Biochemistry Mosc. 63 (3): 247–252. PMID9526121.
^Sippl MJ, Ortner M, Jaritz M, Lackner P, Flockner H (1996). "Helmholtz free energies of atom pair interactions in proteins". Fold Des. 1 (4): 289–98. doi:10.1016/s1359-0278(96)00042-9. PMID9079391.
^ abChandler D (1987) Introduction to Modern Statistical Mechanics. New York: Oxford University Press, USA.
^Simons KT, Kooperberg C, Huang E, Baker D (1997). "Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions". J Mol Biol. 268 (1): 209–225. CiteSeerX10.1.1.579.5647. doi:10.1006/jmbi.1997.0959. PMID9149153.