Practical "Model Validation" - EMBO Bioinformatics Course

| Useful links | Glossary of terms | Google |

First we will go back to basics and wonder what structures are and why we determine them. If you have a reasonable knowledge of structural biology already, you can skip this page! (Although you will probably want to have a look at the three movies at the bottom of this page anyway ...)

Attention Copenhagen students (January, 2009)!

This page is optional - it summarises things most of you probably already know. (But check out the movies at the bottom of this page anyway!)

What is a structure?

A model of a structure is a three-dimensional representation of a molecule that contains information about the spatial arrangements of groups of atoms.

There are actually different levels of structure:

low resolution: few details are discernible other than the shape of the molecule and possibly arrangements of (trans-membrane) helices
medium resolution: one may be able to discern the fold of a protein (how the protein chain twists through space), although there may be breaks and it may be impossible to correlate the structure to the sequence
high resolution: most or all of the protein atoms are resolved (at least the non-hydrogen ones), which means that we know their (x,y,z) coordinates with an accuracy of the order of a (carbon-carbon) bond length or better

It is important to realise that all three can help answer biological questions, although in order to answer questions at the level of atoms (enzyme catalysis, ligand-binding, protein-protein interaction, protein-nucleic acid interaction, etc.) high resolution is obviously the most useful.

A propos the concept of "resolution": crystallographers tend to quantify the resolution of their models based on the quality and quantity of their experimental data. You should remember that small numbers signify high resolution and vice versa. For instance, 4 Å is considered low resolution for a crystal structure, whereas 1 Å is very high resolution!

Q. 1. Do you expect to see more atomic detail in a 2 Å structure or in a 3 Å structure?

Why determine structures?

Why should we determine structures of biomacromolecules? In general, we do this because we want to answer biological questions, and understand biological processes at the atomic level (in terms of chemistry and physics).

The knowledge we gain from structure determination finds many applications. Understanding a structure (apart from satisfying our intellectual curiosity) usually allows us to explain or interpret previous experiments (pertaining to activity, specificity, effect of point mutations, etc.), and it sometimes enables us to suggest further experiments to answer yet other questions. In other cases, it enables us to suggest alterations in natural systems (e.g., a point mutation to improve the thermostability of an enzyme) that improve such systems or at least modify them in a predictable (we hope ...) fashion. Alternatively, we may be able to design compounds that interfere with a natural system or process (e.g., a compound that binds to and inhibits an enzyme that is crucial for the survival of a pathogenic bacterium).

Since the work of Anfinsen (PubMed) we know that protein structure is determined by the amino-acid sequence. Unfortunately, we still don't know the "rules" that Nature uses to determine structure from sequence (it's a pretty safe bet to assume that there's a Nobel Prize waiting for you if you manage to solve this "protein folding problem"). In addition, structure is modulated by the environment (pH, solvent, temperature, ...) and by interactions with other (small and large) molecules (substrate, inhibitor, ion, DNA, ligand, ...). Furthermore, the picture is complicated by metal ion centres, chaperones, isomerases, oligomerisation, domain swapping, glycosylation, etc. And even if none of these factors play a role, there is still the fact that proteins are not static entities. At temperatures above absolute zero, there is always thermal motion of the atoms.

You might counter that proteins are just simple organic molecules (except more so), and since we know the typical values for single carbon-carbon bond lengths, etc., we should be able to calculate an approximate structure by simply using typical bond lengths and bond angles. Unfortunately, the information contained in protein structures lies essentially in the conformational torsion angles. Even if we assume that every amino-acid residue has only three such torsion angles (phi, psi and chi-1, say), and that each of these three can only assume one of three "ideal" values (e.g., 60, 180 and -60 degrees for chi-1), this still leaves us with 27 possible conformations per residue. For a typical 200-residue protein this works out to 27²⁰⁰ which is roughly 1.87 * 10²⁸⁶ possible conformations. But can't we just generate all these conformations, calculate their energy and see which conformation has the lowest energy ? Well, even if we had a perfect force field (a way to calculate the energy of the protein), and if we were able to evaluate 10⁹ conformations per second, this would still keep us busy 4 * 10²⁵⁹ times the current age of the universe (give or take a day or two).

There are cleverer ways to predict protein structures without collecting large amounts of experimental data. If a protein's sequence has more than ~40% identical residues to the sequence of another protein whose structure is known, then a reasonable model can often be generated using so-called homology modelling (a.k.a. comparative modelling) techniques. At lower levels of sequence identity other techniques (fold recognition through threading or profile methods, and entirely ab initio predictions) can be used, but the results on the whole are still rather poor. (See the results of the biannual CASP meetings.)

Experimental structure determination

In practice, most biomolecular structures are determined using one of three techniques:

X-ray crystallography (low to very high resoluition)
NMR spectroscopy (medium to high resolution)
Electron microscopy and crystallography (very low to medium resolution and getting better all the time)

However, none of these techniques allow us to directly calculate "the structure" from the data. In the case of X-ray crystallography one obtains the distribution of the electrons (rather than the nuclei), and the major source of information obtained from NMR experiments concerns limits on the distances of pairs of protons (hydrogen-atom nuclei). This means that there is always an element of subjective (and error-prone) interpretation of the experimental data that leads to a model - a hypothesis of the structure that gave rise to the experimental data that we collected. This is also the major reason why validation of such models is very important.

In the case of X-ray crystallography, one obtains a so-called electron-density map - the distribution of the electrons in space. Where there are many electrons (and, hence, heavier atoms) the density is higher than in places where (on average) there are few electrons.

The blue "chicken wire" contour lines show the electron density contoured at its root-mean-square (RMS) level. The resolution is reasonably high (~1.8 Å) and it is fairly easy to see that this blob of density is likely to be due to a leucine residue. The atomic model that has been built into the density has yellow carbon atoms, red oxygen atoms and blue nitrogen atoms.

In recent years the initial crystallographic model-building process has been partially automated. If good high-resolution data is available, these methods can build almost complete models without human intervention. However, many interesting proteins and complexes do not crystallise in such a way that high-resolution data can be collected. In general, the model-building process becomes more and more difficult as the resolution of the data becomes lower. Simultaneously, the probability of making mistakes increases at lower resolution.

Another (more technical) problem that complicates model building is that of phase error. In short, there are two parts to the quantities that are needed to calculate the electron density, but only one of these (intensity) can be measured. The other (phase) has to be derived somehow (don't worry about the "somehow"). Unfortunately, the impact of these phases on the appearance of the electron density is much greater than that of the intensities (see the "Animal Magic" in Kevin Cowtan's Book of Fourier). In fact, with perfect phases, even at ~4 Å resolution map interpretation would not be a major problem (indeed, this situation often arises in the study of viruses).

Refinement and rebuilding

In most cases, the initial model that a crystallographer (or a computer program) builds on the basis of the experimental data will contain errors (that one wants to identify and correct) and it will be incomplete (e.g., there will be missing loops, ligands, water molecules). In other words: the initial model is almost invariably inaccurate and imprecise. In order to be able to draw conclusions regarding the biologically interesting aspects of the model, it needs to be improved. This is usually an iterative process involving:

Refinement - automatic adjustment of the parameters of the model (e.g., the atomic coordinates) to improve the fit to the data, and to impose reasonable physico-chemical rules (e.g., near-"ideal" bond lengths and angles)
Quality control - a thorough check to see if the model looks more and more like a "real protein" (e.g., in terms of what we know about the distribution of the main-chain torsion angles phi and psi), and to pinpoint places where the model might be in error
Rebuilding - manual adjustment of the model to add missing parts (e.g., water molecules) and to correct errors that the refinement program is unable to correct

These three steps are usually iterated until the model is as complete as the data will allow, until the errors in the model have been removed as much as is humanly possible, and until no more improvement can be obtained by further refinement.

See for yourself!

James Holton at Berkeley has produced a number of movies that demonstrate the importance of resolution, amplitudes and phases for the quality of the resulting electron-density map (his movie page can be found here; if you're a crystallographer, you may want to check out some of his other movies as well!). James has kindly given permission to incorporate a couple of his movies into this practical.

The effect of resolution

"This movie displays a calculated electron density map, contoured at 1 sigma, as the resolution limit is adjusted slowly from 0.5Å to 6Å. [...] The phases are perfect, and so are the amplitudes (R-factor = 0.0%) for all the resolutions displayed. Note that, even for a perfect map, you expect side chains to poke out of density at 3.5Å."

(Click on the image to start the movie. If it doesn't load or is very slow, you can also try the original version.)

The importance of amplitudes

"This movie displays the effect of calculating a map with "wrong" amplitudes. [...] The images in this movie represent the slow changing of all the amplitudes to a different set of randomly selected values while holding the phases constant. It is interesting to note that the map hardly changes at all until the R-factor gets higher than 30%. The maximum R-factor you can get for two random data sets is 75%, which is the end of the movie. Kinda spookey how it still looks traceable, isn't it? The resolution here is 1.5Å, and the phases are always perfect."

(Click on the image to start the movie. If it doesn't load or is very slow, you can also try the original version.)

The importance of phases

"This movie displays the effect of calculating a map with "wrong" phases. The "figure of merit" (cosine of the error in the phase) is displayed as "m". The images in this movie were calculated by merging a perfect calculated map with another map, calculated with the same amplitudes, but with phases obtained from a model with randomly positioned atoms. Merging these two maps always preserves the amplitudes, but changes the phases slowly to a new set of values. At what point do you think the map becomes untraceable? The resolution here is 1.5Å, and the R-factor is always 0.0%."

(Click on the image to start the movie. If it doesn't load or is very slow, you can also try the original version.)

Latest update on 26 January, 2009.