11 Autopsy of a PDB file
Atom coordinates of protein and nucleic acid structures are distributed under the form of PDB files. Those are 80 column formatted text files and present the advantage of being platform independent.
The first six columns are reserved for a keyword describing the type of information that follows on the line (such as HEADER, JRNL, REMARK, ATOM, HETATM, and so on).
A typical PDB file contains a header with information about the entry, literature references, as well as additional remarks that may contain information about how the protein was crystallised, the resolution and so on...
A partial example of PDB file is given below; parts removed are signalled by (...).
1 2 3 4 5 6 7 8 12345678901234567890123456789012345678901234567890123456789012345678901234567890 -------------------------------------------------------------------------------- HEADER OXIDOREDUCTASE(NAD(A)-CHOH(D)) 12-APR-89 4MDH 4MDH 3 COMPND CYTOPLASMIC MALATE DEHYDROGENASE (E.C.1.1.1.37) 4MDH 4 SOURCE PORCINE (SUS $SCROFA) HEART 4MDH 5 AUTHOR J.J.BIRKTOFT,L.J.BANASZAK 4MDH 6 REVDAT 3 15-APR-92 4MDHB 3 ATOM 4MDHB 1 REVDAT 2 15-JAN-90 4MDHA 1 JRNL 4MDHA 1 REVDAT 1 19-APR-89 4MDH 0 4MDH 7 SPRSDE 19-APR-89 4MDH 2MDH 4MDH 8 JRNL AUTH J.J.BIRKTOFT,G.RHODES,L.J.BANASZAK 4MDH 9 JRNL TITL REFINED CRYSTAL STRUCTURE OF CYTOPLASMIC MALATE 4MDHA 2 JRNL TITL 2 DEHYDROGENASE AT 2.5-*ANGSTROMS RESOLUTION 4MDHA 3 JRNL REF BIOCHEMISTRY V. 28 6065 1989 4MDHA 4 JRNL REFN ASTM BICHAW US ISSN 0006-2960 033 4MDHA 5 REMARK 1 4MDH 14 REMARK 1 REFERENCE 1 4MDH 15 REMARK 1 AUTH J.J.BIRKTOFT,Z.FU,G.E.CARNAHAN,G.RHODES, 4MDH 16 REMARK 1 AUTH 2 S.L.RODERICK,L.J.BANASZAK 4MDH 17 REMARK 1 TITL COMPARISON OF THE MOLECULAR STRUCTURES OF 4MDH 18 REMARK 1 TITL 2 CYTOPLASMIC AND MITOCHONDRIAL MALATE DEHYDROGENASE 4MDH 19 REMARK 1 REF TO BE PUBLISHED 4MDH 20 REMARK 1 REFN 353 4MDH 21 (...)
The next section provides information on the amino-acid sequence of each chain. The current example contains two chains (A and B).
SEQRES 1 A 334 ACE SER GLU PRO ILE ARG VAL LEU VAL THR GLY ALA ALA 4MDH 163 SEQRES 2 A 334 GLY GLN ILE ALA TYR SER LEU LEU TYR SER ILE GLY ASN 4MDH 164 SEQRES 3 A 334 GLY SER VAL PHE GLY LYS ASP GLN PRO ILE ILE LEU VAL 4MDH 165 (...) SEQRES 24 A 334 VAL GLU GLY LEU PRO ILE ASN ASP PHE SER ARG GLU LYS 4MDH 186 SEQRES 25 A 334 MET ASP LEU THR ALA LYS GLU LEU ALA GLU GLU LYS GLU 4MDH 187 SEQRES 26 A 334 THR ALA PHE GLU PHE LEU SER SER ALA 4MDH 188 SEQRES 1 B 334 ACE SER GLU PRO ILE ARG VAL LEU VAL THR GLY ALA ALA 4MDH 189 SEQRES 2 B 334 GLY GLN ILE ALA TYR SER LEU LEU TYR SER ILE GLY ASN 4MDH 190 SEQRES 3 B 334 GLY SER VAL PHE GLY LYS ASP GLN PRO ILE ILE LEU VAL 4MDH 191 (...) SEQRES 24 B 334 VAL GLU GLY LEU PRO ILE ASN ASP PHE SER ARG GLU LYS 4MDH 212 SEQRES 25 B 334 MET ASP LEU THR ALA LYS GLU LEU ALA GLU GLU LYS GLU 4MDH 213 SEQRES 26 B 334 THR ALA PHE GLU PHE LEU SER SER ALA 4MDH 214 (...)
The next section contains optional information about HET groups (see the HETATM section that will follow for a more detailed description).
HET NAD A 1 44 NAD CO-ENZYME 4MDH 219 HET SUL A 2 5 SULFATE 4MDH 220 HET NAD B 1 44 NAD CO-ENZYME 4MDH 221 HET SUL B 2 5 SULFATE 4MDH 222 FORMUL 3 NAD 2(C21 H28 N7 O14 P2) 4MDH 223 FORMUL 4 SUL 2(O4 S1) 4MDH 224 FORMUL 5 HOH *471(H2 O1) 4MDH 225 (...)
The next section describe secondary structure elements (HELIX, SHEET and TURN) as they have been provided by the crystallographer. This can be subjective as the definition of these secondary structure elements is loose.
HELIX 1 1BA GLY A 13 LEU A 20 1 4MDH 226 HELIX 2 2BA LEU A 20 GLY A 26 1 4MDH 227 HELIX 3 CA MET A 45 ALA A 60 1 4MDH 228 (...) SHEET 1 S1A 6 LEU A 63 THR A 70 0 4MDH 250 SHEET 2 S1A 6 PRO A 34 ASP A 41 1 4MDH 251 SHEET 3 S1A 6 ILE A 4 GLY A 10 1 4MDH 252 (...) TURN 1 T1 VAL A 8 ALA A 11 4MDH 274 TURN 2 T2 GLY A 10 GLY A 13 4MDH 275 TURN 3 T3 GLY A 26 PHE A 29 4MDH 276 (...)
The next section describe crystallographic information (crystal groups)
CRYST1 139.200 86.600 58.800 90.00 90.00 90.00 P 21 21 2 8 4MDH 328 ORIGX1 1.000000 0.000000 0.000000 0.00000 4MDH 329 ORIGX2 0.000000 1.000000 0.000000 0.00000 4MDH 330 ORIGX3 0.000000 0.000000 1.000000 0.00000 4MDH 331 SCALE1 0.007184 0.000000 0.000000 0.00000 4MDH 332 SCALE2 0.000000 0.011547 0.000000 0.00000 4MDH 333 SCALE3 0.000000 0.000000 0.017007 0.00000 4MDH 334 MTRIX1 1 -0.865540 0.467810 -0.178880 55.21400 1 4MDH 335 MTRIX2 1 0.499790 0.829880 -0.248020 -1.79900 1 4MDH 336 MTRIX3 1 0.032420 -0.304070 -0.952100 89.13300 1 4MDH 337 (...)
And finally atom coordinates for amino-acids (or nucleic acids) are provided. Each line starts with the ATOM keyword, and is followed by atom number, atom name, amino-acid name, chain name, amino-acid number, X, Y, Z coordinates, atom weight, and B-factor (this last number can be viewed as an incertitude factor (0-100). A low B-factor meaning that the position of the atom has been determined with accuracy. Typically B-factors of Alpha Carbons (CA) are lower than atoms located at side chain extremities.
ATOM 1 C ACE A 0 11.590 2.938 35.017 1.00 45.90 4MDHB 5 ATOM 2 O ACE A 0 12.581 2.371 35.517 1.00 28.75 4MDHB 6 ATOM 3 CH3 ACE A 0 10.179 2.477 35.417 1.00 36.75 4MDHB 7 ATOM 4 N SER A 1 11.648 3.946 34.081 1.00 49.10 4MDH 341 ATOM 5 CA SER A 1 12.901 4.557 33.573 1.00 52.42 4MDH 342 ATOM 6 C SER A 1 12.733 5.624 32.482 1.00 48.48 4MDH 343 ATOM 7 O SER A 1 13.238 5.432 31.363 1.00 57.03 4MDH 344 ATOM 8 CB SER A 1 13.990 3.553 33.162 1.00 41.45 4MDH 345 ATOM 9 OG SER A 1 15.105 3.679 34.039 1.00 42.59 4MDH 346 ATOM 10 N GLU A 2 12.073 6.774 32.772 1.00 37.72 4MDH 347 ATOM 11 CA GLU A 2 11.948 7.788 31.721 1.00 20.88 4MDH 348 ATOM 12 C GLU A 2 12.042 9.235 32.169 1.00 28.31 4MDH 349 ATOM 13 O GLU A 2 11.285 9.654 33.030 1.00 14.56 4MDH 350 ATOM 14 CB GLU A 2 10.925 7.482 30.621 1.00 18.66 4MDH 351 ATOM 15 CG GLU A 2 10.188 8.729 30.102 1.00 39.41 4MDH 352 ATOM 16 CD GLU A 2 8.693 8.532 30.110 1.00 55.62 4MDH 353 ATOM 17 OE1 GLU A 2 7.885 9.153 29.379 1.00 55.67 4MDH 354 ATOM 18 OE2 GLU A 2 8.352 7.589 30.997 1.00 68.00 4MDH 355 (...)
As several enzymes are crystallised in presence of enzymatic cofactors or substrate analogues, that have to be described. As the number of substrates is too large to be described, a generic structure named HETATM regroups all atoms belonging to specific compounds other than amino-acids or nucleotides. In the following example NAD (nicotinamide adenine dinucleotide) and SO4 (sulphate) are described as HETATM. Solvent molecules (H2O) that are seen in the electronic density map also appear in this section.
HETATM 5158 AP NAD B 1 42.641 30.361 41.284 1.00 26.73 4MDH5495 HETATM 5159 AO1 NAD B 1 43.440 31.570 40.868 1.00 20.69 4MDH5496 HETATM 5160 AO2 NAD B 1 41.161 30.484 41.376 1.00 33.73 4MDH5497 HETATM 5161 AO5* NAD B 1 43.117 29.802 42.683 1.00 20.55 4MDH5498 HETATM 5162 AC5* NAD B 1 44.483 29.615 43.002 1.00 17.23 4MDH5499 (...) HETATM 5202 S SO4 B 2 44.842 24.424 31.662 1.00 72.77 4MDH5539 HETATM 5203 O1 SO4 B 2 45.916 23.890 32.631 1.00 31.43 4MDH5540 HETATM 5204 O2 SO4 B 2 44.065 23.296 30.916 1.00 26.35 4MDH5541 HETATM 5205 O3 SO4 B 2 45.570 25.307 30.620 1.00 52.53 4MDH5542 HETATM 5206 O4 SO4 B 2 43.834 25.257 32.482 1.00 47.91 4MDH5543 HETATM 5207 O HOH 0 15.379 1.907 3.295 1.00 58.12 4MDH5544 HETATM 5208 O HOH 1 58.861 0.984 17.024 1.00 37.58 4MDH5545 HETATM 5209 O HOH 2 24.384 1.184 74.398 1.00 35.92 4MDH5546 (...)
HETATM fields describe only atoms positions, but as they concern non-standard groups, programs don't know which atoms are effectively connected. These information are found in the CONECT fields. In the example provided below, atom number 74 has to be connected to atoms 69 and 75. In absence of CONECT information, atoms are usually connected if they are closer than 2 angstroms.
CONECT 74 69 75 4MDH6015 CONECT 77 76 4MDH6016 CONECT 92 90 93 4MDH6017 CONECT 99 98 4MDH6018 (...)