chapter4b

4.4.4 SCOP: Structural Classification of Proteins

Introduction:

Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. A knowledge of these relationships is crucial to our understanding of the evolution of proteins and of development. It will also play an important role in the analysis of the sequence data that is being produced by worldwide genome projects.

The scop database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in Protein Data Bank (PDB). It is available as a set of tightly linked hypertext documents which make the large database comprehensible and accessible. In addition, the hypertext pages offer a panoply of representations of proteins, including links to PDB entries, sequences, references, images and interactive display systems. World Wide Web URL http://scop.mrc-lmb.cam.ac.uk/scop/ is the entry point to the database (MRC site).

Existing automatic sequence and structure comparison tools cannot identify all structural and evolutionary relationships between proteins. The scop classification of proteins has been constructed manually by visual inspection and comparison of structures, but with the assistance of tools to make the task manageable and help provide generality. The job is made more challenging--and theoretically daunting--by the fact that the entities being organized are not homogeneous: sometimes it makes more sense to organize by individual domains, and other times by whole multi-domain proteins.

Classification:

Proteins are classified to reflect both structural and evolutionary relatedness. Many levels exist in the hierarchy, but the principal levels are family, superfamily and fold, described below. The exact position of boundaries between these levels are to some degree subjective. Our evolutionary classification is generally conservative: where any doubt about relatedness exists, we made new divisions at the family and superfamily levels. Thus, some researchers may prefer to focus on the higher levels of the classification tree, where proteins with structural similarity are clustered.

The different major levels in the hierarchy are:

Family: Clear evolutionarily relationship

Superfamily: Probable common evolutionary origin

Fold: Major structural similarity

Usage:

We hope that scop will have broad utility that will attract a wide range of users. Experimental structural biologists may wish to explore the region of "structure space" near their proteins of current research, while theoreticians will likely find it most useful to browse the wide range of protein folds currently known. Molecular biologists may find the classification helpful because the categorization assistis in locating proteins of interest and the links make exploration easy. We also hope that scop will find pedegogical use, for it organizes structures in an easily comprehensible manner and makes them accessible from even a simple personal computer.

Table I

Currently (Release 1.48) 9912 PDB Entries were classified into 22140 Domains (excluding nucleic acids and theoretical models)

Class	Number of folds	Number of superfamilies
All alpha proteins	126	175
All beta proteins	81	147
Alpha and beta proteins (a/b)	87	135
Alpha and beta proteins (a+b)	151	214
Multi-domain proteins (alpha and beta)	21	21
Membrane and cell surface proteins and peptides	10	16
Small proteins	44	63
Total	520	771

Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540.

4.4.5 CATH: Classification of protein structures

Introduction:

CATH is a novel hierarchical classification of protein domain structures, which clusters proteins at four major levels, class(C), architecture(A), topology(T) and homologous superfamily (H). Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. The topology level clusters structures according to their toplogical connections and numbers of secondary structures. The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to toplogy families and homologous superfamilies are made by sequence and structure comparisons.

CATH, can be reach on the Web at this URL: http://www.biochem.ucl.ac.uk/bsm/cath/index.html

Domains are regions of contiguous polypeptide chain that have been described as compact, local, and semi-independent units (Richardson, 1981). Within a protein domains can be anything from independant globular units joined only by a flexible length of polypeptide chain, to units which have a very extensive interface.

CATH is now a classification of protein domains. Each protein structure in the PDB has been cut into its constituent domains, and each classified separately. The assignment of domain definitions has been made using a consensus procedure (DBS, Jones et al, (1996)), based on three independent algorithms for domain recognition (DETECTIVE (Swindells, 1995), PUU (Holm & Sander, 1994) and DOMAK (Siddiqui and Barton, 1995). This currently allows approximately 53% of the proteins to be defined as single or multidomain proteins automatically. The remaining structures are assigned domain definitions manually, by choosing what was determined to be the best assignment made by one of the algorithms, a new assignment, or an alternative assignment obtained from the literature.

Table II

Version 1.6 of CATH includes 7703 PDB entries (which contain 13 103 chains and 18 557 domains).

CATH Level	Number
Class	4
Architecture	35
Topology Family	672
Homologous Superfamily	1028
Sequence Family	1784
Near Identical Structures	3487
Identical Structures	6274

Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997) CATH- A Hierarchic Classification of Protein Domain Structures. Structure. Vol 5. No 8. p.1093-1108.