Introduction to Patterns, Profiles, and HMMs

1. PSI-BLAST (SOLUTIONS)

After a few rounds we found a good homology with the abc transporter family involved in a multicomponent binding-protein-dependent transport system for glycine betaine/l-proline (as example see PROV_SALTY).

There are two dinstinct regions at the N-term and C-term of the protein, which could be protein domains.

Looking at the alignment between PROV_SALTY and myseq (in the alignment region of the PSI-BLAST), it looks like the domains are swapped.

Without the low complexity filter we obtain a complete different result. The homology we find is with collagen. This because there is a bias in the aa composition of collagen proteins which match our low complexity region.

2. Build a pattern (SOLUTIONS)

Given the following MSA:

	Seq1  WFFKGIADKDAERHLLA
	Seq2  WFFKNLEQKDAEARLLA
	Seq3  WFFKR---KDAERQLLA
	Seq4  WFFGTI---DAERQLLA
	Seq5  WFFKDIPTKDAERQLLA
	Seq6  WYFG----RESERLLLA
	Seq7  WYFGKIPLKDAERQLLA
	Seq8  WYFGKLRAKDTERLLLL

A possible pattern could be the following one, but this is not the only solution!!

W-[FY]-F-[KG]-x(0,4)-[KR](0,1)-[DE]-x-E-[RA]-x-L(2)-[AL]

Running a search on SWISS-PROT we found 14 matches, all annotated as tyrosine-protein kinase.

This pattern doesn't find false positives on a reverse SWISS-PROT.

Here a possible pattern for the second set of sequences:

	[ED]-R-x(2)-R

This pattern returns false positives (in a random database). These kind of patterns, although useful, require other evidences to be validate.

3. Search the Prosite pattern database (SOLUTIONS)

The number of hits decrease if patterns with a high probability are excluded.

The masked patterns have the characteristic to be short and/or degenerated. This results in a large number of hits. These patterns match have to be considered as a very preliminary information and other information must be used to confirm the matches (biological information. bench experiments, sequence environment, ...).

A search of a random database with a pattern with a high probability of occurrence returns a series of matches. This indicates that the pattern match information alone is not reliable and other information is required to validate the result in a real sequence.

4. Build PSSMs with MEME (SOLUTIONS)

MEME finds 3 possible motifs (see section DATABASE AND MOTIFS of the third mail):

	MOTIFS  (peptide)
	MOTIF WIDTH BEST POSSIBLE MATCH
	----- ----- -------------------
	  1    20   LWNHPWFHGKIPREEAEAIL
	  2     9   DGTFLVRES
	  3    50   AKAKYDFCARDDDELSFKRGDIIKILNKKCDQGWWKGEINGKGGWFPKNY

Motif 1 and 2 are present in all 6 sequences together, while motif 3 is restricted to only 2 sequences (once is repeated).

The motifs described by MEME correspond to 2 protein domains: motif1 + motif2 = SH2, motif3 = SH3.

5. Protein function discovery (SOLUTIONS)

Pfam and Prosite return similar result. Unsure matches with a low score are marked with status:? by the pfscan server. InterPro return a much larger result, because a full search is done against a number of databases (Pfam, Prosite, Smart, ProDom, PRINTS, TIGRfam).

By reading the documentation of each domain is possible to infer that the protein is implicated in the post-transcriptional gene silencing (RNAi). Probably in the degradation of the double-stranded RNA.

Any question? Mail to Lorenzo Cerutti.