After a few rounds we found a good homology with the abc transporter family involved in a multicomponent binding-protein-dependent transport system for glycine betaine/l-proline (as example see PROV_SALTY).
There are two dinstinct regions at the N-term and C-term of the protein, which could be protein domains.
Looking at the alignment between PROV_SALTY and myseq (in the alignment region of the PSI-BLAST), it looks like the domains are swapped.
Without the low complexity filter we obtain a complete different result. The homology we find is with collagen. This because there is a bias in the aa composition of collagen proteins which match our low complexity region.
Given the following MSA:
Seq1 WFFKGIADKDAERHLLA Seq2 WFFKNLEQKDAEARLLA Seq3 WFFKR---KDAERQLLA Seq4 WFFGTI---DAERQLLA Seq5 WFFKDIPTKDAERQLLA Seq6 WYFG----RESERLLLA Seq7 WYFGKIPLKDAERQLLA Seq8 WYFGKLRAKDTERLLLL
A possible pattern could be the following one, but this is not the only solution!!
W-[FY]-F-[KG]-x(0,4)-[KR](0,1)-[DE]-x-E-[RA]-x-L(2)-[AL]
Running a search on SWISS-PROT we found 14 matches, all annotated as tyrosine-protein kinase.
This pattern doesn't find false positives on a reverse SWISS-PROT.
Here a possible pattern for the second set of sequences:
[ED]-R-x(2)-R
This pattern returns false positives (in a random database). These kind of patterns, although useful, require other evidences to be validate.
The number of hits decrease if patterns with a high probability are excluded.
The masked patterns have the characteristic to be short and/or degenerated. This results in a large number of hits. These patterns match have to be considered as a very preliminary information and other information must be used to confirm the matches (biological information. bench experiments, sequence environment, ...).
A search of a random database with a pattern with a high probability of occurrence returns a series of matches. This indicates that the pattern match information alone is not reliable and other information is required to validate the result in a real sequence.
MEME finds 3 possible motifs (see section DATABASE AND MOTIFS of the third mail):
MOTIFS (peptide) MOTIF WIDTH BEST POSSIBLE MATCH ----- ----- ------------------- 1 20 LWNHPWFHGKIPREEAEAIL 2 9 DGTFLVRES 3 50 AKAKYDFCARDDDELSFKRGDIIKILNKKCDQGWWKGEINGKGGWFPKNY
Motif 1 and 2 are present in all 6 sequences together, while motif 3 is restricted to only 2 sequences (once is repeated).
The motifs described by MEME correspond to 2 protein domains: motif1 + motif2 = SH2, motif3 = SH3.
Pfam and Prosite return similar result. Unsure matches with a low score are marked with status:? by the pfscan server. InterPro return a much larger result, because a full search is done against a number of databases (Pfam, Prosite, Smart, ProDom, PRINTS, TIGRfam).
By reading the documentation of each domain is possible to infer that the protein is implicated in the post-transcriptional gene silencing (RNAi). Probably in the degradation of the double-stranded RNA.
Any question? Mail to Lorenzo Cerutti.