Regular Expression for Protein Motif Search Regular Expression for Protein Motif Search
|
|---|

Since these motifs can be represented as regular expressions, it is important to understand about regular expressions in general, and how to use them with the Python language in particular (see Appendix ).
Search for a protein motif with regular expression in protein databases
Evaluation of a regular expression is called pattern matching.
Regular expressions are made up of terms, operators and modifiers.
Terms are the strings or substrings, for example the term "Kinase", will match to substrings. In the string "Protein Kinase", "Kinase" is then a term that matches a substring.
Operators combine terms and expressions. For example
alternation with expression or literal characters like zinc | transmem
repetition with * + ? {min,max} specifies how many times the preceeding expression may match.
But what about concatenation (combining two substrings) ? That is implicit. If you want to combine two expressions you can simply specify them as such.
Operators have precedence, like arithmetic operators. By grouping the expression you can change the precedence.
The regexp MA{5} would match MAAAAA whereas
(MA){5} would match MAMAMAMAMA.
Is it not time to write a regexp to find a tandem repeat ?
Modifiers change the rules, like the compilation flag IGNORECASE we can set with re.I or re.IGNORECASE. Other flags are re.M for multiline, re.S for to make '.' to match any character including newline.
Grouping
|
| Charecter
| class \
| | | \ Literal Anchor
| | | \ characters terms
| | | \ / \ |
| | | | / \ |
| | | | / \ |
| | | | / \ | |
^ ( [a-z] \w + ) . * ( Ala | Met ) \1 $ I |
| \ / . | | | \ / . | | | \ / . | | | \ / . | Modifier | \/ Alternation | | Repeat Back | operators reference | Anchor terms |
The necessary regexp notations we need to represent aa in a motif are
N-terminal residue C-terminal residue Any residue Optional residue Representation of gaps (i.e. variable length regexps)
Before doing that let us familiarize ourselves with the following regexp metacharacter table.
Note that the following characters must be escaped (preceeded by a slash) if we want to match them, but for our purpose it is unlikely that one of these character is going to represent an amino acid!
< $ + * [ { ( ) ? .
| Meta character |
Name | Meaning | How do we use them? |
|---|---|---|---|
| . | dot | Any character except newline | Any aa/gap |
| [a-z] | character class | Match any char from a-z | amino acid in one letter code |
| [^a-z] | negated char class | Match all except a-z | reject certain amino acids |
| ? | optional char | If the previous char is there match it, otherwise don't care | optionally match previous amino acid |
| * | star | match zero or more time | find >=0 'clones' |
| + | plus | match one or more time | find >=1 'clones' |
| ^ | caret | Match at the beginning | N-terminal |
| $ | dollar | Match at the end | C-terminal |
| | | alternation | Match either expression it separates | Match either or |
| {m,n} | range specifier (operator) | m-minimum required n-maximum allowed | Match variable number of aa's or allow variable gaps |
N-terminal part of a motif can be represented by caret character '^' Example: '^SMART'
C-terminal part of a motif can be represented by dollor character '$' Example: 'SMART$'
For any essential residue position we can use the dot character `.' Example: `SMA.T'
For an optional residue we can use with the question mark `?' Example: `SMAR?T', matches SMART as well as SMAT
Representing gaps within a motif is a bit tricky. We can combine the dot character with range specifiers to accomplish this.
SMAR.{0,4}T, is a short hand for the set of patterns;
SMART
SMAR.T
SMAR..T
SMAR...T
SMAR....T
Here what we mean by a `gap' is that there are two sub motifs separated by a variable number of unconserved amino acids. We treat these unconserved amino acids as gaps in the regexps.
Suppose that we have the following alignment and we want to define a motif.
PCNA binders
The alignment for 'pcna binders' is
10 19
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
G R K R R Q T S M T D F Y H S K R R L # Seq 1
G R K R R Q T S L T D F Y H S K R R L # Seq 2
R K R Q P K I T E F M K E R K R L # Seq 3
R Q G S T Q G R L D D F F K V T G S L # Seq 4
S K T I P Q G R L D S F F K P V P S S # Seq 5
L K S G I Q G R L D G F F Q V V P K T # Seq 6
D A Q A T Q L R I D S F F R L A Q Q E # Seq 8
Q F V G T Q S N L T Q F F E G G N T N # Seq 9
K K K G K Q K R I N E F F P R E Y I S # Seq 10
M Q R S I M S F F H P K K E G # Seq 11
R M S T R Q S D I S N F F I S S A S H # Seq 12
G K K P K Q A T L A R F F T S M K N K # Seq 13
M D I R K F F G V I P S G # Seq 14
M S N S D I R S F F G G G N A Q # Seq 15
M V N I S D F F G K N K K S # Seq 16
S T R Q T T I T S H F A K G P A K # Seq 17
T T R Q T T I T A H F T K G P T K # Seq 18
A G K Q P T I L S M F S K G S T K # Seq 19
M I G Q K T L Y S F F S P S P A R # Seq 20
G L K L K Q P R L D N F F K T N T S S # Seq 21
A R K R K Q T T I E D F F G T K K S T # Seq 22
Q S K P Q Q K S I M S F F G K K * # Seq 23
K R L K K Q G T L E S F F K R K A K * # Seq 24
G K A N R Q V S I T G F F Q R K * # Seq 25
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
10 19
|
from 6th column
[QN] | ^M This means the motif starts either with Q or N in any place of
a protein sequence or the motif starts at the N-terminal with Methionine.
Next column (7th) is variable, so let us denote it with '.'
([QN] | ^M).
8th column says all amino acids except Proline, because
the motif is in a helix
([QN] | ^M).[^P]
9th column is one of M,I,L and S, so the motif grows like
([QN] | ^M).[^P][MIL]
10th and 11th column are agin all aa's except proline,
([QN] | ^M).[^P][MIL][^P][^P]
12th and 13th column would be F,H,M or F,H,Y
([QN] | ^M).[^P][MIL][^P][^P][FHM][FYM]
We can also write with range specifiers as
([QN] | ^M).[^P][MIL][^P]{2}[FHM][FYM]
If you use run this pattern against all the above peptides
it will miss atleast one sequence,
M D I R K F F G V I P S G
Let us examine in detail,
M D I R K F F G V I P S G
. . .
Our pattern failed at the fourth amino acid, Arginine, R. Because the fourth amino acid should be one of [MIL].
Also third position I matches because of [^P] not with one of [MIL]. Notice that if you
make the second amino acid as optional when the N-terminal residue is Met, everything becomes fine.
So the refined pattern
([QN].|^M.?)[^P][MIL][^P]{2}[FHM][FYM] # This one works !
satisfies all above sequences. Finally we are there :-)
There might be a situation where you cannot come up with a single pattern. We could also write the above final pattern as two patterns,
([QN] | ^M).[^P][MIL][^P]{2}[FHM][FYM]
and
^M.[MIL][^P]{2}[FHM][FYM]
for the missing one.
Then we could combine them
([QN] | ^M).[^P][MIL][^P]{2}[FHM][FYM] | ^M.[MIL][^P]{2}[FHM][FYM] |
This also works fine. When you are not able to write an optimal regular expression for a pattern consider writing more than one regular expression. There is no harm. Any regular expression is fine as long as it works.
We have not seen much about how to use regular expressions with Python. I would point you to http://www.python.org/doc for a detailed description about how to use regexp.
[AC]-x-V-x(4)-{ED}.
where [] has the same meaning as we have seen, x for any amino acid, and () to specifies the range, {} for specify exceptional character class which is equivalent to [^] we have seen. The '-' character is used for readability and the '.' is to show that the pattern ends. It also uses '<' for N-terminal and '>' for C-terminal.
The POSIX equivalent of the above prosite regexp (that can be used with Python re engine) is
[AC].V.{4}[^ED]
which is more readable I think.
© Copyright Chenna Ramu, EMBL, Heidelberg, GermanyLast Modified: