Regular Expression for Protein Motif Search
Regular Expression for Protein Motif Search



ELM Meeting, Sept 30th - Oct 1st, 2001
Universita Di Roma, Tor Vergata, Italy.
Chenna Ramu, EMBL, Heidelberg, Germany
Printer friendly version


A
B
S
T
R
A
C
T
 
Eucaryotic Linear Motif ( ELM ) database will devote one of its fields for protein sequence patterns or motifs. Protein motifs gets their importance when it becomes too difficult to detect the unknown protein's resemblance by aligning it to the protein of known structure. These motifs can arise because of a particular requirement on the structure of specific regions. Thus we could use the motifs to characterise an unknown protein. These motifs can be created by clustering of a family of proteins and identifying the conserved positions.

Since these motifs can be represented as regular expressions, it is important to understand about regular expressions in general, and how to use them with the Python language in particular (see Appendix ).

Search for a protein motif with regular expression in protein databases

Regular Expressions (regexp)

A regular expression (regexp in short) is a a powerful notational algebra that describes a string or a set of strings. One can use them whenever he/she wants to find patterns in strings.

Pattern Matching

Regular expressions match a pattern in text strings if some substring of the text strings matches a pattern of the regexp. A regexp evaluates to either true for strings it matches, or to false for the strings it doesn't.

Evaluation of a regular expression is called pattern matching.

Regular expressions are made up of terms, operators and modifiers.

Terms are the strings or substrings, for example the term "Kinase", will match to substrings. In the string "Protein Kinase", "Kinase" is then a term that matches a substring.

Operators combine terms and expressions. For example

Modifiers change the rules, like the compilation flag IGNORECASE we can set with re.I or re.IGNORECASE. Other flags are re.M for multiline, re.S for to make '.' to match any character including newline.

Anatomy of a regexp pattern

R
e
g
e
x
p

A
n
a
t
o
m
y

      Grouping
      |
      |    Charecter
      |     class  \
      |    |    |    \               Literal         Anchor 
      |    |    |      \           characters        terms
      |    |    |        \            / \              |
      |    |    |         |          /   \             |
      |    |    |         |         /     \            |
      |    |    |         |        /       \           |

  ^   (  [a-z]  \w  +  )  .  *  (  Ala  |  Met  )  \1  $  I   
  |                 \        /          .           |     |
  |                  \      /           .           |     |
  |                   \    /            .           |     |
  |                    \  /             .           |   Modifier
  |                     \/         Alternation      |
  |                     Repeat                     Back
  |                     operators                  reference
  |
  Anchor
  terms 

Protein Motifs can be described by regular expressions.

Most often the binding sites of proteins have particular requirements that limits the number of residues as well as the amino acids that are part of the binding site. Clustering these motifs would enable us to define a pattern for this particular site. Once you make the pattern and make it regexp friendly, it is easy to write computer programs that can search a new protein sequence to find these motifs.

Python and regexp.

Python has a regexp module called re. Often Python regular expressions are compiled into regexp objects. These regex objects have methods for different tasks such as searching for patterns, splitting a string and string substitutions. We will use compile to compile our regexps, match and search for matching and searching through a string of amino acids.

Writing regular expressions.

Regular expressions for describing protein motifs are rather limited and we will learn only the necessary subset of it. For a full description with other features of regular expressions refer to http://www.python.org/doc

The necessary regexp notations we need to represent aa in a motif are

R
e
-
a
a

   N-terminal residue
   C-terminal residue
   Any residue
   Optional residue
   Representation of gaps (i.e. variable length regexps)

Before doing that let us familiarize ourselves with the following regexp metacharacter table.

Note that the following characters must be escaped (preceeded by a slash) if we want to match them, but for our purpose it is unlikely that one of these character is going to represent an amino acid!

         < $ + * [ { ( ) ? .

S
u
m
m
a
r
y
Meta
character
NameMeaning How do we use them?
. dot Any character except newline Any aa/gap
[a-z] character class Match any char from a-z amino acid in one letter code
[^a-z] negated char class Match all except a-z reject certain amino acids
? optional char If the previous char is there match it, otherwise don't care optionally match previous amino acid
* star match zero or more time find >=0 'clones'
+ plus match one or more time find >=1 'clones'
^ caret Match at the beginning N-terminal
$ dollar Match at the end C-terminal
| alternation Match either expression it separates Match either or
{m,n} range specifier (operator) m-minimum required n-maximum allowed Match variable number of aa's or allow variable gaps

N-terminal part of a motif can be represented by caret character '^' Example: '^SMART'

C-terminal part of a motif can be represented by dollor character '$' Example: 'SMART$'

For any essential residue position we can use the dot character `.' Example: `SMA.T'

For an optional residue we can use with the question mark `?' Example: `SMAR?T', matches SMART as well as SMAT

Representing gaps within a motif is a bit tricky. We can combine the dot character with range specifiers to accomplish this.

E
x
a
m
p
l
e

          SMAR.{0,4}T, is a short hand for the set of patterns;
          SMART
          SMAR.T
          SMAR..T
          SMAR...T
          SMAR....T

Here what we mean by a `gap' is that there are two sub motifs separated by a variable number of unconserved amino acids. We treat these unconserved amino acids as gaps in the regexps.

Suppose that we have the following alignment and we want to define a motif.

PCNA binders
The alignment for 'pcna binders' is

P
C
N
A

b
i
n
d
e
r
s
                           10                19 
          1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

          G R K R R Q T S M T D F Y H S K R R L     # Seq  1
          G R K R R Q T S L T D F Y H S K R R L     # Seq  2
              R K R Q P K I T E F M K E R K R L     # Seq  3

          R Q G S T Q G R L D D F F K V T G S L     # Seq  4
          S K T I P Q G R L D S F F K P V P S S     # Seq  5
          L K S G I Q G R L D G F F Q V V P K T     # Seq  6

          D A Q A T Q L R I D S F F R L A Q Q E     # Seq  8
          Q F V G T Q S N L T Q F F E G G N T N     # Seq  9
          K K K G K Q K R I N E F F P R E Y I S     # Seq 10

          	  M Q R S I M S F F H P K K E G     # Seq 11
          R M S T R Q S D I S N F F I S S A S H     # Seq 12
          G K K P K Q A T L A R F F T S M K N K     # Seq 13

          	      M D I R K F F G V I P S G     # Seq 14
          	M S N S D I R S F F G G G N A Q     # Seq 15
          	    M V N I S D F F G K N K K S     # Seq 16

              S T R Q T T I T S H F A K G P A K     # Seq 17
              T T R Q T T I T A H F T K G P T K     # Seq 18
              A G K Q P T I L S M F S K G S T K     # Seq 19

              M I G Q K T L Y S F F S P S P A R     # Seq 20
          G L K L K Q P R L D N F F K T N T S S     # Seq 21
          A R K R K Q T T I E D F F G T K K S T     # Seq 22

          Q S K P Q Q K S I M S F F G K K *         # Seq 23
          K R L K K Q G T L E S F F K R K A K *     # Seq 24
          G K A N R Q V S I T G F F Q R K *         # Seq 25

          1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
                           10                19 
Now the sixth column is conserved for Q, N and M. Since Met is always the initiator amino acid when it occurs in this position we have to treat it separately. So the first amino acid specification for a regexp would be

from 6th column

   [QN] | ^M   This means the motif starts either with Q or N in any place of 
a protein sequence or the motif starts at the N-terminal with Methionine.

Next column (7th) is variable, so let us denote it with '.'

    ([QN] | ^M).

8th column says all amino acids except Proline, because 
the motif is in a helix

    ([QN] | ^M).[^P]

9th column is one of M,I,L and S, so the motif grows like

    ([QN] | ^M).[^P][MIL]

10th and 11th column are agin all aa's except proline, 

    ([QN] | ^M).[^P][MIL][^P][^P]

12th and 13th column would be F,H,M or F,H,Y

    ([QN] | ^M).[^P][MIL][^P][^P][FHM][FYM]

We can also  write with range specifiers as 

    ([QN] | ^M).[^P][MIL][^P]{2}[FHM][FYM]

If you use run this pattern against all the above peptides it will miss atleast one sequence,

M D I R K F F G V I P S G


Let us examine in detail,


    M D I R K F F G V I P S G 
    . . . 
Our pattern failed at the fourth amino acid, Arginine, R. Because the fourth amino acid should be one of [MIL]. Also third position I matches because of [^P] not with one of [MIL]. Notice that if you make the second amino acid as optional when the N-terminal residue is Met, everything becomes fine. So the refined pattern

        ([QN].|^M.?)[^P][MIL][^P]{2}[FHM][FYM]   # This one works ! 

satisfies all above sequences. Finally we are there :-)

There might be a situation where you cannot come up with a single pattern. We could also write the above final pattern as two patterns,


        ([QN] | ^M).[^P][MIL][^P]{2}[FHM][FYM] 
and 
                     ^M.[MIL][^P]{2}[FHM][FYM] 

for the missing one.

Then we could combine them
([QN] | ^M).[^P][MIL][^P]{2}[FHM][FYM] | ^M.[MIL][^P]{2}[FHM][FYM]

This also works fine. When you are not able to write an optimal regular expression for a pattern consider writing more than one regular expression. There is no harm. Any regular expression is fine as long as it works.


We have not seen much about how to use regular expressions with Python. I would point you to http://www.python.org/doc for a detailed description about how to use regexp.

A Note about the Prosite database pattern

The prosite database describes motif with a different pattern. For example

[AC]-x-V-x(4)-{ED}.

where [] has the same meaning as we have seen, x for any amino acid, and () to specifies the range, {} for specify exceptional character class which is equivalent to [^] we have seen. The '-' character is used for readability and the '.' is to show that the pattern ends. It also uses '<' for N-terminal and '>' for C-terminal.

The POSIX equivalent of the above prosite regexp (that can be used with Python re engine) is

[AC].V.{4}[^ED]

which is more readable I think.


Last Modified: