Exploring Modular Protein Architecture


Check what’s already known about your protein!

It's perhaps obvious - however, you'd be surprised how often people come asking us to predict the function of a protein without already checking which modules have already been identified/experimentally demonstrated!

An obvious first place to check is the UniProt record for the protein of interest - this provides a predominantly manually-curated description of its function, along with links to a wide array of different databases/data resources containing information about the protein.

However this information inevitably becomes out of date - therefore (depending how much effort you are prepared to expend - correlated with the extent to which understanding the function of the protein is important to you) you may want to investigate additional data resources. Note that some (but not all) the data found in such resources may be found in the published literature.

There are very many different bioinformatic data resources available - too many for us to have used or even heard of all of them, and certainly too many to list here. When looking for a resource to fit your needs, obviously try normal internet searching, but you can also try visiting these sites that provide lists of such resources:
Additionally, there are many tools/sites available that aim to integrate some/many of these kinds of resources to make it easier and quicker to navigate through this jungle of resources. We don't have time to look at these now, but if you're interested you can look into these yourself:


We check the human EPN1_HUMAN sequence for phosphorylation sites in PhosphoSite, the Human Protein Reference Database, and phospho.ELM,  and compare them to the ones found in the UniProt record.

Note that we often have the choice with such resources to query both with names/accession numbers of proteins, and with their sequences - there are advantages and disadvantages to both of these query methods.

Note also that there is overlapping but different information stored in the different data resources - highlighting the importance of checking several different sites if you are very interested in a particular protein.


Try out similar queries using PAX6_HUMAN, CBP_HUMAN, and/or your own sequences of interest. Explore the data records for these proteins from these different data resources - each of them provides different information associated with the proteins and their modules.

Do you find any of these resources better/worse than the others? If so, why? Does one of the sites tend to have more comprehensive coverage of functional modules/site in proteins?

Identify Modules Predicted with High Confidence

Many protein modules (in particular globular domains) can be (and typically are i.e. the prediction servers are parameterised in such a way) predicted with a very low false positive rate. Therefore, if one of these tools predicts the presence of a module, you can be almost certain that it is indeed present in the protein. Obviously such predictions are useful - we can be confident (although not certain) that the protein has the functions associated with this module. However such predictions are also useful in that they provide a reference against which to interpret other predictions - in the case of incompatible overlap with other predictions, we prefer to assume that the high-confidence prediction is correct and the other is incorrect.


We will use the SMART, PFAM, and CDD webservers to identify protein modules that are predicted with high confidence.

Bookmark and/or save the results of these analyses (and the others you do in this practical), as in some cases you will be asked to return to them later in the practical, as we will be using additional information/ideas to interpret them differently. Give the files names such as "EPN1_HUMAN_smart.html".

We will use these tools to investigate the modular sturcture of EPN1_HUMAN.


Carry out similar queries using the PAX2_HUMAN sequence

How well do the boundaries of the modules identified using these different tools agree with the boundaries given in UniProt?

If you've time, try a similar query with SRC_HUMAN

Is one of the servers clearly best/worst at accurately defining the boundaries of the modules?

You will have seen that there is considerable variation in the results of the three servers, all of which are attempting to predict protein modules using similar methods (HMMs/PSSMs). This highlights the fact that, if you are keen to determine possible functions of a protein, it is a good idea to try several bioinformatic servers to predict protein modules. Relying on just one, you might miss out on identifying important features.

Match linear motif patterns

The sequences of the modules which can be predicted with high confidence has a pattern of conservation/a set of constraints that greatly differentiates it from "typical" or "random" protein sequence. Therefore, if you find this pattern in your sequence, you can be almost certain that it does indeed contain the module i.e. have the characteristic sturcture and function of the module.

In contrast, other modules (in particular short functional sites/linear motifs such as phosphorylation, acetylation, cyclin-binding sites etc.) place few constraints on the protein sequence, such that the sequence of regions that contain the module is difficult to differentiate from regions that do not contain the module. For example, tyrosine kinases are generally able to phosophorylate pretty much any tyrosine residues they encounter - however we know that many/most tyrosines in the proteome are not phosphorylated.


We will use ELM, ScanSite, Prosite (switching off "exclude patterns with a high probability of occurrence"), to explore sites of interest in EPN1_HUMAN.

You can see that many sites are matched, most of which are unlikely to be functional. Therefore we try to use additional information to help discriminate between true and false positives (this is the purpose of the ELM filters).

For example, most of the sites predicted with low confidence are short (often between 3 and 10 amino acids long), and need to be in relatively flexible, disordered regions of protein structure


Using the servers mentioned above, examine the sequence of EPN1_HUMAN to identify possible short functional sites.

Can you identify any such sites you consider more likely to be true positives/truly functional than others using these tools?

Predict globular/non-globular regions

One way to help discriminate between matches functional and non-functional matches to linear motif patterns is to predict whether the site lies within an appropriate stuctural context - such sites typically need to be in non-globular/unstructured/disordered regions.

There are several different websites available that predict the disorder tendency of protein sequences e.g.  RONN, IUPred, GlobPlot, DisProt. As the training sets and methods used to develop these servers tend to be rather different from each other, they will often predict the order/disorder of a regions differently. However, in other cases, the tools agree - then, by taking a consensus view, we can obtain fairly confident predictions of ordered/disordered regions (although there will be over regions, where the servers disagree with each other, where we cannot make a confident prediction of disordered/ordered regions.)

However, as already mentioned, if these predictions disagree with high-confidence predictions (e.g. a globular domain is predicted in a region that is also predicted to be disordered), we assume that the order/disorder prediction is wrong.


We'll use these four disorder-prediction servers to identify regions of potential order and disorder in the EPN1_HUMAN sequence.


Using the same set of four predictors, investigate the order/disorder predictions for SRC_HUMAN.
Looking at the disorder/order prediction for PAX2_HUMAN, and focusing on the EH1-interaction motif - does the disorder prediction provide additional evidence for or againt this motif being a true instance?

Multiple Sequence Alignments (MSAs)

MSAs play a central role in many different sequence analysis tools. Additionally, they can provide an invaluable tool/platform against which to compare and integrate experimental data obtained from members of a protein family sampled from different model organisms and orthologous groups.

Additionally, MSAs can help predicting protein modules - features of MSAs that are important for the function of the proteins are likely to be conserved over relatively long periods of time (as many characteristics of protein function evolve relatively slowly, for example, in comparison to the relatively quick evolution of gene-expression patterns).

In particular, observing matches to linear motifs that are more conserved than their flanking sequences is a strong indicator that the motif is functional in many of these protein sequences.

Demonstration - Creating an alignment "from scratch"

We will begin by calculating an MSA "from scratch" i.e. begining with a sequence of interest (in this case PAX2_HUMAN), and continuing until we have an MSA to examine.

We begin by using BLAST at the EBI - we are working with protein sequences, so we use the blastp program. BLAST allows us to collect a set of related sequences from those available in the public sequence databases.

We can get a quick pseudo-MSA out of these results by using the MVIEW display option from the result page.

Based on the result of this search, we may want to do the BLAST search again using different parameters.

Once we are happy with the results, we can select sequences to download, and collect them in FASTA format.

We can do a similar search using the BLAST tools available at the NCBI - this interface has different features from those available at the EBI.

Once we have collected the FASTA format sequences, we can align them to each other.

There is a range of different software available to do this - we will try out some of the tools available at the EBI: ClustalW, T-COFFEE, MAFFT, and MUSCLE.

Note that it is also possible to install these software packages locally on your machines and run them there - if you want an alignment for many, long sequences, you would typically want to do it locally.

At the EBI, we then have the option to download these alignment and view/edit them locally e.g. using CLUSTALX, SEAVIEW, or JalView.

Alternatively, from some of these tools we have the option to open JalView in our webbrowser (or to use the MVIEW output we saw before).

Usually, we would prefer to download the sequences and view them locally as we then have more control over the way the alignment is displayed, and we are able to save the results of any editing etc. we do of the alignment locally.

Having loaded the alignment into JalView locally, we can now examine, for example, the conservation of the EH1 interaction motif in the PAX alignment.

The alignment also highlights many common features of protein alignments - for example, characteristic differences between the alignments of globular and non-globular regions, and different patterns of conservation/substitution at different positions in the alignment.


Using a similar procedure, create an MSA for the protein EPN1_HUMAN - examine the functional sites annotated for the protein in its UniProt record. Are they conserved in many sequences within the alignment? Examine the conservation of other sites in the sequence, for example those predicted by ScanSite using "high stringency" - can you identify any good candidates for true functional sites using the alignment?

Demonstration - Obtaining pre-calculated alignments

If you can find a pre-calculated alignment that already contains a set of sequences that is useful for addressing your problem, you may well be able to save lots of time. Using the databases/resources below, we'll look at how you can get access to alignments that you can download yourself and examine locally.
>SRC_HUMAN|P12931|Proto-oncogene tyrosine-protein kinase Src (EC ENSG00000197122


Using these (and perhaps other alignment resources you might find) find a pre-calculated MSA that is will be a good starting place for
building an alignment to investigate:


Why might someone be interested in interpreting or estimating a phylogeny?

For a start, phylogenies/evolutionary trees are a central component of any evolutionary analysis, whether this is explicitly stated, or implicitly assumed.

They may be interesting and valuable of themselves - we are curious to understand the order in which the different lineages of life diverged from each other, to understand better the path that evolution has taken, and to attempt to reconstruct the past history of these lineages.

Or we may be interested in estimating the rate of some kind of evolutionary transition - for example the rate of synonymous DNA base substitution events, the rate of intron loss/gain - or aiming to explicitly reconstruct the state of some feature of an ancestral organism or sequence e.g. did the most recent common ancestor of mammals lay eggs or not? If we are interested in such questions, we need to take into account how the sequences/organisms are related to each other. Otherwise we will not be able to determine whether only closely-related sequences have the same features (indicating, for example, that the character changes relatively quickly), or whether a set of very distantly-related sequences have the same state (indicating that the character is relatively stable/slowly-changing). Given that most evolutionary studies involve either estimating the rate of some character changes, or reconstructing ancestral character states, it is clearly important that we make use of phylogenies in these analyses.

Additionally, phylogenies/dendrograms form an important component of many different functional prediction tools (for example STRING, or various other comparative genomics-based applications). (Note also that almost all the phylogenies published today, and that you are likely to encounter (and perhaps estimate yourselves), will have been estimated from sequence alignments (either DNA or proteins))

Thus, there are many reasons why you might be interested in estimating or interpreting a phylogeny.

Modern state-of-the art methods for phylogeny reconstruction are relatively sophisticated, and a reasonably good level of understanding of how the methods work is required to use them critically - therefore, given the relatively short amount of time we have to look at this topic, we are going to avoid these issues. Instead we will focus on how to interpret and manipulate the results of such analyses - the phylogenetic trees themselves.

Demonstration - Displaying phylogenetic trees


Here we will look at some different software tools that can be used to display/draw phylogenetic trees. This should help you start exploring both pre-calculated trees, and ones you have estimated yourself, along with helping you become more familiar with looking at and thinking about trees.

NJPLOT is a relatively simple program initially developed to show the results of bootstrapped Neighbour-Joining phylogenies. It is often useful as a relatively quick and easy way of viewing the results of an analysis. While it can accept as NEWICK-format un-rooted trees as input, it is only able to display such trees with a root. It won't accept trees with any polytomies apart from a single trisomy (used to specify the position of the root).

DENDROSCOPE is a more comprehensive program that is able to produce trees of nearly publication quality relatively quickly (although in most cases the resulting images require additional editing in other software such as Adobe Illustrator). It can display both rooted and unrooted trees, with scaled and unscaled formats.

MESQUITE is an extremely versatile and flexible software package for carryout out a wide range of different kinds of evolutionary analyses. This flexibility provides a relatively steep learning-curve. We will use it to visualise AND EDIT evolutionary trees; both DENDROSCOPE and NJPLOT are only able to view the data provided in a NEWICK format tree file - while MESQUITE provides the ability to manually alter the topology and branch-lengths of a tree (which can be useful for preparing illustrative figures, or for preparing trees for input to other software (or for other analyses within MESQUITE).

The software can also be rather tricky to install correctly - therefore today we are only demonstrating it to you, you won't do any exercises using it.

We will begin by downloading a pre-calculated alignment, using the alignment to obtain a rough estimate of the phylogeny of the sequences, and will then visualise the resulting tree using NJPLOT, DENDROSCOPE, and MESQUITE.

Suppose we are interested in investigating the phylogeny of the drosophilid Oskar genes. To begin our analysis we go to the set of pre-calculated alignments of sets of "orthologous groups" of genes from Chris Ponting's group's OPTIC resource (these are available for other groups e.g. mammals, yeasts, not just drosophilids).

To identify the relevant group of genes, we need to find the relevant identifier for the Oskar genes - we do this by searching flybase.

We query OPTIC database with CG10901 (the relevant ID - although one actually needs to query with "CG10901-RA")

We click on the link that leads us to the alignments (the rather cryptic "group 3297")

Choose Alignments->Transcripts-AminoAcids, and then export in Fasta format

We then use ClustalX to calculate a quick bootstrapped tree from this alignment
We look at the resulting tree in NJPLOT
Dendroscope can also examine this tree with no trouble
To work on the tree in Mesquite, we need to edit the file
Mesquite, as mentioned, allows you to edit the tree
As you've seen, often tree-viewing software has problems displaying the trees obtained from some sources - we'll look at trees from the TreeFam database, which some of the software has problems with.

To get the Oskar sequences from TreeFam, we query either with the External Accesion number from UniProt "P25158" or with "CG10901" under "gene name" (we could find out the relevant accesion number for our gene by BLASTing with our sequence of interest against UniProt - this should identify an ID that will be within TreeFam) - here's the alignment.

Loads OK into NJPLOT (it includes the node labels)

To load it into Dendroscope, we need to delete all of the comments (i.e everything within "[....]") - here it is without the comments.

Again, to get it into Mesquite we need to put everything onto one line - here it is.


Download this tree file (TF101051, cdc6 relatives from TreeFam with several of the branches removed)

Load the tree into NJPLOT. Ignoring formating, rearange the branches of the tree in NJPLOT to make them the same as in the image below.

Load the tree into DENDROSCOPE. Carry out the same exercise, but this time use the formating options to make the tree as similar to the one in the image below.

Dendroscope Treee

Visit this link to the gnathostome page on the Tree of Life.

Using a text editor, create a file to represent the tree at the top of the Tree of Life webpage that can be successfully displayed in NJPLOT. If that's too easy, try doing the same thing with their aminotes page.

If you have extra time, look through either "Tree of Life" or "History of Life" to find a group of organisms that are interesting for you (I find it easiest to just start from the root nodes and work up), and try the same exercise.

Alternatively, working with a sequence alignment you either create yourself, or that you download pre-calculated, attempt to build your own phylogeny using ClustalX. To give yourself a focus for preparing these alignments, try using them to estimate whether the tree provides evidence of gene duplication/loss events, and if so, how many and when?

NOTE! The method used by ClustalX to estimate phylogenies is relatively simple and is subject to a number of serious systematic errors. If you are serious about preparing a phylogeny in your work, you should certainly use alternative, more accurate methods. However, to obtain an intitial overview of a phylogeny, just to get a feel for how the sequences might be related, ClustalX is worth using due to its speed and its relative ease of use.

Estimating Phylogenies with RAxML


  1. Obtain the seed alignment for TreeFam familly TF105048.
  2. Load the alignment into ClustalX and examine it - we remove some of the sequences to make the ML analysis quicker
  3. Save in FASTA format
  4. Rename sequences to make them short and alpha-numeic only
  5. Run FASTA format file through GBLOCKS to remove columns with gaps, and which are likely to be misaligned, saving the resulting outfile locally (could do this editing "manually" using SEAVIEW or JALVIEW)
  6. Load GBLOCKed fasta file into CLUSTALX and save in PHYLIP format
  7. Load PHYLIP format file into RAxML webserver provided at CIPRES
    1. No bootstrapping
    2. Choose Analysis: WAG + Gamma + F
    3. Choose Tool: RAxML
    4. Give email address
    5. Run Analysis
  8. View resulting tree in NJPLOT/Dendroscope


Carry out a similar analysis using either your own alignment of interest, or using TF105084

Phylogenetic Estimation - from start to finish

In these exercises you are asked to go from having a question that can be addressed using phylogenetic approaches, through the process of obtaining an appopriate set of sequences for the analysis, aligning them, estimating a phylogeny from them, and examining the final result in the light of your initial question

Imagine you are working on the FGF10 protein in humans. You have planned some experiments to be carried out in chickens to test some of your hypotheses, therefore you want to determine:
To do this, go through the complete analysis including
In particular, think carefully about

Links to the UniProt records for some of the sequenes used in these exercises

Back to Gibson Team course pages at EMBL.