Exploring Modular Protein Architecture
Check what’s already known about your protein!
It's perhaps obvious - however, you'd be surprised how often people
come asking us to predict the function of a protein without already
checking which modules have already been identified/experimentally
An obvious first place to check is the UniProt record for the protein
of interest - this provides a predominantly manually-curated
description of its function, along with links to a wide array of
different databases/data resources containing information about the
However this information inevitably becomes out of date - therefore
(depending how much effort you are prepared to expend - correlated with
the extent to which understanding the function of the protein is
important to you) you may want to investigate additional data
resources. Note that some (but not all) the data found in such
resources may be found in the published literature.
There are very many different bioinformatic data resources available -
too many for us to have used or even heard of all of them, and
certainly too many to list here. When looking for a resource to fit
your needs, obviously try normal internet searching, but you can also
try visiting these sites that provide lists of such resources:
Additionally, there are many tools/sites available that aim to
some/many of these kinds of resources to make it easier and quicker to
navigate through this jungle of resources. We don't have time to look
at these now, but if you're interested you can look into these yourself:
We check the human EPN1_HUMAN sequence for phosphorylation sites in
Human Protein Reference
Database, and phospho.ELM,
and compare them to the ones found in the UniProt record.
Note that we often have the choice with such resources to query both
with names/accession numbers of proteins, and with their sequences -
there are advantages and disadvantages to both of these query methods.
Note also that there is overlapping but different information stored
the different data resources - highlighting the importance of checking
several different sites if you are very interested in a particular
Try out similar queries using PAX6_HUMAN, CBP_HUMAN, and/or your own
sequences of interest. Explore the data records for these proteins from
these different data resources - each of them provides different
information associated with the proteins and their modules.
Do you find any of these resources better/worse than the others? If so,
why? Does one of the sites tend to have more comprehensive coverage of
functional modules/site in proteins?
Identify Modules Predicted with High Confidence
Many protein modules (in particular globular domains) can be (and
typically are i.e. the prediction servers are parameterised in such a
way) predicted with a very low false positive rate.
Therefore, if one of these tools predicts the presence of a module, you
can be almost certain that it is indeed present in the protein.
Obviously such predictions are useful - we can be confident (although
certain) that the protein has the functions associated with this
module. However such predictions are also useful in that they provide a
reference against which to interpret other predictions - in the case of
incompatible overlap with other predictions, we prefer to assume that
the high-confidence prediction is correct and the other is incorrect.
We will use the SMART, PFAM, and CDD
webservers to identify protein modules that are predicted with high
Bookmark and/or save the results of these analyses (and the others you
do in this practical), as in some cases you will be asked to return to
them later in the practical, as we will be using additional
information/ideas to interpret them differently.
Give the files names such as "EPN1_HUMAN_smart.html".
We will use these tools to investigate the modular sturcture of
Carry out similar queries using the PAX2_HUMAN sequence
How well do the boundaries of the modules identified using these
different tools agree with the boundaries given in UniProt?
If you've time, try a similar query with SRC_HUMAN
Is one of the servers clearly best/worst at accurately defining the
boundaries of the modules?
You will have seen that there is considerable variation in the results
of the three servers, all of which are attempting to predict protein
modules using similar methods (HMMs/PSSMs). This highlights the fact
that, if you are keen to determine possible functions of a protein, it
is a good idea to try several bioinformatic servers to predict protein
modules. Relying on just one, you might miss out on identifying
Match linear motif patterns
The sequences of the modules which can be predicted with high
a pattern of conservation/a set of constraints that greatly
differentiates it from "typical" or "random" protein sequence.
Therefore, if you find this pattern in your sequence, you can be almost
certain that it does indeed contain the module i.e. have the
characteristic sturcture and function of the module.
In contrast, other modules (in particular short functional sites/linear
motifs such as phosphorylation, acetylation, cyclin-binding sites etc.)
place few constraints on the protein sequence, such that the sequence
of regions that contain the module is difficult to differentiate from
regions that do not contain the module. For example, tyrosine kinases
are generally able to phosophorylate pretty much any tyrosine residues
they encounter - however we know that many/most tyrosines in the
are not phosphorylated.
We will use ELM, ScanSite, Prosite (switching off
"exclude patterns with a high probability of occurrence"), to explore
interest in EPN1_HUMAN.
You can see that many sites are matched, most of which are unlikely
to be functional. Therefore we try to use additional information to
help discriminate between
true and false positives (this is the purpose of the ELM filters).
For example, most of the sites predicted with low confidence are short
(often between 3 and 10 amino acids long), and need to be in relatively
regions of protein structure
Using the servers mentioned above, examine the sequence of EPN1_HUMAN
to identify possible short functional sites.
Can you identify any such sites you consider more likely to be true
positives/truly functional than others using these tools?
Predict globular/non-globular regions
One way to help discriminate between matches functional and
non-functional matches to linear motif patterns is to predict whether
the site lies within
an appropriate stuctural context - such sites typically need to be in
There are several different websites available that predict the
disorder tendency of protein sequences e.g. RONN,
IUPred, GlobPlot, DisProt. As
the training sets and methods used to develop these servers tend to be
rather different from each other, they will often predict the
order/disorder of a regions differently. However, in other cases, the
tools agree - then, by taking a consensus view, we can obtain fairly
confident predictions of ordered/disordered regions (although there
will be over regions, where the servers disagree with each other, where
we cannot make a confident prediction of disordered/ordered regions.)
However, as already mentioned, if these predictions disagree with
high-confidence predictions (e.g. a globular domain is predicted in a
region that is also predicted to be disordered), we assume that the
order/disorder prediction is wrong.
We'll use these four disorder-prediction servers to identify regions of
potential order and disorder in the EPN1_HUMAN sequence.
Using the same set of four predictors, investigate the order/disorder
predictions for SRC_HUMAN.
Looking at the disorder/order prediction for PAX2_HUMAN, and focusing
on the EH1-interaction motif - does the disorder prediction provide
evidence for or againt this motif being a true instance?
- Are there regions of consensus in the predictions of the four
servers i.e. where they all predict the sequence to be either globular
- How well do the different servers predict the boundaries of
globular/non-globular regions (in comparison with the UniProt record
for the protein)?
- Is the output from any of these tools easier/more difficult to
interpret? If so, then which features of these tools are responsible
Multiple Sequence Alignments (MSAs)
MSAs play a central role in many different sequence analysis tools.
Additionally, they can provide an invaluable tool/platform against
which to compare and integrate experimental data obtained from members
of a protein family sampled from different model organisms and
Additionally, MSAs can help predicting protein modules - features of
MSAs that are important for the function of the proteins are likely to
be conserved over relatively long periods of time (as many
characteristics of protein function evolve relatively slowly, for
example, in comparison to the relatively quick evolution of
In particular, observing matches to linear motifs that are more
conserved than their flanking sequences is a strong indicator that the
motif is functional in many of these protein sequences.
Demonstration - Creating an alignment "from scratch"
We will begin by calculating an MSA "from scratch" i.e. begining with a
sequence of interest (in this case PAX2_HUMAN), and continuing until we
have an MSA to examine.
We begin by using BLAST
at the EBI - we are working with protein sequences, so we use the
blastp program. BLAST allows us to collect a set of related sequences
from those available in the public sequence databases.
We can get a quick pseudo-MSA out of these results by using the MVIEW
display option from the result page.
Based on the result of this search, we may want to do the BLAST search
again using different parameters.
Once we are happy with the results, we can select sequences to
download, and collect them in FASTA format.
We can do a similar search using the BLAST tools available
at the NCBI - this interface has different features from those
available at the EBI.
Once we have collected the FASTA format sequences, we can align them to
There is a range of different software available to do this - we will
try out some of the tools available at the EBI: ClustalW, T-COFFEE, MAFFT, and MUSCLE.
Note that it is also possible to install these software packages
locally on your machines and run them there - if you want an alignment
for many, long sequences, you would typically want to do it locally.
At the EBI, we then have the option to download these alignment and
view/edit them locally e.g. using CLUSTALX, SEAVIEW, or JalView.
Alternatively, from some of these tools we have the option to open
JalView in our webbrowser (or to use the MVIEW output we saw before).
Usually, we would prefer to download the sequences and view them
as we then have more control over the way the alignment is displayed,
and we are able to save the results of any editing etc. we do of the
Having loaded the alignment into JalView locally, we can now examine,
for example, the conservation of the EH1 interaction motif in the PAX
The alignment also highlights many common features of protein
alignments - for example, characteristic differences between the
alignments of globular and non-globular regions, and different patterns
of conservation/substitution at different positions in the alignment.
Using a similar procedure, create an MSA for the protein EPN1_HUMAN -
examine the functional sites annotated for the protein in its UniProt
record. Are they conserved in many sequences within the alignment?
Examine the conservation of other sites in the sequence, for example
those predicted by ScanSite using "high stringency" - can you identify
any good candidates for true functional sites using the alignment?
Demonstration - Obtaining pre-calculated alignments
If you can find a pre-calculated alignment that already contains a set
of sequences that is useful for addressing your problem, you may well
be able to save lots of time. Using the databases/resources below,
we'll look at how you can get access to alignments that you can
download yourself and examine locally.
>SRC_HUMAN|P12931|Proto-oncogene tyrosine-protein kinase Src (EC
Using these (and perhaps other alignment resources you might find) find
a pre-calculated MSA that is will be a good starting place for
building an alignment to investigate:
- the evolution of FGF genes (e.g. FGF4_HUMAN) in the early
- variation in the 3D structure of tyrosine kinase domains within
Why might someone be interested in interpreting or estimating a
For a start, phylogenies/evolutionary trees are a central component of
any evolutionary analysis, whether this is explicitly stated, or
They may be interesting and valuable of themselves - we are curious to
understand the order in which the different lineages of life diverged
from each other, to understand better the path that evolution has
taken, and to attempt to reconstruct the past history of these lineages.
Or we may be interested in estimating the rate of some kind of
evolutionary transition - for example the rate of synonymous DNA base
substitution events, the rate of intron loss/gain - or aiming to
explicitly reconstruct the state of some feature of an ancestral
organism or sequence e.g. did the most recent common ancestor of
mammals lay eggs or not? If we are interested in such questions, we
need to take into account how the sequences/organisms are related to
each other. Otherwise we will not be able to determine whether only
closely-related sequences have the same features (indicating, for
example, that the character changes relatively quickly), or whether a
set of very distantly-related sequences have the same state (indicating
that the character is relatively stable/slowly-changing). Given that
most evolutionary studies involve either estimating the rate of some
character changes, or reconstructing ancestral character states, it is
clearly important that we make use of phylogenies in these analyses.
Additionally, phylogenies/dendrograms form an important component of
many different functional prediction tools (for example STRING, or
various other comparative genomics-based applications). (Note also that
almost all the phylogenies published today, and that you
are likely to encounter (and perhaps estimate yourselves), will have
been estimated from sequence alignments (either DNA or proteins))
Thus, there are many reasons why you might be interested in estimating
or interpreting a phylogeny.
Modern state-of-the art methods for phylogeny reconstruction are
relatively sophisticated, and a reasonably good level of understanding
of how the methods work is required to use them critically - therefore,
given the relatively short amount of time we have to look at this
topic, we are going to avoid these issues. Instead we will focus on how
to interpret and manipulate the results of such analyses - the
phylogenetic trees themselves.
Demonstration - Displaying phylogenetic trees
Here we will look at some different software tools that can be used
to display/draw phylogenetic trees. This should help you start
exploring both pre-calculated trees, and ones you have estimated
yourself, along with helping you become more familiar with looking at
and thinking about trees.
NJPLOT is a relatively simple program initially developed to show the
results of bootstrapped Neighbour-Joining phylogenies. It is often
useful as a relatively quick and easy way of viewing the results of an
analysis. While it can accept as NEWICK-format un-rooted trees as
input, it is only able to display such trees with a root. It won't
accept trees with any polytomies apart from a single trisomy (used to
specify the position of the root).
DENDROSCOPE is a more comprehensive program that is able to produce
trees of nearly publication quality relatively quickly (although in
most cases the resulting images require additional editing in other
software such as Adobe Illustrator). It can display both rooted and
unrooted trees, with scaled and unscaled formats.
MESQUITE is an
extremely versatile and flexible software package for carryout out a
wide range of different kinds of evolutionary analyses. This
flexibility provides a relatively steep learning-curve. We will use it
to visualise AND EDIT evolutionary trees; both DENDROSCOPE and NJPLOT
are only able to view the data provided in a NEWICK format tree file -
while MESQUITE provides the ability to manually alter the topology and
branch-lengths of a tree (which can be useful for preparing
illustrative figures, or for preparing trees for input to other
software (or for other analyses within MESQUITE).
The software can also be rather tricky to install correctly - therefore
today we are only demonstrating it to you, you won't do any exercises
We will begin by downloading a pre-calculated alignment, using the
alignment to obtain a rough estimate of the phylogeny of the sequences,
and will then visualise the resulting tree using NJPLOT, DENDROSCOPE,
Suppose we are interested in investigating the phylogeny of the
drosophilid Oskar genes. To begin our analysis we go to the set of
pre-calculated alignments of sets of "orthologous groups" of genes from
Chris Ponting's group's OPTIC
resource (these are available for other groups e.g. mammals, yeasts,
not just drosophilids).
To identify the relevant group of genes, we need to find the relevant
identifier for the Oskar genes - we do this by searching flybase.
We query OPTIC database with CG10901 (the relevant ID - although one
actually needs to query with "CG10901-RA")
We click on the link that leads us to the alignments (the rather
cryptic "group 3297")
Choose Alignments->Transcripts-AminoAcids, and then export in Fasta
We then use ClustalX to calculate a quick bootstrapped tree from this
We look at the resulting tree in NJPLOT
- Trees->Exclude positions with gaps
- Trees->Correct for multiple subsitutions
- Trees->Bootstrap NJ Tree
Dendroscope can also examine this tree with no trouble
- re-rooting tree
- examining subtrees
- rotating branches
- accepts long names and hyphens in names
To work on the tree in Mesquite, we need to edit the file
- different tree representations
- rotate branches (after selecting a NODE)
- hide Edge Labels to lose BS values
- change formating of branches/labels
- zoom in on the tree
- search for taxa names
Mesquite, as mentioned, allows you to edit the tree
- put all text on one line
- remove all the hyphens
As you've seen, often tree-viewing software has problems displaying the
trees obtained from some sources - we'll look at trees from the TreeFam
database, which some of the software has problems with.
- "interchange branches" tool
- adjust branch-lengths
To get the Oskar sequences from TreeFam,
we query either with the External Accesion number from UniProt "P25158"
or with "CG10901" under "gene name" (we could find out the relevant
accesion number for our gene by BLASTing with our sequence of interest
against UniProt - this should identify an ID that will be within
TreeFam) - here's the alignment.
Loads OK into NJPLOT (it includes the node labels)
To load it into Dendroscope, we need to delete all of the comments (i.e
everything within "[....]") - here
it is without the comments.
Again, to get it into Mesquite we need to put everything onto one line
- here it is.
Download this tree file
(TF101051, cdc6 relatives from TreeFam with several of the branches
Load the tree into NJPLOT. Ignoring formating, rearange the branches of
the tree in NJPLOT to make them the same as in the image below.
Load the tree into DENDROSCOPE. Carry out the same exercise, but this
time use the formating options to make the tree as similar to the one
in the image below.
Visit this link to
the gnathostome page on the Tree of Life.
Using a text editor, create a file to represent the tree at the top of
the Tree of Life webpage that can be successfully displayed in NJPLOT.
If that's too
easy, try doing the same thing with their aminotes page.
If you have extra time, look through either "Tree of Life" or "History
of Life" to find a group of organisms that are interesting for you (I
find it easiest to just start from the root nodes and work up), and try
the same exercise.
Alternatively, working with a sequence alignment you either create
yourself, or that you download pre-calculated, attempt to build your
own phylogeny using ClustalX. To give yourself a focus for preparing
these alignments, try using them to estimate whether the tree provides
evidence of gene duplication/loss events, and if so, how many and when?
NOTE! The method used by ClustalX to estimate phylogenies is relatively
simple and is subject to a number of serious systematic errors. If you
are serious about preparing a phylogeny in your work, you should
certainly use alternative, more accurate methods. However, to obtain an
intitial overview of a phylogeny, just to get a feel for how the
sequences might be related, ClustalX is worth using due to its speed
and its relative ease of use.
Estimating Phylogenies with RAxML
- Obtain the seed alignment for TreeFam
- Load the alignment into ClustalX and examine it - we remove some
of the sequences to make the ML analysis quicker
- Save in FASTA format
- Rename sequences to make them short and alpha-numeic only
- Run FASTA format file through GBLOCKS to
remove columns with gaps, and which are likely to be misaligned, saving
the resulting outfile locally (could do this editing "manually" using
SEAVIEW or JALVIEW)
- Load GBLOCKed fasta file into CLUSTALX and save in PHYLIP format
- Load PHYLIP format file into RAxML webserver
provided at CIPRES
- No bootstrapping
- Choose Analysis: WAG + Gamma + F
- Choose Tool: RAxML
- Give email address
- Run Analysis
- View resulting tree in NJPLOT/Dendroscope
Carry out a similar analysis using either your own alignment of
interest, or using TF105084
Phylogenetic Estimation - from start to finish
In these exercises you are asked to go from having a question that can
be addressed using phylogenetic approaches, through the process of
obtaining an appopriate set of sequences for the analysis, aligning
them, estimating a phylogeny from them, and examining the final result
in the light of your initial question
Imagine you are working on the FGF10 protein in humans.
You have planned some experiments to be carried out in chickens to test
some of your hypotheses, therefore you want to determine:
To do this, go through the complete analysis including
- whether there is an ortholog for this gene in chickens
- if so, how many orthologs
In particular, think carefully about
- sequence similarity search
- initial alignment of sequences
- examination of alignment, discarding of some sequences, perhaps
collection of additional sequences and inclusion in the alignment
- removal of potentially mis-aligned regions
- estimation of phylogeny
- rooting of phylogeny
- what would be an appropriate set of sequences to collect
- once you have an initial set of sequences, which ones to keep,
which ones to discard
- once you have a tree, where to root it
Links to the UniProt records for some of the sequenes used in these
to Gibson Team course pages at EMBL.