Friday 30th January 2009
Activity 1: Evolution of Hemoglobin Structure and Function
Multiple Sequence Alignment (MSA)
Cut and paste the contents of this protein
sequence file (in FASTA format)
into the EBI MUSCLE
webserver, (into the box below the text "Enter or Paste a set of
Sequences in any supported format:"). Run the software (click the
Wait a few moments, then click on the JalView button, to view the
results with JalView MSA editor.
(Here is a pre-calculated
version of this file that can be viewed using a locally installed
version of an alignment viewer/editor such as JalView or CLUSTALX)
You will notice that different amino acids are have different colours
-although sometimes they are also left uncoloured/white. If they are
coloured, then it is always the same colour e.g. if aspartate (D) is
coloured, it is always coloured purple. You'll also notice that there
are several cases where different amino acids have the same colour e.g.
both aspartate (D) and glutamate (E) are purple.
Using the tables and diagram from the introduction, which are the
common properties shared by the different colours of amino acids?
What is does the "Conservation" chart (at the bottom of the JalView
The "same" protein in different organisms, or different members of
same protein family in the same organism, may have different amino acid
sequences (as can be seen from the MSA). These differences are due to "point mutations"
in the underlying DNA sequence of the genes that code for these
The aim of an MSA is to place amino acid residues from different
proteins in the same column, such that residues in the same column are
either related via point mutations (or by no mutations at all) - note
that there are many different possible residues that could be aligned
in the same column for two different sequences, however a given residue
is related by point mutations/no mutations at all to (at most) only one
residue in another sequence.
Note, however, that some positions in the alignment contain the "-"
character, indicating a "blank" or "gap" i.e. that there is no residues
in these sequences at those positions that are related via point
mutations (or no mutations at all).
What kinds of genetic events do you think might have caused these
You will also have noticed that different columns in the alignment
look different from each
Describe these differences in terms of (a) the number of
different amino acids found in the column (e.g. is it only ever D and
E, or is it many different ones e.g. V, I, L, D, G, etc.) (b) the
properties of the amino acids (are they all hydrophobic, or a mixture
of hydrophobic/aromatic/polar etc.) (c) presence of empty ("gap")
These observations can help us answer some questions concerning
patterns of protein evolution:
Leave the alignment for these sequences open before continuing with the
- Does the rate of point mutations at different positions in a
protein seem to be (a) the same for all positions (b) different in
- Do all positions in the protein seem to be able to accept gap
characters, or are there some positions in the protein that do not seem
to accept gaps?
- Do all positions in a protein have to preserve the same kind
of chemical/physical property? Or are there some positions that seem to
accept amino acids with many different properties?
- Are certain properties more likely to be conserved than others?
Protein Structure and Amino Acid Properties:
Load the following
file for human beta-chain hemoglobin into PyMOL.
To give you some experience of manipulating and examining 3D protein
structures, use the sequence information given at the top of the PyMOL
window to colour the residues of the chain different colours, depending
on which group you are assigned to (this exercise will also help reveal
some important patterns of protein evolution.)
For some of these classifications you should be able to see a trend in
the localisation of the residues of different colours e.g. tend to be
absent from either the interior or exterior of the protein, or tend to
be near the haem molecule - can you identify any such trends?
|G A V L I C M F P W
|R N D E Q H K S T Y
|K R H D E N Q
|A C G I L M F P S T W Y V
|F H W Y
|A C D E G I K L M N P Q R S T V
(These files show the protein already labeled using these different
- please don't open them until you have finished trying this
for yourself! They are provided for use at a later date)
Protein Structure and Alignments:
Use the "Conservation" column to identify columns in the MSA that are
the same in all sequences in the alignment (these are marked with a
Identify the sequence in the alignment corresponding to the human
beta-chain hemoglobin ("humanHemoglobinB")
Using the initial all-white human beta-chain hemoglobin PyMOL file from
above, label red all the residues in this chain that are the same in
Are these highly-conserved
residues preferentially located (a) in the core (b) on the surface (c)
in the haem-binding pocket?
is the human beta hemoglobin chain with the invariant residues
already coloured red)
When talking about how similar two sequences are to each other, it
often helps to be able to measure/quantify this similarity. One common
way of doing this is to calculate the "percentage identity" (or "%ID")
of a pair of sequences in the context of a given alignment.
%ID is calculated by (a) ignoring all columns that contain any gap
characters (b) for the remaining columns, counting the number of
columns where both sequences have exactly the same amino acid residue
in the same column, and calculating this as a percentage of all
For example, in the alignment below (a region of an alignment between a
pair of hemoglobin sequences)
- the alignment contains a total of 22 columns
- there are two columns that contain gaps (columns 8 and 9) so the
total number of columns that do not contain any gaps is 20
- of these twenty columns, 13 have the same amino acid residue in
both sequences (columns 3 4 5 6 7 12 13 15 17 18 19 21 22)
- therefore the percentage identity for this alignment is: 13 * 100
/ 20 i.e. 65%
Note the importance of the alignment for this measurement - if we take
the following alignment (which is between the same pair of sequences)
we calculate %ID = 20% which is much lower than the 65%
calculated from the previous alignment
With a %ID of 20% we might conclude that the sequences are relatively
dissimilar - if we hadn't looked at the alignment and noticed that
placing gaps differently could give us an alignment with much higher
Returning the the MSA, compare the sequences for human haemoglobin A
and B chains ("humanHemoglobinA" and "humanHemoglobinB")
What is the percentage identity/%ID for these two sequences?
Given the differences between the two sequences, do you
expect the structures to be different?
If so, different in what ways?
(e.g. including both beta-sheets and alpha-helices, rather than just
alpha helices? the same/different numbers of helices? etc.)
two structures (unaligned) of the two chains.
differences what you expected? If not, in what way are you surprised?
Align the two chains to each other using PyMOL.
Does this show
the structures to be more or less different than they did before
aligning them? What are the biggest differences between the structures?
What is the %ID of the following alignment (the
sequences are the human beta hemoglobin chain, and a leghemoglobin from
Now compare their aligned structures of these two proteins as found
You might be surprised how similar the two structures
are to each other given their low similarity/%ID, in comparison to the
differences in structure you saw for the %ID between the two human
hemoglobins alpha and beta.
Protein Structure, Structure Alignment, and Disease:
The mutation responsible for causing the sickle-cell anemia disease is
the change of residue 7 from a glutamate to a valine (E->V).
Given what you have learnt about the relationship between sequence
and structure, would you expect this mutation to make a large
difference to the structure of the protein?
file contains the structures of the mutant and wildtype chains -
Which differences can you identify between the two structures? Are you
surprised by the answer? If so, why?
To help understand how this mutation might cause disease, examine this
PyMOL file, which shows the structure of sickle-cell-mutant
hemoglobin in what is believed to be a valid biological unit. The
mutant residues have been labeled in orange, the chain interacting
with one of these mutant residues has been shown in surface view, water
molecules near the interaction have been shown in cyan, and polar
contacts between these waters and other atoms have been shown with
dotted yellow lines.
Based on examination of
this structure, can you suggest a reason why this mutant might cause
Align the human sickle-cell mutant protein structure with one of the
mouse hemoglobin beta structures, as provided in this PyMOL
Based on this alignment, which residue in the mouse structure that
you would expect to yield a
sickle-cell like disease in mice?
Can you think of any reasons why,
given the resistance to malaria induced by heterozygous human
wt/sickle-cell genotypes, we might not find such variants in mice
populations? Consider that the malaria parasite that infects humans is
not a parasite of mice.
The information and data included in the material we have presented
here was all acquired, free, from the internet - how did we do this,
and where did we find it?
The first step, when working with proteins, is to identify the record
describing your protein of interest in the UniProt database - this database
provides some of the highest-quality (i.e most likely to be
accurate/true) information concerning proteins, along with providing
links to many different kinds of information about the protein in a
range of different other databases.
To identify the record we need, we query the EBI EB-eye search tool with the
phrase "hemoglobin sickle cell",
and follow the links to the UniProt record for HBB_HUMAN (which is,
indeed, the protein which experiences mutations that lead to sickle
Note that, knowing the accession number of your record of interest in
the database, you can also go quickly to that record.
We will show you where this record provides information about:
Use the EB-eye search tool to find out similar information for a
different protein with accession number "P02088"
- the length of the protein
- the name of the organism the protein comes from
- diseases it is involved in
- variations in the sequence found in populations
- the sequence of the protein
- links to databases with information about the structure of the
Is there any other information in the record that you find
particularly interesting? If so, please make a note of this as we'd be
interested to hear about particular areas of interest for teachers.
To Gibson Team Training Pages