Activity 1

Evolution of Hemoglobin Structure and Function


Multiple Sequence Alignment (MSA)

We will begin with an exercise that can help give us an idea of the way in which hemoglobin protein sequences have changed in the time since the human and mice lineages diverged from each other (estimated at around 80/90 million years).

We will do this by building a protein multiple sequence alignment (MSA) of mouse and human hemoglobin sequences.

Cut and paste the contents of this protein sequence file (in FASTA format) into the EBI MUSCLE webserver - i.e. into the box on the webpage that lies below the text "Enter or Paste a set of Sequences in any supported format:".

Run the MUSCLE software (click the "Run") button - this will build a multiple sequence alignment from the sequences in the FASTA-format file.



Wait a few moments, then click on the JalView button, to view the results with JalView MSA editor.



(Here is a pre-calculated version of this file that can be viewed using a locally installed version of an alignment viewer/editor such as JalView or CLUSTALX).

You will notice that, as represented in JalView, different amino acids are shown in different colours - although sometimes they are also left uncoloured/white. If they are coloured, then it is always the same colour e.g. if aspartate (D) is coloured, it is always coloured purple. You'll also notice that there are several cases where different amino acids have the same colour e.g. both aspartate (D) and glutamate (E) are purple.

The colouring scheme used to display the different amino acids has been chosen so that amino acids with similar chemical/physical properties are shown in the same colour. Taking the example of aspartate and glutamate mentioned above, of the 20 amino acids commonly found in proteins, they are the only two that are negatively charged under normal physiological conditions - as they are the only two acidic amino acids. Thus they are shown in the same colour (purple), and none of the other amino acids are shown in this colour.

The set of colours used in this colouring scheme are:

blue
red
green
pink
magenta
orange
cyan
yellow

Examining Protein Multiple Sequence Alignments - Questions
Examining the multiple sequence alignment:
  1. Make a list of the amino acids that are shown for each of the colours e.g. for magenta/purple, this is aspartate (D) and glutamate (E)
  2. Identify the common properties of amino acids shown with the same colour, using the tables [amino acid structures] and diagrams [amino acid properties Venn diagram] describing amino acid propteries shown in the introduction e.g. the magenta/purple amino acids are charged but not positive (i.e. the magenta amino acids are negatively charged).
  3. What is does the "Conservation" chart (at the bottom of the JalView screen) describe?
You can check the answers to these questions here

Point mutations, insertions, and deletions
The "same" protein in different organisms, or different members of the same protein family in the same organism, may have different amino acid sequences (as can be seen from the MSA) - e.g. the sequence of amino acids found in the human hemoglobin B2 chain is different from that in the mouse B1 chain (and indeed is different from all the other sequences in the alignment).

The differences in these sequences are due to their evolution from the same, common, unique ancestral hemoglobin gene, over the course of many millions of years.

A range of different kinds of genetic events/mutations may be responsible for the differences we observe - one source of such changes are "point mutations" in the underlying DNA sequence of the genes that code for these proteins.

The aim of an MSA is to place amino acid residues from different proteins in the same column, such that residues in the same column are either related via point mutations (or by no mutations at all).

Note, however, that some positions in the alignment contain the "-" character, indicating a "blank" or "gap" i.e. that there is no residues in these sequences at those positions that are believed to be related via point mutations. These might be due to insertion or deletion mutations.


Patterns in Protein Sequence Evolution - Questions
A first look at a mutliple sequence alignments shows very clearly that different regions of the alignment look different from each other. For example:
This indicates that different positions in an alignment (and thus different positions in a protein sequence) accept different kinds of mutations at different rates. For example, in the image shown below, it would be possible to explain the diversity of amino acid residues found in the first column (outlined in red) by just a single point mutation event. In contrast, the diversity of residues in the second column (outlined in blue) requires us to infer a much larger number of point mutations.

Different substitution rates at different alignment positions

It is important to note that this does not indicate that the rate of mutations is different at these different positions in the alignment - rather, that the rate of "accepted" mutations is different. For example, if the rate of mutation is the same at all positions in the alignment, but that mutations in positions in the protein cause the organism to be less fit than mutations at other positions, then the point aceepted mutations (or "subsitutions") will be different at the two positions - the rate being lower at the alignment position where mutations are more likely to make the organism less fit.

Observing these features of multiple sequences alignments observations allows us to answer some fundamental questions concerning patterns of protein evolution:
  1. Does the rate of substitution at different positions in a protein seem to be (a) the same for all positions (b) different in different positions?
  2. Do (a) all positions in the protein seem equally able to tolerate gap characters (i.e. insertion/deletion mutations), or (b) are there some positions in the protein that do not seem to accept gaps?
  3. Do the substitutions observed in protein sequence alignments (a) always maintain the same chemical/physical property in that column (b) or are there some positions that seem to accept substitutions to amino acids with many different properties?
  4. Are certain properties more likely to be conserved than others e.g. are there certain colours (i.e. conserved properties) of amio acids that are more or less likely to be conserved in the alignment, as a result of subsitutions?
With reference to the multiple sequence alignment, answer to questions above, identifying regions of the alignment that demonstrate your answer.

Follow this link for answers to these questions

Please do not close the alignment i.e. keep the JalView window open, displaying the hemoglobin alignment  - as we'll be using this in the next section.

Protein Structure and Amino Acid Properties
The following exercise aims to give you some experience at manipulating and examining 3D protein structures, while at the same time illustrating some additional features/patterns in the evolution of protein sequences and structures in terms of whether amino acids with particular kinds of properties tend to be located in particular regions of protein structure (e.g. internal/core or external/solvent-exposed).

This page provides a set of instructions on how to use and access some of the different features of the PyMOL 3D structure viewer.

Load the following file for human beta-chain hemoglobin into PyMOL.

At the top of the PyMOL window, you will see the sequence of the protein whose structure is being displayed (this link explains how to switch the sequence on if it's not immediately visible)

You will be assigned to one of three different groups - those of you in group A should select and re-colour all the hydrophobic residues blue in PyMOL, those in group B should do the same but colouring instead the polar residues red, and those in C the same but colouring the aromatic residues yellow (see the table below).

A
Hydrophobic (blue)
G A V L I C M F P W
Non-hydrophobic (white)
R N D E Q H K S T Y
B
Polar (red)
K R H D E N Q
Non-polar (white)
A C G I L M F P S T W Y V
C
Aromatic (yellow)
F H W Y
Large (white)
A C D E G I K L M N P Q R S T V

For some of these classifications you should be able to see a trend in the localisation of the residues of different colours - in particular, one of the classes is preferentially located in the interior of the protein.

Which class of amino acids is preferentially located in the interior of the protein? (Follow this link to read an answer to this question).

(These files show the protein already labeled using these different colouring-schemes: hydrophobic; polar; aromaticplease don't open them until you have finished trying this for yourself!)

3D Locations of Strongly Conserved Residues

Use the "Conservation" column in JalView to identify columns in the MSA that are the same in all sequences in the alignment (these are marked with a "*" in the conservation chart).

Identify the sequence in the alignment corresponding to the human beta-chain hemoglobin ("humanHemoglobinB")

Using the initial all-white human beta-chain hemoglobin PyMOL file from above, label red all the residues in this chain that are the same in all sequences.

Are these strongly-conserved residues preferentially located (a) in the core (b) on the surface (c) in the haem-binding pocket? (Follow this link to read an answer to this question)

(This is the human beta hemoglobin chain with the invariant residues already coloured red)

Measuring Sequence Similarity

When talking about how similar two sequences are to each other, it often helps to be able to measure/quantify this similarity. One common way of doing this is to calculate the "percentage identity" (or "%ID") of a pair of sequences in the context of a given alignment.

%ID is calculated by (a) ignoring all columns that contain any gap characters (b) for the remaining columns, counting the number of columns where both sequences have exactly the same amino acid residue in the same column, and calculating this as a percentage of all non-gapped columns.

For example, in the alignment below (a region of an alignment between a pair of hemoglobin sequences)


Note the importance of the alignment for this measurement - if we take the following alignment (which is between the same pair of sequences)

we calculate %ID = 20% which is much lower than the 65% calculated from the previous alignment

With a %ID of 20% we might conclude that the sequences are relatively dissimilar - if we hadn't looked at the alignment and noticed that placing gaps differently could give us an alignment with much higher %ID!

Returning to the the MSA, compare the sequences for human haemoglobin A and B chains ("humanHemoglobinA" and "humanHemoglobinB") (this file contains an alignment of only these two sequences - this may be easier to work with than the complete MSA)

What is the percentage identity/%ID for these two sequences? (Here is the answer)

Given the differences between the two sequences, do you expect the structures to be different?

If so, different in what ways? (e.g. including both beta-sheets and alpha-helices, rather than just alpha helices? the same/different numbers of helices? etc.)


View the two structures (unaligned) of the two chains.

Are the differences what you expected? If not, in what way are you surprised?

Align the two chains to each other using PyMOL (instructions on how to do this can be found here)

Does this show the structures to be more or less different than they did before aligning them? What are the biggest differences between the structures?

What is the %ID of the following alignment (the sequences are the human beta hemoglobin chain, and a leghemoglobin from a plant)?



Now compare their aligned structures of these two proteins as found in this PyMOL file


You might be surprised how similar the two structures are to each other given their low similarity/%ID, in comparison to the differences in structure you saw for the %ID between the two human hemoglobins alpha and beta.

Protein Structure, Structure Alignment, and Disease:

The mutation responsible for causing the sickle-cell anemia disease is the change of residue 7 from a glutamate to a valine (E->V).

Given what you have learnt about the relationship between sequence and structure, would you expect this mutation to make a large difference to the structure of the protein?

This PyMOL file contains the structures of the mutant and wildtype chains - align them.

Which differences can you identify between the two structures? Are you surprised by the answer? If so, why?

To help understand how this mutation might cause disease, examine this PyMOL file, which shows the structure of sickle-cell-mutant hemoglobin in what is believed to be a valid biological unit. The mutant residues have been labeled in orange, the chain interacting with one of these mutant residues has been shown in surface view, water molecules near the interaction have been shown in cyan, and polar contacts between these waters and other atoms have been shown with dotted yellow lines.

Based on examination of this structure, can you suggest a reason why this mutant might cause disease?

Align the human sickle-cell mutant protein structure with one of the mouse hemoglobin beta structures, as provided in this PyMOL file.

Based on this alignment, which residue in the mouse structure that you would expect to yield a sickle-cell like disease in mice?

Can you think of any reasons why, given the resistance to malaria induced by heterozygous human wt/sickle-cell genotypes, we might not find such variants in mice populations? Consider that the malaria parasite that infects humans is not a parasite of mice.

Database Records

The information and data included in the material we have presented here was all acquired, free, from the internet - how did we do this, and where did we find it?

The first step, when working with proteins, is to identify the record describing your protein of interest in the UniProt database - this database provides some of the highest-quality (i.e most likely to be accurate/true) information concerning proteins, along with providing links to many different kinds of information about the protein in a range of different other databases.

To identify the record we need, we query the EBI EB-eye search tool with the phrase "hemoglobin sickle cell", and follow the links to the UniProt record for HBB_HUMAN (which is, indeed, the protein which experiences mutations that lead to sickle cell anemia)

Note that, knowing the accession number of your record of interest in the database, you can also go quickly to that record.

We will show you where this record provides information about:
Use the EB-eye search tool to find out similar information for a different protein with accession number "P02088"

Is there any other information in the record that you find particularly interesting? If so, please make a note of this as we'd be interested to hear about particular areas of interest for teachers.

Preparing a Protein Information Worksheet

Complete the following worksheet using information from UniProt record for human hemoglobin beta chain, and from the links to other databases found in the UniProt record:
If you have time, you could try and fill out a similar worksheet for some other proteins of interest:

Back To Gibson Team Training Pages