Protein Evolution

Friday 30th January 2009

Molecular Biotechnology Center [MBC], Torino, Italy

ELLS LearningLAB: Molecular evolution: modern evidence for Darwin's theory

Francesca Diella and Aidan Budd

Activity 1: Evolution of Hemoglobin Structure and Function

Multiple Sequence Alignment (MSA)

Cut and paste the contents of this protein sequence file (in FASTA format) into the EBI MUSCLE webserver, (into the box below the text "Enter or Paste a set of Sequences in any supported format:"). Run the software (click the "Run") button.

Wait a few moments, then click on the JalView button, to view the results with JalView MSA editor.

(Here is a pre-calculated version of this file that can be viewed using a locally installed version of an alignment viewer/editor such as JalView or CLUSTALX)

You will notice that different amino acids are have different colours -although sometimes they are also left uncoloured/white. If they are coloured, then it is always the same colour e.g. if aspartate (D) is coloured, it is always coloured purple. You'll also notice that there are several cases where different amino acids have the same colour e.g. both aspartate (D) and glutamate (E) are purple.

Using the tables and diagram from the introduction, which are the common properties shared by the different colours of amino acids?

What is does the "Conservation" chart (at the bottom of the JalView screen) describe?

The "same" protein in different organisms, or different members of the same protein family in the same organism, may have different amino acid sequences (as can be seen from the MSA). These differences are due to "point mutations" in the underlying DNA sequence of the genes that code for these proteins.

The aim of an MSA is to place amino acid residues from different proteins in the same column, such that residues in the same column are either related via point mutations (or by no mutations at all) - note that there are many different possible residues that could be aligned in the same column for two different sequences, however a given residue is related by point mutations/no mutations at all to (at most) only one residue in another sequence.

Note, however, that some positions in the alignment contain the "-" character, indicating a "blank" or "gap" i.e. that there is no residues in these sequences at those positions that are related via point mutations (or no mutations at all).

What kinds of genetic events do you think might have caused these gaps?

You will also have noticed that different columns in the alignment look different from each other.

Describe these differences in terms of (a) the number of different amino acids found in the column (e.g. is it only ever D and E, or is it many different ones e.g. V, I, L, D, G, etc.) (b) the properties of the amino acids (are they all hydrophobic, or a mixture of hydrophobic/aromatic/polar etc.) (c) presence of empty ("gap") characters?

These observations can help us answer some questions concerning patterns of protein evolution:
  1. Does the rate of point mutations at different positions in a protein seem to be (a) the same for all positions (b) different in different positions?
  2. Do all positions in the protein seem to be able to accept gap characters, or are there some positions in the protein that do not seem to accept gaps?
  3. Do all positions in a protein have to preserve the same kind of chemical/physical property? Or are there some positions that seem to accept amino acids with many different properties?
  4. Are certain properties more likely to be conserved than others?
Leave the alignment for these sequences open before continuing with the next section

Protein Structure and Amino Acid Properties:

Load the following file for human beta-chain hemoglobin into PyMOL.

To give you some experience of manipulating and examining 3D protein structures, use the sequence information given at the top of the PyMOL window to colour the residues of the chain different colours, depending on which group you are assigned to (this exercise will also help reveal some important patterns of protein evolution.)
Hydrophobic (blue)
Non-hydrophobic (white)
Polar (red)
Non-polar (white)
Aromatic (yellow)
Large (white)
For some of these classifications you should be able to see a trend in the localisation of the residues of different colours e.g. tend to be absent from either the interior or exterior of the protein, or tend to be near the haem molecule - can you identify any such trends?

(These files show the protein already labeled using these different colouring-schemes: hydrophobic; polar; smallplease don't open them until you have finished trying this for yourself! They are provided for use at a later date)

Protein Structure and Alignments:

Use the "Conservation" column to identify columns in the MSA that are the same in all sequences in the alignment (these are marked with a "*").

Identify the sequence in the alignment corresponding to the human beta-chain hemoglobin ("humanHemoglobinB")

Using the initial all-white human beta-chain hemoglobin PyMOL file from above, label red all the residues in this chain that are the same in all sequences.

Are these highly-conserved residues preferentially located (a) in the core (b) on the surface (c) in the haem-binding pocket?

(This is the human beta hemoglobin chain with the invariant residues already coloured red)

When talking about how similar two sequences are to each other, it often helps to be able to measure/quantify this similarity. One common way of doing this is to calculate the "percentage identity" (or "%ID") of a pair of sequences in the context of a given alignment.

%ID is calculated by (a) ignoring all columns that contain any gap characters (b) for the remaining columns, counting the number of columns where both sequences have exactly the same amino acid residue in the same column, and calculating this as a percentage of all non-gapped columns.

For example, in the alignment below (a region of an alignment between a pair of hemoglobin sequences)

Note the importance of the alignment for this measurement - if we take the following alignment (which is between the same pair of sequences)

we calculate %ID = 20% which is much lower than the 65% calculated from the previous alignment

With a %ID of 20% we might conclude that the sequences are relatively dissimilar - if we hadn't looked at the alignment and noticed that placing gaps differently could give us an alignment with much higher %ID!

Returning the the MSA, compare the sequences for human haemoglobin A and B chains ("humanHemoglobinA" and "humanHemoglobinB")

What is the percentage identity/%ID for these two sequences?

Given the differences between the two sequences, do you expect the structures to be different?

If so, different in what ways? (e.g. including both beta-sheets and alpha-helices, rather than just alpha helices? the same/different numbers of helices? etc.)

View the two structures (unaligned) of the two chains.

Are the differences what you expected? If not, in what way are you surprised?

Align the two chains to each other using PyMOL.

Does this show the structures to be more or less different than they did before aligning them? What are the biggest differences between the structures?

What is the %ID of the following alignment (the sequences are the human beta hemoglobin chain, and a leghemoglobin from a plant)?

Now compare their aligned structures of these two proteins as found in this PyMOL file

You might be surprised how similar the two structures are to each other given their low similarity/%ID, in comparison to the differences in structure you saw for the %ID between the two human hemoglobins alpha and beta.

Protein Structure, Structure Alignment, and Disease:

The mutation responsible for causing the sickle-cell anemia disease is the change of residue 7 from a glutamate to a valine (E->V).

Given what you have learnt about the relationship between sequence and structure, would you expect this mutation to make a large difference to the structure of the protein?

This PyMOL file contains the structures of the mutant and wildtype chains - align them.

Which differences can you identify between the two structures? Are you surprised by the answer? If so, why?

To help understand how this mutation might cause disease, examine this PyMOL file, which shows the structure of sickle-cell-mutant hemoglobin in what is believed to be a valid biological unit. The mutant residues have been labeled in orange, the chain interacting with one of these mutant residues has been shown in surface view, water molecules near the interaction have been shown in cyan, and polar contacts between these waters and other atoms have been shown with dotted yellow lines.

Based on examination of this structure, can you suggest a reason why this mutant might cause disease?

Align the human sickle-cell mutant protein structure with one of the mouse hemoglobin beta structures, as provided in this PyMOL file.

Based on this alignment, which residue in the mouse structure that you would expect to yield a sickle-cell like disease in mice?

Can you think of any reasons why, given the resistance to malaria induced by heterozygous human wt/sickle-cell genotypes, we might not find such variants in mice populations? Consider that the malaria parasite that infects humans is not a parasite of mice.

Database Records

The information and data included in the material we have presented here was all acquired, free, from the internet - how did we do this, and where did we find it?

The first step, when working with proteins, is to identify the record describing your protein of interest in the UniProt database - this database provides some of the highest-quality (i.e most likely to be accurate/true) information concerning proteins, along with providing links to many different kinds of information about the protein in a range of different other databases.

To identify the record we need, we query the EBI EB-eye search tool with the phrase "hemoglobin sickle cell", and follow the links to the UniProt record for HBB_HUMAN (which is, indeed, the protein which experiences mutations that lead to sickle cell anemia)

Note that, knowing the accession number of your record of interest in the database, you can also go quickly to that record.

We will show you where this record provides information about:
Use the EB-eye search tool to find out similar information for a different protein with accession number "P02088"

Is there any other information in the record that you find particularly interesting? If so, please make a note of this as we'd be interested to hear about particular areas of interest for teachers.

Back To Gibson Team Training Pages