Manual Editing and Adjustment of MSAs
Introductory Notes
The situation often arises that an automatic
alignment program provides a good alignment, but that there are regions
of the alignment that appear to be less-well aligned. As we have seen
in the previous practical, the presence of poorly-aligned regions in an
alignment can have a significant (and, as one would expect, negative)
influence on the quality of an analysis that makes use of that
alignment.
With that in mind, one way to improve the quality of your results is to
take the time to try and obtain the best MSA possible. However, this is
a non-trivial task, and one typically needs to invest a reasonable
amount of time examining and adjusting the alignment before coming up
with something that can be considered as closely approaching "ideal".
This taks/process is one of central importance in many aspects of
bioinformatics, particularly as used by wet-lab biologists, as the
difference between working with a good or a bad alignment can be the
difference between success and failure in the analysis.
Indeed, it is this particular session of the course that was the
motivation behind initiating this course in the first place - more than
once I have been confronted by unhappy biologists who just do not know
where to begin when faced with the task of improving their alignments
(having been told that this is what they should do), and I wanted to
try and pass on some of hints and tips that have been passed on to me
through the years to other people.
However, you will be pleased to hear that this process of examining and
adjusting alignments is something that gets easier with practice, and
such practice is exactly what you will be obtaining during this
practical.
Note that while choosing example datasets for use in this class, I
began with hand-edited alignments, that were created by comparison
to proteins with known structure. I removed all gapped residues from
these alignments, and then realigned the sequences using the CLUSTAL
automatic alignment software. However, I did not choose CLUSTAL for
this task because, of all available automatic sequence alignment
software it produces the best results - rather, I chose it precisely
because it is not
the best, with several other software packages generally doing better
than it (e.g. MUSCLE, MAFFT, PROBCONS) muscle, mafft, probcons). By
using CLUSTAL we get
alignments that have more errors in them than if we had used one of the
other programmes, giving us more errors to correct manually - and as
the point of these exercises is to give you experience of correcting
such errors, to some extent "the more the merrier". However, it does
not follow that automatic alignments created by other software would not
need to be manually edited - thus the skills you acquire in this
exercise are
directly transferable to your efforts on your own sequences (for which
I would recommend you begin by creating alignments using several
different pieces of software, using several different parameters.)
Suspected "bad" or Missing Sequence
Typically, we think of "hand editing" an alignment once we have
obtained an alignment that contains a set of sequences that we think
are likely to be used in our final analysis. This "hand editing" then
involves adjusting the positions of the gaps in the alignment
However sometimes, before reaching this stage, we come up with a set of
sequences we are intersted in analysing further, but we suspect that
some of the sequences in this set are either missing regions (i.e. if
we could determine the sequence of the full-length peptide, we would
find that it were longer than the sequence that currently resides in
the alignment.) Such a situation could easily occur if, for example,
the protein sequence was based on an incompletely-sequenced cDNA.
Alternatively, based on an unusual pattern of divergence in a
particular sequence, we suspect that one or more regions of the
sequence are simply "wrong" or "bad" - one scenario that might account
for this would be that, if the sequence was based on prediction from
genomic DNA, an exon boundry could have been mis-identified, leading
the predicted peptide to include regions that were based on what is
actually non-coding DNA. Such regions of a sequence may be obvious,
with some portions of a sequence displaying a 'typical' degree of
divergence from the other sequences in an alignment, while others are
obviously much more divergent compared to other sequences.
Due to the errors inherent in the process of DNA sequencing, genome
assembly, and gene prediction, sequence databases typically contain
many sequences that share some of these problems. Therefore, as we
typically begin building an MSA from sequences
extracted from a database, sequences of this kind are therefore often
observed in an initial MSA.
Obviously, missing sequence (e.g. missing exons) reduce the amount of
information in the alignment. In particular, the more accurate
phylogenetic analysis approaches almost all exclude from analysis any
column that contains even just a single gap character - thus, a single
'fragment' can lead to an entire region of an alignment being excluded
from the analysis, obviously causing a potentially major reduction in
the precision/sensitivity of the analysis.
"Bad" sequence is perhaps even worse, as it has the potential to also
reduce the accuracy of an analysis, by contributing a set of residues
to the alignment that are not related to the other residues in their
columns by a process of substitution.
The decision about how to deal with these issues depends on the
analysis
being carried out. As a general rule, if the sequences are not
particularly important for the results of the analysis, one simply
discards them. For example, consider an exploration of vertebrate
phylogeny. This will typically include, not just vertebrate sequences,
but also some additional, non-vertebrate sequences in the analysis, to
fix a root on the vertebrate tree. If you have several such
invertebrate sequences, and one of them is missing exons or appears to
contain 'wrong' sequence, you would simply discard the sequence.
However, if you are particularly interested in the position of this
sequence within the tree, you will either keep the sequence in the
alignment or - even better - attempt to collect the missing/correct
sequence.
We will here look at a case where the initial set of sequences
collected for an analysis contains a sequence containing apparently
'wrong' sequence. This sequence is from the puffer-fish Tetraodon
nigroviridis, and is for the signal recognition particle 72KDa
protein - we assume that
the aim of the analysis is to obtain an overview of vertebrate
phylogeny.
Collect an initial set of sequences related to the human SRP72KDa
protein SRP72_HUMAN by obtaining this sequence via SRS, and then
BLASTing
this sequence against the UNIPROT database at the EBI.
Load the resulting sequences into CLUSTALX and align them
automatically.
Several of the sequences in this initial alignment are fragments -
remove these.
Now estimate a bootstrapped NJ tree using CLUSTAL with default
settings, and display the tree using NJPLOT.
Note that in general the sequences in the tree that are from
vertebrates show the expected relationships - however, the TETNG
(Tetradodon nigroviridis) sequence, which should cluster as a
sister
group to the zebrafish sequences (BRARE), is placed as sister group to
all other vertebrate sequences, with a high bootstrap value on this
branch.
Looking at around 700 residues into the alignment, if the "Show
Low-Scorring Residues" option is on, you can see that there is a string
of around 50 amino acids that appear to be widely divergent from the
other sequences. (If this you cannot find this region, then perhaps the
sequence has recently been "repaired" in the databases - for the sake
of this exercise, in this case load the sequence directly into the
alignment from this link.)
As we mentioned, this might be due to the inclusion of non-coding DNA
sequence in the gene prediction software used to obtain this protein
sequence. As we imagine we are very interested in obtaining the correct
sequence for this gene, we will attempt to locate the 'correct'
sequence for this gene.
To do this, we will attempt to align one of the other sequences closely
related to the TETNG sequence against the genomic DNA from which the
TETNG sequence was predicted. We will do this using GENEWISE, software
developed for addressing exactly this kind of question.
Firstly, click here collect the
zebra-fish peptide sequence to use as a query against the Tetraodon
genomic DNA.
Secondly, collect the genomic DNA sequence for the Tetraodon gene
from the Ensembl website.
To do this, search the Tetraodon predicted proteome using
BLAST. From the result page, follow the links to find the Ensembl gene
that corresponds to this peptide. Follow the "Peptide Info" link from
the gene page to get back to the record related to the peptide.
(Note that for Ensembl release 42, December 2006, the gene was
GSTENG00024510001 - however I realise that Ensembl has updated to
release 43 TODAY so who knows what we will find...)
Check using the search facility in your browser whether the peptide
predicted for this gene does indeed contain the 'wrong' sequence by
searching for e.g. "SGGPV" (part of the wrong sequence) - it should be
there...
Follow the link from the "Gene" page to the genomic location of the
gene (link should read something like "2,779,399-2,783,601")
Go to the link on the left of the page under the "export data" heading,
to export the sequence as fasta (collect it as "text"), and save this
sequence into a file.
Use the genewise server at the ebi to compare the zebrafish sequence to
the pufferfish genomic sequence. In the output you obtain, check using
the search facility of your browser whether the 'correct' sequence is
in the new gene prediction e.g. using "QVAQL", a sequence found in all
the zebrafish proteins in this region.
Repeat this process using the predicted TETNG gene to see if you an
find where the 'wrong' sequence came from
(it looks to me as though a
splice site was missed, so an exon boundary was ignored, and then a
small exon was missed.)
Using a text editor, construct a file containg the 'correct' fasta
sequence for the Tetraodon protein. Load this into CLUSTALX
(with the previous alignment still loaded) using "Append" from the File
menu.
Delete the initial Tetraodon sequence containing the "bad" sequence,
and recalculate the alignment.
Estimate a new tree from
this alignment using bootstrapping.
Q Has the sequence found a new,
better position in the tree?
Q Have the bootstrap values for any regions of the tree improved?
I have included this exercise under the heading of "Manual edition"
as it involves specific attempts to adjust regions of an alignment, and
often involves "manually" creating new sequences to add to the
alignment, sometimes joining together regions of sequence from a range
of different sources.
Hand-Editing Alignments With SEAVIEW
Assuming you have already obtained an automaticly-calculated MSA that
contains a set of sequences appropriate for your analysis, the next
step is to examine the alignment in detail, to identify potential
regions that are misaligned, and to alter (and hence hopefully correct)
these regions.
To give you practice in this task, download this alignment of sequences that
contain KH domains, and which was prepared by automatic alignment of
these sequences by CLUSTALX.
Load the alignment into CLUSTALX and examine the alignment for
regions that may have been misaligned.
Remember back to the presentation that was made on the topic of
manually editing sequences - be aware that, when making your judgement
about whether or not a region is badly aligned, you are basing your
decision on your perception of how well the pattern of amino acids in
the sequence region you are examining fits into the pattern observed in
the other sequences in the alignment in the same region.
Remember, also, that one of the things that makes this process so
difficult (and perhaps also so interesting/challenging??) is that the
same amino acid e.g. glycine, can take on mamy different kinds of roles
when it appears in different proteins, and in different positions in
the same proteins. For example - in some proteins, possessing a
particular glycine residue at a particular part of the structure, may
be vital for the folding and/or function of the protein. Such a
position is likely to be very highly conserved amongst all relatives of
that protein, and therefore a single sequence that does not contain a
glycine at in such a highly-conserved column makes one suspicious that
this single sequence may be misaligned with respect to the other
sequences in the alignment. However, in a different protein (or in the
same protein at a different position in the sequence) a glycine might
be just one of many different residues that could be used at that
position, in which case oberving a different amino acid in that column
To consider another example - one positions in a structure demand an
aromatic residue, which would mean a substition by anything other than
a tryptophan, tyrosine, phenyalanine, or a histitdine (W, Y, F, or H)
would indicate a possible misalignment. At another position, the
residue may need to be polar, in which case it would likely tolerate a
tyrosine or histitdine, but not a tryptophan or a phenyalanine. Thus,
in the first column, seeing a W and a Y in the same column would not
raise any warning signs - while in the second column, this would
indicate possible mis-alignment of the sequence containing the W.
Further, note that in the examples I describe above, I consider
patterns of amino acid preference that are only one amino acid long -
in practice, one is often looking for a longer pattern e.g. hydrophobic
- small - polar - aromatic. This search for patterns is further
informed by our knowledge of the patterns associated with different
types of secondary structure elements - For a summary of such features,
check out the following page, which shows a colletion of alignments of
protein structural elements, accompanied by the representation of these
elements in the 3D protein structure and the associated PDB files for
you to examine yourselves if you are interested. Currently the page
includes only a few such links, but hopefully this will be expanded in
the future. STRUCTURE GALLERY
As an aid to the process of identifying (and then checking whether many
sequences in the alignment share) these kinds of patterns, follow this
link to the, in my opinion extremelly useful, summary of the different
contexts the different amino acids are likely to function in LINK TO AA
PICTURE
Once you have identified potentially mis-aligned regions, look in
the surrounding amino acids of these regions, in the sequence you
suspect of being mis-aligned, and try and identify substrings that
might match the pattern you have identified in the other sequences
You will want to try and position these regions in place of the
potentially mis-aligned sequence to see if the resulting alignment is
an improvement on the initial one.
Before editing the alignment, make a new copy of the initial
alignment file that you will use for editing, using a name that
indicates that the new copy will have been edited (I usually add "ED"
before the ".aln" of the original name).
To begin editing the alignment, load the alignment into the alignment
editor SEAVIEW.
While the representation of alignments in SEAVIEW is not as
attractive or as configurable as in JALVIEW (a more recently-developed
sequence alignment editor), I use SEAVIEW for two reasons (i) I find
the interface quick and intuitive to use (ii) much more importantly, I
have used it for many years, and have never experienced that it
corrupted my alignment (a problem associated with several other older
editors). After investing so much time in creating your alignment, you
certainly do not want, at some late stage, to have errors introduced by
the software into your alignment - particularly in the case where you
do not even notice them occuring...!
To carry out your edits
- select the set of sequences into which you want to insert gaps on
the left side of the screen
Note - depending on whether you want to try a new alignment with the
potentially mis-aligned regions shifted C or N terminal with respect to
the other sequences, you will either need to select, respectively (i)
the potentially mis-aligned sequences or (ii) all sequences other than
the mis-aligned sequences
- insert the "gap" character "-" into the sequences in the
position needed to create the new alignment
- assuming that the region being edited is flanked on the
C-terminal (right) side by regions you believe to be correctly aligned,
you will need to move the cursor to somewhere C-terminal of your
initial insertion, in a region that contains only gapped residues, and
then delete (using the backspace key) the same number of residues that
you just inserted, to bring the rest of the residues in these sequences
back into line.
Once you have carried out an edit (or a few edits), save the alignment,
and load it back into CLUSTALX, and use this programme to judge whether
you edits have either improved, not changed, or made worse, the
alignment.
Assuming there are additional changes that you identify as potentially
improving the alignment, repeat this loop of loading into SEAVIEW,
editing, and loading back into CLUSTALX to check the alignment (make a
point of saving your edited alignments frequently, with different names
(I usually just add a "1" after the "ED" I previously added to the copy
of the alignment file name) - this is important as sometimes one can
get into a complete mess, and one will then want to return to a recent
version of the file (not begining again back at the begining!)
Something you might like to try out, is to identify groups of
closely-related sequences in the alignment, and to save these groups in
a group of separate files (using the profile interface of CLUSTALX) -
you will aim to create a good quality alignment of each of these
subsets of sequences, these alignments presumably then being easier to
align against each other. Once the individual groups have been well
aligned, they can be fused back together with other groups, also using
the profile options in CLUSTALX - the resulting alignments can then be
further edited using SEAVIEW. Note that, in general, alignment sofrware
will not allow you to load multiple sequences with the same name -
therefore, you need to be careful to make sure that each sequence is
segreagated into one and only one file when
Once you have made all the changes you think are possible, compare your
alignment to this one: hand-edited
alignment of KH-domain sequences. This alignment was prepared
manually, by comparison to the structure of a member of the alignment
that has been solved (this can inform ones decisons about where to
place gaps in the alignment of the sequence whose structure is known -
gaps have a very strong tendency to occur BETWEEN rather than WITHIN
secondary structure elements i.e. the localise preferentially to loops
in the protein structure)
Compare the alignment you come up with to the hand-edited alignment
here, to see how close you got to the ideal alignment.
Here are some other sets of sequences for you to try hand-editing. Each
of these files
is a high-quality manually-curated alignment based on comparison to
structures - to obtain your low-quality alignment that needs improving,
remove all gaps from the alignment using CLUSTALX, and then realign
using your
alignment software of choice. Then try and fix the resulting
(presumably non-ideal) alignments and compare them to the initial
alignment. You might like to try editing the several alignments of the
same set of sequences that differ in terms of the alignment software
you used to create them.
Extra Exercises
Here are some additional exercises for you to try out if you get
through the ones above with time to spare (don't go through them in
order - just pick out those that sound more interesting to you).
1) Carry out a similar search for sequence that appears to be either
missing or "bad" in the following set of sequences that are taken from
TreeFam TF105801 (for instance,
there may be a region of the rat sequence that is "bad").
2) Additional alignments for you to try hand-editing (each of these
files is a high-quality manually-curated alignment based on comparison
to structures - to obtain your low-quality alignment that needs
improving, remove all gaps from the alignment, and then realign using
your alignment software of choice. Then try and fix the resulting
(presumably non-ideal) alignments:
3) The sets of sequences you used in the hand-editing exercises above
required some effort to obtain - note that none of the sequences seems
to be a fragment or to contain "bad" sequence. Attempt to obtain a set
of sequences similar to that in the S1 Protease dataset above - this
will probably involve carrying out several searches and combing the
results - you will also have to
watch out for the problem of collecting the same sequences (with the
same name!) twice, which
you are then not able to load into the same piece of software.
This exercise will hopefully have helped
you
- distinguish between potentailly mis-aligned and "bad" sequence
- gain experience identifying potentially mis-aligned regions of
an alignment
- gain experience in the process of manually editing alignments
using CLUSTALX and SEAVIEW
- demonstrated strategies that can be used to deal with
fragments and "bad" sequence
Note that throughout these exercises the
following formating is
used to
specify different types of text
Bold non-italic text like this gives
you instructions about tasks you should carry out e.g. "View the
following webpage"
Italic text specifies questions for
you to answer
Back
to Gibson Team course pages at EMBL.