Manual Editing and Adjustment of MSAs

Introductory Notes

The situation often arises that an automatic alignment program provides a good alignment, but that there are regions of the alignment that appear to be less-well aligned. As we have seen in the previous practical, the presence of poorly-aligned regions in an alignment can have a significant (and, as one would expect, negative) influence on the quality of an analysis that makes use of that alignment.

With that in mind, one way to improve the quality of your results is to take the time to try and obtain the best MSA possible. However, this is a non-trivial task, and one typically needs to invest a reasonable amount of time examining and adjusting the alignment before coming up with something that can be considered as closely approaching "ideal". This taks/process is one of central importance in many aspects of bioinformatics, particularly as used by wet-lab biologists, as the difference between working with a good or a bad alignment can be the difference between success and failure in the analysis.

Indeed, it is this particular session of the course that was the motivation behind initiating this course in the first place - more than once I have been confronted by unhappy biologists who just do not know where to begin when faced with the task of improving their alignments (having been told that this is what they should do), and I wanted to try and pass on some of hints and tips that have been passed on to me through the years to other people.

However, you will be pleased to hear that this process of examining and adjusting alignments is something that gets easier with practice, and such practice is exactly what you will be obtaining during this practical.

Note that while choosing example datasets for use in this class, I began with hand-edited alignments, that were created by comparison to proteins with known structure. I removed all gapped residues from these alignments, and then realigned the sequences using the CLUSTAL automatic alignment software. However, I did not choose CLUSTAL for this task because, of all available automatic sequence alignment software it produces the best results - rather, I chose it precisely because it is not the best, with several other software packages generally doing better than it (e.g. MUSCLE, MAFFT, PROBCONS) muscle, mafft, probcons). By using CLUSTAL we get alignments that have more errors in them than if we had used one of the other programmes, giving us more errors to correct manually - and as the point of these exercises is to give you experience of correcting such errors, to some extent "the more the merrier". However, it does not follow that automatic alignments created by other software would not need to be manually edited - thus the skills you acquire in this exercise are directly transferable to your efforts on your own sequences (for which I would recommend you begin by creating alignments using several different pieces of software, using several different parameters.)

Suspected "bad" or Missing Sequence

Typically, we think of "hand editing" an alignment once we have obtained an alignment that contains a set of sequences that we think are likely to be used in our final analysis. This "hand editing" then involves adjusting the positions of the gaps in the alignment

However sometimes, before reaching this stage, we come up with a set of sequences we are intersted in analysing further, but we suspect that some of the sequences in this set are either missing regions (i.e. if we could determine the sequence of the full-length peptide, we would find that it were longer than the sequence that currently resides in the alignment.) Such a situation could easily occur if, for example, the protein sequence was based on an incompletely-sequenced cDNA.

Alternatively, based on an unusual pattern of divergence in a particular sequence, we suspect that one or more regions of the sequence are simply "wrong" or "bad" - one scenario that might account for this would be that, if the sequence was based on prediction from genomic DNA, an exon boundry could have been mis-identified, leading the predicted peptide to include regions that were based on what is actually non-coding DNA. Such regions of a sequence may be obvious, with some portions of a sequence displaying a 'typical' degree of divergence from the other sequences in an alignment, while others are obviously much more divergent compared to other sequences.

Due to the errors inherent in the process of DNA sequencing, genome assembly, and gene prediction, sequence databases typically contain many sequences that share some of these problems. Therefore, as we typically begin building an MSA from sequences extracted from a database, sequences of this kind are therefore often observed in an initial MSA.

Obviously, missing sequence (e.g. missing exons) reduce the amount of information in the alignment. In particular, the more accurate phylogenetic analysis approaches almost all exclude from analysis any column that contains even just a single gap character - thus, a single 'fragment' can lead to an entire region of an alignment being excluded from the analysis, obviously causing a potentially major reduction in the precision/sensitivity of the analysis.

"Bad" sequence is perhaps even worse, as it has the potential to also reduce the accuracy of an analysis, by contributing a set of residues to the alignment that are not related to the other residues in their columns by a process of substitution.

The decision about how to deal with these issues depends on the analysis being carried out. As a general rule, if the sequences are not particularly important for the results of the analysis, one simply discards them. For example, consider an exploration of vertebrate phylogeny. This will typically include, not just vertebrate sequences, but also some additional, non-vertebrate sequences in the analysis, to fix a root on the vertebrate tree. If you have several such invertebrate sequences, and one of them is missing exons or appears to contain 'wrong' sequence, you would simply discard the sequence. However, if you are particularly interested in the position of this sequence within the tree, you will either keep the sequence in the alignment or - even better - attempt to collect the missing/correct sequence.

We will here look at a case where the initial set of sequences collected for an analysis contains a sequence containing apparently 'wrong' sequence. This sequence is from the puffer-fish Tetraodon nigroviridis, and is for the signal recognition particle 72KDa protein - we assume that the aim of the analysis is to obtain an overview of vertebrate phylogeny.

Collect an initial set of sequences related to the human SRP72KDa protein SRP72_HUMAN by obtaining this sequence via SRS, and then BLASTing this sequence against the UNIPROT database at the EBI.

Load the resulting sequences into CLUSTALX and align them automatically.

Several of the sequences in this initial alignment are fragments - remove these.

Now estimate a bootstrapped NJ tree using CLUSTAL with default settings, and display the tree using NJPLOT.

Note that in general the sequences in the tree that are from vertebrates show the expected relationships - however, the TETNG (Tetradodon nigroviridis) sequence, which should cluster as a sister group to the zebrafish sequences (BRARE), is placed as sister group to all other vertebrate sequences, with a high bootstrap value on this branch.

Looking at around 700 residues into the alignment, if the "Show Low-Scorring Residues" option is on, you can see that there is a string of around 50 amino acids that appear to be widely divergent from the other sequences. (If this you cannot find this region, then perhaps the sequence has recently been "repaired" in the databases - for the sake of this exercise, in this case load the sequence directly into the alignment from this link.)

As we mentioned, this might be due to the inclusion of non-coding DNA sequence in the gene prediction software used to obtain this protein sequence. As we imagine we are very interested in obtaining the correct sequence for this gene, we will attempt to locate the 'correct' sequence for this gene.

To do this, we will attempt to align one of the other sequences closely related to the TETNG sequence against the genomic DNA from which the TETNG sequence was predicted. We will do this using GENEWISE, software developed for addressing exactly this kind of question.

Firstly, click here collect the zebra-fish peptide sequence to use as a query against the Tetraodon genomic DNA.

Secondly, collect the genomic DNA sequence for the Tetraodon gene from the Ensembl website.

To do this, search the Tetraodon predicted proteome using BLAST. From the result page, follow the links to find the Ensembl gene that corresponds to this peptide. Follow the "Peptide Info" link from the gene page to get back to the  record related to the peptide. (Note that for Ensembl release 42, December 2006, the gene was GSTENG00024510001 - however I realise that Ensembl has updated to release 43 TODAY so who knows what we will find...)

Check using the search facility in your browser whether the peptide predicted for this gene does indeed contain the 'wrong' sequence by searching for e.g. "SGGPV" (part of the wrong sequence) - it should be there...

Follow the link from the "Gene" page to the genomic location of the gene (link should read something like "2,779,399-2,783,601")

Go to the link on the left of the page under the "export data" heading, to export the sequence as fasta (collect it as "text"), and save this sequence into a file.

Use the genewise server at the ebi to compare the zebrafish sequence to the pufferfish genomic sequence. In the output you obtain, check using the search facility of your browser whether the 'correct' sequence is in the new gene prediction e.g. using "QVAQL", a sequence found in all the zebrafish proteins in this region.

Repeat this process using the predicted TETNG gene to see if you an find where the 'wrong' sequence came from

(it looks to me as though a splice site was missed, so an exon boundary was ignored, and then a small exon was missed.)

Using a text editor, construct a file containg the 'correct' fasta sequence for the Tetraodon protein. Load this into CLUSTALX (with the previous alignment still loaded) using "Append" from the File menu.

Delete the initial Tetraodon sequence containing the "bad" sequence, and recalculate the alignment.

Estimate a new tree from this alignment using bootstrapping.

Q Has the sequence found a new, better position in the tree?

Q Have the bootstrap values for any regions of the tree improved?

I have included this exercise under the heading of "Manual edition" as it involves specific attempts to adjust regions of an alignment, and often involves "manually" creating new sequences to add to the alignment, sometimes joining together regions of sequence from a range of different sources.

Hand-Editing Alignments With SEAVIEW

Assuming you have already obtained an automaticly-calculated MSA that contains a set of sequences appropriate for your analysis, the next step is to examine the alignment in detail, to identify potential regions that are misaligned, and to alter (and hence hopefully correct) these regions.

To give you practice in this task, download this alignment of sequences that contain KH domains, and which was prepared by automatic alignment of these sequences by CLUSTALX.

Load the alignment into CLUSTALX and examine the alignment for regions that may have been misaligned.

Remember back to the presentation that was made on the topic of manually editing sequences - be aware that, when making your judgement about whether or not a region is badly aligned, you are basing your decision on your perception of how well the pattern of amino acids in the sequence region you are examining fits into the pattern observed in the other sequences in the alignment in the same region.

Remember, also, that one of the things that makes this process so difficult (and perhaps also so interesting/challenging??) is that the same amino acid e.g. glycine, can take on mamy different kinds of roles when it appears in different proteins, and in different positions in the same proteins. For example - in some proteins, possessing a particular glycine residue at a particular part of the structure, may be vital for the folding and/or function of the protein. Such a position is likely to be very highly conserved amongst all relatives of that protein, and therefore a single sequence that does not contain a glycine at in such a highly-conserved column makes one suspicious that this single sequence may be misaligned with respect to the other sequences in the alignment. However, in a different protein (or in the same protein at a different position in the sequence) a glycine might be just one of many different residues that could be used at that position, in which case oberving a different amino acid in that column

To consider another example - one positions in a structure demand an aromatic residue, which would mean a substition by anything other than a tryptophan, tyrosine, phenyalanine, or a histitdine (W, Y, F, or H) would indicate a possible misalignment. At another position, the residue may need to be polar, in which case it would likely tolerate a tyrosine or histitdine, but not a tryptophan or a phenyalanine. Thus, in the first column, seeing a W and a Y in the same column would not raise any warning signs - while in the second column, this would indicate possible mis-alignment of the sequence containing the W.

Further, note that in the examples I describe above, I consider patterns of amino acid preference that are only one amino acid long - in practice, one is often looking for a longer pattern e.g. hydrophobic - small - polar - aromatic. This search for patterns is further informed by our knowledge of the patterns associated with different types of secondary structure elements - For a summary of such features, check out the following page, which shows a colletion of alignments of protein structural elements, accompanied by the representation of these elements in the 3D protein structure and the associated PDB files for you to examine yourselves if you are interested. Currently the page includes only a few such links, but hopefully this will be expanded in the future. STRUCTURE GALLERY

As an aid to the process of identifying (and then checking whether many sequences in the alignment share) these kinds of patterns, follow this link to the, in my opinion extremelly useful, summary of the different contexts the different amino acids are likely to function in LINK TO AA PICTURE

Once you have identified potentially mis-aligned regions, look in the surrounding amino acids of these regions, in the sequence you suspect of being mis-aligned, and try and identify substrings that might match the pattern you have identified in the other sequences

You will want to try and position these regions in place of the potentially mis-aligned sequence to see if the resulting alignment is an improvement on the initial one.

Before editing the alignment, make a new copy of the initial alignment file that you will use for editing, using a name that indicates that the new copy will have been edited (I usually add "ED" before the ".aln" of the original name).

To begin editing the alignment, load the alignment into the alignment editor SEAVIEW.

While the representation of alignments in SEAVIEW is not as attractive or as configurable as in JALVIEW (a more recently-developed sequence alignment editor), I use SEAVIEW for two reasons (i) I find the interface quick and intuitive to use (ii) much more importantly, I have used it for many years, and have never experienced that it corrupted my alignment (a problem associated with several other older editors). After investing so much time in creating your alignment, you certainly do not want, at some late stage, to have errors introduced by the software into your alignment - particularly in the case where you do not even notice them occuring...!

To carry out your edits
Note - depending on whether you want to try a new alignment with the potentially mis-aligned regions shifted C or N terminal with respect to the other sequences, you will either need to select, respectively (i) the potentially mis-aligned sequences or (ii) all sequences other than the mis-aligned sequences
Once you have carried out an edit (or a few edits), save the alignment, and load it back into CLUSTALX, and use this programme to judge whether you edits have either improved, not changed, or made worse, the alignment.

Assuming there are additional changes that you identify as potentially improving the alignment, repeat this loop of loading into SEAVIEW, editing, and loading back into CLUSTALX to check the alignment (make a point of saving your edited alignments frequently, with different names (I usually just add a "1" after the "ED" I previously added to the copy of the alignment file name) - this is important as sometimes one can get into a complete mess, and one will then want to return to a recent version of the file (not begining again back at the begining!)

Something you might like to try out, is to identify groups of closely-related sequences in the alignment, and to save these groups in a group of separate files (using the profile interface of CLUSTALX) - you will aim to create a good quality alignment of each of these subsets of sequences, these alignments presumably then being easier to align against each other. Once the individual groups have been well aligned, they can be fused back together with other groups, also using the profile options in CLUSTALX - the resulting alignments can then be further edited using SEAVIEW. Note that, in general, alignment sofrware will not allow you to load multiple sequences with the same name - therefore, you need to be careful to make sure that each sequence is segreagated into one and only one file when

Once you have made all the changes you think are possible, compare your alignment to this one: hand-edited alignment of KH-domain sequences. This alignment was prepared manually, by comparison to the structure of a member of the alignment that has been solved (this can inform ones decisons about where to place gaps in the alignment of the sequence whose structure is known - gaps have a very strong tendency to occur BETWEEN rather than WITHIN secondary structure elements i.e. the localise preferentially to loops in the protein structure)

Compare the alignment you come up with to the hand-edited alignment here, to see how close you got to the ideal alignment.

Here are some other sets of sequences for you to try hand-editing. Each of these files is a high-quality manually-curated alignment based on comparison to structures - to obtain your low-quality alignment that needs improving, remove all gaps from the alignment using CLUSTALX, and then realign using your alignment software of choice. Then try and fix the resulting (presumably non-ideal) alignments and compare them to the initial alignment. You might like to try editing the several alignments of the same set of sequences that differ in terms of the alignment software you used to create them.

Extra Exercises

Here are some additional exercises for you to try out if you get through the ones above with time to spare (don't go through them in order - just pick out those that sound more interesting to you).

1) Carry out a similar search for sequence that appears to be either missing or "bad" in the following set of sequences that are taken from TreeFam TF105801 (for instance, there may be a region of the rat sequence that is "bad").

2) Additional alignments for you to try hand-editing (each of these files is a high-quality manually-curated alignment based on comparison to structures - to obtain your low-quality alignment that needs improving, remove all gaps from the alignment, and then realign using your alignment software of choice. Then try and fix the resulting (presumably non-ideal) alignments:

3) The sets of sequences you used in the hand-editing exercises above required some effort to obtain - note that none of the sequences seems to be a fragment or to contain "bad" sequence. Attempt to obtain a set of sequences similar to that in the S1 Protease dataset above - this will probably involve carrying out several searches and combing the results - you will also have to watch out for the problem of collecting the same sequences (with the same name!) twice, which you are then not able to load into the same piece of software.

This exercise will hopefully have helped you

Note that throughout these exercises the following formating is used to specify different types of text

Bold non-italic text like this gives you instructions about tasks you should carry out e.g. "View the following webpage"

Italic text specifies questions for you to answer

Back to Gibson Team course pages at EMBL.