Introduction to Multiple Sequence Alignment (MSA)

Introductory Notes

The first exercise centers around collecting, aligning, and examining the alignments of a group of proteins known as the "geminins". These proteins are important in the control of DNA replication within the cell cycle, where they prevent DNA replication by preventing incorporation of the MCM omplex into the pre-replication complex (pre-RC).

We will use several different approaches to collect sets of these proteins, will then use several different pieces of software to align them, and examine them using several different pieces of software. This will give you experience of the mechanics of carrying out searches for sequences from several different places using different software and approaches, will give you experience loading sequences into a several different autmoatic alignment software packages, and of visualising alignments in several different pieces of software. Thus you should become more comfortable with the process of manipulating of sequence files, inputing data into different software, while also acquiring an overview of the features of different alignment-related software.

Retrieving sequences for alignment

The first step in using a multiple sequence alignment (MSA) in your analysis to to obtain a set of sequences with which to create an initial alignment. With this in mind, we will now carry out a set of exercises aimed at introducing you to some of the tools that can be used to acquire these sequences.

Retrieving sequences using Entrez at the NCBI

Entrez is a database system used by the NCBI to query its diverse set of biomedical databases. We will begin by using it to collet a set of sequences by keyword.

Begin by visiting the NCBI Entrez website.

Switch the search to be for "Proteins".

Search the database using the keyword query "geminin".

This will identify those records within the Entrez protein sequence databases that mention the word "geminin".

(As an aside, note that this interface can be useful if you want to collect sequences as specified by a set of accession numbers e.g. try with the following list of NCBI "GI"s by executing the query "10181218  7705682 20454909 20454908 3249005" against the Protein databases)

We want to be able to load these sequences into our sequence alignment software, and for that we need to obtain the sequences in a format that the software can interpret. Probably the format that most sequence analysis software is able to use is known as "fasta" format, hence we will collect the sequences in this format.

View the sequences in fasta format by switching "Display" to "fasta"

(if you want to get more information about the sequences, you could save them instead in "genpept" format.)

Save the results into a file by switching "send to" to "file", and store this file somewhere you can find it again, with a name you can use to recognise it e.g. entrezGeminin.fasta

To look at the sequences, load them into your local version of CLUSTALX on your machine.

Open a terminal on your machine.

Type "clustalx" at the prompt and press return.

Use the "File" menu to load the downloaded sequences into CLUSTALX.

Align the sequences using the "Alignment" menu.

We will later compare this set of sequences and their alignment with the sequences and alignments collected in other ways, to allow us to compare the features of these different methods of retrieving sequences for use in an MSA.

Retrieving sequences using SRS

There are many different SRS (Sequence Retreival Service) servers available around the world, many of which could be used for the following exercieses. However, we will focus on using the servers made available at the EBI and EMBL, as these contain the databases we will need for these exercises. If at some point one of these servers does not appear to be responding correctly, switch over to using the other server

Go to the initial query page of the EBI SRS server
(as mentioned above, you can also try this at the EMBL SRS server, although the interface is slightly different there)

Carry out a "Quick Text Search" against "Proteins" rather than "Nucleotides", again using the query term "geminin".

Collect the records identified using this search using the "Save" box on the left side of the query page (switch output to "File", Save As to "ASCII text/tale - FastaSeqs" and click save).

As for the Entrez dataset, load this file into CLUSTALX, align the sequences, and save the results.

Retrieving sequences using BLAST

A further way to collect sequences for an MSA is to use sequence similarity searches, for example using BLAST, which we will use here at the EBI

Use the SRS server in the same way as above to query with the swissprot ID of the human genminin protein, GEMI_HUMAN.

To obtain the sequence in fasta format, click through to the record for this database entry, move to the bottom of the record, where you should find the sequence in fasta format.

Cut and paste the fasta format sequence into the EBI NCBI-BLAST webserver.

"Run BLAST" using the default settings.

From the result page "Check All" sequences, and then "Download" in fasta format. Download the resulting page to your computer, load it into CLUSTALX, and align the sequences.

Examining MSAs

We will now examine the three sets of sequences and alignments you have just obtained. We may ask some of you to present the conclusions you reach by doing these comparisons to the rest of the class, so please take notes while doing this.

To examine the alignments, load each of the alignments into CLUSTALX.

Explore the options provided in the "Colors" and "Quality" menus, as these will alter the appearance of the alignments, sometimes in ways that may help you to better assess the quality of the alignments.

Assuming that the aim of collecting these sequences is to obtain a good overview of the variation of geminin sequences, make notes about each of the alignments in turn, focusing on the following issues:

Q Does the alignment contain any sequences that appear not to be geminins?
(These would appear to be very dissimilar to the majority of sequences within the alignment, and if we want to be looking only at related sequences in an alignment, we do not want such sequences in the alignment. Note that one sign that sequences are related is that the MSA contains blocks/regions of residues that appear to be relatively similar between many different sequences - can you find any/many such blocks within these alignments?)

Q Are there any sequences that appear to be related but highly divergent from other sequences in the alignment? If so, why do you suspect that they are "true" geminins rather than just "contamination" by unrelated sequences? Why do you think they have highly-dirergent sequences (the sequences might be from organisms distantly related to the others in the alignment, or from very rapidly-evoloving paralogs, or just be poorly sequenced)?

(to help you think about these issues, note that the data from the EBI should include in its sequence description an upper case abbreviation of the species and genus name of the sequence using the first three letters of the species name and the first two of the genus name e.g. the African clawed frog, Xenopus laevis, is labeled XENLA. Some organisms with many sequences in the databases are instead known by a common abbreviation instead e.g. the mouse, Mus musculus, is labeled MOUSE, humans as HUMAN, cows (Bos tarsus) as BOVINE etc.)

Q Does the alignment contain sequences from a wide range of different organisms? (this is what one would be hoping for if the aim was to obtain a broad overview of the sequence variation within a family)

Q Does the alignment contain lots of sequences that are very similar to each other?
(If there are too many very similar sequences present in an alignment it can make it harder to view the entire alignment at once, and can cause us to loose our overview)

Q Why are these sequences are so similar to each other?
One reason for this might be that the sequences are from different records for the same gene - one way to check this would be to consider which organisms the sequences come from, so check to see whether apparently identical sequences come from the same organism. One way to avoid this situation is to carry out your queries against non-redundnant databases, that attempt to maintain only a single record per transcript e.g. SWISSPROT.
Having looked at each of the alignments in this way, decide which you find the most useful, and why.

Suggest ways it might be possible to improve the alignments to more accurately provide an overview of geminin diversity (perhaps you would alter the queries? discard some of the sequences in the alignment?)

Note that collecting a set of sequences is almost always only the first step in arriving at the final set of sequences in an alignment you then go on to use in your analysis - one will often want to discard some very similar sequences, some very dissimilar sequences, fragmented sequences etc. However, if one begins with a set of sequences fairly close to what one is looking for, with well-annotated description lines, this can save lots of time.

Comparing automatic MSA software

You should also note that using different automatic MSA software on the same set of sequences will almost always result in a different MSA - additionally, using different parameters for the same software will also usually give different alignments.

To get some feeling for the variation in the results obtained from  different pieces of software, use the following sequence alignment (of a set of S1 proteases) to examine the relative performance of CLUSTALX, MAFFT, MUSCLE, and PROBCONS at aligning these sequences (four of the most widely-used software packages of this type).

For each of these programmes, carry out an automatic alignment of these sequences (follow the instructrions on the "How To" page)

Compare these alignments with each other, and also with this hand-edited version of the alignment.

Q Which programme gives the best alignment?

When considering the quality of the alignments, think about:

Comparing Sequence Examination Software

There are many different pieces of software used to visualise alignments. Taking just one of the geminin alignments, compare the way that three different software packages do this (CLUSTALX, SEAVIEW, and JALVIEW).

Considering again the set of questions provided above, to consider when examining the quality of an alignment - however, this time ask yourself how well these different pieces of alignment visualisation software allow you to do this.

Q Which are the features of the different visualisation softwares that you like and dislike?

Extra Exercises

Here are some additional exercises for you to try out if you get through the ones above with time to spare...

1) Evaluate the default settings of CLUSTAL, MUSCLE, MAFFT and PROBCONS using the two alignments below. These offer specific difficult problems to the alignment software - in the first there is one sequence that is missing a portion in the middle - in the second there is one sequence much more divergent than the others.
Q Is the same software consistently the best performer amongst these different tests?

2) Repeat exercise (1) above, but run the alignment software using a range of different parameters - aim to identify the parameters that give the best results.

3) If you have experience in coding, try using BioPerl (or normal Perl, or Python, or...) to write a script that will allow you to compare two alignments with each other, measuring how many columns they have which are the same.

This set of exercises should have:

Note that throughout these exercises the following formating is used to specify different types of text

Bold non-italic text like this gives you instructions about tasks you should carry out e.g. "View the following webpage"

Italic text specifies questions for you to answer

Back to Course Front Page
Back to Gibson Team course pages at EMBL.