Introduction to Multiple Sequence Alignment (MSA)
Introductory Notes
The first exercise centers around collecting, aligning, and
examining
the alignments of a group of proteins known as the "geminins". These
proteins are important in the control of DNA replication within the
cell cycle, where they prevent DNA replication by preventing
incorporation of the MCM omplex into the pre-replication complex
(pre-RC).
We will use several different approaches to collect sets of these
proteins, will then use several different pieces of software to align
them, and examine them using several different pieces of software. This
will give you experience of the mechanics of carrying out searches for
sequences from several different places using different software and
approaches, will give you experience loading sequences into a several
different autmoatic alignment software packages, and of visualising
alignments in several different pieces of software. Thus you should
become more comfortable with the process of manipulating of sequence
files, inputing data into different software, while also acquiring an
overview of the features of different alignment-related software.
Retrieving sequences for alignment
The first step in using a multiple sequence alignment (MSA) in your
analysis to to obtain a set of sequences with which to create an
initial alignment. With this in mind, we will now carry out a set of
exercises aimed at introducing you to some of the tools that can be
used to acquire these sequences.
Retrieving sequences using Entrez at the NCBI
Entrez is a database system used by the NCBI to query its diverse set
of biomedical databases. We will begin by using it to collet a set of
sequences by keyword.
Begin by visiting the NCBI Entrez
website.
Switch the search to be for "Proteins".
Search the database using the keyword query "geminin".
This will identify those records within the Entrez protein sequence
databases that mention the word "geminin".
(As an aside, note that this interface can be useful if you want to
collect sequences as specified by a set of accession numbers e.g. try
with the following list of NCBI "GI"s by executing the query
"10181218 7705682 20454909 20454908 3249005" against the Protein
databases)
We want to be able to load these sequences into our sequence alignment
software, and for that we need to obtain the sequences in a format that
the software can interpret. Probably the format that most sequence
analysis software is able to use is known as "fasta" format, hence we
will collect the sequences in this format.
View the sequences in fasta format by switching "Display" to "fasta"
(if you want to get more information about the sequences, you could
save them instead in "genpept" format.)
Save the results into a file by switching "send to" to "file", and
store this file somewhere you can find it again, with a name you can
use to recognise it e.g. entrezGeminin.fasta
To look at the sequences, load them into your local version of CLUSTALX
on your machine.
Open a terminal on your machine.
Type "clustalx" at the prompt and press return.
Use the "File" menu to load the downloaded sequences into CLUSTALX.
Align the sequences using the "Alignment" menu.
We will later compare this set of sequences and their alignment with
the sequences and alignments collected in other ways, to allow us to
compare the features of these different methods of retrieving sequences
for use in an MSA.
Retrieving sequences using SRS
There are many different SRS (Sequence Retreival Service) servers
available around the world, many of which could be used for the
following exercieses. However, we will focus on using the servers made
available at the EBI and EMBL, as these contain the databases we will
need for these exercises. If at some point one of these servers does
not appear to be responding correctly, switch over to using the other
server
Go to the initial query page of the EBI
SRS server
(as mentioned above, you can also try this at the EMBL SRS server, although the
interface is slightly different there)
Carry out a "Quick Text Search" against "Proteins" rather than
"Nucleotides", again using the query term "geminin".
Collect the records identified using this search using the "Save"
box
on the left side of the query page (switch output to "File", Save As to
"ASCII text/tale - FastaSeqs" and click save).
As for the Entrez dataset, load this
file into CLUSTALX, align the sequences, and save the results.
Retrieving sequences using BLAST
A further way to collect sequences for an MSA is to use sequence
similarity searches, for example using BLAST, which we will use here at
the EBI
Use the SRS server in the same way as above to query with the
swissprot ID of
the human genminin protein, GEMI_HUMAN.
To obtain the sequence in fasta format, click through to the record for
this database entry, move to the bottom
of the record, where you should find the sequence in fasta format.
Cut and paste the fasta format sequence into the EBI NCBI-BLAST webserver.
"Run BLAST" using the default settings.
From the result page "Check All" sequences, and then "Download" in
fasta format. Download the resulting page to your computer, load it
into CLUSTALX, and align the sequences.
Examining MSAs
We will now examine the three sets of sequences and alignments you have
just obtained. We may ask some of you to present the conclusions
you reach by doing these comparisons to the rest of the class, so
please take notes while doing
this.
To examine the alignments, load each of the alignments into
CLUSTALX.
Explore the options provided in the "Colors"
and "Quality" menus, as these will alter the appearance of the
alignments, sometimes in ways that may help you to better assess the
quality of the alignments.
Assuming that the aim of collecting these sequences is to obtain a good
overview of the variation of geminin sequences, make
notes about each of the alignments in turn, focusing on the following
issues:
Q Does the alignment contain any sequences that appear not to be
geminins?
(These would appear to be very dissimilar to the majority of
sequences within the alignment, and if we want to be looking only at
related sequences in an alignment, we do not want such sequences in the
alignment. Note that one sign that sequences are related is that the
MSA contains blocks/regions of residues that appear to be relatively
similar between many different sequences - can you find any/many such
blocks within these alignments?)
Q Are there any sequences that appear to be related but highly
divergent from other sequences in the alignment? If so, why do you
suspect that they are "true" geminins rather than just "contamination"
by unrelated sequences? Why do you think they have highly-dirergent
sequences (the sequences might be from organisms distantly related to
the others in the alignment, or from very rapidly-evoloving paralogs,
or just be poorly sequenced)?
(to help you think about these issues, note that the data from the EBI
should include in its sequence description an upper case abbreviation
of the species and genus name of the sequence using the first three
letters of the species name and the first two of the genus name e.g.
the African clawed frog, Xenopus laevis, is labeled XENLA. Some
organisms with many sequences in the databases are instead known by a
common abbreviation instead e.g. the mouse, Mus musculus, is labeled
MOUSE, humans as HUMAN, cows (Bos tarsus) as BOVINE etc.)
Q Does the alignment contain sequences from a wide range of
different
organisms? (this is what one would be hoping for if the aim was to
obtain a broad overview of the sequence variation within a family)
Q Does the alignment contain lots of sequences that are very similar
to
each other?
(If there are too many very similar sequences present in an
alignment it can make it harder to view the entire alignment at once,
and can cause us to loose our overview)
Q Why are these
sequences are so similar to each other?
One reason for this might be
that the sequences are from different records for the same gene - one
way to check this would be to consider which organisms the sequences
come from, so check to see whether apparently identical sequences come
from the same organism. One way to avoid this situation is to carry out
your queries against non-redundnant databases, that attempt to maintain
only a single record per transcript e.g. SWISSPROT.
Having looked at each of the alignments in this way, decide which you
find the most useful, and why.
Suggest ways it might be possible to
improve the alignments to more accurately provide an overview of
geminin diversity (perhaps you would alter the queries? discard some of
the sequences in the alignment?)
Note that collecting a set of sequences is almost always only the first
step in arriving at the final set of sequences in an alignment you then
go on to use in your analysis - one will often want to discard some
very similar sequences, some very dissimilar sequences, fragmented
sequences etc. However, if one begins with a set of sequences fairly
close to what one is looking for, with well-annotated description
lines, this can save lots of time.
Comparing automatic MSA software
You should also note that using different automatic MSA software on the
same set of sequences will almost always result in a different MSA -
additionally, using different parameters for the same software will
also usually give different alignments.
To get some feeling for the variation in the results obtained
from different pieces of software, use the following sequence
alignment (of a set of S1 proteases)
to examine the relative
performance of CLUSTALX, MAFFT, MUSCLE, and PROBCONS at aligning these
sequences (four of the most
widely-used software packages of this type).
For each of these programmes, carry
out an automatic alignment of these
sequences (follow the instructrions on the "How To"
page)
Compare these alignments with each other, and also with this hand-edited version of the alignment.
Q Which programme gives the best
alignment?
When considering the quality of the alignments, think about:
- How many columns the calculated alignments have that are the same
as in the reference alignment
- Whether there are any sequences that are clearly mis-aligned for
the majority of their length
Comparing Sequence Examination Software
There are many different pieces of software used to visualise
alignments. Taking just one of the geminin alignments, compare the way
that three different software packages do this (CLUSTALX, SEAVIEW, and
JALVIEW).
Considering again the set of questions provided above, to consider when
examining the quality of an alignment - however, this time ask yourself
how well these different pieces of alignment visualisation software
allow you to do this.
Q Which are the features of the
different visualisation softwares that you like and dislike?
Extra Exercises
Here are some additional exercises for you to try out if you get
through the ones above with time to spare...
1) Evaluate the default settings of CLUSTAL, MUSCLE, MAFFT and PROBCONS
using the two alignments below. These offer specific difficult problems
to the alignment software - in the first there is one sequence that is
missing a portion in the middle - in the second there is one sequence
much more divergent than the others.
- Alignment containing fragment
- Alignment containing divergent sequence
Q Is the same software consistently
the best performer amongst these different tests?
2) Repeat exercise (1) above, but run the alignment software using a
range of different parameters - aim to identify the parameters that
give the best results.
3) If you have experience in coding, try using BioPerl (or normal Perl,
or Python, or...) to write a script that will allow you to compare two
alignments with each other, measuring how many columns they have which
are the same.
This
set of exercises should have:
- demonstrated (and given you experience of) different ways of
obtaining sets of sequences for MSA analysis using keyword and sequence
similarity searches
- give you experience of what to look out for when considering the
quality of an MSA
- demonstrated that different automatic alignment software tend to
provide different results when applied to the same set of sequences
Note that throughout these
exercises the following formating is
used to
specify different types of text
Bold non-italic text like this gives
you instructions about tasks you should carry out e.g. "View the
following webpage"
Italic text specifies questions for
you to answer
Back to Course Front Page
Back
to Gibson Team course pages at EMBL.