Multiple Sequence Alignments
Exercises and Demonstations
Structural and Sequence Alignments
Demonstration
Serine protease sequences
>5ptp
IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEG
NEQFISASKSIVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISG
WGNTKSSGTSYPDVLKCLKAPILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGP
VVCSGKLQGIVSWGSGCAQKNKPGVYTKVCNYVSWIKQTIASN
>1ton
IVGGYKCEKNSQPWQVAVINEYLCGGVLIDPSWVITAAHCYSNNYQVLLGRNNLFKDEPF
AQRRLVRQSFRHPDYIPLIVTNDTEQPVHDHSNDLMLLHLSEPADITGGVKVIDLPTKEP
KVGSTCLASGWGSTNPSEMVVSHDLQCVNIHLLSNEKCIETYKDNVTDVMLCAGEMEGGK
DTCAGDSGGPLICDGVLQGITSGGATPCAKPKTPAIYAKLIKFTSWIKKVMKENP
(both taken from
http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE2/ref1/test2/5ptp_ref1.html)
Initial PDB files 5ptp 1ton
Aligned PDB files 5ptp 1ton
(It's chain A in both cases)
Exercise
Consider you are working with the following amylase sequence
>1smd
GRTSIVHLFEWRWVDIALECERYLAPKGFGGVQVSPPNENVAIHNPFRPWWERYQPVSYK
LCTRSGNEDEFRNMVTRCNNVGVRIYVDAVINHMCGNAVSAGTSSTCGSYFNPGSRDFPA
VPYSGWDFNDGKCKTGSGDIENYNDATQVRDCRLSGLLDLALGKDYVRSKIAEYMNHLID
IGVAGFRIDASKHMWPGDIKAILDKLHNLNSNWFPEGSKPFIYQEVIDLGGEPIKSSDYF
GNGRVTEFKYGAKLGTVIRKWNGEKMSYLKNWGEGWGFMPSDRALVFVDNHDNQRGHGAG
GASILTFWDARLYKMAVGFMLAHPYGFTRVMSSYRWPRYFENGKDVNDWVGPPNDNGVTK
EVTINPDTTCGNDWVCEHRWRQIRNMVNFRNVVDGQPFTNWYDNGSNQVAFGRGNRGFIV
FNNDDWTFSLTLQTGLPAGTYCDVISGDKINGNCTGIKIYVSDDGKAHFSISNSAEDPFI
AIHAESKL
The three residues forming the active site of the enzyme are the D, E,
and D labeled in red in the sequence.
You can find out more about the sequence from its SwissProt entry P04745.
Assuming further that you know the sequence below to also be that of an
amylase
>1bf2
DVIYEVHVRGFTEQDTSIPAQYRGTYYGAGLKASYLASLGVTAVEFLPVQETQNDANDVV
PNSDANQNYWGYMTENYFSPDRRYAYNKAAGGPTAEFQAMVQAFHNAGIKVYMDVVYNHT
AEGGTWTSSDPTTATIYSWRGLDNATYYELTSGNQYFYDNTGIGANFNTYNTVAQNLIVD
SLAYWANTMGVDGFRFDLASVLGNSCLNGAYTASAPNCPNGGYNFDAADSNVAINRILRE
FTVRPAAGGSGLDLFAEPWAIGGNSYQLGGFPQGWSEWNGLFRDSLRQAQNELGSMTIYV
TQDANDFSGSSNLFQSSGRSPWNSINFIDVHDGMTLKDVYSCNGANNSQAWPYGPSDGGT
STNYSWDQGMSAGTGAAVDQRRAARTGMAFEMLSAGTPLMQGGDEYLRTLQCNNNAYNLD
SSANWLTYSWTTDQSNFYTFAQRLIAFRKAHPALRPSSWYSGSQLTWYQPSGAVADSNYW
NNTSNYAIAYAINGPSLGDSNSIYVAYNGWSSSVTFTLPAPPSGTQWYRVTDTCDWNDGA
STFVAPGSETLIGGAGTTYGQCGQSLLLLISK
(Checking the SwissProt record for this sequence P10342 will show
that, indeed, it is an amylase - however, we imagine for now that we do
not know this)
To identify possible candidate residues in the 1bf2 sequence that may
be active site residues, attempt to align the sequences using
Compare these results to a structural alignment using CE - to the right are
the links to the PDB structure files 1smd
1bf2 each of which contains just a
single chain.
- Do the sequence- and structure-based alignments identify the
same residues in 1bf2 as equivalent to the active site residues in 1smd?
- Do gaps in the structural alignment tend to be associated with
secondary structure elements (alpha helices or beta strands) or with
loop regions?
You can look at an MSA including these two sequences in BaliBase
where
the sequences make up part of reference set 1pamA_ref1.
If you finish quickly, then browse the BaliBase alignments, and choose
a further family of interest.
You can download the alignments in several different formats, then load
them into CLUSTALX and save in FASTA format (after removing all the
gaps first) to get the ungapped sequences of those represented in the
family.
Search for the PDB structures the alignments are based on using the PDB
website (4-character identifiers should be the same used in the PDB
e.g. 1smd)
SwissProt/UniProt accession numbers are provided on the BaliBase pages.
You can now align pairs of sequences using BLAST2Sequences and/or
Smith-Waterman/Needleman-Wunsch, and compare these to the structural
alignments of the same pairs of proteins. Look to see if the trends in
terms of placement of gaps, and ability to identify equivalent residues
are similar as for the amylase sequences above.
Finding Pre-Calculated Alignments
Demonstration
If you can find a pre-calculated alignment that already contains a set
of sequences that is useful for addressing your problem, you may well
be able to save lots of time. Using the databases/resources below,
we'll look at how you can get access to alignments that you can
download yourself and examine locally.
>SRC_HUMAN|P12931|Proto-oncogene tyrosine-protein kinase Src (EC
2.7.10.2) ENSG00000197122
GSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEP
KLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDW
WLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESE
TTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLC
HRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLK
PGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYL
RLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTA
RQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERG
YRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
Exercise
Using these (and perhaps other alignment resources you might find) find
a pre-calculated MSA that is will be a good starting place for
building an alignment to investigate:
- the evolution of FGF genes (e.g. FGF4_HUMAN) in the early
vertebrate lineage
- variation in the 3D structure of tyrosine kinase domains within
the animals
Editing Alignments
Demonstration
BaliBase2 kinase2_ref5
realigned (with errors) by ClustalX.
We examine the alignment using Clustalx, and correct some of the
mistakes using JalView
Exercise
Try editing mistakes caused by CLUSTALX in this alignment taken from
BaliBase of a set of hepatitis proteinases (here is the
link to the reference alignment). Using JalView, attempt to edit
the alignment to correct these mistakes.
It will help to examine the alignment using CLUSTALX (With the
Quality->Show Low-Scorring Segments option switched on to identify
potentially mis-aligned regions)
Another such example is this BaliBase
alignment of thermitases re-aligned
by CLUSTALX
The following alignment of KH
domains was created by CLUSTALX - it contains several mis-aligned
regions.
Compare the results of your edits with this manually-curated alignment of the KH
domains.
If you want more practice of this kind, then take the following
manually-curated alignments, realign them using some automatic
alignment tool, and again try to identify mis-aligned regions, and
correct them using JalView. Compare your edits with the initial
alignment.
Automatic MSA Tools
Demonstration
As you'll do in the exercises below, here we'll look at what happens
when we re-align a reference alignment from BaliBase (1aboA_ref1)
using a range of different pieces of automatic alignment software.
Exercises
You will be told to take one from the list of BaliBase2
list of reference alignments.
- 1idy_ref1
- 1ycc_ref4
- kinase1_ref4
- 1eft_ref5
- 1pfc_ref1
- 1wit_ref1
- 1csp_ref1
- 1ldg_ref1
- sh3_ref6
- 1mrj_ref1
- 1led_ref1
- 5ptp_ref1
Download the RSF format of the alignments of interest.
Load alignment into CLUSTALX locally, remove all gaps from all
sequences (Edit->Select all sequences, Edit->Remove all Gaps) and
save in FASTA format.
Re-align the sequences locally using CLUSTALX (Alignment->Do
Complete Alignment) - REMEMBER TO CHOOSE A DIFFERENT NAME FOR THE
ALIGNMENT OUTFILE!
Compare the automatic alignment created by CLUSTALX with the reference
BaliBase alignment - it will help if you re-arrange the (vertical)
order of the sequences in
the CLUSTALX alignment.
How many columns are mis-aligned in the CLUSTALX alignment i.e.
where residues are aligned in the same column in the CLUSTALX alignment
which are not aligned in the same column in the BaliBase alignment?
(Only consider those columns in the BaliBase alignment which are
underlined/in upper case - these are the ones which are assessed as
being reliably aligned according to structural comparisons)
Do the same for other alignment tools such as:
Which of the different tools creates the best alignments when
compared
to the BaliBase alignment?
Is there any obvious feature of these alignments that this method of
assessing alignment quality ignores?
If you have time, take some additional BaliBase alignments and repeat
these analyses. (Note - you will make life easier for yourselves if you
focus on short alignments with only a few sequences) - ideally taking
them from different categories of alignments e.g. from different sets
of BaliBase categories.
Is there a trend for certain tools to always be better/worse than
others, or do the results vary a lot from alignment to alignment?
Are particular types of alignments easier/harder to align correctly?
(e.g. are alignments where average identity is high more reliably
aligned?)
Building an alignment from scratch
Demonstration
Imagine we have noticed that the protein P53_HUMAN, contains a
match to a cyclin-binding site pattern, and we would like to see
whether this site is (a) conserved in other P53 sequences and (b) more
strongly conserved than adjacent sequences
- We query the ELM resource to
confirm the match to the cyclin-binding site regular expression
- We collect sequences similar/related to P53_HUMAN using BLAST
at the NCBI
- Then we download a set of appropriate-looking sequences locally,
and align them automatically e.g. using MUSCLE
- We examine the sequences using JalView and ClustalX
Exercise
Either use BLAST (or one of the databases of multiple sequence
alignments) to address a question of your own using MSAs, or try out
one of the following exercises:
- A similar investigation as for P53 above, but looking at LIG_Clathr_ClatBox_1
motifs in EPN1_HUMAN
- Create an alignment of Cytochrome c oxidase subunit 5A sequences
e.g. COX5A_HUMAN to
examine the phylogenetic location of birds compared to other major
vertebrate groups
Back
to Gibson Team course pages at EMBL.