Using CLUSTALX



Introduction

CLUSTALX provides a graphical user interface to the CLUSTALW multiple sequence alignment (MSA) algorithm.

The original publications describing CLUSTALX and CLUSTALW are some of the highest-cited papers ever - when they were released they provided a major increase in the accuracy and speed with which users could build MSAs, with the focus on protein sequences - in the case of CLUSTALX additionally packaged in an easy-to-use graphical user interface (GUI) - leading to their great popularity.

There are now several other pieces of MSA software that are, for many sets of sequences and analyses, a better choice than CLUSTALX if your aim is to obtain an MSA of a set of protein sequences - and certainly if you are aligning DNA sequences (something CLUSTALW/X has never focused on).

However, despite this, CLUSTALX remains useful - partly because in some circumstances it still does a good job at aligning sequences (judged by todays standards) - but perhaps most importantly due to several features of the GUI that provide easy-to-use interfaces to several key tasks that are needed in many cases to build a good MSA.

The notes supplied here focus on some of the more important of these additional features.

Documentation and tutorials

The documentation file for CLUSTALX is available online from the University of Strasbourg.

Several articles have been published that provide tutorials and advice on using CLUSTALX - a list of them can be found here on the CLUSTAL homepage.

This is a link to a PDF describing a CLUSTALX tutorial as part of the Molecular Systematics and Evolution workshop run annually in Rio de Janeiro, Brazil.

This is a link to Jacques van Helden's phylogenetics tutorial that includes sequence alignment using CLUSTALX.

Getting CLUSTALX

CLUSTALX can be downloaded here from the University of Strasbourg.

It is also available here from the CLUSTAL home page

A web interface to CLUSTALW can be found here at the EBI

Overview of CLUSTALX Graphical User Interface (GUI)


The image above shows the main features of the CLUSTALX GUI.
The image below shows CLUSTALX in Profile Alignment Mode, where the sequences in the alignment are divided between an upper ("Profile 1") and a lower ("Profile 2") Alignment Display Areas, each accompanied by its own Sequence Identifier Panel, Consensus Line, Alignment Ruler and Alignment Conservation Display Area

Opening a sequence file

Using the CLUSTALX menu bar:

File->Load Sequences

Then select the MSA file on your local file system that you want to load into CLUSTALX.

All gaps need to be coded as "-" characters. Some sources of pre-calculated alignments code gaps as "." characters - if you are working with such an alignment, you will need to replace all "." characters by "-", which should be easy using a text editor's Replace All functionality.

Note also that CLUSTALX accepts sequence identifiers containing a reasonably diverse set of characters, but the text shown in the Sequence Identifier Panel is only the first word of the descriptor provided in the input file i.e. the identifier is shown only until the first whitespace in the input file sequence identifiers.

Writing MSA output files - changing the format of an MSA

CLUSTALX can write an MSA to an output file in several different formats.

To specify the format of an output file use the Menu Bar:

File->Save Sequences as...

In the window that appears, in the format field, check the boxes next to the formats you want to use for your output file(s) - if you select several boxes, then CLUSTALX will write multiple files, one for each of the different formats you specify. The names of these files are specified using the path provided in the Save Sequences As: (File extension will be appended) text box, with appropriate extensions as described below to indicate which file constrains the alignment in which format.

To save the alignment, click the "OK" button.

IMPORTANT NOTE!
If you load a file into CLUSTALX, modify the alignment in some way, and then save the modified alignment in the same format as the file you loaded into the software, by default CLUSTALX will overwrite your initial alignment file with the modified alignment.

Often you will want to preserve the initial alignment file!

Therefore, either initially load into CLUSTALX a COPY of the file you want to modify, or be careful to change the name of the file you save the modified alignment into.

For example, in the image shown below, CLUSTALX will write two files (test_alignment.aln in CLUSTAL format and test_alignment.fasta in FASTA format) in the directory /home/myHome/Desktop



As CLUSTALX accepts as input and writes as output MSAs in several different formats, it can function as an MSA sequence format convertor - loading an alignment in one format into CLUSTALX, and saving it in a different format.

Given the diversity of MSA formats available, the user-friendliness of the CLUSTALX interface, and the relatively robust input sequence format parser of CLUSTALX (e.g. it allows long identifier names containing special characters such as "|", unlike much MSA software), CLUSTALX is often used simply to carry out this fairly basic yet important operation.

Saving alignments in PHYLIP format

As outlined above, CLUSTALX can be used to convert alignment formats.

Much software that processes MSAs in a phylogenetic context accepts the MSAs as input in PHYLIP format.

Additionally, much of this software only accepts PHYLIP format where sequence identifiers are restricted in length to a maximum of 10 characters.

Therefore, to maintain compatibility with such software, CLUSTALX truncates the sequence identifiers used in writing PHYLIP format alignment files to a maximum of 10 characters - this is done by discarding all by the first 10 characters from the sequence identifiers supplied in the input MSA file.

In some cases this can cause problems, as most MSA/phylogenetics software requires that the identifiers used in their input alignments are unique - i.e. the software will not accept alignments where the same identifier is associated with more than one sequence - and following such truncation, in some cases the sequence identifiers are no longer unique.

Therefore, before using CLUSTALX to export to PHYLIP format, it is usually a good idea to edit a FASTA format version of the alignment file using a text editor so that all sequence identifiers

Identifying poorly-aligned regions of an MSA - using the "Show Low-Scoring Segments" option

CLUSTALX offers several different ways to assess the quality of an alignment in a given region.

One of these (Show Low-Scoring Segments) is particularly useful for this - indeed, it provides one of the easiest-to-use and powerful methods of identifying potentially mis-aligned regions of an MSA that is available generally.

Given that assessing the quality of an MSA is one of the key steps in the process of building a good MSA, this is one of the most important features provided by CLUSTALX.

This feature is activated using the Menu Bar:

Quality->Show Low-Scoring Segments

To show how good this option is at highlighting mis-aligned regions, see the images below.

The first is an alignment that is judged to be good (the BaliBase2 1cpt_ref2 reference alignment) with Show Low-Scoring Segments switched on - this causes some regions of the alignment (those with low-scoring segments, as you might expect!) to be highlighted in black) - this link leads to this alignment in FASTA format.



The next image shows the above alignment, but with a gap introduced in the y08q_myctu sequence to create a misalignment - this link leads to the alignment containing the manually-introduced gap.

The images below show this alignment both with and without Show Low-Scoring Segments switched on




The mis-alignment of the y08q_myctu sequence is clearly much more obvious with this option switched on.

Thus, scanning an alignment to check its quality - and to identify regions that might benefit from manual adjustment/editing of the alignment, are best carried out in CLUSTALX with this option switched on.

Note that, if you alter the alignment in some way - for example by changing the position of one of the sequences - the highlighting due to this option will disappear - to bring it back simply reapply the option.

Removing/deleting sequences from an alignment

Select a sequence by left-clicking on its sequence identifier the Sequence Identifier Panel.

You can select multiple sequences by:
To remove the selected sequences from the alignment:
Edit->Cut Sequences

Adding (and aligning) sequences to a pre-existing alignment

When building MSAs, we often encounter the situation that we have a set of sequences that are well-enough aligned for our purposes, and we want to then include additional sequences in that alignment i.e. align these additional sequences to the alignment.

For example, you may have been using an alignment for some time, and have identified new members of the family represented in the alignment, and want to build one alignment combining the old alignment and the new sequences together, while preserving the organisation of the old alignment.

Alternatively, you might have obtained an alignment from one sources e.g. TreeFam, and want to integrate that with related sequences obtained from another source e.g. from a BLAST search of the UniProt database.

CLUSTALX can align a set of sequences against an existing MSA by:
  1. Loading the pre-calculated MSA into CLUSTALX
  2. Switching to Profile Alignment Mode
  3. Loading the file containing the unaligned sequences using
  4. Align the contents of the unaligned sequence file to the initial MSA using
  5. Assuming you do not want to overwrite the contents of the initial alignment file BE SURE TO CHOOSE A DIFFERENT NAME FOR THE OUTPUT FILE THAT WILL CONTAIN THE NEW ALIGNMENT!!
  6. Return to Multiple Alignment Mode to save the newly-created alignment to a file

Applying the CLUSTAL alignment algorithm

While, in many cases, more recently developed MSA software outperforms CLUSTALX, there are still occasions where it is convenient use CLUSTALX to perform an alignment (for example, if you prefer working with a GUI, are offline, and are working with a relatively small number of relatively similar sequences)
  1. Load a set of sequences (or an existing alignment) into CLUSTALX
  2. NOTE that, if presented by a set of pre-aligned sequences, the CLUSTAL algorithm incorporates information about gap positions present in this initial alignment when calculating a new alignment. Thus, you may want, before calculating the alignment, to begin by running
  3. Alignment->Do Complete Alignment
  4. Specify a different name for the output file as that offered by default, so as not to overwrite the contents of your initial file, and click OK to run the alignment

Estimating phylogeny using CLUSTALX

CLUSTALX can also be used to estimate reasonably quickly and inaccurately a phylogeny from a set of aligned sequences.

If you have any serious interest in the phylogeny of the sequences, you should CERTAINLY use an alternative method to estimate the phylogeny e.g. using RAxML, given the inaccuracy of the phylogeny estimates provided by CLUSTALX.

However, the trees provided by CLUSTALX are useful for providing a first quick overview of the phylogeny/similarities between a set of  sequences.

Assuming there are a reasonable number of columns in the alignment which do not contain any gaps (and if this isn't the case, you should think carefully about whether it makes any sense to attempt to estimate a tree from the sequences), use the following steps to estimate a tree from the alignment:
  1. Trees->Correct for Multiple Substitutions
  2. Trees->Exclude Positions with Gaps
  3. Trees->Draw Tree
  4. click OK
The resulting tree can be examined using NJplot or other tree viewing software.

Preserving input sequence order during alignment

When aligning a set of sequences with CLUSTALX, by default the software orders the sequences in the alignment so that more-similar sequences are closer to each other than more divergent to each other. This makes a lot of sense, as it makes it easier for the user to visually identify trends in patterns of similarity between the sequences.

In some cases, however, you want to compare the results of an alignment created by CLUSTALX with another alignment - in which case it is convenient to keep the order of the sequences in the alignment the same as in the input file.

This can be implemented using:

Alignment->Output Format Options->Output order->Input

Obtaining a matrix of pairwise identities

There are some situations in which a description of the pairwise identities between a set of sequences in the context of an MSA is required - although if you find that you are using this information to describe the relationships/similarities between a set of sequences, you may want to consider whether it wouldn't be more appropriate to do this in the context of a tree.

To obtain such a matrix:
  1. Trees->Output Format Options and check the "% identity matrix" box
  2. calculate a tree using CLUSTALX as described above
Along with a file containing a phylogeny estimate for the sequences, an additional file (with file-extension ".pim") will be created with a matrix of the pairwise identities for all sequences in the alignment, in the context of the alignment

CLUSTALX output formats and file extensions

Alignment Format
File Extension
CLUSTAL
.aln
GCG/MSF
.msf
GDE
.gde
FASTA
.fasta
NBRF/PIR
.pir
PHYLIP
.phy
NEXUS
.nxs


Author: Aidan Budd
Back To Gibson Team Training Pages