Obtaining Pre-calculated Trees and Alignments
Introduction
There are many websites that provide access to pre-calculated trees and
multiple sequence alignments (MSAs). Given that it can take
considerable effort and time to estimate trees and alignments "from
scratch", these websites are a valuable resource, provided one can
identify and obtain appropriate data.
The primary purpose of these resources, however, is often not to
provide this kind of data - thus, in many cases it is not immediately
obvious how to extract files containing these kinds of data from these
sites.
Thus, the aim of this page is to show how to use several such sites to
obtain data of this kind.
Note, however, that resources of this kind are not static, and may be
changed such that these instructions are no longer valid - in which
case, please let me know!
Text-based searches
In most cases, text-based queries are the quickest way to identify the
pages/records corresponding to your family/protein of interest.
If you do not already have information about the name/accession number
of a protein, then two obvious places to start looking for them are the
EB-eye search tool or Google searches. Alternatively, if you are
beginning with an HTML document that describes the proteins you are
interested in, you could also try the REFLECT
tool (which can also be installed as a plugin to the Firefox
webbrowser).
Any tree/MSA you will be looking for will have at least three sequences
in it - given the choice, it makes sense to begin by searching for
these using identifiers for sequences from organisms about whose
proteome our knowledge is more extensive. This makes it less likely
that you fail to identify a record in the databases for your query
protein. For example, one would usually begin by searching with
appropriate human proteins (assuming your family of interest contains
human sequences - if you're interested in bacteria, then probably
not!), as this proteome is well studied, annotated, and presumably
fairly comprehensive.
If your initial search does not identify a relevant TreeFam record,
then move on to use the identifiers of sequences related to your
sequence of interest.
However, note that several features of biological entities can cause
problems with such searches. These issues are common to many different
kinds of biological entities e.g. genes, proteins, organisms, chemicals
etc. - however, as our focus is on protein sequences (alignments, and
trees estimated from alignments of), in the discussion below we will
use "protein" where in most cases the message is true also for other
kinds of entity.
Multiple "names" for a given protein
Firstly, a protein may be referred to by several different
names/identifiers.
For example, looking at the UniProt record for the human src oncogene SRC_HUMAN we see that
the three whole sections of the record are focused on listing the range
of different names, accession numbers, and identifiers associated with
the protein (The "Names and origin", "Cross-references", and "Entry
Information" sections). Thus, the protein/gene is referred to, in
different places, as "SRC", "SRC1", "c-Src", "p60-Src" etc., UniProt
itself has several different accession numbers for the protein (P12931,
Q86VB9, Q9H5A8), the GeneCards database stores information for the
protein under accession number "GC20P035406" etc.
There is no guarantee that any resource will store a complete set of
the names/identifiers used for a given protein - indeed, this is almost
certainly NOT the case - therefore, when a text query fails to find any
records, the obvious thing to try next is simply a different identifier.
Some identifiers are more likely than others to be found in a database
- in particular, the UniProt entry names (e.g. SRC_HUMAN) or primary
accession numbers (e.g. P12931) are used by many resources, so this is
an obvious first choice when carrying out a query of this kind.
Multiple proteins with the same "name"
Secondly, there are many cases where two very different proteins have
the same name. For example, the UniProt records P00395 and P23219 are both
associated with the name "COX1", despite being very different sequences
carrying out very different functions.
Thus, when you identify a record via text query, it is important that
you verify that the record you identify does indeed correspond to the
protein you are looking for e.g. by checking that the record links to
literature you know refers to your entity of interest, or if you
already know the sequence of your protein, you can check whether it
corresponds to that of the record returned by your search.
Many/most databases have a (for that database) unique identifier
associated with each record - for example, for the human SRC protein in
UniProt, this "primary accession number" is P12931. If you can find
such unique identifiers for your sequence, then they make good search
terms, as it is less likely that two different proteins will have the
same accession number.
Additional annotation
Many protein databases associate with a given protein record
considerable information about the known function of that protein. In
many cases, this will be information such as "interacts with the src
protein" i.e. that mention proteins other than the protein described by
the record. In such cases, a text query that looks for matches also in
such regions of a record can cause similar problems to that caused by
several different proteins being associated with the same name - this
is yet another reason why it is important to check that the records
returned by a search indeed correspond to the protein you are
interested in.
TreeFam
TreeFam clusters sequences into
families, aiming that
each family represents a set of genes that were present in at most a
single copy in the ancestral metazoan/animal genome. Thus, while the
resource does contain some non-metazoan sequences, the emphasis is
strongly on metazoan genes, in particular those present in ENSEMBL
release 47.
This link leads
to the PubMed abstract for the paper describing TreeFam that
appeared in NAR.
Identify family of interest
The most convenient way to identify your family of interest is to query
with the UniProt identifier (or
name) of one of the sequences in your family of interest e.g. SRC_HUMAN
or P12931 if we are looking for a tree/alignment containing the
human src protein.
Begin by navigating to the TreeFam front page by following this link.
At the top left of the frontpage, follow the slightly hidden link to
the "Search" page i.e. the link outlined by the red box in the image
below.

Note that while the TreeFam front page has search facilities (the blue
box on the left side of the page), these do not allow you to
search with external identifiers (e.g. such as SRC_HUMAN or P12931,
which are not the identifiers used internally within TreeFam to refer
to the human src protein). These other boxes can be used to query, for
example, with a TreeFam family identifier e.g. "TF101008" for a cyclin
H family (in TreeFam release 7.0), in the "Go for families" box.
Follow the link to the "Search" page shown below.
Type the external accession number into the bottom of the three text
fields, and click the "Go" button to the right of this field to run the
query.

The results of the search should be returned very quickly, to give a
result pages similar to that shown below (this link
points to the result page).
Follow the link under in the "AC" column of the result page (shown
below highlighted in purple) to reach the record associated with this
TreeFam family.

Obtaining MSA/tree files
To obtain an MSA/tree from the TreeFam record page corresponding to
this family:
- Find the "Plain File" section on the left side of the record
page, and click on the arrow button to access the list of possible
files that can be downloaded for this family.
- Scroll down the list to choose the file you want, and release the
mouse
- The name of the file you've chosen should now be shown in the
list - click "Go" to the right of the filename to download the file
Files whose name ends in ".aln" are alignments
Files whose name ends in ".nhx" are trees
Explore the TreeFam website and articles for information on the
difference between the various alignment and tree files available (e.g
the difference between full.aln, seed.aln and clean.aln)
See below for a screen shot showing these final steps.

ENSEMBL
The ENSEMBL project hosts an extensive website providing access to a
very diverse set of data associated with a range of different genomes.
At the time of writing (ENSEMBL release 53), the main focus of the
project is vertebrate genomes, although some other model organisms are
also included. However, over the next set of releases (there is a new
release on average each two months), many more organisms will be
included, for example plants, fungi, and eubacteria.
Identify family of interest
Given impressively fast release cycle of ENSEMBL, there are frequent
changes to the web interface for the project. Thus, in contrast with
our description of using TreeFam above, we will not focus on showing
screenshots to explain how to obtain MSAs/trees, as these tend to
become outdated rather quickly for ENSEMBL.
The instructions provided below are for release 53 of
ENSEMBL - once release 54 and later are available, this link should
lead you to an archived version of release 53.
- In the prominent "Search ENSEMBL" field on the ENSEMBL front page
- choose the organism that contains your gene of interest
- enter your text query in the box below the species list
- click "Go"
- Usually it is better to query with UniProt accession numbers
rather than UniProt entry names, as this yields fewer hits which are
thus quicker to examine, and should contain the record you are looking
for.
- From the results page, navigate through the list of matches to
your query, looking for matches that are "Genes" (rather than "Protein
Alignment Features" etc.), and follow the link to the appropriate Gene
page.
Obtaining tree files
- The panel on the left side of the Gene page should have a links
to "Gene Tree" - click on the "+" sign to the left of this link, and
you should be shown a link to "Gene Tree (text)" (The "Gene Tree" link
itself might need to be revealed by clicking on a "+" sign to the left
of the "Comparative Genomics" link)
- Follow the "Gene Tree (text)" link, and on the right of the
screen should be shown the NEWICK format of a tree that includes the
gene.
- Copy and paste this NEWICK format into a text editor, and save
the resulting file.
Obtaining MSA files
ENSEMBL provides two kinds of MSAs for download.
The first is associated with the gene tree described above.
To access this tree follow the "Gene Tree (Alignment)" that is just
below the "Gene Tree (text)" link.
A CLUSTALW-format alignment should appear in the same place as the
NEWICK format tree appeared. As before, cut and paste this text into a
text editor to save it locally.
The second (larger and more divergent) alignment is associated with the
ENSEMBL family.
Follow the "Protein families" link - this should provide you with
(amongst other things) a set "Jalview" start buttons.
Choose the alignment you want to download by selecting the appropriate
Jalview button, and wait for the Applet version of Jalview to start.
Once the Jalview Applet is started, you should be able to choose the
"File" menu for Jalview (note that, at least on Macs, to see this menu
you need to have the parent browser window selected, rather than the
Jalview applet window)
Choose "Output alignment via text box" and in the window that then
opens, choose the "Apply" button, and copy-and-paste the resulting text
into a text editor to save a local copy.
HOVERGEN
HOVERGEN
provides information on groups of vertebrate genes that are related to
each other.
Identify family of interest
The procedure below describes how to query for alignments of protein
sequences using a UniProt accession number - the example taken is again
that for the human src protein.
Query HOVERGEN using the WWW-Query system:
- Select using radio buttons "Search for families, alignments and
trees" and "Protein databank"
- To the right of the protein/nucleotide databank selection
buttons, the database list should show "HOVERGEN (AA)", which is indeed
the database we want to query.
- From the selection criteria section of the query page, select
"Accession number" rather than "Keyword" for the first criterion, and
supply the UniProt accession number for a sequence from your family of
interest in corresponding text box (as shown below)
- Note that in this case, you must use the accession number - the
UniProt record names (in this case "SRC_HUMAN) cannot be used as search
terms when querying the accession numbers.
- Click the "Submit" button to execute the query

Obtaining MSA/tree files
From the results page returned by this query:
- click the check-box to the left of the HOVERGEN accession number
for your family of interest (in the "Family" column of the result table)
- instead of "Select families", use the menu list to select
"Download alignments/trees"
- click "Save selection"

- Follow the link to the "SELECTED.tar" file that has been created
on the University of Lyon 1 FTP site to download it to your local
machine.
- Untar the this file (by moving to the directory that contains the
SELECTED.tar file, then use the UNIX command "tar xvf SELECTED.tar")
- The alignment and tree for this family will be found in the
subdirectories of the untared file e.g.
- SELECTED/CLUSTALW/HBG008761.aln (the alignment in CLUSTALW
format)
- SELECTED/TREE/HBG008761.phb (the tree in NEWICK format)
Other sources of
pre-calculated trees/MSAs
Author: Aidan Budd
Back
To Gibson Team Training Pages