Obtaining Pre-calculated Trees and Alignments



Introduction

There are many websites that provide access to pre-calculated trees and multiple sequence alignments (MSAs). Given that it can take considerable effort and time to estimate trees and alignments "from scratch", these websites are a valuable resource, provided one can identify and obtain appropriate data.

The primary purpose of these resources, however, is often not to provide this kind of data - thus, in many cases it is not immediately obvious how to extract files containing these kinds of data from these sites.

Thus, the aim of this page is to show how to use several such sites to obtain data of this kind.

Note, however, that resources of this kind are not static, and may be changed such that these instructions are no longer valid - in which case, please let me know!

Text-based searches

In most cases, text-based queries are the quickest way to identify the pages/records corresponding to your family/protein of interest.

If you do not already have information about the name/accession number of a protein, then two obvious places to start looking for them are the EB-eye search tool or Google searches. Alternatively, if you are beginning with an HTML document that describes the proteins you are interested in, you could also try the REFLECT tool (which can also be installed as a plugin to the Firefox webbrowser).

Any tree/MSA you will be looking for will have at least three sequences in it - given the choice, it makes sense to begin by searching for these using identifiers for sequences from organisms about whose proteome our knowledge is more extensive. This makes it less likely that you fail to identify a record in the databases for your query protein. For example, one would usually begin by searching with appropriate human proteins (assuming your family of interest contains human sequences - if you're interested in bacteria, then probably not!), as this proteome is well studied, annotated, and presumably fairly comprehensive.

If your initial search does not identify a relevant TreeFam record, then move on to use the identifiers of sequences related to your sequence of interest.

However, note that several features of biological entities can cause problems with such searches. These issues are common to many different kinds of biological entities e.g. genes, proteins, organisms, chemicals etc. - however, as our focus is on protein sequences (alignments, and trees estimated from alignments of), in the discussion below we will use "protein" where in most cases the message is true also for other kinds of entity.
Multiple "names" for a given protein
Firstly, a protein may be referred to by several different names/identifiers.

For example, looking at the UniProt record for the human src oncogene SRC_HUMAN we see that the three whole sections of the record are focused on listing the range of different names, accession numbers, and identifiers associated with the protein (The "Names and origin", "Cross-references", and "Entry Information" sections). Thus, the protein/gene is referred to, in different places, as "SRC", "SRC1", "c-Src", "p60-Src" etc., UniProt itself has several different accession numbers for the protein (P12931, Q86VB9, Q9H5A8), the GeneCards database stores information for the protein under accession number "GC20P035406" etc.

There is no guarantee that any resource will store a complete set of the names/identifiers used for a given protein - indeed, this is almost certainly NOT the case - therefore, when a text query fails to find any records, the obvious thing to try next is simply a different identifier.

Some identifiers are more likely than others to be found in a database - in particular, the UniProt entry names (e.g. SRC_HUMAN) or primary accession numbers (e.g. P12931) are used by many resources, so this is an obvious first choice when carrying out a query of this kind.
Multiple proteins with the same "name"
Secondly, there are many cases where two very different proteins have the same name. For example, the UniProt records P00395 and P23219 are both associated with the name "COX1", despite being very different sequences carrying out very different functions.

Thus, when you identify a record via text query, it is important that you verify that the record you identify does indeed correspond to the protein you are looking for e.g. by checking that the record links to literature you know refers to your entity of interest, or if you already know the sequence of your protein, you can check whether it corresponds to that of the record returned by your search.

Many/most databases have a (for that database) unique identifier associated with each record - for example, for the human SRC protein in UniProt, this "primary accession number" is P12931. If you can find such unique identifiers for your sequence, then they make good search terms, as it is less likely that two different proteins will have the same accession number.
Additional annotation
Many protein databases associate with a given protein record considerable information about the known function of that protein. In many cases, this will be information such as "interacts with the src protein" i.e. that mention proteins other than the protein described by the record. In such cases, a text query that looks for matches also in such regions of a record can cause similar problems to that caused by several different proteins being associated with the same name - this is yet another reason why it is important to check that the records returned by a search indeed correspond to the protein you are interested in.

TreeFam

TreeFam clusters sequences into families, aiming that each family represents a set of genes that were present in at most a single copy in the ancestral metazoan/animal genome. Thus, while the resource does contain some non-metazoan sequences, the emphasis is strongly on metazoan genes, in particular those present in ENSEMBL release 47.

This link leads to the PubMed abstract for the paper describing TreeFam that appeared in NAR.
Identify family of interest
The most convenient way to identify your family of interest is to query with the UniProt identifier (or name) of one of the sequences in your family of interest e.g. SRC_HUMAN or P12931 if we  are looking for a tree/alignment containing the human src protein.

Begin by navigating to the TreeFam front page by following this link.

At the top left of the frontpage, follow the slightly hidden link to the "Search" page i.e. the link outlined by the red box in the image below.

Note that while the TreeFam front page has search facilities (the blue box on the left  side of the page), these do not allow you to search with external identifiers (e.g. such as SRC_HUMAN or P12931, which are not the identifiers used internally within TreeFam to refer to the human src protein). These other boxes can be used to query, for example, with a TreeFam family identifier e.g. "TF101008" for a cyclin H family (in TreeFam release 7.0), in the "Go for families" box.

Follow the link to the "Search" page shown below.

Type the external accession number into the bottom of the three text fields, and click the "Go" button to the right of this field to run the query.


The results of the search should be returned very quickly, to give a result pages similar to that shown below (this link points to the result page).

Follow the link under in the "AC" column of the result page (shown below highlighted in purple) to reach the record associated with this TreeFam family.

Obtaining MSA/tree files
To obtain an MSA/tree from the TreeFam record page corresponding to this family:
  1. Find the "Plain File" section on the left side of the record page, and click on the arrow button to access the list of possible files that can be downloaded for this family.
  2. Scroll down the list to choose the file you want, and release the mouse
  3. The name of the file you've chosen should now be shown in the list - click "Go" to the right of the filename to download the file
Files whose name ends in ".aln" are alignments
Files whose name ends in ".nhx" are trees


Explore the TreeFam website and articles for information on the difference between the various alignment and tree files available (e.g the difference between full.aln, seed.aln and clean.aln)

See below for a screen shot showing these final steps.

ENSEMBL

The ENSEMBL project hosts an extensive website providing access to a very diverse set of data associated with a range of different genomes. At the time of writing (ENSEMBL release 53), the main focus of the project is vertebrate genomes, although some other model organisms are also included. However, over the next set of releases (there is a new release on average each two months), many more organisms will be included, for example plants, fungi, and eubacteria.
Identify family of interest
Given impressively fast release cycle of ENSEMBL, there are frequent changes to the web interface for the project. Thus, in contrast with our description of using TreeFam above, we will not focus on showing screenshots to explain how to obtain MSAs/trees, as these tend to become outdated rather quickly for ENSEMBL.

The instructions provided below are for release 53 of ENSEMBL - once release 54 and later are available, this link should lead you to an archived version of release 53.

  1. In the prominent "Search ENSEMBL" field on the ENSEMBL front page
  2. From the results page, navigate through the list of matches to your query, looking for matches that are "Genes" (rather than "Protein Alignment Features" etc.), and follow the link to the appropriate Gene page.
Obtaining tree files
  1. The panel on the left side of the Gene page should have a links to "Gene Tree" - click on the "+" sign to the left of this link, and you should be shown a link to "Gene Tree (text)" (The "Gene Tree" link itself might need to be revealed by clicking on a "+" sign to the left of the "Comparative Genomics" link)
  2. Follow the "Gene Tree (text)" link, and on the right of the screen should be shown the NEWICK format of a tree that includes the gene.
  3. Copy and paste this NEWICK format into a text editor, and save the resulting file.
Obtaining MSA files
ENSEMBL provides two kinds of MSAs for download.

The first is associated with the gene tree described above.

To access this tree follow the "Gene Tree (Alignment)" that is just below the "Gene Tree (text)" link.

A CLUSTALW-format alignment should appear in the same place as the NEWICK format tree appeared. As before, cut and paste this text into a text editor to save it locally.

The second (larger and more divergent) alignment is associated with the ENSEMBL family.

Follow the "Protein families" link - this should provide you with (amongst other things) a set "Jalview" start buttons.

Choose the alignment you want to download by selecting the appropriate Jalview button, and wait for the Applet version of Jalview to start.

Once the Jalview Applet is started, you should be able to choose the "File" menu for Jalview (note that, at least on Macs, to see this menu you need to have the parent browser window selected, rather than the Jalview applet window)

Choose "Output alignment via text box" and in the window that then opens, choose the "Apply" button, and copy-and-paste the resulting text into a text editor to save a local copy.

HOVERGEN

HOVERGEN provides information on groups of vertebrate genes that are related to each other.
Identify family of interest
The procedure below describes how to query for alignments of protein sequences using a UniProt accession number - the example taken is again that for the human src protein.

Query HOVERGEN using the WWW-Query system:
Obtaining MSA/tree files
From the results page returned by this query:

Other sources of pre-calculated trees/MSAs



Author: Aidan Budd
Back To Gibson Team Training Pages