AQUA - Alignment Quality Assessment

TOC for Technical Description



Overview

What is done?

The program tackles the problem of assessing the alignment quality on the following three different levels.

Level 1: Pairs

The alignment between a search sequence (U ) and a protein from the database is assessed by:
  1. checking whether the two proteins are true homologues,
  2. compiling the percentage of correctly aligned residues,
  3. compiling an alignment shift score, that is exponentially proportional to the number of residues the predicted alignment is displaced with respect to the true alignment (e.g. structural alignment),
  4. and for alignments between proteins of known structure by comparing the difference between the similarity for the best structural superposition of the pair and the similarity between the superposition resulting from the predicted alignment.

Level 2: Search list

All alignments listed for a search with U against the database are assessed by:

  1. successively applying Level 1 to each pair,
  2. compiling overall scores for the entire list which reflect the balance between correct hits ('true' homologues) and incorrect ones ('false' positives), and the quality of the first correct (resp. all correct) alignment(s),
  3. compiling the accuracy in dependence of the 'reliability score' (often z-score) provided by the prediction method.

Level3: Test set

All alignments listed for a search with U against the database are assessed. The final overall scores are the same as the ones compiled on Level 2 . Additionally, the following results are given (all values are averages over all search lists):

  1. the correctness of all hits down to rank R ,
  2. the quality of all alignments down to rank R ,
  3. the coverage / accuracy ratio down to rank R.


When may you need AQUA?

Typical applications of AQUA are the evaluation of methods for: The precondition for using AQUA is the existence of some standard of truth. This can be a list of database proteins for which it is known that they are homologous to the search sequence U (similar function; and/or similar structure). For the assessment of alignment quality we provide, by default, structural alignments generated by DALI (Holm & Sander, 1993); 'true' alignments are taken from the latest version of the FSSP database of structural alignments (Holm & Sander, 1994). Given a standard of truth you can use AQUA on three levels: (1) to assess the quality of a pair alignment (pair ); (2) to evaluate a search of one sequence U against a database (list ); and (3) to evaluate an entire test set of search sequences {U} aligned against a database (test set). On all levels, scores will also be compiled if you are only interested in evaluating rank lists, i.e. whether or not a pair predicted to be homologous was a 'true' homologous pair.


What is the input?

Your prediction

Standard inputs to AQUA are a pair alignment, a list of pair alignments from a search with one sequence U against a database, or the results for a test set, i.e. lists of pair alignments from searches with a set of sequences {U} against a database (optional input described below). Note: a special case is a simple hit list that would address the following question: if I search in the database for all proteins similar to U , at which rank appears protein x in the final list; and at which ranks are other correct hits (homologues) to be found? Note: all pairs of the prediction are evaluated (if your alignments are provided in the DAF format, you can mark pairs you want to be excluded from analysis). Overall scores (e.g. accuracy and coverage) refer to the list of alignments as provided by your prediction.

Standard of truth

Additionally, you can provide the standard of truth (filename is second argument on command line). The standard of truth consists of a list of pairs considered to be true homologues and - optionally - the alignments considered to be true alignments (e.g. structural alignments of all homologous pairs). The default 'standard of truth' are structural alignments taken from FSSP (Holm & Sander, 1994).

Search set

As an option you can provide the list of identifiers of the proteins you used as the database , i.e. against which you aligned U . The effect of the search set is that all pairs found in the standard of truth that are missing in your search set will be ignored.

Exclude set

As an option you can provide the list of identifiers of the proteins that ought to be excluded from the analysis. For instance, for remote homology detection these would be all proteins that have significant levels of pairwise sequence identity (>25%) to the search sequence U .

Reliability indices (or z-scores)

If you provide a score reflecting the reliability of a given hit (z-scores, energy values, probabilities), AQUA will return an analysis of accuracy vs. coverage for the assumptions that the list is terminated at various cut off values defined by your reliability score. These score-specific results are compiled at intervals:

reliability score pair i > maximal reliability score / k , with k = 1, ..., 5.

Note that the reliability-dependent results can be compiled only if you define a value for the maximal reliability.

Evaluating test sets

If you want provide the results from a search with sequence U against a database, all pairs have to be merged into a single file. For providing the results from an entire test set the following options are possible: The following file formats will be accepted (xx: after completion of the contributions from Reinhard and Liisa).

File formats for prediction

File formats for standard of truth

Note: if you use the DAF format for the representation of the standard of truth , you can use a column named 'confidence' that assigns on a scale from 0 (low) to 9 (high) a value reflecting the confidence you assign to a given pair. This is to mark putative homologues, or in general, to put different emphasis on a pair, which may be correct and another which clearly is not correct.

File formats for search and exclude set

These optional input files have to meet the following specifications (example):

Formats for evaluating performance on test sets


What is the output?

Two different output files are generated: Both files are written out in two formats: The notation used for that file is explained in its header). Various files with more detailed information will be created for debugging purposes (use the word debug in the command line to make sure debugging output will not be deleted). If you want to evaluate the performance of your method on an entire test set , all results compiled by AQUA will by default be appended into one file (using the command line argument manyFiles the results will be written for each search sequence U into one RDB output file).


How to run the program?

Standard

The standard use is by providing your prediction in one file:

/home/rost/pub/aqua.pl filePred.daf

Get help

To get some explanations on the use of AQUA, type:

/home/rost/pub/aqua.pl filePred.daf

The result will be:


-------------------------------------------------------------------------------
--- 
--- Perl script 
--- Task:  	   	assessing the quality of alignments 
--- Input: 	   	filePred fileTrue options 
--- 
--- Optional: 
--- fileSetSearch=   
---             	lists the identifiers of the search set 
---			(one line per id)
--- fileExclude= 	lists the identifiers of proteins to be excluded from
---             	analysis (one line per id; e.g. >25% seq identity)  
--- title=       	title of output files (will be named: 'title.rdb', asf)
---                
--- noAnaShift   	the 'alignment shift' scores will not be compiled 
--- noAnaModel   	the quality of the alignment model will not be evaluated
---                
--- debug        	intermediately created files will NOT be deleted  
--- notScreen    	no information written onto screen 
--- dirIn=       	input directory        default: local 
--- dirOut=      	output directory       default: local 
--- dirWork=      	working directory      default: local
---
--- Further command line arguments as in 'Defaults.aqua'.  To list optional keys:
---            	 	'aqua opt'  
--- 
-------------------------------------------------------------------------------

Selected optional arguments

The following additional options are available on the command line (or via the default file):

Default file

You can give all optional arguments on the command line or in a default file (copy /home/rost/pub/Defaults.aqua ). Note that command line arguments will overwrite the settings in the default file.



Definition of scores

The scores used to assess the quality of your prediction can be viewed in a hierarchy: pair, list of pairs (search set), list of lists of pairs (test set). The following scores will be compiled by AQUA. Note that most of the scores defined are explained in some detail elsewhere (Rost, 1995a; Rost, 1995b; Rost et al., 1996).


Level 1: Pairs

Rank analysis

On the level of pairs, the rank analysis comprises simply of checking whether or not the pair is a true homologue (binary). If it is not, all further analysis will be omitted.

Percentage of correctly aligned residues

The most simple way to assess the alignment quality is by compiling the percentage of correctly aligned residues:

.

Alignment shift analysis

A more advanced way to compare the predicted and true alignment on a residue-by-residue base is the computation of the alignment shift score. This score is defined by:

where the index i labels the alignment residues; is the number of residues aligned in the true alignment, [[alpha]] is a scaling factor (currently set to 0.8), and the shift is defined by:

The final accuracy is defined as:

For one of the two extreme cases, the shift accuracy is identical to the number of residues identically aligned in the true and predicted alignment: if all residues are correctly aligned the accuracy is 100%. However, if no residue is correctly aligned the shift accuracy may be > 0. The accuracy reflects the correctness of the aligned residues. It may be that compared with the region aligned by structural superposition most residues are correctly aligned in the prediction, but that many additional residues are falsely aligned. This is captured by the shift coverage:

where: is the number of residues aligned in the prediction. A value for the coverage of > 1 reflects an overprediction. Under-prediction (Coverage < 1) is implicitly reflected by the definition of accuracy, as that value can become 100% only if all truly aligned residues are predicted. Additionally, a histogram is reported that gives the number of residues shifted by values of S = 0, ..., 5 (5 is a default value, which you can influence by the command line argument). Is the alignment shift score of the predicted alignment impressive or not? One way to answer this question is by introducing a random background: assume the N- and C-term of the aligned pairs are fixed. The only degree of freedom left for aligning the pair is the introduction of insertions. AQUAshift generates a couple of random alignments (for fixed ends) and compiles the resulting levels of accuracy. These values are reported as the 'random shift accuracy'.

Fitness of alignment for 3D modelling

xx to be filled in by BR


Level 2: Search list

The overall analysis of the performance for a list of pairs (search with sequence U against a database) consists of three parts: (1) overall list scores reflecting the accuracy for the first correct hit and the relation between coverage and accuracy for the entire list; (2) averages compiled over the list; and (3) the results for all pair analyses (see above). To keep the output simple on the level of single search list only some overall scores for the rank analysis are compiled.

First correct hit

For one search list only the rank of the first correct hit will be given (=0, if none found).

Half decay rank

Of practical interest may be to report two different scores describing the half-good, half-bad rank. The first is the rank at which the number of correct hits equals the number of false positives above this rank:

The higher this number, the better the method. If no rank is found that fulfils this condition a value of zero is returned. If more than one is found, the maximum is taken. A similar concept is that of the rank at which the number of correct hits below this rank equals the number of false positives above:

For a perfect method, would be the number of true homologues. Thus, for identical lists (numbers of true homologues), a lower value of reflects a better method. The value of is uniquely defined, but it may not be defined for a given list. In such a case a zero is returned.

Reliability (z-score) dependent rank analysis

Given a reliability score (Z ) for each pair (z-score, energy value, probability, asf.), accuracy and coverage of the rank will be compiled for different intervals of:

the index i labels the pair, v the. For each interval all pairs will be ignored that do not fall into the interval. The z- dependent accuracy is defined by:

with:

where the rank is defined by the z cut-off of interval v . The z- dependent coverage is defined by:

where gives the number of true homologues. This cumulative coverage gives the percentage of all possible correct hits that are predicted if the list is terminated at a given rank .

Cumulative accuracy and coverage vs. rank

The rank-dependent cumulative accuracy and coverage are computed in analogy to the z-dependent scores.

Alignment shift and 3D comparison

All scores compiled on the level of pairs to describe the accuracy of a given alignment apply, as well, on the other levels. Currently, AQUA returns, for simplicity, the assessment of the alignment accuracy for the first correct hit, only.


Level 3: Test set

For finally assessing the performance of a method on entire test sets of proteins {U} each aligned against a database a hierarchy is imposed by the simplicity of the task. The most simple question to evaluate a method is: how often is the first hit correct? Various more detailed scores for the rank correlation are compiled (rank-dependent cumulative accuracy, average half-decay ranks, z-dependent accuracy and coverage). A stronger demand is that the correctly predicted homologue is correctly aligned (alignment shift scores for first correct hit). The final question is of course, how well can we predict 3D structure given an alignment of a sequence with a protein of known structure (alignment 3D test)? The following scores are compiled to capture the performance of methods on test sets.

Rank-dependent cumulative accuracy

The rank dependent cumulative accuracy (usually compiled for ranks R = 1, ..., 50) is defined by:

with:

where is the number of test sequences U in the test set. This accuracy addresses the following question: how often is at least one correct hit to be found among the first R hits? Or: in which percentage of the test proteins is the correct hit to be found if the lists are cut off at rank R ? The choice of the maximum (min { 1 , x }) forces that only the best hit in each list is considered. The special case Q(1), i.e., the percentage of correct first hits is one of the most important simple descriptors for, e.g., threading methods (Rost, 1995a; Rost, 1995b; Rost et al., 1996) .

Half-decay rank

The half-decay ranks are simply averaged over all lists. The average over the rank at which the number of correct hits equals the number of false positives above this rank is:

The higher this number, the better the method. The average over the rank at which the number of correct hits below this rank equals the number of false positives above is:

where:

For identical test sets, a lower value of reflects a better method. (The normalisation with accounts for lists for which this rank was not defined.)

z-dependent cumulative accuracy

Given a reliability score (Z ) for each pair (z-score, energy value, probability, asf.), accuracy and coverage of the rank will be compiled for different intervals of:

the index labels the pair i in one test sequence p , and v the. The z-dependent accuracy is defined by:

where the rank is defined by the z cut-off of interval v . This score answers the question: what is the percentage of pairs predicted correctly if all predictions for z-scores smaller than the one describing v are ignored?

z-dependent coverage

Most methods will have to find a balance between accuracy and coverage, i.e., between the number of hits (predicted homologues) correctly predicted and the number of correct hits (true homologues) predicted. This coverage is of particular importance if all predictions above a given cut-off z-score are ignored. The following definition for the z-dependent coverage is used:

where is the number of true homologues for test sequence p . This cumulative coverage gives the number of hits predicted correctly up to the z-score determined cut-off rank as a percentage of hits that could have been predicted correctly up to that rank. Quoting both scores cumulative accuracy and coverage allows to compare two methods even if the lists are not equally long and the z-score intervals are defined differently.

Alignment shift and 3D comparison

All scores compiled on the level of pairs to describe the accuracy of a given alignment apply, as well, on the other levels. Currently, AQUA returns, for simplicity, the assessment of the alignment accuracy for the first correct hit, only. The average shift score is:

where is the number of residues aligned in the true alignment; and is the number of residues which residue i in the first correct pair alignment of the prediction for search sequence p is shifted with respect to the corresponding residue in the true alignment. Additionally, the histogram of the number of residue shifts of 0, ..., 20 for the first correct hits in all test proteins is returned.


Analysis of alignment shift

to be continued by BR



Analysis of alignment in 3D

to be filled in be LH



Nomenclature



References


EMBL Home Sander Home Rost Home PredictProtein Mail to Rost - - -