AQUA - Alignment Quality Assessment
TOC for Technical Description
- Overview
- What is done?
Level 1: Pairs, Level 2: Search list, Level 3: Test set
- When may you need AQUA?
- What is the input?
- DAF format
- Your prediction , File formats for prediction
- Standard of truth , File formats for standard of truth
- Search set , File format for search set
- Exclude set , File format for exclude set
- Reliability (or z-scores)
- Evaluating test sets , Formats for test sets
- What is the output?
Example Full ---
Example List Accuracy
- How to run the program?
Standard input , Get help , Selected optional arguments, Use default file
- Definition of scores
- Level 1: Pairs
Rank analysis , Percentage of correctly aligned residues , Alignment shift analysis , Fitness of alignment for 3D modelling
- Level 2: Search list
First correct hit , Half decay rank , Reliability z-score dependent rank analysis , Rank-dependent cumulative accuracy and coverage , Alignment shift and 3D comparison
- Level 3: Test set
Cumulative rank accuracy , Half-decay rank , z-dependent accuracy , z-dependent coverage , alignment shift and 3D comparison
- Analysis of alignment shift
- Analysis of alignment in 3D
- Nomenclature
- References
The program tackles the problem of assessing the alignment quality on the following three different levels.
The alignment between a search sequence (U ) and a protein from the
database is assessed by:
- checking whether the two proteins are true homologues,
- compiling the percentage of correctly aligned residues,
- compiling an alignment shift score, that is exponentially proportional to the number of residues the predicted alignment is displaced with respect to the true alignment (e.g. structural alignment),
- and for alignments between proteins of known structure by comparing the difference between the similarity for the best structural superposition of the pair and the similarity between the superposition resulting from the predicted alignment.
All alignments listed for a search with U against the database are
assessed by:
- successively applying Level 1 to each pair,
- compiling overall scores for the entire list which reflect the balance between correct hits ('true' homologues) and incorrect ones ('false' positives), and the quality of the first correct (resp. all correct) alignment(s),
- compiling the accuracy in dependence of the 'reliability score' (often z-score) provided by the prediction method.
All alignments listed for a search with U against the database are assessed. The final overall scores are the same as the ones compiled on Level 2 . Additionally, the following results are given (all values are averages over all search lists):
- the correctness of all hits down to rank R ,
- the quality of all alignments down to rank R ,
- the coverage / accuracy ratio down to rank R.
Typical applications of AQUA are the evaluation of methods for:
- fold recognition (threading),
- sequence alignments,
- and for multiple sequence alignments.
The precondition for using AQUA is the existence of some standard of truth. This can be a list of database proteins for which it is known that they are homologous to the search sequence U (similar function; and/or similar structure). For the assessment of alignment quality we provide, by default, structural alignments generated by DALI (Holm & Sander, 1993); 'true' alignments are taken from the latest version of the FSSP database of structural alignments (Holm & Sander, 1994). Given a standard of truth you can use AQUA on three levels: (1) to assess the quality of a pair alignment (pair ); (2) to evaluate a search of one sequence U against a database (list ); and (3) to evaluate an entire test set of search sequences {U} aligned against a database (test set). On all levels, scores will also be compiled if you are only interested in evaluating rank lists, i.e. whether or not a pair predicted to be homologous was a 'true' homologous pair.
Standard inputs to AQUA are a pair alignment, a list of pair alignments from a search with one sequence U against a database, or the results for a test set, i.e. lists of pair alignments from searches with a set of sequences {U} against a database (optional input described below). Note: a special case is a simple hit list that would address the following question: if I search in the database for all proteins similar to U , at which rank appears protein x in the final list; and at which ranks are other correct hits (homologues) to be found? Note: all pairs of the prediction are evaluated (if your alignments are provided in the DAF format, you can mark pairs you want to be excluded from analysis). Overall scores (e.g. accuracy and coverage) refer to the list of alignments as provided by your prediction.
Additionally, you can provide the standard of truth (filename is second argument on command line). The standard of truth consists of a list of pairs considered to be true homologues and - optionally - the alignments considered to be true alignments (e.g. structural alignments of all homologous pairs). The default 'standard of truth' are structural alignments
taken from FSSP (Holm & Sander, 1994).
As an option you can provide the list of identifiers of the proteins you used as the database , i.e. against which you aligned U . The effect of the search set is that all pairs found in the standard of truth that are missing in your search set will be ignored.
As an option you can provide the list of identifiers of the proteins that ought to be excluded from the analysis. For instance, for remote homology detection these would be all proteins that have significant levels of pairwise sequence identity (>25%) to the search sequence U .
If you provide a score reflecting the reliability of a given hit (z-scores, energy values, probabilities), AQUA will return an analysis of accuracy vs. coverage for the assumptions that the list is terminated at various cut off values defined by your reliability score. These score-specific results are compiled at intervals:
reliability score pair i > maximal reliability score / k ,
with k = 1, ..., 5.
Note that the reliability-dependent results can be compiled only if you define a value for the maximal reliability.
If you want provide the results from a search with sequence U against a database, all pairs have to be merged into a single file. For providing the results from an entire test set the following options are possible:
- list of filenames (where each file contains the results from aligning one sequence against a database),
- or all results merged into a single DAF formatted file.
The following file formats will be accepted (xx: after completion of the contributions from Reinhard and Liisa).
- DAF Dirty Alignment Format: a simple format for listing pair alignments
- HSSP format used for the HSSP database of multiple sequence alignments (Sander & Schneider, 1994)
- MSF Multiple Sequence alignment format: as output by, e.g., the GCG package
- DAF Dirty Alignment Format: a simple format for listing pair alignments
- FSSP format used for the FSSP database of structural alignments (Holm & Sander, 1994)
Note: if you use the DAF format for the representation of the standard of truth , you can use a column named 'confidence' that assigns on a scale from 0 (low) to 9 (high) a value reflecting the confidence you assign to a given pair. This is to mark putative homologues, or in general, to put different emphasis on a pair, which may be correct and another which clearly is not correct.
These optional input files have to meet the following specifications (example):
- lines beginning with a hash (#) are ignored
- one line per identifier
- or: one line per pair, pairs separated by multiple blanks or tabulators (this is if you want to provide the exclude set for a test set!)
- List of files (in formats DAF, HSSP, or MSF)
- DAF (all pair lists merged into one file)
Two different output files are generated:
- full table, i.e., all results from Aqua for one list (example),
- and the cumulative accuracy vs. the rank (example)
Both files are written out in two formats:
- RDB , i.e. Relational Data Base format (tabulator separated columns),
- and HTML, this is a conversion of the previous files (example).
The notation used for that file is explained in its header). Various files with more detailed information will be created for debugging purposes (use the word debug in the command line to make sure debugging output will not be deleted). If you want to evaluate the performance of your method on an entire test set , all results compiled by AQUA will by default be appended into one file (using the command line argument manyFiles the results will be written for each search sequence U into one RDB output file).
The standard use is by providing your prediction in one file:
/home/rost/pub/aqua.pl filePred.daf
To get some explanations on the use of AQUA, type:
/home/rost/pub/aqua.pl filePred.daf
The result will be:
-------------------------------------------------------------------------------
---
--- Perl script
--- Task: assessing the quality of alignments
--- Input: filePred fileTrue options
---
--- Optional:
--- fileSetSearch=
--- lists the identifiers of the search set
--- (one line per id)
--- fileExclude= lists the identifiers of proteins to be excluded from
--- analysis (one line per id; e.g. >25% seq identity)
--- title= title of output files (will be named: 'title.rdb', asf)
---
--- noAnaShift the 'alignment shift' scores will not be compiled
--- noAnaModel the quality of the alignment model will not be evaluated
---
--- debug intermediately created files will NOT be deleted
--- notScreen no information written onto screen
--- dirIn= input directory default: local
--- dirOut= output directory default: local
--- dirWork= working directory default: local
---
--- Further command line arguments as in 'Defaults.aqua'. To list optional keys:
--- 'aqua opt'
---
-------------------------------------------------------------------------------
The following additional options are available on the command line (or via the default file):
- inclRaliPred = x analyse predicted pair only if: 2*lenAliPred / (lenSeqPred + lenStrPred) > x
- inclRaliTrue =x analyse correctly predicted pair only if for the true pair: 2*lenAliTrue / (lenSeqTrue + lenStrTrue) > x
- inclRali = x effect: inclRaliPred=inclRaliTrue=inclRali
- inclRelPred = x analyse predicted pair only if: zscore > x
- inclRelTrue = x analyse correctly predicted pair only if for the true pair: zscore > x (e.g. zDali)
- inclRel = x effect: inclRelPred=inclRelTrue=inclRel
- inclConfMin = x all true alignments with a value for the confidence > x will be regarded as true hits
- confMax = x maximum of score used for confidence intervall (default = 9)
- confMin = x minimum of score used for confidence intervall (default = 0)
- shiftNhisto = N number of histogram points for shift score (i.e. report number of residues with shifts 0, ..., N (default = 5)
- shiftAlpha = x shiftScore=exp(** (-x * shift)) (default = 1.5)
-
-
You can give all optional arguments on the command line or in a default file (copy /home/rost/pub/Defaults.aqua ). Note that command line arguments will overwrite the settings in the default file.
The scores used to assess the quality of your prediction can be viewed in a hierarchy: pair, list of pairs (search set), list of lists of pairs (test set). The following scores will be compiled by AQUA. Note that most of the scores defined are explained in some detail elsewhere (Rost, 1995a; Rost, 1995b; Rost et al., 1996).
On the level of pairs, the rank analysis comprises simply of checking whether or not the pair is a true homologue (binary). If it is not, all further analysis will be omitted.
The most simple way to assess the alignment quality is by compiling the percentage of correctly aligned residues:
.
A more advanced way to compare the predicted and true alignment on a residue-by-residue base is the computation of the alignment shift score. This score is defined by:


where the index i labels the alignment residues;
is the number of residues aligned in the true alignment, [[alpha]] is a scaling factor (currently set to 0.8), and the shift
is defined by:

The final accuracy is defined as:

For one of the two extreme cases, the shift accuracy is identical to the number of residues identically aligned in the true and predicted alignment: if all residues are correctly aligned the accuracy is 100%. However, if no residue is correctly aligned the shift accuracy may be > 0. The accuracy reflects the correctness of the aligned residues. It may be that compared with the region aligned by structural superposition most residues are correctly aligned in the prediction, but that many additional residues are falsely aligned. This is captured by the shift coverage:

where:
is the number of residues aligned in the prediction. A value for the coverage of > 1 reflects an overprediction. Under-prediction (Coverage < 1) is implicitly reflected by the definition of accuracy, as that value can become 100% only if all truly aligned residues are predicted. Additionally, a histogram is reported that gives the number of residues shifted by values of S = 0, ..., 5 (5 is a default value, which you can influence by the command line argument). Is the alignment shift score of the predicted alignment impressive or not? One way to answer this question is by introducing a random background: assume the N- and C-term of the aligned pairs are fixed. The only degree of freedom left for aligning the pair is the introduction of insertions. AQUAshift generates a couple of random alignments (for fixed ends) and compiles the resulting levels of accuracy. These values are reported as the 'random shift accuracy'.
xx to be filled in by BR
The overall analysis of the performance for a list of pairs (search with sequence U against a database) consists of three parts: (1) overall list scores reflecting the accuracy for the first correct hit and the relation between coverage and accuracy for the entire list; (2) averages compiled over the list; and (3) the results for all pair analyses (see above). To keep the output simple on the level of single search list only some overall scores for the rank analysis are compiled.
For one search list only the rank of the first correct hit will be given (=0, if none found).
Of practical interest may be to report two different scores describing the half-good, half-bad rank. The first is the rank
at which the number of correct hits equals the number of false positives above this rank:

The higher this number, the better the method. If no rank is found that fulfils this condition a value of zero is returned. If more than one is found, the maximum is taken. A similar concept is that of the rank
at which the number of correct hits below this rank equals the number of false positives above:

For a perfect method,
would be the number of true homologues. Thus, for identical lists (numbers of true homologues), a lower value of
reflects a better method. The value of
is uniquely defined, but it may not be defined for a given list. In such a case a zero is returned.
Given a reliability score (Z ) for each pair (z-score, energy value, probability, asf.), accuracy and coverage of the rank will be compiled for different intervals of:

the index i labels the pair, v the. For each interval all pairs will be ignored that do not fall into the interval. The z- dependent accuracy is defined by:

with:

where the rank
is defined by the z cut-off of interval v . The z- dependent coverage is defined by:

where
gives the number of true homologues. This cumulative coverage gives the percentage of all possible correct hits that are predicted if the list is terminated at a given rank
.
The rank-dependent cumulative accuracy and coverage are computed in analogy to the z-dependent scores.
All scores compiled on the level of pairs to describe the accuracy of a given alignment apply, as well, on the other levels. Currently, AQUA returns, for simplicity, the assessment of the alignment accuracy for the first correct hit, only.
For finally assessing the performance of a method on entire test sets of proteins {U} each aligned against a database a hierarchy is imposed by the simplicity of the task. The most simple question to evaluate a method is: how often is the first hit correct? Various more detailed scores for the rank correlation are compiled (rank-dependent cumulative accuracy, average half-decay ranks, z-dependent accuracy and coverage). A stronger demand is that the correctly predicted homologue is correctly aligned (alignment shift scores for first correct hit). The final question is of course, how well can we predict 3D structure given an alignment of a sequence with a protein of known structure (alignment 3D test)? The following scores are compiled to capture the performance of methods on test sets.
The rank dependent cumulative accuracy (usually compiled for ranks R = 1, ..., 50) is defined by:

with:

where
is the number of test sequences U in the test set. This accuracy addresses the following question: how often is at least one correct hit to be found among the first R hits? Or: in which percentage of the test proteins is the correct hit to be found if the lists are cut off at rank R ? The choice of the maximum (min { 1 , x }) forces that only the best hit in each list is considered. The special case Q(1), i.e., the percentage of correct first hits is one of the most important simple descriptors for, e.g., threading methods (Rost, 1995a; Rost, 1995b; Rost et al., 1996) .
The half-decay ranks are simply averaged over all lists. The average over the rank
at which the number of correct hits equals the number of false positives above this rank is:

The higher this number, the better the method. The average over the rank
at which the number of correct hits below this rank equals the number of false positives above is:

where:

For identical test sets, a lower value of
reflects a better method. (The normalisation with
accounts for lists for which this rank was not defined.)
Given a reliability score (Z ) for each pair (z-score, energy value, probability, asf.), accuracy and coverage of the rank will be compiled for different intervals of:

the index
labels the pair i in one test sequence p , and v the. The z-dependent accuracy is defined by:

where the rank
is defined by the z cut-off of interval v . This score answers the question: what is the percentage of pairs predicted correctly if all predictions for z-scores smaller than the one describing v are ignored?
Most methods will have to find a balance between accuracy and coverage, i.e., between the number of hits (predicted homologues) correctly predicted and the number of correct hits (true homologues) predicted. This coverage is of particular importance if all predictions above a given cut-off z-score are ignored. The following definition for the z-dependent coverage is used:

where
is the number of true homologues for test sequence p . This cumulative coverage gives the number of hits predicted correctly up to the z-score determined cut-off rank
as a percentage of hits that could have been predicted correctly up to that rank. Quoting both scores cumulative accuracy and coverage allows to compare two methods even if the lists are not equally long and the z-score intervals are defined differently.
All scores compiled on the level of pairs to describe the accuracy of a given alignment apply, as well, on the other levels. Currently, AQUA returns, for simplicity, the assessment of the alignment accuracy for the first correct hit, only. The average shift score is:

where
is the number of residues aligned in the true alignment; and
is the number of residues which residue i in the first correct pair alignment of the prediction for search sequence p is shifted with respect to the corresponding residue in the true alignment. Additionally, the histogram of the number of residue shifts of 0, ..., 20 for the first correct hits in all test proteins is returned.
to be continued by BR
to be filled in be LH
- U :
search sequence, i.e. the protein for which homologues are searched. In applications of the methods U is a protein of unknown structure and/or unknown function.
- database :
here describing all proteins used for the alignment search (this can be a single protein, a list of unique proteins, a list of domains, the entire PDB or SWISS-PROT).
- search sequence :
U
- search set :
the database
- test set :
set of search sequences {U} which are aligned successively against the database .
- homologue :
protein of similar function or structure. The decision on either of the two (function or structure) - when in conflict - is made by the given standard of truth.
- standard of truth :
the assignment of true homologues and - if applicable - the true alignment of the pair being assessed. Technically speaking, the standard of truth is a file in which the true assignments and alignments are listed.
- Holm, L. & Sander, C. (1993). Protein Structure Comparison by Alignment of Distance Matrices. J. Mol. Biol., 233, 123-138.
- Holm, L. & Sander, C. (1994). The FSSP database of structurally aligned protein fold families. Nucl. Acids Res., 22, 3600-3609.
- Rost, B. (1995a). Fitting 1-D predictions into 3-D structures. In Protein folds: a distance based approach (Bohr, H. & Brunak, S., eds.), pp. 132-151, CRC Press, Boca Raton, Florida.
- Rost, B. (1995b). TOPITS: Threading One-dimensional Predictions Into Three-dimensional Structures. In Third International Conference on Intelligent Systems for Molecular Biology (Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T. et al., eds.), pp. 314-321, Menlo Park, CA: AAAI Press, Cambridge, England.
- Rost, B., Schneider, R. & Sander, C. (1996). Protein fold recognition by prediction-based threading. J. Mol. Biol., submitted Nov 27, 1995.
- Sander, C. & Schneider, R. (1994). The HSSP database of protein structure-sequence alignment. Nucl. Acids Res., 22, 3597-3599.
- - -