Notations in DAF format
Header of file
- First line :
the first line HAS to start with '# DAF'
- SOURCE :
name of file used to generate the DAF file
- ALISYM :
symbols allowed in alignments; e.g. 'ACDEFGHIKLMNPQRSTVWY.acdefghiklmnpqrstvwy'
- NPAIRS :
number of pairs listed
- NSEARCH :
number of search sequences listed (for evaluating entire test sets;
example)
- ADDKEYS :
names of additional columns invented by you. The effect of additional columns (which will be incomprehensible to AQUA) is simply that the data will be mirrored in the output.
- RELKEYS :
names of columns describing reliability of predictions (e.g. zscores). Note: only the first name given will be used, all other zscores will only be mirrored in the output.
- RELHISTO :
defines the intervals for the analysis of accuracy vs. coverage for various zscores. For the given example (3, 2, 1), the first interval will be chosen such that z(i)>3 (i=1,...,NPAIRS), the second such that z(i)>2, the third such that z(i)>1. Note: only the first of the zscores appearing in the row 'RELKEYS' will be taken.
- n1,n2,...,nM :
for evaluating 1,...,M proteins of an entire test set, the commata separate the numbers valid for each of the search sequences U1, U2, ..., UM.
- ALIGNMENTS :
keyword to note that the next line will list the names of the columns used.
Body of file
- idSeq :
identifier of the search sequence U(dubbed 'sequence', when having in mind FOSFOS: fittness of sequence for structure ). This first sequence is regarded as the sequence for which the prediction is made, i.e., typically the input for the search with one protein against a database. For instance, this is the first in an HSSP file. Notation for PDB id's: 1pdbC, where 'C' gives the chain (if specified).
- idStr :
identifier of the database protein (dubbed 'structure', when having in mind FOSFOS: fittness of sequence for structure ). This second sequence is regarded as the target, e.g. for threading this is the protein found to be remotely homologous to the search protein. Notation for PDB id's: 1pdbC, where 'C' gives the chain (if specified).
- conf :
for using the DAF format to define the standard of truth: you can give a number between 0 (low) and 9 (high) reflecting your confidence that the pair is a true homologue. Currently, AQUA ignores all pairs with values < 9, i.e., conf<9 has the same effect as if excluding the pair.
- rank :
rank at which the current pair appears in your alignment of a search sequence against a database.
- lenSeq :
length of search protein (or chain, or domain)
- lenStr :
length of the target protein (or chain, or domain)
- lenAli :
number of residues aligned between 'seq' and 'str'
- pide :
percentage (!!) of sequence identity = 100*(number of residues identical/lenAli)
- seq :
Full sequence of the search sequence ('sequence'). YOU OUGHT to give correct and entire DSSP sequences here if you want an analysis of the alignment quality!!!
- str :
Full sequence of the target protein ('structure'). YOU OUGHT to give correct and entire DSSP sequences here if you want an analysis of the alignment quality!!!
- weightSeq :
residue specific weight you may attach to the alignment. The weight ought to be an integer between 0 and 9, if you can not live with that, separate the residue weights by commata.