1. Selecting the range of sequence to clone and express
2. Class of protein
3. Errors, truncations, frame-shifts, incorrect exon assembly
4. Defining the borders of domains
5. Making a multiple sequence alignment
Before cloning your protein, it pays to think carefully about what you will do with the expressed material. Will you solve a structure and/or will you want to do some biochemistry? In the worst case, if you clone in the wrong bit it may be useless for any kind of experiment, let alone for solving the structure. If you need help in analysing the sequence, you are welcome to contact the Sequence Analysis Service.
If you want to practise using some sequence analysis software, try the sequence analysis course pages.
Choosing the right bit of protein to express is vital to the success of an expression experiment. In this section, we want to get you thinking about the following topics.
- Is the protein globular, non-globular, membrane or multi-domain?
- Have you guarded against errors, truncations, frame-shifts in the sequence?
- Have you defined the borders of the domain(s)?
- Have you examined a multiple sequence alignment of the homologues?
Many classical metabolic enzymes consist of a single folded domain and the purified proteins are therefore relatively easy to handle. However, the majority (at least in Eukaryotes) of proteins are not like this at all. Leaving aside the specialist field of membrane proteins, the first thing to worry about is whether the protein is globular at all: if it isn't, you will not be able to determine it's structure (except as an induced fit in a complex), although you will still be able to investigate the protein by CD & NMR. Multi-domain proteins very often consist of a mixture of globular domains interspersed with non-globular regions. The non-globular regions can be important for the function (e.g. containing phosphorylation, SH2-binding, SH3-binding sites and so on). To gain structural insight, it is often necessary to solve one module at a time. Defining the domain boundaries is important since the presence of random coil can confound crystallisation attempts and may lead to unpredictable properties of the protein at NMR concentrations. Sometimes it is very hard to precisely define borders and then it is best to clone several constructs in parallel and then work with the smallest folded construct. Here is a checklist for the fundamental protein types:
- Elongated proteins with repetitive sequences
- Integral membrane
- May combine elements of any other class
You will need to do a multiple sequence alignment to be sure that you know about your protein.
Unfortunately, some sequences in databases can be of low quality. Usually you will get away with "point mutation" errors that affect only one amino acid. Although these will cause trouble if, as you should, you check your sample by Mass Spectrometry.
However, sequences of incorrect length are likely to cause severe problems. Many sequences have not been reliably determined at their N-termini and an arbitrary internal methionine may have been designated as the N-terminus. Occasionally the C-terminus is also incomplete. Either terminus could be affected by frameshift sequencing errors which can also cause internal problems, concatenate adjacent reading frames, etc. Another major (and increasing) class of error in sequence databases is incorrect gene prediction and exon assembly. The latter can have consequences at least as dramatic as for frameshift errors.
The only way to guard against errors like this is to do comparative sequence analysis: again you need to do a multiple sequence alignment.
They will show you many known domains in your protein. But you cannot be completely sure that they have the domain borders correct. You need to examine the set of proteins that contain a given domain. Often, an N- or C-terminal location can be used to define a domain border. Tandemly repeated domains define the domain size but it is easy to wrongly permute the domain borders.
You can only solve the structure of a single domain if it is autonomously folded, say an SH2 or PH domain. However, there is a common class of small rod-forming repetitive domains that are not autonomous, including LRR, ankyrin, tetratricopeptide, armadillo, HEAT and so forth. These typically consist of two or three 2D-structural elements and can only be expressed and folded in groups, so are not usually suitable for NMR analysis (which delayed the structural investigation of this repeat class, although most have representative structures now).
There are also exceptions to the usual organisation of autonomous globular domains that can cause trouble. Sometimes large non-globular inserts are found: these are likely to cause both refolding and crystallisation problems. Sometimes, domains are fused together - either when adjacent or if one domain is inserted into a loop of another domain. Sometimes, unforeseeable domain extensions are present, as in the WW domain structure solved at EMBL some years ago: if the minimal domain is unfolded, trial and error with larger constructs and monitoring by CD & NMR is the only way to find out what is going on, unless you want to try a different sequence.
Since a multiple sequence alignment is the best way to protect yourself from many potential problems, if you don't have one already to hand, now is the time to do it. Here is a brief guide to collecting sequences and aligning them. If more help is needed, contact the Sequence Analysis Service.
There are two ways to collect a set of related proteins and in practice both methods are usually used. You can retrieve them by keyword search with a tool such as SRS. In this case, it is difficult to collect a clean set as keywords are inconsistently applied and it is very common to bring in some completely wrong proteins, as well as missing some real ones.
The alternative is to do a database search with BLAST collect all the sequences this way. Web outputs from good servers should let you collect all the sequences that are clearly related. Often you will need to go through a process of eliminating annoying sequence fragments that will not help in any future analysis.
Having got your alignment, you should now check the sequences as discussed above e.g. for the termini. You need to look for "globularity" of the sequences too and this is simply done by finding buried core residues i.e. conserved columns of hydrophobic residues, which will be interspersed with unconserved columns for the surface residues. Often, conservation periodicities indicate whether a block of sequence is alpha-helix or beta-strand. Non-globular regions are characterised by hydrophilic residues, Prolines and Glycines and generally poor conservation with high tolerance of INDELs. Sometimes it will be useful to have a secondary structure prediction too: In this case try submitting the alignment to the PredictProtein server.
You cannot guarantee to design a perfectly successful cloning and expression experiment but you can optimise your chances of success. Comparative sequence analysis is the best way to control for success in designing a protein expression experiment, just as Mass Spec, CD and 1D-NMR are essential controls to analyse the behaviour of the expressed product. Using these tools can save a lot of frustration. In severe cases, the latter will reveal that you have to start again, so these approaches may need to be used iteratively until you get something to work.
This reference gives some practical hints on analysing protein domains: