Biocomputing |
Gibson Group |
EMBL |
A Practical Course on Accessing and Analysing Sequences in On-line Biological Databases
The Human genome consists of 3 billion base pairs of DNA. It is hard to remember them all - especially, I find, as you get older. So it is obvious that computers are needed to store and investigate the data explosion generated by modern biology. Fortunately, the growth of the internet means that much of this data can be made accessible to many scientists who cannot afford big computers. In fact, not just scientists but anyone with a Mac or PC can access publicly available on-line resources. Today we are going to investigate both protein and genomic sequence using web servers that you could connect to when you get back home. The exercises we've chosen are based on ones that we use for teaching EMBL researchers. If they are too difficult, provide some feedback as we might be doing this again next year.
In the first part, we are going to find out about the protein responsible for the inherited disease Marfan's Syndrome. In the second part we will investigate some human genome sequence servers since the genome is largely completed now. The Ensembl server is a collaboration between the EBI and the Sanger Centre to provide "real time" human genome annotation as the sequence is generated. The annotation is automatically generated from a combination of gene prediction, encoded protein homology and EST matches: obviously it is not going to be perfect. The Human Genome Browser developed in Santa Cruz is another interface useful for getting an overview of the genome. Ensembl and the Genome Browser have links to each other.
Getting Started
We are going to use PCs running the LINUX operating system but we only need to know enough to login and start netscape for browsing the web.
Now we are ready to start surfing...
Part 1. Using biological databases to find out about the Marfan Syndrome protein
Modern molecular genetics has revealed the genes which may get disrupted in many human inherited diseases. In some cases, knowing the gene provides helpful insight and leads to treatment, in other cases it is less helpful (so far). We'll take a look at one example using various database resources.
Getting a sequence from a database:
SRS is a WWW tool to retrieve information for biological databases. We have retrieved an entry for a protein sequence from the SWISSPROT database. The entry includes the protein sequence, information about the protein and links to some other useful databases.
Examine the database entry:
Using the links in the entry:
The links took us to two other database resources, PubMed which collects abstracts of medical and biological literature and OMIM (Online Mendelian Inheritance in Man) which catalogues medically important human genes and gene disorders. These are very useful to researchers. We could have clicked on other links that would have taken us to DNA, Protein Structure and Protein Domain databases.
Checking for domains in the sequence:
SMART compares a sequence against a database of known protein domains. If the score for a match exceeds a probability threshold, it displays the result graphically. Matching sequences to domains is an imperfect activity: highly divergent domains may not be detected. In fact this is generally true of sequence comparisons and reflects the process of gradual sequence change driven by molecular evolution.
These exercises introduced a few databases and tools available on the web. There are plenty more available for biologists to use, especially at major sites such as the EBI, NCBI and Expasy.
Part 2. Using some human genome web servers
The human genome sequence was published with much publicity earlier this year. Only problem - it isn't quite finished yet! It should take a few more years to tidy up. Even so it is already becoming very useful to researchers. Everyone can access the genome through various academic servers. However, presenting the information available about the genome to the user is very much in its infancy. It can be hard to get at the bits of information you want. We'll have a look at two genome servers.
Exercise 1. Querying Ensembl with a protein sequence
Ensembl can be queried by keyword or by sequence similarity, depending what your needs are. It provides quite a flexible combination of protein/DNA query/databases. For example, it is well suited to probing with ESTs of interest. Today we are going to use a protein sequence as query.
Getting the Query:Querying Ensembl with a protein sequence:
Exercise 2. Using the Genome Browser to evaluate whether two chromosomes are related
The Human Genome Browser is another useful interface to the human genome. We can easily click between Ensembl and the Browser and back again: You may find that some things are easier in one of these browsers than the other - though this may change over time as they are both being rapidly developed. The exercise we are going to do now could be done in Ensembl too but was easier in the Browser as we tried it - but next year who knows?
Do you know what a tetraploid is? A normal eukaryotic genome is diploid - having two copies of the genome - while a sperm or egg cell is haploid with a single copy. It turns out that many species can easily duplicate complete genomes: they are then called tetraploids. For example wheat is a tetraploid which resulted from the fusion of two wild grass species which happened in prehistoric farming. Tetraploidy gives rise to many redundant genes (genes with overlapping functions) that can persist long after the genome fusion happened. Vertebrates, including humans, have many redundant genes. There is currently some controversy whether the vertebrate genome underwent tetraploidy or even octaploidy a long time ago in a common ancestor: We know that there are eight Src paralogues (related genes) in the human genome. One of them, HCK is close to Src on chromosome 20: this gene pair was formed by a "tandem duplication" within the chromosome. Another, LYN, is found on chromosome 8 so might have arisen by genome duplication. We are going to see whether there are other genes close to Src and LYN that exhibit synteny: Syntenic chromosomal regions possess similar sets of genes.
Getting the Src locus in the Genome Browser:
Finding a set of genes linked to Src:
Finding a set of genes linked to LYN:
Notes:
This exercise shows quite well the current state of the human genome sequence. There are some nearly finished regions and some less good ones. Also there seem to be some gaps between clone contigs - so the contigs could invert, changing gene order. Also the sequence contig order within a clone is not always correct - although the project does try to order them - so could mess up local gene prediction. I.e. the fine mapping is not reliable in the rough sequence. It will be interesting to see how much the gene order changes in these two regions in the future and if they get a bit more like each other.
We hope you enjoyed todays exercises. They give a taste of the on-line resources currently available to researchers who need to analyse genes and genomes. If you make a career in biology, you will need to use computational resources yourself. In the future, we can expect that there will be more and better services to access.
The gene orders for Part 2: Exercise 2 using the December 2000 release of the human genome are:
SRC, HNF4A, PKIG, MATN4, EYA2, NCOA3, BIG2, STAU
LYN, BIG1, NCOA2, EYA1, STAU2, HNF4G, PKIA, MATN2
By the way, LYN is actually a closer relative of HCK rather than SRC, but HCK is further from the rest of these genes than SRC, so we used SRC as the starting point. It will be interesting to see if this corresponds to true gene order due to a chromosomal inversion, as the sequence is improved.
You can find this page at http://www.embl-heidelberg.de/~seqanal/courses/ScienceSchool01/ScienceCourse.html