Einfuehrung in der Bioinformatik

Donnerstag 27. November 2008, EMBL Heidelberg

Aidan Budd and Evangelos Pafilis

Presentation


Demonsration - BLAST

BLAST webserver at EBI - we'll run it using one of the smaller sequence databases (UniProtKB/SwissProt) using the SRC_HUMAN sequence.

Note: The following demonstration was carried out on an OSX Macintosh, using the Firefox browser. While much will be similar using a different operating system and browser, there will certainly also be differences (for example, on OSX the Safari browser has quite a lot of problems recognising some of the files available for downloading from this page, and with displaying full information from the Ensembl website).

Demonstration - Hemoglobin

There is a bewildering diversity of bioinformatic data and tools - there will be corners (or even huge swathes) of the field that even a professional bioinformatician will know little or nothing about.

Thus, when using bioinformatic tools, it is important to focus addressing very specific questions - to avoid becoming overwhelmed by the multitude of opportunities and choices available.

With this in mind, this demonstration begins by focusing on a very specific question:

What effect does the mutation that causes sickle-cell anemia have on the 3D structure of haemoglobin?

We begin by identifying the UniProt (the major protein primary sequence database) record for human hemoglobin involved in sickle-cell anemia NOTE: Searching bioinformatics databases usually matches multiple records that the user has to choose between to find the one(s) of interest
NOTE: Cross-references between biological databases make finding additional information about the protein/entities much easier and quicker. Almost all bioinformatic databases contain such links
NOTE: The unformated version of the data is relatively difficult for humans to interpret (it's designed to be relatively easy to process by computers) - therefore a key component of bioinformatics focuses on presenting the data in a way that is accessible to human users.
NOTE: This example shows even more clearly why we need visualisation tools for bioinformatic data!
NOTE: CE, PyMOL, and EB-eye are examples of another key component of bioinformatics - TOOLS (as opposed to pure data) developed to help explore and analyse the basic bioinformatic data. One way of deconstructing bioinformatics is to divide it into "data" and "tools".
NOTE: All the tools and data we use here are FREE - with that in mind, it's not surprising that they are often not particularly easy to use. It took me at least 15 minutes to work out  how to change the file to be able to view it properly in PyMOL. Problems like this are typical for bioinformatic tools.
We have now answered our original question!

Mining the internet to obtain this range of information about a protein involves (as you've seen) using and understanding many different resources. Tools such as EB-eye help integrating many such resources (again, as we've seen) - however other tools have been develop that make the initial stages of this process (gaining an overview of what's know about a protein) in many cases very easy (and very attractively presented.)

REFLECT is one such tool and has an impresively wide range of diferent applications - we'll see here how it can help extracting relevant biological information from different types of text.

Returning to the hemoglobin 3D structure - this might be a useful starting point for many other questions e.g.:
Or using completely different structures:

Returning to haemoglobin as an example for other kinds of questions:

Where are the haemoglobin genes in the human genome (i.e. which chromosomes)? NOTE: HBB_HUMAN is (sort of) an accession number for this protein used by UniProt (it's actually the entry's "name" - the primary accession number is P68871) - knowing what it is makes it much easier to find information relating to the protein in other databases. Associating information with an accession number is a very common feature of bioinformatic databases
Do the different haemoglobin genes have similar or different numbers of introns/exons? Do they have one or more transcripts for each locus?
How many copies of the gene are there in humans? In other mammals? When did the genes duplicate?
NOTE: TF333268 is the accession number for this family in the TreeFam database

How are the protein sequences of the members of the family different i.e. how have they changed during evolution?
NOTE: ClustalX (as PyMOL previously) is a tool that we need to have installed directly on the computer we are working on - it is termed a "local" tool. This is in contrast to tools such as CE that are "web" or "remote" tools. The local/remote destinction is another way of describing features of different bioinformatic tools.
How have the DNA sequences changed during evolution?

There are many other directions one could try and go in developing tasks/projects/packages for school classroms:

Understanding bioinformatic tools (e.g. seqeunce alignment)

Sequence alignment is CENTRAL to ALMOST ALL sequence analysis bioinformatic tools. The aim of calculating an alignment is to place equivalent residues from different sequences in the same column.

For example, we can align two haemoglobin sequences HBA_HUMAN and HBB_HUMAN using Smith-Waterman alignments at EBI using EMBOSS or Blast2Sequences at the NCBI To understand how these tools work, students could try and do this themselves without using these tools e.g. using paper and scissors or editing a simple text file.

There are many other such tools/tasks that could be investigated in this way e.g. matching linear motif regular expressions in ELM

Specific systems/data of interest

As mentioned, there are MANY MANY different datatypes, databases, and tools available. Some of the following websites provide lists of these, perhaps some of them would be particularly interesting for school students.

As an example, PhenoBank contains data on the effect of all C. elegans genes for their role in the first two rounds of mitotic cell division via a genome-wide RNAi screening and time-lapse video microscopy of the early embryo e.g. wild-type video

Some websites to visit that provide lists of many different bioinformatic tools and resources:

Cross-over with Informatics

An obvious area of interdisciplinary cross-over is with informatics. One could envisange many different tasks for computational students that would incorporate biological data e.g.:
Another obvious cross-over field is mathematics.

Conclusion

We've looked here at numerous examples of questions that can be addressed using bioinformatic tools, and possible ideas for how such tools/ideas/approaches might be incorporated into the classroom. In summary here are two points I'd particularly like to emphasise:

Novel Approaches for Automated Information Collection So far you have seen that drawing biology-related conclusions involves querying many different resources to collect the relevant pieces of knowledge. This is a tedious and time-consuming process that prevents researchers from spending their time on other research activities.

Clicking on the following button will cause the current page to be opened in a new window.

Tools such as Reflect aim at automating the task of collecting relevant biological information by presenting the users with useful summaries about the proteins and small chemicals mentioned in a piece of text. Now vist the wikipedia entry of your proteins or small chemical molecule of your choice e.g. the prion protein, insulin, or Prozac

Copy and paste the URL of these wikipedia pages into the input field of Reflect. You should see similar marking up of protein names as before.

The same thing can be done with many different URLs e.g.:
Thus, without specialised bioinformatics skills or without prior knowledge of biological and chemical databases users can now get access to essential pieces of knowledge and easily realise, for example, how theoretical concepts such as a 3D structure materialise in practice.
Zuereck zur Gibson Team Ausbildungseiten