Einfuehrung in der Bioinformatik
Donnerstag 27. November 2008, EMBL Heidelberg
Demonsration - BLAST
BLAST
webserver at EBI - we'll run it using one of the smaller sequence
databases (UniProtKB/SwissProt) using the SRC_HUMAN sequence.
Note: The following demonstration was carried out on an OSX
Macintosh, using the Firefox browser. While much will be similar using
a different operating system and
browser, there will certainly also be differences (for example, on OSX
the Safari browser has quite a lot of problems recognising some of the
files available for downloading from this page, and with displaying
full information from the Ensembl website).
Demonstration - Hemoglobin
There is a bewildering diversity of bioinformatic data and tools -
there will be corners (or even huge swathes) of the field that even a
professional bioinformatician will know little or nothing about.
Thus, when using bioinformatic tools, it is important to focus
addressing very specific questions - to avoid becoming overwhelmed by
the multitude of opportunities and choices available.
With this in mind, this demonstration begins by focusing on a very
specific
question:
What effect does the mutation that causes sickle-cell anemia have on
the 3D structure of haemoglobin?
We begin by identifying the UniProt
(the major protein primary sequence database) record for human
hemoglobin involved in sickle-cell anemia
NOTE: Searching bioinformatics databases usually matches multiple
records that the user has to choose between to find the one(s) of
interest
- Following the "Protein Sequence" link we are led to the UniProt record for
the protein, providing information about MANY aspects of the structure
and function of the protein:
- Taxonomy of the organisms (human)
- Links to primary literature describing features of the protein
- DNA sequence of the gene
- 3D structure of the protein
- Phosphorylation state of the protein
- Protein sequence of the gene
- etc.
NOTE: Cross-references between biological databases make finding
additional information about the protein/entities much easier and
quicker. Almost all bioinformatic databases contain such links
- From here we get the primary amino acid
sequence of the protein (following the link that says "FASTA")
- We can also find the mutation in the protein that causes
sickle-cell anemia (searching the page for the word "sickle")
- The UNFORMATED version of this data looks like this
NOTE: The unformated version of the data is relatively difficult
for humans to interpret (it's designed to be relatively easy to process
by computers) - therefore a key component of bioinformatics focuses on
presenting the data in a way that is accessible to human users.
- The UniProt record (or the EB-eye search) link to information
about the 3D structures of the protein in the PDBsum database
NOTE: This example shows even more clearly why we need visualisation
tools for bioinformatic data!
- We can look a the structures of the wildtype and sickle-cell form
of haemoglobin beta from human by loading them both into PyMOL
- NOTE - loading the structures into PyMOL may be tricky. You
might try:
- drag and drop the files onto the PyMOL program icon
- change the names of the files to ones expected by PyMOL
e.g. ending in ".pdb"
- Use the command prompt in the program to load the file
e.g. "load /Users/budd/Desktop/pdb1gzx_hemoglobin_wt.ent.txt"
- To begin to compare the structures we need to change the
representations somewhat
- Without placing the two structures directly on top of each
other it's difficult to see the differences. Therefore we will...
- Align the two structures using the CE webserver, comparing
both B chains:
NOTE: CE, PyMOL, and EB-eye are examples of another key component
of bioinformatics - TOOLS (as opposed to pure data) developed to help
explore and analyse the
basic bioinformatic data. One way of deconstructing bioinformatics is
to divide it into "data" and "tools".
- We can edit the
results file (removing the "ENDMDL" and
"MODEL 2" lines) and
view the alignment in PyMOL
NOTE: All the tools and data we use here are FREE - with that in
mind, it's not surprising that they are often not particularly easy to
use. It took me at least 15 minutes to work out how to change the
file to be able to view it properly in PyMOL. Problems like this are
typical for bioinformatic tools.
- We can also carry out some editing of the appearance of the
structures in PyMOL and load this edited file
directly into PyMOL load a pre-edited file into PyMOL to show the
differences between the structures of the two chains
- Note - again, you may have problems loading the edited file
into PyMOL - in this case, try:
- changing the name to end ".pse"
- loading the file from the command prompt e.g. "load
/Users/budd/Desktop/ceAlignmentHemoglobin_a_wt_b_sickleCell.pse"
- How would you describe the differences between the two
structures?
We have now answered our original question!
Mining the internet to obtain this range of
information about a protein involves (as you've seen) using and
understanding many different resources. Tools such as EB-eye help
integrating many such resources (again, as we've seen) - however other
tools have been develop that make the initial stages of this process
(gaining an overview of what's know about a protein) in many cases very
easy (and very attractively presented.)
REFLECT is one such tool and has an
impresively wide range of diferent applications - we'll see here how it
can help extracting relevant biological information from different
types of text.
Returning to the hemoglobin 3D structure -
this might be a useful starting point for many other questions e.g.:
- What is the difference in the structure of human haemoglobin
alpha and beta (providing insight into the evolution of the protein
family)? CE
alignment PyMOL
Session
- What is the effect on haemoglobin beta structure of other
mutations (we can see a list of naturaly occuring variations in the
UniProt record for the protein)?
Or using completely different structures:
- Short functional peptide and protein disorder FHA domain
intereating with phosphothreonine peptide PyMOL Session
- Much of the proteome is DISORDERED i.e. has no globular
structure
- Disordered sequence is VITAL for many important functions
(especially signalling e.g. phosphorylation - cancer, development, etc.)
- This NMR structure shows an exmple of a disordered peptide
- DNA complexed with
HMGB protein
- Can they read the DNA sequence by looking at the structure of
the
- Does the binding of the protein change the structure of the DNA
(could include
- Which regions of the structure do they expect to be be
more/less flexible? (Show the movie)
- Comparison with DNA
in a non-double-helix conformation - what is the difference with
how the bases intereact with each other?
Returning to haemoglobin as an example for
other kinds of questions:
Where are the haemoglobin genes in the human genome (i.e. which
chromosomes)?
NOTE: HBB_HUMAN is (sort of) an accession number for this protein
used by UniProt (it's actually the entry's "name" - the primary
accession number is P68871) - knowing what it is makes it much easier
to find information relating to the protein in other databases.
Associating information with an accession number is a very common
feature of bioinformatic databases
- Search within the gene page for "Family" - we can follow this
link to information about where other the members of the family are
located (all on chromosome 11)
- Are they randomly distributed through the genome or are they
clustered in one area?
Do the different haemoglobin genes have similar or different numbers
of introns/exons? Do they have one or more transcripts for each locus?
- Follow the links to each of the genes from the Ensembl family
page and count!
How many copies of the gene are there in humans? In other mammals?
When did the genes duplicate?
- Query TreeFam with the
accession number - this should lead to TF333268
- from here we can see when the duplication events occured
NOTE: TF333268 is the accession number for this family in the
TreeFam database
How are the protein sequences of the members of the family different
i.e. how have they changed during evolution?
- From TreeFam choose "clean.aln"
and click "Go"
- Download the alignment and view with ClustalX locally
NOTE: ClustalX (as PyMOL previously) is a tool that we need to
have installed directly on the computer we are working on - it is
termed a "local" tool. This is in contrast to tools such as CE that are
"web" or "remote" tools. The local/remote destinction is another way of
describing features of different bioinformatic tools.
- This shows which mutations have occured within the family
during evolution
- Is the pattern of substituion
the same at all positions?
- If there is a difference, which parts of the 3D structure are
the fast/slow changing residues?
- Are there other changes in the sequences? (Yes -
insertions/deletions)
- Do insertions/deletions occur preferentially in particular
parts of the protein sequence/structure?
- Which
parts of the structure have slower/faster changes? etc.
How have the DNA sequences changed during evolution?
- Choose "clean.nucl.mfa"
and click "Go".
- Download the alignment and view with ClustalX
- This illustrates how mutation patterns vary between different
positions in the alignment
- Are there any obvious patterns/tendencies in how different
positions in the sequence change?
- Taking a short piece of DNA coding sequence, can they deduce
the protein sequence it codes for?
- This can be done automatically using transeq
from the EMBOSS package at EBI
There are many other directions one could try
and go in developing tasks/projects/packages for school classroms:
Understanding bioinformatic tools (e.g. seqeunce alignment)
Sequence alignment is CENTRAL to ALMOST ALL sequence analysis
bioinformatic tools. The aim of calculating an alignment is to place
equivalent residues from different sequences in the same column.
For example, we can align two haemoglobin sequences HBA_HUMAN and HBB_HUMAN using Smith-Waterman
alignments at EBI using EMBOSS or Blast2Sequences
at the NCBI
To understand how these tools work, students could try and do this
themselves without using these tools e.g. using paper and scissors or
editing a simple text file.
There are many other such tools/tasks that could be investigated in
this way e.g. matching linear motif regular expressions in ELM
Specific systems/data of interest
As mentioned, there are MANY MANY different datatypes, databases, and
tools available. Some of the following websites provide lists of these,
perhaps some of them would be particularly interesting for school
students.
As an example, PhenoBank
contains data on the effect of all C. elegans genes for
their role in the first two rounds of mitotic cell division via a
genome-wide RNAi screening and time-lapse video microscopy of the early
embryo e.g. wild-type
video
Some websites to visit that provide lists of many different
bioinformatic tools and resources:
Cross-over with Informatics
An obvious area of interdisciplinary cross-over is with informatics.
One could envisange many different tasks for computational students
that would incorporate biological data e.g.:
- Designing their own visualisation tools
- 3D structure e.g. PyMOL
- text/html formating of database records e.g. as with UniProt
where this raw data becomes
this marked up HTML
- interaction networks e.g. search with "HBB_HUMAN" in STITCH
- ...
- Parsing useful information from large biological datasets
- Implementing bionformatic algorithms
- pairwise alignment
- regular-expression mapping
- mass-spec data interpretation
- ...
Another obvious cross-over field is mathematics.
Conclusion
We've looked here at numerous examples of questions that can be
addressed using bioinformatic tools, and possible ideas for how such
tools/ideas/approaches might be incorporated into the classroom. In
summary here are two points I'd particularly like to emphasise:
- All the data and tools used above is FREE for to
access, download, and work
with/analyse
- Bioinformatic data and tools cover a very wide range of different
aspects of biology - thus there may be many parts of the biology
sylabus where it could be included
Novel
Approaches for Automated Information Collection
So far you have seen that drawing biology-related conclusions involves
querying many different resources to collect the relevant pieces of
knowledge.
This is a tedious and time-consuming process that prevents researchers
from spending their time on other research activities.
Clicking on the following button will cause the current page to be
opened in a new window.

- Have any protein names been highlighted within the page?
- What happens if somebody clicks on the highlighted names?
Tools such as Reflect
aim at automating the task of collecting relevant biological
information by presenting the users with useful summaries about the
proteins and small chemicals mentioned in a piece of text.
Now vist the wikipedia entry of your proteins or small chemical
molecule of your choice e.g. the prion protein, insulin, or Prozac
Copy and paste the URL of these wikipedia pages into the input field of
Reflect. You
should see similar marking up of protein names as before.
The same thing can be done with many different URLs e.g.:
Thus, without specialised bioinformatics skills or without prior
knowledge of biological and chemical databases users can now get access
to essential pieces of knowledge and easily realise, for example, how
theoretical concepts such as a 3D structure materialise in practice.
Zuereck
zur Gibson Team Ausbildungseiten