Recent Changes - Search:








edit SideBar


Introduction to Bioinformatics

Aidan Budd and Venkata Satagopam

We begin by exploring some common features of bioinformatics tools and analyses - this will help provide a framework for understanding and criticising your analyses.

1. What does "Bioinformatics" mean?

2. Bioinformatics involves many different types of data

3. ALL bioinformatic applications depend on identifying similarity between datasets

4. ALL bioinformatic tools are designed to address one or both of these questions:

Exercises - Recognising features of bioinformatic tools

In discussion with your neighbours, consider a situation (or several situations) where you have used bioinformatic tools in the past.

While using these tools:

  • Which of these questions were you trying to answer:
    • "What is already known about my sequence/system of interest?"
    • "What can I predict about my sequence/system that I can use as a hypothesis to test in a wet-lab experiment?"
    • Or perhaps in your opinion you were addressing neither of these questions? If so, what is the central question that was being addressed?
  • Did the tool work via identifying similarity between datasets in a similar way to that described above?
    • If not, then can you describe in similar terms how the tool functioned?
    • If yes, which were the datasets being compared? And how was similarity between the datasets being assessed?

If you have only limited experience working with bioinformatic tools, then select one (or more) of the tools described in the publications listed below and use them in this exercise.

ProtTest: selection of best-fit models of protein evolution.

Abascal et al. 2005 Bioinformatics 21(9) 2104-5 PMID: 15647292

JIGSAW: integration of multiple sources of evidence for gene prediction.

Allen and Salzberg 2005, Bioinformatics 21(18) 3596-603 PMID:16076884

IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content.

Dosztanyi et al. 2005 Bioinformatics 21(16) 3433-4 PMID:15955779

4DXpress: a database for cross-species expression pattern comparisons.

Haudry et al. 2007 Nucleic Acids Research 36(Database issue):D847-53 PMID:17916571

5. Bioinformatics databases

Being aware of some of the typical features of bioinformatic database records/files can help us find what we need in this jungle of data

A. Primary accession numbers and cross-references

Primary accession numbers aim to provide a string/number that unambiguously identifies just one record within a given database, distinguishing it from all other records in that database.


We'll show you the difference between trying to find the UniProt human Src kinase record without and then with knowledge of the accession number


To illustrate the importance of accession numbers in navigating through biological databases, attempt to find records in the following databases associated either with human hemoglobin subunit beta, or with your own protein of interest, without using accession numbers (if using the example of human hemoglobin beta, we've provided links to the relevant accession numbers for you to compare with the results of your own searches.)

Note that in some cases there may be several accession numbers referring to the same entity e.g. 1A00 and 1BBB are both PDB accession numbers for 3D structures of hemoglobin.

B. Overlapping bioinformatic resources


Another common situation you'll find when searching bioinformatic resources is that, in many cases, several different resources provide the same/similar kinds of information. Often you will find that there is some, but not complete, overlap between the information provided by these different sites.

For example PhosphoSite, Human Protein Reference Database, Phospho.ELM, and UniProt all provide information about experimentally-demonstrated phosphorylation sites e.g. for the human epsin-1 sequence (UniProt name EPN1_HUMAN).

For this reason, if you have the time and are very interested in finding evidence of particular features of your protein, it makes sense to examine more than just one resource for information of this kind.


Using the resources above, determine the extent of overlap in the set of sites described to be phosphorylated in the protein CBP_HUMAN (or in your own protein of interest).

C. User-friendly data visualisation

Providing representations of biological data in an easy-to-digest form is a key aspect of bioinformatics development and research.


To give an example of the difference between the raw data, and more user-friendly representations, contrast the information found in the following links, all of which present the UniProt record for CBP_HUMAN:

For each of these sites, try and find the parts of the record that refer to:

  • Domains/Families predicted in the sequence by PFAM
  • GO (Gene Ontology) Accession numbers/IDs with which the record in annotated e.g. GO:0000123
  • The amino-acid sequence

Is one of the versions of the record easier to use than the others?

Are there particular features of the representations that make them easier/more difficult for you to navigate through?

6. SRS

Exercise 1

  • When we initially started to know about a gene, a protein or a disease, we start with a quick overview of the information related to it.
Q1: Find out what are the genes, proteins, domains, function, pathways and structures related to diseases Alzheimer’s and Huntington’s?
This exercise aims to demonstrate:
• How to use the results overview in a quick search of SRS

Exercise 2

  • Now we want to run some slightly bigger queries with more than one search field
Q2a: Find all ‘Huntington’s’ related proteins?
In the case of Q2b and Q2c try to find out the difference in using all the three Boolean operators AND (&), OR (|) and BUT NOT (!)
Q2b: Find the human specific ‘Huntington’s’ disease related proteins?
Q2c: Find the mouse specific ‘Huntington’s’ disease related proteins?
If you compare the results from Q5a, with Q5b or Q5c, it will indicate how to narrow down the search results to a very particular kind of data.
Q2d: Select all the results from Q5c and find out any macromolecular related information associated with it?
Q2e: Any of above results associated with pathways, if so which pathways?
Now we will see very specific information related to 3 proteins TMEDA_HUMAN, HD_HUMAN, NR1H2. For each of these proteins answer the following questions:
Q2f: What are the different gene names that the proteins are known by?
Q2g: Get the fasta sequence?
Q2h: Which papers/evidence has been associated with these proteins?
Q2i: Has the 3D structure of any of these proteins been solved?
Interpreting the result of mutational analysis is best done in the context of 3D structural information about the protein being mutated - just one of many reasons why it can be useful to learn whether the 3D structure of a protein has already been solved.
Q2j: Find out which biological processes the proteins are involved in by following the links to GO
A useful summary of the function of a protein can be obtained from examining its description using the Gene Ontology (GO). Finding other proteins described using the same GO terms is one way of obtaining a set of functionally-related proteins.
This exercise aims to demonstrate:
• How to use query builder and Boolean operators.
• How to get the related information.
• That UniProt is a very useful resource in the way it can provide an easily-overviewed summary of a huge amount of data collected about a particular protein.
• Examples of some of the different kinds of information contained in a UniProt entry.

Exercise 3

  • Now run linking query
Q3a: Get all the human specific genes associated with ‘Parkinson’s’ disease and then get the proteins correspond to them and then find out is there any structures associated with them?
Q3b: Start with a protein Q54PA5_DICDI and find out all other proteins which contains the same domains?
This exercise aims to demonstrate:
• linking facility, a unique feature of SRS

Learning Objectives

  • Outlining the scope of the field of bioinformatics
  • Making participants aware of common features of many/most bioinformatics tools/analyses, and providing practice at deconstructing the tools - an important step towards critical evaluating them
  • Raising awareness of some common features of biological databases
Edit - History - Print - Recent Changes - Search
Page last modified on December 02, 2008, at 07:21 PM CET