From Tomasz Żok

Workshop: BioJavaForStructuralBioinformatics

Structural bioinformatics primer

Description (source: http://en.wikipedia.org/wiki/Structural_bioinformatics)

Structural bioinformatics is the branch of bioinformatics which is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structure such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, and binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology.

Introduction

Description (source: http://biojava.org/)

BioJava is an open-source project dedicated to providing a Java framework for processing biological data. It provides analytical and statistical routines, parsers for common file formats and allows the manipulation of sequences and 3D structures. The goal of the biojava project is to facilitate rapid application development for bioinformatics.
BioJava is licensed under LGPL 2.1.


Publication:

BioJava: an open-source framework for bioinformatics in 2012

Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis

Bioinformatics 2012


Some statistics for 2013-04-02 (source: http://www.ohloh.net/p/biojava)

BioJava:


Performance benchmarks: http://biojava.org/wiki/BioJava:Performance


Important links:


Installation steps:

  1. Install JDK
  2. Install Eclipse Classic
  3. Download a sample project
  4. Run Eclipse
  5. Select File→Import... and then Existing projects into workspace
  6. Choose Select archive file and Browse... for the downloaded bit13-project.zip
  7. Right-click on the project name and select Refactor→Rename...
  8. Type a new name which corresponds to the exercise title (e.g. BIT13-01-pdb)

Important concepts


Basic tasks

Loading a PDB file


Advanced tasks

RMSD calculation

Both sets must have the same size! And the atoms needs to be paired (i.e. i-th atom from structure S corresponds to the i-th atom from the other structure S')


Structure superposition


Statistics about the structure elements

All of these methods Structure.getChains(), Chain.getAtomGroups() and Group.getAtoms() return a Java collection which implements the Iterable interface. We can use a for-each loop construct like this:

for (Chain chain : structure.getChains()) {
    for (Group group : chain.getAtomGroups()) {
        for (Atom atom : group.getAtoms()) {
            // here we will visit every atom!
        }
    }
}

A Multiset (referred to also as a bag collection) is a set which allows for multiple occurrences of elements. Every addition of a new element increments the implicit counter. It is an elegant data structure for counting of elements of different types.

We will use an implementation from Google Guava library.

Center of mass is a point with coordinates equal to weighted average of mass (input atoms' coordinates are the weights) {$$ CoM(\mathbf{X}, \mathbf{Y}, \mathbf{Z}) = \left[ \begin{matrix} \dfrac{\sum_{i=1}^{n}m_i \cdot X_{i}}{M} & \dfrac{\sum_{i=1}^{n}m_i \cdot Y_{i}}{M} & \dfrac{\sum_{i=1}^{n}m_i \cdot Z_{i}}{M} \end{matrix} \right] $$} Where:


Structural alignment

Combinatorial Extension (CE) and Flexible structure AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists (FATCAT) are well-known and popular structural alignment methods of proteins. Please refer to the following publications:

  1. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path, Shindyalov IN, Bourne PE, Protein Eng. 1998 Sep; 11(9):739-47
  2. Flexible structure alignment by chaining aligned fragment pairs allowing twists, Ye Y, Godzik A, Bioinformatics. 2003 Oct; 19 Suppl 2:ii246-55

Unfortunately, these implementations work only with proteins (not RNAs). And their API/documentation is very poor - in my opinion, this is a less mature and relatively new part of BioJava package


Torsion angle calculation

Previously you have learned that we can change abstraction level by going into more and more detailed view i.e. Structure → Chain → Group →Atom. The truth is that you can also go the other way round:

Some actions change the values in the instances of Atom, Group, Chain and Structure. If you develop a bigger project, you should keep in mind to make your work on the clones of the original data. Otherwise, the same structure after modification will be seen differently. For example you could end up in a situation like this:

  1. structure = reader.getStructureById(STRUCTURE_ID);
  2. chain = structure.getChainByPDB(CHAIN_ID);
  3. Some complex calculations now...
  4. Now you could observe that:
    • !structure.equals(chain.getParent());, or
    • !chain.equals(structure.getChainByPDB(CHAIN_ID));

This is not very likely - in general Java libraries tend to minimize occurrences of such unintuitive behaviour and BioJava is no different. However, you should be cautious in bigger projects using this library.


Tasks to do on your own

  1. Statistics about amino acids:
    • Load some protein (e.g. 3CFM)
    • For every amino acid type, remember the first one encountered (a reference amino acid) and calculate RMSD for every next one of this type
    • Calculate an average and standard deviation of RMSD for every amino acid type.
    • Print a ranking (the most variable (i.e. with highest average RMSD) amino acid on top and the least on the bottom)
  2. A table of torsion angle values:
    • Load some RNA (e.g. 1EHZ)
    • Calculate δ angle for every residue (it is made out of the following atoms: C5'-C4'-C3'-O3')
    • Print a table of values (each line has two numbers: residue index and δ angle value for this residue)
  3. RMSD with a three-residue-long window
    • Load 1EHZ and 1EVV structures
    • Read backbone atoms (P, C5', C4', C3', C2', C1', O5', O4', O3' and O2') from the first three residues of 1EHZ
    • Iterate over all triplets of residues in 1EVV (residues 1-2-3, then 2-3-4, then 3-4-5, etc.) and calculate RMSD against the selected atoms in the previous step
    • Print a ranking (each lines has two numbers: starting residue of the triplet in 1EVV and its RMSD)
Retrieved from http://www.cs.put.poznan.pl/tzok/wiki/index.php?n=Workshop.BioJavaForStructuralBioinformatics
Page last modified on 2013 Jun Tue 25 15:46