Structural bioinformatics primer

Description (source: http://en.wikipedia.org/wiki/Structural_bioinformatics)

Structural bioinformatics is the branch of bioinformatics which is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structure such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, and binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology.
• Three-dimensional structures in silico are represented as files in PDB format
• A PDB file contains metadata (about the authors, the experimental method, technicalities, etc.) and most importantly - information about the atoms
• Every atom is described by a list of values (index, residue number, name, coordinates x, y, z, etc.)
• Loading a PDB file means to read all these information. And analyzing the structure means to process this huge amount of information in a logical and purposeful manner
• Example: structure was rotated 15° about the X axis in techincal terms means a matrix with coordinates of all atoms was multiplied by a rotation matrix {$$R_x(15^{\circ}) = \left[ \begin{matrix} 1 & 0 & 0 \\ 0 & \cos 15^{\circ} & -\sin 15^{\circ} \\ 0 & \sin 15^{\circ} & \cos 15^{\circ} \end{matrix} \right]$$}
• This is where various libraries like BioJava come in hand. Most of the tedious work (dealing with errors in the input files, parsing the PDB structures, rotations and many others) are handled by the library. Structural bioinformaticians can focus on solving the real problem instead of dealing with low-level difficulties

Introduction

Description (source: http://biojava.org/)

BioJava is an open-source project dedicated to providing a Java framework for processing biological data. It provides analytical and statistical routines, parsers for common file formats and allows the manipulation of sequences and 3D structures. The goal of the biojava project is to facilitate rapid application development for bioinformatics.

Publication:

Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis

Bioinformatics 2012

Some statistics for 2013-04-02 (source: http://www.ohloh.net/p/biojava)

BioJava:

• has had 6,442 commits made by 70 contributors, representing 290,608 lines of code
• is mostly written in Java with an average number of source code comments
• has a well established, mature codebase, maintained by a large development team, with decreasing Y-O-Y commits
• took an estimated 76 years of effort (COCOMO model), starting with its first commit in January, 2000, ending with its most recent commit 3 months ago

Performance benchmarks: http://biojava.org/wiki/BioJava:Performance

Installation steps:

1. Install JDK
2. Install Eclipse Classic
4. Run Eclipse
5. Select File→Import... and then Existing projects into workspace
7. Right-click on the project name and select Refactor→Rename...
8. Type a new name which corresponds to the exercise title (e.g. BIT13-01-pdb)

Important concepts

• In BioJava we can deal with various level of abstraction seen as Java interfaces:
• org.biojava.bio.structure.Structure (implementation: org.biojava.bio.structure.StructureImpl)
• org.biojava.bio.structure.Chain (implementation: org.biojava.bio.structure.ChainImpl)
• org.biojava.bio.structure.Group (implementations: org.biojava.bio.structure.AminoAcidImpl, org.biojava.bio.structure.NucleotideImpl and org.biojava.bio.structure.HetatomImpl)
• org.biojava.bio.structure.Atom (implementation: org.biojava.bio.structure.AtomImpl)
• A PDB file can contain multiple models. In BioJava, every model is a separate Structure object.
• Most classes in BioJava override the toString() method and so their content can be easily represented as a string

• Create an instance of a PDBFileReader and enable automatic download of PDBs
• Load a Structure object using:
• reader.getStructure(String path)
• reader.getStructure(File path)
File file = new File("/tmp/1EVV.pdb");
if (file.exists()) {
}
• reader.getStructure(URL url)
URL url = new URL("http://www.cs.put.poznan.pl/tzok/files/2013-06-bit13-biojava/broken.pdb");
• reader.getStructureById(String id)
• Handle the IOException
• Full source code of the example: (set path to the file from previous bullet point in the lines pointed by comments)

package pl.org.ptbi;

import java.io.File;
import java.io.IOException;
import java.net.URL;

import org.biojava.bio.structure.Structure;

public class Main {
public static void main(String[] args) {

Structure structure;
try {
// SET PATH HERE!
System.out.println(structure);

// SET PATH HERE!
File file = new File("/tmp/1EVV.pdb");
if (file.exists()) {
System.out.println(structure);
}

URL url = new URL("http://www.cs.put.poznan.pl/"
+ "tzok/files/2013-06-bit13-biojava/broken.pdb");
System.out.println(structure);

System.out.println(structure);
} catch (IOException e) {
e.printStackTrace();
}
}
}

RMSD calculation

• Root Mean Square Deviation (RMSD) is a formula to calculate an average distance between two sets of corresponding points: {$$RMSD = \sqrt{\frac{1}{n}\sum_{i=1}^{n}\delta_i^2}$$} {$$\delta_i = \sqrt{(x_i-x_i')^2 + (y_i - y_i')^2 + (z_i - z_i')^2}$$}

Both sets must have the same size! And the atoms needs to be paired (i.e. i-th atom from structure S corresponds to the i-th atom from the other structure S')

• In structural bioinformatics, points = atoms
• Let's calculate RMSD for two tRNAs:
• Both structures have a single chain A with 76 nucleotides (good for RMSD calculation)
• Get their only chain A by using the Structure.getChainbyPDB(String id) method (important: it throws org.biojava.bio.structure.StructureException upon problems):
left.getChainByPDB("A");
right.getChainByPDB("A");
• We will focus on phosphorus atoms only (there is exactly one per residue). We can iterate over all residues using the Chain.getAtomGroups() method and we can extract the given atom (P for phosphorus) with the Group.getAtom(String name) method (important: it throws StructureException upon error). Let's use the following helper function:
private static List<Atom> getPhosphorus(Chain chain) {
List<Atom> list = new ArrayList<>();

for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("P");
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
return list;
}
• Now we can calculate euclidean distances between corresponding points. Let's use the following helper function:
private static double calculateDistance(Atom left, Atom right) {
double dx = left.getX() - right.getX();
double dy = left.getY() - right.getY();
double dz = left.getZ() - right.getZ();
return Math.sqrt(dx * dx + dy * dy + dz * dz);
}
• The last step for now is to apply the RMSD formula in Java code:
double rmsd = 0;
for (int i = 0; i < size; i++) {
Atom atomLeft = listLeft.get(i);
Atom atomRight = listRight.get(i);

double distance = calculateDistance(atomLeft, atomRight);
rmsd += distance * distance;
}
rmsd /= size;
rmsd = Math.sqrt(rmsd);
• The whole code for this task:

package pl.org.ptbi;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.biojava.bio.structure.Atom;
import org.biojava.bio.structure.Chain;
import org.biojava.bio.structure.Group;
import org.biojava.bio.structure.Structure;
import org.biojava.bio.structure.StructureException;

public class Main {
public static void main(String[] args) {

try {

List<Atom> listLeft = getPhosphorus(left.getChainByPDB("A"));
List<Atom> listRight = getPhosphorus(right.getChainByPDB("A"));

assert listLeft.size() == listRight.size();
double rmsd = 0;
for (int i = 0; i < listLeft.size(); i++) {
Atom atomLeft = listLeft.get(i);
Atom atomRight = listRight.get(i);

double distance = calculateDistance(atomLeft, atomRight);
rmsd += distance * distance;
}
rmsd /= listLeft.size();
rmsd = Math.sqrt(rmsd);
System.out.println(rmsd);
} catch (IOException | StructureException e) {
e.printStackTrace();
}
}

private static List<Atom> getPhosphorus(Chain chain) {
List<Atom> list = new ArrayList<>();

for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("P");
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
return list;
}

private static double calculateDistance(Atom left, Atom right) {
double dx = left.getX() - right.getX();
double dy = left.getY() - right.getY();
double dz = left.getZ() - right.getZ();
return Math.sqrt(dx * dx + dy * dy + dz * dz);
}
}
• Now use java.util.Collections.shuffle(List list) for listLeft and listRight to randomize the order of atoms. How does it influence the result? Why

Structure superposition

• Is the calculated RMSD a valid measure of similarity? No! Because, the atoms in a PDB file are arranged differently in space and the formula does not account for this
• For a real comparison, one has to superimpose the structures (move and rotate them to minimize the final RMSD)
• We can use the org.biojava.bio.structure.SVDSuperimposer class which calculates the rotation matrix and translation vector and the org.biojava.bio.structure.Calc to apply them to the input structures:
Atom[] atomSet1 = listLeft.toArray(new Atom[listLeft.size()]);
Atom[] atomSet2 = listRight.toArray(new Atom[listRight.size()]);
SVDSuperimposer superimposer = new SVDSuperimposer(atomSet1, atomSet2);
Calc.rotate(right, superimposer.getRotation());
Calc.shift(right, superimposer.getTranslation());

// proceed with regular RMSD calculation from now on
• As you can see, we calculate the superposition based on a selection of atoms (phosphorus only), then apply it to the whole structure (identificator is right) to finally calculate RMSD again only for the selected atoms
• There is a static method SVDSuperimposer.getRMS(Atom[] left, Atom[] right) which calculates RMSD. We can thus replace the loop from previous exercise with a library call:
// the commented code below is functionally equal to a one-liner library call in the end
/*
double rmsd = 0;
for (int i = 0; i < size; i++) {
Atom atomLeft = listLeft.get(i);
Atom atomRight = listRight.get(i);

double distance = calculateDistance(atomLeft, atomRight);
rmsd += distance * distance;
}
rmsd /= size;
rmsd = Math.sqrt(rmsd);
*/

double rmsd = SVDSuperimposer.getRMS(atomSet1, atomSet2);
• The final code for this task is the following:

package pl.org.ptbi;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.biojava.bio.structure.Atom;
import org.biojava.bio.structure.Calc;
import org.biojava.bio.structure.Chain;
import org.biojava.bio.structure.Group;
import org.biojava.bio.structure.SVDSuperimposer;
import org.biojava.bio.structure.Structure;
import org.biojava.bio.structure.StructureException;

public class Main {
public static void main(String[] args) {

try {

List<Atom> listLeft = getPhosphorus(left.getChainByPDB("A"));
List<Atom> listRight = getPhosphorus(right.getChainByPDB("A"));

Atom[] atomSet1 = listLeft.toArray(new Atom[listLeft.size()]);
Atom[] atomSet2 = listRight.toArray(new Atom[listRight.size()]);
SVDSuperimposer superimposer = new SVDSuperimposer(atomSet1, atomSet2);
Calc.rotate(right, superimposer.getRotation());
Calc.shift(right, superimposer.getTranslation());

double rmsd = SVDSuperimposer.getRMS(atomSet1, atomSet2);
System.out.println(rmsd);
} catch (IOException | StructureException e) {
e.printStackTrace();
}
}

private static List<Atom> getPhosphorus(Chain chain) {
List<Atom> list = new ArrayList<>();

for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("P");
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
return list;
}
}
• Now modify the code above to calculate RMSD for P, C5', C4', C3', C2', C1', O5', O4', O3' and O2' atoms. How does it influence the RMSD? Why?

• BioJava allows to look on a structure on any abstraction level
• In this task, we will gather some straightforward information about a given structure

All of these methods Structure.getChains(), Chain.getAtomGroups() and Group.getAtoms() return a Java collection which implements the Iterable interface. We can use a for-each loop construct like this:

for (Chain chain : structure.getChains()) {
for (Group group : chain.getAtomGroups()) {
for (Atom atom : group.getAtoms()) {
// here we will visit every atom!
}
}
}

• Every Atom object carries with itself some chemo-physical information inside of the embedded org.biojava.bio.structure.Element instance
• Element objects carry a lot of important information, see the Javadoc entry for this type:
• Let's collect information about total atom count, total mass and histogram of elements present in the given structure

A Multiset (referred to also as a bag collection) is a set which allows for multiple occurrences of elements. Every addition of a new element increments the implicit counter. It is an elegant data structure for counting of elements of different types.

We will use an implementation from Google Guava library.

• To obtain Element instance from any given Atom, you need to call Atom.getElement()
• The full source code for this task is the following:

package pl.org.ptbi;

import java.io.IOException;

import org.biojava.bio.structure.Atom;
import org.biojava.bio.structure.Chain;
import org.biojava.bio.structure.Element;
import org.biojava.bio.structure.Group;
import org.biojava.bio.structure.Structure;

public class Main {
public static void main(String[] args) {

Structure structure;
try {
} catch (IOException e) {
e.printStackTrace();
return;
}

SortedMultiset<Element> bag = TreeMultiset.create();
double mass = 0;
for (Chain chain : structure.getChains()) {
for (Group group : chain.getAtomGroups()) {
for (Atom atom : group.getAtoms()) {
Element element = atom.getElement();
mass += element.getAtomicMass();
}
}
}

System.out.println("Total mass: " + mass);
System.out.println("Atom count: " + bag.size());
System.out.println("Elements present:");
for (Element element : bag.elementSet()) {
System.out.println(element + " : " + bag.count(element));
}
}
}
• Now modify the above code to calculate the center of mass of the given structure

Center of mass is a point with coordinates equal to weighted average of mass (input atoms' coordinates are the weights) {$$CoM(\mathbf{X}, \mathbf{Y}, \mathbf{Z}) = \left[ \begin{matrix} \dfrac{\sum_{i=1}^{n}m_i \cdot X_{i}}{M} & \dfrac{\sum_{i=1}^{n}m_i \cdot Y_{i}}{M} & \dfrac{\sum_{i=1}^{n}m_i \cdot Z_{i}}{M} \end{matrix} \right]$$} Where:

• {$\mathbf{X}$} - a list of x coordinates of atoms
• {$\mathbf{Y}$} - a list of y coordinates of atoms
• {$\mathbf{Z}$} - a list of z coordinates of atoms
• {$n$} - the number of atoms
• {$m_i$} - the mass of i-th atom
• {$M$} - the total mass

Structural alignment

• In one of the previous tasks we have used the superposition of structures. To be exact, SVD superposition is aimed at minimization of RMSD distance
• Is this always desirable? No! In some cases, the objective of superposition can be different. In general we talk about structural alignment where we try to find such a superposition that the structures as a whole or as a set of fragments are positioned optimally with regards to some chosen criterion
• With proteins, the criterion could be to maximize the number of aligned (meaning: spatially close) C-α atoms
• Structural alignment is also more general as it does not require the input structures to be of the same size
• BioJava provides a set of ready-to-use implementations of the interface org.biojava.bio.structure.align.StructureAlignment

Combinatorial Extension (CE) and Flexible structure AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists (FATCAT) are well-known and popular structural alignment methods of proteins. Please refer to the following publications:

Unfortunately, these implementations work only with proteins (not RNAs). And their API/documentation is very poor - in my opinion, this is a less mature and relatively new part of BioJava package

• In this task, we will try to look at properties of these implementations. For this purpose, let's try to align a transthyretin protein (PDB id: 3CFM) and a transthyretin-like one (PDB id: 3UAF). These are structures different in size and shape, but somehow similar to each other - let's check out how!
• Thanks to the polymorphism feature of Java, we can substitute the alignment algorithm in runtime. Each of the presented implementations has a common interface of StructureAlignment. So we can have a single variable of this type, and provide it with various implementations in a loop
StructureAlignment[] algorithms = new StructureAlignment[] { new CeMain(),
new SmithWaterman3Daligner(), new BioJavaStructureAlignment(),
new FatCatRigid(), new FatCatFlexible() };
for (StructureAlignment algorithm : algorithms) {
...
}
• These algorithms work only on C-α, so we will have to extract them first. Let's use a modified version of the code which extracted phosphorus from RNAs
private static Atom[] getCarbonAlpha(Structure structure) {
List<Atom> list = new ArrayList<>();

for (Chain chain : structure.getChains()) {
for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("CA");
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
}
return list.toArray(new Atom[list.size()]);
}
• To align the structures use StructureAlignment.align(Atom[] ca1, Atom[] ca2) method. It returns an instance of org.biojava.bio.structure.align.model.AFPChain class. We can use AFPChain.toCE() method to get a preformatted output.
• To sum up, to solve this task:
2. Extract C-α atoms
3. Construct an array of algorithm implementations
4. In a loop, reuse each algorithm to align the structures and print a formatted output
• The full source code for this task:
package pl.org.ptbi;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.biojava.bio.structure.Atom;
import org.biojava.bio.structure.Chain;
import org.biojava.bio.structure.Group;
import org.biojava.bio.structure.Structure;
import org.biojava.bio.structure.StructureException;
import org.biojava.bio.structure.align.BioJavaStructureAlignment;
import org.biojava.bio.structure.align.StructureAlignment;
import org.biojava.bio.structure.align.ce.CeMain;
import org.biojava.bio.structure.align.fatcat.FatCatFlexible;
import org.biojava.bio.structure.align.fatcat.FatCatRigid;
import org.biojava.bio.structure.align.model.AFPChain;
import org.biojava.bio.structure.align.seq.SmithWaterman3Daligner;

public class Main {
public static void main(String[] args) {

try {

Atom[] ca1 = getCarbonAlpha(left);
Atom[] ca2 = getCarbonAlpha(right);

StructureAlignment[] algorithms = new StructureAlignment[] { new CeMain(),
new SmithWaterman3Daligner(), new BioJavaStructureAlignment(),
new FatCatRigid(), new FatCatFlexible() };
for (StructureAlignment algorithm : algorithms) {
System.out.println(algorithm.getAlgorithmName());
AFPChain alignment = algorithm.align(ca1, ca2);
System.out.println(alignment.toCE(ca1, ca2));
}
} catch (IOException | StructureException e) {
e.printStackTrace();
}
}

private static Atom[] getCarbonAlpha(Structure structure) {
List<Atom> list = new ArrayList<>();

for (Chain chain : structure.getChains()) {
for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("CA");
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
}
return list.toArray(new Atom[list.size()]);
}
}

Torsion angle calculation

• Torsion angle is generally speaking an angle between two planes (see: http://en.wikipedia.org/wiki/Dihedral_angle)
• In structural biology, there are several torsion angles defined between selected fours of atoms (A-B-C-D, here A-B-C constitute one plane and B-C-D the second one)
• They describe the shape and fold of the structure, it is an alternative representation to the Cartesian one (atoms as points)
• We can use the org.biojava.bio.structure.Calc.torsionAngle(Atom a, Atom b, Atom c, Atom d) static method to calculate any given torsion angle
• Let's reuse the code from previous task used for selection of C-α atoms
• The resulting array should be divided into overlapping fours (1-2-3-4, then 2-3-4-5, then 3-4-5-6, etc.). Calculate torsion angle for every such four of atoms.
• The full source code for this task is the following:

package pl.org.ptbi;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.biojava.bio.structure.Atom;
import org.biojava.bio.structure.Calc;
import org.biojava.bio.structure.Chain;
import org.biojava.bio.structure.Group;
import org.biojava.bio.structure.Structure;
import org.biojava.bio.structure.StructureException;

public class Main {
public static void main(String[] args) {

Structure structure;
try {
Atom[] atoms = getCarbonAlpha(structure);
for (int i = 0; i + 3 < atoms.length; i++) {
double torsion = Calc.torsionAngle(atoms[i], atoms[i + 1], atoms[i + 2], atoms[i + 3]);
System.out.println(torsion);
}
} catch (IOException | StructureException e) {
e.printStackTrace();
}
}

private static Atom[] getCarbonAlpha(Structure structure) {
List<Atom> list = new ArrayList<>();

for (Chain chain : structure.getChains()) {
for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("CA");
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
}
return list.toArray(new Atom[list.size()]);
}
}
• The loaded 3CFM structure has two chains. When we select all C-α atoms, we ignore this fact. Modify the code above so that if the chain identifier is not equal for the whole four of atoms, the torsion angle is not calculated (and a message "A different chain id found!" is printed)

Previously you have learned that we can change abstraction level by going into more and more detailed view i.e. Structure → Chain → Group →Atom. The truth is that you can also go the other way round:

• Atom.getGroup() returns the amino acid or nucleotide that contains this atom
• Group.getChain() returns the chain that the given group belongs to
• Chain.getParent() returns the structure that has the given chain

Some actions change the values in the instances of Atom, Group, Chain and Structure. If you develop a bigger project, you should keep in mind to make your work on the clones of the original data. Otherwise, the same structure after modification will be seen differently. For example you could end up in a situation like this:

1. structure = reader.getStructureById(STRUCTURE_ID);
2. chain = structure.getChainByPDB(CHAIN_ID);
3. Some complex calculations now...
4. Now you could observe that:
• !structure.equals(chain.getParent());, or
• !chain.equals(structure.getChainByPDB(CHAIN_ID));

This is not very likely - in general Java libraries tend to minimize occurrences of such unintuitive behaviour and BioJava is no different. However, you should be cautious in bigger projects using this library.