Tomasz Żok | Workshop / BioJavaForStructuralBioinformatics

Structural bioinformatics primer

Description (source: http://en.wikipedia.org/wiki/Structural_bioinformatics)

Structural bioinformatics is the branch of bioinformatics which is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structure such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, and binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology.

Three-dimensional structures in silico are represented as files in PDB format
A PDB file contains metadata (about the authors, the experimental method, technicalities, etc.) and most importantly - information about the atoms
Every atom is described by a list of values (index, residue number, name, coordinates x, y, z, etc.)
Loading a PDB file means to read all these information. And analyzing the structure means to process this huge amount of information in a logical and purposeful manner
Example: structure was rotated 15° about the X axis in techincal terms means a matrix with coordinates of all atoms was multiplied by a rotation matrix {$$ R_x(15^{\circ}) = \left[ \begin{matrix} 1 & 0 & 0 \\ 0 & \cos 15^{\circ} & -\sin 15^{\circ} \\ 0 & \sin 15^{\circ} & \cos 15^{\circ} \end{matrix} \right] $$}
This is where various libraries like BioJava come in hand. Most of the tedious work (dealing with errors in the input files, parsing the PDB structures, rotations and many others) are handled by the library. Structural bioinformaticians can focus on solving the real problem instead of dealing with low-level difficulties

Introduction

Description (source: http://biojava.org/)

BioJava is an open-source project dedicated to providing a Java framework for processing biological data. It provides analytical and statistical routines, parsers for common file formats and allows the manipulation of sequences and 3D structures. The goal of the biojava project is to facilitate rapid application development for bioinformatics.

BioJava is licensed under LGPL 2.1.

Publication:

BioJava: an open-source framework for bioinformatics in 2012

Andreas Prlic; Andrew Yates; Spencer E. Bliven; Peter W. Rose; Julius Jacobsen; Peter V. Troshin; Mark Chapman; Jianjiong Gao; Chuan Hock Koh; Sylvain Foisy; Richard Holland; Gediminas Rimsa; Michael L. Heuer; H. Brandstatter-Muller; Philip E. Bourne; Scooter Willis

Bioinformatics 2012

Some statistics for 2013-04-02 (source: http://www.ohloh.net/p/biojava)

BioJava:

has had 6,442 commits made by 70 contributors, representing 290,608 lines of code
is mostly written in Java with an average number of source code comments
has a well established, mature codebase, maintained by a large development team, with decreasing Y-O-Y commits
took an estimated 76 years of effort (COCOMO model), starting with its first commit in January, 2000, ending with its most recent commit 3 months ago

Performance benchmarks: http://biojava.org/wiki/BioJava:Performance

Important links:

the documentation: http://www.biojava.org/docs/api/index.html
the CookBook (code samples): http://biojava.org/wiki/BioJava:CookBook

Installation steps:

Install JDK
Install Eclipse Classic
Download a sample project
Run Eclipse
Select File→Import... and then Existing projects into workspace
Choose Select archive file and Browse... for the downloaded bit13-project.zip
Right-click on the project name and select Refactor→Rename...
Type a new name which corresponds to the exercise title (e.g. BIT13-01-pdb)

Important concepts

In BioJava we can deal with various level of abstraction seen as Java interfaces:
- org.biojava.bio.structure.Structure (implementation: org.biojava.bio.structure.StructureImpl)
- org.biojava.bio.structure.Chain (implementation: org.biojava.bio.structure.ChainImpl)
- org.biojava.bio.structure.Group (implementations: org.biojava.bio.structure.AminoAcidImpl, org.biojava.bio.structure.NucleotideImpl and org.biojava.bio.structure.HetatomImpl)
- org.biojava.bio.structure.Atom (implementation: org.biojava.bio.structure.AtomImpl)
A PDB file can contain multiple models. In BioJava, every model is a separate Structure object.
Most classes in BioJava override the toString() method and so their content can be easily represented as a string

Basic tasks

Loading a PDB file

Create an instance of a PDBFileReader and enable automatic download of PDBs

PDBFileReader reader = new PDBFileReader();
reader.setAutoFetch(true);
reader.setPath(System.getProperty("java.io.tmpdir"));

[$[Get Code]]
Load a Structure object using:
- reader.getStructure(String path)
  
  structure = reader.getStructure("/tmp/1EHZ.pdb.gz");
  
  [$[Get Code]]
- reader.getStructure(File path)
  
  File file = new File("/tmp/1EVV.pdb");
  if (file.exists()) {
  reader.getStructure(file);
  }
  
  [$[Get Code]]
- reader.getStructure(URL url)
  
  URL url = new URL("http://www.cs.put.poznan.pl/tzok/files/2013-06-bit13-biojava/broken.pdb");
  reader.getStructure(url);
  
  [$[Get Code]]
- reader.getStructureById(String id)
  
  reader.getStructureById("1EHZ");
  
  [$[Get Code]]
Handle the IOException
For this task, please download a sample PDB file (remember the path!)
Full source code of the example: (set path to the file from previous bullet point in the lines pointed by comments)

package pl.org.ptbi;

import java.io.File;
import java.io.IOException;
import java.net.URL;

import org.biojava.bio.structure.Structure;
import org.biojava.bio.structure.io.PDBFileReader;

public class Main {
public static void main(String[] args) {
PDBFileReader reader = new PDBFileReader();
reader.setAutoFetch(true);
reader.setPath(System.getProperty("java.io.tmpdir"));

Structure structure;
try {
// SET PATH HERE!
structure = reader.getStructure("/tmp/1EHZ.pdb.gz");
System.out.println(structure);

// SET PATH HERE!
File file = new File("/tmp/1EVV.pdb");
if (file.exists()) {
structure = reader.getStructure(file);
System.out.println(structure);
}

URL url = new URL("http://www.cs.put.poznan.pl/"
+ "tzok/files/2013-06-bit13-biojava/broken.pdb");
structure = reader.getStructure(url);
System.out.println(structure);

structure = reader.getStructureById("1EHZ");
System.out.println(structure);
} catch (IOException e) {
e.printStackTrace();
}
}
}

[$[Get Code]]

Advanced tasks

RMSD calculation

Root Mean Square Deviation (RMSD) is a formula to calculate an average distance between two sets of corresponding points: {$$ RMSD = \sqrt{\frac{1}{n}\sum_{i=1}^{n}\delta_i^2} $$} {$$ \delta_i = \sqrt{(x_i-x_i')^2 + (y_i - y_i')^2 + (z_i - z_i')^2} $$}

Both sets must have the same size! And the atoms needs to be paired (i.e. i-th atom from structure S corresponds to the i-th atom from the other structure S')

In structural bioinformatics, points = atoms
Let's calculate RMSD for two tRNAs:
- 1EHZ
- 1EVV
Both structures have a single chain A with 76 nucleotides (good for RMSD calculation)
Load these two structures:

Structure left = reader.getStructureById("1EHZ");
Structure right = reader.getStructureById("1EHZ");

[$[Get Code]]
Get their only chain A by using the Structure.getChainbyPDB(String id) method (important: it throws org.biojava.bio.structure.StructureException upon problems):

left.getChainByPDB("A");
right.getChainByPDB("A");

[$[Get Code]]
We will focus on phosphorus atoms only (there is exactly one per residue). We can iterate over all residues using the Chain.getAtomGroups() method and we can extract the given atom (P for phosphorus) with the Group.getAtom(String name) method (important: it throws StructureException upon error). Let's use the following helper function:

private static List<Atom> getPhosphorus(Chain chain) {
List<Atom> list = new ArrayList<>();

for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("P");
list.add(atom);
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
return list;
}

[$[Get Code]]
Now we can calculate euclidean distances between corresponding points. Let's use the following helper function:

private static double calculateDistance(Atom left, Atom right) {
double dx = left.getX() - right.getX();
double dy = left.getY() - right.getY();
double dz = left.getZ() - right.getZ();
return Math.sqrt(dx * dx + dy * dy + dz * dz);
}

[$[Get Code]]
The last step for now is to apply the RMSD formula in Java code:

double rmsd = 0;
for (int i = 0; i < size; i++) {
Atom atomLeft = listLeft.get(i);
Atom atomRight = listRight.get(i);

double distance = calculateDistance(atomLeft, atomRight);
rmsd += distance * distance;
}
rmsd /= size;
rmsd = Math.sqrt(rmsd);

[$[Get Code]]
The whole code for this task:

package pl.org.ptbi;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.biojava.bio.structure.Atom;
import org.biojava.bio.structure.Chain;
import org.biojava.bio.structure.Group;
import org.biojava.bio.structure.Structure;
import org.biojava.bio.structure.StructureException;
import org.biojava.bio.structure.io.PDBFileReader;

public class Main {
public static void main(String[] args) {
PDBFileReader reader = new PDBFileReader();
reader.setAutoFetch(true);
reader.setPath(System.getProperty("java.io.tmpdir"));

try {
Structure left = reader.getStructureById("1EHZ");
Structure right = reader.getStructureById("1EVV");

List<Atom> listLeft = getPhosphorus(left.getChainByPDB("A"));
List<Atom> listRight = getPhosphorus(right.getChainByPDB("A"));

assert listLeft.size() == listRight.size();
double rmsd = 0;
for (int i = 0; i < listLeft.size(); i++) {
Atom atomLeft = listLeft.get(i);
Atom atomRight = listRight.get(i);

double distance = calculateDistance(atomLeft, atomRight);
rmsd += distance * distance;
}
rmsd /= listLeft.size();
rmsd = Math.sqrt(rmsd);
System.out.println(rmsd);
} catch (IOException | StructureException e) {
e.printStackTrace();
}
}

private static List<Atom> getPhosphorus(Chain chain) {
List<Atom> list = new ArrayList<>();

for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("P");
list.add(atom);
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
return list;
}

private static double calculateDistance(Atom left, Atom right) {
double dx = left.getX() - right.getX();
double dy = left.getY() - right.getY();
double dz = left.getZ() - right.getZ();
return Math.sqrt(dx * dx + dy * dy + dz * dz);
}
}

[$[Get Code]]
Now use java.util.Collections.shuffle(List list) for listLeft and listRight to randomize the order of atoms. How does it influence the result? Why

Structure superposition

Is the calculated RMSD a valid measure of similarity? No! Because, the atoms in a PDB file are arranged differently in space and the formula does not account for this
For a real comparison, one has to superimpose the structures (move and rotate them to minimize the final RMSD)
Curious people can read about the details here
We can use the org.biojava.bio.structure.SVDSuperimposer class which calculates the rotation matrix and translation vector and the org.biojava.bio.structure.Calc to apply them to the input structures:

Atom[] atomSet1 = listLeft.toArray(new Atom[listLeft.size()]);
Atom[] atomSet2 = listRight.toArray(new Atom[listRight.size()]);
SVDSuperimposer superimposer = new SVDSuperimposer(atomSet1, atomSet2);
Calc.rotate(right, superimposer.getRotation());
Calc.shift(right, superimposer.getTranslation());

// proceed with regular RMSD calculation from now on

[$[Get Code]]
As you can see, we calculate the superposition based on a selection of atoms (phosphorus only), then apply it to the whole structure (identificator is right) to finally calculate RMSD again only for the selected atoms
There is a static method SVDSuperimposer.getRMS(Atom[] left, Atom[] right) which calculates RMSD. We can thus replace the loop from previous exercise with a library call:

// the commented code below is functionally equal to a one-liner library call in the end
/*
double rmsd = 0;
for (int i = 0; i < size; i++) {
Atom atomLeft = listLeft.get(i);
Atom atomRight = listRight.get(i);

double distance = calculateDistance(atomLeft, atomRight);
rmsd += distance * distance;
}
rmsd /= size;
rmsd = Math.sqrt(rmsd);
*/
double rmsd = SVDSuperimposer.getRMS(atomSet1, atomSet2);

[$[Get Code]]
The final code for this task is the following:

package pl.org.ptbi;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.biojava.bio.structure.Atom;
import org.biojava.bio.structure.Calc;
import org.biojava.bio.structure.Chain;
import org.biojava.bio.structure.Group;
import org.biojava.bio.structure.SVDSuperimposer;
import org.biojava.bio.structure.Structure;
import org.biojava.bio.structure.StructureException;
import org.biojava.bio.structure.io.PDBFileReader;

public class Main {
public static void main(String[] args) {
PDBFileReader reader = new PDBFileReader();
reader.setAutoFetch(true);
reader.setPath(System.getProperty("java.io.tmpdir"));

try {
Structure left = reader.getStructureById("1EHZ");
Structure right = reader.getStructureById("1EVV");

List<Atom> listLeft = getPhosphorus(left.getChainByPDB("A"));
List<Atom> listRight = getPhosphorus(right.getChainByPDB("A"));

Atom[] atomSet1 = listLeft.toArray(new Atom[listLeft.size()]);
Atom[] atomSet2 = listRight.toArray(new Atom[listRight.size()]);
SVDSuperimposer superimposer = new SVDSuperimposer(atomSet1, atomSet2);
Calc.rotate(right, superimposer.getRotation());
Calc.shift(right, superimposer.getTranslation());

double rmsd = SVDSuperimposer.getRMS(atomSet1, atomSet2);
System.out.println(rmsd);
} catch (IOException | StructureException e) {
e.printStackTrace();
}
}

private static List<Atom> getPhosphorus(Chain chain) {
List<Atom> list = new ArrayList<>();

for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("P");
list.add(atom);
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
return list;
}
}

[$[Get Code]]
Now modify the code above to calculate RMSD for P, C5', C4', C3', C2', C1', O5', O4', O3' and O2' atoms. How does it influence the RMSD? Why?

Statistics about the structure elements

BioJava allows to look on a structure on any abstraction level
In this task, we will gather some straightforward information about a given structure

All of these methods Structure.getChains(), Chain.getAtomGroups() and Group.getAtoms() return a Java collection which implements the Iterable interface. We can use a for-each loop construct like this:

for (Chain chain : structure.getChains()) {
for (Group group : chain.getAtomGroups()) {
for (Atom atom : group.getAtoms()) {
// here we will visit every atom!
}
}
}

[$[Get Code]]

Every Atom object carries with itself some chemo-physical information inside of the embedded org.biojava.bio.structure.Element instance
Element objects carry a lot of important information, see the Javadoc entry for this type:
Let's collect information about total atom count, total mass and histogram of elements present in the given structure

A Multiset (referred to also as a bag collection) is a set which allows for multiple occurrences of elements. Every addition of a new element increments the implicit counter. It is an elegant data structure for counting of elements of different types.

We will use an implementation from Google Guava library.

To obtain Element instance from any given Atom, you need to call Atom.getElement()
The full source code for this task is the following:

package pl.org.ptbi;

import java.io.IOException;

import org.biojava.bio.structure.Atom;
import org.biojava.bio.structure.Chain;
import org.biojava.bio.structure.Element;
import org.biojava.bio.structure.Group;
import org.biojava.bio.structure.Structure;
import org.biojava.bio.structure.io.PDBFileReader;

import com.google.common.collect.SortedMultiset;
import com.google.common.collect.TreeMultiset;

public class Main {
public static void main(String[] args) {
PDBFileReader reader = new PDBFileReader();
reader.setAutoFetch(true);
reader.setPath(System.getProperty("java.io.tmpdir"));

Structure structure;
try {
structure = reader.getStructureById("1EHZ");
} catch (IOException e) {
e.printStackTrace();
return;
}

SortedMultiset<Element> bag = TreeMultiset.create();
double mass = 0;
for (Chain chain : structure.getChains()) {
for (Group group : chain.getAtomGroups()) {
for (Atom atom : group.getAtoms()) {
Element element = atom.getElement();
bag.add(element);
mass += element.getAtomicMass();
}
}
}

System.out.println("Total mass: " + mass);
System.out.println("Atom count: " + bag.size());
System.out.println("Elements present:");
for (Element element : bag.elementSet()) {
System.out.println(element + " : " + bag.count(element));
}
}
}

[$[Get Code]]
Now modify the above code to calculate the center of mass of the given structure

Center of mass is a point with coordinates equal to weighted average of mass (input atoms' coordinates are the weights) {$$ CoM(\mathbf{X}, \mathbf{Y}, \mathbf{Z}) = \left[ \begin{matrix} \dfrac{\sum_{i=1}^{n}m_i \cdot X_{i}}{M} & \dfrac{\sum_{i=1}^{n}m_i \cdot Y_{i}}{M} & \dfrac{\sum_{i=1}^{n}m_i \cdot Z_{i}}{M} \end{matrix} \right] $$} Where:

{$ \mathbf{X} $} - a list of x coordinates of atoms
{$ \mathbf{Y} $} - a list of y coordinates of atoms
{$ \mathbf{Z} $} - a list of z coordinates of atoms
{$ n $} - the number of atoms
{$ m_i $} - the mass of i-th atom
{$ M $} - the total mass

Structural alignment

In one of the previous tasks we have used the superposition of structures. To be exact, SVD superposition is aimed at minimization of RMSD distance
Is this always desirable? No! In some cases, the objective of superposition can be different. In general we talk about structural alignment where we try to find such a superposition that the structures as a whole or as a set of fragments are positioned optimally with regards to some chosen criterion
With proteins, the criterion could be to maximize the number of aligned (meaning: spatially close) C-α atoms
Structural alignment is also more general as it does not require the input structures to be of the same size
BioJava provides a set of ready-to-use implementations of the interface org.biojava.bio.structure.align.StructureAlignment

Combinatorial Extension (CE) and Flexible structure AlignmenT by Chaining AFPs (Aligned Fragment Pairs) with Twists (FATCAT) are well-known and popular structural alignment methods of proteins. Please refer to the following publications:

Unfortunately, these implementations work only with proteins (not RNAs). And their API/documentation is very poor - in my opinion, this is a less mature and relatively new part of BioJava package

In this task, we will try to look at properties of these implementations. For this purpose, let's try to align a transthyretin protein (PDB id: 3CFM) and a transthyretin-like one (PDB id: 3UAF). These are structures different in size and shape, but somehow similar to each other - let's check out how!
Thanks to the polymorphism feature of Java, we can substitute the alignment algorithm in runtime. Each of the presented implementations has a common interface of StructureAlignment. So we can have a single variable of this type, and provide it with various implementations in a loop

StructureAlignment[] algorithms = new StructureAlignment[] { new CeMain(),
new SmithWaterman3Daligner(), new BioJavaStructureAlignment(),
new FatCatRigid(), new FatCatFlexible() };
for (StructureAlignment algorithm : algorithms) {
...
}

[$[Get Code]]
These algorithms work only on C-α, so we will have to extract them first. Let's use a modified version of the code which extracted phosphorus from RNAs

private static Atom[] getCarbonAlpha(Structure structure) {
List<Atom> list = new ArrayList<>();

for (Chain chain : structure.getChains()) {
for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("CA");
list.add(atom);
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
}
return list.toArray(new Atom[list.size()]);
}

[$[Get Code]]
To align the structures use StructureAlignment.align(Atom[] ca1, Atom[] ca2) method. It returns an instance of org.biojava.bio.structure.align.model.AFPChain class. We can use AFPChain.toCE() method to get a preformatted output.
To sum up, to solve this task:
1. Load two PDB structures
2. Extract C-α atoms
3. Construct an array of algorithm implementations
4. In a loop, reuse each algorithm to align the structures and print a formatted output
The full source code for this task:

package pl.org.ptbi;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.biojava.bio.structure.Atom;
import org.biojava.bio.structure.Chain;
import org.biojava.bio.structure.Group;
import org.biojava.bio.structure.Structure;
import org.biojava.bio.structure.StructureException;
import org.biojava.bio.structure.align.BioJavaStructureAlignment;
import org.biojava.bio.structure.align.StructureAlignment;
import org.biojava.bio.structure.align.ce.CeMain;
import org.biojava.bio.structure.align.fatcat.FatCatFlexible;
import org.biojava.bio.structure.align.fatcat.FatCatRigid;
import org.biojava.bio.structure.align.model.AFPChain;
import org.biojava.bio.structure.align.seq.SmithWaterman3Daligner;
import org.biojava.bio.structure.io.PDBFileReader;

public class Main {
public static void main(String[] args) {
PDBFileReader reader = new PDBFileReader();
reader.setAutoFetch(true);
reader.setPath(System.getProperty("java.io.tmpdir"));

try {
Structure left = reader.getStructureById("3CFM");
Structure right = reader.getStructureById("3UAF");

Atom[] ca1 = getCarbonAlpha(left);
Atom[] ca2 = getCarbonAlpha(right);

StructureAlignment[] algorithms = new StructureAlignment[] { new CeMain(),
new SmithWaterman3Daligner(), new BioJavaStructureAlignment(),
new FatCatRigid(), new FatCatFlexible() };
for (StructureAlignment algorithm : algorithms) {
System.out.println(algorithm.getAlgorithmName());
AFPChain alignment = algorithm.align(ca1, ca2);
System.out.println(alignment.toCE(ca1, ca2));
}
} catch (IOException | StructureException e) {
e.printStackTrace();
}
}

private static Atom[] getCarbonAlpha(Structure structure) {
List<Atom> list = new ArrayList<>();

for (Chain chain : structure.getChains()) {
for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("CA");
list.add(atom);
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
}
return list.toArray(new Atom[list.size()]);
}
}

[$[Get Code]]

Torsion angle calculation

Torsion angle is generally speaking an angle between two planes (see: http://en.wikipedia.org/wiki/Dihedral_angle)
In structural biology, there are several torsion angles defined between selected fours of atoms (A-B-C-D, here A-B-C constitute one plane and B-C-D the second one)
They describe the shape and fold of the structure, it is an alternative representation to the Cartesian one (atoms as points)
We can use the org.biojava.bio.structure.Calc.torsionAngle(Atom a, Atom b, Atom c, Atom d) static method to calculate any given torsion angle
Let's reuse the code from previous task used for selection of C-α atoms
The resulting array should be divided into overlapping fours (1-2-3-4, then 2-3-4-5, then 3-4-5-6, etc.). Calculate torsion angle for every such four of atoms.
The full source code for this task is the following:

package pl.org.ptbi;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.biojava.bio.structure.Atom;
import org.biojava.bio.structure.Calc;
import org.biojava.bio.structure.Chain;
import org.biojava.bio.structure.Group;
import org.biojava.bio.structure.Structure;
import org.biojava.bio.structure.StructureException;
import org.biojava.bio.structure.io.PDBFileReader;

public class Main {
public static void main(String[] args) {
PDBFileReader reader = new PDBFileReader();
reader.setAutoFetch(true);
reader.setPath(System.getProperty("java.io.tmpdir"));

Structure structure;
try {
structure = reader.getStructureById("3CFM");
Atom[] atoms = getCarbonAlpha(structure);
for (int i = 0; i + 3 < atoms.length; i++) {
double torsion = Calc.torsionAngle(atoms[i], atoms[i + 1], atoms[i + 2], atoms[i + 3]);
System.out.println(torsion);
}
} catch (IOException | StructureException e) {
e.printStackTrace();
}
}

private static Atom[] getCarbonAlpha(Structure structure) {
List<Atom> list = new ArrayList<>();

for (Chain chain : structure.getChains()) {
for (Group group : chain.getAtomGroups()) {
try {
Atom atom = group.getAtom("CA");
list.add(atom);
} catch (StructureException e) {
System.err.println(e.getMessage());
}
}
}
return list.toArray(new Atom[list.size()]);
}
}

[$[Get Code]]
The loaded 3CFM structure has two chains. When we select all C-α atoms, we ignore this fact. Modify the code above so that if the chain identifier is not equal for the whole four of atoms, the torsion angle is not calculated (and a message "A different chain id found!" is printed)

Previously you have learned that we can change abstraction level by going into more and more detailed view i.e. Structure → Chain → Group →Atom. The truth is that you can also go the other way round:

Atom.getGroup() returns the amino acid or nucleotide that contains this atom
Group.getChain() returns the chain that the given group belongs to
Chain.getParent() returns the structure that has the given chain

Some actions change the values in the instances of Atom, Group, Chain and Structure. If you develop a bigger project, you should keep in mind to make your work on the clones of the original data. Otherwise, the same structure after modification will be seen differently. For example you could end up in a situation like this:

structure = reader.getStructureById(STRUCTURE_ID);
chain = structure.getChainByPDB(CHAIN_ID);
Some complex calculations now...
Now you could observe that:
- !structure.equals(chain.getParent());, or
- !chain.equals(structure.getChainByPDB(CHAIN_ID));

This is not very likely - in general Java libraries tend to minimize occurrences of such unintuitive behaviour and BioJava is no different. However, you should be cautious in bigger projects using this library.

Tasks to do on your own

Statistics about amino acids:
- Load some protein (e.g. 3CFM)
- For every amino acid type, remember the first one encountered (a reference amino acid) and calculate RMSD for every next one of this type
- Calculate an average and standard deviation of RMSD for every amino acid type.
- Print a ranking (the most variable (i.e. with highest average RMSD) amino acid on top and the least on the bottom)
A table of torsion angle values:
- Load some RNA (e.g. 1EHZ)
- Calculate δ angle for every residue (it is made out of the following atoms: C5'-C4'-C3'-O3')
- Print a table of values (each line has two numbers: residue index and δ angle value for this residue)
RMSD with a three-residue-long window
- Load 1EHZ and 1EVV structures
- Read backbone atoms (P, C5', C4', C3', C2', C1', O5', O4', O3' and O2') from the first three residues of 1EHZ
- Iterate over all triplets of residues in 1EVV (residues 1-2-3, then 2-3-4, then 3-4-5, etc.) and calculate RMSD against the selected atoms in the previous step
- Print a ranking (each lines has two numbers: starting residue of the triplet in 1EVV and its RMSD)

From Tomasz Żok

Workshop: BioJavaForStructuralBioinformatics