M. Chlopkowski, M. Antczak, M. Slusarczyk, A. Wdowinski, M. Zajaczkowski, M. Kasprzak
"High-order statistical compressor for long-term storage of DNA sequencing data"
RAIRO Operations Research
 50 (2016) 351-361.
 Download full text 
 — The original publication is available at www.rairo-ro.org,
with EDP Sciences as the copyright owner, via
http://www.rairo-ro.org/articles/ro/abs/2016/02/ro150039-s/ro150039-s.html 
Abstract:
We present a specialized compressor designed for efficient data storage of 
FASTQ files produced by high-throughput DNA sequencers. Since the method 
has been optimized for compression quality, it is especially suitable for 
long-term storage and for genome research centers processing huge amount 
of data (counted in petabytes). The proposed compressor uses high-order 
statistical models for range encoding, similar to Markov models, but the 
whole input is considered in building a symbol context. Compression of 
DNA reads is performed according to LZ-style with the use of the 5-7th order
model, while nucleotides' scores are encoded with the 3rd order model.
Back to the List of publications
30 Mar 2016