M. Chlopkowski, M. Antczak, M. Slusarczyk, A. Wdowinski, M. Zajaczkowski, M. Kasprzak
"High-order statistical compressor for long-term storage of DNA sequencing data"
RAIRO Operations Research
50 (2016) 351-361.
Download full text
— The original publication is available at www.rairo-ro.org,
with EDP Sciences as the copyright owner, via
http://www.rairo-ro.org/articles/ro/abs/2016/02/ro150039-s/ro150039-s.html
Abstract:
We present a specialized compressor designed for efficient data storage of
FASTQ files produced by high-throughput DNA sequencers. Since the method
has been optimized for compression quality, it is especially suitable for
long-term storage and for genome research centers processing huge amount
of data (counted in petabytes). The proposed compressor uses high-order
statistical models for range encoding, similar to Markov models, but the
whole input is considered in building a symbol context. Compression of
DNA reads is performed according to LZ-style with the use of the 5-7th order
model, while nucleotides' scores are encoded with the 3rd order model.
Back to the List of publications
30 Mar 2016