Marta Kasprzak - Article's abstract

A. Swiercz, B. Bosak, M. Chlopkowski, A. Hoffa, M. Kasprzak, K. Kurowski, T. Piontek, J. Blazewicz
"Preprocessing and storing high-throughput sequencing data"
Computational Methods in Science and Technology 20 (2014) 9-20.

Download full text
Abstract:

DNA sequencing is a process of recognizing DNA sequences of genomes. The process consists in reading short sequences, that are subsequences of a genome, and merging them into longer sequences, preferably the whole genome. In the first phase even billions of short sequences are read at once. To simplify and speed up the second phase, we develop a pipeline of preprocessing the initial set of short sequences that is removing low quality reads and duplicated reads. We also propose a method for preliminary joining overlapping sequences, which resulted in decreasing the cardinality of initial sets to 13.9% and 27.8%. We also examine possible ways to store the huge amount of experimental data. We compare different compression methods, from which the best appeared to be DSRC, developed for special type of text files containing short sequences and their quality. We test the parameters for TCP data transferring to find the best transfer rate.

Back to the List of publications

2 Apr 2014