A. Swiercz, B. Bosak, M. Chlopkowski, A. Hoffa, M. Kasprzak, K. Kurowski,
T. Piontek, J. Blazewicz
"Preprocessing and storing high-throughput sequencing data"
Computational Methods in Science and Technology 20 (2014) 9-20.
Download full text
Abstract:
DNA sequencing is a process of recognizing DNA sequences of genomes. The
process consists in reading short sequences, that are subsequences of a genome,
and merging them into longer sequences, preferably the whole genome.
In the first phase even billions of short sequences are read at once.
To simplify and speed up the second phase, we develop a pipeline of
preprocessing the initial set of short sequences that is removing low quality
reads and duplicated reads. We also propose a method for preliminary joining
overlapping sequences, which resulted in decreasing the cardinality of
initial sets to 13.9% and 27.8%. We also examine possible ways to store
the huge amount of experimental data. We compare different compression methods,
from which the best appeared to be DSRC, developed for special type of text
files containing short sequences and their quality. We test the parameters for
TCP data transferring to find the best transfer rate.
Back to the List of publications
2 Apr 2014