Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data

Huang L (2019)
Bielefeld: Universität Bielefeld.

Bielefelder E-Dissertation | Englisch
 
Download
OA 11.23 MB
Gutachter*in / Betreuer*in
Abstract / Bemerkung
The increasing amount of next-generation sequencing data introduces a fundamental challenge on large scale genomic analytics. Storing and processing large amounts of sequencing data requires considerable hardware resources and efficient software that can fully utilize these resources. Nowadays, both industrial enterprises and nonprofit institutes are providing robust and easy-access cloud services for studies in life science. To facilitate genomic data analyses on such powerful computing resources, distributed bioinformatics tools are needed. However, most of existing tools have low scalability on the distributed computing cloud. Thus, in this thesis, I developed a cloud based bioinformatics framework that mainly addresses two computational challenges: (i) the run time intensive challenge in the sequence mapping process and (ii) the memory intensive challenge in the de novo genome assembly process.

For sequence mapping, I have natively implemented an Apache Spark based distributed sequence mapping tool called Sparkhit. It uses the q-gram filter and Pigeonhole principle to accelerate the speeds of fragment recruitment and short read mapping processes. These algorithms are implemented in the Spark extended MapReduce model. Sparkhit runs 92–157 times faster than MetaSpark on metagenomic fragment recruitment and 18–32 times faster than Crossbow on data pre-processing.

For de novo genome assembly, I have invented a new data structure called Reflexible Distributed K-mer (RDK) and natively implemented a distributed genome assembler called Reflexiv. Reflexiv is built on top of the Apache Spark platform, uses Spark Resilient Distributed Dataset (RDD) to distributed large amount of k-mers across the cluster and assembles the genome in a recursive way. As a result, Reflexiv runs 8-17 times faster than Ray assembler and 5-18 times faster than AbySS assembler on the clusters deployed at the de.NBI cloud.

In addition, I have incorporated a variety of analytical methods into the framework. I have also developed a tool wrapper to distribute external tools and Docker containers on the Spark cluster. As a large scale genomic use case, my framework processed 100 terabytes of data across four genomic projects on the Amazon cloud in 21 hours. Furthermore, the application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 hours, presenting an approach to easily associate large amounts of public datasets with reference data.

Thus, my work contributes to the interdisciplinary research of life science and distributed cloud computing by improving existing methods with a new data structure, new algorithms, and robust distributed implementations.
Jahr
2019
Page URI
https://pub.uni-bielefeld.de/record/2936599

Zitieren

Huang L. Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: Universität Bielefeld; 2019.
Huang, L. (2019). Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: Universität Bielefeld. https://doi.org/10.4119/unibi/2936599
Huang, Liren. 2019. Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: Universität Bielefeld.
Huang, L. (2019). Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: Universität Bielefeld.
Huang, L., 2019. Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data, Bielefeld: Universität Bielefeld.
L. Huang, Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data, Bielefeld: Universität Bielefeld, 2019.
Huang, L.: Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Universität Bielefeld, Bielefeld (2019).
Huang, Liren. Cloud-based Bioinformatics Framework for Next-Generation Sequencing Data. Bielefeld: Universität Bielefeld, 2019.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Volltext(e)
Access Level
OA Open Access
Zuletzt Hochgeladen
2019-09-06T09:19:08Z
MD5 Prüfsumme
fcb2883d5cf01274dbf92e10755cccf7


Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Suchen in

Google Scholar