Optimizing high-performance cloud computing to enable real-time high-throughput SARS-CoV-2 whole genome sequence analysis

Published in 28th International Dynamics & Evolution of Human Viruses, 2021

Recommended citation: Moshiri N (2021). "Optimizing high-performance cloud computing to enable real-time high-throughput SARS-CoV-2 whole genome sequence analysis." 28th International Dynamics & Evolution of Human Viruses. Poster.

Throughout the COVID-19 pandemic, molecular epidemiology has proven vital to the surveillance of the spread of novel SARS-CoV-2 variants across the world. Such information can provide actionable public health information, but only if it can be provided in real-time. Advances in next-generation sequencing machines (e.g. Illumina NovaSeq) have enabled the generation of ultra-large SARS-CoV-2 whole genome sequence (WGS) datasets (close to 1,000 samples in a single run), but the sheer volume of data obtained poses a new challenge: how can standard Bioinformatics analysis workflows be executed rapidly enough to keep up? In this talk, I will present results of profiling that was performed on the steps of a standard SARS-CoV-2 WGS analysis workflow, and I will demonstrate a cloud-based high-performance computing setup implemented on the Google Cloud Platform for analyzing SARS-CoV-2 WGS datasets in bulk that is able to process close to 1,000 samples in under 2 hours.