HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors

Published in bioRxiv, 2024

Recommended citation: Xu W, Hsu PK, Moshiri N, Yu S, Rosing T (2024). "HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors." bioRxiv. doi:10.1101/2024.03.05.583605

Motivation: Genomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs exhibitive runtime overhead. Genome sketching is a fast and memory-efficient solution to estimate ANI similarity by distilling representative kmers from the original sequences. In this work, we present HyperGen that improves accuracy, runtime performance, and memory efficiency for large-scale ANI estimation. Unlike existing genome sketching algorithms that convert large genome files into discrete k-mer hashes, HyperGen leverages the emerging hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV is compact and can preserve more information, allowing for accurate ANI estimation while reducing required sketch sizes. In particular, the HV sketch representation in HyperGen allows efficient ANI estimation using vector multiplication, which naturally benefits from highly optimized general matrix multiply (GEMM) routines. As a result, HyperGen enables the efficient sketching and ANI estimation for massive genome collections.

Results: We evaluate HyperGen’s sketching and database search performance using several genome datasets at various scales. HyperGen is able to achieve comparable or superior ANI estimation error and linearity compared to other sketch-based counterparts. Compared to other sketch-based baselines, HyperGen achieves comparable sketching speed and up to 4.3x faster in database search. Meanwhile, HyperGen’s sketch size is on average 1.8-2.7x smaller.

Availability: A Rust implementation of HyperGen is freely available under the MIT license as an open-source software project at https://github.com/wh-xu/Hyper-Gen.