MambaVirus
Abstract:
The genomic instructions for all life on Earth are composed of DNA. Cells grow and replicate by transcribing their template to RNA, where further translation builds proteins from RNA transcriptions.
All living organisms use this machinery; however, viruses take advantage of the replicative processes of their host. The enzymes used in this process are not perfect, resulting in small mistakes that eventually accumulate in offspring. This cycle is central to evolution. Forecasting future evolution of viral genomes from highly accurate models could pave the way for early warning research and drug development, as well as improved understanding of our own cellular immune responses. Applications of Large Language Models (LLM’s) in genome design are still mostly unexplored. With the objective of advancing our understanding of virus evolution, and to test the current genomic capabilities of state-of-the-art LLM architectures, we introduce MambaVirus.
MambaVirus is a 1 billion parameter foundational genomic model trained on labeled viral genomes, genes, and instruction prompts. MambaVirus is instructible and can generate full genomes from prompts. Here, we evaluate its ability to follow instructions, the plausibility of generated genomes, and attempt to demonstrate an in-use case of evolution prediction. We hope MambaVirus can be integrated into existing prediction systems to advance understanding of viral evolution, improve virus infection models, and lead to more powerful personalized medical treatments.