SOMBRA: A New Frontier in HCMV Evolutionary Research


Introduction to Human Cytomegalovirus (HCMV) Evolution

Understanding the evolution of human cytomegalovirus (HCMV) is critical for deciphering its genetic diversity, adaptation mechanisms, and the broader implications for human health. Recent phylogenetic studies have begun to reveal significant evolutionary patterns and relationships among geographically distinct HCMV strains, with notable contributions from researchers like Charles and Venturini et al. (1).

However, the current state of research is constrained by two primary challenges:


System for Operational Modeling of Biological Replication and Adaptation (SOMBRA)

SOMBRA Rules System

Figure 4: Overview of the SOMBRA system for operational modeling of biological replication and adaptation.

SOMBRA offers a novel approach to addressing these challenges by simulating the evolutionary processes of HCMV through advanced computational techniques.

Key Features of SOMBRA

  1. Sliding Window Approach: This method extracts trends and identifies conserved regions across the alignment. Consensus voting is applied in windows with less than 90% agreement to assign nucleotides.

  2. k-mer Index Generation: A k-mer index, generated from the MAFFT alignment, enables rapid alignment of newly generated sequences with the reference alignment.

  3. Geographical Data Extraction: Each sequence includes the country of isolation in its header, with the continent extrapolated from this information. SOMBRA uses this data to organize sequences into geographical groups.

Ancestral Sequence Generation

Evolutionary Simulation

Future Directions


As we continue to refine and expand SOMBRA, it’s essential to evaluate how well the simulated HCMV genomes reflect patterns observed in clinical samples. This analysis helps validate the approach and guides further development.

Simulated HCMV Genomes Have Ancestral Patterns That Parallel Clinical Samples

MDS Scatterplot

The scatterplot above depicts a multidimensional scaling (MDS) analysis of the merged reference dataset, revealing genomic clusters associated with the continent where each sample was collected. Notably, African strains cluster on the periphery of the European strains. Strains from the Americas cluster near Europe, although they have a larger range. This finding is consistent with recent publications (1).

Generating Artificial HCMV Genomes

artificial_HCMV_MDS

Figure 1: MDS scatterplot showing genomic clusters associated with continent of origin.

The scatterplot above depicts a multidimensional scaling (MDS) analysis of the merged reference dataset, revealing genomic clusters associated with the continent where each sample was collected. Notably, African strains cluster on the periphery of the European strains. Strains from the Americas cluster near Europe, although they have a larger range. This finding is consistent with recent publications (1).

SOMBRA-generated artificial HCMV genomes largely follow this pattern, with African strains clustering at the periphery of the European and American groups. Panel A depicts the original output from the MDS analyses, while Panel B shows an inversion of point positions around the centroid, revealing similar patterns with Asian, African, and Oceanic strains on the periphery and European and American strains at the center.

mds_centroid_inversion_artificial

Figure 2: Inversion of point positions around the centroid in MDS analysis.

SOMBRA-generated artificial HCMV genomes largely follow this pattern, with African strains clustering at the periphery of the European and American groups. Panel A depicts the original output from the MDS analyses, while Panel B shows an inversion of point positions around the centroid, revealing similar patterns with Asian, African, and Oceanic strains on the periphery and European and American strains at the center.

artificial_reference_genome_size_HCMV
artificial_US21

Fig. 3, Panel A, depicts a length comparison between a generated African strain and a reference African strain. There is a notable size difference (~25%). Panel B displays a protein prediction comparison between membrane-spanning protein US21, showing that the generated strain contains the ORF but terminates early. We suspect this is likely due to inaccurate k-mer indexing, which we are working to correct.

Genetic Distance and Lineage Patterns Across Continents

Panels C and D of Fig. 3 compare the genetic distances of sequences grouped by continent. Artificial sequences within a continent show high in-group similarity, but genetic distances between groups are notably higher. American-derived genomes exhibit a higher genetic distance compared to other continents.

Distribution of Artificial HCMV Genomes

Figure 3C: Distribution of genetic distances within artificial HCMV genomes.

Distribution of Reference HCMV Genomes

Figure 3D: Distribution of genetic distances within reference HCMV genomes.

Phylogenetic Tree of Artificial HCMV Genomes

Figure 4: Phylogenetic tree showing lineage patterns of artificial HCMV genomes.

Despite the challenges posed by genome size, the generated sequences reveal distinct lineage patterns (Fig. 4). Differences in branch lengths between continents reflect the diversity within the reference data, with overrepresentation of reference strains from Europe and the Americas contributing to a broader distribution among their generated counterparts.

Towards Neural Network Integration and a Multi-Agent System

There are trade-offs for rules-based and LLM-based simulations. Rules-based simulations are programmed with explicit instructions and offer interpretable results. Because the instructions are explicit, predictions are bounded by our current understanding of genomic patterns.

Neural networks are often referred to as “black boxes” because of the difficulty in understanding how complex relationships between seemingly unrelated variables are formed during training. Nonetheless, they can help guide the development of hypothesis-based experiments to explain underlying biological realities connected to machine-learned patterns.

Though LLMs present a massive leap for natural language processing, it is up to our community to test and adapt these models to tackle biological questions.

MambaVirus, SOMBRA, and the Future

The context problem poses a challenge to Herpesvirus researchers’ ability to utilize LLMs. As machine learning researchers continue to tackle the context length problem in other realms, we should continue to adapt their findings to answer our biological questions.

The progress in increasing context size, as illustrated by HyenaDNA (5) in the left panel of Fig. 6, demonstrates how researchers are adapting state-of-the-art architectures to address this challenge. We have taken an analogous approach with Mamba.

Model architectures vary and are useful for different tasks. The differences between BERT and Mamba architectures (Figs. 5 and 7) guide our usage of them.

VIRUSBERT’s success in DNA classification tasks makes it a candidate for use in detecting fatal mutations in DNA. We are currently compiling a dataset to fine-tune VIRUSBERT for this purpose.

In addition to VIRUSBERT, we foresee MambaVirus as a tool to correct and regenerate sequences identified by VIRUSBERT.

Integrating our trained language models could greatly improve SOMBRA’s ability to generate new genomes. We are continuously training and testing new models for integration. Future development of SOMBRA aims to tackle broken protein sequences (Fig. 3, Panel B) and develop autonomous agents that function independently to make gene modifications.

With this, we hope that the continued development of SOMBRA leads to a powerful tool to model evolution.

Dataset

The map below displays the geographical locations where each was collected. Their strains and ID are attached as well.

Legend

Continent
Country
State (USA only)

Strain List