SOMBRA: A New Frontier in HCMV Evolutionary Research

Introduction to Human Cytomegalovirus (HCMV) Evolution

Understanding the evolution of human cytomegalovirus (HCMV) is critical for deciphering its genetic diversity, adaptation mechanisms, and the broader implications for human health. Recent phylogenetic studies have begun to reveal significant evolutionary patterns and relationships among geographically distinct HCMV strains, with notable contributions from researchers like Charles and Venturini et al. (1).

However, the current state of research is constrained by two primary challenges:

Limited Data Availability: Only 351 complete HCMV genomes are publicly accessible in the NCBI Virus database, limiting the scope of these analyses.
Geographic Bias: The uneven geographic distribution of these sequences poses a significant barrier to achieving high-resolution phylogenetic insights.

System for Operational Modeling of Biological Replication and Adaptation (SOMBRA)

Figure 4: Overview of the SOMBRA system for operational modeling of biological replication and adaptation.

SOMBRA offers a novel approach to addressing these challenges by simulating the evolutionary processes of HCMV through advanced computational techniques.

Key Features of SOMBRA

Sliding Window Approach: This method extracts trends and identifies conserved regions across the alignment. Consensus voting is applied in windows with less than 90% agreement to assign nucleotides.
k-mer Index Generation: A k-mer index, generated from the MAFFT alignment, enables rapid alignment of newly generated sequences with the reference alignment.
Geographical Data Extraction: Each sequence includes the country of isolation in its header, with the continent extrapolated from this information. SOMBRA uses this data to organize sequences into geographical groups.

Ancestral Sequence Generation

Group Assignment: Sequences are grouped based on their continental origins. Ancestral sequences are generated for each continent using information gathered during initialization.
Consensus Voting: Variable positions are determined by consensus voting across each continental group.
Sequence Alignment: The inferred ancestral sequences for each continent are aligned with reference sequences to ensure consistency.

Evolutionary Simulation

Mutation Application: The average variability is calculated from differences between ancestral and reference sequences from the same continent. This variability guides the number of mutations applied to simulate evolutionary changes.
Base Substitution: Precomputed base frequencies for each position provide probabilities for substitutions.
Indel Hotspots: Positions within indel hotspots are subject to stochastic insertions or deletions.
Recombination Events: These are simulated by mixing segments from different sequences at random breakpoints, enhancing genetic diversity.

Future Directions

Generative Models Integration: Future versions of SOMBRA will integrate generative models to further improve the biological relevance of the newly generated sequences.
Final Output: The synthetic sequences are saved in FASTA and TSV formats, ready for further analysis.

As we continue to refine and expand SOMBRA, it’s essential to evaluate how well the simulated HCMV genomes reflect patterns observed in clinical samples. This analysis helps validate the approach and guides further development.

Simulated HCMV Genomes Have Ancestral Patterns That Parallel Clinical Samples

MDS Scatterplot

The scatterplot above depicts a multidimensional scaling (MDS) analysis of the merged reference dataset, revealing genomic clusters associated with the continent where each sample was collected. Notably, African strains cluster on the periphery of the European strains. Strains from the Americas cluster near Europe, although they have a larger range. This finding is consistent with recent publications (1).

Generating Artificial HCMV Genomes

Figure 1: MDS scatterplot showing genomic clusters associated with continent of origin.

SOMBRA-generated artificial HCMV genomes largely follow this pattern, with African strains clustering at the periphery of the European and American groups. Panel A depicts the original output from the MDS analyses, while Panel B shows an inversion of point positions around the centroid, revealing similar patterns with Asian, African, and Oceanic strains on the periphery and European and American strains at the center.

Figure 2: Inversion of point positions around the centroid in MDS analysis.

Fig. 3, Panel A, depicts a length comparison between a generated African strain and a reference African strain. There is a notable size difference (~25%). Panel B displays a protein prediction comparison between membrane-spanning protein US21, showing that the generated strain contains the ORF but terminates early. We suspect this is likely due to inaccurate k-mer indexing, which we are working to correct.

Genetic Distance and Lineage Patterns Across Continents

Panels C and D of Fig. 3 compare the genetic distances of sequences grouped by continent. Artificial sequences within a continent show high in-group similarity, but genetic distances between groups are notably higher. American-derived genomes exhibit a higher genetic distance compared to other continents.

Figure 3C: Distribution of genetic distances within artificial HCMV genomes.

Figure 3D: Distribution of genetic distances within reference HCMV genomes.

Phylogenetic Tree of Artificial HCMV Genomes

Figure 4: Phylogenetic tree showing lineage patterns of artificial HCMV genomes.

Despite the challenges posed by genome size, the generated sequences reveal distinct lineage patterns (Fig. 4). Differences in branch lengths between continents reflect the diversity within the reference data, with overrepresentation of reference strains from Europe and the Americas contributing to a broader distribution among their generated counterparts.

Towards Neural Network Integration and a Multi-Agent System

There are trade-offs for rules-based and LLM-based simulations. Rules-based simulations are programmed with explicit instructions and offer interpretable results. Because the instructions are explicit, predictions are bounded by our current understanding of genomic patterns.

Neural networks are often referred to as “black boxes” because of the difficulty in understanding how complex relationships between seemingly unrelated variables are formed during training. Nonetheless, they can help guide the development of hypothesis-based experiments to explain underlying biological realities connected to machine-learned patterns.

Though LLMs present a massive leap for natural language processing, it is up to our community to test and adapt these models to tackle biological questions.

MambaVirus, SOMBRA, and the Future

The context problem poses a challenge to Herpesvirus researchers’ ability to utilize LLMs. As machine learning researchers continue to tackle the context length problem in other realms, we should continue to adapt their findings to answer our biological questions.

The progress in increasing context size, as illustrated by HyenaDNA (5) in the left panel of Fig. 6, demonstrates how researchers are adapting state-of-the-art architectures to address this challenge. We have taken an analogous approach with Mamba.

Model architectures vary and are useful for different tasks. The differences between BERT and Mamba architectures (Figs. 5 and 7) guide our usage of them.

VIRUSBERT’s success in DNA classification tasks makes it a candidate for use in detecting fatal mutations in DNA. We are currently compiling a dataset to fine-tune VIRUSBERT for this purpose.

In addition to VIRUSBERT, we foresee MambaVirus as a tool to correct and regenerate sequences identified by VIRUSBERT.

Integrating our trained language models could greatly improve SOMBRA’s ability to generate new genomes. We are continuously training and testing new models for integration. Future development of SOMBRA aims to tackle broken protein sequences (Fig. 3, Panel B) and develop autonomous agents that function independently to make gene modifications.

With this, we hope that the continued development of SOMBRA leads to a powerful tool to model evolution.

Dataset

The map below displays the geographical locations where each was collected. Their strains and ID are attached as well.

Legend

● Continent

● Country

● State (USA only)