Sequence Based Virus Host Prediction

A Curated Dataset and Generalizable Framework for Training Artificial Intelligence to Identify Viruses of Humans

Department of Biochemistry, Microbiology, and Immunology, Wayne State University School of Medicine, Detroit, Michigan, USA
Preprint - February 24, 2025
t-SNE visualization of virus sequence embeddings
Figure 1: t-SNE visualization comparing raw sequence features versus learned embeddings, showing clear clustering patterns that discriminate sequences of human and non-human virus genomes.
Abstract: Understanding how viruses evolve to infect specific hosts is crucial for reliably predicting and preventing emerging virus diseases. While current approaches rely heavily on observed phenotypic traits, the virus genome itself contains evolutionary signatures shaped by host adaptation. To enable systematic investigation of these genomic patterns, we present a comprehensive dataset of 58,046 virus genomes spanning 15 families that represent diverse genome architectures and host ranges. Each sequence was systematically classified for human host compatibility based on isolation source, creating a foundation for studying sequence-level determinants of host specificity.

Through analysis of k-mer frequency patterns across viral taxa, we demonstrate that sequence-based features can be used to accurately predict human host compatibility and even generalize to members of virus families not included in the training set. This performance variation suggests different levels of host-specific signal in virus genomes, potentially reflecting distinct evolutionary constraints across virus groups. Neural network analysis revealed underlying structure in sequence feature space that enables the identification of human-compatible viruses, indicating shared sequence patterns despite diverse genome architectures. This resource and associated analyses provide new opportunities for identifying the genomic basis of virus host range evolution.

Key Findings

  • Created a comprehensive dataset of 58,046 virus genomes across 15 families.
  • Neural network model accurately predicts human host compatibility based solely on sequence features.
  • K-mer frequency patterns (5-mers) provided the strongest predictive signals.
  • Evolutionary analysis revealed a gradient of human adaptation from highly-adapted viruses to those with minimal compatibility.
  • Cross-family validation demonstrated model generalization to unseen virus families.
  • Viruses with low adaptation signals often correlate with higher virulence in humans.

Dataset Composition

Our dataset comprises 58,046 complete virus genomes representing 15 distinct virus families, with an approximately equal distribution between human-associated (52.0%) and non-human-associated (48.0%) viruses.

Virus Genome Distribution
Virus Family Total Human Non-Human
Flaviviridae 9,665 2,655 7,010
Picornaviridae 9,256 7,253 2,003
Sedoreoviridae 8,775 3,151 5,624
Hepadnaviridae 8,281 7,780 501
Papillomaviridae 3,814 3,073 741
Rhabdoviridae 3,029 39 2,990
Spinareoviridae 2,582 73 2,509
Adenoviridae 2,305 1,747 558
Parvoviridae 2,045 479 1,566
Polyomaviridae 2,012 1,447 565
Togaviridae 1,971 811 1,160
Poxviridae 1,445 835 610
Orthoherpesviridae 1,192 629 563
Orthomyxoviridae 929 77 852
Astroviridae 745 132 613
Total 58,046 30,181 27,865

Methods

1. Data Collection and Preparation

  • Retrieved 82,513 complete virus genomes from the NCBI Virus database.
  • Filtered for quality and completeness to yield 58,046 high-quality genomes.
  • Implemented a three-tier classification system for host labeling using direct string matching, pattern recognition, and AI-driven analysis.

2. Feature Extraction

  • Generated k-mer frequency vectors for each sequence (k=3 to k=8).
  • Utilized UMAP for dimensionality reduction and visualization.

3. Neural Network Architecture

  • Developed a shallow feed-forward architecture with two hidden layers.
  • Employed GELU activation functions, L2 regularization, and dropout.

4. Model Evaluation

  • Used Matthews Correlation Coefficient (MCC) as the primary evaluation metric.
  • Performed cross-family validation to assess generalization capabilities.

Key Results & Visualizations

Model Performance Across k-mer Sizes

We evaluated the predictive power of k-mer frequency patterns as features for human host prediction. The Neural Network model consistently outperformed both Logistic Regression and Random Forest baselines across different k-mer sizes, with 5-mer features yielding the highest Matthews Correlation Coefficient (MCC = 0.820).

Matthews Correlation Coefficient (MCC) scores across k-mer sizes and model architectures
k-mer Logistic mcc Random Forest mcc Neural Network mcc
k3 0.723 0.759 0.810
k4 0.775 0.757 0.804
k5 0.767 0.741 0.820
k6 0.654 0.750 0.808
k7 0.669 0.746 0.779
k8 0.594 0.749 0.760

Family-Specific Performance Analysis

We conducted a detailed analysis of model performance across individual virus families, revealing substantial variation in how well sequence features predict human compatibility. This variability suggests differing levels of host-specific genomic signatures among virus groups, which may reflect their evolutionary histories and host adaptation strategies.

Matthews Correlation Coefficient comparison across virus families, sorted by dataset size
Family k3 k4 k5 k6 Sequences (H/NH)
Flaviviridae 0.442 0.294 -0.027 0.049 9665 (2655/7010)
Picornaviridae 0.390 0.257 0.285 0.292 9256 (7253/2003)
Sedoreoviridae 0.248 0.068 0.276 0.289 8775 (3151/5624)
Hepadnaviridae -0.320 0.341 0.738 0.828 8281 (7780/501)
Papillomaviridae 0.283 0.539 0.634 0.633 3814 (3073/741)
Rhabdoviridae -0.020 -0.019 -0.012 -0.005 3029 (39/2990)
Spinareoviridae -0.049 -0.002 -0.019 0.013 2582 (73/2509)
Adenoviridae -0.144 -0.205 -0.050 0.000 2305 (1747/558)
Parvoviridae 0.212 0.308 0.099 0.311 2045 (479/1566)
Polyomaviridae 0.300 -0.072 -0.077 0.144 2012 (1447/565)
Togaviridae -0.454 0.000 0.000 0.000 1971 (811/1160)
Poxviridae 0.436 -0.304 -0.155 -0.110 1445 (835/610)
Orthoherpesviridae -0.257 -0.224 -0.032 0.155 1192 (629/563)
Orthomyxoviridae 0.059 0.101 0.000 -0.046 929 (77/852)
Astroviridae 0.181 0.013 0.126 0.047 745 (132/613)

Embedding Analysis and Visualization

t-distributed stochastic neighbor embedding (t-SNE) is a widely used nonlinear dimensionality reduction technique that preserves local relationships between data points when projecting high-dimensional data into a lower-dimensional space. To visualize and analyze the relationships between virus sequences, we employed t-SNE to generate low-dimensional projections of both the raw k-mer frequency vectors and the learned embeddings from the neural network's final layer. This was done on both training and test datasets using scikit-learn.

The resulting visualizations were colored according to three different classification schemes: human/non-human host tropism, model prediction probabilities, and prediction uncertainty. The uncertainty u for a prediction probability p was calculated as:

\[ u = 1 - 2|p - 0.5| \]

where maximum uncertainty (1.0) occurs at p=0.5 and minimum uncertainty (0) at p=0 or p=1.

t-SNE Visualization 1: Human Host Probability

Interactive visualization showing the distribution of predicted human-host probabilities across the sequence embedding space.

t-SNE Visualization 2: Prediction Uncertainty

Distribution of prediction uncertainty across the sequence embedding space, highlighting areas where the model is most and least confident.

Pathogenicity and Adaptation Correlation

We observed a striking inverse relationship between predicted human adaptation and virulence. Viruses with high predicted human-host compatibility typically exhibit lower fatality rates in humans, while those with minimal adaptation signals (e.g., rabies virus, Marburg virus) tend to cause more severe or fatal disease when they do infect humans. This finding suggests potential evolutionary trade-offs between adaptation and virulence.

Viruses and their reported human-fatality rate sorted by human-host probability from 5-mer frequency
Accession Virus Family Fatality Rate Probability
NC_038889 Human papillomavirus 30 Papillomaviridae ~0.0001% 0.980
NC_006273 Human cytomegalovirus Orthoherpesviridae ~0.0001% 0.920
NC_001538 BK polyomavirus Polyomaviridae ~0.0001% 0.910
NC_003977 Hepatitis B virus Hepadnaviridae ~0.0001% 0.871
MG953831 Human bocavirus 2 Parvoviridae ~0.0001% 0.730
NC_063383 Monkeypox virus Poxviridae 3–6% 0.680
KX010994 Yellow fever virus Flaviviridae 20–60% 0.550
NC_001802 HIV Retroviridae* ~0.0001% 0.543
NC_045512 SARS-CoV-2 Coronaviridae* ~1% 0.445
NC_001477 Dengue virus type 1 Flaviviridae ~1% 0.350
NC_001475 Dengue virus type 3 Flaviviridae ~1% 0.330
NC_002640 Dengue virus type 4 Flaviviridae ~1% 0.320
NC_044855 Norovirus Calciviridae* 0.1–0.001% 0.260
NC_075022 Venezuelan equine encephalitis virus Flaviviridae 1% 0.214
NC_001474 Dengue virus type 2 Flaviviridae ~1% 0.170
NC_001563 West Nile virus Flaviviridae 3–15% 0.168
NC_005062 Omsk hemorrhagic fever Flaviviridae 0.5–3% 0.168
NC_019843 MERS Coronaviridae* 35–36% 0.140
NC_002728 Nipah virus Paramyxoviridae* 40–75% 0.124
NC_001608 Marburg virus Filoviridae* 50% 0.093
NC_006432 Sudan ebola virus Filoviridae* 40–75% 0.079
NC_004812 Macacine alphaherpesvirus 1 Orthoherpesviridae 100% 0.075
NC_003899 Eastern equine encephalitis virus Flaviviridae 30% 0.064
NC_001542 Rabies virus Rhabdoviridae 100% 0.063
NC_075802 Salmonid herpesvirus 2 Alloherpesviridae* NA 0.042
NC_079185 Vibrio phage Chaseviridae* NA 0.037
NC_079140 Gordonia phage Azira Caudoviricetes* NA 0.036
OR795895 Cucumber mosaic virus Bromoviridae* NA 0.035
NC_078671 Grapevine line pattern virus Bromoviridae* NA 0.034
NC_077680 Tomato mottle leaf curl virus Geminiviridae* NA 0.028
NC_004181 Colorado tick fever virus Flaviviridae 0.2% 0.022

SARS-CoV-2 Analysis

The model correctly identified human SARS-CoV-2 sequences as possessing distinct human-host signatures relative to non-human betacoronaviruses. A gradual increase in predicted human-host probability over time (2019-2024) was observed, averaging an increase of 6.89×10⁻³ per month.

COVID-19 Adaptation Trend

SARS-CoV-2 human adaptation trend (2019-2024)

Figure 4: Temporal analysis of SARS-CoV-2 genome sequences showing increasing human host adaptation probability from 2019 to 2024. Each point represents a viral genome isolate, with the trend line showing a consistent increase in predicted human adaptation.

Citation

@article{carbajo2025virus, title={}, author={}, journal={}, year={2025}, }

Data Availability

Metadata Explorer

Loading metadata...

Acknowledgements

We extend our gratitude to the data contributors and maintainers of public genomic databases. We also acknowledge the institutional support and feedback from colleagues during the development of this dataset.