Key Results & Visualizations
Model Performance Across k-mer Sizes
We evaluated the predictive power of k-mer frequency patterns as features for human host prediction. The Neural Network model consistently outperformed both Logistic Regression and Random Forest baselines across different k-mer sizes, with 5-mer features yielding the highest Matthews Correlation Coefficient (MCC = 0.820).
Matthews Correlation Coefficient (MCC) scores across k-mer sizes and model architectures
| k-mer |
Logistic mcc |
Random Forest mcc |
Neural Network mcc |
| k3 |
0.723 |
0.759 |
0.810 |
| k4 |
0.775 |
0.757 |
0.804 |
| k5 |
0.767 |
0.741 |
0.820 |
| k6 |
0.654 |
0.750 |
0.808 |
| k7 |
0.669 |
0.746 |
0.779 |
| k8 |
0.594 |
0.749 |
0.760 |
Family-Specific Performance Analysis
We conducted a detailed analysis of model performance across individual virus families, revealing substantial variation in how well sequence features predict human compatibility. This variability suggests differing levels of host-specific genomic signatures among virus groups, which may reflect their evolutionary histories and host adaptation strategies.
Matthews Correlation Coefficient comparison across virus families, sorted by dataset size
| Family |
k3 |
k4 |
k5 |
k6 |
Sequences (H/NH) |
| Flaviviridae |
0.442 |
0.294 |
-0.027 |
0.049 |
9665 (2655/7010) |
| Picornaviridae |
0.390 |
0.257 |
0.285 |
0.292 |
9256 (7253/2003) |
| Sedoreoviridae |
0.248 |
0.068 |
0.276 |
0.289 |
8775 (3151/5624) |
| Hepadnaviridae |
-0.320 |
0.341 |
0.738 |
0.828 |
8281 (7780/501) |
| Papillomaviridae |
0.283 |
0.539 |
0.634 |
0.633 |
3814 (3073/741) |
| Rhabdoviridae |
-0.020 |
-0.019 |
-0.012 |
-0.005 |
3029 (39/2990) |
| Spinareoviridae |
-0.049 |
-0.002 |
-0.019 |
0.013 |
2582 (73/2509) |
| Adenoviridae |
-0.144 |
-0.205 |
-0.050 |
0.000 |
2305 (1747/558) |
| Parvoviridae |
0.212 |
0.308 |
0.099 |
0.311 |
2045 (479/1566) |
| Polyomaviridae |
0.300 |
-0.072 |
-0.077 |
0.144 |
2012 (1447/565) |
| Togaviridae |
-0.454 |
0.000 |
0.000 |
0.000 |
1971 (811/1160) |
| Poxviridae |
0.436 |
-0.304 |
-0.155 |
-0.110 |
1445 (835/610) |
| Orthoherpesviridae |
-0.257 |
-0.224 |
-0.032 |
0.155 |
1192 (629/563) |
| Orthomyxoviridae |
0.059 |
0.101 |
0.000 |
-0.046 |
929 (77/852) |
| Astroviridae |
0.181 |
0.013 |
0.126 |
0.047 |
745 (132/613) |
Embedding Analysis and Visualization
t-distributed stochastic neighbor embedding (t-SNE) is a widely used nonlinear dimensionality reduction technique that preserves local relationships between data points when projecting high-dimensional data into a lower-dimensional space. To visualize and analyze the relationships between virus sequences, we employed t-SNE to generate low-dimensional projections of both the raw k-mer frequency vectors and the learned embeddings from the neural network's final layer. This was done on both training and test datasets using scikit-learn.
The resulting visualizations were colored according to three different classification schemes: human/non-human host tropism, model prediction probabilities, and prediction uncertainty. The uncertainty u for a prediction probability p was calculated as:
\[ u = 1 - 2|p - 0.5| \]
where maximum uncertainty (1.0) occurs at p=0.5 and minimum uncertainty (0) at p=0 or p=1.
t-SNE Visualization 1: Human Host Probability
Interactive visualization showing the distribution of predicted human-host probabilities across the sequence embedding space.
t-SNE Visualization 2: Prediction Uncertainty
Distribution of prediction uncertainty across the sequence embedding space, highlighting areas where the model is most and least confident.
Pathogenicity and Adaptation Correlation
We observed a striking inverse relationship between predicted human adaptation and virulence. Viruses with high predicted human-host compatibility typically exhibit lower fatality rates in humans, while those with minimal adaptation signals (e.g., rabies virus, Marburg virus) tend to cause more severe or fatal disease when they do infect humans. This finding suggests potential evolutionary trade-offs between adaptation and virulence.
Viruses and their reported human-fatality rate sorted by human-host probability from 5-mer frequency
| Accession |
Virus |
Family |
Fatality Rate |
Probability |
| NC_038889 |
Human papillomavirus 30 |
Papillomaviridae |
~0.0001% |
0.980 |
| NC_006273 |
Human cytomegalovirus |
Orthoherpesviridae |
~0.0001% |
0.920 |
| NC_001538 |
BK polyomavirus |
Polyomaviridae |
~0.0001% |
0.910 |
| NC_003977 |
Hepatitis B virus |
Hepadnaviridae |
~0.0001% |
0.871 |
| MG953831 |
Human bocavirus 2 |
Parvoviridae |
~0.0001% |
0.730 |
| NC_063383 |
Monkeypox virus |
Poxviridae |
3–6% |
0.680 |
| KX010994 |
Yellow fever virus |
Flaviviridae |
20–60% |
0.550 |
| NC_001802 |
HIV |
Retroviridae* |
~0.0001% |
0.543 |
| NC_045512 |
SARS-CoV-2 |
Coronaviridae* |
~1% |
0.445 |
| NC_001477 |
Dengue virus type 1 |
Flaviviridae |
~1% |
0.350 |
| NC_001475 |
Dengue virus type 3 |
Flaviviridae |
~1% |
0.330 |
| NC_002640 |
Dengue virus type 4 |
Flaviviridae |
~1% |
0.320 |
| NC_044855 |
Norovirus |
Calciviridae* |
0.1–0.001% |
0.260 |
| NC_075022 |
Venezuelan equine encephalitis virus |
Flaviviridae |
1% |
0.214 |
| NC_001474 |
Dengue virus type 2 |
Flaviviridae |
~1% |
0.170 |
| NC_001563 |
West Nile virus |
Flaviviridae |
3–15% |
0.168 |
| NC_005062 |
Omsk hemorrhagic fever |
Flaviviridae |
0.5–3% |
0.168 |
| NC_019843 |
MERS |
Coronaviridae* |
35–36% |
0.140 |
| NC_002728 |
Nipah virus |
Paramyxoviridae* |
40–75% |
0.124 |
| NC_001608 |
Marburg virus |
Filoviridae* |
50% |
0.093 |
| NC_006432 |
Sudan ebola virus |
Filoviridae* |
40–75% |
0.079 |
| NC_004812 |
Macacine alphaherpesvirus 1 |
Orthoherpesviridae |
100% |
0.075 |
| NC_003899 |
Eastern equine encephalitis virus |
Flaviviridae |
30% |
0.064 |
| NC_001542 |
Rabies virus |
Rhabdoviridae |
100% |
0.063 |
| NC_075802 |
Salmonid herpesvirus 2 |
Alloherpesviridae* |
NA |
0.042 |
| NC_079185 |
Vibrio phage |
Chaseviridae* |
NA |
0.037 |
| NC_079140 |
Gordonia phage Azira |
Caudoviricetes* |
NA |
0.036 |
| OR795895 |
Cucumber mosaic virus |
Bromoviridae* |
NA |
0.035 |
| NC_078671 |
Grapevine line pattern virus |
Bromoviridae* |
NA |
0.034 |
| NC_077680 |
Tomato mottle leaf curl virus |
Geminiviridae* |
NA |
0.028 |
| NC_004181 |
Colorado tick fever virus |
Flaviviridae |
0.2% |
0.022 |
SARS-CoV-2 Analysis
The model correctly identified human SARS-CoV-2 sequences as possessing distinct human-host signatures relative to non-human betacoronaviruses. A gradual increase in predicted human-host probability over time (2019-2024) was observed, averaging an increase of 6.89×10⁻³ per month.
COVID-19 Adaptation Trend
Figure 4: Temporal analysis of SARS-CoV-2 genome sequences showing increasing human host adaptation probability from 2019 to 2024. Each point represents a viral genome isolate, with the trend line showing a consistent increase in predicted human adaptation.