Sequence Based Virus Host Prediction: A Curated Dataset and Generalizable Framework for Training Artificial Intelligence to Identify Viruses of Humans

t-SNE visualization of virus sequence embeddings

Figure 1: t-SNE visualization comparing raw sequence features versus learned embeddings, showing clear clustering patterns that discriminate sequences of human and non-human virus genomes.

Abstract: Understanding how viruses evolve to infect specific hosts is crucial for reliably predicting and preventing emerging virus diseases. While current approaches rely heavily on observed phenotypic traits, the virus genome itself contains evolutionary signatures shaped by host adaptation. To enable systematic investigation of these genomic patterns, we present a comprehensive dataset of 58,046 virus genomes spanning 15 families that represent diverse genome architectures and host ranges. Each sequence was systematically classified for human host compatibility based on isolation source, creating a foundation for studying sequence-level determinants of host specificity.

Through analysis of k-mer frequency patterns across viral taxa, we demonstrate that sequence-based features can be used to accurately predict human host compatibility and even generalize to members of virus families not included in the training set. This performance variation suggests different levels of host-specific signal in virus genomes, potentially reflecting distinct evolutionary constraints across virus groups. Neural network analysis revealed underlying structure in sequence feature space that enables the identification of human-compatible viruses, indicating shared sequence patterns despite diverse genome architectures. This resource and associated analyses provide new opportunities for identifying the genomic basis of virus host range evolution.

Key Findings

Created a comprehensive dataset of 58,046 virus genomes across 15 families.
Neural network model accurately predicts human host compatibility based solely on sequence features.
K-mer frequency patterns (5-mers) provided the strongest predictive signals.
Evolutionary analysis revealed a gradient of human adaptation from highly-adapted viruses to those with minimal compatibility.
Cross-family validation demonstrated model generalization to unseen virus families.
Viruses with low adaptation signals often correlate with higher virulence in humans.

Dataset Composition

Our dataset comprises 58,046 complete virus genomes representing 15 distinct virus families, with an approximately equal distribution between human-associated (52.0%) and non-human-associated (48.0%) viruses.

Virus Genome Distribution
Virus Family	Total	Human	Non-Human
Flaviviridae	9,665	2,655	7,010
Picornaviridae	9,256	7,253	2,003
Sedoreoviridae	8,775	3,151	5,624
Hepadnaviridae	8,281	7,780	501
Papillomaviridae	3,814	3,073	741
Rhabdoviridae	3,029	39	2,990
Spinareoviridae	2,582	73	2,509
Adenoviridae	2,305	1,747	558
Parvoviridae	2,045	479	1,566
Polyomaviridae	2,012	1,447	565
Togaviridae	1,971	811	1,160
Poxviridae	1,445	835	610
Orthoherpesviridae	1,192	629	563
Orthomyxoviridae	929	77	852
Astroviridae	745	132	613
Total	58,046	30,181	27,865

Methods

1. Data Collection and Preparation

Retrieved 82,513 complete virus genomes from the NCBI Virus database.
Filtered for quality and completeness to yield 58,046 high-quality genomes.
Implemented a three-tier classification system for host labeling using direct string matching, pattern recognition, and AI-driven analysis.

2. Feature Extraction

Generated k-mer frequency vectors for each sequence (k=3 to k=8).
Utilized UMAP for dimensionality reduction and visualization.

3. Neural Network Architecture

Developed a shallow feed-forward architecture with two hidden layers.
Employed GELU activation functions, L2 regularization, and dropout.

4. Model Evaluation

Used Matthews Correlation Coefficient (MCC) as the primary evaluation metric.
Performed cross-family validation to assess generalization capabilities.

Key Results & Visualizations

Model Performance Across k-mer Sizes

We evaluated the predictive power of k-mer frequency patterns as features for human host prediction. The Neural Network model consistently outperformed both Logistic Regression and Random Forest baselines across different k-mer sizes, with 5-mer features yielding the highest Matthews Correlation Coefficient (MCC = 0.820).

Matthews Correlation Coefficient (MCC) scores across k-mer sizes and model architectures
k-mer	Logistic mcc	Random Forest mcc	Neural Network mcc
k3	0.723	0.759	0.810
k4	0.775	0.757	0.804
k5	0.767	0.741	0.820
k6	0.654	0.750	0.808
k7	0.669	0.746	0.779
k8	0.594	0.749	0.760

Family-Specific Performance Analysis

We conducted a detailed analysis of model performance across individual virus families, revealing substantial variation in how well sequence features predict human compatibility. This variability suggests differing levels of host-specific genomic signatures among virus groups, which may reflect their evolutionary histories and host adaptation strategies.

Matthews Correlation Coefficient comparison across virus families, sorted by dataset size
Family	k3	k4	k5	k6	Sequences (H/NH)
Flaviviridae	0.442	0.294	-0.027	0.049	9665 (2655/7010)
Picornaviridae	0.390	0.257	0.285	0.292	9256 (7253/2003)
Sedoreoviridae	0.248	0.068	0.276	0.289	8775 (3151/5624)
Hepadnaviridae	-0.320	0.341	0.738	0.828	8281 (7780/501)
Papillomaviridae	0.283	0.539	0.634	0.633	3814 (3073/741)
Rhabdoviridae	-0.020	-0.019	-0.012	-0.005	3029 (39/2990)
Spinareoviridae	-0.049	-0.002	-0.019	0.013	2582 (73/2509)
Adenoviridae	-0.144	-0.205	-0.050	0.000	2305 (1747/558)
Parvoviridae	0.212	0.308	0.099	0.311	2045 (479/1566)
Polyomaviridae	0.300	-0.072	-0.077	0.144	2012 (1447/565)
Togaviridae	-0.454	0.000	0.000	0.000	1971 (811/1160)
Poxviridae	0.436	-0.304	-0.155	-0.110	1445 (835/610)
Orthoherpesviridae	-0.257	-0.224	-0.032	0.155	1192 (629/563)
Orthomyxoviridae	0.059	0.101	0.000	-0.046	929 (77/852)
Astroviridae	0.181	0.013	0.126	0.047	745 (132/613)

Embedding Analysis and Visualization

t-distributed stochastic neighbor embedding (t-SNE) is a widely used nonlinear dimensionality reduction technique that preserves local relationships between data points when projecting high-dimensional data into a lower-dimensional space. To visualize and analyze the relationships between virus sequences, we employed t-SNE to generate low-dimensional projections of both the raw k-mer frequency vectors and the learned embeddings from the neural network's final layer. This was done on both training and test datasets using scikit-learn.

The resulting visualizations were colored according to three different classification schemes: human/non-human host tropism, model prediction probabilities, and prediction uncertainty. The uncertainty u for a prediction probability p was calculated as:

\[ u = 1 - 2|p - 0.5| \]

where maximum uncertainty (1.0) occurs at p=0.5 and minimum uncertainty (0) at p=0 or p=1.

t-SNE Visualization 1: Human Host Probability

Interactive visualization showing the distribution of predicted human-host probabilities across the sequence embedding space.

t-SNE Visualization 2: Prediction Uncertainty

Distribution of prediction uncertainty across the sequence embedding space, highlighting areas where the model is most and least confident.

Pathogenicity and Adaptation Correlation

We observed a striking inverse relationship between predicted human adaptation and virulence. Viruses with high predicted human-host compatibility typically exhibit lower fatality rates in humans, while those with minimal adaptation signals (e.g., rabies virus, Marburg virus) tend to cause more severe or fatal disease when they do infect humans. This finding suggests potential evolutionary trade-offs between adaptation and virulence.

Viruses and their reported human-fatality rate sorted by human-host probability from 5-mer frequency
Accession	Virus	Family	Fatality Rate	Probability
NC_038889	Human papillomavirus 30	Papillomaviridae	~0.0001%	0.980
NC_006273	Human cytomegalovirus	Orthoherpesviridae	~0.0001%	0.920
NC_001538	BK polyomavirus	Polyomaviridae	~0.0001%	0.910
NC_003977	Hepatitis B virus	Hepadnaviridae	~0.0001%	0.871
MG953831	Human bocavirus 2	Parvoviridae	~0.0001%	0.730
NC_063383	Monkeypox virus	Poxviridae	3–6%	0.680
KX010994	Yellow fever virus	Flaviviridae	20–60%	0.550
NC_001802	HIV	Retroviridae*	~0.0001%	0.543
NC_045512	SARS-CoV-2	Coronaviridae*	~1%	0.445
NC_001477	Dengue virus type 1	Flaviviridae	~1%	0.350
NC_001475	Dengue virus type 3	Flaviviridae	~1%	0.330
NC_002640	Dengue virus type 4	Flaviviridae	~1%	0.320
NC_044855	Norovirus	Calciviridae*	0.1–0.001%	0.260
NC_075022	Venezuelan equine encephalitis virus	Flaviviridae	1%	0.214
NC_001474	Dengue virus type 2	Flaviviridae	~1%	0.170
NC_001563	West Nile virus	Flaviviridae	3–15%	0.168
NC_005062	Omsk hemorrhagic fever	Flaviviridae	0.5–3%	0.168
NC_019843	MERS	Coronaviridae*	35–36%	0.140
NC_002728	Nipah virus	Paramyxoviridae*	40–75%	0.124
NC_001608	Marburg virus	Filoviridae*	50%	0.093
NC_006432	Sudan ebola virus	Filoviridae*	40–75%	0.079
NC_004812	Macacine alphaherpesvirus 1	Orthoherpesviridae	100%	0.075
NC_003899	Eastern equine encephalitis virus	Flaviviridae	30%	0.064
NC_001542	Rabies virus	Rhabdoviridae	100%	0.063
NC_075802	Salmonid herpesvirus 2	Alloherpesviridae*	NA	0.042
NC_079185	Vibrio phage	Chaseviridae*	NA	0.037
NC_079140	Gordonia phage Azira	Caudoviricetes*	NA	0.036
OR795895	Cucumber mosaic virus	Bromoviridae*	NA	0.035
NC_078671	Grapevine line pattern virus	Bromoviridae*	NA	0.034
NC_077680	Tomato mottle leaf curl virus	Geminiviridae*	NA	0.028
NC_004181	Colorado tick fever virus	Flaviviridae	0.2%	0.022

SARS-CoV-2 Analysis

The model correctly identified human SARS-CoV-2 sequences as possessing distinct human-host signatures relative to non-human betacoronaviruses. A gradual increase in predicted human-host probability over time (2019-2024) was observed, averaging an increase of 6.89×10⁻³ per month.

COVID-19 Adaptation Trend

SARS-CoV-2 human adaptation trend (2019-2024)

Figure 4: Temporal analysis of SARS-CoV-2 genome sequences showing increasing human host adaptation probability from 2019 to 2024. Each point represents a viral genome isolate, with the trend line showing a consistent increase in predicted human adaptation.

Citation

@article{carbajo2025virus, title={}, author={}, journal={}, year={2025}, }

Data Availability

Data splits and vectorized dataset: GitHub
Trained model weights: Hugging Face
Parquet format with metadata: Hugging Face Dataset
Demo space: Host Classifier

Metadata Explorer

Loading metadata...

Acknowledgements

We extend our gratitude to the data contributors and maintainers of public genomic databases. We also acknowledge the institutional support and feedback from colleagues during the development of this dataset.