Multimodal Variant Effect Prediction with Foundation Models

Jan 1, 2025

Multimodal Variant Effect Prediction with Foundation Models diagram

Accurate prediction of pathogenic genetic variants remains central to human genetics. Existing tools often treat genomic and protein information separately or rely on hand crafted features. We explore whether multi-omic embeddings from large DNA and protein language models can improve variant effect prediction when fused carefully.

We build a pipeline that extracts DNA sequence context and corresponding protein sequences for each ClinVar single-nucleotide variant (SNV), then encodes them using Evo2-7B and ProGen2. We represent each variant using delta embeddings (variant minus reference) at both DNA and protein levels and train classifiers on DNA-only, protein-only, and multimodal feature sets.