Tracing the evolutionary histories of ultra-rare variants using variational dating of large ancestral recombination graphs.

Pope NS., Tallman S., Jeffery B., Robertson D., Wong Y., Karthikeyan S., Ralph PL., Kelleher J.

Ultra-rare variants dominate whole-genome sequencing datasets, yet their interpretation is limited by allele frequency, which provides little information at very low counts and is highly sensitive to uneven ancestry representation. Allele age offers an ancestry-agnostic alternative but existing methods do not scale to biobank-sized cohorts. Here we present a scalable variational algorithm for dating Ancestral Recombination Graphs (ARGs), implemented in tsdate, together with new distributed methods enabling practical biobank-scale ARG inference using tsinfer. Applied to 47,535 genomes from the Genomics England 100,000 Genomes Project, we infer contiguous ARGs spanning 206 Mb and estimate ages for 23.2 million variants, including 11.8 million singletons. ARG-based allele ages remain accurate under extreme sampling imbalance and, in real data, reveal signatures of purifying selection and clinically relevant heterogeneity among variants with identical observed frequencies. Estimates for recent mutations are precise only at large sample sizes, highlighting the information accessible in the haplotype structure of large datasets. Biobank-scale ARGs therefore enable robust, ancestry-agnostic age estimation for ultra-rare variation with broad utility for statistical and clinical genomics.

DOI

10.64898/2026.01.07.698223

Type

Journal article

Publication Date

2026-01-12T00:00:00+00:00

Permalink More information Close