Improving viral genomic data analyses to gain new insights into HIV transmission and evolution

Project Overview

Huge progress has been made in the prevention of HIV/AIDS. Nonetheless, rates of new infection remain stubbornly high, and so there remains an urgent need for new science, tools and technologies. Next-generation sequencing (NGS) has revolutionised disease surveillance, resulting in large databases of viral genetic diversity collected from the most affected populations.

There are challenges unique to HIV in interpreting NGS data. Each infection consists of a swarm of related viruses: NGS characterises this swarm through millions of small fragments of thousands of closely related viral genomes. These must be assembled into sensible units of analysis. 

Phylogenetic methods are used to infer evolutionary history and epidemic dynamics. Though very powerful, existing methods have several major limitations, including that they ignore recombination. 

The primary aim of this DPhil will be to develop novel bioinformatic, phylogenetic and statistical methods to improve the interpretation of HIV sequence data. The student will have access to three large databases of HIV NGS data generated in projects led by the Fraser group (N>25,000 samples).

The new methods will build on and improve existing algorithms for interpreting within host diversity, see for example references [1,2]. The student will aim will to infer whole genome haplotypes and associated recombination events, and so improve our understanding of HIV epidemic dynamics. There will be extensive opportunities to link the findings of the project to clinical and epidemiological data, and to interact with wide multidisciplinary teams. Findings (data and software) will be shared with the consortia and beyond, so that the public health utility of the work can be maximised.

  1. Wymant C et al. PHYLOSCANNER: Inferring Transmission from Within- and Between-Host Pathogen Genetic Diversity. Mol Biol Evol. 2017 
  2. Palmer DS et al. Mapping the drivers of within-host pathogen evolution using massive data sets. Nature Comm 2019. 

Training Opportunities

This project will provide broad opportunities to develop expertise in data science and big data.  There will be specific streams in bioinformatics, statistical genetics, phylogenetics, machine learning and inference, and communication to a clinical and public health audience.

The project will be co-supervised by Prof Christophe Fraser, Prof Gil McVean and Dr Tanya Golubchik, providing a broad expertise in the core topics of the DPhil. There will be opportunities for close collaboration with Dr Katrina Lythgoe and Prof Oliver Pybus, who will provide alternative expertise and advice. 

The student will be based in the Big Data Institute, and will have access to the new training in data science that will be offered as part of the Institute's Doctoral Training Centre. The student will also be eligible to attend relevant courses in other departments, e.g. statistics, zoology. The student will be expected to interact regularly with postdoctoral scientists  in the Fraser and McVean groups, and with the members of wider consortia that generated the data.

The student will be exposed to a multidisciplinary environment including epidemiology, public health, clinical science, phylodynamics and mathematical modelling. As a result, the student will acquire key communication skills needed for working in a multidisciplinary data science project.


Genetics & Genomics and Immunology & Infectious Disease


Project reference number: 1011

Funding and admissions information


Name Department Institution Country Email
Christophe Fraser Big Data Institute Oxford University, Henry Wellcome Building of Genomic Medicine GBR
Gil McVean FMedSci, FRS, FMedSci, FRS Big Data Institute Oxford University, Henry Wellcome Building of Genomic Medicine GBR

Wymant C, Hall M, Ratmann O, Bonsall D, Golubchik T, de Cesare M, Gall A, Cornelissen M, Fraser C, STOP-HCV Consortium, The Maela Pneumococcal Collaboration, and The BEEHIVE Collaboration. 2017. PHYLOSCANNER: Inferring Transmission from Within- and Between-Host Pathogen Genetic Diversity. Mol. Biol. Evol., Read abstract | Read more

A central feature of pathogen genomics is that different infectious particles (virions, bacterial cells, etc.) within an infected individual may be genetically distinct, with patterns of relatedness amongst infectious particles being the result of both within-host evolution and transmission from one host to the next. Here we present a new software tool, phyloscanner, which analyses pathogen diversity from multiple infected hosts. phyloscanner provides unprecedented resolution into the transmission process, allowing inference of the direction of transmission from sequence data alone. Multiply infected individuals are also identified, as they harbour subpopulations of infectious particles that are not connected by within-host evolution, except where recombinant types emerge. Low-level contamination is flagged and removed. We illustrate phyloscanner on both viral and bacterial pathogens, namely HIV-1 sequenced on Illumina and Roche 454 platforms, HCV sequenced with the Oxford Nanopore MinION platform, and Streptococcus pneumoniae with sequences from multiple colonies per individual. phyloscanner is available from Hide abstract

Palmer DS, Turner I, Fidler S, Frater J, Goedhals D, Goulder P, Huang KG, Oxenius A, Phillips R, Shapiro R, Vuuren CV, McLean AR, McVean G. 2019. Mapping the drivers of within-host pathogen evolution using massive data sets. Nat Commun, 10 (1), pp. 3017. Read abstract | Read more

Differences among hosts, resulting from genetic variation in the immune system or heterogeneity in drug treatment, can impact within-host pathogen evolution. Genetic association studies can potentially identify such interactions. However, extensive and correlated genetic population structure in hosts and pathogens presents a substantial risk of confounding analyses. Moreover, the multiple testing burden of interaction scanning can potentially limit power. We present a Bayesian approach for detecting host influences on pathogen evolution that exploits vast existing data sets of pathogen diversity to improve power and control for stratification. The approach models key processes, including recombination and selection, and identifies regions of the pathogen genome affected by host factors. Our simulations and empirical analysis of drug-induced selection on the HIV-1 genome show that the method recovers known associations and has superior precision-recall characteristics compared to other approaches. We build a high-resolution map of HLA-induced selection in the HIV-1 genome, identifying novel epitope-allele combinations. Hide abstract