Dr, PI/Group Leader
My research program is divided into two parts:
Computational and Statistical Methods:
I work on computational methods for detecting and representing genetic variation, particularly from high-throughput sequencing data. There are many situations where genomic regions of considerable biomedical interest (eg the HLA genes in humans, many surface antigens in P. falciparum, and many antibiotic- resistance-conferring mobile elements in bacteria) are too diverse and have too complex evolutionary histories to be accessible to standard computational approaches. I develop algorithms and data structures for representing and detecting simple and complex genetic variation without using a reference genome. These methods are based on so-called "de novo assembly", because you make no prior assumptions on what the genome looks like. Typically challenges involve dealing with large data volumes (efficiency of data structures and software implementation), developing appropriate algorithms to analyse complex variants, and incorporating ideas from population genetics into assembly. I have spent a lot of time working in the 1000 genomes project, where I am now co-lead of the de novo assembly subgroup. Currently a major focus is improving methods to handle the diversity and numbers of genomes required for modern and future pathogen analysis.
Applications to pathogens:
I work with colleagues to apply these methods to pathogens, to enable better surveillance of pathogen evolution (within host, within hospital and across the world) and to better understand variation in important drug-resistance and immune-target genes. The major focuses for me
- The study of P. falciparum, the parasite responsible for malaria. The study of Plasmodium is particularly challenging for various reasons - the genome has an extremely high repeat content, it has its own non-allelic recombination and mutation processes which occur in the subtelomeric regions which contain extremely important antigen/virulence genes, and in some parts of the world it is very common to have mixed infections. Assembly approaches are proving tremendously powerful, especially as for many surface antigen genes different strains can have astonishingly diverged genomes. These methods are for the first time allowing us to understand variation at a number of critical genes.
- Bacterial species tend to have smaller and less repetitive genomes, but levels of diversity can be extremely high, and samples can be highly diverged from any reference. By building representations of all known sequence and variation, and more reliable variant discovery, I seek to bridge the gap between research and clinical application of sequencing.
Key collaborators are Dominic Kwiatkowski (MalariaGEN consortium), Derrick Crook (the Modernising Medical Microbiology Consortium) and Henk den Bakker (Cornell and New York State Department of Health)
Same-Day Diagnostic and Surveillance Data for Tuberculosis via Whole-Genome Sequencing of Direct Respiratory Samples
Votintseva AA. et al, (2017), Journal of Clinical Microbiology, 55, 1285 - 1298
Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes
Dolle DD. et al, (2017), Genome Research, 27, 300 - 309
A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree
Eberle MA. et al, (2017), Genome Research, 27, 157 - 164
Same-day diagnostic and surveillance data for tuberculosis via whole genome sequencing of direct respiratory samples
Votintseva AA. et al, (2016)
High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs
Dilthey AT. et al, (2016), PLOS Computational Biology, 12, e1005151 - e1005151