Tracking future infection threats using genomic data and electronic health records

Project Overview

The UK Chief Medical Officer has noted that "infectious disease is as great a threat to national security as climate change", with more than one new threatening disease identified each year, and that "the difficulty and challenge in identifying and tracking future threats is not the acquisition of trusted physiological and genomic data, but in using these huge databases with new computer science methods for analysis."

Existing lab-based methods for testing for infectious disease can take up to four weeks to perform, and whole-genome analysis of pathogen DNA, combined with machine learning across the Electronic Health Records (EHR), could be performed near same-day, greatly improving our ability to identify and fight outbreaks of infectious disease. Here, the engineering challenge is to develop probabilistic models to explore relationships between the massively-multivariate genomic data (the "genotype") of the pathogen and the corresponding patient data from the EHR (the "phenotype"), including lab test results, blood tests, biochemistry, and patient physiology for large cohorts of patients with fully-labelled EHR data (including sequenced blood cultures of bacteria).  The emphasis is on delivering the right antibiotic to the right patient, and using "novelty detection" to identify new resistance strains.  

This project involves the development of novel models, including machine learning methods along with techniques from computational biostatistics, to identify genomic markers of antibiotic resistance and susceptibility within various pathogens.

David Clifton is an Associate Professor in the Department of Engineering Science of the University of Oxford. He is a Research Fellow of the Royal Academy of Engineering. His research focuses on the development of "big data" machine learning for tracking the health of complex systems. Danny Wilson is a Sir Henry Dale Fellow working on microbial genomics at the University of Oxford Nuffield Department of Medicine. He is an associate member of the Department of Statistics and a principal investigator at the Oxford Martin School Institute of Emerging Infections.

This project is part of the Modernising Medical Microbiology Project, an international consortium led by Professor Derrick Crook, professor of microbiology at the University of Oxford and director of the National Infection Service at Public Health England. The consortium is supported by a multi-million pound funding portfolio, including from the Wellcome Trust, the Medical Research Council and the Bill and Melinda Gates Foundation. We have a strong record of publishing in the top internationally recognized journals, and our research has impacted on the delivery of public health and microbiology in Britain and beyond. For more information visit

Training Opportunities

The Modernising Medical Microbiology consortium provides an excellent research environment in which to develop new skills and train among world-leading scientists in their field. Based at the John Radcliffe Hospital, the University of Oxford team consists of a community of research groups led by Profs Derrick Crook, Tim Peto, Sarah Walker and Drs David Clifton, Kate Dingle, Phil Fowler, Zam Iqbal, Danny Wilson and David Wyllie. We have specialist expertise in microbiology, genomics, statistics, epidemiology and bioinformatics, and our work focuses on understanding the causes of infectious disease in populations. Training is provided by weekly supervisory meetings, weekly Modernising Medical Microbiology work-in-progress meetings, journal clubs, seminar series, and external opportunities including attending national and international conferences. The department and university run training courses, while the Department for Continuing Education and the Language Centre offer further opportunities for personal development to research students at Oxford.


Genetics & Genomics and Immunology & Infectious Disease


Project reference number: 845

Funding and admissions information


Name Department Institution Country Email
Professor Daniel J Wilson Experimental Medicine Division Oxford University, John Radcliffe Hospital GBR
Professor David Clifton Department of Engineering Science,University of Oxford GBR

De Silva D, Peters J, Cole K, Cole MJ, Cresswell F, Dean G, Dave J, Thomas DR, Foster K, Waldram A, Wilson DJ, Didelot X, Grad YH, Crook DW, Peto TE, Walker AS, Paul J, Eyre DW. 2016. Whole-genome sequencing to determine transmission of Neisseria gonorrhoeae: an observational study. Lancet Infect Dis, 16 (11), pp. 1295-1303. Read abstract | Read more

BACKGROUND: New approaches are urgently required to address increasing rates of gonorrhoea and the emergence and global spread of antibiotic-resistant Neisseria gonorrhoeae. We used whole-genome sequencing to study transmission and track resistance in N gonorrhoeae isolates. METHODS: We did whole-genome sequencing of isolates obtained from samples collected from patients attending sexual health services in Brighton, UK, between Jan 1, 2011, and March 9, 2015. We also included isolates from other UK locations, historical isolates from Brighton, and previous data from a US study. Samples from symptomatic patients and asymptomatic sexual health screening underwent nucleic acid amplification testing; positive samples and all samples from symptomatic patients were cultured for N gonorrhoeae, and resulting isolates were whole-genome sequenced. Cefixime susceptibility testing was done in selected isolates by agar incorporation, and we used sequence data to determine multi-antigen sequence types and penA genotypes. We derived a transmission nomogram to determine the plausibility of direct or indirect transmission between any two cases depending on the time between samples: estimated mutation rates, plus diversity noted within patients across anatomical sites and probable transmission pairs, were used to fit a coalescent model to determine the number of single nucleotide polymorphisms expected. FINDINGS: 1407 (98%) of 1437 Brighton isolates between Jan 1, 2011, and March 9, 2015 were successfully sequenced. We identified 1061 infections from 907 patients. 281 (26%) of these infections were indistinguishable (ie, differed by zero single nucleotide polymorphisms) from one or more previous cases, and 786 (74%) had evidence of a sampled direct or indirect Brighton source. We observed multiple related samples across geographical locations. Of 1273 infections in Brighton (including historical data), 225 (18%) were linked to another case elsewhere in the UK, and 115 (9%) to a case in the USA. Four lineages initially identified in Brighton could be linked to 70 USA sequences, including 61 from a lineage carrying the mosaic penA XXXIV allele, which is associated with reduced cefixime susceptibility. INTERPRETATION: We present a whole-genome-sequencing-based tool for genomic contact tracing of N gonorrhoeae and demonstrate local, national, and international transmission. Whole-genome sequencing can be applied across geographical boundaries to investigate gonorrhoea transmission and to track antimicrobial resistance. FUNDING: Oxford National Institute for Health Research Health Protection Research Unit and Biomedical Research Centre. Hide abstract

Das S, Lindemann C, Young BC, Muller J, Österreich B, Ternette N, Winkler AC, Paprotka K, Reinhardt R, Förstner KU, Allen E, Flaxman A, Yamaguchi Y, Rollier CS, van Diemen P, Blättner S, Remmele CW, Selle M, Dittrich M, Müller T, Vogel J, Ohlsen K, Crook DW, Massey R, Wilson DJ, Rudel T, Wyllie DH, Fraunholz MJ. 2016. Natural mutations in a Staphylococcus aureus virulence regulator attenuate cytotoxicity but permit bacteremia and abscess formation. Proc. Natl. Acad. Sci. U.S.A., 113 (22), pp. E3101-10. Read abstract | Read more

Staphylococcus aureus is a major bacterial pathogen, which causes severe blood and tissue infections that frequently emerge by autoinfection with asymptomatically carried nose and skin populations. However, recent studies report that bloodstream isolates differ systematically from those found in the nose and skin, exhibiting reduced toxicity toward leukocytes. In two patients, an attenuated toxicity bloodstream infection evolved from an asymptomatically carried high-toxicity nasal strain by loss-of-function mutations in the gene encoding the transcription factor repressor of surface proteins (rsp). Here, we report that rsp knockout mutants lead to global transcriptional and proteomic reprofiling, and they exhibit the greatest signal in a genome-wide screen for genes influencing S. aureus survival in human cells. This effect is likely to be mediated in part via SSR42, a long-noncoding RNA. We show that rsp controls SSR42 expression, is induced by hydrogen peroxide, and is required for normal cytotoxicity and hemolytic activity. Rsp inactivation in laboratory- and bacteremia-derived mutants attenuates toxin production, but up-regulates other immune subversion proteins and reduces lethality during experimental infection. Crucially, inactivation of rsp preserves bacterial dissemination, because it affects neither formation of deep abscesses in mice nor survival in human blood. Thus, we have identified a spontaneously evolving, attenuated-cytotoxicity, nonhemolytic S. aureus phenotype, controlled by a pleiotropic transcriptional regulator/noncoding RNA virulence regulatory system, capable of causing S. aureus bloodstream infections. Such a phenotype could promote deep infection with limited early clinical manifestations, raising concerns that bacterial evolution within the human body may contribute to severe infection. Hide abstract

Earle SG, Wu CH, Charlesworth J, Stoesser N, Gordon NC, Walker TM, Spencer CC, Iqbal Z, Clifton DA, Hopkins KL, Woodford N, Smith EG, Ismail N, Llewelyn MJ, Peto TE, Crook DW, McVean G, Walker AS, Wilson DJ. 2016. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat Microbiol, 1 pp. 16041. Read abstract | Read more

Bacteria pose unique challenges for genome-wide association studies because of strong structuring into distinct strains and substantial linkage disequilibrium across the genome(1,2). Although methods developed for human studies can correct for strain structure(3,4), this risks considerable loss-of-power because genetic differences between strains often contribute substantial phenotypic variability(5). Here, we propose a new method that captures lineage-level associations even when locus-specific associations cannot be fine-mapped. We demonstrate its ability to detect genes and genetic variants underlying resistance to 17 antimicrobials in 3,144 isolates from four taxonomically diverse clonal and recombining bacteria: Mycobacterium tuberculosis, Staphylococcus aureus, Escherichia coli and Klebsiella pneumoniae. Strong selection, recombination and penetrance confer high power to recover known antimicrobial resistance mechanisms and reveal a candidate association between the outer membrane porin nmpC and cefazolin resistance in E. coli. Hence, our method pinpoints locus-specific effects where possible and boosts power by detecting lineage-level differences when fine-mapping is intractable. Hide abstract

Walker TM, Kohl TA, Omar SV, Hedge J, Del Ojo Elias C, Bradley P, Iqbal Z, Feuerriegel S, Niehaus KE, Wilson DJ, Clifton DA, Kapatai G, Ip CL, Bowden R, Drobniewski FA, Allix-Béguec C, Gaudin C, Parkhill J, Diel R, Supply P, Crook DW, Smith EG, Walker AS, Ismail N, Niemann S, Peto TE, Modernizing Medical Microbiology (MMM) Informatics Group. 2015. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infect Dis, 15 (10), pp. 1193-202. Read abstract | Read more

BACKGROUND: Diagnosing drug-resistance remains an obstacle to the elimination of tuberculosis. Phenotypic drug-susceptibility testing is slow and expensive, and commercial genotypic assays screen only common resistance-determining mutations. We used whole-genome sequencing to characterise common and rare mutations predicting drug resistance, or consistency with susceptibility, for all first-line and second-line drugs for tuberculosis. METHODS: Between Sept 1, 2010, and Dec 1, 2013, we sequenced a training set of 2099 Mycobacterium tuberculosis genomes. For 23 candidate genes identified from the drug-resistance scientific literature, we algorithmically characterised genetic mutations as not conferring resistance (benign), resistance determinants, or uncharacterised. We then assessed the ability of these characterisations to predict phenotypic drug-susceptibility testing for an independent validation set of 1552 genomes. We sought mutations under similar selection pressure to those characterised as resistance determinants outside candidate genes to account for residual phenotypic resistance. FINDINGS: We characterised 120 training-set mutations as resistance determining, and 772 as benign. With these mutations, we could predict 89·2% of the validation-set phenotypes with a mean 92·3% sensitivity (95% CI 90·7-93·7) and 98·4% specificity (98·1-98·7). 10·8% of validation-set phenotypes could not be predicted because uncharacterised mutations were present. With an in-silico comparison, characterised resistance determinants had higher sensitivity than the mutations from three line-probe assays (85·1% vs 81·6%). No additional resistance determinants were identified among mutations under selection pressure in non-candidate genes. INTERPRETATION: A broad catalogue of genetic mutations enable data from whole-genome sequencing to be used clinically to predict drug resistance, drug susceptibility, or to identify drug phenotypes that cannot yet be genetically predicted. This approach could be integrated into routine diagnostic workflows, phasing out phenotypic drug-susceptibility testing while reporting drug resistance early. FUNDING: Wellcome Trust, National Institute of Health Research, Medical Research Council, and the European Union. Hide abstract

Bradley P, Gordon NC, Walker TM, Dunn L, Heys S, Huang B, Earle S, Pankhurst LJ, Anson L, de Cesare M, Piazza P, Votintseva AA, Golubchik T, Wilson DJ, Wyllie DH, Diel R, Niemann S, Feuerriegel S, Kohl TA, Ismail N, Omar SV, Smith EG, Buck D, McVean G, Walker AS, Peto TE, Crook DW, Iqbal Z. 2015. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat Commun, 6 pp. 10063. Read abstract | Read more

The rise of antibiotic-resistant bacteria has led to an urgent need for rapid detection of drug resistance in clinical samples, and improvements in global surveillance. Here we show how de Bruijn graph representation of bacterial diversity can be used to identify species and resistance profiles of clinical isolates. We implement this method for Staphylococcus aureus and Mycobacterium tuberculosis in a software package ('Mykrobe predictor') that takes raw sequence data as input, and generates a clinician-friendly report within 3 minutes on a laptop. For S. aureus, the error rates of our method are comparable to gold-standard phenotypic methods, with sensitivity/specificity of 99.1%/99.6% across 12 antibiotics (using an independent validation set, n=470). For M. tuberculosis, our method predicts resistance with sensitivity/specificity of 82.6%/98.5% (independent validation set, n=1,609); sensitivity is lower here, probably because of limited understanding of the underlying genetic mechanisms. We give evidence that minor alleles improve detection of extremely drug-resistant strains, and demonstrate feasibility of the use of emerging single-molecule nanopore sequencing techniques for these purposes. Hide abstract

De Maio N, Wu CH, O'Reilly KM, Wilson D. 2015. New Routes to Phylogeography: A Bayesian Structured Coalescent Approximation. PLoS Genet., 11 (8), pp. e1005421. Read abstract | Read more

Phylogeographic methods aim to infer migration trends and the history of sampled lineages from genetic data. Applications of phylogeography are broad, and in the context of pathogens include the reconstruction of transmission histories and the origin and emergence of outbreaks. Phylogeographic inference based on bottom-up population genetics models is computationally expensive, and as a result faster alternatives based on the evolution of discrete traits have become popular. In this paper, we show that inference of migration rates and root locations based on discrete trait models is extremely unreliable and sensitive to biased sampling. To address this problem, we introduce BASTA (BAyesian STructured coalescent Approximation), a new approach implemented in BEAST2 that combines the accuracy of methods based on the structured coalescent with the computational efficiency required to handle more than just few populations. We illustrate the potentially severe implications of poor model choice for phylogeographic analyses by investigating the zoonotic transmission of Ebola virus. Whereas the structured coalescent analysis correctly infers that successive human Ebola outbreaks have been seeded by a large unsampled non-human reservoir population, the discrete trait analysis implausibly concludes that undetected human-to-human transmission has allowed the virus to persist over the past four decades. As genomics takes on an increasingly prominent role informing the control and prevention of infectious diseases, it will be vital that phylogeographic inference provides robust insights into transmission history. Hide abstract

Everitt RG, Didelot X, Batty EM, Miller RR, Knox K, Young BC, Bowden R, Auton A, Votintseva A, Larner-Svensson H, Charlesworth J, Golubchik T, Ip CL, Godwin H, Fung R, Peto TE, Walker AS, Crook DW, Wilson DJ. 2014. Mobile elements drive recombination hotspots in the core genome of Staphylococcus aureus. Nat Commun, 5 pp. 3956. Read abstract | Read more

Horizontal gene transfer is an important driver of bacterial evolution, but genetic exchange in the core genome of clonal species, including the major pathogen Staphylococcus aureus, is incompletely understood. Here we reveal widespread homologous recombination in S. aureus at the species level, in contrast to its near-complete absence between closely related strains. We discover a patchwork of hotspots and coldspots at fine scales falling against a backdrop of broad-scale trends in rate variation. Over megabases, homoplasy rates fluctuate 1.9-fold, peaking towards the origin-of-replication. Over kilobases, we find core recombination hotspots of up to 2.5-fold enrichment situated near fault lines in the genome associated with mobile elements. The strongest hotspots include regions flanking conjugative transposon ICE6013, the staphylococcal cassette chromosome (SCC) and genomic island νSaα. Mobile element-driven core genome transfer represents an opportunity for adaptation and challenges our understanding of the recombination landscape in predominantly clonal pathogens, with important implications for genotype-phenotype mapping. Hide abstract

Laabei M, Recker M, Rudkin JK, Aldeljawi M, Gulay Z, Sloan TJ, Williams P, Endres JL, Bayles KW, Fey PD, Yajjala VK, Widhelm T, Hawkins E, Lewis K, Parfett S, Scowen L, Peacock SJ, Holden M, Wilson D, Read TD, van den Elsen J, Priest NK, Feil EJ, Hurst LD, Josefsson E, Massey RC. 2014. Predicting the virulence of MRSA from its genome sequence. Genome Res., 24 (5), pp. 839-49. Read abstract | Read more

Microbial virulence is a complex and often multifactorial phenotype, intricately linked to a pathogen's evolutionary trajectory. Toxicity, the ability to destroy host cell membranes, and adhesion, the ability to adhere to human tissues, are the major virulence factors of many bacterial pathogens, including Staphylococcus aureus. Here, we assayed the toxicity and adhesiveness of 90 MRSA (methicillin resistant S. aureus) isolates and found that while there was remarkably little variation in adhesion, toxicity varied by over an order of magnitude between isolates, suggesting different evolutionary selection pressures acting on these two traits. We performed a genome-wide association study (GWAS) and identified a large number of loci, as well as a putative network of epistatically interacting loci, that significantly associated with toxicity. Despite this apparent complexity in toxicity regulation, a predictive model based on a set of significant single nucleotide polymorphisms (SNPs) and insertion and deletions events (indels) showed a high degree of accuracy in predicting an isolate's toxicity solely from the genetic signature at these sites. Our results thus highlight the potential of using sequence data to determine clinically relevant parameters and have further implications for understanding the microbial virulence of this opportunistic pathogen. Hide abstract

Walker TM, Ip CL, Harrell RH, Evans JT, Kapatai G, Dedicoat MJ, Eyre DW, Wilson DJ, Hawkey PM, Crook DW, Parkhill J, Harris D, Walker AS, Bowden R, Monk P, Smith EG, Peto TE. 2013. Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study. Lancet Infect Dis, 13 (2), pp. 137-46. Read abstract | Read more

BACKGROUND: Tuberculosis incidence in the UK has risen in the past decade. Disease control depends on epidemiological data, which can be difficult to obtain. Whole-genome sequencing can detect microevolution within Mycobacterium tuberculosis strains. We aimed to estimate the genetic diversity of related M tuberculosis strains in the UK Midlands and to investigate how this measurement might be used to investigate community outbreaks. METHODS: In a retrospective observational study, we used Illumina technology to sequence M tuberculosis genomes from an archive of frozen cultures. We characterised isolates into four groups: cross-sectional, longitudinal, household, and community. We measured pairwise nucleotide differences within hosts and between hosts in household outbreaks and estimated the rate of change in DNA sequences. We used the findings to interpret network diagrams constructed from 11 community clusters derived from mycobacterial interspersed repetitive-unit-variable-number tandem-repeat data. FINDINGS: We sequenced 390 separate isolates from 254 patients, including representatives from all five major lineages of M tuberculosis. The estimated rate of change in DNA sequences was 0.5 single nucleotide polymorphisms (SNPs) per genome per year (95% CI 0.3-0.7) in longitudinal isolates from 30 individuals and 25 families. Divergence is rarely higher than five SNPs in 3 years. 109 (96%) of 114 paired isolates from individuals and households differed by five or fewer SNPs. More than five SNPs separated isolates from none of 69 epidemiologically linked patients, two (15%) of 13 possibly linked patients, and 13 (17%) of 75 epidemiologically unlinked patients (three-way comparison exact p<0.0001). Genetic trees and clinical and epidemiological data suggest that super-spreaders were present in two community clusters. INTERPRETATION: Whole-genome sequencing can delineate outbreaks of tuberculosis and allows inference about direction of transmission between cases. The technique could identify super-spreaders and predict the existence of undiagnosed cases, potentially leading to early treatment of infectious patients and their contacts. FUNDING: Medical Research Council, Wellcome Trust, National Institute for Health Research, and the Health Protection Agency. Hide abstract

Eyre DW, Cule ML, Wilson DJ, Griffiths D, Vaughan A, O'Connor L, Ip CL, Golubchik T, Batty EM, Finney JM, Wyllie DH, Didelot X, Piazza P, Bowden R, Dingle KE, Harding RM, Crook DW, Wilcox MH, Peto TE, Walker AS. 2013. Diverse sources of C. difficile infection identified on whole-genome sequencing. N. Engl. J. Med., 369 (13), pp. 1195-205. Read abstract | Read more

BACKGROUND: It has been thought that Clostridium difficile infection is transmitted predominantly within health care settings. However, endemic spread has hampered identification of precise sources of infection and the assessment of the efficacy of interventions. METHODS: From September 2007 through March 2011, we performed whole-genome sequencing on isolates obtained from all symptomatic patients with C. difficile infection identified in health care settings or in the community in Oxfordshire, United Kingdom. We compared single-nucleotide variants (SNVs) between the isolates, using C. difficile evolution rates estimated on the basis of the first and last samples obtained from each of 145 patients, with 0 to 2 SNVs expected between transmitted isolates obtained less than 124 days apart, on the basis of a 95% prediction interval. We then identified plausible epidemiologic links among genetically related cases from data on hospital admissions and community location. RESULTS: Of 1250 C. difficile cases that were evaluated, 1223 (98%) were successfully sequenced. In a comparison of 957 samples obtained from April 2008 through March 2011 with those obtained from September 2007 onward, a total of 333 isolates (35%) had no more than 2 SNVs from at least 1 earlier case, and 428 isolates (45%) had more than 10 SNVs from all previous cases. Reductions in incidence over time were similar in the two groups, a finding that suggests an effect of interventions targeting the transition from exposure to disease. Of the 333 patients with no more than 2 SNVs (consistent with transmission), 126 patients (38%) had close hospital contact with another patient, and 120 patients (36%) had no hospital or community contact with another patient. Distinct subtypes of infection continued to be identified throughout the study, which suggests a considerable reservoir of C. difficile. CONCLUSIONS: Over a 3-year period, 45% of C. difficile cases in Oxfordshire were genetically distinct from all previous cases. Genetically diverse sources, in addition to symptomatic patients, play a major part in C. difficile transmission. (Funded by the U.K. Clinical Research Collaboration Translational Infection Research Initiative and others.). Hide abstract

Young BC, Golubchik T, Batty EM, Fung R, Larner-Svensson H, Votintseva AA, Miller RR, Godwin H, Knox K, Everitt RG, Iqbal Z, Rimmer AJ, Cule M, Ip CL, Didelot X, Harding RM, Donnelly P, Peto TE, Crook DW, Bowden R, Wilson DJ. 2012. Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease. Proc. Natl. Acad. Sci. U.S.A., 109 (12), pp. 4550-5. Read abstract | Read more

Whole-genome sequencing offers new insights into the evolution of bacterial pathogens and the etiology of bacterial disease. Staphylococcus aureus is a major cause of bacteria-associated mortality and invasive disease and is carried asymptomatically by 27% of adults. Eighty percent of bacteremias match the carried strain. However, the role of evolutionary change in the pathogen during the progression from carriage to disease is incompletely understood. Here we use high-throughput genome sequencing to discover the genetic changes that accompany the transition from nasal carriage to fatal bloodstream infection in an individual colonized with methicillin-sensitive S. aureus. We found a single, cohesive population exhibiting a repertoire of 30 single-nucleotide polymorphisms and four insertion/deletion variants. Mutations accumulated at a steady rate over a 13-mo period, except for a cluster of mutations preceding the transition to disease. Although bloodstream bacteria differed by just eight mutations from the original nasally carried bacteria, half of those mutations caused truncation of proteins, including a premature stop codon in an AraC-family transcriptional regulator that has been implicated in pathogenicity. Comparison with evolution in two asymptomatic carriers supported the conclusion that clusters of protein-truncating mutations are highly unusual. Our results demonstrate that bacterial diversity in vivo is limited but nonetheless detectable by whole-genome sequencing, enabling the study of evolutionary dynamics within the host. Regulatory or structural changes that occur during carriage may be functionally important for pathogenesis; therefore identifying those changes is a crucial step in understanding the biological causes of invasive bacterial disease. Hide abstract