Increasing deployment and sophistication of electronic healthcare records in hospital and community settings provides a rich and complex resource for early detection of emerging infectious diseases threats, and detection broader pressures on healthcare systems. Examples of the data available include patient admissions, diagnoses, ward movements, and contact networks, real-time patient observations and laboratory results defining illness severity, and data on healthcare interventions such as antibiotic prescribing. Detailed microbiological data is also available for many infections, including the causative organism, its susceptibility to antibiotics, and as increasing pioneered in Oxford, pathogen whole-genome sequence data, allowing investigation of genomic determinants of disease severity.
The project significantly benefits from foundational work integrating diverse healthcare databases from across Oxfordshire. The aim of this project is to exploit this data using a range of statistical and machine learning approaches to detect new infectious disease threats in Oxfordshire. Potential threats include increases in the incidence or severity of particular diseases or clinical syndromes, as well as decreases in the effectiveness of treatments, e.g. through the emergence of resistance to antibiotics. Approaches to allow better hospital resource management and infection control will also be investigated, for example to detect seasonal infectious disease fluctuations earlier, anticipate increased bed pressures, and to detect rises hospital-acquired infections.
The machine learning techniques to be applied include Bayesian methods that allow principled inference in the presence of uncertainty, facilitating the use of incomplete or inconsistent healthcare data, and providing predictions based upon high-dimensional multivariate statistical models, with a known degree of confidence. They include also techniques that support the analysis of large, complex datasets by learning automatically, and updating themselves in the presence of new information as it becomes available.
Dr David Clifton is an Associate Professor in the Department of Engineering Science of the University of Oxford, and a Governing Body fellow of Balliol College, Oxford. He is a Research Fellow of the Royal Academy of Engineering. His research focuses on the development of "big data" machine learning for tracking the health of complex systems. His previous research resulted in patented systems for jet-engine health monitoring, used with the engines of the Airbus A380, the Boeing 787 "Dreamliner", and the Eurofighter Typhoon. Since 2008, he has translated his work into the biomedical context for healthcare applications. He has worked on Visensia, the world's first FDA-approved multivariate patient monitoring system, and the SEND system, which is now used to monitor 20,000 patients each month in the NHS. His research has been commercialised via university spin-out companies OBS Medical, Oxehealth, and Medyc, in addition to collaboration with multinational industrial bodies.
Dr David Eyre is an infectious diseases and microbiology clinical lecturer, epidemiologist, and data scientist. His previous research includes use of healthcare databases and whole-genome sequencing to investigate transmission of Clostridium difficile, gonorrhoea, and other pathogens. Professor Tim Peto is a Professor of Medicine at the University of Oxford. He is the co-leader for the Infection Theme of the Oxford Biomedical Research Centre and a National Institute for Health Research Senior Investigator. Both Dr Eyre and Prof Peto are jointly based at the Nuffield Department of Medicine, University of Oxford and Oxford University Hospitals.
This project is part of the Modernising Medical Microbiology Project, an international consortium led by Professor Derrick Crook, professor of microbiology at the University of Oxford and director of the National Infection Service at Public Health England. The consortium is supported by a multi-million pound funding portfolio, including from the Wellcome Trust, the Medical Research Council and the Bill and Melinda Gates Foundation. We have a strong record of publishing in the top internationally recognized journals, and our research has impacted on the delivery of public health and microbiology in Britain and beyond. For more information visit www.modmedmicro.ac.uk.
The Modernising Medical Microbiology consortium provides an excellent research environment in which to develop new skills and train among world-leading scientists in their field. Based at the John Radcliffe Hospital, the University of Oxford team consists of a community of research groups led by Profs Derrick Crook, Tim Peto, Sarah Walker and Drs David Clifton, Kate Dingle, Phil Fowler, Zam Iqbal, Danny Wilson and David Wyllie. We have specialist expertise in microbiology, genomics, statistics, epidemiology and bioinformatics, and our work focuses on understanding the causes of infectious disease in populations. Training is provided by weekly supervisory meetings, weekly Modernising Medical Microbiology work-in-progress meetings, journal clubs, seminar series, and external opportunities including attending national and international conferences. The project will be strengthened by the establishment of the Oxford Big Data Institute where there will be world leading expertise in managing, processing, analysing and visualising very large complex datasets. The department and university run training courses, while the Department for Continuing Education and the Language Centre offer further opportunities for personal development to research students at Oxford.
Project reference number: 842
|Professor Tim Peto||Experimental Medicine Division||Oxford University, John Radcliffe Hospital||GBRemail@example.com|
|Professor David Clifton||Department of Engineering Science,University of Oxford||GBR|
|Dr David Eyre||Experimental Medicine Division||Oxford University,||firstname.lastname@example.org|
OBJECTIVE: The aim of the study was to evaluate the ability of a data-fusion patient status index (PSI) to detect patient deterioration in the emergency department (ED) in comparison with track-and-trigger (T&T). MATERIALS AND METHODS: A single-centre observational cohort study was conducted in a medium-sized teaching hospital ED. Vital sign data and any documented T&T scores (paper T&T) were collected from adults attending the resuscitation room, majors or observation ward. For each set of vital signs, we retrospectively calculated T&T (eT&T). PSI was calculated retrospectively from the continuous vital sign data using a statistical model of normality. Clinical notes were examined to identify 'escalation' events, and the numbers of these escalations identified by paper T&T, eT&T and PSI were retrospectively calculated. RESULTS: Data from 472 patient episodes were examined. A total of 20 patients had PSI data at the time of an escalation related to vital sign abnormalities that occurred during their ED stay (vs. on arrival). Only four patient events were detected at the time by paper T&T. In all, 17 were detected retrospectively by eT&T and 15 by PSI. PSI had a calculated false-alert rate of 1.13 alerts/bed-day. CONCLUSION: Electronic data capture offers opportunities for increased detection of deteriorating patients in a busy clinical environment compared with paper charts. Sample size in this study is insufficient to determine which electronic method (eT&T or PSI) offers superior detection of the need for escalation. Hide abstract
Respiratory rate (RR) is a key vital sign that is monitored to assess the health of patients. With the increase of the availability of wearable devices, it is important that RR is extracted in a robust and noninvasive manner from the photoplethysmogram (PPG) acquired from pulse oximeters and similar devices. However, existing methods of noninvasive RR estimation suffer from a lack of robustness, resulting in the fact that they are not used in clinical practice. We propose a Bayesian approach to fusing the outputs of many RR estimation algorithms to improve the overall robustness of the resulting estimates. Our method estimates the accuracy of each algorithm and jointly infers the fused RR estimate in an unsupervised manner, with aim of producing a fused estimate that is more accurate than any of the algorithms taken individually. This approach is novel in the literature, where the latter has so far concentrated on attempting to produce single algorithms for RR estimation, without resulting in systems that have penetrated into clinical practice. A publicly-available dataset, Capnobase, was used to validate the performance of our proposed model. Our proposed methodology was compared to the best-performing individual algorithm from the literature, as well as to the results of using common fusing methodologies such as averaging, median, and maximum likelihood (ML). Our proposed methodology resulted in a mean-absolute-error (MAE) of 1.98 breaths per minute (bpm), outperformed other fusing strategies (mean fusion: 2.95 bpm; median fusion: 2.33 bpm; ML: 2.30 bpm). It also outperformed the best single algorithm (2.39 bpm) and the benchmark algorithm proposed for use with Capnobase (2.22 bpm). We conclude that the proposed fusion methodology can be used to combine RR estimates from multiple sources derived from the PPG, to infer a reliable and robust estimation of the respiratory rate in an unsupervised manner. Hide abstract
We consider an integrated patient monitoring system, combining electronic patient records with high-rate acquisition of patient physiological data. There remain many challenges in increasing the robustness of "e-health" applications to a level at which they are clinically useful, particularly in the use of automated algorithms used to detect and cope with artifact in data contained within the electronic patient record, and in analyzing and communicating the resultant data for reporting to clinicians. There is a consequential "plague of pilots," in which engineering prototype systems do not enter into clinical use. This paper describes an approach in which, for the first time, the Emergency Department (ED) of a major research hospital has adopted such systems for use during a large clinical trial. We describe the disadvantages of existing evaluation metrics when applied to such large trials, and propose a solution suitable for large-scale validation. We demonstrate that machine learning technologies embedded within healthcare information systems can provide clinical benefit, with the potential to improve patient outcomes in the busy environment of a major ED and other high-dependence areas of patient care. Hide abstract
The ability to determine patient acuity (or severity of illness) has immediate practical use for clinicians. We evaluate the use of multivariate timeseries modeling with the multi-task Gaussian process (GP) models using noisy, incomplete, sparse, heterogeneous and unevenly-sampled clinical data, including both physiological signals and clinical notes. The learned multi-task GP (MTGP) hyperparameters are then used to assess and forecast patient acuity. Experiments were conducted with two real clinical data sets acquired from ICU patients: firstly, estimating cerebrovascular pressure reactivity, an important indicator of secondary damage for traumatic brain injury patients, by learning the interactions between intracranial pressure and mean arterial blood pressure signals, and secondly, mortality prediction using clinical progress notes. In both cases, MTGPs provided improved results: an MTGP model provided better results than single-task GP models for signal interpolation and forecasting (0.91 vs 0.69 RMSE), and the use of MTGP hyperparameters obtained improved results when used as additional classification features (0.812 vs 0.788 AUC). Hide abstract
OBJECTIVES: To review how health informatics systems based on machine learning methods have impacted the clinical management of patients, by affecting clinical practice. METHODS: We reviewed literature from 2010-2015 from databases such as Pubmed, IEEE xplore, and INSPEC, in which methods based on machine learning are likely to be reported. We bring together a broad body of literature, aiming to identify those leading examples of health informatics that have advanced the methodology of machine learning. While individual methods may have further examples that might be added, we have chosen some of the most representative, informative exemplars in each case. RESULTS: Our survey highlights that, while much research is taking place in this high-profile field, examples of those that affect the clinical management of patients are seldom found. We show that substantial progress is being made in terms of methodology, often by data scientists working in close collaboration with clinical groups. CONCLUSIONS: Health informatics systems based on machine learning are in their infancy and the translation of such systems into clinical management has yet to be performed at scale. Hide abstract
BACKGROUND: It has been thought that Clostridium difficile infection is transmitted predominantly within health care settings. However, endemic spread has hampered identification of precise sources of infection and the assessment of the efficacy of interventions. METHODS: From September 2007 through March 2011, we performed whole-genome sequencing on isolates obtained from all symptomatic patients with C. difficile infection identified in health care settings or in the community in Oxfordshire, United Kingdom. We compared single-nucleotide variants (SNVs) between the isolates, using C. difficile evolution rates estimated on the basis of the first and last samples obtained from each of 145 patients, with 0 to 2 SNVs expected between transmitted isolates obtained less than 124 days apart, on the basis of a 95% prediction interval. We then identified plausible epidemiologic links among genetically related cases from data on hospital admissions and community location. RESULTS: Of 1250 C. difficile cases that were evaluated, 1223 (98%) were successfully sequenced. In a comparison of 957 samples obtained from April 2008 through March 2011 with those obtained from September 2007 onward, a total of 333 isolates (35%) had no more than 2 SNVs from at least 1 earlier case, and 428 isolates (45%) had more than 10 SNVs from all previous cases. Reductions in incidence over time were similar in the two groups, a finding that suggests an effect of interventions targeting the transition from exposure to disease. Of the 333 patients with no more than 2 SNVs (consistent with transmission), 126 patients (38%) had close hospital contact with another patient, and 120 patients (36%) had no hospital or community contact with another patient. Distinct subtypes of infection continued to be identified throughout the study, which suggests a considerable reservoir of C. difficile. CONCLUSIONS: Over a 3-year period, 45% of C. difficile cases in Oxfordshire were genetically distinct from all previous cases. Genetically diverse sources, in addition to symptomatic patients, play a major part in C. difficile transmission. (Funded by the U.K. Clinical Research Collaboration Translational Infection Research Initiative and others.). Hide abstract
OBJECTIVES: To investigate the prospects of newly available benchtop sequencers to provide rapid whole-genome data in routine clinical practice. Next-generation sequencing has the potential to resolve uncertainties surrounding the route and timing of person-to-person transmission of healthcare-associated infection, which has been a major impediment to optimal management. DESIGN: The authors used Illumina MiSeq benchtop sequencing to undertake case studies investigating potential outbreaks of methicillin-resistant Staphylococcus aureus (MRSA) and Clostridium difficile. SETTING: Isolates were obtained from potential outbreaks associated with three UK hospitals. PARTICIPANTS: Isolates were sequenced from a cluster of eight MRSA carriers and an associated bacteraemia case in an intensive care unit, another MRSA cluster of six cases and two clusters of C difficile. Additionally, all C difficile isolates from cases over 6 weeks in a single hospital were rapidly sequenced and compared with local strain sequences obtained in the preceding 3 years. MAIN OUTCOME MEASURE: Whole-genome genetic relatedness of the isolates within each epidemiological cluster. RESULTS: Twenty-six MRSA and 15 C difficile isolates were successfully sequenced and analysed within 5 days of culture. Both MRSA clusters were identified as outbreaks, with most sequences in each cluster indistinguishable and all within three single nucleotide variants (SNVs). Epidemiologically unrelated isolates of the same spa-type were genetically distinct (≥21 SNVs). In both C difficile clusters, closely epidemiologically linked cases (in one case sharing the same strain type) were shown to be genetically distinct (≥144 SNVs). A reconstruction applying rapid sequencing in C difficile surveillance provided early outbreak detection and identified previously undetected probable community transmission. CONCLUSIONS: This benchtop sequencing technology is widely generalisable to human bacterial pathogens. The findings provide several good examples of how rapid and precise sequencing could transform identification of transmission of healthcare-associated infection and therefore improve hospital infection control and patient outcomes in routine clinical practice. Hide abstract
BACKGROUND: New approaches are urgently required to address increasing rates of gonorrhoea and the emergence and global spread of antibiotic-resistant Neisseria gonorrhoeae. We used whole-genome sequencing to study transmission and track resistance in N gonorrhoeae isolates. METHODS: We did whole-genome sequencing of isolates obtained from samples collected from patients attending sexual health services in Brighton, UK, between Jan 1, 2011, and March 9, 2015. We also included isolates from other UK locations, historical isolates from Brighton, and previous data from a US study. Samples from symptomatic patients and asymptomatic sexual health screening underwent nucleic acid amplification testing; positive samples and all samples from symptomatic patients were cultured for N gonorrhoeae, and resulting isolates were whole-genome sequenced. Cefixime susceptibility testing was done in selected isolates by agar incorporation, and we used sequence data to determine multi-antigen sequence types and penA genotypes. We derived a transmission nomogram to determine the plausibility of direct or indirect transmission between any two cases depending on the time between samples: estimated mutation rates, plus diversity noted within patients across anatomical sites and probable transmission pairs, were used to fit a coalescent model to determine the number of single nucleotide polymorphisms expected. FINDINGS: 1407 (98%) of 1437 Brighton isolates between Jan 1, 2011, and March 9, 2015 were successfully sequenced. We identified 1061 infections from 907 patients. 281 (26%) of these infections were indistinguishable (ie, differed by zero single nucleotide polymorphisms) from one or more previous cases, and 786 (74%) had evidence of a sampled direct or indirect Brighton source. We observed multiple related samples across geographical locations. Of 1273 infections in Brighton (including historical data), 225 (18%) were linked to another case elsewhere in the UK, and 115 (9%) to a case in the USA. Four lineages initially identified in Brighton could be linked to 70 USA sequences, including 61 from a lineage carrying the mosaic penA XXXIV allele, which is associated with reduced cefixime susceptibility. INTERPRETATION: We present a whole-genome-sequencing-based tool for genomic contact tracing of N gonorrhoeae and demonstrate local, national, and international transmission. Whole-genome sequencing can be applied across geographical boundaries to investigate gonorrhoea transmission and to track antimicrobial resistance. FUNDING: Oxford National Institute for Health Research Health Protection Research Unit and Biomedical Research Centre. Hide abstract
BACKGROUND: Clostridium difficile infection (CDI) is a leading cause of antibiotic-associated diarrhoea and is endemic in hospitals, hindering the identification of sources and routes of transmission based on shared time and space alone. This may compromise rational control despite costly prevention strategies. This study aimed to investigate ward-based transmission of C. difficile, by subdividing outbreaks into distinct lineages defined by multi-locus sequence typing (MLST). METHODS AND FINDINGS: All C. difficile toxin enzyme-immunoassay-positive and culture-positive samples over 2.5 y from a geographically defined population of ~600,000 persons underwent MLST. Sequence types (STs) were combined with admission and ward movement data from an integrated comprehensive healthcare system incorporating three hospitals (1,700 beds) providing all acute care for the defined geographical population. Networks of cases and potential transmission events were constructed for each ST. Potential infection sources for each case and transmission timescales were defined by prior ward-based contact with other cases sharing the same ST. From 1 September 2007 to 31 March 2010, there were means of 102 tests and 9.4 CDIs per 10,000 overnight stays in inpatients, and 238 tests and 15.7 CDIs per month in outpatients/primary care. In total, 1,276 C. difficile isolates of 69 STs were studied. From MLST, no more than 25% of cases could be linked to a potential ward-based inpatient source, ranging from 37% in renal/transplant, 29% in haematology/oncology, and 28% in acute/elderly medicine to 6% in specialist surgery. Most of the putative transmissions identified occurred shortly (≤ 1 wk) after the onset of symptoms (141/218, 65%), with few >8 wk (21/218, 10%). Most incubation periods were ≤ 4 wk (132/218, 61%), with few >12 wk (28/218, 13%). Allowing for persistent ward contamination following ward discharge of a CDI case did not increase the proportion of linked cases after allowing for random meeting of matched controls. CONCLUSIONS: In an endemic setting with well-implemented infection control measures, ward-based contact with symptomatic enzyme-immunoassay-positive patients cannot account for most new CDI cases. Hide abstract
BACKGROUND: Tuberculosis incidence in the UK has risen in the past decade. Disease control depends on epidemiological data, which can be difficult to obtain. Whole-genome sequencing can detect microevolution within Mycobacterium tuberculosis strains. We aimed to estimate the genetic diversity of related M tuberculosis strains in the UK Midlands and to investigate how this measurement might be used to investigate community outbreaks. METHODS: In a retrospective observational study, we used Illumina technology to sequence M tuberculosis genomes from an archive of frozen cultures. We characterised isolates into four groups: cross-sectional, longitudinal, household, and community. We measured pairwise nucleotide differences within hosts and between hosts in household outbreaks and estimated the rate of change in DNA sequences. We used the findings to interpret network diagrams constructed from 11 community clusters derived from mycobacterial interspersed repetitive-unit-variable-number tandem-repeat data. FINDINGS: We sequenced 390 separate isolates from 254 patients, including representatives from all five major lineages of M tuberculosis. The estimated rate of change in DNA sequences was 0.5 single nucleotide polymorphisms (SNPs) per genome per year (95% CI 0.3-0.7) in longitudinal isolates from 30 individuals and 25 families. Divergence is rarely higher than five SNPs in 3 years. 109 (96%) of 114 paired isolates from individuals and households differed by five or fewer SNPs. More than five SNPs separated isolates from none of 69 epidemiologically linked patients, two (15%) of 13 possibly linked patients, and 13 (17%) of 75 epidemiologically unlinked patients (three-way comparison exact p<0.0001). Genetic trees and clinical and epidemiological data suggest that super-spreaders were present in two community clusters. INTERPRETATION: Whole-genome sequencing can delineate outbreaks of tuberculosis and allows inference about direction of transmission between cases. The technique could identify super-spreaders and predict the existence of undiagnosed cases, potentially leading to early treatment of infectious patients and their contacts. FUNDING: Medical Research Council, Wellcome Trust, National Institute for Health Research, and the Health Protection Agency. Hide abstract
BACKGROUND: Despite substantial interest in biomarkers, their impact on clinical outcomes and variation with bacterial strain has rarely been explored using integrated databases. METHODS: From September 2006 to May 2011, strains isolated from Clostridium difficile toxin enzyme immunoassay (EIA)-positive fecal samples from Oxfordshire, United Kingdom (approximately 600,000 people) underwent multilocus sequence typing. Fourteen-day mortality and levels of 15 baseline biomarkers were compared between consecutive C. difficile infections (CDIs) from different clades/sequence types (STs) and EIA-negative controls using Cox and normal regression adjusted for demographic/clinical factors. RESULTS: Fourteen-day mortality was 13% in 2222 adults with 2745 EIA-positive samples (median, 78 years) vs 5% in 20,722 adults with 27,550 EIA-negative samples (median, 74 years) (absolute attributable mortality, 7.7%; 95% CI, 6.4%-9.0%). Mortality was highest in clade 5 CDIs (25% [16 of 63]; polymerase chain reaction (PCR) ribotype 078/ST 11), then clade 2 (20% [111 of 560]; 99% PCR ribotype 027/ST 1) versus clade 1 (12% [137 of 1168]; adjusted P < .0001). Within clade 1, 14-day mortality was only 4% (3 of 84) in ST 44 (PCR ribotype 015) (adjusted P = .05 vs other clade 1). Mean baseline neutrophil counts also varied significantly by genotype: 12.4, 11.6, and 9.5 × 10(9) neutrophils/L for clades 5, 2 and 1, respectively, vs 7.0 × 10(9) neutrophils/L in EIA-negative controls (P < .0001) and 7.9 × 10(9) neutrophils/L in ST 44 (P = .08). There were strong associations between C. difficile-type-specific effects on mortality and neutrophil/white cell counts (rho = 0.48), C-reactive-protein (rho = 0.43), eosinophil counts (rho = -0.45), and serum albumin (rho = -0.47). Biomarkers predicted 30%-40% of clade-specific mortality differences. CONCLUSIONS: C. difficile genotype predicts mortality, and excess mortality correlates with genotype-specific changes in biomarkers, strongly implicating inflammatory pathways as a major influence on poor outcome after CDI. PCR ribotype 078/ST 11 (clade 5) leads to severe CDI; thus ongoing surveillance remains essential. Hide abstract
Clostridium difficile surveillance allows outbreaks of cases clustered in time and space to be identified and further transmission prevented. Traditionally, manual detection of groups of cases diagnosed in the same ward or hospital, often followed by retrospective reference laboratory genotyping, has been used to identify outbreaks. However, integrated healthcare databases offer the prospect of automated real-time outbreak detection based on statistically robust methods, and accounting for contacts between cases, including those distant to the ward of diagnosis. Complementary to this, rapid benchtop whole genome sequencing, and other highly discriminatory genotyping, has the potential to distinguish which cases are part of an outbreak with high precision and in clinically relevant timescales. These new technologies are likely to shape future surveillance. Hide abstract
OBJECTIVES: Whole-genome sequencing potentially represents a single, rapid and cost-effective approach to defining resistance mechanisms and predicting phenotype, and strain type, for both clinical and epidemiological purposes. This retrospective study aimed to determine the efficacy of whole genome-based antimicrobial resistance prediction in clinical isolates of Escherichia coli and Klebsiella pneumoniae. METHODS: Seventy-four E. coli and 69 K. pneumoniae bacteraemia isolates from Oxfordshire, UK, were sequenced (Illumina HiSeq 2000). Resistance phenotypes were predicted from genomic sequences using BLASTn-based comparisons of de novo-assembled contigs with a study database of >100 known resistance-associated loci, including plasmid-associated and chromosomal genes. Predictions were made for seven commonly used antimicrobials: amoxicillin, co-amoxiclav, ceftriaxone, ceftazidime, ciprofloxacin, gentamicin and meropenem. Comparisons were made with phenotypic results obtained in duplicate by broth dilution (BD Phoenix). Discrepancies, either between duplicate BD Phoenix results or between genotype and phenotype, were resolved with gradient diffusion analyses. RESULTS: A wide variety of antimicrobial resistance genes were identified, including blaCTX-M, blaLEN, blaOKP, blaOXA, blaSHV, blaTEM, aac(3')-Ia, aac-(3')-IId, aac-(3')-IIe, aac(6')-Ib-cr, aadA1a, aadA4, aadA5, aadA16, aph(6')-Id, aph(3')-Ia, qnrB and qnrS, as well as resistance-associated mutations in chromosomal gyrA and parC genes. The sensitivity of genome-based resistance prediction across all antibiotics for both species was 0.96 (95% CI: 0.94-0.98) and the specificity was 0.97 (95% CI: 0.95-0.98). Very major and major error rates were 1.2% and 2.1%, respectively. CONCLUSIONS: Our method was as sensitive and specific as routinely deployed phenotypic methods. Validation against larger datasets and formal assessments of cost and turnaround time in a routine laboratory setting are warranted. Hide abstract