register interest

Gil McVean FMedSci, FRS, FMedSci, FRS

Research Area: Genetics and Genomics
Technology Exchange: Bioinformatics, SNP typing and Statistical genetics
Scientific Themes: Genetics & Genomics

My research covers several areas in the analysis of genetic variation, combining the development of methods for analyzing high throughput sequencing data, theoretical work and empirical analysis. Of particular interest are: the analysis of recombination from population genetic data, dissecting signals of disease association within the HLA, methods for inferring genealogical history from DNA sequence data and de novo sequence assembly for the discovery of genetic variation. I am a member of the Department of Statistics and Director of the Oxford Big Data Institute.

Name Department Institution Country
Dominic Kwiatkowski Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Peter Donnelly FRS FMedSci Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Dr Luke Jostins Nuffield Department of Medicine University of Oxford United Kingdom
Chris Holmes Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Jenny Taylor Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Adrian Hill Jenner Institute Oxford University, Old Road Campus Research Building United Kingdom
Professor Paul Flicek Wellcome Trust Sanger Institute United Kingdom
Professor Stephen Sawcer Department of Clinical Neurosciences, University of Cambridge United Kingdom
Dr David Bentley Illumina Inc. United Kingdom
Dr Mike Eberle Illumina Inc. United States
Dr Ian Henderson University of Cambridge United Kingdom
Kelleher J, Wong Y, Wohns AW, Fadil C, Albers PK, McVean G. 2019. Publisher Correction: Inferring whole-genome histories in large population datasets. Nat Genet, 51 (11), pp. 1660-1660. | Show Abstract | Read more

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

Kelleher J, Wong Y, Wohns AW, Fadil C, Albers PK, McVean G. 2019. Inferring whole-genome histories in large population datasets. Nat Genet, 51 (9), pp. 1330-1338. | Show Abstract | Read more

Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an 'evolutionary encoding' of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.

Mentzer AJ, Brenner N, Allen N, Littlejohns TJ, Chong AY, Cortes A, Almond R, Hill M, Sheard S, McVean G et al. Identification of host-pathogen-disease relationships using a scalable Multiplex Serology platform in UK Biobank | Show Abstract | Read more

<jats:p>Background: Certain infectious agents are recognised causes of cancer and potentially other chronic diseases. Identifying associations and understanding pathological mechanisms involving infectious agents and subsequent chronic disease risk will be possible through measuring exposure to multiple infectious agents in large-scale prospective cohorts such as UK Biobank. Methods: Following expert consensus we designed a Multiplex Serology platform capable of simultaneously measuring quantitative antibody responses against 45 antigens from 20 infectious agents implicated in non-communicable diseases, including human herpes, hepatitis, polyoma, papilloma, and retroviruses, as well as <jats:italic>Chlamydia trachomatis</jats:italic>, <jats:italic>Helicobacter pylori</jats:italic> and <jats:italic>Toxoplasma gondii</jats:italic>. This panel was assayed in a random subset of UK Biobank participants (n=9,695) to test associations between infectious agents and recognised demographic and genetic risk factors and disease outcomes. Findings: Seroprevalence estimates for each infectious agent were consistent with those expected from the literature. The data confirmed epidemiological associations of infectious agent antibody responses with sociodemographic characteristics (e.g. lifetime sexual partners with <jats:italic>C. trachomatis</jats:italic>; <jats:italic>P</jats:italic>=1.8x10<jats:sup>-149</jats:sup>), genetic variants (e.g. rs6927022 with Epstein-Barr virus (EBV) EBNA1 antibodies, <jats:italic>P</jats:italic>=9.5x10<jats:sup>-91</jats:sup>) and disease outcomes including human papillomavirus-16 seropositivity and cervical intraepithelial neoplasia (odds ratio 2.28, 95% confidence interval 1.38-3.63), and quantitative EBV viral capsid antigen responses and multiple sclerosis through genetic correlation (MHC rG=0.30, <jats:italic>P</jats:italic>=0.01). Interpretation: This dataset, intended as a pilot study to demonstrate applicability of Multiplex Serology in epidemiological studies, is itself one of the largest studies to date covering diverse infectious agents in a prospective UK cohort including those traditionally under-represented in population cohorts such as human immunodeficiency virus-1 and <jats:italic>C. trachomatis</jats:italic>. Our results emphasise the validity of our Multiplex Serology approach in large-scale epidemiological studies opening up opportunities for improving our understanding of host-pathogen-disease relationships. These data are available to researchers interested in examining the relationship between infectious agents and human health.</jats:p>

Zhu SJ, Hendry JA, Almagro-Garcia J, Pearson RD, Amato R, Miles A, Weiss DJ, Lucas TC, Nguyen M, Gething PW et al. 2019. The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria. Elife, 8 | Show Abstract | Read more

Individual malaria infections can carry multiple strains of Plasmodium falciparum with varying levels of relatedness. Yet, how local epidemiology affects the properties of such mixed infections remains unclear. Here, we develop an enhanced method for strain deconvolution from genome sequencing data, which estimates the number of strains, their proportions, identity-by-descent (IBD) profiles and individual haplotypes. Applying it to the Pf3k data set, we find that the rate of mixed infection varies from 29% to 63% across countries and that 51% of mixed infections involve more than two strains. Furthermore, we estimate that 47% of symptomatic dual infections contain sibling strains likely to have been co-transmitted from a single mosquito, and find evidence of mixed infections propagated over successive infection cycles. Finally, leveraging data from the Malaria Atlas Project, we find that prevalence correlates within Africa, but not Asia, with both the rate of mixed infection and the level of IBD.

Garimella KV, Iqbal Z, Krause MA, Campino S, Kekre M, Drury E, Kwiatkowski D, Sa JM, Wellems TE, McVean G. Detection of simple and complex de novo mutations without, with, or with multiple reference sequences | Show Abstract | Read more

<jats:title>Abstract</jats:title><jats:p>The characterization of <jats:italic>de novo</jats:italic> mutations in regions of high sequence and structural diversity from whole genome sequencing data remains highly challenging. Complex structural variants tend to arise in regions of high repetitiveness and low complexity, challenging both <jats:italic>de novo</jats:italic> assembly, where short-reads do not capture the long-range context required for resolution, and mapping approaches, where improper alignment of reads to a reference genome that is highly diverged from that of the sample can lead to false or partial calls. Long-read technologies can potentially solve such problems but are currently unfeasible to use at scale. Here we present Corticall, a graph-based method that combines the advantages of multiple technologies and prior data sources to detect arbitrary classes of genetic variant. We construct multi-sample, coloured de Bruijn graphs from shortread data for all samples, align long-read-derived haplotypes and multiple reference data sources to restore graph connectivity information, and call variants using graph path-finding algorithms and a model for simultaneous alignment and recombination. We validate and evaluate the approach using extensive simulations and use it to characterize the rate and spectrum of <jats:italic>de novo</jats:italic> mutation events in 119 progeny from four <jats:italic>Plasmodium falciparum</jats:italic> experimental crosses, using long-read data on the parents to inform reconstructions of the progeny and to detect several known and novel non-allelic homologous recombination events.</jats:p>

Palmer DS, Turner I, Fidler S, Frater J, Goedhals D, Goulder P, Huang K-HG, Oxenius A, Phillips R, Shapiro R et al. 2019. Mapping the drivers of within-host pathogen evolution using massive data sets. Nat Commun, 10 (1), pp. 3017. | Show Abstract | Read more

Differences among hosts, resulting from genetic variation in the immune system or heterogeneity in drug treatment, can impact within-host pathogen evolution. Genetic association studies can potentially identify such interactions. However, extensive and correlated genetic population structure in hosts and pathogens presents a substantial risk of confounding analyses. Moreover, the multiple testing burden of interaction scanning can potentially limit power. We present a Bayesian approach for detecting host influences on pathogen evolution that exploits vast existing data sets of pathogen diversity to improve power and control for stratification. The approach models key processes, including recombination and selection, and identifies regions of the pathogen genome affected by host factors. Our simulations and empirical analysis of drug-induced selection on the HIV-1 genome show that the method recovers known associations and has superior precision-recall characteristics compared to other approaches. We build a high-resolution map of HLA-induced selection in the HIV-1 genome, identifying novel epitope-allele combinations.

Fang H, ULTRA-DD Consortium, De Wolf H, Knezevic B, Burnham KL, Osgood J, Sanniti A, Lledó Lara A, Kasela S, De Cesco S et al. 2019. A genetics-led approach defines the drug target landscape of 30 immune-related traits. Nat Genet, 51 (7), pp. 1082-1091. | Show Abstract | Read more

Most candidate drugs currently fail later-stage clinical trials, largely due to poor prediction of efficacy on early target selection1. Drug targets with genetic support are more likely to be therapeutically valid2,3, but the translational use of genome-scale data such as from genome-wide association studies for drug target discovery in complex diseases remains challenging4-6. Here, we show that integration of functional genomic and immune-related annotations, together with knowledge of network connectivity, maximizes the informativeness of genetics for target validation, defining the target prioritization landscape for 30 immune traits at the gene and pathway level. We demonstrate how our genetics-led drug target prioritization approach (the priority index) successfully identifies current therapeutics, predicts activity in high-throughput cellular screens (including L1000, CRISPR, mutagenesis and patient-derived cell assays), enables prioritization of under-explored targets and allows for determination of target-level trait relationships. The priority index is an open-access, scalable system accelerating early-stage drug target selection for immune-mediated disease.

Nelson D, Kelleher J, Ragsdale AP, McVean G, Gravel S. Coupling Wright-Fisher and coalescent dynamics for realistic simulation of population-scale datasets | Show Abstract | Read more

<jats:label>1</jats:label><jats:title>Abstract</jats:title><jats:p>Coalescent simulations are widely used to examine the effects of evolution and demographic history on the genetic makeup of populations. Thanks to recent progress in algorithms and data structures, simulators such as the widely-used <jats:monospace>msprime</jats:monospace> [1] now provide genome-wide simulations for millions of individuals. However, this software relies on classic coalescent theory and the corresponding assumptions that sample sizes are small relative to effective population size and that the region being simulated is short. Here we show that coalescent simulations of long regions of the genome exhibit large biases in identity-by-descent (IBD), long-range linkage disequilibrium (LD), and ancestry patterns, particularly when sample size is large. We present a Wright-Fisher extension to <jats:monospace>msprime</jats:monospace>, and show that it produces more realistic distributions of IBD, LD, and ancestry proportions, while also addressing more subtle biases of the coalescent. Further, these extensions are more computationally efficient than state-of-the-art coalescent simulations when simulating long regions, including whole-genome data. For shorter regions, efficiency and accuracy can be maintained via a flexible hybrid model which simulates the recent past under the Wright-Fisher model and uses coalescent simulations in the distant past.</jats:p>

Dilthey AT, Mentzer AJ, Carapito R, Cutland C, Cereb N, Madhi SA, Rhie A, Koren S, Bahram S, McVean G, Phillippy AM. 2019. HLA*LA-HLA typing from linearly projected graph alignments. Bioinformatics, 35 (21), pp. 4394-4396. | Show Abstract | Read more

SUMMARY: HLA*LA implements a new graph alignment model for human leukocyte antigen (HLA) type inference, based on the projection of linear alignments onto a variation graph. It enables accurate HLA type inference from whole-genome (99% accuracy) and whole-exome (93% accuracy) Illumina data; from long-read Oxford Nanopore and Pacific Biosciences data (98% accuracy for whole-genome and targeted data) and from genome assemblies. Computational requirements for a typical sample vary between 0.7 and 14 CPU hours per sample. AVAILABILITY AND IMPLEMENTATION: HLA*LA is implemented in C++ and Perl and freely available as a bioconda package or from https://github.com/DiltheyLab/HLA-LA (GPL v3). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Trochet H, Pirinen M, Band G, Jostins L, McVean G, Spencer CCA. 2019. Bayesian meta-analysis across genome-wide association studies of diverse phenotypes. Genet Epidemiol, 43 (5), pp. 532-547. | Show Abstract | Read more

Genome-wide association studies (GWAS) are a powerful tool for understanding the genetic basis of diseases and traits, but most studies have been conducted in isolation, with a focus on either a single or a set of closely related phenotypes. We describe MetABF, a simple Bayesian framework for performing integrative meta-analysis across multiple GWAS using summary statistics. The approach is applicable across a wide range of study designs and can increase the power by 50% compared with standard frequentist tests when only a subset of studies have a true effect. We demonstrate its utility in a meta-analysis of 20 diverse GWAS which were part of the Wellcome Trust Case Control Consortium 2. The novelty of the approach is its ability to explore, and assess the evidence for a range of possible true patterns of association across studies in a computationally efficient framework.

Bradley P, den Bakker HC, Rocha EPC, McVean G, Iqbal Z. 2019. Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol, 37 (2), pp. 152-159. | Show Abstract | Read more

Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as real-time genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.

Auburn S, Getachew S, Pearson RD, Amato R, Miotto O, Trimarsanto H, Zhu SJ, Rumaseb A, Marfurt J, Noviyanti R et al. 2019. Genomic analysis of Plasmodium vivax in southern Ethiopia reveals selective pressures in multiple parasite mechanisms. J Infect Dis, 220 (11), pp. 1738-1749. | Show Abstract | Read more

The Horn of Africa harbours the largest reservoir of Plasmodium vivax in the continent. Most of sub-Saharan Africa has remained relatively vivax-free due to a high prevalence of the human Duffy-negative trait, but the emergence of strains able to invade Duffy-negative reticulocytes poses a major public health threat. We undertook the first population genomic investigation of P. vivax from the region, comparing the genomes of 24 Ethiopian isolates against data from Southeast Asia to identify important local adaptions. The prevalence of the duffy binding protein amplification in Ethiopia was 79%, potentially reflecting adaptation to Duffy-negativity. There was also evidence of selection in a region upstream of the chloroquine resistance transporter, a putative chloroquine-resistance determinant. Strong signals of selection were observed in genes involved in immune evasion and regulation of gene expression, highlighting the need for a multifaceted intervention approach to combat P. vivax in the region.

Maher GJ, Ralph HK, Ding Z, Koelling N, Mlcochova H, Giannoulatou E, Dhami P, Paul DS, Stricker SH, Beck S et al. 2018. Selfish mutations dysregulating RAS-MAPK signaling are pervasive in aged human testes. Genome Res, 28 (12), pp. 1779-1790. | Show Abstract | Read more

Mosaic mutations present in the germline have important implications for reproductive risk and disease transmission. We previously demonstrated a phenomenon occurring in the male germline, whereby specific mutations arising spontaneously in stem cells (spermatogonia) lead to clonal expansion, resulting in elevated mutation levels in sperm over time. This process, termed "selfish spermatogonial selection," explains the high spontaneous birth prevalence and strong paternal age-effect of disorders such as achondroplasia and Apert, Noonan and Costello syndromes, with direct experimental evidence currently available for specific positions of six genes (FGFR2, FGFR3, RET, PTPN11, HRAS, and KRAS). We present a discovery screen to identify novel mutations and genes showing evidence of positive selection in the male germline, by performing massively parallel simplex PCR using RainDance technology to interrogate mutational hotspots in 67 genes (51.5 kb in total) in 276 biopsies of testes from five men (median age, 83 yr). Following ultradeep sequencing (about 16,000×), development of a low-frequency variant prioritization strategy, and targeted validation, we identified 61 distinct variants present at frequencies as low as 0.06%, including 54 variants not previously directly associated with selfish selection. The majority (80%) of variants identified have previously been implicated in developmental disorders and/or oncogenesis and include mutations in six newly associated genes (BRAF, CBL, MAP2K1, MAP2K2, RAF1, and SOS1), all of which encode components of the RAS-MAPK pathway and activate signaling. Our findings extend the link between mutations dysregulating the RAS-MAPK pathway and selfish selection, and show that the aging male germline is a repository for such deleterious mutations.

James KL, de Silva TI, Brown K, Whittle H, Taylor S, McVean G, Esbjörnsson J, Rowland-Jones SL. 2019. Low-Bias RNA Sequencing of the HIV-2 Genome from Blood Plasma. J Virol, 93 (1), | Show Abstract | Read more

Accurate determination of the genetic diversity present in the HIV quasispecies is critical for the development of a preventative vaccine: in particular, little is known about viral genetic diversity for the second type of HIV, HIV-2. A better understanding of HIV-2 biology is relevant to the HIV vaccine field because a substantial proportion of infected people experience long-term viral control, and prior HIV-2 infection has been associated with slower HIV-1 disease progression in coinfected subjects. The majority of traditional and next-generation sequencing methods have relied on target amplification prior to sequencing, introducing biases that may obscure the true signals of diversity in the viral population. Additionally, target enrichment through PCR requires a priori sequence knowledge, which is lacking for HIV-2. Therefore, a target enrichment free method of library preparation would be valuable for the field. We applied an RNA shotgun sequencing (RNA-Seq) method without PCR amplification to cultured viral stocks and patient plasma samples from HIV-2-infected individuals. Libraries generated from total plasma RNA were analyzed with a two-step pipeline: (i) de novo genome assembly, followed by (ii) read remapping. By this approach, whole-genome sequences were generated with a 28× to 67× mean depth of coverage. Assembled reads showed a low level of GC bias, and comparison of the genome diversities at the intrahost level showed low diversity in the accessory gene vpx in all patients. Our study demonstrates that RNA-Seq is a feasible full-genome de novo sequencing method for blood plasma samples collected from HIV-2-infected individuals.IMPORTANCE An accurate picture of viral genetic diversity is critical for the development of a globally effective HIV vaccine. However, sequencing strategies are often complicated by target enrichment prior to sequencing, introducing biases that can distort variant frequencies, which are not easily corrected for in downstream analyses. Additionally, detailed a priori sequence knowledge is needed to inform robust primer design when employing PCR amplification, a factor that is often lacking when working with tropical diseases localized in developing countries. Previous work has demonstrated that direct RNA shotgun sequencing (RNA-Seq) can be used to circumvent these issues for hepatitis C virus (HCV) and norovirus. We applied RNA-Seq to total RNA extracted from HIV-2 blood plasma samples, demonstrating the applicability of this technique to HIV-2 and allowing us to generate a dynamic picture of genetic diversity over the whole genome of HIV-2 in the context of low-bias sequencing.

Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O'Connell J et al. 2018. The UK Biobank resource with deep phenotyping and genomic data. Nature, 562 (7726), pp. 203-209. | Show Abstract | Read more

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.

Albers PK, McVean G. Dating genomic variants and shared ancestry in population-scale sequencing data | Show Abstract | Read more

<jats:title>Abstract</jats:title><jats:p>The origin and fate of new mutations within species is the fundamental process underlying evolution. However, while much attention has been focused on characterizing the presence, frequency, and phenotypic impact of genetic variation, the evolutionary histories of most variants are largely unexplored. We have developed a non-parametric approach for estimating the date of origin of genetic variants in large-scale sequencing data sets. The accuracy and robustness of the approach is demonstrated through simulation. Using data from two publicly available human genomic diversity resources, we estimated the age of more than 45 million single nucleotide polymorphisms (SNPs) in the human genome and release the Atlas of Variant Age as a public online database. We characterize the relationship between variant age and frequency in different geographical regions, and demonstrate the value of age information in interpreting variants of functional and selective importance. Finally, we use allele age estimates to power a rapid approach for inferring the ancestry shared between individual genomes, to quantify genealogical relationships at different points in the past, as well as describe and explore the evolutionary history of modern human populations.</jats:p>

Frot B, Jostins L, McVean G. 2019. Graphical Model Selection for Gaussian Conditional Random Fields in the Presence of Latent Variables. J Am Stat Assoc, 114 (526), pp. 723-734. | Show Abstract | Read more

We consider the problem of learning a conditional Gaussian graphical model in the presence of latent variables. Building on recent advances in this field, we suggest a method that decomposes the parameters of a conditional Markov random field into the sum of a sparse and a low-rank matrix. We derive convergence bounds for this estimator and show that it is well-behaved in the high-dimensional regime as well as "sparsistent" (i.e., capable of recovering the graph structure). We then show how proximal gradient algorithms and semi-definite programming techniques can be employed to fit the model to thousands of variables. Through extensive simulations, we illustrate the conditions required for identifiability and show that there is a wide range of situations in which this model performs significantly better than its counterparts, for example, by accommodating more latent variables. Finally, the suggested method is applied to two datasets comprising individual level data on genetic variants and metabolites levels. We show our results replicate better than alternative approaches and show enriched biological signal. Supplementary materials for this article are available online.

Turner I, Garimella KV, Iqbal Z, McVean G. 2018. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics, 34 (15), pp. 2556-2565. | Show Abstract | Read more

Motivation: The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data. Results: We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes. Availability and implementation: Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex. Supplementary information: Supplementary data are available at Bioinformatics online.

Flannick J, Fuchsberger C, Mahajan A, Teslovich TM, Agarwala V, Gaulton KJ, Caulkins L, Koesterer R, Ma C, Moutsianas L et al. 2018. Erratum: Sequence data and association statistics from 12,940 type 2 diabetes cases and controls. Sci Data, 5 (1), pp. 180002. | Show Abstract | Read more

This corrects the article DOI: 10.1038/sdata.2017.179.

Flannick J, Fuchsberger C, Mahajan A, Teslovich TM, Agarwala V, Gaulton KJ, Caulkins L, Koesterer R, Ma C, Moutsianas L et al. 2017. Sequence data and association statistics from 12,940 type 2 diabetes cases and controls. Sci Data, 4 (1), pp. 170179. | Show Abstract | Read more

To investigate the genetic basis of type 2 diabetes (T2D) to high resolution, the GoT2D and T2D-GENES consortia catalogued variation from whole-genome sequencing of 2,657 European individuals and exome sequencing of 12,940 individuals of multiple ancestries. Over 27M SNPs, indels, and structural variants were identified, including 99% of low-frequency (minor allele frequency [MAF] 0.1-5%) non-coding variants in the whole-genome sequenced individuals and 99.7% of low-frequency coding variants in the whole-exome sequenced individuals. Each variant was tested for association with T2D in the sequenced individuals, and, to increase power, most were tested in larger numbers of individuals (>80% of low-frequency coding variants in ~82 K Europeans via the exome chip, and ~90% of low-frequency non-coding variants in ~44 K Europeans via genotype imputation). The variants, genotypes, and association statistics from these analyses provide the largest reference to date of human genetic information relevant to T2D, for use in activities such as T2D-focused genotype imputation, functional characterization of variants or genes, and other novel analyses to detect associations between sequence variation and T2D.

Davies B, Brown LA, Cais O, Watson J, Clayton AJ, Chang VT, Biggs D, Preece C, Hernandez-Pliego P, Krohn J et al. 2017. A point mutation in the ion conduction pore of AMPA receptor GRIA3 causes dramatically perturbed sleep patterns as well as intellectual disability. Hum Mol Genet, 26 (20), pp. 3869-3882. | Show Abstract | Read more

The discovery of genetic variants influencing sleep patterns can shed light on the physiological processes underlying sleep. As part of a large clinical sequencing project, WGS500, we sequenced a family in which the two male children had severe developmental delay and a dramatically disturbed sleep-wake cycle, with very long wake and sleep durations, reaching up to 106-h awake and 48-h asleep. The most likely causal variant identified was a novel missense variant in the X-linked GRIA3 gene, which has been implicated in intellectual disability. GRIA3 encodes GluA3, a subunit of AMPA-type ionotropic glutamate receptors (AMPARs). The mutation (A653T) falls within the highly conserved transmembrane domain of the ion channel gate, immediately adjacent to the analogous residue in the Grid2 (glutamate receptor) gene, which is mutated in the mouse neurobehavioral mutant, Lurcher. In vitro, the GRIA3(A653T) mutation stabilizes the channel in a closed conformation, in contrast to Lurcher. We introduced the orthologous mutation into a mouse strain by CRISPR-Cas9 mutagenesis and found that hemizygous mutants displayed significant differences in the structure of their activity and sleep compared to wild-type littermates. Typically, mice are polyphasic, exhibiting multiple sleep bouts of sleep several minutes long within a 24-h period. The Gria3A653T mouse showed significantly fewer brief bouts of activity and sleep than the wild-types. Furthermore, Gria3A653T mice showed enhanced period lengthening under constant light compared to wild-type mice, suggesting an increased sensitivity to light. Our results suggest a role for GluA3 channel activity in the regulation of sleep behavior in both mice and humans.

Cutts A, Venn O, Dilthey A, Gupta A, Vavoulis D, Dreau H, Middleton M, McVean G, Taylor JC, Schuh A. 2017. Characterisation of the changing genomic landscape of metastatic melanoma using cell free DNA. NPJ Genom Med, 2 (1), pp. 25. | Show Abstract | Read more

Cancer is characterised by complex somatically acquired genetic aberrations that manifest as intra-tumour and inter-tumour genetic heterogeneity and can lead to treatment resistance. In this case study, we characterise the genome-wide somatic mutation dynamics in a metastatic melanoma patient during therapy using low-input (50 ng) PCR-free whole genome sequencing of cell-free DNA from pre-treatment and post-relapse blood samples. We identify de novo tumour-specific somatic mutations from cell-free DNA, while the sequence context of single nucleotide variants showed the characteristic UV-damage mutation signature of melanoma. To investigate the behaviour of individual somatic mutations during proto-oncogene B-Raf -targeted and immune checkpoint inhibition, amplicon-based deep sequencing was used to verify and track frequencies of 212 single nucleotide variants at 10 distinct time points over 13 months of treatment. Under checkpoint inhibition therapy, we observed an increase in mutant allele frequencies indicating progression on therapy 88 days before clinical determination of non-response positron emission tomogrophy-computed tomography. We also revealed mutations from whole genome sequencing of cell-free DNA that were not present in the tissue biopsy, but that later contributed to relapse. Our findings have potential clinical applications where high quality tumour-tissue derived DNA is not available.

Zhu SJ, Almagro-Garcia J, McVean G. 2018. Deconvolution of multiple infections in Plasmodium falciparum from high throughput sequencing data. Bioinformatics, 34 (1), pp. 9-15. | Show Abstract | Read more

Motivation: The presence of multiple infecting strains of the malarial parasite Plasmodium falciparum affects key phenotypic traits, including drug resistance and risk of severe disease. Advances in protocols and sequencing technology have made it possible to obtain high-coverage genome-wide sequencing data from blood samples and blood spots taken in the field. However, analyzing and interpreting such data is challenging because of the high rate of multiple infections present. Results: We have developed a statistical method and implementation for deconvolving multiple genome sequences present in an individual with mixed infections. The software package DEploid uses haplotype structure within a reference panel of clonal isolates as a prior for haplotypes present in a given sample. It estimates the number of strains, their relative proportions and the haplotypes presented in a sample, allowing researchers to study multiple infection in malaria with an unprecedented level of detail. Availability and implementation: The open source implementation DEploid is freely available at https://github.com/mcveanlab/DEploid under the conditions of the GPLv3 license. An R version is available at https://github.com/mcveanlab/DEploid-r. Contact: joe.zhu@bdi.ox.ac.uk or gil.mcvean@bdi.ox.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.

Cortes A, Dendrou CA, Motyer A, Jostins L, Vukcevic D, Dilthey A, Donnelly P, Leslie S, Fugger L, McVean G. 2017. Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank. Nat Genet, 49 (9), pp. 1311-1318. | Show Abstract | Read more

Genetic discovery from the multitude of phenotypes extractable from routine healthcare data can transform understanding of the human phenome and accelerate progress toward precision medicine. However, a critical question when analyzing high-dimensional and heterogeneous data is how best to interrogate increasingly specific subphenotypes while retaining statistical power to detect genetic associations. Here we develop and employ a new Bayesian analysis framework that exploits the hierarchical structure of diagnosis classifications to analyze genetic variants against UK Biobank disease phenotypes derived from self-reporting and hospital episode statistics. Our method displays a more than 20% increase in power to detect genetic effects over other approaches and identifies new associations between classical human leukocyte antigen (HLA) alleles and common immune-mediated diseases (IMDs). By applying the approach to genetic risk scores (GRSs), we show the extent of genetic sharing among IMDs and expose differences in disease perception or diagnosis with potential clinical implications.

Kaur G, Gras S, Mobbs JI, Vivian JP, Cortes A, Barber T, Kuttikkatte SB, Jensen LT, Attfield KE, Dendrou CA et al. 2017. Structural and regulatory diversity shape HLA-C protein expression levels. Nat Commun, 8 (1), pp. 15924. | Show Abstract | Read more

Expression of HLA-C varies widely across individuals in an allele-specific manner. This variation in expression can influence efficacy of the immune response, as shown for infectious and autoimmune diseases. MicroRNA binding partially influences differential HLA-C expression, but the additional contributing factors have remained undetermined. Here we use functional and structural analyses to demonstrate that HLA-C expression is modulated not just at the RNA level, but also at the protein level. Specifically, we show that variation in exons 2 and 3, which encode the α1/α2 domains, drives differential expression of HLA-C allomorphs at the cell surface by influencing the structure of the peptide-binding cleft and the diversity of peptides bound by the HLA-C molecules. Together with a phylogenetic analysis, these results highlight the diversity and long-term balancing selection of regulatory factors that modulate HLA-C expression.

Palmer DS, Turner I, Fidler S, Frater J, Goulder P, Goedhals D, Huang K-HG, Oxenius A, Phillips R, Shapiro R et al. Mapping the drivers of within-host pathogen evolution using massive data sets | Show Abstract | Read more

<jats:title>Abstract</jats:title><jats:p>Differences among hosts, resulting from genetic variation in the immune system or heterogeneity in drug treatment, can impact within-host pathogen evolution. Identifying such interactions can potentially be achieved through genetic association studies. However, extensive and correlated genetic population structure in hosts and pathogens presents a substantial risk of confounding analyses. Moreover, the multiple testing burden of interaction scanning can potentially limit power. To address these problems, we have developed a Bayesian approach for detecting host influences on pathogen evolution that makes use of vast existing data sets of pathogen diversity to improve power and control for stratification. The approach models key processes, including recombination and selection, and identifies regions of the pathogen genome affected by host factors. Using simulations and empirical analysis of drug-induced selection on the HIV-1 genome we demonstrate the power of the method to recover known associations and show greatly improved precision-recall characteristics compared to other approaches. We build a high-resolution map of HLA-induced selection in the HIV-1 genome, identifying novel epitope-allele combinations.</jats:p>

Turner I, Garimella KV, Iqbal Z, McVean G. Integrating long-range connectivity information into de Bruijn graphs | Show Abstract | Read more

<jats:title>Abstract</jats:title><jats:sec><jats:title>Motivation</jats:title><jats:p>The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter <jats:italic>k</jats:italic>, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input.</jats:p></jats:sec><jats:sec><jats:title>Results</jats:title><jats:p>We present a novel assembly graph data structure: the <jats:italic>Linked de Bruijn Graph (LdBG)</jats:italic>. Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both the de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to <jats:italic>Klebsiella pneumoniae</jats:italic> short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterise the genomic context of drug-resistance genes.</jats:p></jats:sec><jats:sec><jats:title>Availability</jats:title><jats:p>Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, available under the MIT license at <jats:ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://https://github.com/mcvean/mccortex">https://github.com/mcvean/mccortex</jats:ext-link>.</jats:p></jats:sec><jats:sec><jats:title>Contact</jats:title><jats:p><jats:email>turner.isaac@gmail.com</jats:email>.</jats:p></jats:sec>

Giannoulatou E, Maher GJ, Ding Z, Gillis AJM, Dorssers LCJ, Hoischen A, Rajpert-De Meyts E, WGS500 Consortium, McVean G, Wilkie AOM et al. 2017. Whole-genome sequencing of spermatocytic tumors provides insights into the mutational processes operating in the male germline. PLoS One, 12 (5), pp. e0178169. | Show Abstract | Read more

Adult male germline stem cells (spermatogonia) proliferate by mitosis and, after puberty, generate spermatocytes that undertake meiosis to produce haploid spermatozoa. Germ cells are under evolutionary constraint to curtail mutations and maintain genome integrity. Despite constant turnover, spermatogonia very rarely form tumors, so-called spermatocytic tumors (SpT). In line with the previous identification of FGFR3 and HRAS selfish mutations in a subset of cases, candidate gene screening of 29 SpTs identified an oncogenic NRAS mutation in two cases. To gain insights in the etiology of SpT and into properties of the male germline, we performed whole-genome sequencing of five tumors (4/5 with matched normal tissue). The acquired single nucleotide variant load was extremely low (~0.2 per Mb), with an average of 6 (2-9) non-synonymous variants per tumor, none of which is likely to be oncogenic. The observed mutational signature of SpTs is strikingly similar to that of germline de novo mutations, mostly involving C>T transitions with a significant enrichment in the ACG trinucleotide context. The tumors exhibited extensive aneuploidy (50-99 autosomes/tumor) involving whole-chromosomes, with recurrent gains of chr9 and chr20 and loss of chr7, suggesting that aneuploidy itself represents the initiating oncogenic event. We propose that SpT etiology recapitulates the unique properties of male germ cells; because of evolutionary constraints to maintain low point mutation rate, rare tumorigenic driver events are caused by a combination of gene imbalance mediated via whole-chromosome aneuploidy. Finally, we propose a general framework of male germ cell tumor pathology that accounts for their mutational landscape, timing and cellular origin.

Ansari MA, Pedergnana V, L C Ip C, Magri A, Von Delft A, Bonsall D, Chaturvedi N, Bartha I, Smith D, Nicholson G et al. 2017. Genome-to-genome analysis highlights the effect of the human innate and adaptive immune systems on the hepatitis C virus. Nat Genet, 49 (5), pp. 666-673. | Show Abstract | Read more

Outcomes of hepatitis C virus (HCV) infection and treatment depend on viral and host genetic factors. Here we use human genome-wide genotyping arrays and new whole-genome HCV viral sequencing technologies to perform a systematic genome-to-genome study of 542 individuals who were chronically infected with HCV, predominantly genotype 3. We show that both alleles of genes encoding human leukocyte antigen molecules and genes encoding components of the interferon lambda innate immune system drive viral polymorphism. Additionally, we show that IFNL4 genotypes determine HCV viral load through a mechanism dependent on a specific amino acid residue in the HCV NS5A protein. These findings highlight the interplay between the innate immune system and the viral genome in HCV control.

Manning A, Highland HM, Gasser J, Sim X, Tukiainen T, Fontanillas P, Grarup N, Rivas MA, Mahajan A, Locke AE et al. 2017. A Low-Frequency Inactivating AKT2 Variant Enriched in the Finnish Population Is Associated With Fasting Insulin Levels and Type 2 Diabetes Risk. Diabetes, 66 (7), pp. 2019-2032. | Show Abstract | Read more

To identify novel coding association signals and facilitate characterization of mechanisms influencing glycemic traits and type 2 diabetes risk, we analyzed 109,215 variants derived from exome array genotyping together with an additional 390,225 variants from exome sequence in up to 39,339 normoglycemic individuals from five ancestry groups. We identified a novel association between the coding variant (p.Pro50Thr) in AKT2 and fasting plasma insulin (FI), a gene in which rare fully penetrant mutations are causal for monogenic glycemic disorders. The low-frequency allele is associated with a 12% increase in FI levels. This variant is present at 1.1% frequency in Finns but virtually absent in individuals from other ancestries. Carriers of the FI-increasing allele had increased 2-h insulin values, decreased insulin sensitivity, and increased risk of type 2 diabetes (odds ratio 1.05). In cellular studies, the AKT2-Thr50 protein exhibited a partial loss of function. We extend the allelic spectrum for coding variants in AKT2 associated with disorders of glucose homeostasis and demonstrate bidirectional effects of variants within the pleckstrin homology domain of AKT2.

Cortes A, Dendrou CA, Motyer A, Jostins L, Vukcevic D, Dilthey A, Donnelly P, Leslie S, Fugger L, McVean G. Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank | Show Abstract | Read more

<jats:p>Genetic discovery from the multitude of phenotypes extractable from routine healthcare data has the ability to radically transform our understanding of the human phenome, thereby accelerating progress towards precision medicine. However, a critical question when analysing high-dimensional and heterogeneous data is how to interrogate increasingly specific subphenotypes whilst retaining statistical power to detect genetic associations. Here we develop and employ a novel Bayesian analysis framework that exploits the hierarchical structure of diagnosis classifications to jointly analyse genetic variants against UK Biobank healthcare phenotypes. Our method displays a more than 20% increase in power to detect genetic effects over other approaches, such that we uncover the broader burden of genetic variation: we identify associations with over 2,000 diagnostic terms. We find novel associations with common immune-mediated diseases (IMD), we reveal the extent of genetic sharing between specific IMDs, and we expose differences in disease perception or diagnosis with potential clinical implications.</jats:p>

Motyer A, Vukcevic D, Dilthey A, Donnelly P, McVean G, Leslie S. Practical Use of Methods for Imputation of HLA Alleles from SNP Genotype Data | Show Abstract | Read more

<jats:title>Abstract</jats:title><jats:p>The human leukocyte antigen (HLA) genes play an essential role in immune function. Typing of HLA alleles is critical for transplantation and is informative for many disease associations. The high cost of accurate lab-based HLA typing has precluded its use in large-scale disease-association studies. The development of statistical methods to type alleles using linkage disequilibrium with nearby SNPs, called HLA imputation, has allowed large cohorts of individuals to be typed accurately, so that massive numbers of affected individuals and controls may be studied. This has resulted in many important findings. Several HLA imputation methods have been widely used, however their relative performance has not been adequately addressed. We have conducted a comprehensive study to evaluate the most widely used HLA imputation methods. We assembled a multi-ethnic panel of 10,561 individuals with SNP genotype data and lab-based typing of alleles at 11 HLA genes at two-field resolution, and used it to train and validate each method. Use of this panel leads to imputation accuracy far superior to what is currently publicly available. We present a highly-accurate new imputation method, HLA*IMP:03. We address the question of optimal use of HLA imputations in tests of genetic association, showing that it is usually not necessary to apply a probability threshold to achieve maximal power. We also investigated the effect on accuracy of SNP density and population stratification at the continental level and show that neither of these are a significant concern.</jats:p>

Dendrou CA, McVean G, Fugger L. 2016. Neuroinflammation - using big data to inform clinical practice. Nat Rev Neurol, 12 (12), pp. 685-698. | Show Abstract | Read more

Neuroinflammation is emerging as a central process in many neurological conditions, either as a causative factor or as a secondary response to nervous system insult. Understanding the causes and consequences of neuroinflammation could, therefore, provide insight that is needed to improve therapeutic interventions across many diseases. However, the complexity of the pathways involved necessitates the use of high-throughput approaches to extensively interrogate the process, and appropriate strategies to translate the data generated into clinical benefit. Use of 'big data' aims to generate, integrate and analyse large, heterogeneous datasets to provide in-depth insights into complex processes, and has the potential to unravel the complexities of neuroinflammation. Limitations in data analysis approaches currently prevent the full potential of big data being reached, but some aspects of big data are already yielding results. The implementation of 'omics' analyses in particular is becoming routine practice in biomedical research, and neuroimaging is producing large sets of complex data. In this Review, we evaluate the impact of the drive to collect and analyse big data on our understanding of neuroinflammation in disease. We describe the breadth of big data that are leading to an evolution in our understanding of this field, exemplify how these data are beginning to be of use in a clinical setting, and consider possible future directions.

Eberle MA, Fritzilas E, Krusche P, Källberg M, Moore BL, Bekritsky MA, Iqbal Z, Chuang H-Y, Humphray SJ, Halpern AL et al. 2017. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res, 27 (1), pp. 157-164. | Show Abstract | Read more

Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased "Platinum" variant catalog of 4.7 million single-nucleotide variants (SNVs) plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and 11 children of this pedigree. Platinum genotypes are highly concordant with the current catalog of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission ("nonplatinum") revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.

Dendrou CA, Cortes A, Shipman L, Evans HG, Attfield KE, Jostins L, Barber T, Kaur G, Kuttikkatte SB, Leach OA et al. 2016. Resolving TYK2 locus genotype-to-phenotype differences in autoimmunity. Sci Transl Med, 8 (363), pp. 363ra149. | Show Abstract | Read more

Thousands of genetic variants have been identified, which contribute to the development of complex diseases, but determining how to elucidate their biological consequences for translation into clinical benefit is challenging. Conflicting evidence regarding the functional impact of genetic variants in the tyrosine kinase 2 (TYK2) gene, which is differentially associated with common autoimmune diseases, currently obscures the potential of TYK2 as a therapeutic target. We aimed to resolve this conflict by performing genetic meta-analysis across disorders; subsequent molecular, cellular, in vivo, and structural functional follow-up; and epidemiological studies. Our data revealed a protective homozygous effect that defined a signaling optimum between autoimmunity and immunodeficiency and identified TYK2 as a potential drug target for certain common autoimmune disorders.

Dilthey AT, Gourraud P-A, Mentzer AJ, Cereb N, Iqbal Z, McVean G. 2016. High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs. PLoS Comput Biol, 12 (10), pp. e1005151. | Show Abstract | Read more

Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30-250 CPU hours per sample) remain a significant challenge to practical application.

Miles A, Iqbal Z, Vauterin P, Pearson R, Campino S, Theron M, Gould K, Mead D, Drury E, O'Brien J et al. 2016. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res, 26 (9), pp. 1288-1299. | Show Abstract | Read more

The malaria parasite Plasmodium falciparum has a great capacity for evolutionary adaptation to evade host immunity and develop drug resistance. Current understanding of parasite evolution is impeded by the fact that a large fraction of the genome is either highly repetitive or highly variable and thus difficult to analyze using short-read sequencing technologies. Here, we describe a resource of deep sequencing data on parents and progeny from genetic crosses, which has enabled us to perform the first genome-wide, integrated analysis of SNP, indel and complex polymorphisms, using Mendelian error rates as an indicator of genotypic accuracy. These data reveal that indels are exceptionally abundant, being more common than SNPs and thus the dominant mode of polymorphism within the core genome. We use the high density of SNP and indel markers to analyze patterns of meiotic recombination, confirming a high rate of crossover events and providing the first estimates for the rate of non-crossover events and the length of conversion tracts. We observe several instances of meiotic recombination within copy number variants associated with drug resistance, demonstrating a mechanism whereby fitness costs associated with resistance mutations could be compensated and greater phenotypic plasticity could be acquired.

Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ et al. 2016. The genetic architecture of type 2 diabetes. Nature, 536 (7614), pp. 41-47. | Show Abstract | Read more

The genetic architecture of common traits, including the number, frequency, and effect sizes of inherited variants that contribute to individual risk, has been long debated. Genome-wide association studies have identified scores of common variants associated with type 2 diabetes, but in aggregate, these explain only a fraction of the heritability of this disease. Here, to test the hypothesis that lower-frequency variants explain much of the remainder, the GoT2D and T2D-GENES consortia performed whole-genome sequencing in 2,657 European individuals with and without diabetes, and exome sequencing in 12,940 individuals from five ancestry groups. To increase statistical power, we expanded the sample size via genotyping and imputation in a further 111,548 subjects. Variants associated with type 2 diabetes after sequencing were overwhelmingly common and most fell within regions previously identified by genome-wide association studies. Comprehensive enumeration of sequence variation is necessary to identify functional alleles that provide important clues to disease pathophysiology, but large-scale sequencing does not support the idea that lower-frequency variants have a major role in predisposition to type 2 diabetes.

Choi K, Reinhard C, Serra H, Ziolkowski PA, Underwood CJ, Zhao X, Hardcastle TJ, Yelina NE, Griffin C, Jackson M et al. 2016. Recombination Rate Heterogeneity within Arabidopsis Disease Resistance Genes. PLoS Genet, 12 (7), pp. e1006179. | Show Abstract | Read more

Meiotic crossover frequency varies extensively along chromosomes and is typically concentrated in hotspots. As recombination increases genetic diversity, hotspots are predicted to occur at immunity genes, where variation may be beneficial. A major component of plant immunity is recognition of pathogen Avirulence (Avr) effectors by resistance (R) genes that encode NBS-LRR domain proteins. Therefore, we sought to test whether NBS-LRR genes would overlap with meiotic crossover hotspots using experimental genetics in Arabidopsis thaliana. NBS-LRR genes tend to physically cluster in plant genomes; for example, in Arabidopsis most are located in large clusters on the south arms of chromosomes 1 and 5. We experimentally mapped 1,439 crossovers within these clusters and observed NBS-LRR gene associated hotspots, which were also detected as historical hotspots via analysis of linkage disequilibrium. However, we also observed NBS-LRR gene coldspots, which in some cases correlate with structural heterozygosity. To study recombination at the fine-scale we used high-throughput sequencing to analyze ~1,000 crossovers within the RESISTANCE TO ALBUGO CANDIDA1 (RAC1) R gene hotspot. This revealed elevated intragenic crossovers, overlapping nucleosome-occupied exons that encode the TIR, NBS and LRR domains. The highest RAC1 recombination frequency was promoter-proximal and overlapped CTT-repeat DNA sequence motifs, which have previously been associated with plant crossover hotspots. Additionally, we show a significant influence of natural genetic variation on NBS-LRR cluster recombination rates, using crosses between Arabidopsis ecotypes. In conclusion, we show that a subset of NBS-LRR genes are strong hotspots, whereas others are coldspots. This reveals a complex recombination landscape in Arabidopsis NBS-LRR genes, which we propose results from varying coevolutionary pressures exerted by host-pathogen relationships, and is influenced by structural heterozygosity.

Hellner K, Miranda F, Fotso Chedom D, Herrero-Gonzalez S, Hayden DM, Tearle R, Artibani M, KaramiNejadRanjbar M, Williams R, Gaitskell K et al. 2016. Premalignant SOX2 overexpression in the fallopian tubes of ovarian cancer patients: Discovery and validation studies. EBioMedicine, 10 pp. 137-149. | Show Abstract | Read more

Current screening methods for ovarian cancer can only detect advanced disease. Earlier detection has proved difficult because the molecular precursors involved in the natural history of the disease are unknown. To identify early driver mutations in ovarian cancer cells, we used dense whole genome sequencing of micrometastases and microscopic residual disease collected at three time points over three years from a single patient during treatment for high-grade serous ovarian cancer (HGSOC). The functional and clinical significance of the identified mutations was examined using a combination of population-based whole genome sequencing, targeted deep sequencing, multi-center analysis of protein expression, loss of function experiments in an in-vivo reporter assay and mammalian models, and gain of function experiments in primary cultured fallopian tube epithelial (FTE) cells. We identified frequent mutations involving a 40kb distal repressor region for the key stem cell differentiation gene SOX2. In the apparently normal FTE, the region was also mutated. This was associated with a profound increase in SOX2 expression (p<2(-16)), which was not found in patients without cancer (n=108). Importantly, we show that SOX2 overexpression in FTE is nearly ubiquitous in patients with HGSOCs (n=100), and common in BRCA1-BRCA2 mutation carriers (n=71) who underwent prophylactic salpingo-oophorectomy. We propose that the finding of SOX2 overexpression in FTE could be exploited to develop biomarkers for detecting disease at a premalignant stage, which would reduce mortality from this devastating disease.

Kelleher J, Etheridge AM, McVean G. 2016. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput Biol, 12 (5), pp. e1004842. | Show Abstract | Read more

A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods.

Bradley P, Gordon NC, Walker TM, Dunn L, Heys S, Huang B, Earle S, Pankhurst LJ, Anson L, de Cesare M et al. 2016. Corrigendum: Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat Commun, 7 (1), pp. 11465. | Read more

Earle SG, Wu C-H, Charlesworth J, Stoesser N, Gordon NC, Walker TM, Spencer CCA, Iqbal Z, Clifton DA, Hopkins KL et al. 2016. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat Microbiol, 1 (5), pp. 16041. | Show Abstract | Read more

Bacteria pose unique challenges for genome-wide association studies because of strong structuring into distinct strains and substantial linkage disequilibrium across the genome(1,2). Although methods developed for human studies can correct for strain structure(3,4), this risks considerable loss-of-power because genetic differences between strains often contribute substantial phenotypic variability(5). Here, we propose a new method that captures lineage-level associations even when locus-specific associations cannot be fine-mapped. We demonstrate its ability to detect genes and genetic variants underlying resistance to 17 antimicrobials in 3,144 isolates from four taxonomically diverse clonal and recombining bacteria: Mycobacterium tuberculosis, Staphylococcus aureus, Escherichia coli and Klebsiella pneumoniae. Strong selection, recombination and penetrance confer high power to recover known antimicrobial resistance mechanisms and reveal a candidate association between the outer membrane porin nmpC and cefazolin resistance in E. coli. Hence, our method pinpoints locus-specific effects where possible and boosts power by detecting lineage-level differences when fine-mapping is intractable.

MalariaGEN Plasmodium falciparum Community Project. 2016. Genomic epidemiology of artemisinin resistant malaria. Elife, 5 (MARCH2016), | Show Abstract | Read more

The current epidemic of artemisinin resistant Plasmodium falciparum in Southeast Asia is the result of a soft selective sweep involving at least 20 independent kelch13 mutations. In a large global survey, we find that kelch13 mutations which cause resistance in Southeast Asia are present at low frequency in Africa. We show that African kelch13 mutations have originated locally, and that kelch13 shows a normal variation pattern relative to other genes in Africa, whereas in Southeast Asia there is a great excess of non-synonymous mutations, many of which cause radical amino-acid changes. Thus, kelch13 is not currently undergoing strong selection in Africa, despite a deep reservoir of variations that could potentially allow resistance to emerge rapidly. The practical implications are that public health surveillance for artemisinin resistance should not rely on kelch13 data alone, and interventions to prevent resistance must account for local evolutionary conditions, shown by genomic epidemiology to differ greatly between geographical regions.

Jostins L, McVean G. 2016. Trinculo: Bayesian and frequentist multinomial logistic regression for genome-wide association studies of multi-category phenotypes. Bioinformatics, 32 (12), pp. 1898-1900. | Show Abstract | Read more

MOTIVATION: For many classes of disease the same genetic risk variants underly many related phenotypes or disease subtypes. Multinomial logistic regression provides an attractive framework to analyze multi-category phenotypes, and explore the genetic relationships between these phenotype categories. We introduce Trinculo, a program that implements a wide range of multinomial analyses in a single fast package that is designed to be easy to use by users of standard genome-wide association study software. AVAILABILITY AND IMPLEMENTATION: An open source C implementation, with code and binaries for Linux and Mac OSX, is available for download at http://sourceforge.net/projects/trinculo SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CONTACT: lj4@well.ox.ac.uk.

Bradley P, Gordon NC, Walker TM, Dunn L, Heys S, Huang B, Earle S, Pankhurst LJ, Anson L, de Cesare M et al. 2015. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat Commun, 6 (1), pp. 10063. | Show Abstract | Read more

The rise of antibiotic-resistant bacteria has led to an urgent need for rapid detection of drug resistance in clinical samples, and improvements in global surveillance. Here we show how de Bruijn graph representation of bacterial diversity can be used to identify species and resistance profiles of clinical isolates. We implement this method for Staphylococcus aureus and Mycobacterium tuberculosis in a software package ('Mykrobe predictor') that takes raw sequence data as input, and generates a clinician-friendly report within 3 minutes on a laptop. For S. aureus, the error rates of our method are comparable to gold-standard phenotypic methods, with sensitivity/specificity of 99.1%/99.6% across 12 antibiotics (using an independent validation set, n=470). For M. tuberculosis, our method predicts resistance with sensitivity/specificity of 82.6%/98.5% (independent validation set, n=1,609); sensitivity is lower here, probably because of limited understanding of the underlying genetic mechanisms. We give evidence that minor alleles improve detection of extremely drug-resistant strains, and demonstrate feasibility of the use of emerging single-molecule nanopore sequencing techniques for these purposes.

Singhal S, Leffler EM, Sannareddy K, Turner I, Venn O, Hooper DM, Strand AI, Li Q, Raney B, Balakrishnan CN et al. 2015. Stable recombination hotspots in birds. Science, 350 (6263), pp. 928-932. | Show Abstract | Read more

The DNA-binding protein PRDM9 has a critical role in specifying meiotic recombination hotspots in mice and apes, but it appears to be absent from other vertebrate species, including birds. To study the evolution and determinants of recombination in species lacking the gene that encodes PRDM9, we inferred fine-scale genetic maps from population resequencing data for two bird species: the zebra finch, Taeniopygia guttata, and the long-tailed finch, Poephila acuticauda. We found that both species have recombination hotspots, which are enriched near functional genomic elements. Unlike in mice and apes, most hotspots are shared between the two species, and their conservation seems to extend over tens of millions of years. These observations suggest that in the absence of PRDM9, recombination targets functional features that both enable access to the genome and constrain its evolution.

Singhal S, Leffler EM, Sannareddy K, Turner I, Venn O, Hooper DM, Strand AI, Li Q, Raney B, Balakrishnan CN et al. 2015. Stable recombination hotspots in birds Science, 350 (6263), pp. 928-932. | Show Abstract | Read more

Copyright © 2015, American Association for the Advancement of Science. The DNA-binding protein PRDM9 has a critical role in specifying meiotic recombination hotspots in mice and apes, but it appears to be absent from other vertebrate species, including birds. To study the evolution and determinants of recombination in species lacking the gene that encodes PRDM9, we inferred fine-scale genetic maps from population resequencing data for two bird species: the zebra finch, Taeniopygia guttata, and the long-tailed finch, Poephila acuticauda. We found that both species have recombination hotspots, which are enriched near functional genomic elements. Unlike in mice and apes, most hotspots are shared between the two species, and their conservation seems to extend over tens of millions of years. These observations suggest that in the absence of PRDM9, recombination targets functional features that both enable access to the genome and constrain its evolution.

Vukcevic D, Traherne JA, Næss S, Ellinghaus E, Kamatani Y, Dilthey A, Lathrop M, Karlsen TH, Franke A, Moffatt M et al. 2015. Imputation of KIR Types from SNP Variation Data. Am J Hum Genet, 97 (4), pp. 593-607. | Show Abstract | Read more

Large population studies of immune system genes are essential for characterizing their role in diseases, including autoimmune conditions. Of key interest are a group of genes encoding the killer cell immunoglobulin-like receptors (KIRs), which have known and hypothesized roles in autoimmune diseases, resistance to viruses, reproductive conditions, and cancer. These genes are highly polymorphic, which makes typing expensive and time consuming. Consequently, despite their importance, KIRs have been little studied in large cohorts. Statistical imputation methods developed for other complex loci (e.g., human leukocyte antigen [HLA]) on the basis of SNP data provide an inexpensive high-throughput alternative to direct laboratory typing of these loci and have enabled important findings and insights for many diseases. We present KIR∗IMP, a method for imputation of KIR copy number. We show that KIR∗IMP is highly accurate and thus allows the study of KIRs in large cohorts and enables detailed investigation of the role of KIRs in human disease.

1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. 2015. A global reference for human genetic variation. Nature, 526 (7571), pp. 68-74. | Show Abstract | Read more

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

Moutsianas L, Jostins L, Beecham AH, Dilthey AT, Xifara DK, Ban M, Shah TS, Patsopoulos NA, Alfredsson L, Anderson CA et al. 2015. Class II HLA interactions modulate genetic risk for multiple sclerosis. Nat Genet, 47 (10), pp. 1107-1113. | Show Abstract | Read more

Association studies have greatly refined the understanding of how variation within the human leukocyte antigen (HLA) genes influences risk of multiple sclerosis. However, the extent to which major effects are modulated by interactions is poorly characterized. We analyzed high-density SNP data on 17,465 cases and 30,385 controls from 11 cohorts of European ancestry, in combination with imputation of classical HLA alleles, to build a high-resolution map of HLA genetic risk and assess the evidence for interactions involving classical HLA alleles. Among new and previously identified class II risk alleles (HLA-DRB1*15:01, HLA-DRB1*13:03, HLA-DRB1*03:01, HLA-DRB1*08:01 and HLA-DQB1*03:02) and class I protective alleles (HLA-A*02:01, HLA-B*44:02, HLA-B*38:01 and HLA-B*55:01), we find evidence for two interactions involving pairs of class II alleles: HLA-DQA1*01:01-HLA-DRB1*15:01 and HLA-DQB1*03:01-HLA-DQB1*03:02. We find no evidence for interactions between classical HLA alleles and non-HLA risk-associated variants and estimate a minimal effect of polygenic epistasis in modulating major risk alleles.

Tyler-Smith C, Yang H, Landweber LF, Dunham I, Knoppers BM, Donnelly P, Mardis ER, Snyder M, McVean G. 2015. Where Next for Genetics and Genomics? PLoS Biol, 13 (7), pp. e1002216. | Show Abstract | Read more

The last few decades have utterly transformed genetics and genomics, but what might the next ten years bring? PLOS Biology asked eight leaders spanning a range of related areas to give us their predictions. Without exception, the predictions are for more data on a massive scale and of more diverse types. All are optimistic and predict enormous positive impact on scientific understanding, while a recurring theme is the benefit of such data for the transformation and personalization of medicine. Several also point out that the biggest changes will very likely be those that we don't foresee, even now.

Cited:

58

Scopus

Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G. 2015. Improved genome inference in the MHC using a population reference graph Nature Genetics, 47 (6), pp. 682-688. | Show Abstract | Read more

© 2015 Nature America, Inc. All rights reserved. Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.

Parham LR, Briley LP, Li L, Shen J, Newcombe PJ, King KS, Slater AJ, Dilthey A, Iqbal Z, McVean G et al. 2016. Comprehensive genome-wide evaluation of lapatinib-induced liver injury yields a single genetic signal centered on known risk allele HLA-DRB1*07:01. Pharmacogenomics J, 16 (2), pp. 180-185. | Show Abstract | Read more

Lapatinib is associated with a low incidence of serious liver injury. Previous investigations have identified and confirmed the Class II allele HLA-DRB1*07:01 to be strongly associated with lapatinib-induced liver injury; however, the moderate positive predictive value limits its clinical utility. To assess whether additional genetic variants located within the major histocompatibility complex locus or elsewhere in the genome may influence lapatinib-induced liver injury risk, and potentially lead to a genetic association with improved predictive qualities, we have taken two approaches: a genome-wide association study and a whole-genome sequencing study. This evaluation did not reveal additional associations other than the previously identified association for HLA-DRB1*07:01. The present study represents the most comprehensive genetic evaluation of drug-induced liver injury (DILI) or hypersensitivity, and suggests that investigation of possible human leukocyte antigen associations with DILI and other hypersensitivities represents an important first step in understanding the mechanism of these events.

Taylor JC, Martin HC, Lise S, Broxholme J, Cazier J-B, Rimmer A, Kanapin A, Lunter G, Fiddy S, Allan C et al. 2015. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat Genet, 47 (7), pp. 717-726. | Show Abstract | Read more

To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.

Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G. 2015. Improved genome inference in the MHC using a population reference graph. Nat Genet, 47 (6), pp. 682-688. | Show Abstract | Read more

Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.

Moutsianas L, Agarwala V, Fuchsberger C, Flannick J, Rivas MA, Gaulton KJ, Albers PK, GoT2D Consortium, McVean G, Boehnke M et al. 2015. The power of gene-based rare variant methods to detect disease-associated variation and test hypotheses about complex disease. PLoS Genet, 11 (4), pp. e1005165. | Show Abstract | Read more

Genome and exome sequencing in large cohorts enables characterization of the role of rare variation in complex diseases. Success in this endeavor, however, requires investigators to test a diverse array of genetic hypotheses which differ in the number, frequency and effect sizes of underlying causal variants. In this study, we evaluated the power of gene-based association methods to interrogate such hypotheses, and examined the implications for study design. We developed a flexible simulation approach, using 1000 Genomes data, to (a) generate sequence variation at human genes in up to 10K case-control samples, and (b) quantify the statistical power of a panel of widely used gene-based association tests under a variety of allelic architectures, locus effect sizes, and significance thresholds. For loci explaining ~1% of phenotypic variance underlying a common dichotomous trait, we find that all methods have low absolute power to achieve exome-wide significance (~5-20% power at α = 2.5 × 10(-6)) in 3K individuals; even in 10K samples, power is modest (~60%). The combined application of multiple methods increases sensitivity, but does so at the expense of a higher false positive rate. MiST, SKAT-O, and KBAC have the highest individual mean power across simulated datasets, but we observe wide architecture-dependent variability in the individual loci detected by each test, suggesting that inferences about disease architecture from analysis of sequencing studies can differ depending on which methods are used. Our results imply that tens of thousands of individuals, extensive functional annotation, or highly targeted hypothesis testing will be required to confidently detect or exclude rare variant signals at complex disease loci.

Mahajan A, Sim X, Ng HJ, Manning A, Rivas MA, Highland HM, Locke AE, Grarup N, Im HK, Cingolani P et al. 2015. Identification and functional characterization of G6PC2 coding variants influencing glycemic traits define an effector transcript at the G6PC2-ABCB11 locus. PLoS Genet, 11 (1), pp. e1004876. | Show Abstract | Read more

Genome wide association studies (GWAS) for fasting glucose (FG) and insulin (FI) have identified common variant signals which explain 4.8% and 1.2% of trait variance, respectively. It is hypothesized that low-frequency and rare variants could contribute substantially to unexplained genetic variance. To test this, we analyzed exome-array data from up to 33,231 non-diabetic individuals of European ancestry. We found exome-wide significant (P<5×10-7) evidence for two loci not previously highlighted by common variant GWAS: GLP1R (p.Ala316Thr, minor allele frequency (MAF)=1.5%) influencing FG levels, and URB2 (p.Glu594Val, MAF = 0.1%) influencing FI levels. Coding variant associations can highlight potential effector genes at (non-coding) GWAS signals. At the G6PC2/ABCB11 locus, we identified multiple coding variants in G6PC2 (p.Val219Leu, p.His177Tyr, and p.Tyr207Ser) influencing FG levels, conditionally independent of each other and the non-coding GWAS signal. In vitro assays demonstrate that these associated coding alleles result in reduced protein abundance via proteasomal degradation, establishing G6PC2 as an effector gene at this locus. Reconciliation of single-variant associations and functional effects was only possible when haplotype phase was considered. In contrast to earlier reports suggesting that, paradoxically, glucose-raising alleles at this locus are protective against type 2 diabetes (T2D), the p.Val219Leu G6PC2 variant displayed a modest but directionally consistent association with T2D risk. Coding variant associations for glycemic traits in GWAS signals highlight PCSK1, RREB1, and ZHX3 as likely effector transcripts. These coding variant association signals do not have a major impact on the trait variance explained, but they do provide valuable biological insights.

Miotto O, Amato R, Ashley EA, MacInnis B, Almagro-Garcia J, Amaratunga C, Lim P, Mead D, Oyola SO, Dhorda M et al. 2015. Genetic architecture of artemisinin-resistant Plasmodium falciparum. Nat Genet, 47 (3), pp. 226-234. | Show Abstract | Read more

We report a large multicenter genome-wide association study of Plasmodium falciparum resistance to artemisinin, the frontline antimalarial drug. Across 15 locations in Southeast Asia, we identified at least 20 mutations in kelch13 (PF3D7_1343700) affecting the encoded propeller and BTB/POZ domains, which were associated with a slow parasite clearance rate after treatment with artemisinin derivatives. Nonsynonymous polymorphisms in fd (ferredoxin), arps10 (apicoplast ribosomal protein S10), mdr2 (multidrug resistance protein 2) and crt (chloroquine resistance transporter) also showed strong associations with artemisinin resistance. Analysis of the fine structure of the parasite population showed that the fd, arps10, mdr2 and crt polymorphisms are markers of a genetic background on which kelch13 mutations are particularly likely to arise and that they correlate with the contemporary geographical boundaries and population frequencies of artemisinin resistance. These findings indicate that the risk of new resistance-causing mutations emerging is determined by specific predisposing genetic factors in the underlying parasite population.

Panoutsopoulou K, Hatzikotoulas K, Xifara DK, Colonna V, Farmaki A-E, Ritchie GRS, Southam L, Gilly A, Tachmazidou I, Fatumo S et al. 2014. Genetic characterization of Greek population isolates reveals strong genetic drift at missense and trait-associated variants. Nat Commun, 5 (1), pp. 5345. | Show Abstract | Read more

Isolated populations are emerging as a powerful study design in the search for low-frequency and rare variant associations with complex phenotypes. Here we genotype 2,296 samples from two isolated Greek populations, the Pomak villages (HELIC-Pomak) in the North of Greece and the Mylopotamos villages (HELIC-MANOLIS) in Crete. We compare their genomic characteristics to the general Greek population and establish them as genetic isolates. In the MANOLIS cohort, we observe an enrichment of missense variants among the variants that have drifted up in frequency by more than fivefold. In the Pomak cohort, we find novel associations at variants on chr11p15.4 showing large allele frequency increases (from 0.2% in the general Greek population to 4.6% in the isolate) with haematological traits, for example, with mean corpuscular volume (rs7116019, P=2.3 × 10(-26)). We replicate this association in a second set of Pomak samples (combined P=2.0 × 10(-36)). We demonstrate significant power gains in detecting medical trait associations.

Wilkie A, Maher G, Giannoulatou E, McVean G, Goriely A. 2014. Selfish mutations in spermatogenesis and paternal age effects MUTAGENESIS, 29 (6), pp. 546-546.

Majithia AR, Flannick J, Shahinian P, Guo M, Bray M-A, Fontanillas P, Gabriel SB, GoT2D Consortium, NHGRI JHS/FHS Allelic Spectrum Project, SIGMA T2D Consortium et al. 2014. Rare variants in PPARG with decreased activity in adipocyte differentiation are associated with increased risk of type 2 diabetes. Proc Natl Acad Sci U S A, 111 (36), pp. 13127-13132. | Show Abstract | Read more

Peroxisome proliferator-activated receptor gamma (PPARG) is a master transcriptional regulator of adipocyte differentiation and a canonical target of antidiabetic thiazolidinedione medications. In rare families, loss-of-function (LOF) mutations in PPARG are known to cosegregate with lipodystrophy and insulin resistance; in the general population, the common P12A variant is associated with a decreased risk of type 2 diabetes (T2D). Whether and how rare variants in PPARG and defects in adipocyte differentiation influence risk of T2D in the general population remains undetermined. By sequencing PPARG in 19,752 T2D cases and controls drawn from multiple studies and ethnic groups, we identified 49 previously unidentified, nonsynonymous PPARG variants (MAF < 0.5%). Considered in aggregate (with or without computational prediction of functional consequence), these rare variants showed no association with T2D (OR = 1.35; P = 0.17). The function of the 49 variants was experimentally tested in a novel high-throughput human adipocyte differentiation assay, and nine were found to have reduced activity in the assay. Carrying any of these nine LOF variants was associated with a substantial increase in risk of T2D (OR = 7.22; P = 0.005). The combination of large-scale DNA sequencing and functional testing in the laboratory reveals that approximately 1 in 1,000 individuals carries a variant in PPARG that reduces function in a human adipocyte differentiation assay and is associated with a substantial risk of T2D.

Mathieson I, McVean G. 2014. Demography and the age of rare variants. PLoS Genet, 10 (8), pp. e1004528. | Show Abstract | Read more

Large whole-genome sequencing projects have provided access to much rare variation in human populations, which is highly informative about population structure and recent demography. Here, we show how the age of rare variants can be estimated from patterns of haplotype sharing and how these ages can be related to historical relationships between populations. We investigate the distribution of the age of variants occurring exactly twice (ƒ(2) variants) in a worldwide sample sequenced by the 1000 Genomes Project, revealing enormous variation across populations. The median age of haplotypes carrying ƒ(2) variants is 50 to 160 generations across populations within Europe or Asia, and 170 to 320 generations within Africa. Haplotypes shared between continents are much older with median ages for haplotypes shared between Europe and Asia ranging from 320 to 670 generations. The distribution of the ages of ƒ(2) haplotypes is informative about their demography, revealing recent bottlenecks, ancient splits, and more modern connections between populations. We see the effect of selection in the observation that functional variants are significantly younger than nonfunctional variants of the same frequency. This approach is relatively insensitive to mutation rate and complements other nonparametric methods for demographic inference.

Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SRF, WGS500 Consortium, Wilkie AOM, McVean G, Lunter G. 2014. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet, 46 (8), pp. 912-918. | Show Abstract | Read more

High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.

Colonna V, Ayub Q, Chen Y, Pagani L, Luisi P, Pybus M, Garrison E, Xue Y, Tyler-Smith C, 1000 Genomes Project Consortium et al. 2014. Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences. Genome Biol, 15 (6), pp. R88. | Show Abstract | Read more

BACKGROUND: Population differentiation has proved to be effective for identifying loci under geographically localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes. RESULTS: We demonstrate that while sites with low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively. CONCLUSIONS: We identify known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.

Venn O, Turner I, Mathieson I, de Groot N, Bontrop R, McVean G. 2014. Nonhuman genetics. Strong male bias drives germline mutation in chimpanzees. Science, 344 (6189), pp. 1272-1275. | Show Abstract | Read more

Germline mutation determines rates of molecular evolution, genetic diversity, and fitness load. In humans, the average point mutation rate is 1.2 × 10(-8) per base pair per generation, with every additional year of father's age contributing two mutations across the genome and males contributing three to four times as many mutations as females. To assess whether such patterns are shared with our closest living relatives, we sequenced the genomes of a nine-member pedigree of Western chimpanzees, Pan troglodytes verus. Our results indicate a mutation rate of 1.2 × 10(-8) per base pair per generation, but a male contribution seven to eight times that of females and a paternal age effect of three mutations per year of father's age. Thus, mutation rates and patterns differ between closely related species.

Cited:

54

WOS

Venn O, Turner I, Mathieson I, de Groot N, Bontrop R, McVean G. 2014. NONHUMAN GENETICS Strong male bias drives germline mutation in chimpanzees SCIENCE, 344 (6189), pp. 1272-1275. | Read more

Martin HC, Kim GE, Pagnamenta AT, Murakami Y, Carvill GL, Meyer E, Copley RR, Rimmer A, Barcia G, Fleming MR et al. 2014. Clinical whole-genome sequencing in severe early-onset epilepsy reveals new genes and improves molecular diagnosis. Hum Mol Genet, 23 (12), pp. 3200-3211. | Show Abstract | Read more

In severe early-onset epilepsy, precise clinical and molecular genetic diagnosis is complex, as many metabolic and electro-physiological processes have been implicated in disease causation. The clinical phenotypes share many features such as complex seizure types and developmental delay. Molecular diagnosis has historically been confined to sequential testing of candidate genes known to be associated with specific sub-phenotypes, but the diagnostic yield of this approach can be low. We conducted whole-genome sequencing (WGS) on six patients with severe early-onset epilepsy who had previously been refractory to molecular diagnosis, and their parents. Four of these patients had a clinical diagnosis of Ohtahara Syndrome (OS) and two patients had severe non-syndromic early-onset epilepsy (NSEOE). In two OS cases, we found de novo non-synonymous mutations in the genes KCNQ2 and SCN2A. In a third OS case, WGS revealed paternal isodisomy for chromosome 9, leading to identification of the causal homozygous missense variant in KCNT1, which produced a substantial increase in potassium channel current. The fourth OS patient had a recessive mutation in PIGQ that led to exon skipping and defective glycophosphatidyl inositol biosynthesis. The two patients with NSEOE had likely pathogenic de novo mutations in CBL and CSNK1G1, respectively. Mutations in these genes were not found among 500 additional individuals with epilepsy. This work reveals two novel genes for OS, KCNT1 and PIGQ. It also uncovers unexpected genetic mechanisms and emphasizes the power of WGS as a clinical tool for making molecular diagnoses, particularly for highly heterogeneous disorders.

Tachmazidou I, Dedoussis G, Southam L, Farmaki A-E, Ritchie GRS, Xifara DK, Matchan A, Hatzikotoulas K, Rayner NW, Chen Y et al. 2013. A rare functional cardioprotective APOC3 variant has risen in frequency in distinct population isolates. Nat Commun, 4 (1), pp. 2872. | Show Abstract | Read more

Isolated populations can empower the identification of rare variation associated with complex traits through next generation association studies, but the generalizability of such findings remains unknown. Here we genotype 1,267 individuals from a Greek population isolate on the Illumina HumanExome Beadchip, in search of functional coding variants associated with lipids traits. We find genome-wide significant evidence for association between R19X, a functional variant in APOC3, with increased high-density lipoprotein and decreased triglycerides levels. Approximately 3.8% of individuals are heterozygous for this cardioprotective variant, which was previously thought to be private to the Amish founder population. R19X is rare (<0.05% frequency) in outbred European populations. The increased frequency of R19X enables discovery of this lipid traits signal at genome-wide significance in a small sample size. This work exemplifies the value of isolated populations in successfully detecting transferable rare variant associations of high medical relevance.

Giannoulatou E, McVean G, Taylor IB, McGowan SJ, Maher GJ, Iqbal Z, Pfeifer SP, Turner I, Burkitt Wright EMM, Shorto J et al. 2013. Contributions of intrinsic mutation rate and selfish selection to levels of de novo HRAS mutations in the paternal germline. Proc Natl Acad Sci U S A, 110 (50), pp. 20152-20157. | Show Abstract | Read more

The RAS proto-oncogene Harvey rat sarcoma viral oncogene homolog (HRAS) encodes a small GTPase that transduces signals from cell surface receptors to intracellular effectors to control cellular behavior. Although somatic HRAS mutations have been described in many cancers, germline mutations cause Costello syndrome (CS), a congenital disorder associated with predisposition to malignancy. Based on the epidemiology of CS and the occurrence of HRAS mutations in spermatocytic seminoma, we proposed that activating HRAS mutations become enriched in sperm through a process akin to tumorigenesis, termed selfish spermatogonial selection. To test this hypothesis, we quantified the levels, in blood and sperm samples, of HRAS mutations at the p.G12 codon and compared the results to changes at the p.A11 codon, at which activating mutations do not occur. The data strongly support the role of selection in determining HRAS mutation levels in sperm, and hence the occurrence of CS, but we also found differences from the mutation pattern in tumorigenesis. First, the relative prevalence of mutations in sperm correlates weakly with their in vitro activating properties and occurrence in cancers. Second, specific tandem base substitutions (predominantly GC>TT/AA) occur in sperm but not in cancers; genomewide analysis showed that this same mutation is also overrepresented in constitutional pathogenic and polymorphic variants, suggesting a heightened vulnerability to these mutations in the germline. We developed a statistical model to show how both intrinsic mutation rate and selfish selection contribute to the mutational burden borne by the paternal germline.

Cited:

138

Scopus

Choi K, Zhao X, Kelly KA, Venn O, Higgins JD, Yelina NE, Hardcastle TJ, Ziolkowski PA, Copenhaver GP, Franklin FCH et al. 2013. Arabidopsis meiotic crossover hot spots overlap with H2A.Z nucleosomes at gene promoters Nature Genetics, 45 (11), pp. 1327-1338. | Show Abstract | Read more

PRDM9 directs human meiotic crossover hot spots to intergenic sequence motifs, whereas budding yeast hot spots overlap regions of low nucleosome density (LND) in gene promoters. To investigate hot spots in plants, which lack PRDM9, we used coalescent analysis of genetic variation in Arabidopsis thaliana. Crossovers increased toward gene promoters and terminators, and hot spots were associated with active chromatin modifications, including H2A.Z, histone H3 Lys4 trimethylation (H3K4me3), LND and low DNA methylation. Hot spot-enriched A-rich and CTT-repeat DNA motifs occurred upstream and downstream, respectively, of transcriptional start sites. Crossovers were asymmetric around promoters and were most frequent over CTT-repeat motifs and H2A.Z nucleosomes. Pollen typing, segregation and cytogenetic analysis showed decreased numbers of crossovers in the arp6 H2A.Z deposition mutant at multiple scales. During meiosis, H2A.Z forms overlapping chromosomal foci with the DMC1 and RAD51 recombinases. As arp6 reduced the number of DMC1 or RAD51 foci, H2A.Z may promote the formation or processing of meiotic DNA double-strand breaks. We propose that gene chromatin ancestrally designates hot spots within eukaryotes and PRDM9 is a derived state within vertebrates. © 2013 Nature America, Inc. All rights reserved.

International Multiple Sclerosis Genetics Consortium (IMSGC), Beecham AH, Patsopoulos NA, Xifara DK, Davis MF, Kemppinen A, Cotsapas C, Shah TS, Spencer C, Booth D et al. 2013. Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat Genet, 45 (11), pp. 1353-1360. | Show Abstract | Read more

Using the ImmunoChip custom genotyping array, we analyzed 14,498 subjects with multiple sclerosis and 24,091 healthy controls for 161,311 autosomal variants and identified 135 potentially associated regions (P < 1.0 × 10(-4)). In a replication phase, we combined these data with previous genome-wide association study (GWAS) data from an independent 14,802 subjects with multiple sclerosis and 26,703 healthy controls. In these 80,094 individuals of European ancestry, we identified 48 new susceptibility variants (P < 5.0 × 10(-8)), 3 of which we found after conditioning on previously identified variants. Thus, there are now 110 established multiple sclerosis risk variants at 103 discrete loci outside of the major histocompatibility complex. With high-resolution Bayesian fine mapping, we identified five regions where one variant accounted for more than 50% of the posterior probability of association. This study enhances the catalog of multiple sclerosis risk variants and illustrates the value of fine mapping in the resolution of GWAS signals.

Choi K, Zhao X, Kelly KA, Venn O, Higgins JD, Yelina NE, Hardcastle TJ, Ziolkowski PA, Copenhaver GP, Franklin FCH et al. 2013. Arabidopsis meiotic crossover hot spots overlap with H2A.Z nucleosomes at gene promoters. Nat Genet, 45 (11), pp. 1327-1336. | Show Abstract | Read more

PRDM9 directs human meiotic crossover hot spots to intergenic sequence motifs, whereas budding yeast hot spots overlap regions of low nucleosome density (LND) in gene promoters. To investigate hot spots in plants, which lack PRDM9, we used coalescent analysis of genetic variation in Arabidopsis thaliana. Crossovers increased toward gene promoters and terminators, and hot spots were associated with active chromatin modifications, including H2A.Z, histone H3 Lys4 trimethylation (H3K4me3), LND and low DNA methylation. Hot spot-enriched A-rich and CTT-repeat DNA motifs occurred upstream and downstream, respectively, of transcriptional start sites. Crossovers were asymmetric around promoters and were most frequent over CTT-repeat motifs and H2A.Z nucleosomes. Pollen typing, segregation and cytogenetic analysis showed decreased numbers of crossovers in the arp6 H2A.Z deposition mutant at multiple scales. During meiosis, H2A.Z forms overlapping chromosomal foci with the DMC1 and RAD51 recombinases. As arp6 reduced the number of DMC1 or RAD51 foci, H2A.Z may promote the formation or processing of meiotic DNA double-strand breaks. We propose that gene chromatin ancestrally designates hot spots within eukaryotes and PRDM9 is a derived state within vertebrates.

Palmer D, Frater J, Phillips R, McLean AR, McVean G. 2013. Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes Proceedings of the Royal Society B: Biological Sciences, 280 (1762), | Show Abstract | Read more

The rates of escape and reversion in response to selection pressure arising from the host immune system, notably the cytotoxic T-lymphocyte (CTL) response, are key factors determining the evolution of HIV. Existing methods for estimating these parameters from cross-sectional population data using ordinary differential equations (ODEs) ignore information about the genealogy of sampled HIV sequences, which has the potential to cause systematic bias and overestimate certainty. Here, we describe an integrated approach, validated through extensive simulations, which combines genealogical inference and epidemiological modelling, to estimate rates of CTL escape and reversion in HIV epitopes. We show that there is substantial uncertainty about rates of viral escape and reversion from cross-sectional data, which arises from the inherent stochasticity in the evolutionary process. By application to empirical data, we find that point estimates of rates from a previously published ODE model and the integrated approach presented here are often similar, but can also differ several-fold depending on the structure of the genealogy. The model-based approach we apply provides a framework for the statistical analysis and hypothesis testing of escape and reversion in population data and highlights the need for longitudinal and denser cross-sectional sampling to enable accurate estimate of these key parameters. © 2013 The Authors.

Palles C, Cazier J-B, Howarth KM, Domingo E, Jones AM, Broderick P, Kemp Z, Spain SL, Almeida EG, Salguero I et al. 2013. Germline mutations affecting the proofreading domains of POLE and POLD1 predispose to colorectal adenomas and carcinomas (vol 45, pg 136, 2013) NATURE GENETICS, 45 (6), pp. 713-713. | Read more

Zilversmit MM, Chase EK, Chen DS, Awadalla P, Day KP, McVean G. 2013. Hypervariable antigen genes in malaria have ancient roots. BMC Evol Biol, 13 (1), pp. 110. | Show Abstract | Read more

BACKGROUND: The var genes of the human malaria parasite Plasmodium falciparum are highly polymorphic loci coding for the erythrocyte membrane proteins 1 (PfEMP1), which are responsible for the cytoaherence of P. falciparum infected red blood cells to the human vasculature. Cytoadhesion, coupled with differential expression of var genes, contributes to virulence and allows the parasite to establish chronic infections by evading detection from the host's immune system. Although studying genetic diversity is a major focus of recent work on the var genes, little is known about the gene family's origin and evolutionary history. RESULTS: Using a novel hidden Markov model-based approach and var sequences assembled from additional isolates and species, we are able to reveal elements of both the early evolution of the var genes as well as recent diversifying events. We compare sequences of the var gene DBLα domains from divergent isolates of P. falciparum (3D7 and HB3), and a closely-related species, Plasmodium reichenowi. We find that the gene family is equally large in P. reichenowi and P. falciparum -- with a minimum of 51 var genes in the P. reichenowi genome (compared to 61 in 3D7 and a minimum of 48 in HB3). In addition, we are able to define large, continuous blocks of homologous sequence among P. falciparum and P. reichenowi var gene DBLα domains. These results reveal that the contemporary structure of the var gene family was present before the divergence of P. falciparum and P. reichenowi, estimated to be between 2.5 to 6 million years ago. We also reveal that recombination has played an important and traceable role in both the establishment, and the maintenance, of diversity in the sequences. CONCLUSIONS: Despite the remarkable diversity and rapid evolution found in these loci within and among P. falciparum populations, the basic structure of these domains and the gene family is surprisingly old and stable. Revealing a common structure as well as conserved sequence among two species also has implications for developing new primate-parasite models for studying the pathology and immunology of falciparum malaria, and for studying the population genetics of var genes and associated virulence phenotypes.

Babbs C, Roberts NA, Sanchez-Pulido L, McGowan SJ, Ahmed MR, Brown JM, Sabry MA, WGS500 Consortium, Bentley DR, McVean GA et al. 2013. Homozygous mutations in a predicted endonuclease are a novel cause of congenital dyserythropoietic anemia type I. Haematologica, 98 (9), pp. 1383-1387. | Show Abstract | Read more

The congenital dyserythropoietic anemias are a heterogeneous group of rare disorders primarily affecting erythropoiesis with characteristic morphological abnormalities and a block in erythroid maturation. Mutations in the CDAN1 gene, which encodes Codanin-1, underlie the majority of congenital dyserythropoietic anemia type I cases. However, no likely pathogenic CDAN1 mutation has been detected in approximately 20% of cases, suggesting the presence of at least one other locus. We used whole genome sequencing and segregation analysis to identify a homozygous T to A transversion (c.533T>A), predicted to lead to a p.L178Q missense substitution in C15ORF41, a gene of unknown function, in a consanguineous pedigree of Middle-Eastern origin. Sequencing C15ORF41 in other CDAN1 mutation-negative congenital dyserythropoietic anemia type I pedigrees identified a homozygous transition (c.281A>G), predicted to lead to a p.Y94C substitution, in two further pedigrees of SouthEast Asian origin. The haplotype surrounding the c.281A>G change suggests a founder effect for this mutation in Pakistan. Detailed sequence similarity searches indicate that C15ORF41 encodes a novel restriction endonuclease that is a member of the Holliday junction resolvase family of proteins.

Palmer D, Frater J, Phillips R, McLean AR, McVean G. 2013. Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes. Proc Biol Sci, 280 (1762), pp. 20130696. | Show Abstract | Read more

The rates of escape and reversion in response to selection pressure arising from the host immune system, notably the cytotoxic T-lymphocyte (CTL) response, are key factors determining the evolution of HIV. Existing methods for estimating these parameters from cross-sectional population data using ordinary differential equations (ODEs) ignore information about the genealogy of sampled HIV sequences, which has the potential to cause systematic bias and overestimate certainty. Here, we describe an integrated approach, validated through extensive simulations, which combines genealogical inference and epidemiological modelling, to estimate rates of CTL escape and reversion in HIV epitopes. We show that there is substantial uncertainty about rates of viral escape and reversion from cross-sectional data, which arises from the inherent stochasticity in the evolutionary process. By application to empirical data, we find that point estimates of rates from a previously published ODE model and the integrated approach presented here are often similar, but can also differ several-fold depending on the structure of the genealogy. The model-based approach we apply provides a framework for the statistical analysis and hypothesis testing of escape and reversion in population data and highlights the need for longitudinal and denser cross-sectional sampling to enable accurate estimate of these key parameters.

Mathieson I, McVean G. 2013. Reply to: "FaST-LMM-Select for addressing confounding from spatial structure and rare variants". Nat Genet, 45 (5), pp. 471. | Read more

Miotto O, Almagro-Garcia J, Manske M, Macinnis B, Campino S, Rockett KA, Amaratunga C, Lim P, Suon S, Sreng S et al. 2013. Multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia. Nat Genet, 45 (6), pp. 648-655. | Show Abstract | Read more

We describe an analysis of genome variation in 825 P. falciparum samples from Asia and Africa that identifies an unusual pattern of parasite population structure at the epicenter of artemisinin resistance in western Cambodia. Within this relatively small geographic area, we have discovered several distinct but apparently sympatric parasite subpopulations with extremely high levels of genetic differentiation. Of particular interest are three subpopulations, all associated with clinical resistance to artemisinin, which have skewed allele frequency spectra and high levels of haplotype homozygosity, indicative of founder effects and recent population expansion. We provide a catalog of SNPs that show high levels of differentiation in the artemisinin-resistant subpopulations, including codon variants in transporter proteins and DNA mismatch repair proteins. These data provide a population-level genetic framework for investigating the biological origins of artemisinin resistance and for defining molecular markers to assist in its elimination.

Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, Ananda G, Howie B, Karczewski KJ, Smith KS et al. 2013. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res, 23 (5), pp. 749-761. | Show Abstract | Read more

Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

Cited:

43

WOS

Mathieson I, McVean G. 2013. Estimating Selection Coefficients in Spatially Structured Populations from Time Series Data of Allele Frequencies GENETICS, 193 (3), pp. 973-+. | Read more

Mathieson I, McVean G. 2013. Estimating selection coefficients in spatially structured populations from time series data of allele frequencies. Genetics, 193 (3), pp. 973-984. | Show Abstract | Read more

Inferring the nature and magnitude of selection is an important problem in many biological contexts. Typically when estimating a selection coefficient for an allele, it is assumed that samples are drawn from a panmictic population and that selection acts uniformly across the population. However, these assumptions are rarely satisfied. Natural populations are almost always structured, and selective pressures are likely to act differentially. Inference about selection ought therefore to take account of structure. We do this by considering evolution in a simple lattice model of spatial population structure. We develop a hidden Markov model based maximum-likelihood approach for estimating the selection coefficient in a single population from time series data of allele frequencies. We then develop an approximate extension of this to the structured case to provide a joint estimate of migration rate and spatially varying selection coefficients. We illustrate our method using classical data sets of moth pigmentation morph frequencies, but it has wide applications in settings ranging from ecology to human evolution.

Leffler EM, Gao Z, Pfeifer S, Ségurel L, Auton A, Venn O, Bowden R, Bontrop R, Wall JD, Sella G et al. 2013. Multiple instances of ancient balancing selection shared between humans and chimpanzees. Science, 339 (6127), pp. 1578-1582. | Show Abstract | Read more

Instances in which natural selection maintains genetic variation in a population over millions of years are thought to be extremely rare. We conducted a genome-wide scan for long-lived balancing selection by looking for combinations of SNPs shared between humans and chimpanzees. In addition to the major histocompatibility complex, we identified 125 regions in which the same haplotypes are segregating in the two species, all but two of which are noncoding. In six cases, there is evidence for an ancestral polymorphism that persisted to the present in humans and chimpanzees. Regions with shared haplotypes are significantly enriched for membrane glycoproteins, and a similar trend is seen among shared coding polymorphisms. These findings indicate that ancient balancing selection has shaped human variation and point to genes involved in host-pathogen interactions as common targets.

Newey PJ, Nesbit MA, Rimmer AJ, Head RA, Gorvin CM, Attar M, Gregory L, Wass JAH, Buck D, Karavitaki N et al. 2013. Whole-exome sequencing studies of nonfunctioning pituitary adenomas. J Clin Endocrinol Metab, 98 (4), pp. E796-E800. | Show Abstract | Read more

CONTEXT: The tumorigenic role of genetic abnormalities in sporadic pituitary nonfunctioning adenomas (NFAs), which usually originate from gonadotroph cells, is unknown. OBJECTIVE: The objective of the study was to identify somatic genetic abnormalities in sporadic pituitary NFAs. DESIGN: Whole-exome sequencing was performed using DNA from 7 pituitary NFAs and leukocyte samples obtained from the same patients. Somatic variants were confirmed by dideoxynucleotide sequencing, and candidate driver genes were assessed in an additional 24 pituitary NFAs. RESULTS: Whole-exome sequencing achieved a high degree of coverage such that approximately 97% of targeted bases were represented by more than 10 base reads; 24 somatic variants were identified and confirmed in the discovery set of 7 pituitary NFAs (mean 3.5 variants/tumor; range 1-7). Approximately 80% of variants occurred as missense single nucleotide variants and the remainder were synonymous changes or small frameshift deletions. Each of the 24 mutations occurred in independent genes with no recurrent mutations. Mutations were not observed in genes previously associated with pituitary tumorigenesis, although somatic variants in putative driver genes including platelet-derived growth factor D (PDGFD), N-myc down-regulated gene family member 4 (NDRG4), and Zipper sterile-α-motif kinase (ZAK) were identified; however, DNA sequence analysis of these in the validation set of 24 pituitary NFAs did not reveal any mutations indicating that these genes are unlikely to contribute significantly in the etiology of sporadic pituitary NFAs. CONCLUSIONS: Pituitary NFAs harbor few somatic mutations consistent with their low proliferation rates and benign nature, but mechanisms other than somatic mutation are likely involved in the etiology of sporadic pituitary NFAs.

Dilthey A, Leslie S, Moutsianas L, Shen J, Cox C, Nelson MR, McVean G. 2013. Multi-population classical HLA type imputation. PLoS Comput Biol, 9 (2), pp. e1002877. | Show Abstract | Read more

Statistical imputation of classical HLA alleles in case-control studies has become established as a valuable tool for identifying and fine-mapping signals of disease association in the MHC. Imputation into diverse populations has, however, remained challenging, mainly because of the additional haplotypic heterogeneity introduced by combining reference panels of different sources. We present an HLA type imputation model, HLA*IMP:02, designed to operate on a multi-population reference panel. HLA*IMP:02 is based on a graphical representation of haplotype structure. We present a probabilistic algorithm to build such models for the HLA region, accommodating genotyping error, haplotypic heterogeneity and the need for maximum accuracy at the HLA loci, generalizing the work of Browning and Browning (2007) and Ron et al. (1998). HLA*IMP:02 achieves an average 4-digit imputation accuracy on diverse European panels of 97% (call rate 97%). On non-European samples, 2-digit performance is over 90% for most loci and ethnicities where data available. HLA*IMP:02 supports imputation of HLA-DPB1 and HLA-DRB3-5, is highly tolerant of missing data in the imputation panel and works on standard genotype data from popular genotyping chips. It is publicly available in source code and as a user-friendly web service framework.

Cited:

125

Scopus

Nesbit MA, Hannan FM, Howles SA, Reed AAC, Cranston T, Thakker CE, Gregory L, Rimmer AJ, Rust N, Graham U et al. 2013. Mutations in AP2S1 cause familial hypocalciuric hypercalcemia type 3 Nature Genetics, 45 (1), pp. 93-97. | Show Abstract | Read more

Adaptor protein-2 (AP2), a central component of clathrin-coated vesicles (CCVs), is pivotal in clathrin-mediated endocytosis, which internalizes plasma membrane constituents such as G protein-coupled receptors (GPCRs). AP2, a heterotetramer of α, β, μ and σ subunits, links clathrin to vesicle membranes and binds to tyrosine- and dileucine-based motifs of membrane-associated cargo proteins. Here we show that missense mutations of AP2 σ subunit (AP2S1) affecting Arg15, which forms key contacts with dileucine-based motifs of CCV cargo proteins, result in familial hypocalciuric hypercalcemia type 3 (FHH3), an extracellular calcium homeostasis disorder affecting the parathyroids, kidneys and bone. We found AP2S1 mutations in >20% of cases of FHH without mutations in calcium-sensing GPCR (CASR), which cause FHH1. AP2S1 mutations decreased the sensitivity of CaSR-expressing cells to extracellular calcium and reduced CaSR endocytosis, probably through loss of interaction with a C-terminal CaSR dileucine-based motif, whose disruption also decreased intracellular signaling. Thus, our results identify a new role for AP2 in extracellular calcium homeostasis. © 2013 Nature America, Inc. All rights reserved.

Palles C, Cazier J-B, Howarth KM, Domingo E, Jones AM, Broderick P, Kemp Z, Spain SL, Guarino E, Salguero I et al. 2013. Germline mutations affecting the proofreading domains of POLE and POLD1 predispose to colorectal adenomas and carcinomas. Nat Genet, 45 (2), pp. 136-144. | Show Abstract | Read more

Many individuals with multiple or large colorectal adenomas or early-onset colorectal cancer (CRC) have no detectable germline mutations in the known cancer predisposition genes. Using whole-genome sequencing, supplemented by linkage and association analysis, we identified specific heterozygous POLE or POLD1 germline variants in several multiple-adenoma and/or CRC cases but in no controls. The variants associated with susceptibility, POLE p.Leu424Val and POLD1 p.Ser478Asn, have high penetrance, and POLD1 mutation was also associated with endometrial cancer predisposition. The mutations map to equivalent sites in the proofreading (exonuclease) domain of DNA polymerases ɛ and δ and are predicted to cause a defect in the correction of mispaired bases inserted during DNA replication. In agreement with this prediction, the tumors from mutation carriers were microsatellite stable but tended to acquire base substitution mutations, as confirmed by yeast functional assays. Further analysis of published data showed that the recently described group of hypermutant, microsatellite-stable CRCs is likely to be caused by somatic POLE mutations affecting the exonuclease domain.

Fugger L, McVean G, Bell JI. 2012. Genomewide association studies and common disease--realizing clinical utility. N Engl J Med, 367 (25), pp. 2370-2371. | Read more

Nesbit MA, Hannan FM, Howles SA, Reed AAC, Cranston T, Thakker CE, Gregory L, Rimmer AJ, Rust N, Graham U et al. 2013. Mutations in AP2S1 cause familial hypocalciuric hypercalcemia type 3. Nat Genet, 45 (1), pp. 93-97. | Show Abstract | Read more

Adaptor protein-2 (AP2), a central component of clathrin-coated vesicles (CCVs), is pivotal in clathrin-mediated endocytosis, which internalizes plasma membrane constituents such as G protein-coupled receptors (GPCRs). AP2, a heterotetramer of α, β, μ and σ subunits, links clathrin to vesicle membranes and binds to tyrosine- and dileucine-based motifs of membrane-associated cargo proteins. Here we show that missense mutations of AP2 σ subunit (AP2S1) affecting Arg15, which forms key contacts with dileucine-based motifs of CCV cargo proteins, result in familial hypocalciuric hypercalcemia type 3 (FHH3), an extracellular calcium homeostasis disorder affecting the parathyroids, kidneys and bone. We found AP2S1 mutations in >20% of cases of FHH without mutations in calcium-sensing GPCR (CASR), which cause FHH1. AP2S1 mutations decreased the sensitivity of CaSR-expressing cells to extracellular calcium and reduced CaSR endocytosis, probably through loss of interaction with a C-terminal CaSR dileucine-based motif, whose disruption also decreased intracellular signaling. Thus, our results identify a new role for AP2 in extracellular calcium homeostasis.

Lise S, Clarkson Y, Perkins E, Kwasniewska A, Sadighi Akha E, Schnekenberg RP, Suminaite D, Hope J, Baker I, Gregory L et al. 2012. Recessive mutations in SPTBN2 implicate β-III spectrin in both cognitive and motor development. PLoS Genet, 8 (12), pp. e1003074. | Show Abstract | Read more

β-III spectrin is present in the brain and is known to be important in the function of the cerebellum. Heterozygous mutations in SPTBN2, the gene encoding β-III spectrin, cause Spinocerebellar Ataxia Type 5 (SCA5), an adult-onset, slowly progressive, autosomal-dominant pure cerebellar ataxia. SCA5 is sometimes known as "Lincoln ataxia," because the largest known family is descended from relatives of the United States President Abraham Lincoln. Using targeted capture and next-generation sequencing, we identified a homozygous stop codon in SPTBN2 in a consanguineous family in which childhood developmental ataxia co-segregates with cognitive impairment. The cognitive impairment could result from mutations in a second gene, but further analysis using whole-genome sequencing combined with SNP array analysis did not reveal any evidence of other mutations. We also examined a mouse knockout of β-III spectrin in which ataxia and progressive degeneration of cerebellar Purkinje cells has been previously reported and found morphological abnormalities in neurons from prefrontal cortex and deficits in object recognition tasks, consistent with the human cognitive phenotype. These data provide the first evidence that β-III spectrin plays an important role in cortical brain development and cognition, in addition to its function in the cerebellum; and we conclude that cognitive impairment is an integral part of this novel recessive ataxic syndrome, Spectrin-associated Autosomal Recessive Cerebellar Ataxia type 1 (SPARCA1). In addition, the identification of SPARCA1 and normal heterozygous carriers of the stop codon in SPTBN2 provides insights into the mechanism of molecular dominance in SCA5 and demonstrates that the cell-specific repertoire of spectrin subunits underlies a novel group of disorders, the neuronal spectrinopathies, which includes SCA5, SPARCA1, and a form of West syndrome.

Cited:

202

Scopus

Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, Su Z, Howson JMM, Auton A, Myers S, Morris A et al. 2012. Bayesian refinement of association signals for 14 loci in 3 common diseases Nature Genetics, 44 (12), pp. 1294-1301. | Show Abstract | Read more

To further investigate susceptibility loci identified by genome-wide association studies, we genotyped 5,500 SNPs across 14 associated regions in 8,000 samples from a control group and 3 diseases: type 2 diabetes (T2D), coronary artery disease (CAD) and Graves' disease. We defined, using Bayes theorem, credible sets of SNPs that were 95% likely, based on posterior probability, to contain the causal disease-associated SNPs. In 3 of the 14 regions, TCF7L2 (T2D), CTLA4 (Graves' disease) and CDKN2A-CDKN2B (T2D), much of the posterior probability rested on a single SNP, and, in 4 other regions (CDKN2A-CDKN2B (CAD) and CDKAL1, FTO and HHEX (T2D)), the 95% sets were small, thereby excluding most SNPs as potentially causal. Very few SNPs in our credible sets had annotated functions, illustrating the limitations in understanding the mechanisms underlying susceptibility to common diseases. Our results also show the value of more detailed mapping to target sequences for functional studies. © 2012 Nature America, Inc. All rights reserved.

Iqbal Z, Turner I, McVean G. 2013. High-throughput microbial population genomics using the Cortex variation assembler. Bioinformatics, 29 (2), pp. 275-276. | Show Abstract | Read more

SUMMARY: We have developed a software package, Cortex, designed for the analysis of genetic variation by de novo assembly of multiple samples. This allows direct comparison of samples without using a reference genome as intermediate and incorporates discovery and genotyping of single-nucleotide polymorphisms, indels and larger events in a single framework. We introduce pipelines which simplify the analysis of microbial samples and increase discovery power; these also enable the construction of a graph of known sequence and variation in a species, against which new samples can be compared rapidly. We demonstrate the ease-of-use and power by reproducing the results of studies using both long and short reads. AVAILABILITY: http://cortexassembler.sourceforge.net (GPLv3 license). CONTACT: zam@well.ox.ac.uk, mcvean@well.ox.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), pp. 56-65. | Show Abstract | Read more

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

Wellcome Trust Case Control Consortium, Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, Su Z, Howson JMM, Auton A, Myers S et al. 2012. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet, 44 (12), pp. 1294-1301. | Show Abstract | Read more

To further investigate susceptibility loci identified by genome-wide association studies, we genotyped 5,500 SNPs across 14 associated regions in 8,000 samples from a control group and 3 diseases: type 2 diabetes (T2D), coronary artery disease (CAD) and Graves' disease. We defined, using Bayes theorem, credible sets of SNPs that were 95% likely, based on posterior probability, to contain the causal disease-associated SNPs. In 3 of the 14 regions, TCF7L2 (T2D), CTLA4 (Graves' disease) and CDKN2A-CDKN2B (T2D), much of the posterior probability rested on a single SNP, and, in 4 other regions (CDKN2A-CDKN2B (CAD) and CDKAL1, FTO and HHEX (T2D)), the 95% sets were small, thereby excluding most SNPs as potentially causal. Very few SNPs in our credible sets had annotated functions, illustrating the limitations in understanding the mechanisms underlying susceptibility to common diseases. Our results also show the value of more detailed mapping to target sequences for functional studies.

Irish Schizophrenia Genomics Consortium and the Wellcome Trust Case Control Consortium 2. 2012. Genome-wide association study implicates HLA-C*01:02 as a risk factor at the major histocompatibility complex locus in schizophrenia. Biol Psychiatry, 72 (8), pp. 620-628. | Show Abstract | Read more

BACKGROUND: We performed a genome-wide association study (GWAS) to identify common risk variants for schizophrenia. METHODS: The discovery scan included 1606 patients and 1794 controls from Ireland, using 6,212,339 directly genotyped or imputed single nucleotide polymorphisms (SNPs). A subset of this sample (270 cases and 860 controls) was subsequently included in the Psychiatric GWAS Consortium-schizophrenia GWAS meta-analysis. RESULTS: One hundred eight SNPs were taken forward for replication in an independent sample of 13,195 cases and 31,021 control subjects. The most significant associations in discovery, corrected for genomic inflation, were (rs204999, p combined = 1.34 × 10(-9) and in combined samples (rs2523722 p combined = 2.88 × 10(-16)) mapped to the major histocompatibility complex (MHC) region. We imputed classical human leukocyte antigen (HLA) alleles at the locus; the most significant finding was with HLA-C*01:02. This association was distinct from the top SNP signal. The HLA alleles DRB1*03:01 and B*08:01 were protective, replicating a previous study. CONCLUSIONS: This study provides further support for involvement of MHC class I molecules in schizophrenia. We found evidence of association with previously reported risk alleles at the TCF4, VRK2, and ZNF804A loci.

Newey PJ, Nesbit MA, Rimmer AJ, Attar M, Head RT, Christie PT, Gorvin CM, Stechman M, Gregory L, Mihai R et al. 2012. Whole-exome sequencing studies of nonhereditary (sporadic) parathyroid adenomas. J Clin Endocrinol Metab, 97 (10), pp. E1995-E2005. | Show Abstract | Read more

CONTEXT: Genetic abnormalities, such as those of multiple endocrine neoplasia type 1 (MEN1) and Cyclin D1 (CCND1) genes, occur in <50% of nonhereditary (sporadic) parathyroid adenomas. OBJECTIVE: To identify genetic abnormalities in nonhereditary parathyroid adenomas by whole-exome sequence analysis. DESIGN: Whole-exome sequence analysis was performed on parathyroid adenomas and leukocyte DNA samples from 16 postmenopausal women without a family history of parathyroid tumors or MEN1 and in whom primary hyperparathyroidism due to single-gland disease was cured by surgery. Somatic variants confirmed in this discovery set were assessed in 24 other parathyroid adenomas. RESULTS: Over 90% of targeted exons were captured and represented by more than 10 base reads. Analysis identified 212 somatic variants (median eight per tumor; range, 2-110), with the majority being heterozygous nonsynonymous single-nucleotide variants that predicted missense amino acid substitutions. Somatic MEN1 mutations occurred in six of 16 (∼35%) parathyroid adenomas, in association with loss of heterozygosity on chromosome 11. However, no other gene was mutated in more than one tumor. Mutations in several genes that may represent low-frequency driver mutations were identified, including a protection of telomeres 1 (POT1) mutation that resulted in exon skipping and disruption to the single-stranded DNA-binding domain, which may contribute to increased genomic instability and the observed high mutation rate in one tumor. CONCLUSIONS: Parathyroid adenomas typically harbor few somatic variants, consistent with their low proliferation rates. MEN1 mutation represents the major driver in sporadic parathyroid tumorigenesis although multiple low-frequency driver mutations likely account for tumors not harboring somatic MEN1 mutations.

Gregory AP, Dendrou CA, Attfield KE, Haghikia A, Xifara DK, Butter F, Poschmann G, Kaur G, Lambert L, Leach OA et al. 2012. TNF receptor 1 genetic risk mirrors outcome of anti-TNF therapy in multiple sclerosis. Nature, 488 (7412), pp. 508-511. | Show Abstract | Read more

Although there has been much success in identifying genetic variants associated with common diseases using genome-wide association studies (GWAS), it has been difficult to demonstrate which variants are causal and what role they have in disease. Moreover, the modest contribution that these variants make to disease risk has raised questions regarding their medical relevance. Here we have investigated a single nucleotide polymorphism (SNP) in the TNFRSF1A gene, that encodes tumour necrosis factor receptor 1 (TNFR1), which was discovered through GWAS to be associated with multiple sclerosis (MS), but not with other autoimmune conditions such as rheumatoid arthritis, psoriasis and Crohn’s disease. By analysing MS GWAS data in conjunction with the 1000 Genomes Project data we provide genetic evidence that strongly implicates this SNP, rs1800693, as the causal variant in the TNFRSF1A region. We further substantiate this through functional studies showing that the MS risk allele directs expression of a novel, soluble form of TNFR1 that can block TNF. Importantly, TNF-blocking drugs can promote onset or exacerbation of MS, but they have proven highly efficacious in the treatment of autoimmune diseases for which there is no association with rs1800693. This indicates that the clinical experience with these drugs parallels the disease association of rs1800693, and that the MS-associated TNFR1 variant mimics the effect of TNF-blocking drugs. Hence, our study demonstrates that clinical practice can be informed by comparing GWAS across common autoimmune diseases and by investigating the functional consequences of the disease-associated genetic variation.

Fernando MMA, Freudenberg J, Lee A, Morris DL, Boteva L, Rhodes B, Gonzalez-Escribano MF, Lopez-Nevot MA, Navarra SV, Gregersen PK et al. 2012. Transancestral mapping of the MHC region in systemic lupus erythematosus identifies new independent and interacting loci at MSH5, HLA-DPB1 and HLA-G. Ann Rheum Dis, 71 (5), pp. 777-784. | Show Abstract | Read more

OBJECTIVES: Systemic lupus erythematosus (SLE) is a chronic multisystem genetically complex autoimmune disease characterised by the production of autoantibodies to nuclear and cellular antigens, tissue inflammation and organ damage. Genome-wide association studies have shown that variants within the major histocompatibility complex (MHC) region on chromosome 6 confer the greatest genetic risk for SLE in European and Chinese populations. However, the causal variants remain elusive due to tight linkage disequilibrium across disease-associated MHC haplotypes, the highly polymorphic nature of many MHC genes and the heterogeneity of the SLE phenotype. METHODS: A high-density case-control single nucleotide polymorphism (SNP) study of the MHC region was undertaken in SLE cohorts of Spanish and Filipino ancestry using a custom Illumina chip in order to fine-map association signals in these haplotypically diverse populations. In addition, comparative analyses were performed between these two datasets and a northern European UK SLE cohort. A total of 1433 cases and 1458 matched controls were examined. RESULTS: Using this transancestral SNP mapping approach, novel independent loci were identified within the MHC region in UK, Spanish and Filipino patients with SLE with some evidence of interaction. These loci include HLA-DPB1, HLA-G and MSH5 which are independent of each other and HLA-DRB1 alleles. Furthermore, the established SLE-associated HLA-DRB1*15 signal was refined to an interval encompassing HLA-DRB1 and HLA-DQA1. Increased frequencies of MHC region risk alleles and haplotypes were found in the Filipino population compared with Europeans, suggesting that the greater disease burden in non-European SLE may be due in part to this phenomenon. CONCLUSION: These data highlight the usefulness of mapping disease susceptibility loci using a transancestral approach, particularly in a region as complex as the MHC, and offer a springboard for further fine-mapping, resequencing and transcriptomic analysis.

Goriely A, Maher G, Lim J, Taylor IB, McGowan SJ, Pfeifer S, Rajpert-DeMeyts E, McVean GAT, Wilkie AOM. 2012. PATERNAL AGE EFFECT AND SELFISH MUTATIONS SCHIZOPHRENIA RESEARCH, 136 pp. S4-S4. | Read more

Auton A, Fledel-Alon A, Pfeifer S, Venn O, Ségurel L, Street T, Leffler EM, Bowden R, Aneas I, Broxholme J et al. 2012. A fine-scale chimpanzee genetic map from population sequencing. Science, 336 (6078), pp. 193-198. | Show Abstract | Read more

To study the evolution of recombination rates in apes, we developed methodology to construct a fine-scale genetic map from high-throughput sequence data from 10 Western chimpanzees, Pan troglodytes verus. Compared to the human genetic map, broad-scale recombination rates tend to be conserved, but with exceptions, particularly in regions of chromosomal rearrangements and around the site of ancestral fusion in human chromosome 2. At fine scales, chimpanzee recombination is dominated by hotspots, which show no overlap with those of humans even though rates are similarly elevated around CpG islands and decreased within genes. The hotspot-specifying protein PRDM9 shows extensive variation among Western chimpanzees, and there is little evidence that any sequence motifs are enriched in hotspots. The contrasting locations of hotspots provide a natural experiment, which demonstrates the impact of recombination on base composition.

Cited:

198

Scopus

Mathieson I, McVean G. 2012. Differential confounding of rare and common variants in spatially structured populations Nature Genetics, 44 (3), pp. 243-246. | Show Abstract | Read more

Well-powered genome-wide association studies, now made possible through advances in technology and large-scale collaborative projects, promise to characterize the contribution of rare variants to complex traits and disease. However, while population structure is a known confounder of association studies, it remains unknown whether methods developed to control stratification are equally effective for rare variants. Here, we demonstrate that rare variants can show a stratification that is systematically different from, and typically stronger than, common variants, and this is not necessarily corrected by existing methods. We show that the same process leads to inflation for load-based tests and can obscure signals at truly associated variants. Furthermore, we show that populations can display spatial structure in rare variants, even when Wright's fixation index F ST is low, but that allele frequencyg-dependent metrics of allele sharing can reveal localized stratification. These results underscore the importance of collecting and integrating spatial information in the genetic analysis of complex traits. © 2012 Nature America, Inc. All rights reserved.

Mathieson I, McVean G. 2012. Differential confounding of rare and common variants in spatially structured populations. Nat Genet, 44 (3), pp. 243-246. | Show Abstract | Read more

Well-powered genome-wide association studies, now made possible through advances in technology and large-scale collaborative projects, promise to characterize the contribution of rare variants to complex traits and disease. However, while population structure is a known confounder of association studies, it remains unknown whether methods developed to control stratification are equally effective for rare variants. Here, we demonstrate that rare variants can show a stratification that is systematically different from, and typically stronger than, common variants, and this is not necessarily corrected by existing methods. We show that the same process leads to inflation for load-based tests and can obscure signals at truly associated variants. Furthermore, we show that populations can display spatial structure in rare variants, even when Wright's fixation index F(ST) is low, but that allele frequency-dependent metrics of allele sharing can reveal localized stratification. These results underscore the importance of collecting and integrating spatial information in the genetic analysis of complex traits.

Chen H, Hayashi G, Lai OY, Dilthey A, Kuebler PJ, Wong TV, Martin MP, Fernandez Vina MA, McVean G, Wabl M et al. 2012. Psoriasis patients are enriched for genetic variants that protect against HIV-1 disease. PLoS Genet, 8 (2), pp. e1002514. | Show Abstract | Read more

An important paradigm in evolutionary genetics is that of a delicate balance between genetic variants that favorably boost host control of infection but which may unfavorably increase susceptibility to autoimmune disease. Here, we investigated whether patients with psoriasis, a common immune-mediated disease of the skin, are enriched for genetic variants that limit the ability of HIV-1 virus to replicate after infection. We analyzed the HLA class I and class II alleles of 1,727 Caucasian psoriasis cases and 3,581 controls and found that psoriasis patients are significantly more likely than controls to have gene variants that are protective against HIV-1 disease. This includes several HLA class I alleles associated with HIV-1 control; amino acid residues at HLA-B positions 67, 70, and 97 that mediate HIV-1 peptide binding; and the deletion polymorphism rs67384697 associated with high surface expression of HLA-C. We also found that the compound genotype KIR3DS1 plus HLA-B Bw4-80I, which respectively encode a natural killer cell activating receptor and its putative ligand, significantly increased psoriasis susceptibility. This compound genotype has also been associated with delay of progression to AIDS. Together, our results suggest that genetic variants that contribute to anti-viral immunity may predispose to the development of psoriasis.

Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. 2012. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet, 44 (2), pp. 226-232. | Show Abstract | Read more

Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

Auton A, McVean G. 2012. Estimating recombination rates from genetic variation in humans. Methods Mol Biol, 856 pp. 217-237. | Show Abstract | Read more

Recombination acts to shuffle the existing genetic variation within a population, leading to various approaches for detecting its action and estimating the rate at which it occurs. Here, we discuss the principal methodological and analytical approaches taken to understanding the distribution of recombination across the human genome. We first discuss the detection of recent crossover events in both well-characterised pedigrees and larger populations with extensive recent shared ancestry. We then describe approaches for learning about the fine-scale structure of recombination rate variation from patterns of genetic variation in unrelated individuals. Finally, we show how related approaches using individuals of admixed ancestry can provide an alternative approach to analysing recombination. Approaches differ not only in the statistical methods used, but also in the resolution of inference, the timescale over which recombination events are detected, and the extent to which inter-individual variation can be identified.

Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC. 2011. Detecting novel associations in large data sets. Science, 334 (6062), pp. 1518-1524. | Show Abstract | Read more

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R(2)) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

International Multiple Sclerosis Genetics Consortium, Wellcome Trust Case Control Consortium 2, Sawcer S, Hellenthal G, Pirinen M, Spencer CCA, Patsopoulos NA, Moutsianas L, Dilthey A, Su Z et al. 2011. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature, 476 (7359), pp. 214-219. | Show Abstract | Read more

Multiple sclerosis is a common disease of the central nervous system in which the interplay between inflammatory and neurodegenerative processes typically results in intermittent neurological disturbance followed by progressive accumulation of disability. Epidemiological studies have shown that genetic factors are primarily responsible for the substantially increased frequency of the disease seen in the relatives of affected individuals, and systematic attempts to identify linkage in multiplex families have confirmed that variation within the major histocompatibility complex (MHC) exerts the greatest individual effect on risk. Modestly powered genome-wide association studies (GWAS) have enabled more than 20 additional risk loci to be identified and have shown that multiple variants exerting modest individual effects have a key role in disease susceptibility. Most of the genetic architecture underlying susceptibility to the disease remains to be defined and is anticipated to require the analysis of sample sizes that are beyond the numbers currently available to individual research groups. In a collaborative GWAS involving 9,772 cases of European descent collected by 23 research groups working in 15 different countries, we have replicated almost all of the previously suggested associations and identified at least a further 29 novel susceptibility loci. Within the MHC we have refined the identity of the HLA-DRB1 risk alleles and confirmed that variation in the HLA-A gene underlies the independent protective effect attributable to the class I region. Immunologically relevant genes are significantly overrepresented among those mapping close to the identified loci and particularly implicate T-helper-cell differentiation in the pathogenesis of multiple sclerosis.

Evans DM, Spencer CCA, Pointon JJ, Su Z, Harvey D, Kochan G, Oppermann U, Dilthey A, Pirinen M, Stone MA et al. 2011. Interaction between ERAP1 and HLA-B27 in ankylosing spondylitis implicates peptide handling in the mechanism for HLA-B27 in disease susceptibility. Nat Genet, 43 (8), pp. 761-767. | Show Abstract | Read more

Ankylosing spondylitis is a common form of inflammatory arthritis predominantly affecting the spine and pelvis that occurs in approximately 5 out of 1,000 adults of European descent. Here we report the identification of three variants in the RUNX3, LTBR-TNFRSF1A and IL12B regions convincingly associated with ankylosing spondylitis (P < 5 × 10(-8) in the combined discovery and replication datasets) and a further four loci at PTGER4, TBKBP1, ANTXR2 and CARD9 that show strong association across all our datasets (P < 5 × 10(-6) overall, with support in each of the three datasets studied). We also show that polymorphisms of ERAP1, which encodes an endoplasmic reticulum aminopeptidase involved in peptide trimming before HLA class I presentation, only affect ankylosing spondylitis risk in HLA-B27-positive individuals. These findings provide strong evidence that HLA-B27 operates in ankylosing spondylitis through a mechanism involving aberrant processing of antigenic peptides.

Didelot X, Bowden R, Street T, Golubchik T, Spencer C, McVean G, Sangal V, Anjum MF, Achtman M, Falush D, Donnelly P. 2011. Recombination and population structure in Salmonella enterica. PLoS Genet, 7 (7), pp. e1002191. | Show Abstract | Read more

Salmonella enterica is a bacterial pathogen that causes enteric fever and gastroenteritis in humans and animals. Although its population structure was long described as clonal, based on high linkage disequilibrium between loci typed by enzyme electrophoresis, recent examination of gene sequences has revealed that recombination plays an important evolutionary role. We sequenced around 10% of the core genome of 114 isolates of enterica using a resequencing microarray. Application of two different analysis methods (Structure and ClonalFrame) to our genomic data allowed us to define five clear lineages within S. enterica subspecies enterica, one of which is five times older than the other four and two thirds of the age of the whole subspecies. We show that some of these lineages display more evidence of recombination than others. We also demonstrate that some level of sexual isolation exists between the lineages, so that recombination has occurred predominantly between members of the same lineage. This pattern of recombination is compatible with expectations from the previously described ecological structuring of the enterica population as well as mechanistic barriers to recombination observed in laboratory experiments. In spite of their relatively low level of genetic differentiation, these lineages might therefore represent incipient species.

Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST et al. 2011. The variant call format and VCFtools. Bioinformatics, 27 (15), pp. 2156-2158. | Show Abstract | Read more

SUMMARY: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. AVAILABILITY: http://vcftools.sourceforge.net

Moutsianas L, Enciso-Mora V, Ma YP, Leslie S, Dilthey A, Broderick P, Sherborne A, Cooke R, Ashworth A, Swerdlow AJ et al. 2011. Multiple Hodgkin lymphoma-associated loci within the HLA region at chromosome 6p21.3. Blood, 118 (3), pp. 670-674. | Show Abstract | Read more

Since an association between the human leukocyte antigen (HLA) region and Hodgkin lymphoma (HL) was first reported in 1967, many studies have reported associations between HL risk and both single nucleotide polymorphism (SNP) and classic HLA allele variation in the major histocompatibility complex. However, population stratification and the extent and complexity of linkage disequilibrium within the major histocompatibility complex have hindered efforts to fine-map causal signals. Using SNP data to impute alleles at classic HLA loci, we have conducted an integrated analysis of HL risk within the HLA region in 582 early-onset HL cases and 4736 controls. We confirm that the strongest signal of association comes from an SNP located in the class II region, rs6903608 (odds ratio [OR] = 1.79, P = 6.63 × 10(-19)), which is unlikely to be driven by association to HLA-DRB, DQA, or DQB alleles. In addition, we identify independent signals at rs2281389 (OR = 1.73, P = 6.31 × 10(-13)), a SNP that maps closely to HLA-DPB1, and the class II HLA allele DQA1*02:01 (OR = 0.56, P = 1.51 × 10(-7)). These data suggest that multiple independent loci within the HLA class II region contribute to the risk of developing early-onset HL.

Jiang H, Li N, Gopalan V, Zilversmit MM, Varma S, Nagarajan V, Li J, Mu J, Hayton K, Henschen B et al. 2011. High recombination rates and hotspots in a Plasmodium falciparum genetic cross. Genome Biol, 12 (4), pp. R33. | Show Abstract | Read more

BACKGROUND: The human malaria parasite Plasmodium falciparum survives pressures from the host immune system and antimalarial drugs by modifying its genome. Genetic recombination and nucleotide substitution are the two major mechanisms that the parasite employs to generate genome diversity. A better understanding of these mechanisms may provide important information for studying parasite evolution, immune evasion and drug resistance. RESULTS: Here, we used a high-density tiling array to estimate the genetic recombination rate among 32 progeny of a P. falciparum genetic cross (7G8 × GB4). We detected 638 recombination events and constructed a high-resolution genetic map. Comparing genetic and physical maps, we obtained an overall recombination rate of 9.6 kb per centimorgan and identified 54 candidate recombination hotspots. Similar to centromeres in other organisms, the sequences of P. falciparum centromeres are found in chromosome regions largely devoid of recombination activity. Motifs enriched in hotspots were also identified, including a 12-bp G/C-rich motif with 3-bp periodicity that may interact with a protein containing 11 predicted zinc finger arrays. CONCLUSIONS: These results show that the P. falciparum genome has a high recombination rate, although it also follows the overall rule of meiosis in eukaryotes with an average of approximately one crossover per chromosome per meiosis. GC-rich repetitive motifs identified in the hotspot sequences may play a role in the high recombination rate observed. The lack of recombination activity in centromeric regions is consistent with the observations of reduced recombination near the centromeres of other organisms.

Sainudiin R, Thornton K, Harlow J, Booth J, Stillman M, Yoshida R, Griffiths R, McVean G, Donnelly P. 2011. Experiments with the site frequency spectrum. Bull Math Biol, 73 (4), pp. 829-872. | Show Abstract | Read more

Evaluating the likelihood function of parameters in highly-structured population genetic models from extant deoxyribonucleic acid (DNA) sequences is computationally prohibitive. In such cases, one may approximately infer the parameters from summary statistics of the data such as the site-frequency-spectrum (SFS) or its linear combinations. Such methods are known as approximate likelihood or Bayesian computations. Using a controlled lumped Markov chain and computational commutative algebraic methods, we compute the exact likelihood of the SFS and many classical linear combinations of it at a non-recombining locus that is neutrally evolving under the infinitely-many-sites mutation model. Using a partially ordered graph of coalescent experiments around the SFS, we provide a decision-theoretic framework for approximate sufficiency. We also extend a family of classical hypothesis tests of standard neutrality at a non-recombining locus based on the SFS to a more powerful version that conditions on the topological information provided by the SFS.

Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G, 1000 Genomes Project, Sella G, Przeworski M. 2011. Classic selective sweeps were rare in recent human evolution. Science, 331 (6019), pp. 920-924. | Show Abstract | Read more

Efforts to identify the genetic basis of human adaptations from polymorphism data have sought footprints of "classic selective sweeps" (in which a beneficial mutation arises and rapidly fixes in the population).Yet it remains unknown whether this form of natural selection was common in our evolution. We examined the evidence for classic sweeps in resequencing data from 179 human genomes. As expected under a recurrent-sweep model, we found that diversity levels decrease near exons and conserved noncoding regions. In contrast to expectation, however, the trough in diversity around human-specific amino acid substitutions is no more pronounced than around synonymous substitutions. Moreover, relative to the genome background, amino acid and putative regulatory sites are not significantly enriched in alleles that are highly differentiated between populations. These findings indicate that classic sweeps were not a dominant mode of human adaptation over the past ~250,000 years.

Dilthey AT, Moutsianas L, Leslie S, McVean G. 2011. HLA*IMP--an integrated framework for imputing classical HLA alleles from SNP genotypes. Bioinformatics, 27 (7), pp. 968-972. | Show Abstract | Read more

MOTIVATION: Genetic variation at classical HLA alleles influences many phenotypes, including susceptibility to autoimmune disease, resistance to pathogens and the risk of adverse drug reactions. However, classical HLA typing methods are often prohibitively expensive for large-scale studies. We previously described a method for imputing classical alleles from linked SNP genotype data. Here, we present a modification of the original algorithm implemented in a freely available software suite that combines local data preparation and QC with probabilistic imputation through a remote server. RESULTS: We introduce two modifications to the original algorithm. First, we present a novel SNP selection function that leads to pronounced increases (up by 40% in some scenarios) in call rate. Second, we develop a parallelized model building algorithm that allows us to process a reference set of over 2500 individuals. In a validation experiment, we show that our framework produces highly accurate HLA type imputations at class I and class II loci for independent datasets: at call rates of 95-99%, imputation accuracy is between 92% and 98% at the four-digit level and over 97% at the two-digit level. We demonstrate utility of the method through analysis of a genome-wide association study for psoriasis where there is a known classical HLA risk allele (HLA-C*06:02). We show that the imputed allele shows stronger association with disease than any single SNP within the region. The imputation framework, HLA*IMP, provides a powerful tool for dissecting the architecture of genetic risk within the HLA. AVAILABILITY: HLA*IMP, implemented in C++ and Perl, is available from http://oxfordhla.well.ox.ac.uk and is free for academic use.

Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK et al. 2011. Mapping copy number variation by population-scale genome sequencing. Nature, 470 (7332), pp. 59-65. | Show Abstract | Read more

Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.

Hosking FJ, Leslie S, Dilthey A, Moutsianas L, Wang Y, Dobbins SE, Papaemmanuil E, Sheridan E, Kinsey SE, Lightfoot T et al. 2011. MHC variation and risk of childhood B-cell precursor acute lymphoblastic leukemia. Blood, 117 (5), pp. 1633-1640. | Show Abstract | Read more

A role for specific human leukocyte antigen (HLA) variants in the etiology of childhood acute lymphoblastic leukemia (ALL) has been extensively studied over the last 30 years, but no unambiguous association has been identified. To comprehensively study the relationship between genetic variation within the 4.5 Mb major histocompatibility complex genomic region and precursor B-cell (BCP) ALL risk, we analyzed 1075 observed and 8176 imputed single nucleotide polymorphisms and their related haplotypes in 824 BCP-ALL cases and 4737 controls. Using these genotypes we also imputed both common and rare alleles at class I (HLA-A, HLA-B, and HLA-C) and class II (HLA-DRB1, HLA-DQA1, and HLA-DQB1) HLA loci. Overall, we found no statistically significant association between variants and BCP-ALL risk. We conclude that major histocompatibility complex-defined variation in immune-mediated response is unlikely to be a major risk factor for BCP-ALL.

Genetic Analysis of Psoriasis Consortium & the Wellcome Trust Case Control Consortium 2, Strange A, Capon F, Spencer CCA, Knight J, Weale ME, Allen MH, Barton A, Band G, Bellenguez C et al. 2010. A genome-wide association study identifies new psoriasis susceptibility loci and an interaction between HLA-C and ERAP1. Nat Genet, 42 (11), pp. 985-990. | Show Abstract | Read more

To identify new susceptibility loci for psoriasis, we undertook a genome-wide association study of 594,224 SNPs in 2,622 individuals with psoriasis and 5,667 controls. We identified associations at eight previously unreported genomic loci. Seven loci harbored genes with recognized immune functions (IL28RA, REL, IFIH1, ERAP1, TRAF3IP2, NFKBIA and TYK2). These associations were replicated in 9,079 European samples (six loci with a combined P < 5 × 10⁻⁸ and two loci with a combined P < 5 × 10⁻⁷). We also report compelling evidence for an interaction between the HLA-C and ERAP1 loci (combined P = 6.95 × 10⁻⁶). ERAP1 plays an important role in MHC class I peptide processing. ERAP1 variants only influenced psoriasis susceptibility in individuals carrying the HLA-C risk allele. Our findings implicate pathways that integrate epidermal barrier dysfunction with innate and adaptive immune dysregulation in psoriasis pathogenesis.

1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), pp. 1061-1073. | Show Abstract | Read more

The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R. 2011. Dindel: accurate indel calls from short-read data. Genome Res, 21 (6), pp. 961-973. | Show Abstract | Read more

Small insertions and deletions (indels) are a common and functionally important type of sequence polymorphism. Most of the focus of studies of sequence variation is on single nucleotide variants (SNVs) and large structural variants. In principle, high-throughput sequencing studies should allow identification of indels just as SNVs. However, inference of indels from next-generation sequence data is challenging, and so far methods for identifying indels lag behind methods for calling SNVs in terms of sensitivity and specificity. We propose a Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs. We show that our method is sensitive and achieves low false discovery rates on simulated and real data sets, although challenges remain. The algorithm is implemented in the program Dindel, which has been used in the 1000 Genomes Project call sets.

Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, Sampas N, Bruhn L, Shendure J, 1000 Genomes Project, Eichler EE. 2010. Diversity of human copy number variation and multicopy genes. Science, 330 (6004), pp. 641-646. | Show Abstract | Read more

Copy number variants affect both disease and normal phenotypic variation, but those lying within heavily duplicated, highly identical sequence have been difficult to assay. By analyzing short-read mapping depth for 159 human genomes, we demonstrated accurate estimation of absolute copy number for duplications as small as 1.9 kilobase pairs, ranging from 0 to 48 copies. We identified 4.1 million "singly unique nucleotide" positions informative in distinguishing specific copies and used them to genotype the copy and content of specific paralogs within highly duplicated gene families. These data identify human-specific expansions in genes associated with brain development, reveal extensive population genetic diversity, and detect signatures consistent with gene conversion in the human species. Our approach makes ~1000 genes accessible to genetic studies of disease association.

McVean G, Myers S. 2010. PRDM9 marks the spot. Nat Genet, 42 (10), pp. 821-822. | Show Abstract | Read more

A new study demonstrates that PRDM9 variation in humans leads to profound differences in the activity of hotspots for both allelic recombination and genomic instability. Although PRDM9 is found to play a role in many more human hotspots than previously suspected, the search remains for additional, undetermined factors involved in defining hotspot locations and intensities. © 2010 Nature America, Inc. All rights reserved.

International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F et al. 2010. Integrating common and rare genetic variation in diverse human populations. Nature, 467 (7311), pp. 52-58. | Show Abstract | Read more

Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called 'HapMap 3', includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of <or=5%, and demonstrated the feasibility of imputing newly discovered CNPs and SNPs. This expanded public resource of genome variants in global populations supports deeper interrogation of genomic variation and its role in human disease, and serves as a step towards a high-resolution map of the landscape of human genetic variation.

McVean G. 2010. What drives recombination hotspots to repeat DNA in humans? Philos Trans R Soc Lond B Biol Sci, 365 (1544), pp. 1213-1218. | Show Abstract | Read more

Recombination between homologous, but non-allelic, stretches of DNA such as gene families, segmental duplications and repeat elements is an important source of mutation. In humans, recent studies have identified short DNA motifs that both determine the location of 40 per cent of meiotic cross-over hotspots and are significantly enriched at the breakpoints of recurrent non-allelic homologous recombination (NAHR) syndromes. Unexpectedly, the most highly penetrant form of the motif occurs on the background of an inactive repeat element family (THE1 elements) and the motif also has strong recombinogenic activity on currently active element families including Alu and LINE2 elements. Analysis of genetic variation among members of these repeat families indicates an important role for NAHR in their evolution. Given the potential for double-strand breaks within repeat DNA to cause pathological rearrangement, the association between repeats and hotspots is surprising. Here we consider possible explanations for why selection acting against NAHR has not eliminated hotspots from repeat DNA including mechanistic constraints, possible benefits to repeat DNA from recruiting hotspots and rapid evolution of the recombination machinery. I suggest that rapid evolution of hotspot motifs may, surprisingly, tend to favour sequences present in repeat DNA and outline the data required to differentiate between hypotheses.

Wellcome Trust Case Control Consortium, Craddock N, Hurles ME, Cardin N, Pearson RD, Plagnol V, Robson S, Vukcevic D, Barnes C, Conrad DF et al. 2010. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature, 464 (7289), pp. 713-720. | Show Abstract | Read more

Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.

Myers S, Bowden R, Tumian A, Bontrop RE, Freeman C, MacFie TS, McVean G, Donnelly P. 2010. Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science, 327 (5967), pp. 876-879. | Show Abstract | Read more

Although present in both humans and chimpanzees, recombination hotspots, at which meiotic crossover events cluster, differ markedly in their genomic location between the species. We report that a 13-base pair sequence motif previously associated with the activity of 40% of human hotspots does not function in chimpanzees and is being removed by self-destructive drive in the human lineage. Multiple lines of evidence suggest that the rapidly evolving zinc-finger protein PRDM9 binds to this motif and that sequence changes in the protein may be responsible for hotspot differences between species. The involvement of PRDM9, which causes histone H3 lysine 4 trimethylation, implies that there is a common mechanism for recombination hotspots in eukaryotes but raises questions about what forces have driven such rapid change.

Cited:

610

Scopus

Strange A, Capon F, Spencer CCA, Knight J, Weale ME, Allen MH, Barton A, Band G, Bellenguez C, Bergboer JGM et al. 2010. A genome-wide asociation study identifies new psoriasis susceptibility loci and an interaction betwEn HLA-C and ERAP1 Nature Genetics, 42 (11), pp. 985-990. | Show Abstract | Read more

To identify new susceptibility loci for psoriasis, we undertOk a genome-wide asociation study of 594,224 SNPs in 2,622 individuals with psoriasis and 5,667 controls. We identified asociations at eight previously unreported genomic loci. Seven loci harbored genes with recognized iMune functions (IL28RA, REL, IFIH1, ERAP1, TRAF3IP2, NFKBIA and TYK2). These asociations were replicated in 9,079 European samples (six loci with a combined P < 5-10 -8 and two loci with a combined P < 5-10 -7 ). We also report compeLing evidence for an interaction betwEn the HLA-C and ERAP1 loci (combined P = 6.95-10 -6 ). ERAP1 plays an important role in MHC claS I peptide proceSing. ERAP1 variants only influenced psoriasis susceptibility in individuals carrying the HLA-C risk aLele. Our findings implicate pathways that integrate epidermal barrier dysfunction with iNate and adaptive iMune dysregulation in psoriasis pathogenesis. © 2010 Nature America, Inc. All rights reserved.

Sainudiin R, Thornton K, Harlow J, Booth J, Stillman M, Yoshida R, Griffiths R, McVean G, Donnelly P. 2010. Experiments with the Site Frequency Spectrum Bulletin of Mathematical Biology, pp. 1-44.

International MHC and Autoimmunity Genetics Network, Rioux JD, Goyette P, Vyse TJ, Hammarström L, Fernando MMA, Green T, De Jager PL, Foisy S, Wang J et al. 2009. Mapping of multiple susceptibility variants within the MHC region for 7 immune-mediated diseases. Proc Natl Acad Sci U S A, 106 (44), pp. 18680-18685. | Show Abstract | Read more

The human MHC represents the strongest susceptibility locus for autoimmune diseases. However, the identification of the true predisposing gene(s) has been handicapped by the strong linkage disequilibrium across the region. Furthermore, most studies to date have been limited to the examination of a subset of the HLA and non-HLA genes with a marker density and sample size insufficient for mapping all independent association signals. We genotyped a panel of 1,472 SNPs to capture the common genomic variation across the 3.44 megabase (Mb) classic MHC region in 10,576 DNA samples derived from patients with systemic lupus erythematosus, Crohn's disease, ulcerative colitis, rheumatoid arthritis, myasthenia gravis, selective IgA deficiency, multiple sclerosis, and appropriate control samples. We identified the primary association signals for each disease and performed conditional regression to identify independent secondary signals. The data demonstrate that MHC associations with autoimmune diseases result from complex, multilocus effects that span the entire region.

Goriely A, Hansen RMS, Taylor IB, Olesen IA, Jacobsen GK, McGowan SJ, Pfeifer SP, McVean GAT, Rajpert-De Meyts E, Wilkie AOM. 2009. Activating mutations in FGFR3 and HRAS reveal a shared genetic origin for congenital disorders and testicular tumors. Nat Genet, 41 (11), pp. 1247-1252. | Show Abstract | Read more

Genes mutated in congenital malformation syndromes are frequently implicated in oncogenesis, but the causative germline and somatic mutations occur in separate cells at different times of an organism's life. Here we unify these processes to a single cellular event for mutations arising in male germ cells that show a paternal age effect. Screening of 30 spermatocytic seminomas for oncogenic mutations in 17 genes identified 2 mutations in FGFR3 (both 1948A>G, encoding K650E, which causes thanatophoric dysplasia in the germline) and 5 mutations in HRAS. Massively parallel sequencing of sperm DNA showed that levels of the FGFR3 mutation increase with paternal age and that the mutation spectrum at the Lys650 codon is similar to that observed in bladder cancer. Most spermatocytic seminomas show increased immunoreactivity for FGFR3 and/or HRAS. We propose that paternal age-effect mutations activate a common 'selfish' pathway supporting proliferation in the testis, leading to diverse phenotypes in the next generation including fetal lethality, congenital syndromes and cancer predisposition.

McVean G. 2009. A genealogical interpretation of principal components analysis. PLoS Genet, 5 (10), pp. e1000686. | Show Abstract | Read more

Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's f(st) and show that SNP ascertainment has a largely simple and predictable effect on the projection of samples. Using examples from human genetics, I discuss the application of these results to empirical data and the implications for inference.

Myers S, Freeman C, Auton A, Donnelly P, McVean G. 2008. A common sequence motif associated with recombination hot spots and genome instability in humans. Nat Genet, 40 (9), pp. 1124-1129. | Show Abstract | Read more

In humans, most meiotic crossover events are clustered into short regions of the genome known as recombination hot spots. We have previously identified DNA motifs that are enriched in hot spots, particularly the 7-mer CCTCCCT. Here we use the increased hot-spot resolution afforded by the Phase 2 HapMap and novel search methods to identify an extended family of motifs based around the degenerate 13-mer CCNCCNTNNCCNC, which is critical in recruiting crossover events to at least 40% of all human hot spots and which operates on diverse genetic backgrounds in both sexes. Furthermore, these motifs are found in hypervariable minisatellites and are clustered in the breakpoint regions of both disease-causing nonallelic homologous recombination hot spots and common mitochondrial deletion hot spots, implicating the motif as a driver of genome instability.

Myers S, Freeman C, Auton A, Donnelly P, McVean G. 2008. A common sequence motif associated with recombination hot spots and genome instability in humans. Nat Genet, 40 (9), pp. 1124-1129. | Show Abstract | Read more

In humans, most meiotic crossover events are clustered into short regions of the genome known as recombination hot spots. We have previously identified DNA motifs that are enriched in hot spots, particularly the 7-mer CCTCCCT. Here we use the increased hot-spot resolution afforded by the Phase 2 HapMap and novel search methods to identify an extended family of motifs based around the degenerate 13-mer CCNCCNTNNCCNC, which is critical in recruiting crossover events to at least 40% of all human hot spots and which operates on diverse genetic backgrounds in both sexes. Furthermore, these motifs are found in hypervariable minisatellites and are clustered in the breakpoint regions of both disease-causing nonallelic homologous recombination hot spots and common mitochondrial deletion hot spots, implicating the motif as a driver of genome instability.

McVean G. 2008. Linkage Disequilibrium, Recombination and Selection 2 pp. 909-944. | Read more

Leslie S, Donnelly P, McVean G. 2008. A statistical method for predicting classical HLA alleles from SNP data. Am J Hum Genet, 82 (1), pp. 48-56. | Show Abstract | Read more

Genetic variation at classical HLA alleles is a crucial determinant of transplant success and susceptibility to a large number of infectious and autoimmune diseases. However, large-scale studies involving classical type I and type II HLA alleles might be limited by the cost of allele-typing technologies. Although recent studies have shown that some common HLA alleles can be tagged with small numbers of markers, SNP-based tagging does not offer a complete solution to predicting HLA alleles. We have developed a new statistical methodology to use SNP variation within the region to predict alleles at key class I (HLA-A, HLA-B, and HLA-C) and class II (HLA-DRB1, HLA-DQA1, and HLA-DQB1) loci. Our results indicate that a single panel of approximately 100 SNPs typed across the region is sufficient for predicting both rare and common HLA alleles with up to 95% accuracy in both African and non-African populations. Furthermore, we show that HLA alleles can be successfully predicted by using previously genotyped SNPs that are within the MHC and that had not been chosen for their ability to predict HLA alleles, such as those included on genome-wide products. These results indicate that our methodology, combined with an extended database of reference haplotypes, will facilitate large-scale experiments, including disease-association studies and vaccine trials, in which detailed information about HLA type is valuable.

Stone GN, Atkinson RJ, Rokas A, Aldrey J-LN, Melika G, Acs Z, Csóka G, Hayward A, Bailey R, Buckee C, McVean GAT. 2008. Evidence for widespread cryptic sexual generations in apparently purely asexual Andricus gallwasps. Mol Ecol, 17 (2), pp. 652-665. | Show Abstract | Read more

Oak gallwasps (Hymenoptera, Cynipidae, Cynipini) are one of seven major animal taxa that commonly reproduce by cyclical parthenogenesis (CP). A major question in research on CP taxa is the frequency with which lineages lose their sexual generations, and diversify as purely asexual radiations. Most oak gallwasp species are only known from an asexual generation, and secondary loss of sex has been conclusively demonstrated in several species, particularly members of the holarctic genus Andricus. This raises the possibility of widespread secondary loss of sex in the Cynipini, and of diversification within purely parthenogenetic lineages. We use two approaches based on analyses of allele frequency data to test for cryptic sexual generations in eight apparently asexual European species distributed through a major western palaearctic lineage of the gallwasp genus Andricus. All species showing adequate levels of polymorphism (7/8) showed signatures of sex compatible with cyclical parthenogenesis. We also use DNA sequence data to test the hypothesis that ignorance of these sexual generations (despite extensive study on this group) results from failure to discriminate among known but morphologically indistinguishable sexual generations. This hypothesis is supported: 35 sequences attributed by leading cynipid taxonomists to a single sexual adult morphospecies, Andricus burgundus, were found to represent the sexual generations of at least six Andricus species. We confirm cryptic sexual generations in a total of 11 Andricus species, suggesting that secondary loss of sex is rare in Andricus.

International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P et al. 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449 (7164), pp. 851-861. | Show Abstract | Read more

We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.

Cited:

3189

Scopus

Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM et al. 2007. A second generation human haplotype map of over 3.1 million SNPs Nature, 449 (7164), pp. 851-861. | Show Abstract | Read more

We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations. ©2007 Nature Publishing Group.

Gay J, Myers S, McVean G. 2007. Estimating meiotic gene conversion rates from population genetic data. Genetics, 177 (2), pp. 881-894. | Show Abstract | Read more

Gene conversion plays an important part in shaping genetic diversity in populations, yet estimating the rate at which it occurs is difficult because of the short lengths of DNA involved. We have developed a new statistical approach to estimating gene conversion rates from genetic variation, by extending an existing model for haplotype data in the presence of crossover events. We show, by simulation, that when the rate of gene conversion events is at least comparable to the rate of crossover events, the method provides a powerful approach to the detection of gene conversion and estimation of its rate. Application of the method to data from the telomeric X chromosome of Drosophila melanogaster, in which crossover activity is suppressed, indicates that gene conversion occurs approximately 400 times more often than crossover events. We also extend the method to estimating variable crossover and gene conversion rates and estimate the rate of gene conversion to be approximately 1.5 times higher than the crossover rate in a region of human chromosome 1 with known recombination hotspots.

Auton A, McVean G. 2007. Recombination rate estimation in the presence of hotspots. Genome Res, 17 (8), pp. 1219-1227. | Show Abstract | Read more

Fine-scale estimation of recombination rates remains a challenging problem. Experimental techniques can provide accurate estimates at fine scales but are technically challenging and cannot be applied on a genome-wide scale. An alternative source of information comes from patterns of genetic variation. Several statistical methods have been developed to estimate recombination rates from randomly sampled chromosomes. However, most such methods either make poor assumptions about recombination rate variation, or simply assume that there is no rate variation. Since the discovery of recombination hotspots, it is clear that recombination rates can vary over many orders of magnitude at the fine scale. We present a method for the estimation of recombination rates in the presence of recombination hotspots. We demonstrate that the method is able to detect and accurately quantify recombination rate heterogeneity, and is a substantial improvement over a commonly used method. We then use the method to reanalyze genetic variation data from the HLA and MS32 regions of the human genome and demonstrate that the method is able to provide accurate rate estimates and simultaneously detect hotspots.

Marchini J, Howie B, Myers S, McVean G, Donnelly P. 2007. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet, 39 (7), pp. 906-913. | Show Abstract | Read more

Genome-wide association studies are set to become the method of choice for uncovering the genetic basis of human diseases. A central challenge in this area is the development of powerful multipoint methods that can detect causal variants that have not been directly genotyped. We propose a coherent analysis framework that treats the problem as one involving missing or uncertain genotypes. Central to our approach is a model-based imputation method for inferring genotypes at observed or unobserved SNPs, leading to improved power over existing methods for multipoint association mapping. Using real genome-wide association study data, we show that our approach (i) is accurate and well calibrated, (ii) provides detailed views of associated regions that facilitate follow-up studies and (iii) can be used to validate and correct data at genotyped markers. A notable future use of our method will be to boost power by combining data from genome-wide scans that use different SNP sets.

Barry AE, Leliwa-Sytek A, Tavul L, Imrie H, Migot-Nabias F, Brown SM, McVean G, Day KP. 2007. Erratum: Population genomics of the immune evasion (var) genes of Plasmodium falciparum (PLoS Pathogens (2007) 3,3 (e70) DOI: 10.1371/journal. ppat.0030034) PLoS Pathogens, 3 (5), pp. 0690. | Read more

Barry AE, Leliwa-Sytek A, Tavul L, Imrie H, Migot-Nabias F, Brown SM, McVean GAV, Day KP. 2007. Population genomics of the immune evasion (var) genes of Plasmodium falciparum. PLoS Pathog, 3 (3), pp. e34. | Show Abstract | Read more

Var genes encode the major surface antigen (PfEMP1) of the blood stages of the human malaria parasite Plasmodium falciparum. Differential expression of up to 60 diverse var genes in each parasite genome underlies immune evasion. We compared the diversity of the DBLalpha domain of var genes sampled from 30 parasite isolates from a malaria endemic area of Papua New Guinea (PNG) and 59 from widespread geographic origins (global). Overall, we obtained over 8,000 quality-controlled DBLalpha sequences. Within our sampling frame, the global population had a total of 895 distinct DBLalpha "types" and negligible overlap among repertoires. This indicated that var gene diversity on a global scale is so immense that many genomes would need to be sequenced to capture its true extent. In contrast, we found a much lower diversity in PNG of 185 DBLalpha types, with an average of approximately 7% overlap among repertoires. While we identify marked geographic structuring, nearly 40% of types identified in PNG were also found in samples from different countries showing a cosmopolitan distribution for much of the diversity. We also present evidence to suggest that recombination plays a key role in maintaining the unprecedented levels of polymorphism found in these immune evasion genes. This population genomic framework provides a cost effective molecular epidemiological tool to rapidly explore the geographic diversity of var genes.

McVean G. 2007. The structure of linkage disequilibrium around a selective sweep. Genetics, 175 (3), pp. 1395-1406. | Show Abstract | Read more

The fixation of advantageous mutations by natural selection has a profound impact on patterns of linked neutral variation. While it has long been appreciated that such selective sweeps influence the frequency spectrum of nearby polymorphism, it has only recently become clear that they also have dramatic effects on local linkage disequilibrium. By extending previous results on the relationship between genealogical structure and linkage disequilibrium, I obtain simple expressions for the influence of a selective sweep on patterns of allelic association. I show that sweeps can increase, decrease, or even eliminate linkage disequilibrium (LD) entirely depending on the relative position of the selected and neutral loci. I also show the importance of the age of the neutral mutations in predicting their degree of association and describe the consequences of such results for the interpretation of empirical data. In particular, I demonstrate that while selective sweeps can eliminate LD, they generate patterns of genetic variation very different from those expected from recombination hotspots.

Eyheramendy S, Marchini J, McVean G, Myers S, Donnelly P. 2007. A model-based approach to capture genetic variation for future association studies. Genome Res, 17 (1), pp. 88-95. | Show Abstract | Read more

Genome-wide association studies are still constrained by the cost of genotyping. For this reason, the selection of a reduced set of markers or tags able to capture a significant proportion of the genetic variation is an important aspect of these studies. Most tagging SNP selection methods have been successful in capturing the genetic variation of the data from which the tags have been chosen. However, when these tags are used in an independent data set, a significant proportion of the remaining SNPs (non-tags) are not captured and, in most cases, there is no information on which SNPs are captured. We propose to use a probabilistic model to predict the non-tags based on a set of tags, as a way to capture genetic variation. An important advantage of this method is that it directly predicts the genotype of the non-tags with which we can test for association with the phenotype and which could help to elucidate the location of genes responsible for increasing disease susceptibility. Additionally, this method provides an estimate of the probabilities with which the predictions are made, which reflects the confidence of the probabilistic model. We also propose new methods to select the tagging SNPs. We empirically show by using HapMap data that our approach is able to capture significantly more genetic variation than methods based solely on a pairwise LD measure.

Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, Byrne EH, McCarroll SA, Gaudet R et al. 2007. Genome-wide detection and characterization of positive selection in human populations. Nature, 449 (7164), pp. 913-918. | Show Abstract | Read more

With the advent of dense maps of human genetic variation, it is now possible to detect positive natural selection across the human genome. Here we report an analysis of over 3 million polymorphisms from the International HapMap Project Phase 2 (HapMap2). We used 'long-range haplotype' methods, which were developed to identify alleles segregating in a population that have undergone recent selection, and we also developed new methods that are based on cross-population comparisons to discover alleles that have swept to near-fixation within a population. The analysis reveals more than 300 strong candidate regions. Focusing on the strongest 22 regions, we develop a heuristic for scrutinizing these regions to identify candidate targets of selection. In a complementary analysis, we identify 26 non-synonymous, coding, single nucleotide polymorphisms showing regional evidence of positive selection. Examination of these candidates highlights three cases in which two genes in a common biological process have apparently undergone positive selection in the same population:LARGE and DMD, both related to infection by the Lassa virus, in West Africa;SLC24A5 and SLC45A2, both involved in skin pigmentation, in Europe; and EDAR and EDA2R, both involved in development of hair follicles, in Asia.

Mu J, Awadalla P, Duan J, McGee KM, Keebler J, Seydel K, McVean GAT, Su X-Z. 2007. Genome-wide variation and identification of vaccine targets in the Plasmodium falciparum genome. Nat Genet, 39 (1), pp. 126-130. | Show Abstract | Read more

One goal in sequencing the Plasmodium falciparum genome, the agent of the most lethal form of malaria, is to discover vaccine and drug targets. However, identifying those targets in a genome in which approximately 60% of genes have unknown functions is an enormous challenge. Because the majority of known malaria antigens and drug-resistant genes are highly polymorphic and under various selective pressures, genome-wide analysis for signatures of selection may lead to discovery of new vaccine and drug candidates. Here we surveyed 3,539 P. falciparum genes ( approximately 65% of the predicted genes) for polymorphisms and identified various highly polymorphic loci and genes, some of which encode new antigens that we confirmed using human immune sera. Our collections of genome-wide SNPs ( approximately 65% nonsynonymous) and polymorphic microsatellites and indels provide a high-resolution map (one marker per approximately 4 kb) for mapping parasite traits and studying parasite populations. In addition, we report new antigens, providing urgently needed vaccine candidates for disease control.

McVean G, Spencer CCA. 2006. Scanning the human genome for signals of selection. Curr Opin Genet Dev, 16 (6), pp. 624-629. | Show Abstract | Read more

The search for adaptive evolution in the human genome has reached a new era with the advent of genome-wide surveys of genetic variation. However, making sense, let alone use, of such experiments is far from straightforward. Key problems include the way in which the data have been collected, the need to control for factors such as population history and variable recombination rates, which influence the discovery rates for both true and false positives, and the inherent difficulty of falsification. Nevertheless, recent work has shown that genome scans can be used to identify both functional polymorphisms underlying selected traits and entire classes of genes enriched for signals of adaptation.

Mu J, Awadalla P, Duan J, McGee KM, Keebler J, Siydel K, McVean GAT, Su X-Z. 2006. Genome-wide variation and identification of vaccine targets in the Plasmodium falciparum genome AMERICAN JOURNAL OF TROPICAL MEDICINE AND HYGIENE, 75 (5), pp. 87-87.

Gregory SG, Barlow KF, McLay KE, Kaul R, Swarbreck D, Dunham A, Scott CE, Howe KL, Woodfine K, Spencer CCA et al. 2006. Erratum: The DNA sequence and biological annotation of human chromosome 1 Nature, 443 (7114), pp. 1013-1013. | Read more

de Bakker PIW, McVean G, Sabeti PC, Miretti MM, Green T, Marchini J, Ke X, Monsuur AJ, Whittaker P, Delgado M et al. 2006. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat Genet, 38 (10), pp. 1166-1172. | Show Abstract | Read more

The proteins encoded by the classical HLA class I and class II genes in the major histocompatibility complex (MHC) are highly polymorphic and are essential in self versus non-self immune recognition. HLA variation is a crucial determinant of transplant rejection and susceptibility to a large number of infectious and autoimmune diseases. Yet identification of causal variants is problematic owing to linkage disequilibrium that extends across multiple HLA and non-HLA genes in the MHC. We therefore set out to characterize the linkage disequilibrium patterns between the highly polymorphic HLA genes and background variation by typing the classical HLA genes and >7,500 common SNPs and deletion-insertion polymorphisms across four population samples. The analysis provides informative tag SNPs that capture much of the common variation in the MHC region and that could be used in disease association studies, and it provides new insight into the evolutionary dynamics and ancestral origins of the HLA loci and their haplotypes.

Spencer CCA, Deloukas P, Hunt S, Mullikin J, Myers S, Silverman B, Donnelly P, Bentley D, McVean G. 2006. The influence of recombination on human genetic diversity. PLoS Genet, 2 (9), pp. e148. | Show Abstract | Read more

In humans, the rate of recombination, as measured on the megabase scale, is positively associated with the level of genetic variation, as measured at the genic scale. Despite considerable debate, it is not clear whether these factors are causally linked or, if they are, whether this is driven by the repeated action of adaptive evolution or molecular processes such as double-strand break formation and mismatch repair. We introduce three innovations to the analysis of recombination and diversity: fine-scale genetic maps estimated from genotype experiments that identify recombination hotspots at the kilobase scale, analysis of an entire human chromosome, and the use of wavelet techniques to identify correlations acting at different scales. We show that recombination influences genetic diversity only at the level of recombination hotspots. Hotspots are also associated with local increases in GC content and the relative frequency of GC-increasing mutations but have no effect on substitution rates. Broad-scale association between recombination and diversity is explained through covariance of both factors with base composition. To our knowledge, these results are the first evidence of a direct and local influence of recombination hotspots on genetic variation and the fate of individual mutations. However, that hotspots have no influence on substitution rates suggests that they are too ephemeral on an evolutionary time scale to have a strong influence on broader scale patterns of base composition and long-term molecular evolution.

Myers S, Spencer CCA, Auton A, Bottolo L, Freeman C, Donnelly P, McVean G. 2006. The distribution and causes of meiotic recombination in the human genome. Biochem Soc Trans, 34 (Pt 4), pp. 526-530. | Show Abstract | Read more

Using the statistical analysis of genetic variation, we have developed a high-resolution genetic map of recombination hotspots and recombination rate variation across the human genome. This map, which has a resolution several orders of magnitude greater than previous studies, identifies over 25,000 recombination hotspots and gives new insights into the distribution and determination of recombination. Wavelet-based analysis demonstrates scale-specific influences of base composition, coding context and DNA repeats on recombination rates, though, in contrast with other species, no association with DNase I hypersensitivity. We have also identified specific DNA motifs that are strongly associated with recombination hotspots and whose activity is influenced by local context. Comparative analysis of recombination rates in humans and chimpanzees demonstrates very high rates of evolution of the fine-scale structure of the recombination landscape. In the light of these observations, we suggest possible resolutions of the hotspot paradox.

Gregory SG, Barlow KF, McLay KE, Kaul R, Swarbreck D, Dunham A, Scott CE, Howe KL, Woodfine K, Spencer CCA et al. 2006. The DNA sequence and biological annotation of human chromosome 1. Nature, 441 (7091), pp. 315-321. | Show Abstract | Read more

The reference sequence for each human chromosome provides the framework for understanding genome function, variation and evolution. Here we report the finished sequence and biological annotation of human chromosome 1. Chromosome 1 is gene-dense, with 3,141 genes and 991 pseudogenes, and many coding sequences overlap. Rearrangements and mutations of chromosome 1 are prevalent in cancer and many other diseases. Patterns of sequence variation reveal signals of recent selection in specific genes that may contribute to human fitness, and also in regions where no function is evident. Fine-scale recombination occurs in hotspots of varying intensity along the sequence, and is enriched near genes. These and other studies of human biology and disease encoded within chromosome 1 are made possible with the highly accurate annotated sequence, as part of the completed set of chromosome sequences that comprise the reference human genome.

Wilson DJ, McVean G. 2006. Estimating diversifying selection and functional constraint in the presence of recombination. Genetics, 172 (3), pp. 1411-1425. | Show Abstract | Read more

Models of molecular evolution that incorporate the ratio of nonsynonymous to synonymous polymorphism (dN/dS ratio) as a parameter can be used to identify sites that are under diversifying selection or functional constraint in a sample of gene sequences. However, when there has been recombination in the evolutionary history of the sequences, reconstructing a single phylogenetic tree is not appropriate, and inference based on a single tree can give misleading results. In the presence of high levels of recombination, the identification of sites experiencing diversifying selection can suffer from a false-positive rate as high as 90%. We present a model that uses a population genetics approximation to the coalescent with recombination and use reversible-jump MCMC to perform Bayesian inference on both the dN/dS ratio and the recombination rate, allowing each to vary along the sequence. We demonstrate that the method has the power to detect variation in the dN/dS ratio and the recombination rate and does not suffer from a high false-positive rate. We use the method to analyze the porB gene of Neisseria meningitidis and verify the inferences using prior sensitivity analysis and model criticism techniques.

Hanchard NA, Rockett KA, Spencer C, Coop G, Pinder M, Jallow M, Kimber M, McVean G, Mott R, Kwiatkowski DP. 2006. Screening for recently selected alleles by analysis of human haplotype similarity. Am J Hum Genet, 78 (1), pp. 153-159. | Show Abstract | Read more

There is growing interest in the use of haplotype-based methods for detecting recent selection. Here, we describe a method that uses a sliding window to estimate similarity among the haplotypes associated with any given single-nucleotide polymorphism (SNP) allele. We used simulations of natural selection to provide estimates of the empirical power of the method to detect recently selected alleles and found it to be comparable in power to the popular long-range haplotype test and more powerful than methods based on nucleotide diversity. We then applied the method to a recently selected allele--the sickle mutation at the HBB locus--and found it to have a signal of selection that was significantly stronger than that of simulated models both with and without strong selection. Using this method, we also evaluated >4,000 SNPs on chromosome 20, indicating the applicability of the method to regional data sets.

Cited:

48

Scopus

Spencer CCA, Deloukas P, Hunt S, Mullikin J, Myers S, Silverman B, Donnelly P, Bentley D, McVean G. 2006. The influence of recombination on human genetic diversity. PLoS genetics, 2 (9), | Show Abstract | Read more

In humans, the rate of recombination, as measured on the megabase scale, is positively associated with the level of genetic variation, as measured at the genic scale. Despite considerable debate, it is not clear whether these factors are causally linked or, if they are, whether this is driven by the repeated action of adaptive evolution or molecular processes such as double-strand break formation and mismatch repair. We introduce three innovations to the analysis of recombination and diversity: fine-scale genetic maps estimated from genotype experiments that identify recombination hotspots at the kilobase scale, analysis of an entire human chromosome, and the use of wavelet techniques to identify correlations acting at different scales. We show that recombination influences genetic diversity only at the level of recombination hotspots. Hotspots are also associated with local increases in GC content and the relative frequency of GC-increasing mutations but have no effect on substitution rates. Broad-scale association between recombination and diversity is explained through covariance of both factors with base composition. To our knowledge, these results are the first evidence of a direct and local influence of recombination hotspots on genetic variation and the fate of individual mutations. However, that hotspots have no influence on substitution rates suggests that they are too ephemeral on an evolutionary time scale to have a strong influence on broader scale patterns of base composition and long-term molecular evolution.

International HapMap Consortium. 2005. A haplotype map of the human genome. Nature, 437 (7063), pp. 1299-1320. | Show Abstract | Read more

Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution.

Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. 2005. A fine-scale map of recombination rates and hotspots across the human genome. Science, 310 (5746), pp. 321-324. | Show Abstract | Read more

Genetic maps, which document the way in which recombination rates vary over a genome, are an essential tool for many genetic analyses. We present a high-resolution genetic map of the human genome, based on statistical analyses of genetic variation data, and identify more than 25,000 recombination hotspots, together with motifs and sequence contexts that play a role in hotspot activity. Differences between the behavior of recombination rates over large (megabase) and small (kilobase) scales lead us to suggest a two-stage model for recombination in which hotspots are stochastic features, within a framework in which large-scale rates are constrained.

Mu J, Awadalla P, Duan J, McGee KM, Joy DA, McVean GAT, Su X-Z. 2005. Recombination hotspots and population structure in Plasmodium falciparum. PLoS Biol, 3 (10), pp. e335. | Show Abstract | Read more

Understanding the influences of population structure, selection, and recombination on polymorphism and linkage disequilibrium (LD) is integral to mapping genes contributing to drug resistance or virulence in Plasmodium falciparum. The parasite's short generation time, coupled with a high cross-over rate, can cause rapid LD break-down. However, observations of low genetic variation have led to suggestions of effective clonality: selfing, population admixture, and selection may preserve LD in populations. Indeed, extensive LD surrounding drug-resistant genes has been observed, indicating that recombination and selection play important roles in shaping recent parasite genome evolution. These studies, however, provide only limited information about haplotype variation at local scales. Here we describe the first (to our knowledge) chromosome-wide SNP haplotype and population recombination maps for a global collection of malaria parasites, including the 3D7 isolate, whose genome has been sequenced previously. The parasites are clustered according to continental origin, but alternative groupings were obtained using SNPs at 37 putative transporter genes that are potentially under selection. Geographic isolation and highly variable multiple infection rates are the major factors affecting haplotype structure. Variation in effective recombination rates is high, both among populations and along the chromosome, with recombination hotspots conserved among populations at chromosome ends. This study supports the feasibility of genome-wide association studies in some parasite populations.

McVean G, Spencer CCA, Chaix R. 2005. Perspectives on human genetic variation from the HapMap Project. PLoS Genet, 1 (4), pp. e54. | Show Abstract | Read more

The completion of the International HapMap Project marks the start of a new phase in human genetics. The aim of the project was to provide a resource that facilitates the design of efficient genome-wide association studies, through characterising patterns of genetic variation and linkage disequilibrium in a sample of 270 individuals across four geographical populations. In total, over one million SNPs have been typed across these genomes, providing an unprecedented view of human genetic diversity. In this review we focus on what the HapMap Project has taught us about the structure of human genetic variation and the fundamental molecular and evolutionary processes that shape it.

Cited:

155

WOS

McVean GAT, Cardin NJ. 2005. Approximating the coalescent with recombination PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 360 (1459), pp. 1387-1393. | Read more

McVean GAT, Cardin NJ. 2005. Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci, 360 (1459), pp. 1387-1393. | Show Abstract | Read more

The coalescent with recombination describes the distribution of genealogical histories and resulting patterns of genetic variation in samples of DNA sequences from natural populations. However, using the model as the basis for inference is currently severely restricted by the computational challenge of estimating the likelihood. We discuss why the coalescent with recombination is so challenging to work with and explore whether simpler models, under which inference is more tractable, may prove useful for genealogy-based inference. We introduce a simplification of the coalescent process in which coalescence between lineages with no overlapping ancestral material is banned. The resulting process has a simple Markovian structure when generating genealogies sequentially along a sequence, yet has very similar properties to the full model, both in terms of describing patterns of genetic variation and as the basis for statistical inference.

Goriely A, McVean GAT, van Pelt AMM, O'Rourke AW, Wall SA, de Rooij DG, Wilkie AOM. 2005. Gain-of-function amino acid substitutions drive positive selection of FGFR2 mutations in human spermatogonia. Proc Natl Acad Sci U S A, 102 (17), pp. 6051-6056. | Show Abstract | Read more

Despite the importance of mutation in genetics, there are virtually no experimental data on the occurrence of specific nucleotide substitutions in human gametes. C>G transversions at position 755 of FGF receptor 2 (FGFR2) cause Apert syndrome; this mutation, encoding the gain-of-function substitution Ser252Trp, occurs with a birth rate elevated 200- to 800-fold above background and originates exclusively from the unaffected father. We previously demonstrated high levels of both 755C>G and 755C>T FGFR2 mutations in human sperm and proposed that these particular mutations are enriched because the encoded proteins confer a selective advantage to spermatogonial cells. Here, we examine three corollaries of this hypothesis. First, we show that mutation levels at the adjacent FGFR2 nucleotides 752-754 are low, excluding any general increase in local mutation rate. Second, we present three instances of double-nucleotide changes involving 755C, expected to be extremely rare as chance events. Two of these double-nucleotide substitutions are shown, either by assessment of the pedigree or by direct analysis of sperm, to have arisen in sequential steps; the third (encoding Ser252Tyr) was predicted from structural considerations. Finally, we demonstrate that both major alternative spliceforms of FGFR2 (Fgfr2b and Fgfr2c) are expressed in rat spermatogonial stem cell lines. Taken together, these observations show that specific FGFR2 mutations attain high levels in sperm because they encode proteins with gain-of-function properties, favoring clonal expansion of mutant spermatogonial cells. Among FGFR2 mutations, those causing Apert syndrome may be especially prevalent because they enhance signaling by FGF ligands specific for each of the major expressed isoforms.

Winckler W, Myers SR, Richter DJ, Onofrio RC, McDonald GJ, Bontrop RE, McVean GAT, Gabriel SB, Reich D, Donnelly P, Altshuler D. 2005. Comparison of fine-scale recombination rates in humans and chimpanzees. Science, 308 (5718), pp. 107-111. | Show Abstract | Read more

We compared fine-scale recombination rates at orthologous loci in humans and chimpanzees by analyzing polymorphism data in both species. Strong statistical evidence for hotspots of recombination was obtained in both species. Despite approximately 99% identity at the level of DNA sequence, however, recombination hotspots were found rarely (if at all) at the same positions in the two species, and no correlation was observed in estimates of fine-scale recombination rates. Thus, local patterns of recombination rate have evolved rapidly, in a manner disproportionate to the change in DNA sequence.

Jolley KA, Wilson DJ, Kriz P, Mcvean G, Maiden MCJ. 2005. The influence of mutation, recombination, population history, and selection on patterns of genetic diversity in Neisseria meningitidis (vol 22, pg 562, 2005) MOLECULAR BIOLOGY AND EVOLUTION, 22 (4), pp. 1158-1158.

Jolley KA, Wilson DJ, Kriz P, McVean G, Maiden MCJ. 2005. The influence of mutation, recombination, population history, and selection on patterns of genetic diversity in Neisseria meningitidis. Mol Biol Evol, 22 (3), pp. 562-569. | Show Abstract | Read more

Patterns of genetic diversity within populations of human pathogens, shaped by the ecology of host-microbe interactions, contain important information about the epidemiological history of infectious disease. Exploiting this information, however, requires a systematic approach that distinguishes the genetic signal generated by epidemiological processes from the effects of other forces, such as recombination, mutation, and population history. Here, a variety of quantitative techniques were employed to investigate multilocus sequence information from isolate collections of Neisseria meningitidis, a major cause of meningitis and septicemia world wide. This allowed quantitative evaluation of alternative explanations for the observed population structure. A coalescent-based approach was employed to estimate the rate of mutation, the rate of recombination, and the size distribution of recombination fragments from samples from disease-associated and carried meningococci obtained in the Czech Republic in 1993 and a global collection of disease-associated isolates collected globally from 1937 to 1996. The parameter estimates were used to reject a model in which genetic structure arose by chance in small populations, and analysis of molecular variation showed that geographically restricted gene flow was unlikely to be the cause of the genetic structure. The genetic differentiation between disease and carriage isolate collections indicated that, whereas certain genotypes were overrepresented among the disease-isolate collections (the "hyperinvasive" lineages), disease-associated and carried meningococci exhibited remarkably little differentiation at the level of individual nucleotide polymorphisms. In combination, these results indicated the repeated action of natural selection on meningococcal populations, possibly arising from the coevolutionary dynamic of host-pathogen interactions.

Shrivastava J, Qian BZ, Mcvean G, Webster JP. 2005. An insight into the genetic variation of Schistosoma japonicum in mainland China using DNA microsatellite markers. Mol Ecol, 14 (3), pp. 839-849. | Show Abstract | Read more

This study presents the first microsatellite investigation into the level of genetic variation among Schistosoma japonicum from different geographical origins. S. japonicum isolates were obtained from seven endemic provinces across mainland China: Zhejiang (Jiashan County), Anhui (Guichi County), Jiangxi (Yongxiu County), Hubei (Wuhan County), Hunan (Yueyang area), Sichuan 1 (Maoshan County), Sichuan 2 (Tianquan County), Yunnan (Dali County), and also one province in the Philippines (Sorsogon). DNA from 20 individuals from each origin were screened against 11 recently isolated and characterized S. japonicum microsatellites, and a set of nine loci were selected based on their polymorphic information content. High levels of polymorphism were obtained between and within population samples, with Chinese and Philippine strains appearing to follow different lineages, and with distinct branching between provinces. Moreover, across mainland China, genotype clustering appeared to be related to habitat type and/or intermediate host morph. These results highlight the suitability of microsatellites for population genetic studies of S. japonicum and suggest that there may be different strains of S. japonicum circulating in mainland China.

Wilson DJ, Falush D, McVean G. 2005. Germs, genomes and genealogies. Trends Ecol Evol, 20 (1), pp. 39-45. | Show Abstract | Read more

Genetic diversity in pathogen species contains information about evolutionary and epidemiological processes, including the origins and history of disease, the nature of the selective forces acting on pathogen genes and the role of recombination in generating genetic novelty. Here, we review recent developments in these fields and compare the use of population genetic, or population-model based, approaches to phylogenetic, or population-model free, methodologies. We show how simple epidemiological models can be related to the ancestral, or coalescent, process underlying samples from pathogen species, enabling detailed inference about pathogen biology from patterns of molecular variation.

Cited:

68

Scopus

Mu J, Awadalla P, Duan J, McGee KM, Joy DA, McVean GAT, Su XZ. 2005. Recombination hotspots and population structure in Plasmodium falciparum. PLoS biology., 3 (10), | Show Abstract

Understanding the influences of population structure, selection, and recombination on polymorphism and linkage disequilibrium (LD) is integral to mapping genes contributing to drug resistance or virulence in Plasmodium falciparum. The parasite's short generation time, coupled with a high cross-over rate, can cause rapid LD break-down. However, observations of low genetic variation have led to suggestions of effective clonality: selfing, population admixture, and selection may preserve LD in populations. Indeed, extensive LD surrounding drug-resistant genes has been observed, indicating that recombination and selection play important roles in shaping recent parasite genome evolution. These studies, however, provide only limited information about haplotype variation at local scales. Here we describe the first (to our knowledge) chromosome-wide SNP haplotype and population recombination maps for a global collection of malaria parasites, including the 3D7 isolate, whose genome has been sequenced previously. The parasites are clustered according to continental origin, but alternative groupings were obtained using SNPs at 37 putative transporter genes that are potentially under selection. Geographic isolation and highly variable multiple infection rates are the major factors affecting haplotype structure. Variation in effective recombination rates is high, both among populations and along the chromosome, with recombination hotspots conserved among populations at chromosome ends. This study supports the feasibility of genome-wide association studies in some parasite populations.

Cited:

72

Scopus

McVean G, Spencer CCA, Chaix R. 2005. Perspectives on human genetic variation from the HapMap project PLoS Genetics, 1 (4), pp. 0413-0418. | Show Abstract | Read more

The completion of the International HapMap Project marks the start of a new phase in human genetics. The aim of the project was to provide a resource that facilitates the design of efficient genome-wide association studies, through characterising patterns of genetic variation and linkage disequilibrium in a sample of 270 individuals across four geographical populations. In total, over one million SNPs have been typed across these genomes, providing an unprecedented view of human genetic diversity. In this review we focus on what the HapMap project has taught us about the structure of human genetic variation and the fundamental molecular and evolutionary processes that shape it. © 2005 McVean et al.

Harding RM, McVean G. 2004. A structured ancestral population for the evolution of modern humans. Curr Opin Genet Dev, 14 (6), pp. 667-674. | Show Abstract | Read more

The view that modern humans evolved through a bottleneck from a single founding group of archaic Homo is being challenged by new analyses of contemporary genetic variation. A wide range of middle to late Pleistocene ages for gene genealogies and evidence for early population structures point to a diverse and scattered ancestry associated with a metapopulation history of local extinctions, re-colonization and admixture. A different balance of the same processes has shaped chimpanzee diversity.

International HapMap Consortium. 2004. Integrating ethics and science in the International HapMap Project. Nat Rev Genet, 5 (6), pp. 467-475. | Show Abstract | Read more

Genomics resources that use samples from identified populations raise scientific, social and ethical issues that are, in many ways, inextricably linked. Scientific decisions about which populations to sample to produce the HapMap, an international genetic vadation resource, have raised questions about the relationships between the social identities used to recruit participants and the biological findings of studies that will use the HapMap. The sometimes problematic implications of those complex relationships have led to questions about how to conduct genetic variation research that uses identified populations in an ethical way, including how to involve members of a population in evaluating the risks and benefits posed for everyone who shares that identity. The ways in which these issues are linked is increasingly drawing the scientific and ethical spheres of genomics research closer together.

McVean GAT, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P. 2004. The fine-scale structure of recombination rate variation in the human genome. Science, 304 (5670), pp. 581-584. | Show Abstract | Read more

The nature and scale of recombination rate variation are largely unknown for most species. In humans, pedigree analysis has documented variation at the chromosomal level, and sperm studies have identified specific hotspots in which crossing-over events cluster. To address whether this picture is representative of the genome as a whole, we have developed and validated a method for estimating recombination rates from patterns of genetic variation. From extensive single-nucleotide polymorphism surveys in European and African populations, we find evidence for extreme local rate variation spanning four orders in magnitude, in which 50% of all recombination events take place in less than 10% of the sequence. We demonstrate that recombination hotspots are a ubiquitous feature of the human genome, occurring on average every 200 kilobases or less, but recombination occurs preferentially outside genes.

Stumpf MPH, McVean GAT. 2003. Estimating recombination rates from population-genetic data. Nat Rev Genet, 4 (12), pp. 959-968. | Show Abstract | Read more

Obtaining an accurate measure of how recombination rates vary across the genome has implications for understanding the molecular basis of recombination, its evolutionary significance and the distribution of linkage disequilibrium in natural populations. Although measuring the recombination rate is experimentally challenging, good estimates can be obtained by applying population-genetic methods to DNA sequences taken from natural populations. Statistical methods are now providing insights into the nature and scale of variation in the recombination rate, particularly in humans. Such knowledge will become increasingly important owing to the growing use of population-genetic methods in biomedical research.

Tyler-Smith C, McVean G. 2003. The comings and goings of a Y polymorphism. Nat Genet, 35 (3), pp. 201-202. | Show Abstract | Read more

A deletion that removes 1.6 Mb of DNA and several genes from the Y chromosome increases its carriers' chance of infertility, yet is present in ∼2% of men. The Y chromosome can no longer be regarded as a neutral locus in evolutionary studies.

Goriely A, McVean GAT, Röjmyr M, Ingemarsson B, Wilkie AOM. 2003. Evidence for selective advantage of pathogenic FGFR2 mutations in the male germ line. Science, 301 (5633), pp. 643-646. | Show Abstract | Read more

Observed mutation rates in humans appear higher in male than female gametes and often increase with paternal age. This bias, usually attributed to the accumulation of replication errors or inefficient repair processes, has been difficult to study directly. Here, we describe a sensitive method to quantify substitutions at nucleotide 755 of the fibroblast growth factor receptor 2 (FGFR2) gene in sperm. Although substitution levels increase with age, we show that even high levels originate from infrequent mutational events. We propose that these FGFR2 mutations, although harmful to embryonic development, are paradoxically enriched because they confer a selective advantage to the spermatogonial cells in which they arise.

International HapMap Consortium. 2003. The International HapMap Project. Nature, 426 (6968), pp. 789-796. | Show Abstract | Read more

The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention.

McVean GAT. 2002. A genealogical interpretation of linkage disequilibrium. Genetics, 162 (2), pp. 987-991. | Show Abstract

The degree of association between alleles at different loci, or linkage disequilibrium, is widely used to infer details of evolutionary processes. Here I explore how associations between alleles relate to properties of the underlying genealogy of sequences. Under the neutral, infinite-sites assumption I show that there is a direct correspondence between the covariance in coalescence times at different parts of the genome and the degree of linkage disequilibrium. These covariances can be calculated exactly under the standard neutral model and by Monte Carlo simulation under different demographic models. I show that the effects of population growth, population bottlenecks, and population structure on linkage disequilibrium can be described through their effects on the covariance in coalescence times.

Reich DE, Schaffner SF, Daly MJ, McVean G, Mullikin JC, Higgins JM, Richter DJ, Lander ES, Altshuler D. 2002. Human genome sequence variation and the influence of gene history, mutation and recombination. Nat Genet, 32 (1), pp. 135-142. | Show Abstract | Read more

Variation in the human genome sequence is key to understanding susceptibility to disease in modern populations and the history of ancestral populations. Unlocking this information requires knowledge of the patterns and underlying causes of human sequence diversity. By applying a new population-genetic framework to two genome-wide polymorphism surveys, we find that the human genome contains sizeable regions (stretching over tens of thousands of base pairs) that have intrinsically high and low rates of sequence variation. We show that the primary determinant of these patterns is shared genealogical history. Only a fraction of the variation (at most 25%) is due to the local mutation rate. By measuring the average distance over which genealogical histories are typically preserved, these data provide the first genome-wide estimate of the average extent of correlation among variants (linkage disequilibrium). The results are best explained by extreme variability in the recombination rate at a fine scale, and provide the first empirical evidence that such recombination 'hot spots' are a general feature of the human genome and have a principal role in shaping genetic variation in the human population.

Balding DJ, Carothers AD, Marchini JL, Cardon LR, Vetta A, Griffiths B, Weir BS, Hill WG, Goldstein D, Strimmer K et al. 2002. Discussion on the meeting on 'Statistical modelling and analysis of genetic data' JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 64 (4), pp. 737-775. | Read more

McVean G, Awadalla P, Fearnhead P. 2002. A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics, 160 (3), pp. 1231-1241. | Show Abstract

Determining the amount of recombination in the genealogical history of a sample of genes is important to both evolutionary biology and medical population genetics. However, recurrent mutation can produce patterns of genetic diversity similar to those generated by recombination and can bias estimates of the population recombination rate. Hudson 2001 has suggested an approximate-likelihood method based on coalescent theory to estimate the population recombination rate, 4N(e)r, under an infinite-sites model of sequence evolution. Here we extend the method to the estimation of the recombination rate in genomes, such as those of many viruses and bacteria, where the rate of recurrent mutation is high. In addition, we develop a powerful permutation-based method for detecting recombination that is both more powerful than other permutation-based methods and robust to misspecification of the model of sequence evolution. We apply the method to sequence data from viruses, bacteria, and human mitochondrial DNA. The extremely high level of recombination detected in both HIV1 and HIV2 sequences demonstrates that recombination cannot be ignored in the analysis of viral population genetic data.

Atkinson RJ, McVean GAT, Stone GN. 2002. Use of population genetic data to infer oviposition behaviour: species-specific patterns in four oak gallwasps (Hymenoptera: Cynipidae). Proc Biol Sci, 269 (1489), pp. 383-390. | Show Abstract | Read more

Many species of oak gallwasp (Hymenoptera: Cynipidae: Cynipini) induce galls containing more than one larva (multilocular galls) on their host plant. To date, it has remained unclear whether multilocular galls result solely from clustered oviposition by a single female, or include the aggregated offspring of several females (multiple founding). We have developed a novel maximum-likelihood approach for use with population genetic data that estimates the number and genotypes of parents contributing to offspring from each gall. We apply this method to allozyme data from multiple populations of four oak gallwasps whose asexual generations develop in multilocular galls (Andricus coriarius, A. lucidus, A. panteli and A. seckendorffi). We find strong evidence for multiple founding in all four species, and show the data to be compatible with multiple founding rather than founding by a single foundress mated with multiple males. The extent of multiple founding differs among species: in A. lucidus and A. seckendorffi most galls are induced by a single female, whereas in A. coriarius and A. panteli over half of the galls sampled were multiple founded. We suggest that variation in levels of multiple founding may be due to consistent ecological differences between the four species.

Balding DJ, Carothers AD, Marchini JL, Cardon LR, Vetta A, Griffiths B, Weir BS, Hill WG, Goldstein D, Strimmer K et al. 2002. Discussion on the meeting on 'statistical modelling and analysis of genetic data' Journal of the Royal Statistical Society. Series B: Statistical Methodology, 64 (4), pp. 737-775. | Read more

McVean GA. 2001. What do patterns of genetic variability reveal about mitochondrial recombination? Heredity (Edinb), 87 (Pt 6), pp. 613-620. | Show Abstract | Read more

Recent claims that patterns of genetic variability in human mitochondria show evidence for recombination, have provoked considerable argument and much correspondence concerning the quality of the data, the nature of the analyses, and the biological realism of mitochondrial recombination. While the majority of evidence now points towards a lack of effective recombination, at least in humans, the debate has highlighted how difficult the detection of recombination can be in genomes with unusual mutation processes and complex demographic histories. A major difficulty is the lack of consensus about how to measure linkage disequilibrium. I show that measures differ in the way they treat data that are uninformative about recombination, and that when just those pairwise comparisons that are informative about recombination are used, there is agreement between different statistics. In this light, the significant negative correlation between linkage disequilibrium and distance, in at least some of the data sets, is a real pattern that requires explanation. I discuss whether plausible mutational and selective processes can give rise to such a pattern.

Charlesworth D, Charlesworth B, McVean GAT. 2001. Genome sequences and evolutionary biology, a two-way interaction. Trends Ecol Evol, 16 (5), pp. 235-242. | Show Abstract | Read more

Complete genome sequences are accumulating rapidly, culminating with the announcement of the human genome sequence in February 2001. In addition to cataloguing the diversity of genes and other sequences, genome sequences will provide the first detailed and complete data on gene families and genome organization, including data on evolutionary changes. Reciprocally, evolutionary biology will make important contributions to the efforts to understand functions of genes and other sequences in genomes. Large-scale, detailed and unbiased comparisons between species will illuminate the evolution of genes and genomes, and population genetics methods will enable detection of functionally important genes or sequences, including sequences that have been involved in adaptive changes.

McVean GA, Vieira J. 2001. Inferring parameters of mutation, selection and demography from patterns of synonymous site evolution in Drosophila. Genetics, 157 (1), pp. 245-257. | Show Abstract

Selection acting on codon usage can cause patterns of synonymous evolution to deviate considerably from those expected under neutrality. To investigate the quantitative relationship between parameters of mutation, selection, and demography, and patterns of synonymous site divergence, we have developed a novel combination of population genetic models and likelihood methods of phylogenetic sequence analysis. Comparing 50 orthologous gene pairs from Drosophila melanogaster and D. virilis and 27 from D. melanogaster and D. simulans, we show considerable variation between amino acids and genes in the strength of selection acting on codon usage and find evidence for both long-term and short-term changes in the strength of selection between species. Remarkably, D. melanogaster shows no evidence of current selection on codon usage, while its sister species D. simulans experiences only half the selection pressure for codon usage of their common ancestor. We also find evidence for considerable base asymmetries in the rate of mutation, such that the average synonymous mutation rate is 20-30% higher than in noncoding regions. A Bayesian approach is adopted to investigate how accounting for selection on codon usage influences estimates of the parameters of mutation.

McVean G. 2000. Evolutionary genetics: what is driving male mutation? Curr Biol, 10 (22), pp. R834-R835. | Show Abstract | Read more

In mammals, most new mutations occur in males. But a study of the evolution of a human X to Y chromosomal translocation has revealed a sex bias much lower than previous estimates. Patterns of substitution suggest that differential methylation between male and female germ lines is a key determinant of the mutation rate.

McVean GA, Charlesworth B. 2000. The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. Genetics, 155 (2), pp. 929-944. | Show Abstract

Associations between selected alleles and the genetic backgrounds on which they are found can reduce the efficacy of selection. We consider the extent to which such interference, known as the Hill-Robertson effect, acting between weakly selected alleles, can restrict molecular adaptation and affect patterns of polymorphism and divergence. In particular, we focus on synonymous-site mutations, considering the fate of novel variants in a two-locus model and the equilibrium effects of interference with multiple loci and reversible mutation. We find that weak selection Hill-Robertson (wsHR) interference can considerably reduce adaptation, e.g., codon bias, and, to a lesser extent, levels of polymorphism, particularly in regions of low recombination. Interference causes the frequency distribution of segregating sites to resemble that expected from more weakly selected mutations and also generates specific patterns of linkage disequilibrium. While the selection coefficients involved are small, the fitness consequences of wsHR interference across the genome can be considerable. We suggest that wsHR interference is an important force in the evolution of nonrecombining genomes and may explain the unexpected constancy of codon bias across species of very different census population sizes, as well as several unusual features of codon usage in Drosophila.

McAllister BF, McVean GA. 2000. Neutral evolution of the sex-determining gene transformer in Drosophila. Genetics, 154 (4), pp. 1711-1720. | Show Abstract

The amino acid sequence of the transformer (tra) gene exhibits an extremely rapid rate of evolution among Drosophila species, although the gene performs a critical step in sex determination. These changes in amino acid sequence are the result of either natural selection or neutral evolution. To differentiate between selective and neutral causes of this evolutionary change, analyses of both intraspecific and interspecific patterns of molecular evolution of tra gene sequences are presented. Sequences of 31 tra alleles were obtained from Drosophila americana. Many replacement and silent nucleotide variants are present among the alleles; however, the distribution of this sequence variation is consistent with neutral evolution. Sequence evolution was also examined among six species representative of the genus Drosophila. For most lineages and most regions of the gene, both silent and replacement substitutions have accumulated in a constant, clock-like manner. In exon 3 of D. virilis and D. americana we find evidence for an elevated rate of nonsynonymous substitution, but no statistical support for a greater rate of nonsynonymous relative to synonymous substitutions. Both levels of analysis of the tra sequence suggest that, although the gene is evolving at a rapid pace, these changes are neutral in function.

McVean GA, Hurst GD. 2000. Evolutionary lability of context-dependent codon bias in bacteria. J Mol Evol, 50 (3), pp. 264-275. | Show Abstract | Read more

In bacteria, synonymous codon usage can be considerably affected by base composition at neighboring sites. Such context-dependent biases may be caused by either selection against specific nucleotide motifs or context-dependent mutation biases. Here we consider the evolutionary conservation of context-dependent codon bias across 11 completely sequenced bacterial genomes. In particular, we focus on two contextual biases previously identified in Escherichia coli; the avoidance of out-of-frame stop codons and AGG motifs. By identifying homologues of E. coli genes, we also investigate the effect of gene expression level in Haemophilus influenzae and Mycoplasma genitalium. We find that while context-dependent codon biases are widespread in bacteria, few are conserved across all species considered. Avoidance of out-of-frame stop codons does not apply to all stop codons or amino acids in E. coli, does not hold for different species, does not increase with gene expression level, and is not relaxed in Mycoplasma spp., in which the canonical stop codon, TGA, is recognized as tryptophan. Avoidance of AGG motifs shows some evolutionary conservation and increases with gene expression level in E. coli, suggestive of the action of selection, but the cause of the bias differs between species. These results demonstrate that strong context-dependent forces, both selective and mutational, operate on synonymous codon usage but that these differ considerably between genomes.

Cited:

111

Scopus

McVean GAT, Charlesworth B. 1999. A population genetic model for the evolution of synonymous codon usage: Patterns and predictions Genetical Research, 74 (2), pp. 145-158. | Show Abstract | Read more

Patterns of synonymous codon usage are determined by the forces of mutation, selection and drift. We elaborate on previous population genetic models of codon usage to incorporate parameters of population polymorphism, and demonstrate that the degree of codon bias expected in a single sequence picked at random from the population is accurately predicted by previous models, irrespective of population polymorphism. This new model is used to explore the relationships between synonymous codon usage, nucleotide site diversity and the rate of substitution. We derive the equilibrium frequency distribution of weakly selected segregating sites under the infinite-sites model, and the expected nucleotide site diversity. Contrary to intuition, levels of silent-site diversity can increase with the strength of selection acting on codon usage. We also predict the effects of background selection on statistics of synonymous codon usage and derive simple formulae to predict patterns of codon usage at amino acids with more than two synonymous codons, and the effects of variation in selection coefficient between sites within a gene. We show that patterns of silent-site variation and synonymous codon usage on the X chromosome and autosomes in Drosophila are compatible with recessivity of the fitness effects of unpreferred codons. Finally, we suggest that there still exist considerable discrepancies between current models and data.

McVean GA, Vieira J. 1999. The evolution of codon preferences in Drosophila: a maximum-likelihood approach to parameter estimation and hypothesis testing. J Mol Evol, 49 (1), pp. 63-75. | Show Abstract | Read more

Synonymous codon usage in related species may differ as a result of variation in mutation biases, differences in the overall strength and efficiency of selection, and shifts in codon preference-the selective hierarchy of codons within and between amino acids. We have developed a maximum-likelihood method to employ explicit population genetic models to analyze the evolution of parameters determining codon usage. The method is applied to twofold degenerate amino acids in 50 orthologous genes from D. melanogaster and D. virilis. We find that D. virilis has significantly reduced selection on codon usage for all amino acids, but the data are incompatible with a simple model in which there is a single difference in the long-term Ne, or overall strength of selection, between the two species, indicating shifts in codon preference. The strength of selection acting on codon usage in D. melanogaster is estimated to be |Nes| approximately 0.4 for most CT-ending twofold degenerate amino acids, but 1.7 times greater for cysteine and 1.4 times greater for AG-ending codons. In D. virilis, the strength of selection acting on codon usage for most amino acids is only half that acting in D. melanogaster but is considerably greater than half for cysteine, perhaps indicating the dual selection pressures of translational efficiency and accuracy. Selection coefficients in orthologues are highly correlated (rho = 0.46), but a number of genes deviate significantly from this relationship.

Cited:

28

Scopus

Stephan W, Charlesworth B, McVean G. 1999. The effect of background selection at a single locus on weakly selected, partially linked variants Genetical Research, 73 (2), pp. 133-146. | Show Abstract | Read more

Previous work has shown that genetic diversity at a neutral locus is affected by background selection due to recurrent deleterious mutations as though the effective population size N(e) is reduced by a factor that is calculable from genetic parameters such as mutation rates, selection coefficients, and the rates of recombination between sites subject to selection and the neutral locus. Given that silent changes at third coding positions are often subject to weak selection pressures, it is important to develop similar quantitative predictions of the effects of background selection on variation and evolution at weakly selected sites. A diffusion approximation is derived that describes the effects of the presence of a single locus subject to mutation and strongly deleterious selection on variation and evolution at a partially linked, weakly selected locus. The results are validated by computer simulations using the Ito pseudo-sampling method. We show that both nucleotide site diversity and rates of molecular evolution at a weakly selected locus are affected by background selection as though N(e) is reduced in the same way as for a neutral locus. Heuristic arguments are presented as to why the change in N(e) for the neutral case also applies with weak selection. As in the case of a neutral locus, the number of segregating sites in the population is poorly predicted from the change in N(e). The potential significance of the results in relation to the effects of recombinational environment on molecular variation and evolution is discussed.

Hurst GDD, Mcvean GAT. 1998. Parasitic male-killing bacteria and the evolution of clutch size Ecological Entomology, 23 (3), pp. 350-353. | Read more

Hurst LD, McVean GT. 1998. Do we understand the evolution of genomic imprinting? Curr Opin Genet Dev, 8 (6), pp. 701-708. | Show Abstract | Read more

The conflict theory is the only hypothesis to have attracted any critical attention for the evolution of genomic imprinting. Although the earliest data appeared supportive, recent systematic analyses have not confirmed the model's predictions. The status of theory remains undecided, however, as post-hoc explanations can be provided as to why these predictions are not borne out.

McVean GT, Hurst LD. 1997. Evidence for a selectively favourable reduction in the mutation rate of the X chromosome. Nature, 386 (6623), pp. 388-392. | Show Abstract | Read more

The equilibrium per-genome mutation rate in sexual species is thought to result from a trade-off between the benefits of reducing the deleterious mutation rate and the costs of increasing fidelity. We propose that selection will often favour a lower mutation rate on the X chromosome than on autosomes, owing to the exposure of deleterious recessive mutations on hemizygous chromosomes. We tested this hypothesis by examining 33 X-linked genes that have been sequenced in both mouse and rat, and compared their rate of evolution against 238 autosomal genes. The X-linked genes were found to have a significantly lower rate of synonymous substitution than the autosomal genes. Neither the supposed higher mutation rate in males nor stronger purifying selection against slightly deleterious mutations on the X chromosome can account for the low value. The most parsimonious explanation is that rodents have a lower mutation rate on the X chromosome than on autosomes. It is therefore likely that previous indirect estimates of the excess male mutation rate are inaccurate. Indeed, after correction we find no evidence for a male-biased mutation rate in rodents. Furthermore, the rate of synonymous substitution in Y-linked genes is not significantly different from that in autosomal ones. The extent to which enhanced male mutation rates are problematic for the mutational deterministic model of the evolution of sex must, in turn, be questioned.

Cited:

42

Scopus

McVean GT, Hurst LD. 1997. Molecular evolution of imprinted genes: No evidence for antagonistic coevolution Proceedings of the Royal Society B: Biological Sciences, 264 (1382), pp. 739-746. | Show Abstract | Read more

Genomically imprinted genes are those for which expression is dependent on the sex of the parent from which they are derived. Numerous theories have been proposed for the evolution of genomic imprinting; one theory is that it is an intra-individual manifestation of classical parent-offspring conflict. This theory is unique in predicting that an arms race may develop between maternally and paternally derived genes for the control of foetal growth demands. Such antagonistic coevolution may be mediated through changes in the structure of the proteins concerned. Comparable coevolution is the most likely explanation for the rapid changes seen in antigenic components of parasites and antigen recognition components of immune systems. We have examined the evolution of insulin-like growth factor (Igf2) and its antagonistic receptor (Igf2r) and find that, in contrast to immune genes, at the sites of mutual binding they are highly conserved. In addition, we have analysed the rate of molecular evolution of seven imprinted genes (including Igf2 and Igf2r), sequenced in both mouse and rat, and find that this is the same as that of non-imprinted receptors and significantly lower than that of immune genes (controlling for differences in mutation rate). Contrary to the expectations of the conflict hypothesis, we hence find no evidence for antagonistic coevolution of imprinted genes mediated by changes in sequence.

Hurst LD, McVean GT. 1997. Growth effects of uniparental disomies and the conflict theory of genomic imprinting. Trends Genet, 13 (11), pp. 436-443. | Show Abstract | Read more

While numerous theories have been proposed for the evolution of genomic imprinting, few have been tested. The conflict theory proposes that imprinting is an intra-individual manifestation of classical parent-offspring conflict. This theory is unique in predicting that imprinted genes expressed from the paternally derived genome should be enhancers of pre- and post-natal growth, while those expressed from the maternally derived genome should be growth suppressors. We examine this prediction by reviewing the literature on growth of human and mouse progeny that have inherited both copies (or part thereof) of a particular chromosome from only one parent. Perhaps surprisingly, we find that much of the data do not support the hypothesis.

McVean G, Hurst LD. 1996. Genetic conflicts and the paradox of sex determination: three paths to the evolution of female intersexuality in a mammal. J Theor Biol, 179 (3), pp. 199-211. | Show Abstract | Read more

That sex determining systems ever change is paradoxical but can be explained by noting that conflict between selfish elements and their modifiers will often cause a shift in sex determining strategy. The evolution of the novel sex determining system of moles (Talpa europaea and T. occidentalis) may, we argue, be an example of just such a process. Three different models for the evolution of female intersexuality are presented. These all attempt to account for (1) the fact that a few years ago populations of moles had high frequencies of sterile XX individuals that were either morphologically male or intersex (other XX individuals were normal females) and (2) that presently, the XX individuals in the same population are exclusively fertile intersexes that are functionally female; i.e. have follicle producing ovotestes. This case history is compared to that of the wood lemming and two similarities are discussed. First, in both cases it is noted that one end product could be approached from different routes. Second, selfish elements may be involved in the evolution of both systems. In general, it is suggested that XY sex determination, far from being resilient to evolutionary change, is vulnerable to take-over by selfish elements. This is particularly the case in mammals in which transplacental interactions could allow manipulation of sex determination in one foetus by another. This, we also suggest, is a good candidate explanation for the evolution of novel sex determination in Talpa.

Hurst LD, McVean G, Moore T. 1996. Imprinted genes have few and small introns. Nat Genet, 12 (3), pp. 234-237. | Read more

Hurst L, McVean G. 1996. Erratum: News and views (Nature (1996) 381 (650-651)) Nature, 381 (6585), pp. 742.

McVean GT, Hurst LD, Moore T. 1996. Genomic evolution in mice and men: imprinted genes have little intronic content. Bioessays, 18 (9), pp. 773-775. | Read more

Hurst LD, McVean GT. 1996. Evolutionary genetics...and scandalous symbionts. Nature, 381 (6584), pp. 650-651. | Read more

Cited:

71

Scopus

Hurst LD, McVean GT. 1996. Clade selection, reversible evolution and the persistence of selfish elements: The evolutionary dynamics of cytoplasmic incompatibility Proceedings of the Royal Society B: Biological Sciences, 263 (1366), pp. 97-104. | Show Abstract | Read more

Every sexual species is potentially vulnerable to a wide range of heritable factors that disrupt the normal patterns of inheritance. Why these selfish elements are not more common is not well understood. Here, by reference to the dynamics of cytoplasmic incompatibility, we propose a novel solution, namely that under certain conditions, the most likely trajectory is for the selfish element to invade, followed by slow decay to complete loss. Cytoplasmic incompatibility is found in several arthropod species and is characterized by the fact that crosses between males infected with vertically transmitted bacteria of the genus Wolbachia and uninfected females produce significantly fewer adult progeny than the other mating combinations. Early models revealed that such bacteria are expected to spread very rapidly and persist at high frequencies. This being so, cytoplasmic incompatibility should be stably maintained through cladogenesis, thus most sub-populations within a species should be affected. Contrary to these expectations is the finding that cytoplasmic incompatibility within species is patchily distributed and rare. More recent analyses suggest that although Wolbachia should persist in populations the sterilizing effect may wane. We extend one of these analyses and find that through most parameter space, following the invasion of cytoplasmic incompatibility inducing Wolbachia into an uninfected population, not only will the sterilizing effect wane but the conditions become permissive for the spread of the uninfected cytotype. This system is then expected to proceed to fixation of uninfecteds at which point the population will have gone full circle (reversible evolution). This evolutionary trajectory is supported by mitochondrial haplotype/Wolbachia cytotype covariance in Drosophila melanogaster. Given that there is no evidence of intra-specific horizontal transmission and that the most robust equilibrium is fixation of uninfecteds, a further consequence is that one can deduce that clade selection most likely favoured Wolbachia's ability to undergo inter-specific transmission. This possibility is supported by comparison of inter-species horizontal transmission rates of mutualistic and deleterious symbionts. Reversible evolution may well be a property of many (but not all) selfish genetic elements.

Hurst LD, McVean GT. 1996. A difficult phase for introns-early. Molecular evolution. Curr Biol, 6 (5), pp. 533-536. | Show Abstract | Read more

Close analysis of intron phase - the position of introns within codons - is claimed to provide novel evidence supporting the view that introns predate the divergence of bacteria and eukaryotes and, via 'exon shuffling', played a crucial role in protein evolution. But just how compelling is this evidence?

McVean GT. 1995. Fractious chromosomes: hybrid disruption and the origin of selfish genetic elements. Bioessays, 17 (7), pp. 579-582. | Show Abstract | Read more

Supernumerary B chromosomes are dispensable elements of the genome which can be retained in populations at high frequencies, despite being deleterious, through the ability to undergo non-Mendelian inheritance. Their mode of origin is, however, obscure. Recent work on gynogenetic fish has demonstrated the incorporation of small, unstable, centromere-containing microchromosomes, probably of interspecific derivation, into an asexual lineage (1). That these resemble B chromosomes both in structure and behaviour is consistent with the proposal that hybridisation between closely related species may be a significant mode of origin for such selfish genetic elements. Additional work on the B chromosome of a parasitoid wasp and observations on patterns of chromosome breakage from somatic cell hybrids also support this hypothesis.

Palmer DS, Adland E, Frater JA, Goulder PJR, Ndung'u T, Matthews PC, Phillips RE, Shapiro R, McVean G, McLean AR. Predictable patterns of CTL escape and reversion across host populations and viral subtypes in HIV-1 evolution | Show Abstract

The twin processes of viral evolutionary escape and reversion in response to host immune pressure, in particular the cytotoxic T-lymphocyte (CTL) response, shape Human Immunodeficiency Virus-1 sequence evolution in infected host populations. The tempo of CTL escape and reversion is known to differ between CTL escape variants in a given host population. Here, we ask: are rates of escape and reversion comparable across infected host populations? For three cohorts taken from three continents, we estimate escape and reversion rates at 23 escape sites in optimally defined Gag epitopes. We find consistent escape rate estimates across the examined cohorts. Reversion rates are also consistent between a Canadian and South African infected host population. Certain Gag escape variants that incur a large replicative fitness cost are known to revert rapidly upon transmission. However, the relationship between escape/reversion rates and viral replicative capacity across a large number of epitopes has not been interrogated. We investigate this relationship by examining $in$ $vitro$ replicative capacities of viral sequences with minimal variation: point escape mutants induced in a lab strain. Remarkably, despite the complexities of epistatic effects exemplified by pathways to escape in famous epitopes, and the diversity of both hosts and viruses, CTL escape mutants which escape rapidly tend to be those with the highest replicative capacity when applied as a single point mutation. Similarly, mutants inducing the greatest costs to viral replicative capacity tend to revert more quickly. These data suggest that escape rates in Gag are consistent across host populations, and that in general these rates are dominated by site specific effects upon viral replicative capacity.

Zhu SJ, Almagro-Garcia J, McVean G. 2018. Deconvolution of multiple infections in Plasmodium falciparum from high throughput sequencing data. Bioinformatics, 34 (1), pp. 9-15. | Show Abstract | Read more

Motivation: The presence of multiple infecting strains of the malarial parasite Plasmodium falciparum affects key phenotypic traits, including drug resistance and risk of severe disease. Advances in protocols and sequencing technology have made it possible to obtain high-coverage genome-wide sequencing data from blood samples and blood spots taken in the field. However, analyzing and interpreting such data is challenging because of the high rate of multiple infections present. Results: We have developed a statistical method and implementation for deconvolving multiple genome sequences present in an individual with mixed infections. The software package DEploid uses haplotype structure within a reference panel of clonal isolates as a prior for haplotypes present in a given sample. It estimates the number of strains, their relative proportions and the haplotypes presented in a sample, allowing researchers to study multiple infection in malaria with an unprecedented level of detail. Availability and implementation: The open source implementation DEploid is freely available at https://github.com/mcveanlab/DEploid under the conditions of the GPLv3 license. An R version is available at https://github.com/mcveanlab/DEploid-r. Contact: joe.zhu@bdi.ox.ac.uk or gil.mcvean@bdi.ox.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.

Cortes A, Dendrou CA, Motyer A, Jostins L, Vukcevic D, Dilthey A, Donnelly P, Leslie S, Fugger L, McVean G. 2017. Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank. Nat Genet, 49 (9), pp. 1311-1318. | Show Abstract | Read more

Genetic discovery from the multitude of phenotypes extractable from routine healthcare data can transform understanding of the human phenome and accelerate progress toward precision medicine. However, a critical question when analyzing high-dimensional and heterogeneous data is how best to interrogate increasingly specific subphenotypes while retaining statistical power to detect genetic associations. Here we develop and employ a new Bayesian analysis framework that exploits the hierarchical structure of diagnosis classifications to analyze genetic variants against UK Biobank disease phenotypes derived from self-reporting and hospital episode statistics. Our method displays a more than 20% increase in power to detect genetic effects over other approaches and identifies new associations between classical human leukocyte antigen (HLA) alleles and common immune-mediated diseases (IMDs). By applying the approach to genetic risk scores (GRSs), we show the extent of genetic sharing among IMDs and expose differences in disease perception or diagnosis with potential clinical implications.

Miles A, Iqbal Z, Vauterin P, Pearson R, Campino S, Theron M, Gould K, Mead D, Drury E, O'Brien J et al. 2016. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res, 26 (9), pp. 1288-1299. | Show Abstract | Read more

The malaria parasite Plasmodium falciparum has a great capacity for evolutionary adaptation to evade host immunity and develop drug resistance. Current understanding of parasite evolution is impeded by the fact that a large fraction of the genome is either highly repetitive or highly variable and thus difficult to analyze using short-read sequencing technologies. Here, we describe a resource of deep sequencing data on parents and progeny from genetic crosses, which has enabled us to perform the first genome-wide, integrated analysis of SNP, indel and complex polymorphisms, using Mendelian error rates as an indicator of genotypic accuracy. These data reveal that indels are exceptionally abundant, being more common than SNPs and thus the dominant mode of polymorphism within the core genome. We use the high density of SNP and indel markers to analyze patterns of meiotic recombination, confirming a high rate of crossover events and providing the first estimates for the rate of non-crossover events and the length of conversion tracts. We observe several instances of meiotic recombination within copy number variants associated with drug resistance, demonstrating a mechanism whereby fitness costs associated with resistance mutations could be compensated and greater phenotypic plasticity could be acquired.

1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. 2015. A global reference for human genetic variation. Nature, 526 (7571), pp. 68-74. | Show Abstract | Read more

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

Taylor JC, Martin HC, Lise S, Broxholme J, Cazier J-B, Rimmer A, Kanapin A, Lunter G, Fiddy S, Allan C et al. 2015. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat Genet, 47 (7), pp. 717-726. | Show Abstract | Read more

To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.

Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G. 2015. Improved genome inference in the MHC using a population reference graph. Nat Genet, 47 (6), pp. 682-688. | Show Abstract | Read more

Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.

Cited:

54

WOS

Venn O, Turner I, Mathieson I, de Groot N, Bontrop R, McVean G. 2014. NONHUMAN GENETICS Strong male bias drives germline mutation in chimpanzees SCIENCE, 344 (6189), pp. 1272-1275. | Read more

Zilversmit MM, Chase EK, Chen DS, Awadalla P, Day KP, McVean G. 2013. Hypervariable antigen genes in malaria have ancient roots. BMC Evol Biol, 13 (1), pp. 110. | Show Abstract | Read more

BACKGROUND: The var genes of the human malaria parasite Plasmodium falciparum are highly polymorphic loci coding for the erythrocyte membrane proteins 1 (PfEMP1), which are responsible for the cytoaherence of P. falciparum infected red blood cells to the human vasculature. Cytoadhesion, coupled with differential expression of var genes, contributes to virulence and allows the parasite to establish chronic infections by evading detection from the host's immune system. Although studying genetic diversity is a major focus of recent work on the var genes, little is known about the gene family's origin and evolutionary history. RESULTS: Using a novel hidden Markov model-based approach and var sequences assembled from additional isolates and species, we are able to reveal elements of both the early evolution of the var genes as well as recent diversifying events. We compare sequences of the var gene DBLα domains from divergent isolates of P. falciparum (3D7 and HB3), and a closely-related species, Plasmodium reichenowi. We find that the gene family is equally large in P. reichenowi and P. falciparum -- with a minimum of 51 var genes in the P. reichenowi genome (compared to 61 in 3D7 and a minimum of 48 in HB3). In addition, we are able to define large, continuous blocks of homologous sequence among P. falciparum and P. reichenowi var gene DBLα domains. These results reveal that the contemporary structure of the var gene family was present before the divergence of P. falciparum and P. reichenowi, estimated to be between 2.5 to 6 million years ago. We also reveal that recombination has played an important and traceable role in both the establishment, and the maintenance, of diversity in the sequences. CONCLUSIONS: Despite the remarkable diversity and rapid evolution found in these loci within and among P. falciparum populations, the basic structure of these domains and the gene family is surprisingly old and stable. Revealing a common structure as well as conserved sequence among two species also has implications for developing new primate-parasite models for studying the pathology and immunology of falciparum malaria, and for studying the population genetics of var genes and associated virulence phenotypes.

Leffler EM, Gao Z, Pfeifer S, Ségurel L, Auton A, Venn O, Bowden R, Bontrop R, Wall JD, Sella G et al. 2013. Multiple instances of ancient balancing selection shared between humans and chimpanzees. Science, 339 (6127), pp. 1578-1582. | Show Abstract | Read more

Instances in which natural selection maintains genetic variation in a population over millions of years are thought to be extremely rare. We conducted a genome-wide scan for long-lived balancing selection by looking for combinations of SNPs shared between humans and chimpanzees. In addition to the major histocompatibility complex, we identified 125 regions in which the same haplotypes are segregating in the two species, all but two of which are noncoding. In six cases, there is evidence for an ancestral polymorphism that persisted to the present in humans and chimpanzees. Regions with shared haplotypes are significantly enriched for membrane glycoproteins, and a similar trend is seen among shared coding polymorphisms. These findings indicate that ancient balancing selection has shaped human variation and point to genes involved in host-pathogen interactions as common targets.

1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), pp. 56-65. | Show Abstract | Read more

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

Gregory AP, Dendrou CA, Attfield KE, Haghikia A, Xifara DK, Butter F, Poschmann G, Kaur G, Lambert L, Leach OA et al. 2012. TNF receptor 1 genetic risk mirrors outcome of anti-TNF therapy in multiple sclerosis. Nature, 488 (7412), pp. 508-511. | Show Abstract | Read more

Although there has been much success in identifying genetic variants associated with common diseases using genome-wide association studies (GWAS), it has been difficult to demonstrate which variants are causal and what role they have in disease. Moreover, the modest contribution that these variants make to disease risk has raised questions regarding their medical relevance. Here we have investigated a single nucleotide polymorphism (SNP) in the TNFRSF1A gene, that encodes tumour necrosis factor receptor 1 (TNFR1), which was discovered through GWAS to be associated with multiple sclerosis (MS), but not with other autoimmune conditions such as rheumatoid arthritis, psoriasis and Crohn’s disease. By analysing MS GWAS data in conjunction with the 1000 Genomes Project data we provide genetic evidence that strongly implicates this SNP, rs1800693, as the causal variant in the TNFRSF1A region. We further substantiate this through functional studies showing that the MS risk allele directs expression of a novel, soluble form of TNFR1 that can block TNF. Importantly, TNF-blocking drugs can promote onset or exacerbation of MS, but they have proven highly efficacious in the treatment of autoimmune diseases for which there is no association with rs1800693. This indicates that the clinical experience with these drugs parallels the disease association of rs1800693, and that the MS-associated TNFR1 variant mimics the effect of TNF-blocking drugs. Hence, our study demonstrates that clinical practice can be informed by comparing GWAS across common autoimmune diseases and by investigating the functional consequences of the disease-associated genetic variation.

Auton A, Fledel-Alon A, Pfeifer S, Venn O, Ségurel L, Street T, Leffler EM, Bowden R, Aneas I, Broxholme J et al. 2012. A fine-scale chimpanzee genetic map from population sequencing. Science, 336 (6078), pp. 193-198. | Show Abstract | Read more

To study the evolution of recombination rates in apes, we developed methodology to construct a fine-scale genetic map from high-throughput sequence data from 10 Western chimpanzees, Pan troglodytes verus. Compared to the human genetic map, broad-scale recombination rates tend to be conserved, but with exceptions, particularly in regions of chromosomal rearrangements and around the site of ancestral fusion in human chromosome 2. At fine scales, chimpanzee recombination is dominated by hotspots, which show no overlap with those of humans even though rates are similarly elevated around CpG islands and decreased within genes. The hotspot-specifying protein PRDM9 shows extensive variation among Western chimpanzees, and there is little evidence that any sequence motifs are enriched in hotspots. The contrasting locations of hotspots provide a natural experiment, which demonstrates the impact of recombination on base composition.

Improving viral genomic data analyses to gain new insights into HIV transmission and evolution

Huge progress has been made in the prevention of HIV/AIDS. Nonetheless, rates of new infection remain stubbornly high, and so there remains an urgent need for new science, tools and technologies. Next-generation sequencing (NGS) has revolutionised disease surveillance, resulting in large databases of viral genetic diversity collected from the most affected populations.There are challenges unique to HIV in interpreting NGS data. Each infection consists of a swarm of related viruses: NGS ...

View project

217

Thank you for registering your interest

We were unable to record your request to register for interest in future opportunities. Please try again and if problems persist contact us at webteam@ndm.ox.ac.uk