register interest

Professor Gil McVean FRS FMedSci

Research Area: Genetics and Genomics
Technology Exchange: Bioinformatics, SNP typing and Statistical genetics
Scientific Themes: Genetics & Genomics
Keywords: Population genetics, Coalescent modelling, HLA imputation, de novo assembly and Statistical Genetics
Web Links:

My research covers several areas in the analysis of genetic variation, combining the development of methods for analysing high throughput sequencing data, theoretical work and empirical analysis. Of particular interest are: the analysis of recombination from population genetic data, dissecting signals of disease association within the HLA, methods for inferring genealogical history from DNA sequence data and de novo sequence assemblyfor the discovery of genetic variation. I am a member of the Mathematical Genetics Group in the Department of Statistics and Acting Director of the new Oxford Big Data Institute.

Name Department Institution Country
Professor Dominic Kwiatkowski Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Professor Peter Donnelly FRS Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Prof Mark McCarthy (RDM) OCDEM Oxford University, Oxford Centre for Diabetes, Endocrinology & Metabolism United Kingdom
Prof Lars Fugger (RDM) Weatherall Institute of Molecular Medicine Oxford University, Weatherall Institute of Molecular Medicine United Kingdom
Dr Luke Jostins Nuffield Department of Medicine University of Oxford United Kingdom
Professor Chris Holmes Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Prof Andrew OM Wilkie FRS FMedSci FRCP (RDM) Nuffield Division of Clinical Laboratory Sciences Oxford University, Weatherall Institute of Molecular Medicine United Kingdom
Professor Jenny Taylor Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Professor Adrian VS Hill Jenner Institute Oxford University, Old Road Campus Research Building United Kingdom
Professor Paul Flicek Wellcome Trust Sanger Institute United Kingdom
Professor Stephen Sawcer Department of Clinical Neurosciences, University of Cambridge United Kingdom
Prof Doug Higgs FRS (RDM) Nuffield Division of Clinical Laboratory Sciences Oxford University, Weatherall Institute of Molecular Medicine United Kingdom
Prof Anne Goriely (RDM) Nuffield Division of Clinical Laboratory Sciences Oxford University, Weatherall Institute of Molecular Medicine United Kingdom
Dr David Bentley Illumina Inc. United Kingdom
Dr Mike Eberle Illumina Inc. United States
Dr Ian Henderson University of Cambridge United Kingdom
Miles A, Iqbal Z, Vauterin P, Pearson R, Campino S, Theron M, Gould K, Mead D, Drury E, O'Brien J et al. 2016. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res, 26 (9), pp. 1288-1299. | Show Abstract | Read more

The malaria parasite Plasmodium falciparum has a great capacity for evolutionary adaptation to evade host immunity and develop drug resistance. Current understanding of parasite evolution is impeded by the fact that a large fraction of the genome is either highly repetitive or highly variable and thus difficult to analyze using short-read sequencing technologies. Here, we describe a resource of deep sequencing data on parents and progeny from genetic crosses, which has enabled us to perform the first genome-wide, integrated analysis of SNP, indel and complex polymorphisms, using Mendelian error rates as an indicator of genotypic accuracy. These data reveal that indels are exceptionally abundant, being more common than SNPs and thus the dominant mode of polymorphism within the core genome. We use the high density of SNP and indel markers to analyze patterns of meiotic recombination, confirming a high rate of crossover events and providing the first estimates for the rate of non-crossover events and the length of conversion tracts. We observe several instances of meiotic recombination within copy number variants associated with drug resistance, demonstrating a mechanism whereby fitness costs associated with resistance mutations could be compensated and greater phenotypic plasticity could be acquired.

Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ et al. 2016. The genetic architecture of type 2 diabetes. Nature, 536 (7614), pp. 41-47. | Show Abstract | Read more

The genetic architecture of common traits, including the number, frequency, and effect sizes of inherited variants that contribute to individual risk, has been long debated. Genome-wide association studies have identified scores of common variants associated with type 2 diabetes, but in aggregate, these explain only a fraction of the heritability of this disease. Here, to test the hypothesis that lower-frequency variants explain much of the remainder, the GoT2D and T2D-GENES consortia performed whole-genome sequencing in 2,657 European individuals with and without diabetes, and exome sequencing in 12,940 individuals from five ancestry groups. To increase statistical power, we expanded the sample size via genotyping and imputation in a further 111,548 subjects. Variants associated with type 2 diabetes after sequencing were overwhelmingly common and most fell within regions previously identified by genome-wide association studies. Comprehensive enumeration of sequence variation is necessary to identify functional alleles that provide important clues to disease pathophysiology, but large-scale sequencing does not support the idea that lower-frequency variants have a major role in predisposition to type 2 diabetes.

Hellner K, Miranda F, Fotso Chedom D, Herrero-Gonzalez S, Hayden DM, Tearle R, Artibani M, KaramiNejadRanjbar M, Williams R, Gaitskell K et al. 2016. Premalignant SOX2 overexpression in the fallopian tubes of ovarian cancer patients: Discovery and validation studies. EBioMedicine, 10 pp. 137-149. | Show Abstract | Read more

Current screening methods for ovarian cancer can only detect advanced disease. Earlier detection has proved difficult because the molecular precursors involved in the natural history of the disease are unknown. To identify early driver mutations in ovarian cancer cells, we used dense whole genome sequencing of micrometastases and microscopic residual disease collected at three time points over three years from a single patient during treatment for high-grade serous ovarian cancer (HGSOC). The functional and clinical significance of the identified mutations was examined using a combination of population-based whole genome sequencing, targeted deep sequencing, multi-center analysis of protein expression, loss of function experiments in an in-vivo reporter assay and mammalian models, and gain of function experiments in primary cultured fallopian tube epithelial (FTE) cells. We identified frequent mutations involving a 40kb distal repressor region for the key stem cell differentiation gene SOX2. In the apparently normal FTE, the region was also mutated. This was associated with a profound increase in SOX2 expression (p<2(-16)), which was not found in patients without cancer (n=108). Importantly, we show that SOX2 overexpression in FTE is nearly ubiquitous in patients with HGSOCs (n=100), and common in BRCA1-BRCA2 mutation carriers (n=71) who underwent prophylactic salpingo-oophorectomy. We propose that the finding of SOX2 overexpression in FTE could be exploited to develop biomarkers for detecting disease at a premalignant stage, which would reduce mortality from this devastating disease.

Choi K, Reinhard C, Serra H, Ziolkowski PA, Underwood CJ, Zhao X, Hardcastle TJ, Yelina NE, Griffin C, Jackson M et al. 2016. Recombination Rate Heterogeneity within Arabidopsis Disease Resistance Genes. PLoS Genet, 12 (7), pp. e1006179. | Show Abstract | Read more

Meiotic crossover frequency varies extensively along chromosomes and is typically concentrated in hotspots. As recombination increases genetic diversity, hotspots are predicted to occur at immunity genes, where variation may be beneficial. A major component of plant immunity is recognition of pathogen Avirulence (Avr) effectors by resistance (R) genes that encode NBS-LRR domain proteins. Therefore, we sought to test whether NBS-LRR genes would overlap with meiotic crossover hotspots using experimental genetics in Arabidopsis thaliana. NBS-LRR genes tend to physically cluster in plant genomes; for example, in Arabidopsis most are located in large clusters on the south arms of chromosomes 1 and 5. We experimentally mapped 1,439 crossovers within these clusters and observed NBS-LRR gene associated hotspots, which were also detected as historical hotspots via analysis of linkage disequilibrium. However, we also observed NBS-LRR gene coldspots, which in some cases correlate with structural heterozygosity. To study recombination at the fine-scale we used high-throughput sequencing to analyze ~1,000 crossovers within the RESISTANCE TO ALBUGO CANDIDA1 (RAC1) R gene hotspot. This revealed elevated intragenic crossovers, overlapping nucleosome-occupied exons that encode the TIR, NBS and LRR domains. The highest RAC1 recombination frequency was promoter-proximal and overlapped CTT-repeat DNA sequence motifs, which have previously been associated with plant crossover hotspots. Additionally, we show a significant influence of natural genetic variation on NBS-LRR cluster recombination rates, using crosses between Arabidopsis ecotypes. In conclusion, we show that a subset of NBS-LRR genes are strong hotspots, whereas others are coldspots. This reveals a complex recombination landscape in Arabidopsis NBS-LRR genes, which we propose results from varying coevolutionary pressures exerted by host-pathogen relationships, and is influenced by structural heterozygosity.

Kelleher J, Etheridge AM, McVean G. 2016. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput Biol, 12 (5), pp. e1004842. | Show Abstract | Read more

A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods.

Jostins L, McVean G. 2016. Trinculo: Bayesian and frequentist multinomial logistic regression for genome-wide association studies of multi-category phenotypes. Bioinformatics, 32 (12), pp. 1898-1900. | Show Abstract | Read more

MOTIVATION: For many classes of disease the same genetic risk variants underly many related phenotypes or disease subtypes. Multinomial logistic regression provides an attractive framework to analyze multi-category phenotypes, and explore the genetic relationships between these phenotype categories. We introduce Trinculo, a program that implements a wide range of multinomial analyses in a single fast package that is designed to be easy to use by users of standard genome-wide association study software. AVAILABILITY AND IMPLEMENTATION: An open source C implementation, with code and binaries for Linux and Mac OSX, is available for download at http://sourceforge.net/projects/trinculo SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CONTACT: lj4@well.ox.ac.uk.

Earle SG, Wu CH, Charlesworth J, Stoesser N, Gordon NC, Walker TM, Spencer CC, Iqbal Z, Clifton DA, Hopkins KL et al. 2016. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat Microbiol, 1 (5), pp. 16041. | Show Abstract | Read more

Bacteria pose unique challenges for genome-wide association studies because of strong structuring into distinct strains and substantial linkage disequilibrium across the genome(1,2). Although methods developed for human studies can correct for strain structure(3,4), this risks considerable loss-of-power because genetic differences between strains often contribute substantial phenotypic variability(5). Here, we propose a new method that captures lineage-level associations even when locus-specific associations cannot be fine-mapped. We demonstrate its ability to detect genes and genetic variants underlying resistance to 17 antimicrobials in 3,144 isolates from four taxonomically diverse clonal and recombining bacteria: Mycobacterium tuberculosis, Staphylococcus aureus, Escherichia coli and Klebsiella pneumoniae. Strong selection, recombination and penetrance confer high power to recover known antimicrobial resistance mechanisms and reveal a candidate association between the outer membrane porin nmpC and cefazolin resistance in E. coli. Hence, our method pinpoints locus-specific effects where possible and boosts power by detecting lineage-level differences when fine-mapping is intractable.

MalariaGEN Plasmodium falciparum Community Project. 2016. Genomic epidemiology of artemisinin resistant malaria. Elife, 5 (MARCH2016), | Show Abstract | Read more

The current epidemic of artemisinin resistant Plasmodium falciparum in Southeast Asia is the result of a soft selective sweep involving at least 20 independent kelch13 mutations. In a large global survey, we find that kelch13 mutations which cause resistance in Southeast Asia are present at low frequency in Africa. We show that African kelch13 mutations have originated locally, and that kelch13 shows a normal variation pattern relative to other genes in Africa, whereas in Southeast Asia there is a great excess of non-synonymous mutations, many of which cause radical amino-acid changes. Thus, kelch13 is not currently undergoing strong selection in Africa, despite a deep reservoir of variations that could potentially allow resistance to emerge rapidly. The practical implications are that public health surveillance for artemisinin resistance should not rely on kelch13 data alone, and interventions to prevent resistance must account for local evolutionary conditions, shown by genomic epidemiology to differ greatly between geographical regions.

Bradley P, Gordon NC, Walker TM, Dunn L, Heys S, Huang B, Earle S, Pankhurst LJ, Anson L, de Cesare M et al. 2015. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat Commun, 6 pp. 10063. | Show Abstract | Read more

The rise of antibiotic-resistant bacteria has led to an urgent need for rapid detection of drug resistance in clinical samples, and improvements in global surveillance. Here we show how de Bruijn graph representation of bacterial diversity can be used to identify species and resistance profiles of clinical isolates. We implement this method for Staphylococcus aureus and Mycobacterium tuberculosis in a software package ('Mykrobe predictor') that takes raw sequence data as input, and generates a clinician-friendly report within 3 minutes on a laptop. For S. aureus, the error rates of our method are comparable to gold-standard phenotypic methods, with sensitivity/specificity of 99.1%/99.6% across 12 antibiotics (using an independent validation set, n=470). For M. tuberculosis, our method predicts resistance with sensitivity/specificity of 82.6%/98.5% (independent validation set, n=1,609); sensitivity is lower here, probably because of limited understanding of the underlying genetic mechanisms. We give evidence that minor alleles improve detection of extremely drug-resistant strains, and demonstrate feasibility of the use of emerging single-molecule nanopore sequencing techniques for these purposes.

Singhal S, Leffler EM, Sannareddy K, Turner I, Venn O, Hooper DM, Strand AI, Li Q, Raney B, Balakrishnan CN et al. 2015. Stable recombination hotspots in birds Science, 350 (6263), pp. 928-932. | Show Abstract | Read more

Copyright © 2015, American Association for the Advancement of Science.The DNA-binding protein PRDM9 has a critical role in specifying meiotic recombination hotspots in mice and apes, but it appears to be absent from other vertebrate species, including birds. To study the evolution and determinants of recombination in species lacking the gene that encodes PRDM9, we inferred fine-scale genetic maps from population resequencing data for two bird species: the zebra finch, Taeniopygia guttata, and the long-tailed finch, Poephila acuticauda. We found that both species have recombination hotspots, which are enriched near functional genomic elements. Unlike in mice and apes, most hotspots are shared between the two species, and their conservation seems to extend over tens of millions of years. These observations suggest that in the absence of PRDM9, recombination targets functional features that both enable access to the genome and constrain its evolution.

Vukcevic D, Traherne JA, Næss S, Ellinghaus E, Kamatani Y, Dilthey A, Lathrop M, Karlsen TH, Franke A, Moffatt M et al. 2015. Imputation of KIR Types from SNP Variation Data. Am J Hum Genet, 97 (4), pp. 593-607. | Show Abstract | Read more

Large population studies of immune system genes are essential for characterizing their role in diseases, including autoimmune conditions. Of key interest are a group of genes encoding the killer cell immunoglobulin-like receptors (KIRs), which have known and hypothesized roles in autoimmune diseases, resistance to viruses, reproductive conditions, and cancer. These genes are highly polymorphic, which makes typing expensive and time consuming. Consequently, despite their importance, KIRs have been little studied in large cohorts. Statistical imputation methods developed for other complex loci (e.g., human leukocyte antigen [HLA]) on the basis of SNP data provide an inexpensive high-throughput alternative to direct laboratory typing of these loci and have enabled important findings and insights for many diseases. We present KIR∗IMP, a method for imputation of KIR copy number. We show that KIR∗IMP is highly accurate and thus allows the study of KIRs in large cohorts and enables detailed investigation of the role of KIRs in human disease.

1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. 2015. A global reference for human genetic variation. Nature, 526 (7571), pp. 68-74. | Show Abstract | Read more

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

Tyler-Smith C, Yang H, Landweber LF, Dunham I, Knoppers BM, Donnelly P, Mardis ER, Snyder M, McVean G. 2015. Where Next for Genetics and Genomics? PLoS Biol, 13 (7), pp. e1002216. | Show Abstract | Read more

The last few decades have utterly transformed genetics and genomics, but what might the next ten years bring? PLOS Biology asked eight leaders spanning a range of related areas to give us their predictions. Without exception, the predictions are for more data on a massive scale and of more diverse types. All are optimistic and predict enormous positive impact on scientific understanding, while a recurring theme is the benefit of such data for the transformation and personalization of medicine. Several also point out that the biggest changes will very likely be those that we don't foresee, even now.

Parham LR, Briley LP, Li L, Shen J, Newcombe PJ, King KS, Slater AJ, Dilthey A, Iqbal Z, McVean G et al. 2016. Comprehensive genome-wide evaluation of lapatinib-induced liver injury yields a single genetic signal centered on known risk allele HLA-DRB1*07:01. Pharmacogenomics J, 16 (2), pp. 180-185. | Show Abstract | Read more

Lapatinib is associated with a low incidence of serious liver injury. Previous investigations have identified and confirmed the Class II allele HLA-DRB1*07:01 to be strongly associated with lapatinib-induced liver injury; however, the moderate positive predictive value limits its clinical utility. To assess whether additional genetic variants located within the major histocompatibility complex locus or elsewhere in the genome may influence lapatinib-induced liver injury risk, and potentially lead to a genetic association with improved predictive qualities, we have taken two approaches: a genome-wide association study and a whole-genome sequencing study. This evaluation did not reveal additional associations other than the previously identified association for HLA-DRB1*07:01. The present study represents the most comprehensive genetic evaluation of drug-induced liver injury (DILI) or hypersensitivity, and suggests that investigation of possible human leukocyte antigen associations with DILI and other hypersensitivities represents an important first step in understanding the mechanism of these events.

Taylor JC, Martin HC, Lise S, Broxholme J, Cazier JB, Rimmer A, Kanapin A, Lunter G, Fiddy S, Allan C et al. 2015. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat Genet, 47 (7), pp. 717-726. | Show Abstract | Read more

To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.

Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G. 2015. Improved genome inference in the MHC using a population reference graph. Nat Genet, 47 (6), pp. 682-688. | Show Abstract | Read more

Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.

Moutsianas L, Agarwala V, Fuchsberger C, Flannick J, Rivas MA, Gaulton KJ, Albers PK, GoT2D Consortium, McVean G, Boehnke M et al. 2015. The power of gene-based rare variant methods to detect disease-associated variation and test hypotheses about complex disease. PLoS Genet, 11 (4), pp. e1005165. | Show Abstract | Read more

Genome and exome sequencing in large cohorts enables characterization of the role of rare variation in complex diseases. Success in this endeavor, however, requires investigators to test a diverse array of genetic hypotheses which differ in the number, frequency and effect sizes of underlying causal variants. In this study, we evaluated the power of gene-based association methods to interrogate such hypotheses, and examined the implications for study design. We developed a flexible simulation approach, using 1000 Genomes data, to (a) generate sequence variation at human genes in up to 10K case-control samples, and (b) quantify the statistical power of a panel of widely used gene-based association tests under a variety of allelic architectures, locus effect sizes, and significance thresholds. For loci explaining ~1% of phenotypic variance underlying a common dichotomous trait, we find that all methods have low absolute power to achieve exome-wide significance (~5-20% power at α = 2.5 × 10(-6)) in 3K individuals; even in 10K samples, power is modest (~60%). The combined application of multiple methods increases sensitivity, but does so at the expense of a higher false positive rate. MiST, SKAT-O, and KBAC have the highest individual mean power across simulated datasets, but we observe wide architecture-dependent variability in the individual loci detected by each test, suggesting that inferences about disease architecture from analysis of sequencing studies can differ depending on which methods are used. Our results imply that tens of thousands of individuals, extensive functional annotation, or highly targeted hypothesis testing will be required to confidently detect or exclude rare variant signals at complex disease loci.

Miotto O, Amato R, Ashley EA, MacInnis B, Almagro-Garcia J, Amaratunga C, Lim P, Mead D, Oyola SO, Dhorda M et al. 2015. Genetic architecture of artemisinin-resistant Plasmodium falciparum. Nat Genet, 47 (3), pp. 226-234. | Show Abstract | Read more

We report a large multicenter genome-wide association study of Plasmodium falciparum resistance to artemisinin, the frontline antimalarial drug. Across 15 locations in Southeast Asia, we identified at least 20 mutations in kelch13 (PF3D7_1343700) affecting the encoded propeller and BTB/POZ domains, which were associated with a slow parasite clearance rate after treatment with artemisinin derivatives. Nonsynonymous polymorphisms in fd (ferredoxin), arps10 (apicoplast ribosomal protein S10), mdr2 (multidrug resistance protein 2) and crt (chloroquine resistance transporter) also showed strong associations with artemisinin resistance. Analysis of the fine structure of the parasite population showed that the fd, arps10, mdr2 and crt polymorphisms are markers of a genetic background on which kelch13 mutations are particularly likely to arise and that they correlate with the contemporary geographical boundaries and population frequencies of artemisinin resistance. These findings indicate that the risk of new resistance-causing mutations emerging is determined by specific predisposing genetic factors in the underlying parasite population.

Mahajan A, Sim X, Ng HJ, Manning A, Rivas MA, Highland HM, Locke AE, Grarup N, Im HK, Cingolani P et al. 2015. Identification and functional characterization of G6PC2 coding variants influencing glycemic traits define an effector transcript at the G6PC2-ABCB11 locus. PLoS Genet, 11 (1), pp. e1004876. | Show Abstract | Read more

Genome wide association studies (GWAS) for fasting glucose (FG) and insulin (FI) have identified common variant signals which explain 4.8% and 1.2% of trait variance, respectively. It is hypothesized that low-frequency and rare variants could contribute substantially to unexplained genetic variance. To test this, we analyzed exome-array data from up to 33,231 non-diabetic individuals of European ancestry. We found exome-wide significant (P<5×10-7) evidence for two loci not previously highlighted by common variant GWAS: GLP1R (p.Ala316Thr, minor allele frequency (MAF)=1.5%) influencing FG levels, and URB2 (p.Glu594Val, MAF = 0.1%) influencing FI levels. Coding variant associations can highlight potential effector genes at (non-coding) GWAS signals. At the G6PC2/ABCB11 locus, we identified multiple coding variants in G6PC2 (p.Val219Leu, p.His177Tyr, and p.Tyr207Ser) influencing FG levels, conditionally independent of each other and the non-coding GWAS signal. In vitro assays demonstrate that these associated coding alleles result in reduced protein abundance via proteasomal degradation, establishing G6PC2 as an effector gene at this locus. Reconciliation of single-variant associations and functional effects was only possible when haplotype phase was considered. In contrast to earlier reports suggesting that, paradoxically, glucose-raising alleles at this locus are protective against type 2 diabetes (T2D), the p.Val219Leu G6PC2 variant displayed a modest but directionally consistent association with T2D risk. Coding variant associations for glycemic traits in GWAS signals highlight PCSK1, RREB1, and ZHX3 as likely effector transcripts. These coding variant association signals do not have a major impact on the trait variance explained, but they do provide valuable biological insights.

Moutsianas L, Jostins L, Beecham AH, Dilthey AT, Xifara DK, Ban M, Shah TS, Patsopoulos NA, Alfredsson L, Anderson CA et al. 2015. Class II HLA interactions modulate genetic risk for multiple sclerosis. Nat Genet, 47 (10), pp. 1107-1113. | Show Abstract | Read more

Association studies have greatly refined the understanding of how variation within the human leukocyte antigen (HLA) genes influences risk of multiple sclerosis. However, the extent to which major effects are modulated by interactions is poorly characterized. We analyzed high-density SNP data on 17,465 cases and 30,385 controls from 11 cohorts of European ancestry, in combination with imputation of classical HLA alleles, to build a high-resolution map of HLA genetic risk and assess the evidence for interactions involving classical HLA alleles. Among new and previously identified class II risk alleles (HLA-DRB1*15:01, HLA-DRB1*13:03, HLA-DRB1*03:01, HLA-DRB1*08:01 and HLA-DQB1*03:02) and class I protective alleles (HLA-A*02:01, HLA-B*44:02, HLA-B*38:01 and HLA-B*55:01), we find evidence for two interactions involving pairs of class II alleles: HLA-DQA1*01:01-HLA-DRB1*15:01 and HLA-DQB1*03:01-HLA-DQB1*03:02. We find no evidence for interactions between classical HLA alleles and non-HLA risk-associated variants and estimate a minimal effect of polygenic epistasis in modulating major risk alleles.

Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G. 2015. Improved genome inference in the MHC using a population reference graph Nature Genetics, 47 (6), pp. 682-688. | Show Abstract | Read more

© 2015 Nature America, Inc. All rights reserved.Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.

Wilkie A, Maher G, Giannoulatou E, McVean G, Goriely A. 2014. Selfish mutations in spermatogenesis and paternal age effects MUTAGENESIS, 29 (6), pp. 546-546.

Majithia AR, Flannick J, Shahinian P, Guo M, Bray MA, Fontanillas P, Gabriel SB, GoT2D Consortium, NHGRI JHS/FHS Allelic Spectrum Project, SIGMA T2D Consortium et al. 2014. Rare variants in PPARG with decreased activity in adipocyte differentiation are associated with increased risk of type 2 diabetes. Proc Natl Acad Sci U S A, 111 (36), pp. 13127-13132. | Show Abstract | Read more

Peroxisome proliferator-activated receptor gamma (PPARG) is a master transcriptional regulator of adipocyte differentiation and a canonical target of antidiabetic thiazolidinedione medications. In rare families, loss-of-function (LOF) mutations in PPARG are known to cosegregate with lipodystrophy and insulin resistance; in the general population, the common P12A variant is associated with a decreased risk of type 2 diabetes (T2D). Whether and how rare variants in PPARG and defects in adipocyte differentiation influence risk of T2D in the general population remains undetermined. By sequencing PPARG in 19,752 T2D cases and controls drawn from multiple studies and ethnic groups, we identified 49 previously unidentified, nonsynonymous PPARG variants (MAF < 0.5%). Considered in aggregate (with or without computational prediction of functional consequence), these rare variants showed no association with T2D (OR = 1.35; P = 0.17). The function of the 49 variants was experimentally tested in a novel high-throughput human adipocyte differentiation assay, and nine were found to have reduced activity in the assay. Carrying any of these nine LOF variants was associated with a substantial increase in risk of T2D (OR = 7.22; P = 0.005). The combination of large-scale DNA sequencing and functional testing in the laboratory reveals that approximately 1 in 1,000 individuals carries a variant in PPARG that reduces function in a human adipocyte differentiation assay and is associated with a substantial risk of T2D.

Mathieson I, McVean G. 2014. Demography and the age of rare variants. PLoS Genet, 10 (8), pp. e1004528. | Show Abstract | Read more

Large whole-genome sequencing projects have provided access to much rare variation in human populations, which is highly informative about population structure and recent demography. Here, we show how the age of rare variants can be estimated from patterns of haplotype sharing and how these ages can be related to historical relationships between populations. We investigate the distribution of the age of variants occurring exactly twice (ƒ(2) variants) in a worldwide sample sequenced by the 1000 Genomes Project, revealing enormous variation across populations. The median age of haplotypes carrying ƒ(2) variants is 50 to 160 generations across populations within Europe or Asia, and 170 to 320 generations within Africa. Haplotypes shared between continents are much older with median ages for haplotypes shared between Europe and Asia ranging from 320 to 670 generations. The distribution of the ages of ƒ(2) haplotypes is informative about their demography, revealing recent bottlenecks, ancient splits, and more modern connections between populations. We see the effect of selection in the observation that functional variants are significantly younger than nonfunctional variants of the same frequency. This approach is relatively insensitive to mutation rate and complements other nonparametric methods for demographic inference.

Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SR, WGS500 Consortium, Wilkie AO, McVean G, Lunter G. 2014. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet, 46 (8), pp. 912-918. | Show Abstract | Read more

High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls.

Cited:

20

WOS

Venn O, Turner I, Mathieson I, de Groot N, Bontrop R, McVean G. 2014. NONHUMAN GENETICS Strong male bias drives germline mutation in chimpanzees SCIENCE, 344 (6189), pp. 1272-1275. | Read more

Venn O, Turner I, Mathieson I, de Groot N, Bontrop R, McVean G. 2014. Nonhuman genetics. Strong male bias drives germline mutation in chimpanzees. Science, 344 (6189), pp. 1272-1275. | Show Abstract | Read more

Germline mutation determines rates of molecular evolution, genetic diversity, and fitness load. In humans, the average point mutation rate is 1.2 × 10(-8) per base pair per generation, with every additional year of father's age contributing two mutations across the genome and males contributing three to four times as many mutations as females. To assess whether such patterns are shared with our closest living relatives, we sequenced the genomes of a nine-member pedigree of Western chimpanzees, Pan troglodytes verus. Our results indicate a mutation rate of 1.2 × 10(-8) per base pair per generation, but a male contribution seven to eight times that of females and a paternal age effect of three mutations per year of father's age. Thus, mutation rates and patterns differ between closely related species.

Martin HC, Kim GE, Pagnamenta AT, Murakami Y, Carvill GL, Meyer E, Copley RR, Rimmer A, Barcia G, Fleming MR et al. 2014. Clinical whole-genome sequencing in severe early-onset epilepsy reveals new genes and improves molecular diagnosis. Hum Mol Genet, 23 (12), pp. 3200-3211. | Show Abstract | Read more

In severe early-onset epilepsy, precise clinical and molecular genetic diagnosis is complex, as many metabolic and electro-physiological processes have been implicated in disease causation. The clinical phenotypes share many features such as complex seizure types and developmental delay. Molecular diagnosis has historically been confined to sequential testing of candidate genes known to be associated with specific sub-phenotypes, but the diagnostic yield of this approach can be low. We conducted whole-genome sequencing (WGS) on six patients with severe early-onset epilepsy who had previously been refractory to molecular diagnosis, and their parents. Four of these patients had a clinical diagnosis of Ohtahara Syndrome (OS) and two patients had severe non-syndromic early-onset epilepsy (NSEOE). In two OS cases, we found de novo non-synonymous mutations in the genes KCNQ2 and SCN2A. In a third OS case, WGS revealed paternal isodisomy for chromosome 9, leading to identification of the causal homozygous missense variant in KCNT1, which produced a substantial increase in potassium channel current. The fourth OS patient had a recessive mutation in PIGQ that led to exon skipping and defective glycophosphatidyl inositol biosynthesis. The two patients with NSEOE had likely pathogenic de novo mutations in CBL and CSNK1G1, respectively. Mutations in these genes were not found among 500 additional individuals with epilepsy. This work reveals two novel genes for OS, KCNT1 and PIGQ. It also uncovers unexpected genetic mechanisms and emphasizes the power of WGS as a clinical tool for making molecular diagnoses, particularly for highly heterogeneous disorders.

Panoutsopoulou K, Hatzikotoulas K, Xifara DK, Colonna V, Farmaki AE, Ritchie GR, Southam L, Gilly A, Tachmazidou I, Fatumo S et al. 2014. Genetic characterization of Greek population isolates reveals strong genetic drift at missense and trait-associated variants. Nat Commun, 5 pp. 5345. | Show Abstract | Read more

Isolated populations are emerging as a powerful study design in the search for low-frequency and rare variant associations with complex phenotypes. Here we genotype 2,296 samples from two isolated Greek populations, the Pomak villages (HELIC-Pomak) in the North of Greece and the Mylopotamos villages (HELIC-MANOLIS) in Crete. We compare their genomic characteristics to the general Greek population and establish them as genetic isolates. In the MANOLIS cohort, we observe an enrichment of missense variants among the variants that have drifted up in frequency by more than fivefold. In the Pomak cohort, we find novel associations at variants on chr11p15.4 showing large allele frequency increases (from 0.2% in the general Greek population to 4.6% in the isolate) with haematological traits, for example, with mean corpuscular volume (rs7116019, P=2.3 × 10(-26)). We replicate this association in a second set of Pomak samples (combined P=2.0 × 10(-36)). We demonstrate significant power gains in detecting medical trait associations.

Tachmazidou I, Dedoussis G, Southam L, Farmaki AE, Ritchie GR, Xifara DK, Matchan A, Hatzikotoulas K, Rayner NW, Chen Y et al. 2013. A rare functional cardioprotective APOC3 variant has risen in frequency in distinct population isolates. Nat Commun, 4 pp. 2872. | Show Abstract | Read more

Isolated populations can empower the identification of rare variation associated with complex traits through next generation association studies, but the generalizability of such findings remains unknown. Here we genotype 1,267 individuals from a Greek population isolate on the Illumina HumanExome Beadchip, in search of functional coding variants associated with lipids traits. We find genome-wide significant evidence for association between R19X, a functional variant in APOC3, with increased high-density lipoprotein and decreased triglycerides levels. Approximately 3.8% of individuals are heterozygous for this cardioprotective variant, which was previously thought to be private to the Amish founder population. R19X is rare (<0.05% frequency) in outbred European populations. The increased frequency of R19X enables discovery of this lipid traits signal at genome-wide significance in a small sample size. This work exemplifies the value of isolated populations in successfully detecting transferable rare variant associations of high medical relevance.

Giannoulatou E, McVean G, Taylor IB, McGowan SJ, Maher GJ, Iqbal Z, Pfeifer SP, Turner I, Burkitt Wright EM, Shorto J et al. 2013. Contributions of intrinsic mutation rate and selfish selection to levels of de novo HRAS mutations in the paternal germline. Proc Natl Acad Sci U S A, 110 (50), pp. 20152-20157. | Show Abstract | Read more

The RAS proto-oncogene Harvey rat sarcoma viral oncogene homolog (HRAS) encodes a small GTPase that transduces signals from cell surface receptors to intracellular effectors to control cellular behavior. Although somatic HRAS mutations have been described in many cancers, germline mutations cause Costello syndrome (CS), a congenital disorder associated with predisposition to malignancy. Based on the epidemiology of CS and the occurrence of HRAS mutations in spermatocytic seminoma, we proposed that activating HRAS mutations become enriched in sperm through a process akin to tumorigenesis, termed selfish spermatogonial selection. To test this hypothesis, we quantified the levels, in blood and sperm samples, of HRAS mutations at the p.G12 codon and compared the results to changes at the p.A11 codon, at which activating mutations do not occur. The data strongly support the role of selection in determining HRAS mutation levels in sperm, and hence the occurrence of CS, but we also found differences from the mutation pattern in tumorigenesis. First, the relative prevalence of mutations in sperm correlates weakly with their in vitro activating properties and occurrence in cancers. Second, specific tandem base substitutions (predominantly GC>TT/AA) occur in sperm but not in cancers; genomewide analysis showed that this same mutation is also overrepresented in constitutional pathogenic and polymorphic variants, suggesting a heightened vulnerability to these mutations in the germline. We developed a statistical model to show how both intrinsic mutation rate and selfish selection contribute to the mutational burden borne by the paternal germline.

Cited:

57

Scopus

Choi K, Zhao X, Kelly KA, Venn O, Higgins JD, Yelina NE, Hardcastle TJ, Ziolkowski PA, Copenhaver GP, Franklin FCH et al. 2013. Arabidopsis meiotic crossover hot spots overlap with H2A.Z nucleosomes at gene promoters Nature Genetics, 45 (11), pp. 1327-1338. | Show Abstract | Read more

PRDM9 directs human meiotic crossover hot spots to intergenic sequence motifs, whereas budding yeast hot spots overlap regions of low nucleosome density (LND) in gene promoters. To investigate hot spots in plants, which lack PRDM9, we used coalescent analysis of genetic variation in Arabidopsis thaliana. Crossovers increased toward gene promoters and terminators, and hot spots were associated with active chromatin modifications, including H2A.Z, histone H3 Lys4 trimethylation (H3K4me3), LND and low DNA methylation. Hot spot-enriched A-rich and CTT-repeat DNA motifs occurred upstream and downstream, respectively, of transcriptional start sites. Crossovers were asymmetric around promoters and were most frequent over CTT-repeat motifs and H2A.Z nucleosomes. Pollen typing, segregation and cytogenetic analysis showed decreased numbers of crossovers in the arp6 H2A.Z deposition mutant at multiple scales. During meiosis, H2A.Z forms overlapping chromosomal foci with the DMC1 and RAD51 recombinases. As arp6 reduced the number of DMC1 or RAD51 foci, H2A.Z may promote the formation or processing of meiotic DNA double-strand breaks. We propose that gene chromatin ancestrally designates hot spots within eukaryotes and PRDM9 is a derived state within vertebrates. © 2013 Nature America, Inc. All rights reserved.

International Multiple Sclerosis Genetics Consortium (IMSGC), Beecham AH, Patsopoulos NA, Xifara DK, Davis MF, Kemppinen A, Cotsapas C, Shah TS, Spencer C, Booth D et al. 2013. Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat Genet, 45 (11), pp. 1353-1360. | Show Abstract | Read more

Using the ImmunoChip custom genotyping array, we analyzed 14,498 subjects with multiple sclerosis and 24,091 healthy controls for 161,311 autosomal variants and identified 135 potentially associated regions (P < 1.0 × 10(-4)). In a replication phase, we combined these data with previous genome-wide association study (GWAS) data from an independent 14,802 subjects with multiple sclerosis and 26,703 healthy controls. In these 80,094 individuals of European ancestry, we identified 48 new susceptibility variants (P < 5.0 × 10(-8)), 3 of which we found after conditioning on previously identified variants. Thus, there are now 110 established multiple sclerosis risk variants at 103 discrete loci outside of the major histocompatibility complex. With high-resolution Bayesian fine mapping, we identified five regions where one variant accounted for more than 50% of the posterior probability of association. This study enhances the catalog of multiple sclerosis risk variants and illustrates the value of fine mapping in the resolution of GWAS signals.

Choi K, Zhao X, Kelly KA, Venn O, Higgins JD, Yelina NE, Hardcastle TJ, Ziolkowski PA, Copenhaver GP, Franklin FC et al. 2013. Arabidopsis meiotic crossover hot spots overlap with H2A.Z nucleosomes at gene promoters. Nat Genet, 45 (11), pp. 1327-1336. | Show Abstract | Read more

PRDM9 directs human meiotic crossover hot spots to intergenic sequence motifs, whereas budding yeast hot spots overlap regions of low nucleosome density (LND) in gene promoters. To investigate hot spots in plants, which lack PRDM9, we used coalescent analysis of genetic variation in Arabidopsis thaliana. Crossovers increased toward gene promoters and terminators, and hot spots were associated with active chromatin modifications, including H2A.Z, histone H3 Lys4 trimethylation (H3K4me3), LND and low DNA methylation. Hot spot-enriched A-rich and CTT-repeat DNA motifs occurred upstream and downstream, respectively, of transcriptional start sites. Crossovers were asymmetric around promoters and were most frequent over CTT-repeat motifs and H2A.Z nucleosomes. Pollen typing, segregation and cytogenetic analysis showed decreased numbers of crossovers in the arp6 H2A.Z deposition mutant at multiple scales. During meiosis, H2A.Z forms overlapping chromosomal foci with the DMC1 and RAD51 recombinases. As arp6 reduced the number of DMC1 or RAD51 foci, H2A.Z may promote the formation or processing of meiotic DNA double-strand breaks. We propose that gene chromatin ancestrally designates hot spots within eukaryotes and PRDM9 is a derived state within vertebrates.

Palmer D, Frater J, Phillips R, McLean AR, McVean G. 2013. Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes. Proceedings. Biological sciences / The Royal Society, 280 (1762), pp. 20130696. | Show Abstract

The rates of escape and reversion in response to selection pressure arising from the host immune system, notably the cytotoxic T-lymphocyte (CTL) response, are key factors determining the evolution of HIV. Existing methods for estimating these parameters from cross-sectional population data using ordinary differential equations (ODEs) ignore information about the genealogy of sampled HIV sequences, which has the potential to cause systematic bias and overestimate certainty. Here, we describe an integrated approach, validated through extensive simulations, which combines genealogical inference and epidemiological modelling, to estimate rates of CTL escape and reversion in HIV epitopes. We show that there is substantial uncertainty about rates of viral escape and reversion from cross-sectional data, which arises from the inherent stochasticity in the evolutionary process. By application to empirical data, we find that point estimates of rates from a previously published ODE model and the integrated approach presented here are often similar, but can also differ several-fold depending on the structure of the genealogy. The model-based approach we apply provides a framework for the statistical analysis and hypothesis testing of escape and reversion in population data and highlights the need for longitudinal and denser cross-sectional sampling to enable accurate estimate of these key parameters.

Babbs C, Roberts NA, Sanchez-Pulido L, McGowan SJ, Ahmed MR, Brown JM, Sabry MA, WGS500 Consortium, Bentley DR, McVean GA et al. 2013. Homozygous mutations in a predicted endonuclease are a novel cause of congenital dyserythropoietic anemia type I. Haematologica, 98 (9), pp. 1383-1387. | Show Abstract | Read more

The congenital dyserythropoietic anemias are a heterogeneous group of rare disorders primarily affecting erythropoiesis with characteristic morphological abnormalities and a block in erythroid maturation. Mutations in the CDAN1 gene, which encodes Codanin-1, underlie the majority of congenital dyserythropoietic anemia type I cases. However, no likely pathogenic CDAN1 mutation has been detected in approximately 20% of cases, suggesting the presence of at least one other locus. We used whole genome sequencing and segregation analysis to identify a homozygous T to A transversion (c.533T>A), predicted to lead to a p.L178Q missense substitution in C15ORF41, a gene of unknown function, in a consanguineous pedigree of Middle-Eastern origin. Sequencing C15ORF41 in other CDAN1 mutation-negative congenital dyserythropoietic anemia type I pedigrees identified a homozygous transition (c.281A>G), predicted to lead to a p.Y94C substitution, in two further pedigrees of SouthEast Asian origin. The haplotype surrounding the c.281A>G change suggests a founder effect for this mutation in Pakistan. Detailed sequence similarity searches indicate that C15ORF41 encodes a novel restriction endonuclease that is a member of the Holliday junction resolvase family of proteins.

Palmer D, Frater J, Phillips R, McLean AR, McVean G. 2013. Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes. Proc Biol Sci, 280 (1762), pp. 20130696. | Show Abstract | Read more

The rates of escape and reversion in response to selection pressure arising from the host immune system, notably the cytotoxic T-lymphocyte (CTL) response, are key factors determining the evolution of HIV. Existing methods for estimating these parameters from cross-sectional population data using ordinary differential equations (ODEs) ignore information about the genealogy of sampled HIV sequences, which has the potential to cause systematic bias and overestimate certainty. Here, we describe an integrated approach, validated through extensive simulations, which combines genealogical inference and epidemiological modelling, to estimate rates of CTL escape and reversion in HIV epitopes. We show that there is substantial uncertainty about rates of viral escape and reversion from cross-sectional data, which arises from the inherent stochasticity in the evolutionary process. By application to empirical data, we find that point estimates of rates from a previously published ODE model and the integrated approach presented here are often similar, but can also differ several-fold depending on the structure of the genealogy. The model-based approach we apply provides a framework for the statistical analysis and hypothesis testing of escape and reversion in population data and highlights the need for longitudinal and denser cross-sectional sampling to enable accurate estimate of these key parameters.

Mathieson I, McVean G. 2013. Reply to: "FaST-LMM-Select for addressing confounding from spatial structure and rare variants". Nat Genet, 45 (5), pp. 471. | Read more

Miotto O, Almagro-Garcia J, Manske M, Macinnis B, Campino S, Rockett KA, Amaratunga C, Lim P, Suon S, Sreng S et al. 2013. Multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia. Nat Genet, 45 (6), pp. 648-655. | Show Abstract | Read more

We describe an analysis of genome variation in 825 P. falciparum samples from Asia and Africa that identifies an unusual pattern of parasite population structure at the epicenter of artemisinin resistance in western Cambodia. Within this relatively small geographic area, we have discovered several distinct but apparently sympatric parasite subpopulations with extremely high levels of genetic differentiation. Of particular interest are three subpopulations, all associated with clinical resistance to artemisinin, which have skewed allele frequency spectra and high levels of haplotype homozygosity, indicative of founder effects and recent population expansion. We provide a catalog of SNPs that show high levels of differentiation in the artemisinin-resistant subpopulations, including codon variants in transporter proteins and DNA mismatch repair proteins. These data provide a population-level genetic framework for investigating the biological origins of artemisinin resistance and for defining molecular markers to assist in its elimination.

Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, Ananda G, Howie B, Karczewski KJ, Smith KS et al. 2013. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res, 23 (5), pp. 749-761. | Show Abstract | Read more

Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

Mathieson I, McVean G. 2013. Estimating selection coefficients in spatially structured populations from time series data of allele frequencies. Genetics, 193 (3), pp. 973-984. | Show Abstract | Read more

Inferring the nature and magnitude of selection is an important problem in many biological contexts. Typically when estimating a selection coefficient for an allele, it is assumed that samples are drawn from a panmictic population and that selection acts uniformly across the population. However, these assumptions are rarely satisfied. Natural populations are almost always structured, and selective pressures are likely to act differentially. Inference about selection ought therefore to take account of structure. We do this by considering evolution in a simple lattice model of spatial population structure. We develop a hidden Markov model based maximum-likelihood approach for estimating the selection coefficient in a single population from time series data of allele frequencies. We then develop an approximate extension of this to the structured case to provide a joint estimate of migration rate and spatially varying selection coefficients. We illustrate our method using classical data sets of moth pigmentation morph frequencies, but it has wide applications in settings ranging from ecology to human evolution.

Newey PJ, Nesbit MA, Rimmer AJ, Head RA, Gorvin CM, Attar M, Gregory L, Wass JA, Buck D, Karavitaki N et al. 2013. Whole-exome sequencing studies of nonfunctioning pituitary adenomas. J Clin Endocrinol Metab, 98 (4), pp. E796-E800. | Show Abstract | Read more

CONTEXT: The tumorigenic role of genetic abnormalities in sporadic pituitary nonfunctioning adenomas (NFAs), which usually originate from gonadotroph cells, is unknown. OBJECTIVE: The objective of the study was to identify somatic genetic abnormalities in sporadic pituitary NFAs. DESIGN: Whole-exome sequencing was performed using DNA from 7 pituitary NFAs and leukocyte samples obtained from the same patients. Somatic variants were confirmed by dideoxynucleotide sequencing, and candidate driver genes were assessed in an additional 24 pituitary NFAs. RESULTS: Whole-exome sequencing achieved a high degree of coverage such that approximately 97% of targeted bases were represented by more than 10 base reads; 24 somatic variants were identified and confirmed in the discovery set of 7 pituitary NFAs (mean 3.5 variants/tumor; range 1-7). Approximately 80% of variants occurred as missense single nucleotide variants and the remainder were synonymous changes or small frameshift deletions. Each of the 24 mutations occurred in independent genes with no recurrent mutations. Mutations were not observed in genes previously associated with pituitary tumorigenesis, although somatic variants in putative driver genes including platelet-derived growth factor D (PDGFD), N-myc down-regulated gene family member 4 (NDRG4), and Zipper sterile-α-motif kinase (ZAK) were identified; however, DNA sequence analysis of these in the validation set of 24 pituitary NFAs did not reveal any mutations indicating that these genes are unlikely to contribute significantly in the etiology of sporadic pituitary NFAs. CONCLUSIONS: Pituitary NFAs harbor few somatic mutations consistent with their low proliferation rates and benign nature, but mechanisms other than somatic mutation are likely involved in the etiology of sporadic pituitary NFAs.

Dilthey A, Leslie S, Moutsianas L, Shen J, Cox C, Nelson MR, McVean G. 2013. Multi-population classical HLA type imputation. PLoS Comput Biol, 9 (2), pp. e1002877. | Show Abstract | Read more

Statistical imputation of classical HLA alleles in case-control studies has become established as a valuable tool for identifying and fine-mapping signals of disease association in the MHC. Imputation into diverse populations has, however, remained challenging, mainly because of the additional haplotypic heterogeneity introduced by combining reference panels of different sources. We present an HLA type imputation model, HLA*IMP:02, designed to operate on a multi-population reference panel. HLA*IMP:02 is based on a graphical representation of haplotype structure. We present a probabilistic algorithm to build such models for the HLA region, accommodating genotyping error, haplotypic heterogeneity and the need for maximum accuracy at the HLA loci, generalizing the work of Browning and Browning (2007) and Ron et al. (1998). HLA*IMP:02 achieves an average 4-digit imputation accuracy on diverse European panels of 97% (call rate 97%). On non-European samples, 2-digit performance is over 90% for most loci and ethnicities where data available. HLA*IMP:02 supports imputation of HLA-DPB1 and HLA-DRB3-5, is highly tolerant of missing data in the imputation panel and works on standard genotype data from popular genotyping chips. It is publicly available in source code and as a user-friendly web service framework.

Zilversmit MM, Chase EK, Chen DS, Awadalla P, Day KP, McVean G. 2013. Hypervariable antigen genes in malaria have ancient roots. BMC Evol Biol, 13 (1), pp. 110. | Show Abstract | Read more

BACKGROUND: The var genes of the human malaria parasite Plasmodium falciparum are highly polymorphic loci coding for the erythrocyte membrane proteins 1 (PfEMP1), which are responsible for the cytoaherence of P. falciparum infected red blood cells to the human vasculature. Cytoadhesion, coupled with differential expression of var genes, contributes to virulence and allows the parasite to establish chronic infections by evading detection from the host's immune system. Although studying genetic diversity is a major focus of recent work on the var genes, little is known about the gene family's origin and evolutionary history. RESULTS: Using a novel hidden Markov model-based approach and var sequences assembled from additional isolates and species, we are able to reveal elements of both the early evolution of the var genes as well as recent diversifying events. We compare sequences of the var gene DBLα domains from divergent isolates of P. falciparum (3D7 and HB3), and a closely-related species, Plasmodium reichenowi. We find that the gene family is equally large in P. reichenowi and P. falciparum -- with a minimum of 51 var genes in the P. reichenowi genome (compared to 61 in 3D7 and a minimum of 48 in HB3). In addition, we are able to define large, continuous blocks of homologous sequence among P. falciparum and P. reichenowi var gene DBLα domains. These results reveal that the contemporary structure of the var gene family was present before the divergence of P. falciparum and P. reichenowi, estimated to be between 2.5 to 6 million years ago. We also reveal that recombination has played an important and traceable role in both the establishment, and the maintenance, of diversity in the sequences. CONCLUSIONS: Despite the remarkable diversity and rapid evolution found in these loci within and among P. falciparum populations, the basic structure of these domains and the gene family is surprisingly old and stable. Revealing a common structure as well as conserved sequence among two species also has implications for developing new primate-parasite models for studying the pathology and immunology of falciparum malaria, and for studying the population genetics of var genes and associated virulence phenotypes.

Leffler EM, Gao Z, Pfeifer S, Ségurel L, Auton A, Venn O, Bowden R, Bontrop R, Wall JD, Sella G et al. 2013. Multiple instances of ancient balancing selection shared between humans and chimpanzees. Science, 339 (6127), pp. 1578-1582. | Show Abstract | Read more

Instances in which natural selection maintains genetic variation in a population over millions of years are thought to be extremely rare. We conducted a genome-wide scan for long-lived balancing selection by looking for combinations of SNPs shared between humans and chimpanzees. In addition to the major histocompatibility complex, we identified 125 regions in which the same haplotypes are segregating in the two species, all but two of which are noncoding. In six cases, there is evidence for an ancestral polymorphism that persisted to the present in humans and chimpanzees. Regions with shared haplotypes are significantly enriched for membrane glycoproteins, and a similar trend is seen among shared coding polymorphisms. These findings indicate that ancient balancing selection has shaped human variation and point to genes involved in host-pathogen interactions as common targets.

Cited:

73

Scopus

Nesbit MA, Hannan FM, Howles SA, Reed AAC, Cranston T, Thakker CE, Gregory L, Rimmer AJ, Rust N, Graham U et al. 2013. Mutations in AP2S1 cause familial hypocalciuric hypercalcemia type 3 Nature Genetics, 45 (1), pp. 93-97. | Show Abstract | Read more

Adaptor protein-2 (AP2), a central component of clathrin-coated vesicles (CCVs), is pivotal in clathrin-mediated endocytosis, which internalizes plasma membrane constituents such as G protein-coupled receptors (GPCRs). AP2, a heterotetramer of α, β, μ and σ subunits, links clathrin to vesicle membranes and binds to tyrosine- and dileucine-based motifs of membrane-associated cargo proteins. Here we show that missense mutations of AP2 σ subunit (AP2S1) affecting Arg15, which forms key contacts with dileucine-based motifs of CCV cargo proteins, result in familial hypocalciuric hypercalcemia type 3 (FHH3), an extracellular calcium homeostasis disorder affecting the parathyroids, kidneys and bone. We found AP2S1 mutations in >20% of cases of FHH without mutations in calcium-sensing GPCR (CASR), which cause FHH1. AP2S1 mutations decreased the sensitivity of CaSR-expressing cells to extracellular calcium and reduced CaSR endocytosis, probably through loss of interaction with a C-terminal CaSR dileucine-based motif, whose disruption also decreased intracellular signaling. Thus, our results identify a new role for AP2 in extracellular calcium homeostasis. © 2013 Nature America, Inc. All rights reserved.

Palles C, Cazier JB, Howarth KM, Domingo E, Jones AM, Broderick P, Kemp Z, Spain SL, Guarino E, Salguero I et al. 2013. Germline mutations affecting the proofreading domains of POLE and POLD1 predispose to colorectal adenomas and carcinomas. Nat Genet, 45 (2), pp. 136-144. | Show Abstract | Read more

Many individuals with multiple or large colorectal adenomas or early-onset colorectal cancer (CRC) have no detectable germline mutations in the known cancer predisposition genes. Using whole-genome sequencing, supplemented by linkage and association analysis, we identified specific heterozygous POLE or POLD1 germline variants in several multiple-adenoma and/or CRC cases but in no controls. The variants associated with susceptibility, POLE p.Leu424Val and POLD1 p.Ser478Asn, have high penetrance, and POLD1 mutation was also associated with endometrial cancer predisposition. The mutations map to equivalent sites in the proofreading (exonuclease) domain of DNA polymerases ɛ and δ and are predicted to cause a defect in the correction of mispaired bases inserted during DNA replication. In agreement with this prediction, the tumors from mutation carriers were microsatellite stable but tended to acquire base substitution mutations, as confirmed by yeast functional assays. Further analysis of published data showed that the recently described group of hypermutant, microsatellite-stable CRCs is likely to be caused by somatic POLE mutations affecting the exonuclease domain.

Fugger L, McVean G, Bell JI. 2012. Genomewide association studies and common disease--realizing clinical utility. N Engl J Med, 367 (25), pp. 2370-2371. | Read more

Nesbit MA, Hannan FM, Howles SA, Reed AA, Cranston T, Thakker CE, Gregory L, Rimmer AJ, Rust N, Graham U et al. 2013. Mutations in AP2S1 cause familial hypocalciuric hypercalcemia type 3. Nat Genet, 45 (1), pp. 93-97. | Show Abstract | Read more

Adaptor protein-2 (AP2), a central component of clathrin-coated vesicles (CCVs), is pivotal in clathrin-mediated endocytosis, which internalizes plasma membrane constituents such as G protein-coupled receptors (GPCRs). AP2, a heterotetramer of α, β, μ and σ subunits, links clathrin to vesicle membranes and binds to tyrosine- and dileucine-based motifs of membrane-associated cargo proteins. Here we show that missense mutations of AP2 σ subunit (AP2S1) affecting Arg15, which forms key contacts with dileucine-based motifs of CCV cargo proteins, result in familial hypocalciuric hypercalcemia type 3 (FHH3), an extracellular calcium homeostasis disorder affecting the parathyroids, kidneys and bone. We found AP2S1 mutations in >20% of cases of FHH without mutations in calcium-sensing GPCR (CASR), which cause FHH1. AP2S1 mutations decreased the sensitivity of CaSR-expressing cells to extracellular calcium and reduced CaSR endocytosis, probably through loss of interaction with a C-terminal CaSR dileucine-based motif, whose disruption also decreased intracellular signaling. Thus, our results identify a new role for AP2 in extracellular calcium homeostasis.

Lise S, Clarkson Y, Perkins E, Kwasniewska A, Sadighi Akha E, Schnekenberg RP, Suminaite D, Hope J, Baker I, Gregory L et al. 2012. Recessive mutations in SPTBN2 implicate β-III spectrin in both cognitive and motor development. PLoS Genet, 8 (12), pp. e1003074. | Show Abstract | Read more

β-III spectrin is present in the brain and is known to be important in the function of the cerebellum. Heterozygous mutations in SPTBN2, the gene encoding β-III spectrin, cause Spinocerebellar Ataxia Type 5 (SCA5), an adult-onset, slowly progressive, autosomal-dominant pure cerebellar ataxia. SCA5 is sometimes known as "Lincoln ataxia," because the largest known family is descended from relatives of the United States President Abraham Lincoln. Using targeted capture and next-generation sequencing, we identified a homozygous stop codon in SPTBN2 in a consanguineous family in which childhood developmental ataxia co-segregates with cognitive impairment. The cognitive impairment could result from mutations in a second gene, but further analysis using whole-genome sequencing combined with SNP array analysis did not reveal any evidence of other mutations. We also examined a mouse knockout of β-III spectrin in which ataxia and progressive degeneration of cerebellar Purkinje cells has been previously reported and found morphological abnormalities in neurons from prefrontal cortex and deficits in object recognition tasks, consistent with the human cognitive phenotype. These data provide the first evidence that β-III spectrin plays an important role in cortical brain development and cognition, in addition to its function in the cerebellum; and we conclude that cognitive impairment is an integral part of this novel recessive ataxic syndrome, Spectrin-associated Autosomal Recessive Cerebellar Ataxia type 1 (SPARCA1). In addition, the identification of SPARCA1 and normal heterozygous carriers of the stop codon in SPTBN2 provides insights into the mechanism of molecular dominance in SCA5 and demonstrates that the cell-specific repertoire of spectrin subunits underlies a novel group of disorders, the neuronal spectrinopathies, which includes SCA5, SPARCA1, and a form of West syndrome.

Iqbal Z, Turner I, McVean G. 2013. High-throughput microbial population genomics using the Cortex variation assembler. Bioinformatics, 29 (2), pp. 275-276. | Show Abstract | Read more

SUMMARY: We have developed a software package, Cortex, designed for the analysis of genetic variation by de novo assembly of multiple samples. This allows direct comparison of samples without using a reference genome as intermediate and incorporates discovery and genotyping of single-nucleotide polymorphisms, indels and larger events in a single framework. We introduce pipelines which simplify the analysis of microbial samples and increase discovery power; these also enable the construction of a graph of known sequence and variation in a species, against which new samples can be compared rapidly. We demonstrate the ease-of-use and power by reproducing the results of studies using both long and short reads. AVAILABILITY: http://cortexassembler.sourceforge.net (GPLv3 license). CONTACT: zam@well.ox.ac.uk, mcvean@well.ox.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Wellcome Trust Case Control Consortium, Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, Su Z, Howson JM, Auton A, Myers S et al. 2012. Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet, 44 (12), pp. 1294-1301. | Show Abstract | Read more

To further investigate susceptibility loci identified by genome-wide association studies, we genotyped 5,500 SNPs across 14 associated regions in 8,000 samples from a control group and 3 diseases: type 2 diabetes (T2D), coronary artery disease (CAD) and Graves' disease. We defined, using Bayes theorem, credible sets of SNPs that were 95% likely, based on posterior probability, to contain the causal disease-associated SNPs. In 3 of the 14 regions, TCF7L2 (T2D), CTLA4 (Graves' disease) and CDKN2A-CDKN2B (T2D), much of the posterior probability rested on a single SNP, and, in 4 other regions (CDKN2A-CDKN2B (CAD) and CDKAL1, FTO and HHEX (T2D)), the 95% sets were small, thereby excluding most SNPs as potentially causal. Very few SNPs in our credible sets had annotated functions, illustrating the limitations in understanding the mechanisms underlying susceptibility to common diseases. Our results also show the value of more detailed mapping to target sequences for functional studies.

Cited:

42

WOS

Strange A, Riley BP, Spencer CCA, Morris DW, Pirinen M, O'Dushlaine CT, Su Z, Maher BS, Freeman C, Cormican P et al. 2012. Genome-Wide Association Study Implicates HLA-C*01:02 as a Risk Factor at the Major Histocompatibility Complex Locus in Schizophrenia BIOLOGICAL PSYCHIATRY, 72 (8), pp. 620-628. | Read more

Newey PJ, Nesbit MA, Rimmer AJ, Attar M, Head RT, Christie PT, Gorvin CM, Stechman M, Gregory L, Mihai R et al. 2012. Whole-exome sequencing studies of nonhereditary (sporadic) parathyroid adenomas. J Clin Endocrinol Metab, 97 (10), pp. E1995-E2005. | Show Abstract | Read more

CONTEXT: Genetic abnormalities, such as those of multiple endocrine neoplasia type 1 (MEN1) and Cyclin D1 (CCND1) genes, occur in <50% of nonhereditary (sporadic) parathyroid adenomas. OBJECTIVE: To identify genetic abnormalities in nonhereditary parathyroid adenomas by whole-exome sequence analysis. DESIGN: Whole-exome sequence analysis was performed on parathyroid adenomas and leukocyte DNA samples from 16 postmenopausal women without a family history of parathyroid tumors or MEN1 and in whom primary hyperparathyroidism due to single-gland disease was cured by surgery. Somatic variants confirmed in this discovery set were assessed in 24 other parathyroid adenomas. RESULTS: Over 90% of targeted exons were captured and represented by more than 10 base reads. Analysis identified 212 somatic variants (median eight per tumor; range, 2-110), with the majority being heterozygous nonsynonymous single-nucleotide variants that predicted missense amino acid substitutions. Somatic MEN1 mutations occurred in six of 16 (∼35%) parathyroid adenomas, in association with loss of heterozygosity on chromosome 11. However, no other gene was mutated in more than one tumor. Mutations in several genes that may represent low-frequency driver mutations were identified, including a protection of telomeres 1 (POT1) mutation that resulted in exon skipping and disruption to the single-stranded DNA-binding domain, which may contribute to increased genomic instability and the observed high mutation rate in one tumor. CONCLUSIONS: Parathyroid adenomas typically harbor few somatic variants, consistent with their low proliferation rates. MEN1 mutation represents the major driver in sporadic parathyroid tumorigenesis although multiple low-frequency driver mutations likely account for tumors not harboring somatic MEN1 mutations.

Gregory AP, Dendrou CA, Attfield KE, Haghikia A, Xifara DK, Butter F, Poschmann G, Kaur G, Lambert L, Leach OA et al. 2012. TNF receptor 1 genetic risk mirrors outcome of anti-TNF therapy in multiple sclerosis. Nature, 488 (7412), pp. 508-511. | Show Abstract | Read more

Although there has been much success in identifying genetic variants associated with common diseases using genome-wide association studies (GWAS), it has been difficult to demonstrate which variants are causal and what role they have in disease. Moreover, the modest contribution that these variants make to disease risk has raised questions regarding their medical relevance. Here we have investigated a single nucleotide polymorphism (SNP) in the TNFRSF1A gene, that encodes tumour necrosis factor receptor 1 (TNFR1), which was discovered through GWAS to be associated with multiple sclerosis (MS), but not with other autoimmune conditions such as rheumatoid arthritis, psoriasis and Crohn’s disease. By analysing MS GWAS data in conjunction with the 1000 Genomes Project data we provide genetic evidence that strongly implicates this SNP, rs1800693, as the causal variant in the TNFRSF1A region. We further substantiate this through functional studies showing that the MS risk allele directs expression of a novel, soluble form of TNFR1 that can block TNF. Importantly, TNF-blocking drugs can promote onset or exacerbation of MS, but they have proven highly efficacious in the treatment of autoimmune diseases for which there is no association with rs1800693. This indicates that the clinical experience with these drugs parallels the disease association of rs1800693, and that the MS-associated TNFR1 variant mimics the effect of TNF-blocking drugs. Hence, our study demonstrates that clinical practice can be informed by comparing GWAS across common autoimmune diseases and by investigating the functional consequences of the disease-associated genetic variation.

Mathieson I, McVean G. 2012. Differential confounding of rare and common variants in spatially structured populations. Nat Genet, 44 (3), pp. 243-246. | Show Abstract | Read more

Well-powered genome-wide association studies, now made possible through advances in technology and large-scale collaborative projects, promise to characterize the contribution of rare variants to complex traits and disease. However, while population structure is a known confounder of association studies, it remains unknown whether methods developed to control stratification are equally effective for rare variants. Here, we demonstrate that rare variants can show a stratification that is systematically different from, and typically stronger than, common variants, and this is not necessarily corrected by existing methods. We show that the same process leads to inflation for load-based tests and can obscure signals at truly associated variants. Furthermore, we show that populations can display spatial structure in rare variants, even when Wright's fixation index F(ST) is low, but that allele frequency-dependent metrics of allele sharing can reveal localized stratification. These results underscore the importance of collecting and integrating spatial information in the genetic analysis of complex traits.

Chen H, Hayashi G, Lai OY, Dilthey A, Kuebler PJ, Wong TV, Martin MP, Fernandez Vina MA, McVean G, Wabl M et al. 2012. Psoriasis patients are enriched for genetic variants that protect against HIV-1 disease. PLoS Genet, 8 (2), pp. e1002514. | Show Abstract | Read more

An important paradigm in evolutionary genetics is that of a delicate balance between genetic variants that favorably boost host control of infection but which may unfavorably increase susceptibility to autoimmune disease. Here, we investigated whether patients with psoriasis, a common immune-mediated disease of the skin, are enriched for genetic variants that limit the ability of HIV-1 virus to replicate after infection. We analyzed the HLA class I and class II alleles of 1,727 Caucasian psoriasis cases and 3,581 controls and found that psoriasis patients are significantly more likely than controls to have gene variants that are protective against HIV-1 disease. This includes several HLA class I alleles associated with HIV-1 control; amino acid residues at HLA-B positions 67, 70, and 97 that mediate HIV-1 peptide binding; and the deletion polymorphism rs67384697 associated with high surface expression of HLA-C. We also found that the compound genotype KIR3DS1 plus HLA-B Bw4-80I, which respectively encode a natural killer cell activating receptor and its putative ligand, significantly increased psoriasis susceptibility. This compound genotype has also been associated with delay of progression to AIDS. Together, our results suggest that genetic variants that contribute to anti-viral immunity may predispose to the development of psoriasis.

Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. 2012. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet, 44 (2), pp. 226-232. | Show Abstract | Read more

Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

Auton A, Fledel-Alon A, Pfeifer S, Venn O, Ségurel L, Street T, Leffler EM, Bowden R, Aneas I, Broxholme J et al. 2012. A fine-scale chimpanzee genetic map from population sequencing. Science, 336 (6078), pp. 193-198. | Show Abstract | Read more

To study the evolution of recombination rates in apes, we developed methodology to construct a fine-scale genetic map from high-throughput sequence data from 10 Western chimpanzees, Pan troglodytes verus. Compared to the human genetic map, broad-scale recombination rates tend to be conserved, but with exceptions, particularly in regions of chromosomal rearrangements and around the site of ancestral fusion in human chromosome 2. At fine scales, chimpanzee recombination is dominated by hotspots, which show no overlap with those of humans even though rates are similarly elevated around CpG islands and decreased within genes. The hotspot-specifying protein PRDM9 shows extensive variation among Western chimpanzees, and there is little evidence that any sequence motifs are enriched in hotspots. The contrasting locations of hotspots provide a natural experiment, which demonstrates the impact of recombination on base composition.

1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), pp. 56-65. | Show Abstract | Read more

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

Cited:

88

Scopus

Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, Su Z, Howson JMM, Auton A, Myers S, Morris A et al. 2012. Bayesian refinement of association signals for 14 loci in 3 common diseases Nature Genetics, 44 (12), pp. 1294-1301. | Show Abstract | Read more

To further investigate susceptibility loci identified by genome-wide association studies, we genotyped 5,500 SNPs across 14 associated regions in 8,000 samples from a control group and 3 diseases: type 2 diabetes (T2D), coronary artery disease (CAD) and Graves' disease. We defined, using Bayes theorem, credible sets of SNPs that were 95% likely, based on posterior probability, to contain the causal disease-associated SNPs. In 3 of the 14 regions, TCF7L2 (T2D), CTLA4 (Graves' disease) and CDKN2A-CDKN2B (T2D), much of the posterior probability rested on a single SNP, and, in 4 other regions (CDKN2A-CDKN2B (CAD) and CDKAL1, FTO and HHEX (T2D)), the 95% sets were small, thereby excluding most SNPs as potentially causal. Very few SNPs in our credible sets had annotated functions, illustrating the limitations in understanding the mechanisms underlying susceptibility to common diseases. Our results also show the value of more detailed mapping to target sequences for functional studies. © 2012 Nature America, Inc. All rights reserved.

Cited:

140

Scopus

Mathieson I, McVean G. 2012. Differential confounding of rare and common variants in spatially structured populations Nature Genetics, 44 (3), pp. 243-246. | Show Abstract | Read more

Well-powered genome-wide association studies, now made possible through advances in technology and large-scale collaborative projects, promise to characterize the contribution of rare variants to complex traits and disease. However, while population structure is a known confounder of association studies, it remains unknown whether methods developed to control stratification are equally effective for rare variants. Here, we demonstrate that rare variants can show a stratification that is systematically different from, and typically stronger than, common variants, and this is not necessarily corrected by existing methods. We show that the same process leads to inflation for load-based tests and can obscure signals at truly associated variants. Furthermore, we show that populations can display spatial structure in rare variants, even when Wright's fixation index F ST is low, but that allele frequencyg-dependent metrics of allele sharing can reveal localized stratification. These results underscore the importance of collecting and integrating spatial information in the genetic analysis of complex traits. © 2012 Nature America, Inc. All rights reserved.

Auton A, McVean G. 2012. Estimating recombination rates from genetic variation in humans. Methods Mol Biol, 856 pp. 217-237. | Show Abstract | Read more

Recombination acts to shuffle the existing genetic variation within a population, leading to various approaches for detecting its action and estimating the rate at which it occurs. Here, we discuss the principal methodological and analytical approaches taken to understanding the distribution of recombination across the human genome. We first discuss the detection of recent crossover events in both well-characterised pedigrees and larger populations with extensive recent shared ancestry. We then describe approaches for learning about the fine-scale structure of recombination rate variation from patterns of genetic variation in unrelated individuals. Finally, we show how related approaches using individuals of admixed ancestry can provide an alternative approach to analysing recombination. Approaches differ not only in the statistical methods used, but also in the resolution of inference, the timescale over which recombination events are detected, and the extent to which inter-individual variation can be identified.

Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC. 2011. Detecting novel associations in large data sets. Science, 334 (6062), pp. 1518-1524. | Show Abstract | Read more

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R(2)) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

Didelot X, Bowden R, Street T, Golubchik T, Spencer C, McVean G, Sangal V, Anjum MF, Achtman M, Falush D, Donnelly P. 2011. Recombination and population structure in Salmonella enterica. PLoS Genet, 7 (7), pp. e1002191. | Show Abstract | Read more

Salmonella enterica is a bacterial pathogen that causes enteric fever and gastroenteritis in humans and animals. Although its population structure was long described as clonal, based on high linkage disequilibrium between loci typed by enzyme electrophoresis, recent examination of gene sequences has revealed that recombination plays an important evolutionary role. We sequenced around 10% of the core genome of 114 isolates of enterica using a resequencing microarray. Application of two different analysis methods (Structure and ClonalFrame) to our genomic data allowed us to define five clear lineages within S. enterica subspecies enterica, one of which is five times older than the other four and two thirds of the age of the whole subspecies. We show that some of these lineages display more evidence of recombination than others. We also demonstrate that some level of sexual isolation exists between the lineages, so that recombination has occurred predominantly between members of the same lineage. This pattern of recombination is compatible with expectations from the previously described ecological structuring of the enterica population as well as mechanistic barriers to recombination observed in laboratory experiments. In spite of their relatively low level of genetic differentiation, these lineages might therefore represent incipient species.

Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST et al. 2011. The variant call format and VCFtools. Bioinformatics, 27 (15), pp. 2156-2158. | Show Abstract | Read more

SUMMARY: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. AVAILABILITY: http://vcftools.sourceforge.net

Moutsianas L, Enciso-Mora V, Ma YP, Leslie S, Dilthey A, Broderick P, Sherborne A, Cooke R, Ashworth A, Swerdlow AJ et al. 2011. Multiple Hodgkin lymphoma-associated loci within the HLA region at chromosome 6p21.3. Blood, 118 (3), pp. 670-674. | Show Abstract | Read more

Since an association between the human leukocyte antigen (HLA) region and Hodgkin lymphoma (HL) was first reported in 1967, many studies have reported associations between HL risk and both single nucleotide polymorphism (SNP) and classic HLA allele variation in the major histocompatibility complex. However, population stratification and the extent and complexity of linkage disequilibrium within the major histocompatibility complex have hindered efforts to fine-map causal signals. Using SNP data to impute alleles at classic HLA loci, we have conducted an integrated analysis of HL risk within the HLA region in 582 early-onset HL cases and 4736 controls. We confirm that the strongest signal of association comes from an SNP located in the class II region, rs6903608 (odds ratio [OR] = 1.79, P = 6.63 × 10(-19)), which is unlikely to be driven by association to HLA-DRB, DQA, or DQB alleles. In addition, we identify independent signals at rs2281389 (OR = 1.73, P = 6.31 × 10(-13)), a SNP that maps closely to HLA-DPB1, and the class II HLA allele DQA1*02:01 (OR = 0.56, P = 1.51 × 10(-7)). These data suggest that multiple independent loci within the HLA class II region contribute to the risk of developing early-onset HL.

Sainudiin R, Thornton K, Harlow J, Booth J, Stillman M, Yoshida R, Griffiths R, McVean G, Donnelly P. 2011. Experiments with the site frequency spectrum. Bull Math Biol, 73 (4), pp. 829-872. | Show Abstract | Read more

Evaluating the likelihood function of parameters in highly-structured population genetic models from extant deoxyribonucleic acid (DNA) sequences is computationally prohibitive. In such cases, one may approximately infer the parameters from summary statistics of the data such as the site-frequency-spectrum (SFS) or its linear combinations. Such methods are known as approximate likelihood or Bayesian computations. Using a controlled lumped Markov chain and computational commutative algebraic methods, we compute the exact likelihood of the SFS and many classical linear combinations of it at a non-recombining locus that is neutrally evolving under the infinitely-many-sites mutation model. Using a partially ordered graph of coalescent experiments around the SFS, we provide a decision-theoretic framework for approximate sufficiency. We also extend a family of classical hypothesis tests of standard neutrality at a non-recombining locus based on the SFS to a more powerful version that conditions on the topological information provided by the SFS.

Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G, 1000 Genomes Project, Sella G, Przeworski M. 2011. Classic selective sweeps were rare in recent human evolution. Science, 331 (6019), pp. 920-924. | Show Abstract | Read more

Efforts to identify the genetic basis of human adaptations from polymorphism data have sought footprints of "classic selective sweeps" (in which a beneficial mutation arises and rapidly fixes in the population).Yet it remains unknown whether this form of natural selection was common in our evolution. We examined the evidence for classic sweeps in resequencing data from 179 human genomes. As expected under a recurrent-sweep model, we found that diversity levels decrease near exons and conserved noncoding regions. In contrast to expectation, however, the trough in diversity around human-specific amino acid substitutions is no more pronounced than around synonymous substitutions. Moreover, relative to the genome background, amino acid and putative regulatory sites are not significantly enriched in alleles that are highly differentiated between populations. These findings indicate that classic sweeps were not a dominant mode of human adaptation over the past ~250,000 years.

Dilthey AT, Moutsianas L, Leslie S, McVean G. 2011. HLA*IMP--an integrated framework for imputing classical HLA alleles from SNP genotypes. Bioinformatics, 27 (7), pp. 968-972. | Show Abstract | Read more

MOTIVATION: Genetic variation at classical HLA alleles influences many phenotypes, including susceptibility to autoimmune disease, resistance to pathogens and the risk of adverse drug reactions. However, classical HLA typing methods are often prohibitively expensive for large-scale studies. We previously described a method for imputing classical alleles from linked SNP genotype data. Here, we present a modification of the original algorithm implemented in a freely available software suite that combines local data preparation and QC with probabilistic imputation through a remote server. RESULTS: We introduce two modifications to the original algorithm. First, we present a novel SNP selection function that leads to pronounced increases (up by 40% in some scenarios) in call rate. Second, we develop a parallelized model building algorithm that allows us to process a reference set of over 2500 individuals. In a validation experiment, we show that our framework produces highly accurate HLA type imputations at class I and class II loci for independent datasets: at call rates of 95-99%, imputation accuracy is between 92% and 98% at the four-digit level and over 97% at the two-digit level. We demonstrate utility of the method through analysis of a genome-wide association study for psoriasis where there is a known classical HLA risk allele (HLA-C*06:02). We show that the imputed allele shows stronger association with disease than any single SNP within the region. The imputation framework, HLA*IMP, provides a powerful tool for dissecting the architecture of genetic risk within the HLA. AVAILABILITY: HLA*IMP, implemented in C++ and Perl, is available from http://oxfordhla.well.ox.ac.uk and is free for academic use.

Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK et al. 2011. Mapping copy number variation by population-scale genome sequencing. Nature, 470 (7332), pp. 59-65. | Show Abstract | Read more

Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.

Hosking FJ, Leslie S, Dilthey A, Moutsianas L, Wang Y, Dobbins SE, Papaemmanuil E, Sheridan E, Kinsey SE, Lightfoot T et al. 2011. MHC variation and risk of childhood B-cell precursor acute lymphoblastic leukemia. Blood, 117 (5), pp. 1633-1640. | Show Abstract | Read more

A role for specific human leukocyte antigen (HLA) variants in the etiology of childhood acute lymphoblastic leukemia (ALL) has been extensively studied over the last 30 years, but no unambiguous association has been identified. To comprehensively study the relationship between genetic variation within the 4.5 Mb major histocompatibility complex genomic region and precursor B-cell (BCP) ALL risk, we analyzed 1075 observed and 8176 imputed single nucleotide polymorphisms and their related haplotypes in 824 BCP-ALL cases and 4737 controls. Using these genotypes we also imputed both common and rare alleles at class I (HLA-A, HLA-B, and HLA-C) and class II (HLA-DRB1, HLA-DQA1, and HLA-DQB1) HLA loci. Overall, we found no statistically significant association between variants and BCP-ALL risk. We conclude that major histocompatibility complex-defined variation in immune-mediated response is unlikely to be a major risk factor for BCP-ALL.

Evans DM, Spencer CC, Pointon JJ, Su Z, Harvey D, Kochan G, Oppermann U, Dilthey A, Pirinen M, Stone MA et al. 2011. Interaction between ERAP1 and HLA-B27 in ankylosing spondylitis implicates peptide handling in the mechanism for HLA-B27 in disease susceptibility. Nat Genet, 43 (8), pp. 761-767. | Show Abstract | Read more

Ankylosing spondylitis is a common form of inflammatory arthritis predominantly affecting the spine and pelvis that occurs in approximately 5 out of 1,000 adults of European descent. Here we report the identification of three variants in the RUNX3, LTBR-TNFRSF1A and IL12B regions convincingly associated with ankylosing spondylitis (P < 5 × 10(-8) in the combined discovery and replication datasets) and a further four loci at PTGER4, TBKBP1, ANTXR2 and CARD9 that show strong association across all our datasets (P < 5 × 10(-6) overall, with support in each of the three datasets studied). We also show that polymorphisms of ERAP1, which encodes an endoplasmic reticulum aminopeptidase involved in peptide trimming before HLA class I presentation, only affect ankylosing spondylitis risk in HLA-B27-positive individuals. These findings provide strong evidence that HLA-B27 operates in ankylosing spondylitis through a mechanism involving aberrant processing of antigenic peptides.

International Multiple Sclerosis Genetics Consortium, Wellcome Trust Case Control Consortium 2, Sawcer S, Hellenthal G, Pirinen M, Spencer CC, Patsopoulos NA, Moutsianas L, Dilthey A, Su Z et al. 2011. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature, 476 (7359), pp. 214-219. | Show Abstract | Read more

Multiple sclerosis is a common disease of the central nervous system in which the interplay between inflammatory and neurodegenerative processes typically results in intermittent neurological disturbance followed by progressive accumulation of disability. Epidemiological studies have shown that genetic factors are primarily responsible for the substantially increased frequency of the disease seen in the relatives of affected individuals, and systematic attempts to identify linkage in multiplex families have confirmed that variation within the major histocompatibility complex (MHC) exerts the greatest individual effect on risk. Modestly powered genome-wide association studies (GWAS) have enabled more than 20 additional risk loci to be identified and have shown that multiple variants exerting modest individual effects have a key role in disease susceptibility. Most of the genetic architecture underlying susceptibility to the disease remains to be defined and is anticipated to require the analysis of sample sizes that are beyond the numbers currently available to individual research groups. In a collaborative GWAS involving 9,772 cases of European descent collected by 23 research groups working in 15 different countries, we have replicated almost all of the previously suggested associations and identified at least a further 29 novel susceptibility loci. Within the MHC we have refined the identity of the HLA-DRB1 risk alleles and confirmed that variation in the HLA-A gene underlies the independent protective effect attributable to the class I region. Immunologically relevant genes are significantly overrepresented among those mapping close to the identified loci and particularly implicate T-helper-cell differentiation in the pathogenesis of multiple sclerosis.

Jiang H, Li N, Gopalan V, Zilversmit MM, Varma S, Nagarajan V, Li J, Mu J, Hayton K, Henschen B et al. 2011. High recombination rates and hotspots in a Plasmodium falciparum genetic cross. Genome Biol, 12 (4), pp. R33. | Show Abstract | Read more

BACKGROUND: The human malaria parasite Plasmodium falciparum survives pressures from the host immune system and antimalarial drugs by modifying its genome. Genetic recombination and nucleotide substitution are the two major mechanisms that the parasite employs to generate genome diversity. A better understanding of these mechanisms may provide important information for studying parasite evolution, immune evasion and drug resistance. RESULTS: Here, we used a high-density tiling array to estimate the genetic recombination rate among 32 progeny of a P. falciparum genetic cross (7G8 × GB4). We detected 638 recombination events and constructed a high-resolution genetic map. Comparing genetic and physical maps, we obtained an overall recombination rate of 9.6 kb per centimorgan and identified 54 candidate recombination hotspots. Similar to centromeres in other organisms, the sequences of P. falciparum centromeres are found in chromosome regions largely devoid of recombination activity. Motifs enriched in hotspots were also identified, including a 12-bp G/C-rich motif with 3-bp periodicity that may interact with a protein containing 11 predicted zinc finger arrays. CONCLUSIONS: These results show that the P. falciparum genome has a high recombination rate, although it also follows the overall rule of meiosis in eukaryotes with an average of approximately one crossover per chromosome per meiosis. GC-rich repetitive motifs identified in the hotspot sequences may play a role in the high recombination rate observed. The lack of recombination activity in centromeric regions is consistent with the observations of reduced recombination near the centromeres of other organisms.

Genetic Analysis of Psoriasis Consortium & the Wellcome Trust Case Control Consortium 2, Strange A, Capon F, Spencer CC, Knight J, Weale ME, Allen MH, Barton A, Band G, Bellenguez C et al. 2010. A genome-wide association study identifies new psoriasis susceptibility loci and an interaction between HLA-C and ERAP1. Nat Genet, 42 (11), pp. 985-990. | Show Abstract | Read more

To identify new susceptibility loci for psoriasis, we undertook a genome-wide association study of 594,224 SNPs in 2,622 individuals with psoriasis and 5,667 controls. We identified associations at eight previously unreported genomic loci. Seven loci harbored genes with recognized immune functions (IL28RA, REL, IFIH1, ERAP1, TRAF3IP2, NFKBIA and TYK2). These associations were replicated in 9,079 European samples (six loci with a combined P < 5 × 10⁻⁸ and two loci with a combined P < 5 × 10⁻⁷). We also report compelling evidence for an interaction between the HLA-C and ERAP1 loci (combined P = 6.95 × 10⁻⁶). ERAP1 plays an important role in MHC class I peptide processing. ERAP1 variants only influenced psoriasis susceptibility in individuals carrying the HLA-C risk allele. Our findings implicate pathways that integrate epidermal barrier dysfunction with innate and adaptive immune dysregulation in psoriasis pathogenesis.

1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), pp. 1061-1073. | Show Abstract | Read more

The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R. 2011. Dindel: accurate indel calls from short-read data. Genome Res, 21 (6), pp. 961-973. | Show Abstract | Read more

Small insertions and deletions (indels) are a common and functionally important type of sequence polymorphism. Most of the focus of studies of sequence variation is on single nucleotide variants (SNVs) and large structural variants. In principle, high-throughput sequencing studies should allow identification of indels just as SNVs. However, inference of indels from next-generation sequence data is challenging, and so far methods for identifying indels lag behind methods for calling SNVs in terms of sensitivity and specificity. We propose a Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs. We show that our method is sensitive and achieves low false discovery rates on simulated and real data sets, although challenges remain. The algorithm is implemented in the program Dindel, which has been used in the 1000 Genomes Project call sets.

McVean G, Myers S. 2010. PRDM9 marks the spot. Nat Genet, 42 (10), pp. 821-822. | Read more

International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F et al. 2010. Integrating common and rare genetic variation in diverse human populations. Nature, 467 (7311), pp. 52-58. | Show Abstract | Read more

Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called 'HapMap 3', includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of <or=5%, and demonstrated the feasibility of imputing newly discovered CNPs and SNPs. This expanded public resource of genome variants in global populations supports deeper interrogation of genomic variation and its role in human disease, and serves as a step towards a high-resolution map of the landscape of human genetic variation.

McVean G. 2010. What drives recombination hotspots to repeat DNA in humans? Philos Trans R Soc Lond B Biol Sci, 365 (1544), pp. 1213-1218. | Show Abstract | Read more

Recombination between homologous, but non-allelic, stretches of DNA such as gene families, segmental duplications and repeat elements is an important source of mutation. In humans, recent studies have identified short DNA motifs that both determine the location of 40 per cent of meiotic cross-over hotspots and are significantly enriched at the breakpoints of recurrent non-allelic homologous recombination (NAHR) syndromes. Unexpectedly, the most highly penetrant form of the motif occurs on the background of an inactive repeat element family (THE1 elements) and the motif also has strong recombinogenic activity on currently active element families including Alu and LINE2 elements. Analysis of genetic variation among members of these repeat families indicates an important role for NAHR in their evolution. Given the potential for double-strand breaks within repeat DNA to cause pathological rearrangement, the association between repeats and hotspots is surprising. Here we consider possible explanations for why selection acting against NAHR has not eliminated hotspots from repeat DNA including mechanistic constraints, possible benefits to repeat DNA from recruiting hotspots and rapid evolution of the recombination machinery. I suggest that rapid evolution of hotspot motifs may, surprisingly, tend to favour sequences present in repeat DNA and outline the data required to differentiate between hypotheses.

Wellcome Trust Case Control Consortium, Craddock N, Hurles ME, Cardin N, Pearson RD, Plagnol V, Robson S, Vukcevic D, Barnes C, Conrad DF et al. 2010. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature, 464 (7289), pp. 713-720. | Show Abstract | Read more

Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.

Myers S, Bowden R, Tumian A, Bontrop RE, Freeman C, MacFie TS, McVean G, Donnelly P. 2010. Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science, 327 (5967), pp. 876-879. | Show Abstract | Read more

Although present in both humans and chimpanzees, recombination hotspots, at which meiotic crossover events cluster, differ markedly in their genomic location between the species. We report that a 13-base pair sequence motif previously associated with the activity of 40% of human hotspots does not function in chimpanzees and is being removed by self-destructive drive in the human lineage. Multiple lines of evidence suggest that the rapidly evolving zinc-finger protein PRDM9 binds to this motif and that sequence changes in the protein may be responsible for hotspot differences between species. The involvement of PRDM9, which causes histone H3 lysine 4 trimethylation, implies that there is a common mechanism for recombination hotspots in eukaryotes but raises questions about what forces have driven such rapid change.

Sainudiin R, Thornton K, Harlow J, Booth J, Stillman M, Yoshida R, Griffiths R, McVean G, Donnelly P. 2010. Experiments with the Site Frequency Spectrum Bulletin of Mathematical Biology, pp. 1-44.

Cited:

435

Scopus

Strange A, Capon F, Spencer CCA, Knight J, Weale ME, Allen MH, Barton A, Band G, Bellenguez C, Bergboer JGM et al. 2010. A genome-wide asociation study identifies new psoriasis susceptibility loci and an interaction betwEn HLA-C and ERAP1 Nature Genetics, 42 (11), pp. 985-990. | Show Abstract | Read more

To identify new susceptibility loci for psoriasis, we undertOk a genome-wide asociation study of 594,224 SNPs in 2,622 individuals with psoriasis and 5,667 controls. We identified asociations at eight previously unreported genomic loci. Seven loci harbored genes with recognized iMune functions (IL28RA, REL, IFIH1, ERAP1, TRAF3IP2, NFKBIA and TYK2). These asociations were replicated in 9,079 European samples (six loci with a combined P < 5-10 -8 and two loci with a combined P < 5-10-7). We also report compeLing evidence for an interaction betwEn the HLA-C and ERAP1 loci (combined P = 6.95-10-6). ERAP1 plays an important role in MHC claS I peptide proceSing. ERAP1 variants only influenced psoriasis susceptibility in individuals carrying the HLA-C risk aLele. Our findings implicate pathways that integrate epidermal barrier dysfunction with iNate and adaptive iMune dysregulation in psoriasis pathogenesis. © 2010 Nature America, Inc. All rights reserved.

International MHC and Autoimmunity Genetics Network, Rioux JD, Goyette P, Vyse TJ, Hammarström L, Fernando MM, Green T, De Jager PL, Foisy S, Wang J et al. 2009. Mapping of multiple susceptibility variants within the MHC region for 7 immune-mediated diseases. Proc Natl Acad Sci U S A, 106 (44), pp. 18680-18685. | Show Abstract | Read more

The human MHC represents the strongest susceptibility locus for autoimmune diseases. However, the identification of the true predisposing gene(s) has been handicapped by the strong linkage disequilibrium across the region. Furthermore, most studies to date have been limited to the examination of a subset of the HLA and non-HLA genes with a marker density and sample size insufficient for mapping all independent association signals. We genotyped a panel of 1,472 SNPs to capture the common genomic variation across the 3.44 megabase (Mb) classic MHC region in 10,576 DNA samples derived from patients with systemic lupus erythematosus, Crohn's disease, ulcerative colitis, rheumatoid arthritis, myasthenia gravis, selective IgA deficiency, multiple sclerosis, and appropriate control samples. We identified the primary association signals for each disease and performed conditional regression to identify independent secondary signals. The data demonstrate that MHC associations with autoimmune diseases result from complex, multilocus effects that span the entire region.

Goriely A, Hansen RM, Taylor IB, Olesen IA, Jacobsen GK, McGowan SJ, Pfeifer SP, McVean GA, Rajpert-De Meyts E, Wilkie AO. 2009. Activating mutations in FGFR3 and HRAS reveal a shared genetic origin for congenital disorders and testicular tumors. Nat Genet, 41 (11), pp. 1247-1252. | Show Abstract | Read more

Genes mutated in congenital malformation syndromes are frequently implicated in oncogenesis, but the causative germline and somatic mutations occur in separate cells at different times of an organism's life. Here we unify these processes to a single cellular event for mutations arising in male germ cells that show a paternal age effect. Screening of 30 spermatocytic seminomas for oncogenic mutations in 17 genes identified 2 mutations in FGFR3 (both 1948A>G, encoding K650E, which causes thanatophoric dysplasia in the germline) and 5 mutations in HRAS. Massively parallel sequencing of sperm DNA showed that levels of the FGFR3 mutation increase with paternal age and that the mutation spectrum at the Lys650 codon is similar to that observed in bladder cancer. Most spermatocytic seminomas show increased immunoreactivity for FGFR3 and/or HRAS. We propose that paternal age-effect mutations activate a common 'selfish' pathway supporting proliferation in the testis, leading to diverse phenotypes in the next generation including fetal lethality, congenital syndromes and cancer predisposition.

McVean G. 2009. A genealogical interpretation of principal components analysis. PLoS Genet, 5 (10), pp. e1000686. | Show Abstract | Read more

Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's f(st) and show that SNP ascertainment has a largely simple and predictable effect on the projection of samples. Using examples from human genetics, I discuss the application of these results to empirical data and the implications for inference.

Myers S, Freeman C, Auton A, Donnelly P, McVean G. 2008. A common sequence motif associated with recombination hot spots and genome instability in humans. Nat Genet, 40 (9), pp. 1124-1129. | Show Abstract | Read more

In humans, most meiotic crossover events are clustered into short regions of the genome known as recombination hot spots. We have previously identified DNA motifs that are enriched in hot spots, particularly the 7-mer CCTCCCT. Here we use the increased hot-spot resolution afforded by the Phase 2 HapMap and novel search methods to identify an extended family of motifs based around the degenerate 13-mer CCNCCNTNNCCNC, which is critical in recruiting crossover events to at least 40% of all human hot spots and which operates on diverse genetic backgrounds in both sexes. Furthermore, these motifs are found in hypervariable minisatellites and are clustered in the breakpoint regions of both disease-causing nonallelic homologous recombination hot spots and common mitochondrial deletion hot spots, implicating the motif as a driver of genome instability.

McVean G. 2008. Linkage Disequilibrium, Recombination and Selection 2 pp. 909-944. | Read more

Leslie S, Donnelly P, McVean G. 2008. A statistical method for predicting classical HLA alleles from SNP data. Am J Hum Genet, 82 (1), pp. 48-56. | Show Abstract | Read more

Genetic variation at classical HLA alleles is a crucial determinant of transplant success and susceptibility to a large number of infectious and autoimmune diseases. However, large-scale studies involving classical type I and type II HLA alleles might be limited by the cost of allele-typing technologies. Although recent studies have shown that some common HLA alleles can be tagged with small numbers of markers, SNP-based tagging does not offer a complete solution to predicting HLA alleles. We have developed a new statistical methodology to use SNP variation within the region to predict alleles at key class I (HLA-A, HLA-B, and HLA-C) and class II (HLA-DRB1, HLA-DQA1, and HLA-DQB1) loci. Our results indicate that a single panel of approximately 100 SNPs typed across the region is sufficient for predicting both rare and common HLA alleles with up to 95% accuracy in both African and non-African populations. Furthermore, we show that HLA alleles can be successfully predicted by using previously genotyped SNPs that are within the MHC and that had not been chosen for their ability to predict HLA alleles, such as those included on genome-wide products. These results indicate that our methodology, combined with an extended database of reference haplotypes, will facilitate large-scale experiments, including disease-association studies and vaccine trials, in which detailed information about HLA type is valuable.

Stone GN, Atkinson RJ, Rokas A, Aldrey JL, Melika G, Acs Z, Csóka G, Hayward A, Bailey R, Buckee C, McVean GA. 2008. Evidence for widespread cryptic sexual generations in apparently purely asexual Andricus gallwasps. Mol Ecol, 17 (2), pp. 652-665. | Show Abstract | Read more

Oak gallwasps (Hymenoptera, Cynipidae, Cynipini) are one of seven major animal taxa that commonly reproduce by cyclical parthenogenesis (CP). A major question in research on CP taxa is the frequency with which lineages lose their sexual generations, and diversify as purely asexual radiations. Most oak gallwasp species are only known from an asexual generation, and secondary loss of sex has been conclusively demonstrated in several species, particularly members of the holarctic genus Andricus. This raises the possibility of widespread secondary loss of sex in the Cynipini, and of diversification within purely parthenogenetic lineages. We use two approaches based on analyses of allele frequency data to test for cryptic sexual generations in eight apparently asexual European species distributed through a major western palaearctic lineage of the gallwasp genus Andricus. All species showing adequate levels of polymorphism (7/8) showed signatures of sex compatible with cyclical parthenogenesis. We also use DNA sequence data to test the hypothesis that ignorance of these sexual generations (despite extensive study on this group) results from failure to discriminate among known but morphologically indistinguishable sexual generations. This hypothesis is supported: 35 sequences attributed by leading cynipid taxonomists to a single sexual adult morphospecies, Andricus burgundus, were found to represent the sexual generations of at least six Andricus species. We confirm cryptic sexual generations in a total of 11 Andricus species, suggesting that secondary loss of sex is rare in Andricus.

International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P et al. 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449 (7164), pp. 851-861. | Show Abstract | Read more

We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.

Cited:

2813

Scopus

Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM et al. 2007. A second generation human haplotype map of over 3.1 million SNPs Nature, 449 (7164), pp. 851-861. | Show Abstract | Read more

We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations. ©2007 Nature Publishing Group.

Gay J, Myers S, McVean G. 2007. Estimating meiotic gene conversion rates from population genetic data. Genetics, 177 (2), pp. 881-894. | Show Abstract | Read more

Gene conversion plays an important part in shaping genetic diversity in populations, yet estimating the rate at which it occurs is difficult because of the short lengths of DNA involved. We have developed a new statistical approach to estimating gene conversion rates from genetic variation, by extending an existing model for haplotype data in the presence of crossover events. We show, by simulation, that when the rate of gene conversion events is at least comparable to the rate of crossover events, the method provides a powerful approach to the detection of gene conversion and estimation of its rate. Application of the method to data from the telomeric X chromosome of Drosophila melanogaster, in which crossover activity is suppressed, indicates that gene conversion occurs approximately 400 times more often than crossover events. We also extend the method to estimating variable crossover and gene conversion rates and estimate the rate of gene conversion to be approximately 1.5 times higher than the crossover rate in a region of human chromosome 1 with known recombination hotspots.

Auton A, McVean G. 2007. Recombination rate estimation in the presence of hotspots. Genome Res, 17 (8), pp. 1219-1227. | Show Abstract | Read more

Fine-scale estimation of recombination rates remains a challenging problem. Experimental techniques can provide accurate estimates at fine scales but are technically challenging and cannot be applied on a genome-wide scale. An alternative source of information comes from patterns of genetic variation. Several statistical methods have been developed to estimate recombination rates from randomly sampled chromosomes. However, most such methods either make poor assumptions about recombination rate variation, or simply assume that there is no rate variation. Since the discovery of recombination hotspots, it is clear that recombination rates can vary over many orders of magnitude at the fine scale. We present a method for the estimation of recombination rates in the presence of recombination hotspots. We demonstrate that the method is able to detect and accurately quantify recombination rate heterogeneity, and is a substantial improvement over a commonly used method. We then use the method to reanalyze genetic variation data from the HLA and MS32 regions of the human genome and demonstrate that the method is able to provide accurate rate estimates and simultaneously detect hotspots.

Marchini J, Howie B, Myers S, McVean G, Donnelly P. 2007. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet, 39 (7), pp. 906-913. | Show Abstract | Read more

Genome-wide association studies are set to become the method of choice for uncovering the genetic basis of human diseases. A central challenge in this area is the development of powerful multipoint methods that can detect causal variants that have not been directly genotyped. We propose a coherent analysis framework that treats the problem as one involving missing or uncertain genotypes. Central to our approach is a model-based imputation method for inferring genotypes at observed or unobserved SNPs, leading to improved power over existing methods for multipoint association mapping. Using real genome-wide association study data, we show that our approach (i) is accurate and well calibrated, (ii) provides detailed views of associated regions that facilitate follow-up studies and (iii) can be used to validate and correct data at genotyped markers. A notable future use of our method will be to boost power by combining data from genome-wide scans that use different SNP sets.

Barry AE, Leliwa-Sytek A, Tavul L, Imrie H, Migot-Nabias F, Brown SM, McVean GA, Day KP. 2007. Population genomics of the immune evasion (var) genes of Plasmodium falciparum. PLoS Pathog, 3 (3), pp. e34. | Show Abstract | Read more

Var genes encode the major surface antigen (PfEMP1) of the blood stages of the human malaria parasite Plasmodium falciparum. Differential expression of up to 60 diverse var genes in each parasite genome underlies immune evasion. We compared the diversity of the DBLalpha domain of var genes sampled from 30 parasite isolates from a malaria endemic area of Papua New Guinea (PNG) and 59 from widespread geographic origins (global). Overall, we obtained over 8,000 quality-controlled DBLalpha sequences. Within our sampling frame, the global population had a total of 895 distinct DBLalpha "types" and negligible overlap among repertoires. This indicated that var gene diversity on a global scale is so immense that many genomes would need to be sequenced to capture its true extent. In contrast, we found a much lower diversity in PNG of 185 DBLalpha types, with an average of approximately 7% overlap among repertoires. While we identify marked geographic structuring, nearly 40% of types identified in PNG were also found in samples from different countries showing a cosmopolitan distribution for much of the diversity. We also present evidence to suggest that recombination plays a key role in maintaining the unprecedented levels of polymorphism found in these immune evasion genes. This population genomic framework provides a cost effective molecular epidemiological tool to rapidly explore the geographic diversity of var genes.

McVean G. 2007. The structure of linkage disequilibrium around a selective sweep. Genetics, 175 (3), pp. 1395-1406. | Show Abstract | Read more

The fixation of advantageous mutations by natural selection has a profound impact on patterns of linked neutral variation. While it has long been appreciated that such selective sweeps influence the frequency spectrum of nearby polymorphism, it has only recently become clear that they also have dramatic effects on local linkage disequilibrium. By extending previous results on the relationship between genealogical structure and linkage disequilibrium, I obtain simple expressions for the influence of a selective sweep on patterns of allelic association. I show that sweeps can increase, decrease, or even eliminate linkage disequilibrium (LD) entirely depending on the relative position of the selected and neutral loci. I also show the importance of the age of the neutral mutations in predicting their degree of association and describe the consequences of such results for the interpretation of empirical data. In particular, I demonstrate that while selective sweeps can eliminate LD, they generate patterns of genetic variation very different from those expected from recombination hotspots.

Eyheramendy S, Marchini J, McVean G, Myers S, Donnelly P. 2007. A model-based approach to capture genetic variation for future association studies. Genome Res, 17 (1), pp. 88-95. | Show Abstract | Read more

Genome-wide association studies are still constrained by the cost of genotyping. For this reason, the selection of a reduced set of markers or tags able to capture a significant proportion of the genetic variation is an important aspect of these studies. Most tagging SNP selection methods have been successful in capturing the genetic variation of the data from which the tags have been chosen. However, when these tags are used in an independent data set, a significant proportion of the remaining SNPs (non-tags) are not captured and, in most cases, there is no information on which SNPs are captured. We propose to use a probabilistic model to predict the non-tags based on a set of tags, as a way to capture genetic variation. An important advantage of this method is that it directly predicts the genotype of the non-tags with which we can test for association with the phenotype and which could help to elucidate the location of genes responsible for increasing disease susceptibility. Additionally, this method provides an estimate of the probabilities with which the predictions are made, which reflects the confidence of the probabilistic model. We also propose new methods to select the tagging SNPs. We empirically show by using HapMap data that our approach is able to capture significantly more genetic variation than methods based solely on a pairwise LD measure.

Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, Byrne EH, McCarroll SA, Gaudet R et al. 2007. Genome-wide detection and characterization of positive selection in human populations. Nature, 449 (7164), pp. 913-918. | Show Abstract | Read more

With the advent of dense maps of human genetic variation, it is now possible to detect positive natural selection across the human genome. Here we report an analysis of over 3 million polymorphisms from the International HapMap Project Phase 2 (HapMap2). We used 'long-range haplotype' methods, which were developed to identify alleles segregating in a population that have undergone recent selection, and we also developed new methods that are based on cross-population comparisons to discover alleles that have swept to near-fixation within a population. The analysis reveals more than 300 strong candidate regions. Focusing on the strongest 22 regions, we develop a heuristic for scrutinizing these regions to identify candidate targets of selection. In a complementary analysis, we identify 26 non-synonymous, coding, single nucleotide polymorphisms showing regional evidence of positive selection. Examination of these candidates highlights three cases in which two genes in a common biological process have apparently undergone positive selection in the same population:LARGE and DMD, both related to infection by the Lassa virus, in West Africa;SLC24A5 and SLC45A2, both involved in skin pigmentation, in Europe; and EDAR and EDA2R, both involved in development of hair follicles, in Asia.

Mu J, Awadalla P, Duan J, McGee KM, Keebler J, Seydel K, McVean GA, Su XZ. 2007. Genome-wide variation and identification of vaccine targets in the Plasmodium falciparum genome. Nat Genet, 39 (1), pp. 126-130. | Show Abstract | Read more

One goal in sequencing the Plasmodium falciparum genome, the agent of the most lethal form of malaria, is to discover vaccine and drug targets. However, identifying those targets in a genome in which approximately 60% of genes have unknown functions is an enormous challenge. Because the majority of known malaria antigens and drug-resistant genes are highly polymorphic and under various selective pressures, genome-wide analysis for signatures of selection may lead to discovery of new vaccine and drug candidates. Here we surveyed 3,539 P. falciparum genes ( approximately 65% of the predicted genes) for polymorphisms and identified various highly polymorphic loci and genes, some of which encode new antigens that we confirmed using human immune sera. Our collections of genome-wide SNPs ( approximately 65% nonsynonymous) and polymorphic microsatellites and indels provide a high-resolution map (one marker per approximately 4 kb) for mapping parasite traits and studying parasite populations. In addition, we report new antigens, providing urgently needed vaccine candidates for disease control.

McVean G, Spencer CC. 2006. Scanning the human genome for signals of selection. Curr Opin Genet Dev, 16 (6), pp. 624-629. | Show Abstract | Read more

The search for adaptive evolution in the human genome has reached a new era with the advent of genome-wide surveys of genetic variation. However, making sense, let alone use, of such experiments is far from straightforward. Key problems include the way in which the data have been collected, the need to control for factors such as population history and variable recombination rates, which influence the discovery rates for both true and false positives, and the inherent difficulty of falsification. Nevertheless, recent work has shown that genome scans can be used to identify both functional polymorphisms underlying selected traits and entire classes of genes enriched for signals of adaptation.

de Bakker PI, McVean G, Sabeti PC, Miretti MM, Green T, Marchini J, Ke X, Monsuur AJ, Whittaker P, Delgado M et al. 2006. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat Genet, 38 (10), pp. 1166-1172. | Show Abstract | Read more

The proteins encoded by the classical HLA class I and class II genes in the major histocompatibility complex (MHC) are highly polymorphic and are essential in self versus non-self immune recognition. HLA variation is a crucial determinant of transplant rejection and susceptibility to a large number of infectious and autoimmune diseases. Yet identification of causal variants is problematic owing to linkage disequilibrium that extends across multiple HLA and non-HLA genes in the MHC. We therefore set out to characterize the linkage disequilibrium patterns between the highly polymorphic HLA genes and background variation by typing the classical HLA genes and >7,500 common SNPs and deletion-insertion polymorphisms across four population samples. The analysis provides informative tag SNPs that capture much of the common variation in the MHC region and that could be used in disease association studies, and it provides new insight into the evolutionary dynamics and ancestral origins of the HLA loci and their haplotypes.

Spencer CC, Deloukas P, Hunt S, Mullikin J, Myers S, Silverman B, Donnelly P, Bentley D, McVean G. 2006. The influence of recombination on human genetic diversity. PLoS Genet, 2 (9), pp. e148. | Show Abstract | Read more

In humans, the rate of recombination, as measured on the megabase scale, is positively associated with the level of genetic variation, as measured at the genic scale. Despite considerable debate, it is not clear whether these factors are causally linked or, if they are, whether this is driven by the repeated action of adaptive evolution or molecular processes such as double-strand break formation and mismatch repair. We introduce three innovations to the analysis of recombination and diversity: fine-scale genetic maps estimated from genotype experiments that identify recombination hotspots at the kilobase scale, analysis of an entire human chromosome, and the use of wavelet techniques to identify correlations acting at different scales. We show that recombination influences genetic diversity only at the level of recombination hotspots. Hotspots are also associated with local increases in GC content and the relative frequency of GC-increasing mutations but have no effect on substitution rates. Broad-scale association between recombination and diversity is explained through covariance of both factors with base composition. To our knowledge, these results are the first evidence of a direct and local influence of recombination hotspots on genetic variation and the fate of individual mutations. However, that hotspots have no influence on substitution rates suggests that they are too ephemeral on an evolutionary time scale to have a strong influence on broader scale patterns of base composition and long-term molecular evolution.

Myers S, Spencer CC, Auton A, Bottolo L, Freeman C, Donnelly P, McVean G. 2006. The distribution and causes of meiotic recombination in the human genome. Biochem Soc Trans, 34 (Pt 4), pp. 526-530. | Show Abstract | Read more

Using the statistical analysis of genetic variation, we have developed a high-resolution genetic map of recombination hotspots and recombination rate variation across the human genome. This map, which has a resolution several orders of magnitude greater than previous studies, identifies over 25,000 recombination hotspots and gives new insights into the distribution and determination of recombination. Wavelet-based analysis demonstrates scale-specific influences of base composition, coding context and DNA repeats on recombination rates, though, in contrast with other species, no association with DNase I hypersensitivity. We have also identified specific DNA motifs that are strongly associated with recombination hotspots and whose activity is influenced by local context. Comparative analysis of recombination rates in humans and chimpanzees demonstrates very high rates of evolution of the fine-scale structure of the recombination landscape. In the light of these observations, we suggest possible resolutions of the hotspot paradox.

Gregory SG, Barlow KF, McLay KE, Kaul R, Swarbreck D, Dunham A, Scott CE, Howe KL, Woodfine K, Spencer CC et al. 2006. The DNA sequence and biological annotation of human chromosome 1. Nature, 441 (7091), pp. 315-321. | Show Abstract | Read more

The reference sequence for each human chromosome provides the framework for understanding genome function, variation and evolution. Here we report the finished sequence and biological annotation of human chromosome 1. Chromosome 1 is gene-dense, with 3,141 genes and 991 pseudogenes, and many coding sequences overlap. Rearrangements and mutations of chromosome 1 are prevalent in cancer and many other diseases. Patterns of sequence variation reveal signals of recent selection in specific genes that may contribute to human fitness, and also in regions where no function is evident. Fine-scale recombination occurs in hotspots of varying intensity along the sequence, and is enriched near genes. These and other studies of human biology and disease encoded within chromosome 1 are made possible with the highly accurate annotated sequence, as part of the completed set of chromosome sequences that comprise the reference human genome.

Wilson DJ, McVean G. 2006. Estimating diversifying selection and functional constraint in the presence of recombination. Genetics, 172 (3), pp. 1411-1425. | Show Abstract | Read more

Models of molecular evolution that incorporate the ratio of nonsynonymous to synonymous polymorphism (dN/dS ratio) as a parameter can be used to identify sites that are under diversifying selection or functional constraint in a sample of gene sequences. However, when there has been recombination in the evolutionary history of the sequences, reconstructing a single phylogenetic tree is not appropriate, and inference based on a single tree can give misleading results. In the presence of high levels of recombination, the identification of sites experiencing diversifying selection can suffer from a false-positive rate as high as 90%. We present a model that uses a population genetics approximation to the coalescent with recombination and use reversible-jump MCMC to perform Bayesian inference on both the dN/dS ratio and the recombination rate, allowing each to vary along the sequence. We demonstrate that the method has the power to detect variation in the dN/dS ratio and the recombination rate and does not suffer from a high false-positive rate. We use the method to analyze the porB gene of Neisseria meningitidis and verify the inferences using prior sensitivity analysis and model criticism techniques.

Hanchard NA, Rockett KA, Spencer C, Coop G, Pinder M, Jallow M, Kimber M, McVean G, Mott R, Kwiatkowski DP. 2006. Screening for recently selected alleles by analysis of human haplotype similarity. Am J Hum Genet, 78 (1), pp. 153-159. | Show Abstract | Read more

There is growing interest in the use of haplotype-based methods for detecting recent selection. Here, we describe a method that uses a sliding window to estimate similarity among the haplotypes associated with any given single-nucleotide polymorphism (SNP) allele. We used simulations of natural selection to provide estimates of the empirical power of the method to detect recently selected alleles and found it to be comparable in power to the popular long-range haplotype test and more powerful than methods based on nucleotide diversity. We then applied the method to a recently selected allele--the sickle mutation at the HBB locus--and found it to have a signal of selection that was significantly stronger than that of simulated models both with and without strong selection. Using this method, we also evaluated >4,000 SNPs on chromosome 20, indicating the applicability of the method to regional data sets.

Cited:

40

Scopus

Spencer CCA, Deloukas P, Hunt S, Mullikin J, Myers S, Silverman B, Donnelly P, Bentley D, McVean G. 2006. The influence of recombination on human genetic diversity. PLoS genetics, 2 (9), | Show Abstract | Read more

In humans, the rate of recombination, as measured on the megabase scale, is positively associated with the level of genetic variation, as measured at the genic scale. Despite considerable debate, it is not clear whether these factors are causally linked or, if they are, whether this is driven by the repeated action of adaptive evolution or molecular processes such as double-strand break formation and mismatch repair. We introduce three innovations to the analysis of recombination and diversity: fine-scale genetic maps estimated from genotype experiments that identify recombination hotspots at the kilobase scale, analysis of an entire human chromosome, and the use of wavelet techniques to identify correlations acting at different scales. We show that recombination influences genetic diversity only at the level of recombination hotspots. Hotspots are also associated with local increases in GC content and the relative frequency of GC-increasing mutations but have no effect on substitution rates. Broad-scale association between recombination and diversity is explained through covariance of both factors with base composition. To our knowledge, these results are the first evidence of a direct and local influence of recombination hotspots on genetic variation and the fate of individual mutations. However, that hotspots have no influence on substitution rates suggests that they are too ephemeral on an evolutionary time scale to have a strong influence on broader scale patterns of base composition and long-term molecular evolution.

International HapMap Consortium. 2005. A haplotype map of the human genome. Nature, 437 (7063), pp. 1299-1320. | Show Abstract | Read more

Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution.

Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. 2005. A fine-scale map of recombination rates and hotspots across the human genome. Science, 310 (5746), pp. 321-324. | Show Abstract | Read more

Genetic maps, which document the way in which recombination rates vary over a genome, are an essential tool for many genetic analyses. We present a high-resolution genetic map of the human genome, based on statistical analyses of genetic variation data, and identify more than 25,000 recombination hotspots, together with motifs and sequence contexts that play a role in hotspot activity. Differences between the behavior of recombination rates over large (megabase) and small (kilobase) scales lead us to suggest a two-stage model for recombination in which hotspots are stochastic features, within a framework in which large-scale rates are constrained.

McVean G, Spencer CC, Chaix R. 2005. Perspectives on human genetic variation from the HapMap Project. PLoS Genet, 1 (4), pp. e54. | Show Abstract | Read more

The completion of the International HapMap Project marks the start of a new phase in human genetics. The aim of the project was to provide a resource that facilitates the design of efficient genome-wide association studies, through characterising patterns of genetic variation and linkage disequilibrium in a sample of 270 individuals across four geographical populations. In total, over one million SNPs have been typed across these genomes, providing an unprecedented view of human genetic diversity. In this review we focus on what the HapMap Project has taught us about the structure of human genetic variation and the fundamental molecular and evolutionary processes that shape it.

Mu J, Awadalla P, Duan J, McGee KM, Joy DA, McVean GA, Su XZ. 2005. Recombination hotspots and population structure in Plasmodium falciparum. PLoS Biol, 3 (10), pp. e335. | Show Abstract | Read more

Understanding the influences of population structure, selection, and recombination on polymorphism and linkage disequilibrium (LD) is integral to mapping genes contributing to drug resistance or virulence in Plasmodium falciparum. The parasite's short generation time, coupled with a high cross-over rate, can cause rapid LD break-down. However, observations of low genetic variation have led to suggestions of effective clonality: selfing, population admixture, and selection may preserve LD in populations. Indeed, extensive LD surrounding drug-resistant genes has been observed, indicating that recombination and selection play important roles in shaping recent parasite genome evolution. These studies, however, provide only limited information about haplotype variation at local scales. Here we describe the first (to our knowledge) chromosome-wide SNP haplotype and population recombination maps for a global collection of malaria parasites, including the 3D7 isolate, whose genome has been sequenced previously. The parasites are clustered according to continental origin, but alternative groupings were obtained using SNPs at 37 putative transporter genes that are potentially under selection. Geographic isolation and highly variable multiple infection rates are the major factors affecting haplotype structure. Variation in effective recombination rates is high, both among populations and along the chromosome, with recombination hotspots conserved among populations at chromosome ends. This study supports the feasibility of genome-wide association studies in some parasite populations.

McVean GA, Cardin NJ. 2005. Approximating the coalescent with recombination. Philos Trans R Soc Lond B Biol Sci, 360 (1459), pp. 1387-1393. | Show Abstract | Read more

The coalescent with recombination describes the distribution of genealogical histories and resulting patterns of genetic variation in samples of DNA sequences from natural populations. However, using the model as the basis for inference is currently severely restricted by the computational challenge of estimating the likelihood. We discuss why the coalescent with recombination is so challenging to work with and explore whether simpler models, under which inference is more tractable, may prove useful for genealogy-based inference. We introduce a simplification of the coalescent process in which coalescence between lineages with no overlapping ancestral material is banned. The resulting process has a simple Markovian structure when generating genealogies sequentially along a sequence, yet has very similar properties to the full model, both in terms of describing patterns of genetic variation and as the basis for statistical inference.

Goriely A, McVean GA, van Pelt AM, O'Rourke AW, Wall SA, de Rooij DG, Wilkie AO. 2005. Gain-of-function amino acid substitutions drive positive selection of FGFR2 mutations in human spermatogonia. Proc Natl Acad Sci U S A, 102 (17), pp. 6051-6056. | Show Abstract | Read more

Despite the importance of mutation in genetics, there are virtually no experimental data on the occurrence of specific nucleotide substitutions in human gametes. C>G transversions at position 755 of FGF receptor 2 (FGFR2) cause Apert syndrome; this mutation, encoding the gain-of-function substitution Ser252Trp, occurs with a birth rate elevated 200- to 800-fold above background and originates exclusively from the unaffected father. We previously demonstrated high levels of both 755C>G and 755C>T FGFR2 mutations in human sperm and proposed that these particular mutations are enriched because the encoded proteins confer a selective advantage to spermatogonial cells. Here, we examine three corollaries of this hypothesis. First, we show that mutation levels at the adjacent FGFR2 nucleotides 752-754 are low, excluding any general increase in local mutation rate. Second, we present three instances of double-nucleotide changes involving 755C, expected to be extremely rare as chance events. Two of these double-nucleotide substitutions are shown, either by assessment of the pedigree or by direct analysis of sperm, to have arisen in sequential steps; the third (encoding Ser252Tyr) was predicted from structural considerations. Finally, we demonstrate that both major alternative spliceforms of FGFR2 (Fgfr2b and Fgfr2c) are expressed in rat spermatogonial stem cell lines. Taken together, these observations show that specific FGFR2 mutations attain high levels in sperm because they encode proteins with gain-of-function properties, favoring clonal expansion of mutant spermatogonial cells. Among FGFR2 mutations, those causing Apert syndrome may be especially prevalent because they enhance signaling by FGF ligands specific for each of the major expressed isoforms.

Winckler W, Myers SR, Richter DJ, Onofrio RC, McDonald GJ, Bontrop RE, McVean GA, Gabriel SB, Reich D, Donnelly P, Altshuler D. 2005. Comparison of fine-scale recombination rates in humans and chimpanzees. Science, 308 (5718), pp. 107-111. | Show Abstract | Read more

We compared fine-scale recombination rates at orthologous loci in humans and chimpanzees by analyzing polymorphism data in both species. Strong statistical evidence for hotspots of recombination was obtained in both species. Despite approximately 99% identity at the level of DNA sequence, however, recombination hotspots were found rarely (if at all) at the same positions in the two species, and no correlation was observed in estimates of fine-scale recombination rates. Thus, local patterns of recombination rate have evolved rapidly, in a manner disproportionate to the change in DNA sequence.

Shrivastava J, Qian BZ, Mcvean G, Webster JP. 2005. An insight into the genetic variation of Schistosoma japonicum in mainland China using DNA microsatellite markers. Mol Ecol, 14 (3), pp. 839-849. | Show Abstract | Read more

This study presents the first microsatellite investigation into the level of genetic variation among Schistosoma japonicum from different geographical origins. S. japonicum isolates were obtained from seven endemic provinces across mainland China: Zhejiang (Jiashan County), Anhui (Guichi County), Jiangxi (Yongxiu County), Hubei (Wuhan County), Hunan (Yueyang area), Sichuan 1 (Maoshan County), Sichuan 2 (Tianquan County), Yunnan (Dali County), and also one province in the Philippines (Sorsogon). DNA from 20 individuals from each origin were screened against 11 recently isolated and characterized S. japonicum microsatellites, and a set of nine loci were selected based on their polymorphic information content. High levels of polymorphism were obtained between and within population samples, with Chinese and Philippine strains appearing to follow different lineages, and with distinct branching between provinces. Moreover, across mainland China, genotype clustering appeared to be related to habitat type and/or intermediate host morph. These results highlight the suitability of microsatellites for population genetic studies of S. japonicum and suggest that there may be different strains of S. japonicum circulating in mainland China.

Jolley KA, Wilson DJ, Kriz P, McVean G, Maiden MC. 2005. The influence of mutation, recombination, population history, and selection on patterns of genetic diversity in Neisseria meningitidis. Mol Biol Evol, 22 (3), pp. 562-569. | Show Abstract | Read more

Patterns of genetic diversity within populations of human pathogens, shaped by the ecology of host-microbe interactions, contain important information about the epidemiological history of infectious disease. Exploiting this information, however, requires a systematic approach that distinguishes the genetic signal generated by epidemiological processes from the effects of other forces, such as recombination, mutation, and population history. Here, a variety of quantitative techniques were employed to investigate multilocus sequence information from isolate collections of Neisseria meningitidis, a major cause of meningitis and septicemia world wide. This allowed quantitative evaluation of alternative explanations for the observed population structure. A coalescent-based approach was employed to estimate the rate of mutation, the rate of recombination, and the size distribution of recombination fragments from samples from disease-associated and carried meningococci obtained in the Czech Republic in 1993 and a global collection of disease-associated isolates collected globally from 1937 to 1996. The parameter estimates were used to reject a model in which genetic structure arose by chance in small populations, and analysis of molecular variation showed that geographically restricted gene flow was unlikely to be the cause of the genetic structure. The genetic differentiation between disease and carriage isolate collections indicated that, whereas certain genotypes were overrepresented among the disease-isolate collections (the "hyperinvasive" lineages), disease-associated and carried meningococci exhibited remarkably little differentiation at the level of individual nucleotide polymorphisms. In combination, these results indicated the repeated action of natural selection on meningococcal populations, possibly arising from the coevolutionary dynamic of host-pathogen interactions.

Wilson DJ, Falush D, McVean G. 2005. Germs, genomes and genealogies. Trends Ecol Evol, 20 (1), pp. 39-45. | Show Abstract | Read more

Genetic diversity in pathogen species contains information about evolutionary and epidemiological processes, including the origins and history of disease, the nature of the selective forces acting on pathogen genes and the role of recombination in generating genetic novelty. Here, we review recent developments in these fields and compare the use of population genetic, or population-model based, approaches to phylogenetic, or population-model free, methodologies. We show how simple epidemiological models can be related to the ancestral, or coalescent, process underlying samples from pathogen species, enabling detailed inference about pathogen biology from patterns of molecular variation.

Cited:

30

Scopus

McVean G, Spencer CCA, Chaix R. 2005. Perspectives on human genetic variation from the HapMap project PLoS Genetics, 1 (4), pp. 0413-0418. | Show Abstract | Read more

The completion of the International HapMap Project marks the start of a new phase in human genetics. The aim of the project was to provide a resource that facilitates the design of efficient genome-wide association studies, through characterising patterns of genetic variation and linkage disequilibrium in a sample of 270 individuals across four geographical populations. In total, over one million SNPs have been typed across these genomes, providing an unprecedented view of human genetic diversity. In this review we focus on what the HapMap project has taught us about the structure of human genetic variation and the fundamental molecular and evolutionary processes that shape it. © 2005 McVean et al.

Cited:

63

Scopus

Mu J, Awadalla P, Duan J, McGee KM, Joy DA, McVean GAT, Su XZ. 2005. Recombination hotspots and population structure in Plasmodium falciparum. PLoS biology., 3 (10), | Show Abstract

Understanding the influences of population structure, selection, and recombination on polymorphism and linkage disequilibrium (LD) is integral to mapping genes contributing to drug resistance or virulence in Plasmodium falciparum. The parasite's short generation time, coupled with a high cross-over rate, can cause rapid LD break-down. However, observations of low genetic variation have led to suggestions of effective clonality: selfing, population admixture, and selection may preserve LD in populations. Indeed, extensive LD surrounding drug-resistant genes has been observed, indicating that recombination and selection play important roles in shaping recent parasite genome evolution. These studies, however, provide only limited information about haplotype variation at local scales. Here we describe the first (to our knowledge) chromosome-wide SNP haplotype and population recombination maps for a global collection of malaria parasites, including the 3D7 isolate, whose genome has been sequenced previously. The parasites are clustered according to continental origin, but alternative groupings were obtained using SNPs at 37 putative transporter genes that are potentially under selection. Geographic isolation and highly variable multiple infection rates are the major factors affecting haplotype structure. Variation in effective recombination rates is high, both among populations and along the chromosome, with recombination hotspots conserved among populations at chromosome ends. This study supports the feasibility of genome-wide association studies in some parasite populations.

Harding RM, McVean G. 2004. A structured ancestral population for the evolution of modern humans. Curr Opin Genet Dev, 14 (6), pp. 667-674. | Show Abstract | Read more

The view that modern humans evolved through a bottleneck from a single founding group of archaic Homo is being challenged by new analyses of contemporary genetic variation. A wide range of middle to late Pleistocene ages for gene genealogies and evidence for early population structures point to a diverse and scattered ancestry associated with a metapopulation history of local extinctions, re-colonization and admixture. A different balance of the same processes has shaped chimpanzee diversity.

McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P. 2004. The fine-scale structure of recombination rate variation in the human genome. Science, 304 (5670), pp. 581-584. | Show Abstract | Read more

The nature and scale of recombination rate variation are largely unknown for most species. In humans, pedigree analysis has documented variation at the chromosomal level, and sperm studies have identified specific hotspots in which crossing-over events cluster. To address whether this picture is representative of the genome as a whole, we have developed and validated a method for estimating recombination rates from patterns of genetic variation. From extensive single-nucleotide polymorphism surveys in European and African populations, we find evidence for extreme local rate variation spanning four orders in magnitude, in which 50% of all recombination events take place in less than 10% of the sequence. We demonstrate that recombination hotspots are a ubiquitous feature of the human genome, occurring on average every 200 kilobases or less, but recombination occurs preferentially outside genes.

Stumpf MP, McVean GA. 2003. Estimating recombination rates from population-genetic data. Nat Rev Genet, 4 (12), pp. 959-968. | Show Abstract | Read more

Obtaining an accurate measure of how recombination rates vary across the genome has implications for understanding the molecular basis of recombination, its evolutionary significance and the distribution of linkage disequilibrium in natural populations. Although measuring the recombination rate is experimentally challenging, good estimates can be obtained by applying population-genetic methods to DNA sequences taken from natural populations. Statistical methods are now providing insights into the nature and scale of variation in the recombination rate, particularly in humans. Such knowledge will become increasingly important owing to the growing use of population-genetic methods in biomedical research.

Tyler-Smith C, McVean G. 2003. The comings and goings of a Y polymorphism. Nat Genet, 35 (3), pp. 201-202. | Show Abstract | Read more

A deletion that removes 1.6 Mb of DNA and several genes from the Y chromosome increases its carriers' chance of infertility, yet is present in ∼2% of men. The Y chromosome can no longer be regarded as a neutral locus in evolutionary studies.

Goriely A, McVean GA, Röjmyr M, Ingemarsson B, Wilkie AO. 2003. Evidence for selective advantage of pathogenic FGFR2 mutations in the male germ line. Science, 301 (5633), pp. 643-646. | Show Abstract | Read more

Observed mutation rates in humans appear higher in male than female gametes and often increase with paternal age. This bias, usually attributed to the accumulation of replication errors or inefficient repair processes, has been difficult to study directly. Here, we describe a sensitive method to quantify substitutions at nucleotide 755 of the fibroblast growth factor receptor 2 (FGFR2) gene in sperm. Although substitution levels increase with age, we show that even high levels originate from infrequent mutational events. We propose that these FGFR2 mutations, although harmful to embryonic development, are paradoxically enriched because they confer a selective advantage to the spermatogonial cells in which they arise.

Cited:

3766

Scopus

Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu FL, Yang HM, Ch'ang LY, Huang W, Liu B, Shen Y et al. 2003. The International HapMap Project NATURE, 426 (6968), pp. 789-796. | Show Abstract | Read more

The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention. © 2003 Nature Publishing Group.

McVean GA. 2002. A genealogical interpretation of linkage disequilibrium. Genetics, 162 (2), pp. 987-991. | Show Abstract

The degree of association between alleles at different loci, or linkage disequilibrium, is widely used to infer details of evolutionary processes. Here I explore how associations between alleles relate to properties of the underlying genealogy of sequences. Under the neutral, infinite-sites assumption I show that there is a direct correspondence between the covariance in coalescence times at different parts of the genome and the degree of linkage disequilibrium. These covariances can be calculated exactly under the standard neutral model and by Monte Carlo simulation under different demographic models. I show that the effects of population growth, population bottlenecks, and population structure on linkage disequilibrium can be described through their effects on the covariance in coalescence times.

Reich DE, Schaffner SF, Daly MJ, McVean G, Mullikin JC, Higgins JM, Richter DJ, Lander ES, Altshuler D. 2002. Human genome sequence variation and the influence of gene history, mutation and recombination. Nat Genet, 32 (1), pp. 135-142. | Show Abstract | Read more

Variation in the human genome sequence is key to understanding susceptibility to disease in modern populations and the history of ancestral populations. Unlocking this information requires knowledge of the patterns and underlying causes of human sequence diversity. By applying a new population-genetic framework to two genome-wide polymorphism surveys, we find that the human genome contains sizeable regions (stretching over tens of thousands of base pairs) that have intrinsically high and low rates of sequence variation. We show that the primary determinant of these patterns is shared genealogical history. Only a fraction of the variation (at most 25%) is due to the local mutation rate. By measuring the average distance over which genealogical histories are typically preserved, these data provide the first genome-wide estimate of the average extent of correlation among variants (linkage disequilibrium). The results are best explained by extreme variability in the recombination rate at a fine scale, and provide the first empirical evidence that such recombination 'hot spots' are a general feature of the human genome and have a principal role in shaping genetic variation in the human population.

Balding DJ, Carothers AD, Marchini JL, Cardon LR, Vetta A, Griffiths B, Weir BS, Hill WG, Goldstein D, Strimmer K et al. 2002. Discussion on the meeting on 'Statistical modelling and analysis of genetic data' JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 64 (4), pp. 737-775. | Read more

McVean G, Awadalla P, Fearnhead P. 2002. A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics, 160 (3), pp. 1231-1241. | Show Abstract

Determining the amount of recombination in the genealogical history of a sample of genes is important to both evolutionary biology and medical population genetics. However, recurrent mutation can produce patterns of genetic diversity similar to those generated by recombination and can bias estimates of the population recombination rate. Hudson 2001 has suggested an approximate-likelihood method based on coalescent theory to estimate the population recombination rate, 4N(e)r, under an infinite-sites model of sequence evolution. Here we extend the method to the estimation of the recombination rate in genomes, such as those of many viruses and bacteria, where the rate of recurrent mutation is high. In addition, we develop a powerful permutation-based method for detecting recombination that is both more powerful than other permutation-based methods and robust to misspecification of the model of sequence evolution. We apply the method to sequence data from viruses, bacteria, and human mitochondrial DNA. The extremely high level of recombination detected in both HIV1 and HIV2 sequences demonstrates that recombination cannot be ignored in the analysis of viral population genetic data.

Atkinson RJ, McVean GA, Stone GN. 2002. Use of population genetic data to infer oviposition behaviour: species-specific patterns in four oak gallwasps (Hymenoptera: Cynipidae). Proc Biol Sci, 269 (1489), pp. 383-390. | Show Abstract | Read more

Many species of oak gallwasp (Hymenoptera: Cynipidae: Cynipini) induce galls containing more than one larva (multilocular galls) on their host plant. To date, it has remained unclear whether multilocular galls result solely from clustered oviposition by a single female, or include the aggregated offspring of several females (multiple founding). We have developed a novel maximum-likelihood approach for use with population genetic data that estimates the number and genotypes of parents contributing to offspring from each gall. We apply this method to allozyme data from multiple populations of four oak gallwasps whose asexual generations develop in multilocular galls (Andricus coriarius, A. lucidus, A. panteli and A. seckendorffi). We find strong evidence for multiple founding in all four species, and show the data to be compatible with multiple founding rather than founding by a single foundress mated with multiple males. The extent of multiple founding differs among species: in A. lucidus and A. seckendorffi most galls are induced by a single female, whereas in A. coriarius and A. panteli over half of the galls sampled were multiple founded. We suggest that variation in levels of multiple founding may be due to consistent ecological differences between the four species.

McVean GA. 2001. What do patterns of genetic variability reveal about mitochondrial recombination? Heredity (Edinb), 87 (Pt 6), pp. 613-620. | Show Abstract | Read more

Recent claims that patterns of genetic variability in human mitochondria show evidence for recombination, have provoked considerable argument and much correspondence concerning the quality of the data, the nature of the analyses, and the biological realism of mitochondrial recombination. While the majority of evidence now points towards a lack of effective recombination, at least in humans, the debate has highlighted how difficult the detection of recombination can be in genomes with unusual mutation processes and complex demographic histories. A major difficulty is the lack of consensus about how to measure linkage disequilibrium. I show that measures differ in the way they treat data that are uninformative about recombination, and that when just those pairwise comparisons that are informative about recombination are used, there is agreement between different statistics. In this light, the significant negative correlation between linkage disequilibrium and distance, in at least some of the data sets, is a real pattern that requires explanation. I discuss whether plausible mutational and selective processes can give rise to such a pattern.

Charlesworth D, Charlesworth B, McVean GA. 2001. Genome sequences and evolutionary biology, a two-way interaction. Trends Ecol Evol, 16 (5), pp. 235-242. | Show Abstract | Read more

Complete genome sequences are accumulating rapidly, culminating with the announcement of the human genome sequence in February 2001. In addition to cataloguing the diversity of genes and other sequences, genome sequences will provide the first detailed and complete data on gene families and genome organization, including data on evolutionary changes. Reciprocally, evolutionary biology will make important contributions to the efforts to understand functions of genes and other sequences in genomes. Large-scale, detailed and unbiased comparisons between species will illuminate the evolution of genes and genomes, and population genetics methods will enable detection of functionally important genes or sequences, including sequences that have been involved in adaptive changes.

McVean GA, Vieira J. 2001. Inferring parameters of mutation, selection and demography from patterns of synonymous site evolution in Drosophila. Genetics, 157 (1), pp. 245-257. | Show Abstract

Selection acting on codon usage can cause patterns of synonymous evolution to deviate considerably from those expected under neutrality. To investigate the quantitative relationship between parameters of mutation, selection, and demography, and patterns of synonymous site divergence, we have developed a novel combination of population genetic models and likelihood methods of phylogenetic sequence analysis. Comparing 50 orthologous gene pairs from Drosophila melanogaster and D. virilis and 27 from D. melanogaster and D. simulans, we show considerable variation between amino acids and genes in the strength of selection acting on codon usage and find evidence for both long-term and short-term changes in the strength of selection between species. Remarkably, D. melanogaster shows no evidence of current selection on codon usage, while its sister species D. simulans experiences only half the selection pressure for codon usage of their common ancestor. We also find evidence for considerable base asymmetries in the rate of mutation, such that the average synonymous mutation rate is 20-30% higher than in noncoding regions. A Bayesian approach is adopted to investigate how accounting for selection on codon usage influences estimates of the parameters of mutation.

McVean G. 2000. Evolutionary genetics: what is driving male mutation? Curr Biol, 10 (22), pp. R834-R835. | Show Abstract | Read more

In mammals, most new mutations occur in males. But a study of the evolution of a human X to Y chromosomal translocation has revealed a sex bias much lower than previous estimates. Patterns of substitution suggest that differential methylation between male and female germ lines is a key determinant of the mutation rate.

McVean GA, Charlesworth B. 2000. The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. Genetics, 155 (2), pp. 929-944. | Show Abstract

Associations between selected alleles and the genetic backgrounds on which they are found can reduce the efficacy of selection. We consider the extent to which such interference, known as the Hill-Robertson effect, acting between weakly selected alleles, can restrict molecular adaptation and affect patterns of polymorphism and divergence. In particular, we focus on synonymous-site mutations, considering the fate of novel variants in a two-locus model and the equilibrium effects of interference with multiple loci and reversible mutation. We find that weak selection Hill-Robertson (wsHR) interference can considerably reduce adaptation, e.g., codon bias, and, to a lesser extent, levels of polymorphism, particularly in regions of low recombination. Interference causes the frequency distribution of segregating sites to resemble that expected from more weakly selected mutations and also generates specific patterns of linkage disequilibrium. While the selection coefficients involved are small, the fitness consequences of wsHR interference across the genome can be considerable. We suggest that wsHR interference is an important force in the evolution of nonrecombining genomes and may explain the unexpected constancy of codon bias across species of very different census population sizes, as well as several unusual features of codon usage in Drosophila.

McAllister BF, McVean GA. 2000. Neutral evolution of the sex-determining gene transformer in Drosophila. Genetics, 154 (4), pp. 1711-1720. | Show Abstract

The amino acid sequence of the transformer (tra) gene exhibits an extremely rapid rate of evolution among Drosophila species, although the gene performs a critical step in sex determination. These changes in amino acid sequence are the result of either natural selection or neutral evolution. To differentiate between selective and neutral causes of this evolutionary change, analyses of both intraspecific and interspecific patterns of molecular evolution of tra gene sequences are presented. Sequences of 31 tra alleles were obtained from Drosophila americana. Many replacement and silent nucleotide variants are present among the alleles; however, the distribution of this sequence variation is consistent with neutral evolution. Sequence evolution was also examined among six species representative of the genus Drosophila. For most lineages and most regions of the gene, both silent and replacement substitutions have accumulated in a constant, clock-like manner. In exon 3 of D. virilis and D. americana we find evidence for an elevated rate of nonsynonymous substitution, but no statistical support for a greater rate of nonsynonymous relative to synonymous substitutions. Both levels of analysis of the tra sequence suggest that, although the gene is evolving at a rapid pace, these changes are neutral in function.

McVean GA, Hurst GD. 2000. Evolutionary lability of context-dependent codon bias in bacteria. J Mol Evol, 50 (3), pp. 264-275. | Show Abstract

In bacteria, synonymous codon usage can be considerably affected by base composition at neighboring sites. Such context-dependent biases may be caused by either selection against specific nucleotide motifs or context-dependent mutation biases. Here we consider the evolutionary conservation of context-dependent codon bias across 11 completely sequenced bacterial genomes. In particular, we focus on two contextual biases previously identified in Escherichia coli; the avoidance of out-of-frame stop codons and AGG motifs. By identifying homologues of E. coli genes, we also investigate the effect of gene expression level in Haemophilus influenzae and Mycoplasma genitalium. We find that while context-dependent codon biases are widespread in bacteria, few are conserved across all species considered. Avoidance of out-of-frame stop codons does not apply to all stop codons or amino acids in E. coli, does not hold for different species, does not increase with gene expression level, and is not relaxed in Mycoplasma spp., in which the canonical stop codon, TGA, is recognized as tryptophan. Avoidance of AGG motifs shows some evolutionary conservation and increases with gene expression level in E. coli, suggestive of the action of selection, but the cause of the bias differs between species. These results demonstrate that strong context-dependent forces, both selective and mutational, operate on synonymous codon usage but that these differ considerably between genomes.

Dilthey AT, Gourraud PA, Mentzer AJ, Cereb N, Iqbal Z, McVean G. 2016. High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs. PLoS Comput Biol, 12 (10), pp. e1005151. | Show Abstract | Read more

Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30-250 CPU hours per sample) remain a significant challenge to practical application.

Miles A, Iqbal Z, Vauterin P, Pearson R, Campino S, Theron M, Gould K, Mead D, Drury E, O'Brien J et al. 2016. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res, 26 (9), pp. 1288-1299. | Show Abstract | Read more

The malaria parasite Plasmodium falciparum has a great capacity for evolutionary adaptation to evade host immunity and develop drug resistance. Current understanding of parasite evolution is impeded by the fact that a large fraction of the genome is either highly repetitive or highly variable and thus difficult to analyze using short-read sequencing technologies. Here, we describe a resource of deep sequencing data on parents and progeny from genetic crosses, which has enabled us to perform the first genome-wide, integrated analysis of SNP, indel and complex polymorphisms, using Mendelian error rates as an indicator of genotypic accuracy. These data reveal that indels are exceptionally abundant, being more common than SNPs and thus the dominant mode of polymorphism within the core genome. We use the high density of SNP and indel markers to analyze patterns of meiotic recombination, confirming a high rate of crossover events and providing the first estimates for the rate of non-crossover events and the length of conversion tracts. We observe several instances of meiotic recombination within copy number variants associated with drug resistance, demonstrating a mechanism whereby fitness costs associated with resistance mutations could be compensated and greater phenotypic plasticity could be acquired.

1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. 2015. A global reference for human genetic variation. Nature, 526 (7571), pp. 68-74. | Show Abstract | Read more

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

Taylor JC, Martin HC, Lise S, Broxholme J, Cazier JB, Rimmer A, Kanapin A, Lunter G, Fiddy S, Allan C et al. 2015. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat Genet, 47 (7), pp. 717-726. | Show Abstract | Read more

To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.

Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G. 2015. Improved genome inference in the MHC using a population reference graph. Nat Genet, 47 (6), pp. 682-688. | Show Abstract | Read more

Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.

Cited:

20

WOS

Venn O, Turner I, Mathieson I, de Groot N, Bontrop R, McVean G. 2014. NONHUMAN GENETICS Strong male bias drives germline mutation in chimpanzees SCIENCE, 344 (6189), pp. 1272-1275. | Read more

Zilversmit MM, Chase EK, Chen DS, Awadalla P, Day KP, McVean G. 2013. Hypervariable antigen genes in malaria have ancient roots. BMC Evol Biol, 13 (1), pp. 110. | Show Abstract | Read more

BACKGROUND: The var genes of the human malaria parasite Plasmodium falciparum are highly polymorphic loci coding for the erythrocyte membrane proteins 1 (PfEMP1), which are responsible for the cytoaherence of P. falciparum infected red blood cells to the human vasculature. Cytoadhesion, coupled with differential expression of var genes, contributes to virulence and allows the parasite to establish chronic infections by evading detection from the host's immune system. Although studying genetic diversity is a major focus of recent work on the var genes, little is known about the gene family's origin and evolutionary history. RESULTS: Using a novel hidden Markov model-based approach and var sequences assembled from additional isolates and species, we are able to reveal elements of both the early evolution of the var genes as well as recent diversifying events. We compare sequences of the var gene DBLα domains from divergent isolates of P. falciparum (3D7 and HB3), and a closely-related species, Plasmodium reichenowi. We find that the gene family is equally large in P. reichenowi and P. falciparum -- with a minimum of 51 var genes in the P. reichenowi genome (compared to 61 in 3D7 and a minimum of 48 in HB3). In addition, we are able to define large, continuous blocks of homologous sequence among P. falciparum and P. reichenowi var gene DBLα domains. These results reveal that the contemporary structure of the var gene family was present before the divergence of P. falciparum and P. reichenowi, estimated to be between 2.5 to 6 million years ago. We also reveal that recombination has played an important and traceable role in both the establishment, and the maintenance, of diversity in the sequences. CONCLUSIONS: Despite the remarkable diversity and rapid evolution found in these loci within and among P. falciparum populations, the basic structure of these domains and the gene family is surprisingly old and stable. Revealing a common structure as well as conserved sequence among two species also has implications for developing new primate-parasite models for studying the pathology and immunology of falciparum malaria, and for studying the population genetics of var genes and associated virulence phenotypes.

Leffler EM, Gao Z, Pfeifer S, Ségurel L, Auton A, Venn O, Bowden R, Bontrop R, Wall JD, Sella G et al. 2013. Multiple instances of ancient balancing selection shared between humans and chimpanzees. Science, 339 (6127), pp. 1578-1582. | Show Abstract | Read more

Instances in which natural selection maintains genetic variation in a population over millions of years are thought to be extremely rare. We conducted a genome-wide scan for long-lived balancing selection by looking for combinations of SNPs shared between humans and chimpanzees. In addition to the major histocompatibility complex, we identified 125 regions in which the same haplotypes are segregating in the two species, all but two of which are noncoding. In six cases, there is evidence for an ancestral polymorphism that persisted to the present in humans and chimpanzees. Regions with shared haplotypes are significantly enriched for membrane glycoproteins, and a similar trend is seen among shared coding polymorphisms. These findings indicate that ancient balancing selection has shaped human variation and point to genes involved in host-pathogen interactions as common targets.

Gregory AP, Dendrou CA, Attfield KE, Haghikia A, Xifara DK, Butter F, Poschmann G, Kaur G, Lambert L, Leach OA et al. 2012. TNF receptor 1 genetic risk mirrors outcome of anti-TNF therapy in multiple sclerosis. Nature, 488 (7412), pp. 508-511. | Show Abstract | Read more

Although there has been much success in identifying genetic variants associated with common diseases using genome-wide association studies (GWAS), it has been difficult to demonstrate which variants are causal and what role they have in disease. Moreover, the modest contribution that these variants make to disease risk has raised questions regarding their medical relevance. Here we have investigated a single nucleotide polymorphism (SNP) in the TNFRSF1A gene, that encodes tumour necrosis factor receptor 1 (TNFR1), which was discovered through GWAS to be associated with multiple sclerosis (MS), but not with other autoimmune conditions such as rheumatoid arthritis, psoriasis and Crohn’s disease. By analysing MS GWAS data in conjunction with the 1000 Genomes Project data we provide genetic evidence that strongly implicates this SNP, rs1800693, as the causal variant in the TNFRSF1A region. We further substantiate this through functional studies showing that the MS risk allele directs expression of a novel, soluble form of TNFR1 that can block TNF. Importantly, TNF-blocking drugs can promote onset or exacerbation of MS, but they have proven highly efficacious in the treatment of autoimmune diseases for which there is no association with rs1800693. This indicates that the clinical experience with these drugs parallels the disease association of rs1800693, and that the MS-associated TNFR1 variant mimics the effect of TNF-blocking drugs. Hence, our study demonstrates that clinical practice can be informed by comparing GWAS across common autoimmune diseases and by investigating the functional consequences of the disease-associated genetic variation.

Mathieson I, McVean G. 2012. Differential confounding of rare and common variants in spatially structured populations. Nat Genet, 44 (3), pp. 243-246. | Show Abstract | Read more

Well-powered genome-wide association studies, now made possible through advances in technology and large-scale collaborative projects, promise to characterize the contribution of rare variants to complex traits and disease. However, while population structure is a known confounder of association studies, it remains unknown whether methods developed to control stratification are equally effective for rare variants. Here, we demonstrate that rare variants can show a stratification that is systematically different from, and typically stronger than, common variants, and this is not necessarily corrected by existing methods. We show that the same process leads to inflation for load-based tests and can obscure signals at truly associated variants. Furthermore, we show that populations can display spatial structure in rare variants, even when Wright's fixation index F(ST) is low, but that allele frequency-dependent metrics of allele sharing can reveal localized stratification. These results underscore the importance of collecting and integrating spatial information in the genetic analysis of complex traits.

Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. 2012. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet, 44 (2), pp. 226-232. | Show Abstract | Read more

Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex, the first de novo assembler capable of assembling multiple eukaryotic genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variations in a high-coverage human genome. Second, we identify more than 3 Mb of sequence absent from the human reference genome, in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from ten chimpanzees enables accurate variant calls without a reference sequence. Last, we estimate classical human leukocyte antigen (HLA) genotypes at HLA-B, the most variable gene in the human genome.

Auton A, Fledel-Alon A, Pfeifer S, Venn O, Ségurel L, Street T, Leffler EM, Bowden R, Aneas I, Broxholme J et al. 2012. A fine-scale chimpanzee genetic map from population sequencing. Science, 336 (6078), pp. 193-198. | Show Abstract | Read more

To study the evolution of recombination rates in apes, we developed methodology to construct a fine-scale genetic map from high-throughput sequence data from 10 Western chimpanzees, Pan troglodytes verus. Compared to the human genetic map, broad-scale recombination rates tend to be conserved, but with exceptions, particularly in regions of chromosomal rearrangements and around the site of ancestral fusion in human chromosome 2. At fine scales, chimpanzee recombination is dominated by hotspots, which show no overlap with those of humans even though rates are similarly elevated around CpG islands and decreased within genes. The hotspot-specifying protein PRDM9 shows extensive variation among Western chimpanzees, and there is little evidence that any sequence motifs are enriched in hotspots. The contrasting locations of hotspots provide a natural experiment, which demonstrates the impact of recombination on base composition.

1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), pp. 56-65. | Show Abstract | Read more

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

Mapping genetic substructure among autoimmune disorders from routine healthcare data

The analysis of genomic data linked to routine healthcare and self-reported information offers huge potential to learn about the causes, consequences and potential for intervention in complex disease.  However, there exist many sources of bias, noise and heterogeneity within such data that complicate their analysis compared to clinically-recruited cohorts.  In this project, we will investigate whether genetic information can be used to assess the veracity of routinely derived data and to derive ...

View project

217

Thank you for registering your interest

We were unable to record your request to register for interest in future opportunities. Please try again and if problems persist contact us at webteam@ndm.ox.ac.uk