register interest

Professor Jonathan Marchini

Research Area: Genetics and Genomics
Scientific Themes: Genetics & Genomics
Keywords: Statistical Genetics, Genome-wide Association Studies, Computationally Intensive Statistics, Bayesian Statistics and Image Analysis

The main focus of my research is the development of statistical methods for the localization and detection of disease genes in genome-wide association studies. These studies consist of measurements on thousands of individuals at up to 1 million locations throughout the genome. We aim to develop powerful methods that can extract the signal of association but at the same time account for the many confounding factors that affect these studies. Recently this has involved working  on genotype calling algorithms, detection of copy number variants, genotype imputation and haplotype phase inference, fine mapping, non-parametric association tests, detection of gene-gene interactions and algorithms for the detection and characterization of population structure. Much of this work has been stimulated by my involvement as an analysis group member of the International HapMap Project and the Wellcome Trust Case Control Consortium. I also have a continuing interest in spatio-temporal statistics applied to the area of functional MRI of the brain in collaboration with the Oxford Centre for Functional Magnetic Resonance Imaging of the Brain.

Name Department Institution Country
Professor Simon Myers Wellcome Trust Centre for Human Genetics Oxford University, Henry Wellcome Building of Genomic Medicine United Kingdom
Justice AE, Winkler TW, Feitosa MF, Graff M, Fisher VA, Young K, Barata L, Deng X, Czajkowski J, Hadley D et al. 2017. Genome-wide meta-analysis of 241,258 adults accounting for smoking behaviour identifies novel loci for obesity traits. Nat Commun, 8 pp. 14977. | Show Abstract | Read more

Few genome-wide association studies (GWAS) account for environmental exposures, like smoking, potentially impacting the overall trait variance when investigating the genetic contribution to obesity-related traits. Here, we use GWAS data from 51,080 current smokers and 190,178 nonsmokers (87% European descent) to identify loci influencing BMI and central adiposity, measured as waist circumference and waist-to-hip ratio both adjusted for BMI. We identify 23 novel genetic loci, and 9 loci with convincing evidence of gene-smoking interaction (GxSMK) on obesity-related traits. We show consistent direction of effect for all identified loci and significance for 18 novel and for 5 interaction loci in an independent study sample. These loci highlight novel biological functions, including response to oxidative stress, addictive behaviour, and regulatory functions emphasizing the importance of accounting for environment in genetic analyses. Our results suggest that tobacco smoking may alter the genetic susceptibility to overall adiposity and body fat distribution.

Cai N, Bigdeli TB, Kretzschmar WW, Li Y, Liang J, Hu J, Peterson RE, Bacanu S, Webb BT, Riley B et al. 2017. 11,670 whole-genome sequences representative of the Han Chinese population from the CONVERGE project. Sci Data, 4 pp. 170011. | Show Abstract | Read more

The China, Oxford and Virginia Commonwealth University Experimental Research on Genetic Epidemiology (CONVERGE) project on Major Depressive Disorder (MDD) sequenced 11,670 female Han Chinese at low-coverage (1.7X), providing the first large-scale whole genome sequencing resource representative of the largest ethnic group in the world. Samples are collected from 58 hospitals from 23 provinces around China. We are able to call 22 million high quality single nucleotide polymorphisms (SNP) from the nuclear genome, representing the largest SNP call set from an East Asian population to date. We use these variants for imputation of genotypes across all samples, and this has allowed us to perform a successful genome wide association study (GWAS) on MDD. The utility of these data can be extended to studies of genetic ancestry in the Han Chinese and evolutionary genetics when integrated with data from other populations. Molecular phenotypes, such as copy number variations and structural variations can be detected, quantified and analysed in similar ways.

Astle WJ, Elding H, Jiang T, Allen D, Ruklisa D, Mann AL, Mead D, Bouman H, Riveros-Mckay F, Kostadima MA et al. 2016. The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease. Cell, 167 (5), pp. 1415-1429.e19. | Show Abstract | Read more

Many common variants have been associated with hematological traits, but identification of causal genes and pathways has proven challenging. We performed a genome-wide association analysis in the UK Biobank and INTERVAL studies, testing 29.5 million genetic variants for association with 36 red cell, white cell, and platelet properties in 173,480 European-ancestry participants. This effort yielded hundreds of low frequency (<5%) and rare (<1%) variants with a strong impact on blood cell phenotypes. Our data highlight general properties of the allelic architecture of complex traits, including the proportion of the heritable component of each blood trait explained by the polygenic signal across different genome regulatory domains. Finally, through Mendelian randomization, we provide evidence of shared genetic pathways linking blood cell indices with complex pathologies, including autoimmune diseases, schizophrenia, and coronary heart disease and evidence suggesting previously reported population associations between blood cell indices and cardiovascular disease may be non-causal.

Amos CI, Dennis J, Wang Z, Byun J, Schumacher FR, Gayther SA, Casey G, Hunter DJ, Sellers TA, Gruber SB et al. 2017. The OncoArray Consortium: A Network for Understanding the Genetic Architecture of Common Cancers. Cancer Epidemiol Biomarkers Prev, 26 (1), pp. 126-135. | Show Abstract | Read more

BACKGROUND: Common cancers develop through a multistep process often including inherited susceptibility. Collaboration among multiple institutions, and funding from multiple sources, has allowed the development of an inexpensive genotyping microarray, the OncoArray. The array includes a genome-wide backbone, comprising 230,000 SNPs tagging most common genetic variants, together with dense mapping of known susceptibility regions, rare variants from sequencing experiments, pharmacogenetic markers, and cancer-related traits. METHODS: The OncoArray can be genotyped using a novel technology developed by Illumina to facilitate efficient genotyping. The consortium developed standard approaches for selecting SNPs for study, for quality control of markers, and for ancestry analysis. The array was genotyped at selected sites and with prespecified replicate samples to permit evaluation of genotyping accuracy among centers and by ethnic background. RESULTS: The OncoArray consortium genotyped 447,705 samples. A total of 494,763 SNPs passed quality control steps with a sample success rate of 97% of the samples. Participating sites performed ancestry analysis using a common set of markers and a scoring algorithm based on principal components analysis. CONCLUSIONS: Results from these analyses will enable researchers to identify new susceptibility loci, perform fine-mapping of new or known loci associated with either single or multiple cancers, assess the degree of overlap in cancer causation and pleiotropic effects of loci that have been identified for disease-specific risk, and jointly model genetic, environmental, and lifestyle-related exposures. IMPACT: Ongoing analyses will shed light on etiology and risk assessment for many types of cancer. Cancer Epidemiol Biomarkers Prev; 26(1); 126-35. ©2016 AACR.

Hore V, Viñuela A, Buil A, Knight J, McCarthy MI, Small K, Marchini J. 2016. Tensor decomposition for multiple-tissue gene expression experiments. Nat Genet, 48 (9), pp. 1094-1100. | Show Abstract | Read more

Genome-wide association studies of gene expression traits and other cellular phenotypes have successfully identified links between genetic variation and biological processes. The majority of discoveries have uncovered cis-expression quantitative trait locus (eQTL) effects via mass univariate testing of SNPs against gene expression in single tissues. Here we present a Bayesian method for multiple-tissue experiments focusing on uncovering gene networks linked to genetic variation. Our method decomposes the 3D array (or tensor) of gene expression measurements into a set of latent components. We identify sparse gene networks that can then be tested for association against genetic variation across the genome. We apply our method to a data set of 845 individuals from the TwinsUK cohort with gene expression measured via RNA-seq analysis in adipose, lymphoblastoid cell lines (LCLs) and skin. We uncover several gene networks with a genetic basis and clear biological and statistical significance. Extensions of this approach will allow integration of different omics, environmental and phenotypic data sets.

McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, Kang HM, Fuchsberger C, Danecek P, Sharp K et al. 2016. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet, 48 (10), pp. 1279-1283. | Show Abstract | Read more

We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

van der Laan SW, Fall T, Soumaré A, Teumer A, Sedaghat S, Baumert J, Zabaneh D, van Setten J, Isgum I, Galesloot TE et al. 2016. Cystatin C and Cardiovascular Disease: A Mendelian Randomization Study. J Am Coll Cardiol, 68 (9), pp. 934-945. | Show Abstract | Read more

BACKGROUND: Epidemiological studies show that high circulating cystatin C is associated with risk of cardiovascular disease (CVD), independent of creatinine-based renal function measurements. It is unclear whether this relationship is causal, arises from residual confounding, and/or is a consequence of reverse causation. OBJECTIVES: The aim of this study was to use Mendelian randomization to investigate whether cystatin C is causally related to CVD in the general population. METHODS: We incorporated participant data from 16 prospective cohorts (n = 76,481) with 37,126 measures of cystatin C and added genetic data from 43 studies (n = 252,216) with 63,292 CVD events. We used the common variant rs911119 in CST3 as an instrumental variable to investigate the causal role of cystatin C in CVD, including coronary heart disease, ischemic stroke, and heart failure. RESULTS: Cystatin C concentrations were associated with CVD risk after adjusting for age, sex, and traditional risk factors (relative risk: 1.82 per doubling of cystatin C; 95% confidence interval [CI]: 1.56 to 2.13; p = 2.12 × 10(-14)). The minor allele of rs911119 was associated with decreased serum cystatin C (6.13% per allele; 95% CI: 5.75 to 6.50; p = 5.95 × 10(-211)), explaining 2.8% of the observed variation in cystatin C. Mendelian randomization analysis did not provide evidence for a causal role of cystatin C, with a causal relative risk for CVD of 1.00 per doubling cystatin C (95% CI: 0.82 to 1.22; p = 0.994), which was statistically different from the observational estimate (p = 1.6 × 10(-5)). A causal effect of cystatin C was not detected for any individual component of CVD. CONCLUSIONS: Mendelian randomization analyses did not support a causal role of cystatin C in the etiology of CVD. As such, therapeutics targeted at lowering circulating cystatin C are unlikely to be effective in preventing CVD.

Sharp K, Kretzschmar W, Delaneau O, Marchini J. 2016. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics, 32 (13), pp. 1974-1980. | Show Abstract | Read more

MOTIVATION: There is growing recognition that estimating haplotypes from high coverage sequencing of single samples in clinical settings is an important problem. At the same time very large datasets consisting of tens and hundreds of thousands of high-coverage sequenced samples will soon be available. We describe a method that takes advantage of these huge human genetic variation resources and rare variant sharing patterns to estimate haplotypes on single sequenced samples. Sharing rare variants between two individuals is more likely to arise from a recent common ancestor and, hence, also more likely to indicate similar shared haplotypes over a substantial flanking region of sequence. RESULTS: Our method exploits this idea to select a small set of highly informative copying states within a Hidden Markov Model (HMM) phasing algorithm. Using rare variants in this way allows us to avoid iterative MCMC methods to infer haplotypes. Compared to other approaches that do not explicitly use rare variants we obtain significant gains in phasing accuracy, less variation over phasing runs and improvements in speed. For example, using a reference panel of 7420 haplotypes from the UK10K project, we are able to reduce switch error rates by up to 50% when phasing samples sequenced at high-coverage. In addition, a single step rephasing of the UK10K panel, using rare variant information, has a downstream impact on phasing performance. These results represent a proof of concept that rare variant sharing patterns can be utilized to phase large high-coverage sequencing studies such as the 100 000 Genomes Project dataset. AVAILABILITY AND IMPLEMENTATION: A webserver that includes an implementation of this new method and allows phasing of high-coverage clinical samples is available at https://phasingserver.stats.ox.ac.uk/ CONTACT: marchini@stats.ox.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

O'Connell J, Sharp K, Shrine N, Wain L, Hall I, Tobin M, Zagury JF, Delaneau O, Marchini J. 2016. Haplotype estimation for biobank-scale data sets. Nat Genet, 48 (7), pp. 817-820. | Show Abstract | Read more

The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as ∼0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.

Dahl A, Iotchkova V, Baud A, Johansson Å, Gyllensten U, Soranzo N, Mott R, Kranis A, Marchini J. 2016. A multiple-phenotype imputation method for genetic studies. Nat Genet, 48 (4), pp. 466-472. | Show Abstract | Read more

Genetic association studies have yielded a wealth of biological discoveries. However, these studies have mostly analyzed one trait and one SNP at a time, thus failing to capture the underlying complexity of the data sets. Joint genotype-phenotype analyses of complex, high-dimensional data sets represent an important way to move beyond simple genome-wide association studies (GWAS) with great potential. The move to high-dimensional phenotypes will raise many new statistical problems. Here we address the central issue of missing phenotypes in studies with any level of relatedness between samples. We propose a multiple-phenotype mixed model and use a computationally efficient variational Bayesian algorithm to fit the model. On a variety of simulated and real data sets from a range of organisms and trait types, we show that our method outperforms existing state-of-the-art methods from the statistics and machine learning literature and can boost signals of association.

Soler Artigas M, Wain LV, Miller S, Kheirallah AK, Huffman JE, Ntalla I, Shrine N, Obeidat M, Trochet H, McArdle WL et al. 2015. Sixteen new lung function signals identified through 1000 Genomes Project reference panel imputation. Nat Commun, 6 pp. 8658. | Show Abstract | Read more

Lung function measures are used in the diagnosis of chronic obstructive pulmonary disease. In 38,199 European ancestry individuals, we studied genome-wide association of forced expiratory volume in 1 s (FEV1), forced vital capacity (FVC) and FEV1/FVC with 1000 Genomes Project (phase 1)-imputed genotypes and followed up top associations in 54,550 Europeans. We identify 14 novel loci (P<5 × 10(-8)) in or near ENSA, RNU5F-1, KCNS3, AK097794, ASTN2, LHX3, CCDC91, TBX3, TRIP11, RIN3, TEKT5, LTBP4, MN1 and AP1S2, and two novel signals at known loci NPNT and GPR126, providing a basis for new understanding of the genetic determinants of these traits and pulmonary diseases in which they are altered.

UK10K Consortium, Walter K, Min JL, Huang J, Crooks L, Memari Y, McCarthy S, Perry JR, Xu C, Futema M et al. 2015. The UK10K project identifies rare variants in health and disease. Nature, 526 (7571), pp. 82-90. | Show Abstract | Read more

The contribution of rare and low-frequency variants to human traits is largely unexplored. Here we describe insights from sequencing whole genomes (low read depth, 7×) or exomes (high read depth, 80×) of nearly 10,000 individuals from population-based and disease collections. In extensively phenotyped cohorts we characterize over 24 million novel sequence variants, generate a highly accurate imputation reference panel and identify novel alleles associated with levels of triglycerides (APOB), adiponectin (ADIPOQ) and low-density lipoprotein cholesterol (LDLR and RGAG1) from single-marker and rare variant aggregation tests. We describe population structure and functional annotation of rare and low-frequency variants, use the data to estimate the benefits of sequencing for association studies, and summarize lessons from disease-specific collections. Finally, we make available an extensive resource, including individual-level genetic and phenotypic data and web-based tools to facilitate the exploration of association results.

1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. 2015. A global reference for human genetic variation. Nature, 526 (7571), pp. 68-74. | Show Abstract | Read more

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

Wain LV, Shrine N, Miller S, Jackson VE, Ntalla I, Soler Artigas M, Billington CK, Kheirallah AK, Allen R, Cook JP et al. 2015. Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank. Lancet Respir Med, 3 (10), pp. 769-781. | Show Abstract | Read more

BACKGROUND: Understanding the genetic basis of airflow obstruction and smoking behaviour is key to determining the pathophysiology of chronic obstructive pulmonary disease (COPD). We used UK Biobank data to study the genetic causes of smoking behaviour and lung health. METHODS: We sampled individuals of European ancestry from UK Biobank, from the middle and extremes of the forced expiratory volume in 1 s (FEV1) distribution among heavy smokers (mean 35 pack-years) and never smokers. We developed a custom array for UK Biobank to provide optimum genome-wide coverage of common and low-frequency variants, dense coverage of genomic regions already implicated in lung health and disease, and to assay rare coding variants relevant to the UK population. We investigated whether there were shared genetic causes between different phenotypes defined by extremes of FEV1. We also looked for novel variants associated with extremes of FEV1 and smoking behaviour and assessed regions of the genome that had already shown evidence for a role in lung health and disease. We set genome-wide significance at p<5 × 10(-8). FINDINGS: UK Biobank participants were recruited from March 15, 2006, to July 7, 2010. Sample selection for the UK BiLEVE study started on Nov 22, 2012, and was completed on Dec 20, 2012. We selected 50,008 unique samples: 10,002 individuals with low FEV1, 10,000 with average FEV1, and 5002 with high FEV1 from each of the heavy smoker and never smoker groups. We noted a substantial sharing of genetic causes of low FEV1 between heavy smokers and never smokers (p=2.29 × 10(-16)) and between individuals with and without doctor-diagnosed asthma (p=6.06 × 10(-11)). We discovered six novel genome-wide significant signals of association with extremes of FEV1, including signals at four novel loci (KANSL1, TSEN54, TET2, and RBM19/TBX5) and independent signals at two previously reported loci (NPNT and HLA-DQB1/HLA-DQA2). These variants also showed association with COPD, including in individuals with no history of smoking. The number of copies of a 150 kb region containing the 5' end of KANSL1, a gene that is important for epigenetic gene regulation, was associated with extremes of FEV1. We also discovered five new genome-wide significant signals for smoking behaviour, including a variant in NCAM1 (chromosome 11) and a variant on chromosome 2 (between TEX41 and PABPC1P2) that has a trans effect on expression of NCAM1 in brain tissue. INTERPRETATION: By sampling from the extremes of the lung function distribution in UK Biobank, we identified novel genetic causes of lung function and smoking behaviour. These results provide new insight into the specific mechanisms underlying airflow obstruction, COPD, and tobacco addiction, and show substantial shared genetic architecture underlying airflow obstruction across individuals, irrespective of smoking behaviour and other airway disease. FUNDING: Medical Research Council.

Huang J, Howie B, McCarthy S, Memari Y, Walter K, Min JL, Danecek P, Malerba G, Trabetti E, Zheng HF et al. 2015. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat Commun, 6 pp. 8111. | Show Abstract | Read more

Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low depth (average 7x), aiming to exhaustively characterize genetic variation down to 0.1% minor allele frequency in the British population. Here we demonstrate the value of this resource for improving imputation accuracy at rare and low-frequency variants in both a UK and an Italian population. We show that large increases in imputation accuracy can be achieved by re-phasing WGS reference panels after initial genotype calling. We also present a method for combining WGS panels to improve variant coverage and downstream imputation accuracy, which we illustrate by integrating 7,562 WGS haplotypes from the UK10K project with 2,184 haplotypes from the 1000 Genomes Project. Finally, we introduce a novel approximation that maintains speed without sacrificing imputation accuracy for rare variants.

CONVERGE consortium. 2015. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature, 523 (7562), pp. 588-591. | Show Abstract | Read more

Major depressive disorder (MDD), one of the most frequently encountered forms of mental illness and a leading cause of disability worldwide, poses a major challenge to genetic analysis. To date, no robustly replicated genetic loci have been identified, despite analysis of more than 9,000 cases. Here, using low-coverage whole-genome sequencing of 5,303 Chinese women with recurrent MDD selected to reduce phenotypic heterogeneity, and 5,337 controls screened to exclude MDD, we identified, and subsequently replicated in an independent sample, two loci contributing to risk of MDD on chromosome 10: one near the SIRT1 gene (P = 2.53 × 10(-10)), the other in an intron of the LHPP gene (P = 6.45 × 10(-12)). Analysis of 4,509 cases with a severe subtype of MDD, melancholia, yielded an increased genetic signal at the SIRT1 locus. We attribute our success to the recruitment of relatively homogeneous cases with severe illness.

Schmidts M, Hou Y, Cortés CR, Mans DA, Huber C, Boldt K, Patel M, van Reeuwijk J, Plaza JM, van Beersum SE et al. 2015. TCTEX1D2 mutations underlie Jeune asphyxiating thoracic dystrophy with impaired retrograde intraflagellar transport. Nat Commun, 6 pp. 7074. | Show Abstract | Read more

The analysis of individuals with ciliary chondrodysplasias can shed light on sensitive mechanisms controlling ciliogenesis and cell signalling that are essential to embryonic development and survival. Here we identify TCTEX1D2 mutations causing Jeune asphyxiating thoracic dystrophy with partially penetrant inheritance. Loss of TCTEX1D2 impairs retrograde intraflagellar transport (IFT) in humans and the protist Chlamydomonas, accompanied by destabilization of the retrograde IFT dynein motor. We thus define TCTEX1D2 as an integral component of the evolutionarily conserved retrograde IFT machinery. In complex with several IFT dynein light chains, it is required for correct vertebrate skeletal formation but may be functionally redundant under certain conditions.

Cai N, Chang S, Li Y, Li Q, Hu J, Liang J, Song L, Kretzschmar W, Gan X, Nicod J et al. 2015. Molecular signatures of major depression. Curr Biol, 25 (9), pp. 1146-1156. | Show Abstract | Read more

Adversity, particularly in early life, can cause illness. Clues to the responsible mechanisms may lie with the discovery of molecular signatures of stress, some of which include alterations to an individual's somatic genome. Here, using genome sequences from 11,670 women, we observed a highly significant association between a stress-related disease, major depression, and the amount of mtDNA (p = 9.00 × 10(-42), odds ratio 1.33 [95% confidence interval [CI] = 1.29-1.37]) and telomere length (p = 2.84 × 10(-14), odds ratio 0.85 [95% CI = 0.81-0.89]). While both telomere length and mtDNA amount were associated with adverse life events, conditional regression analyses showed the molecular changes were contingent on the depressed state. We tested this hypothesis with experiments in mice, demonstrating that stress causes both molecular changes, which are partly reversible and can be elicited by the administration of corticosterone. Together, these results demonstrate that changes in the amount of mtDNA and telomere length are consequences of stress and entering a depressed state. These findings identify increased amounts of mtDNA as a molecular marker of MD and have important implications for understanding how stress causes the disease.

Martin HC, Christ R, Hussin JG, O'Connell J, Gordon S, Mbarek H, Hottenga JJ, McAloney K, Willemsen G, Gasparini P et al. 2015. Multicohort analysis of the maternal age effect on recombination. Nat Commun, 6 pp. 7846. | Show Abstract | Read more

Several studies have reported that the number of crossovers increases with maternal age in humans, but others have found the opposite. Resolving the true effect has implications for understanding the maternal age effect on aneuploidies. Here, we revisit this question in the largest sample to date using single nucleotide polymorphism (SNP)-chip data, comprising over 6,000 meioses from nine cohorts. We develop and fit a hierarchical model to allow for differences between cohorts and between mothers. We estimate that over 10 years, the expected number of maternal crossovers increases by 2.1% (95% credible interval (0.98%, 3.3%)). Our results are not consistent with the larger positive and negative effects previously reported in smaller cohorts. We see heterogeneity between cohorts that is likely due to chance effects in smaller samples, or possibly to confounders, emphasizing that care should be taken when interpreting results from any specific cohort about the effect of maternal age on recombination.

Cited:

57

Scopus

Cai N, Chang S, Li Y, Li Q, Hu J, Liang J, Song L, Kretzschmar W, Gan X, Nicod J et al. 2015. Molecular signatures of major depression Current Biology, 25 (9), pp. 1146-1156. | Show Abstract | Read more

© 2015 Elsevier Ltd All rights reserved. Adversity, particularly in early life, can cause illness. Clues to the responsible mechanisms may lie with the discovery of molecular signatures of stress, some of which include alterations to an individual's somatic genome. Here, using genome sequences from 11,670 women, we observed a highly significant association between a stress-related disease, major depression, and the amount of mtDNA (p = 9.00 × 10 < sup > -42 < /sup > , odds ratio 1.33 [95% confidence interval [CI] = 1.29-1.37] ) and telomere length (p = 2.84 × 10 < sup > -14 < /sup > , odds ratio 0.85 [95% CI = 0.81-0.89]). While both telomere length and mtDNA amount were associated with adverse life events, conditional regression analyses showed the molecular changes were contingent on the depressed state. We tested this hypothesis with experiments in mice, demonstrating that stress causes both molecular changes, which are partly reversible and can be elicited by the administration of corticosterone. Together, these results demonstrate that changes in the amount of mtDNA and telomere length are consequences of stress and entering a depressed state. These findings identify increased amounts of mtDNA as a molecular marker of MD and have important implications for understanding how stress causes the diseas e.

Taylor PN, Porcu E, Chew S, Campbell PJ, Traglia M, Brown SJ, Mullin BH, Shihab HA, Min J, Walter K et al. 2015. Whole-genome sequence-based analysis of thyroid function. Nat Commun, 6 pp. 5681. | Show Abstract | Read more

Normal thyroid function is essential for health, but its genetic architecture remains poorly understood. Here, for the heritable thyroid traits thyrotropin (TSH) and free thyroxine (FT4), we analyse whole-genome sequence data from the UK10K project (N=2,287). Using additional whole-genome sequence and deeply imputed data sets, we report meta-analysis results for common variants (MAF≥1%) associated with TSH and FT4 (N=16,335). For TSH, we identify a novel variant in SYN2 (MAF=23.5%, P=6.15 × 10(-9)) and a new independent variant in PDE8B (MAF=10.4%, P=5.94 × 10(-14)). For FT4, we report a low-frequency variant near B4GALT6/SLC25A52 (MAF=3.2%, P=1.27 × 10(-9)) tagging a rare TTR variant (MAF=0.4%, P=2.14 × 10(-11)). All common variants explain ≥20% of the variance in TSH and FT4. Analysis of rare variants (MAF<1%) using sequence kernel association testing reveals a novel association with FT4 in NRG1. Our results demonstrate that increased coverage in whole-genome sequence association studies identifies novel variants associated with thyroid function.

Delaneau O, Marchini J, 1000 Genomes Project Consortium, 1000 Genomes Project Consortium. 2014. Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nat Commun, 5 pp. 3934. | Show Abstract | Read more

A major use of the 1000 Genomes Project (1000 GP) data is genotype imputation in genome-wide association studies (GWAS). Here we develop a method to estimate haplotypes from low-coverage sequencing data that can take advantage of single-nucleotide polymorphism (SNP) microarray genotypes on the same samples. First the SNP array data are phased to build a backbone (or 'scaffold') of haplotypes across each chromosome. We then phase the sequence data 'onto' this haplotype scaffold. This approach can take advantage of relatedness between sequenced and non-sequenced samples to improve accuracy. We use this method to create a new 1000 GP haplotype reference set for use by the human genetic community. Using a set of validation genotypes at SNP and bi-allelic indels we show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low-frequency variants.

O'Connell J, Gurdasani D, Delaneau O, Pirastu N, Ulivi S, Cocca M, Traglia M, Huang J, Huffman JE, Rudan I et al. 2014. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet, 10 (4), pp. e1004234. | Show Abstract | Read more

Many existing cohorts contain a range of relatedness between genotyped individuals, either by design or by chance. Haplotype estimation in such cohorts is a central step in many downstream analyses. Using genotypes from six cohorts from isolated populations and two cohorts from non-isolated populations, we have investigated the performance of different phasing methods designed for nominally 'unrelated' individuals. We find that SHAPEIT2 produces much lower switch error rates in all cohorts compared to other methods, including those designed specifically for isolated populations. In particular, when large amounts of IBD sharing is present, SHAPEIT2 infers close to perfect haplotypes. Based on these results we have developed a general strategy for phasing cohorts with any level of implicit or explicit relatedness between individuals. First SHAPEIT2 is run ignoring all explicit family information. We then apply a novel HMM method (duoHMM) to combine the SHAPEIT2 haplotypes with any family information to infer the inheritance pattern of each meiosis at all sites across each chromosome. This allows the correction of switch errors, detection of recombination events and genotyping errors. We show that the method detects numbers of recombination events that align very well with expectations based on genetic maps, and that it infers far fewer spurious recombination events than Merlin. The method can also detect genotyping errors and infer recombination events in otherwise uninformative families, such as trios and duos. The detected recombination events can be used in association scans for recombination phenotypes. The method provides a simple and unified approach to haplotype estimation, that will be of interest to researchers in the fields of human, animal and plant genetics.

Dahl A, Hore V, Iotchkova V, Marchini J. Network inference in matrix-variate Gaussian models with non-independent noise | Show Abstract

Inferring a graphical model or network from observational data from a large number of variables is a well studied problem in machine learning and computational statistics. In this paper we consider a version of this problem that is relevant to the analysis of multiple phenotypes collected in genetic studies. In such datasets we expect correlations between phenotypes and between individuals. We model observations as a sum of two matrix normal variates such that the joint covariance function is a sum of Kronecker products. This model, which generalizes the Graphical Lasso, assumes observations are correlated due to known genetic relationships and corrupted with non-independent noise. We have developed a computationally efficient EM algorithm to fit this model. On simulated datasets we illustrate substantially improved performance in network reconstruction by allowing for a general noise distribution.

Stephens SH, Hartz SM, Hoft NR, Saccone NL, Corley RC, Hewitt JK, Hopfer CJ, Breslau N, Coon H, Chen X et al. 2013. Distinct loci in the CHRNA5/CHRNA3/CHRNB4 gene cluster are associated with onset of regular smoking Genetic Epidemiology, 37 (8), pp. 846-859. | Show Abstract | Read more

Neuronal nicotinic acetylcholine receptor (nAChR) genes (CHRNA5/CHRNA3/CHRNB4) have been reproducibly associated with nicotine dependence, smoking behaviors, and lung cancer risk. Of the few reports that have focused on early smoking behaviors, association results have been mixed. This meta-analysis examines early smoking phenotypes and SNPs in the gene cluster to determine: (1) whether the most robust association signal in this region (rs16969968) for other smoking behaviors is also associated with early behaviors, and/or (2) if additional statistically independent signals are important in early smoking. We focused on two phenotypes: age of tobacco initiation (AOI) and age of first regular tobacco use (AOS). This study included 56,034 subjects (41 groups) spanning nine countries and evaluated five SNPs including rs1948, rs16969968, rs578776, rs588765, and rs684513. Each dataset was analyzed using a centrally generated script. Meta-analyses were conducted from summary statistics. AOS yielded significant associations with SNPs rs578776 (beta = 0.02, P = 0.004), rs1948 (beta = 0.023, P = 0.018), and rs684513 (beta = 0.032, P = 0.017), indicating protective effects. There were no significant associations for the AOI phenotype. Importantly, rs16969968, the most replicated signal in this region for nicotine dependence, cigarettes per day, and cotinine levels, was not associated with AOI (P = 0.59) or AOS (P = 0.92). These results provide important insight into the complexity of smoking behavior phenotypes, and suggest that association signals in the CHRNA5/A3/B4 gene cluster affecting early smoking behaviors may be different from those affecting the mature nicotine dependence phenotype. © 2013 WILEY PERIODICALS, INC.

Stephens SH, Hartz SM, Hoft NR, Saccone NL, Corley RC, Hewitt JK, Hopfer CJ, Breslau N, Coon H, Chen X et al. 2013. Distinct loci in the CHRNA5/CHRNA3/CHRNB4 gene cluster are associated with onset of regular smoking. Genet Epidemiol, 37 (8), pp. 846-859. | Show Abstract | Read more

Neuronal nicotinic acetylcholine receptor (nAChR) genes (CHRNA5/CHRNA3/CHRNB4) have been reproducibly associated with nicotine dependence, smoking behaviors, and lung cancer risk. Of the few reports that have focused on early smoking behaviors, association results have been mixed. This meta-analysis examines early smoking phenotypes and SNPs in the gene cluster to determine: (1) whether the most robust association signal in this region (rs16969968) for other smoking behaviors is also associated with early behaviors, and/or (2) if additional statistically independent signals are important in early smoking. We focused on two phenotypes: age of tobacco initiation (AOI) and age of first regular tobacco use (AOS). This study included 56,034 subjects (41 groups) spanning nine countries and evaluated five SNPs including rs1948, rs16969968, rs578776, rs588765, and rs684513. Each dataset was analyzed using a centrally generated script. Meta-analyses were conducted from summary statistics. AOS yielded significant associations with SNPs rs578776 (beta = 0.02, P = 0.004), rs1948 (beta = 0.023, P = 0.018), and rs684513 (beta = 0.032, P = 0.017), indicating protective effects. There were no significant associations for the AOI phenotype. Importantly, rs16969968, the most replicated signal in this region for nicotine dependence, cigarettes per day, and cotinine levels, was not associated with AOI (P = 0.59) or AOS (P = 0.92). These results provide important insight into the complexity of smoking behavior phenotypes, and suggest that association signals in the CHRNA5/A3/B4 gene cluster affecting early smoking behaviors may be different from those affecting the mature nicotine dependence phenotype.

Delaneau O, Howie B, Cox AJ, Zagury JF, Marchini J. 2013. Haplotype estimation using sequencing reads. Am J Hum Genet, 93 (4), pp. 687-696. | Show Abstract | Read more

High-throughput sequencing technologies produce short sequence reads that can contain phase information if they span two or more heterozygote genotypes. This information is not routinely used by current methods that infer haplotypes from genotype data. We have extended the SHAPEIT2 method to use phase-informative sequencing reads to improve phasing accuracy. Our model incorporates the read information in a probabilistic model through base quality scores within each read. The method is primarily designed for high-coverage sequence data or data sets that already have genotypes called. One important application is phasing of single samples sequenced at high coverage for use in medical sequencing and studies of rare diseases. Our method can also use existing panels of reference haplotypes. We tested the method by using a mother-father-child trio sequenced at high-coverage by Illumina together with the low-coverage sequence data from the 1000 Genomes Project (1000GP). We found that use of phase-informative reads increases the mean distance between switch errors by 22% from 274.4 kb to 328.6 kb. We also used male chromosome X haplotypes from the 1000GP samples to simulate sequencing reads with varying insert size, read length, and base error rate. When using short 100 bp paired-end reads, we found that using mixtures of insert sizes produced the best results. When using longer reads with high error rates (5-20 kb read with 4%-15% error per base), phasing performance was substantially improved.

Wang J, Bansal AT, Martin M, Germer S, Benayed R, Essioux L, Lee JS, Begovich A, Hemmings A, Kenwright A et al. 2013. Genome-wide association analysis implicates the involvement of eight loci with response to tocilizumab for the treatment of rheumatoid arthritis Pharmacogenomics Journal, 13 (3), pp. 235-241. | Show Abstract | Read more

Rheumatoid arthritis (RA) is an immune-mediated inflammatory disease affecting the joints. A heterogeneous response to available therapies demonstrates the need to identify those patients likely to benefit from a particular therapy. Our objective was to identify genetic factors associated with response to tocilizumab, a humanized monoclonal antibody targeting the interleukin (IL)-6 receptor, recently approved for treating RA. We report the first genome-wide association study on the response to tocilizumab in 1683 subjects with RA from six clinical studies. Putative associations were identified with eight loci, previously unrecognized as linked to the IL-6 pathway or associated with RA risk. This study suggests that it is unlikely that a major genetic determinant of response exists, and it illustrates the complexity of performing genome-wide association scans in clinical trials. © 2013 Macmillan Publishers Limited. All rights reserved 1470-269X/13.

Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, Ananda G, Howie B, Karczewski KJ, Smith KS et al. 2013. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res, 23 (5), pp. 749-761. | Show Abstract | Read more

Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

Thorgeirsson TE, Gudbjartsson DF, Sulem P, Besenbacher S, Styrkarsdottir U, Thorleifsson G, Walters GB, TAG Consortium, Oxford-GSK Consortium, ENGAGE consortium et al. 2013. A common biological basis of obesity and nicotine addiction. Transl Psychiatry, 3 (10), pp. e308. | Show Abstract | Read more

Smoking influences body weight such that smokers weigh less than non-smokers and smoking cessation often leads to weight increase. The relationship between body weight and smoking is partly explained by the effect of nicotine on appetite and metabolism. However, the brain reward system is involved in the control of the intake of both food and tobacco. We evaluated the effect of single-nucleotide polymorphisms (SNPs) affecting body mass index (BMI) on smoking behavior, and tested the 32 SNPs identified in a meta-analysis for association with two smoking phenotypes, smoking initiation (SI) and the number of cigarettes smoked per day (CPD) in an Icelandic sample (N=34,216 smokers). Combined according to their effect on BMI, the SNPs correlate with both SI (r=0.019, P=0.00054) and CPD (r=0.032, P=8.0 × 10(-7)). These findings replicate in a second large data set (N=127,274, thereof 76,242 smokers) for both SI (P=1.2 × 10(-5)) and CPD (P=9.3 × 10(-5)). Notably, the variant most strongly associated with BMI (rs1558902-A in FTO) did not associate with smoking behavior. The association with smoking behavior is not due to the effect of the SNPs on BMI. Our results strongly point to a common biological basis of the regulation of our appetite for tobacco and food, and thus the vulnerability to nicotine addiction and obesity.

Delaneau O, Zagury JF, Marchini J. 2013. Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods, 10 (1), pp. 5-6. | Read more

Cited:

28

Scopus

Churchhouse C, Marchini J. 2013. Multiway Admixture Deconvolution Using Phased or Unphased Ancestral Panels Genetic Epidemiology, 37 (1), pp. 1-12. | Show Abstract | Read more

We describe a novel method for inferring the local ancestry of admixed individuals from dense genome-wide single nucleotide polymorphism data. The method, called MULTIMIX, allows multiple source populations, models population linkage disequilibrium between markers and is applicable to datasets in which the sample and source populations are either phased or unphased. The model is based upon a hidden Markov model of switches in ancestry between consecutive windows of loci. We model the observed haplotypes within each window using a multivariate normal distribution with parameters estimated from the ancestral panels. We present three methods to fit the model-Markov chain Monte Carlo sampling, the Expectation Maximization algorithm, and a Classification Expectation Maximization algorithm. The performance of our method on individuals simulated to be admixed with European and West African ancestry shows it to be comparable to HAPMIX, the ancestry calls of the two methods agreeing at 99.26% of loci across the three parameter groups. In addition to it being faster than HAPMIX, it is also found to perform well over a range of extent of admixture in a simulation involving three ancestral populations. In an analysis of real data, we estimate the contribution of European, West African and Native American ancestry to each locus in the Mexican samples of HapMap, giving estimates of ancestral proportions that are consistent with those previously reported. © 2012 WILEY PERIODICALS, INC.

Hancock DB, Soler Artigas M, Gharib SA, Henry A, Manichaikul A, Ramasamy A, Loth DW, Imboden M, Koch B, McArdle WL et al. 2012. Genome-wide joint meta-analysis of SNP and SNP-by-smoking interaction identifies novel loci for pulmonary function. PLoS Genet, 8 (12), pp. e1003098. | Show Abstract | Read more

Genome-wide association studies have identified numerous genetic loci for spirometic measures of pulmonary function, forced expiratory volume in one second (FEV(1)), and its ratio to forced vital capacity (FEV(1)/FVC). Given that cigarette smoking adversely affects pulmonary function, we conducted genome-wide joint meta-analyses (JMA) of single nucleotide polymorphism (SNP) and SNP-by-smoking (ever-smoking or pack-years) associations on FEV(1) and FEV(1)/FVC across 19 studies (total N = 50,047). We identified three novel loci not previously associated with pulmonary function. SNPs in or near DNER (smallest P(JMA = )5.00×10(-11)), HLA-DQB1 and HLA-DQA2 (smallest P(JMA = )4.35×10(-9)), and KCNJ2 and SOX9 (smallest P(JMA = )1.28×10(-8)) were associated with FEV(1)/FVC or FEV(1) in meta-analysis models including SNP main effects, smoking main effects, and SNP-by-smoking (ever-smoking or pack-years) interaction. The HLA region has been widely implicated for autoimmune and lung phenotypes, unlike the other novel loci, which have not been widely implicated. We evaluated DNER, KCNJ2, and SOX9 and found them to be expressed in human lung tissue. DNER and SOX9 further showed evidence of differential expression in human airway epithelium in smokers compared to non-smokers. Our findings demonstrated that joint testing of SNP and SNP-by-environment interaction identified novel loci associated with complex traits that are missed when considering only the genetic main effects.

Churchhouse C, Marchini J. 2013. Multiway admixture deconvolution using phased or unphased ancestral panels. Genet Epidemiol, 37 (1), pp. 1-12. | Show Abstract | Read more

We describe a novel method for inferring the local ancestry of admixed individuals from dense genome-wide single nucleotide polymorphism data. The method, called MULTIMIX, allows multiple source populations, models population linkage disequilibrium between markers and is applicable to datasets in which the sample and source populations are either phased or unphased. The model is based upon a hidden Markov model of switches in ancestry between consecutive windows of loci. We model the observed haplotypes within each window using a multivariate normal distribution with parameters estimated from the ancestral panels. We present three methods to fit the model-Markov chain Monte Carlo sampling, the Expectation Maximization algorithm, and a Classification Expectation Maximization algorithm. The performance of our method on individuals simulated to be admixed with European and West African ancestry shows it to be comparable to HAPMIX, the ancestry calls of the two methods agreeing at 99.26% of loci across the three parameter groups. In addition to it being faster than HAPMIX, it is also found to perform well over a range of extent of admixture in a simulation involving three ancestral populations. In an analysis of real data, we estimate the contribution of European, West African and Native American ancestry to each locus in the Mexican samples of HapMap, giving estimates of ancestral proportions that are consistent with those previously reported.

Menelaou A, Marchini J. 2013. Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold. Bioinformatics, 29 (1), pp. 84-91. | Show Abstract | Read more

MOTIVATION: Given the current costs of next-generation sequencing, large studies carry out low-coverage sequencing followed by application of methods that leverage linkage disequilibrium to infer genotypes. We propose a novel method that assumes study samples are sequenced at low coverage and genotyped on a genome-wide microarray, as in the 1000 Genomes Project (1KGP). We assume polymorphic sites have been detected from the sequencing data and that genotype likelihoods are available at these sites. We also assume that the microarray genotypes have been phased to construct a haplotype scaffold. We then phase each polymorphic site using an MCMC algorithm that iteratively updates the unobserved alleles based on the genotype likelihoods at that site and local haplotype information. We use a multivariate normal model to capture both allele frequency and linkage disequilibrium information around each site. When sequencing data are available from trios, Mendelian transmission constraints are easily accommodated into the updates. The method is highly parallelizable, as it analyses one position at a time. RESULTS: We illustrate the performance of the method compared with other methods using data from Phase 1 of the 1KGP in terms of genotype accuracy, phasing accuracy and downstream imputation performance. We show that the haplotype panel we infer in African samples, which was based on a trio-phased scaffold, increases downstream imputation accuracy for rare variants (R2 increases by >0.05 for minor allele frequency <1%), and this will translate into a boost in power to detect associations. These results highlight the value of incorporating microarray genotypes when calling variants from next-generation sequence data. AVAILABILITY: The method (called MVNcall) is implemented in a C++ program and is available from http://www.stats.ox.ac.uk/∼marchini/#software.

Hartz SM, Short SE, Saccone NL, Culverhouse R, Chen L, Schwantes-An TH, Coon H, Han Y, Stephens SH, Sun J et al. 2012. Increased genetic vulnerability to smoking at CHRNA5 in early-onset smokers. Arch Gen Psychiatry, 69 (8), pp. 854-860. | Show Abstract | Read more

CONTEXT: Recent studies have shown an association between cigarettes per day (CPD) and a nonsynonymous single-nucleotide polymorphism in CHRNA5, rs16969968. OBJECTIVE: To determine whether the association between rs16969968 and smoking is modified by age at onset of regular smoking. DATA SOURCES: Primary data. STUDY SELECTION: Available genetic studies containing measures of CPD and the genotype of rs16969968 or its proxy. DATA EXTRACTION: Uniform statistical analysis scripts were run locally. Starting with 94,050 ever-smokers from 43 studies, we extracted the heavy smokers (CPD >20) and light smokers (CPD ≤10) with age-at-onset information, reducing the sample size to 33,348. Each study was stratified into early-onset smokers (age at onset ≤16 years) and late-onset smokers (age at onset >16 years), and a logistic regression of heavy vs light smoking with the rs16969968 genotype was computed for each stratum. Meta-analysis was performed within each age-at-onset stratum. DATA SYNTHESIS: Individuals with 1 risk allele at rs16969968 who were early-onset smokers were significantly more likely to be heavy smokers in adulthood (odds ratio [OR] = 1.45; 95% CI, 1.36-1.55; n = 13,843) than were carriers of the risk allele who were late-onset smokers (OR = 1.27; 95% CI, 1.21-1.33, n = 19,505) (P = .01). CONCLUSION: These results highlight an increased genetic vulnerability to smoking in early-onset smokers.

O'Connell J, Marchini J. 2012. Joint genotype calling with array and sequence data. Genet Epidemiol, 36 (6), pp. 527-537. | Show Abstract | Read more

Analysis of rare variants is currently a major focus of genetic studies of human disease. Single-nucleotide polymorphism (SNP) genotypes can be assayed using microarray genotyping or by sequencing, but neither technology produces perfect genotype calls, especially at rare SNPs. Studies that collect both types of data are becoming increasingly common, so it may be possible to combine data types to increase accuracy. We present a method, called Chiamante, which calls genotypes on individuals with either array data, sequence data, or both. The model adapts to data quality and can estimate when either the array or the sequence data should be ignored when calling the genotypes at each SNP. As a special case, our method will call genotypes from only array data and outperforms existing methods in this scenario. We have applied our method to array and sequence data from Phase I of the 1000 Genomes Project and show that it provides improved performance, especially at rare SNPs. This method provides a foundation for future efforts to fuse genetic data from different sources, for example, when combining data from exome sequencing and exome microarrays.

Wang J, Bansal AT, Martin M, Germer S, Benayed R, Essioux L, Lee JS, Begovich A, Hemmings A, Kenwright A et al. 2013. Genome-wide association analysis implicates the involvement of eight loci with response to tocilizumab for the treatment of rheumatoid arthritis. Pharmacogenomics J, 13 (3), pp. 235-241. | Show Abstract | Read more

Rheumatoid arthritis (RA) is an immune-mediated inflammatory disease affecting the joints. A heterogeneous response to available therapies demonstrates the need to identify those patients likely to benefit from a particular therapy. Our objective was to identify genetic factors associated with response to tocilizumab, a humanized monoclonal antibody targeting the interleukin (IL)-6 receptor, recently approved for treating RA. We report the first genome-wide association study on the response to tocilizumab in 1683 subjects with RA from six clinical studies. Putative associations were identified with eight loci, previously unrecognized as linked to the IL-6 pathway or associated with RA risk. This study suggests that it is unlikely that a major genetic determinant of response exists, and it illustrates the complexity of performing genome-wide association scans in clinical trials.

1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), pp. 56-65. | Show Abstract | Read more

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. 2012. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet, 44 (8), pp. 955-959. | Show Abstract | Read more

The 1000 Genomes Project and disease-specific sequencing efforts are producing large collections of haplotypes that can be used as reference panels for genotype imputation in genome-wide association studies (GWAS). However, imputing from large reference panels with existing methods imposes a high computational burden. We introduce a strategy called 'pre-phasing' that maintains the accuracy of leading methods while reducing computational costs. We first statistically estimate the haplotypes for each individual within the GWAS sample (pre-phasing) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because (i) the GWAS samples must be phased only once, whereas standard methods would implicitly repeat phasing with each reference panel update, and (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match two unphased GWAS genotypes to a pair of reference haplotypes. We implemented our approach in the MaCH and IMPUTE2 frameworks, and we tested it on data sets from the Wellcome Trust Case Control Consortium 2 (WTCCC2), the Genetic Association Information Network (GAIN), the Women's Health Initiative (WHI) and the 1000 Genomes Project. This strategy will be particularly valuable for repeated imputation as reference panels evolve.

O'Connell J, Marchini J. 2012. Joint Genotype Calling With Array and Sequence Data Genetic Epidemiology, 36 (6), pp. 527-537. | Show Abstract | Read more

Analysis of rare variants is currently a major focus of genetic studies of human disease. Single-nucleotide polymorphism (SNP) genotypes can be assayed using microarray genotyping or by sequencing, but neither technology produces perfect genotype calls, especially at rare SNPs. Studies that collect both types of data are becoming increasingly common, so it may be possible to combine data types to increase accuracy. We present a method, called Chiamante, which calls genotypes on individuals with either array data, sequence data, or both. The model adapts to data quality and can estimate when either the array or the sequence data should be ignored when calling the genotypes at each SNP. As a special case, our method will call genotypes from only array data and outperforms existing methods in this scenario. We have applied our method to array and sequence data from Phase I of the 1000 Genomes Project and show that it provides improved performance, especially at rare SNPs. This method provides a foundation for future efforts to fuse genetic data from different sources, for example, when combining data from exome sequencing and exome microarrays. © 2012 Wiley Periodicals, Inc.

Delaneau O, Marchini J, Zagury JF. 2011. A linear complexity phasing method for thousands of genomes. Nat Methods, 9 (2), pp. 179-181. | Show Abstract | Read more

Human-disease etiology can be better understood with phase information about diploid sequences. We present a method for estimating haplotypes, using genotype data from unrelated samples or small nuclear families, that leads to improved accuracy and speed compared to several widely used methods. The method, segmented haplotype estimation and imputation tool (SHAPEIT), scales linearly with the number of haplotypes used in each iteration and can be run efficiently on whole chromosomes.

Howie B, Marchini J, Stephens M. 2011. Genotype imputation with thousands of genomes. G3 (Bethesda), 1 (6), pp. 457-470. | Show Abstract | Read more

Genotype imputation is a statistical technique that is often used to increase the power and resolution of genetic association studies. Imputation methods work by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset, and a number of approaches have been proposed for choosing subsets of reference haplotypes that will maximize accuracy in a given study population. These panel selection strategies become harder to apply and interpret as sequencing efforts like the 1000 Genomes Project produce larger and more diverse reference sets, which led us to develop an alternative framework. Our approach is built around a new approximation that uses local sequence similarity to choose a custom reference panel for each study haplotype in each region of the genome. This approximation makes it computationally efficient to use all available reference haplotypes, which allows us to bypass the panel selection step and to improve accuracy at low-frequency variants by capturing unexpected allele sharing among populations. Using data from HapMap 3, we show that our framework produces accurate results in a wide range of human populations. We also use data from the Malaria Genetic Epidemiology Network (MalariaGEN) to provide recommendations for imputation-based studies in Africa. We demonstrate that our approximation improves efficiency in large, sequence-based reference panels, and we discuss general computational strategies for modern reference datasets. Genome-wide association studies will soon be able to harness the power of thousands of reference genomes, and our work provides a practical way for investigators to use this rich information. New methodology from this study is implemented in the IMPUTE2 software package.

Soler Artigas M, Loth DW, Wain LV, Gharib SA, Obeidat M, Tang W, Zhai G, Zhao JH, Smith AV, Huffman JE et al. 2011. Genome-wide association and large-scale follow up identifies 16 new loci influencing lung function. Nat Genet, 43 (11), pp. 1082-1090. | Show Abstract | Read more

Pulmonary function measures reflect respiratory health and are used in the diagnosis of chronic obstructive pulmonary disease. We tested genome-wide association with forced expiratory volume in 1 second and the ratio of forced expiratory volume in 1 second to forced vital capacity in 48,201 individuals of European ancestry with follow up of the top associations in up to an additional 46,411 individuals. We identified new regions showing association (combined P < 5 × 10(-8)) with pulmonary function in or near MFAP2, TGFB2, HDAC4, RARB, MECOM (also known as EVI1), SPATA9, ARMC2, NCR3, ZKSCAN3, CDC123, C10orf11, LRP1, CCDC38, MMP15, CFDP1 and KCNE2. Identification of these 16 new loci may provide insight into the molecular mechanisms regulating pulmonary function and into molecular targets for future therapy to alleviate reduced lung function.

Cardin N, Holmes C, Wellcome Trust Case Control Consortium, Donnelly P, Marchini J. 2011. Bayesian hierarchical mixture modeling to assign copy number from a targeted CNV array. Genet Epidemiol, 35 (6), pp. 536-548. | Show Abstract | Read more

Accurate assignment of copy number at known copy number variant (CNV) loci is important for both increasing understanding of the structural evolution of genomes as well as for carrying out association studies of copy number with disease. As with calling SNP genotypes, the task can be framed as a clustering problem but for a number of reasons assigning copy number is much more challenging. CNV assays have lower signal-to-noise ratios than SNP assays, often display heavy tailed and asymmetric intensity distributions, contain outlying observations and may exhibit systematic technical differences among different cohorts. In addition, the number of copy-number classes at a CNV in the population may be unknown a priori. Due to these complications, automatic and robust assignment of copy number from array data remains a challenging problem. We have developed a copy number assignment algorithm, CNVCALL, for a targeted CNV array, such as that used by the Wellcome Trust Case Control Consortium's recent CNV association study. We use a Bayesian hierarchical mixture model that robustly identifies both the number of different copy number classes at a specific locus as well as relative copy number for each individual in the sample. This approach is fully automated which is a critical requirement when analyzing large numbers of CNVs. We illustrate the methods performance using real data from the Wellcome Trust Case Control Consortium's CNV association study and using simulated data.

Soler Artigas M, Wain LV, Repapi E, Obeidat M, Sayers I, Burton PR, Johnson T, Zhao JH, Albrecht E, Dominiczak AF et al. 2011. Effect of 5 Genetic Variants Associated with Lung Function on the Risk of COPD, and their Joint Effects on Lung Function. Am J Respir Crit Care Med, | Show Abstract | Read more

RATIONALE: Genomic loci are associated with forced expiratory volume in one second (FEV1) or the ratio of FEV1 to forced vital capacity (FVC) in population samples, but their association with COPD has not yet been proven, nor have their combined effects on lung function and COPD been studied. OBJECTIVES: To test association with COPD of variants at five loci (TNS1, GSTCD, HTR4, AGER and THSD4) and evaluate joint effects on lung function and COPD of these SNPs, and variants at the previously reported locus near HHIP. METHODS: Sampling from 12 population-based studies (n=31,422), we obtained genotype data on 3,284 COPD cases and 17,538 controls for sentinel SNPs in TNS1, GSTCD, HTR4, AGER and THSD4. In 24,648 individuals (including 2,890 COPD cases and 13,862 controls), we additionally obtained genotypes for rs12504628 near HHIP. Each allele associated with lung function decline at these six SNPs contributed to a risk score. We studied the association of the risk score to lung function and COPD. RESULTS: Association with COPD was significant for three loci (TNS1, GSTCD, HTR4) and the previously reported HHIP locus, and suggestive and directionally consistent for AGER and TSHD4. Compared with the baseline group (7 risk alleles), carrying 10-12 risk alleles was associated with a reduction in FEV1 (β=-72.21 ml, P=3.90x10-4) and FEV1/FVC (β=-1.53%, P=6.35x10-6), and with COPD (OR=1.63, P=1.46x10-5). CONCLUSIONS: Variants in TNS1, GSTCD and HTR4 are associated with COPD. Our highest risk score category was associated with 1.6-fold higher COPD risk than the population average score.

Su Z, Marchini J, Donnelly P. 2011. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics, 27 (16), pp. 2304-2305. | Show Abstract | Read more

MOTIVATION: Performing experiments with simulated data is an inexpensive approach to evaluating competing experimental designs and analysis methods in genome-wide association studies. Simulation based on resampling known haplotypes is fast and efficient and can produce samples with patterns of linkage disequilibrium (LD), which mimic those in real data. However, the inability of current methods to simulate multiple nearby disease SNPs on the same chromosome can limit their application. RESULTS: We introduce a new simulation algorithm based on a successful resampling method, HAPGEN, that can simulate multiple nearby disease SNPs on the same chromosome. The new method, HAPGEN2, retains many advantages of resampling methods and expands the range of disease models that current simulators offer. AVAILABILITY: HAPGEN2 is freely available from http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html. CONTACT: zhan@well.ox.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Tomlinson IP, Carvajal-Carmona LG, Dobbins SE, Tenesa A, Jones AM, Howarth K, Palles C, Broderick P, Jaeger EE, Farrington S et al. 2011. Multiple common susceptibility variants near BMP pathway loci GREM1, BMP4, and BMP2 explain part of the missing heritability of colorectal cancer. PLoS Genet, 7 (6), pp. e1002105. | Show Abstract | Read more

Genome-wide association studies (GWAS) have identified 14 tagging single nucleotide polymorphisms (tagSNPs) that are associated with the risk of colorectal cancer (CRC), and several of these tagSNPs are near bone morphogenetic protein (BMP) pathway loci. The penalty of multiple testing implicit in GWAS increases the attraction of complementary approaches for disease gene discovery, including candidate gene- or pathway-based analyses. The strongest candidate loci for additional predisposition SNPs are arguably those already known both to have functional relevance and to be involved in disease risk. To investigate this proposition, we searched for novel CRC susceptibility variants close to the BMP pathway genes GREM1 (15q13.3), BMP4 (14q22.2), and BMP2 (20p12.3) using sample sets totalling 24,910 CRC cases and 26,275 controls. We identified new, independent CRC predisposition SNPs close to BMP4 (rs1957636, P = 3.93×10(-10)) and BMP2 (rs4813802, P = 4.65×10(-11)). Near GREM1, we found using fine-mapping that the previously-identified association between tagSNP rs4779584 and CRC actually resulted from two independent signals represented by rs16969681 (P = 5.33×10(-8)) and rs11632715 (P = 2.30×10(-10)). As low-penetrance predisposition variants become harder to identify-owing to small effect sizes and/or low risk allele frequencies-approaches based on informed candidate gene selection may become increasingly attractive. Our data emphasise that genetic fine-mapping studies can deconvolute associations that have arisen owing to independent correlation of a tagSNP with more than one functional SNP, thus explaining some of the apparently missing heritability of common diseases.

Marchini J. 2011. Genotype Imputation pp. 157-175. | Show Abstract | Read more

Genotype imputation is the term used to describe the process of predicting or imputing genotypes that are not directly assayed in a sample of individuals. Imputation methods attempt to identify sharing between the underlying haplotypes of the study individuals and the haplotypes in the reference set and use this sharing to impute the missing alleles in study individuals. There are strong connections between the models and methods used to infer haplotype phase and those used to perform genotype imputation, as well as strong connections to tagging SNP-based approaches and methods used in linkage studies. This chapter concerns the main uses of imputation, a description and comparison of the different methods that are proposed for imputation. It also discusses the factors that affect imputation performance and how imputed genotypes can be used to test for association. Imputation is used to boost power, fine-mapping, and meta-analysis. Several different methods that are proposed to carry out genotype imputation can be broadly grouped into two main classes of methods: SNP tagging-based approaches that use only a small number of tagSNPs to impute each SNP, and approaches that use more of the flanking genotype data to impute each SNP through the use of various types of hidden Markov models. © 2011 Elsevier Inc. All rights reserved.

Soler Artigas M, Wain LV, Repapi E, Obeidat M, Sayers I, Burton PR, Johnson T, Zhao JH, Albrecht E, Dominiczak AF et al. 2011. Effect of five genetic variants associated with lung function on the risk of chronic obstructive lung disease, and their joint effects on lung function. Am J Respir Crit Care Med, 184 (7), pp. 786-795. | Show Abstract | Read more

RATIONALE: Genomic loci are associated with FEV1 or the ratio of FEV1 to FVC in population samples, but their association with chronic obstructive pulmonary disease (COPD) has not yet been proven, nor have their combined effects on lung function and COPD been studied. OBJECTIVES: To test association with COPD of variants at five loci (TNS1, GSTCD, HTR4, AGER, and THSD4) and to evaluate joint effects on lung function and COPD of these single-nucleotide polymorphisms (SNPs), and variants at the previously reported locus near HHIP. METHODS: By sampling from 12 population-based studies (n = 31,422), we obtained genotype data on 3,284 COPD case subjects and 17,538 control subjects for sentinel SNPs in TNS1, GSTCD, HTR4, AGER, and THSD4. In 24,648 individuals (including 2,890 COPD case subjects and 13,862 control subjects), we additionally obtained genotypes for rs12504628 near HHIP. Each allele associated with lung function decline at these six SNPs contributed to a risk score. We studied the association of the risk score to lung function and COPD. MEASUREMENTS AND MAIN RESULTS: Association with COPD was significant for three loci (TNS1, GSTCD, and HTR4) and the previously reported HHIP locus, and suggestive and directionally consistent for AGER and TSHD4. Compared with the baseline group (7 risk alleles), carrying 10-12 risk alleles was associated with a reduction in FEV1 (β = -72.21 ml, P = 3.90 × 10(-4)) and FEV1/FVC (β = -1.53%, P = 6.35 × 10(-6)), and with COPD (odds ratio = 1.63, P = 1.46 × 10(-5)). CONCLUSIONS: Variants in TNS1, GSTCD, and HTR4 are associated with COPD. Our highest risk score category was associated with a 1.6-fold higher COPD risk than the population average score.

Southam L, Panoutsopoulou K, Rayner NW, Chapman K, Durrant C, Ferreira T, Arden N, Carr A, Deloukas P, Doherty M et al. 2011. The effect of genome-wide association scan quality control on imputation outcome for common variants. Eur J Hum Genet, 19 (5), pp. 610-614. | Show Abstract | Read more

Imputation is an extremely valuable tool in conducting and synthesising genome-wide association studies (GWASs). Directly typed SNP quality control (QC) is thought to affect imputation quality. It is, therefore, common practise to use quality-controlled (QCed) data as an input for imputing genotypes. This study aims to determine the effect of commonly applied QC steps on imputation outcomes. We performed several iterations of imputing SNPs across chromosome 22 in a dataset consisting of 3177 samples with Illumina 610 k (Illumina, San Diego, CA, USA) GWAS data, applying different QC steps each time. The imputed genotypes were compared with the directly typed genotypes. In addition, we investigated the correlation between alternatively QCed data. We also applied a series of post-imputation QC steps balancing elimination of poorly imputed SNPs and information loss. We found that the difference between the unQCed data and the fully QCed data on imputation outcome was minimal. Our study shows that imputation of common variants is generally very accurate and robust to GWAS QC, which is not a major factor affecting imputation outcome. A minority of common-frequency SNPs with particular properties cannot be accurately imputed regardless of QC stringency. These findings may not generalise to the imputation of low frequency and rare variants.

Ferreira T, Marchini J. 2011. Modeling interactions with known risk loci-a Bayesian model averaging approach. Ann Hum Genet, 75 (1), pp. 1-9. | Show Abstract | Read more

Genome-wide association studies (GWAS) are now clearly established as a powerful method for detecting loci involved in the etiology of common complex diseases. Most diseases and traits studied using the GWAS approach now have several loci that have been shown to be convincingly replicated. It is generally the case that these loci have been identified using single locus association scans of genotyped or imputed SNPs and very few loci have been identified by taking interactions into account. We propose a method that assesses the evidence of association at each SNP by modeling the effect of the locus in combination with other known loci. We use a Bayesian model averaging approach that combines the evidence across several different plausible models for the way in which the loci interact. We show that the method has good power both when the association is the result of marginal effects only, and when interaction with a known locus occurs. The method is implemented as an option in the program SNPTEST.

1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), pp. 1061-1073. | Show Abstract | Read more

The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

Marchini J, Howie B. 2010. Genotype imputation for genome-wide association studies. Nat Rev Genet, 11 (7), pp. 499-511. | Show Abstract | Read more

In the past few years genome-wide association (GWA) studies have uncovered a large number of convincingly replicated associations for many complex human diseases. Genotype imputation has been used widely in the analysis of GWA studies to boost power, fine-map associations and facilitate the combination of results across studies using meta-analysis. This Review describes the details of several different statistical methods for imputing genotypes, illustrates and discusses the factors that influence imputation performance, and reviews methods that can be used to assess imputation performance and test association at imputed SNPs.

Wellcome Trust Case Control Consortium, Craddock N, Hurles ME, Cardin N, Pearson RD, Plagnol V, Robson S, Vukcevic D, Barnes C, Conrad DF et al. 2010. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature, 464 (7289), pp. 713-720. | Show Abstract | Read more

Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.

Liu JZ, Tozzi F, Waterworth DM, Pillai SG, Muglia P, Middleton L, Berrettini W, Knouff CW, Yuan X, Waeber G et al. 2010. Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nat Genet, 42 (5), pp. 436-440. | Show Abstract | Read more

Smoking is a leading global cause of disease and mortality. We established the Oxford-GlaxoSmithKline study (Ox-GSK) to perform a genome-wide meta-analysis of SNP association with smoking-related behavioral traits. Our final data set included 41,150 individuals drawn from 20 disease, population and control cohorts. Our analysis confirmed an effect on smoking quantity at a locus on 15q25 (P = 9.45 x 10(-19)) that includes CHRNA5, CHRNA3 and CHRNB4, three genes encoding neuronal nicotinic acetylcholine receptor subunits. We used data from the 1000 Genomes project to investigate the region using imputation, which allowed for analysis of virtually all common SNPs in the region and offered a fivefold increase in marker density over HapMap2 (ref. 2) as an imputation reference panel. Our fine-mapping approach identified a SNP showing the highest significance, rs55853698, located within the promoter region of CHRNA5. Conditional analysis also identified a secondary locus (rs6495308) in CHRNA3.

Imielinski M, Baldassano RN, Griffiths A, Russell RK, Annese V, Dubinsky M, Kugathasan S, Bradfield JP, Walters TD, Sleiman P et al. 2009. Common variants at five new loci associated with early-onset inflammatory bowel disease. Nat Genet, 41 (12), pp. 1335-1340. | Show Abstract | Read more

The inflammatory bowel diseases (IBD) Crohn's disease and ulcerative colitis are common causes of morbidity in children and young adults in the western world. Here we report the results of a genome-wide association study in early-onset IBD involving 3,426 affected individuals and 11,963 genetically matched controls recruited through international collaborations in Europe and North America, thereby extending the results from a previous study of 1,011 individuals with early-onset IBD. We have identified five new regions associated with early-onset IBD susceptibility, including 16p11 near the cytokine gene IL27 (rs8049439, P = 2.41 x 10(-9)), 22q12 (rs2412973, P = 1.55 x 10(-9)), 10q22 (rs1250550, P = 5.63 x 10(-9)), 2q37 (rs4676410, P = 3.64 x 10(-8)) and 19q13.11 (rs10500264, P = 4.26 x 10(-10)). Our scan also detected associations at 23 of 32 loci previously implicated in adult-onset Crohn's disease and at 8 of 17 loci implicated in adult-onset ulcerative colitis, highlighting the close pathogenetic relationship between early- and adult-onset IBD.

Su Z, Cardin N, Donnelly P, Marchini J, Control WTC. 2009. A Bayesian Method for Detecting and Characterizing Allelic Heterogeneity and Boosting Signals in Genome-Wide Association Studies STATISTICAL SCIENCE, 24 (4), pp. 430-450. | Show Abstract | Read more

The standard paradigm for the analysis of genome-wide association studies involves carrying out association tests at both typed and imputed SNPs. These methods will not be optimal for detecting the signal of association at SNPs that are not currently known or in regions where allelic heterogeneity occurs. We propose a novel association test, complementary to the SNP-based approaches, that attempts to extract further signals of association by explicitly modeling and estimating both unknown SNPs and allelic heterogeneity at a locus. At each site we estimate the genealogy of the case-control sample by taking advantage of the HapMap haplotypes across the genome. Allelic heterogeneity is modeled by allowing more than one mutation on the branches of the genealogy. Our use of Bayesian methods allows us to assess directly the evidence for a causative SNP not well correlated with known SNPs and for allelic heterogeneity at each locus. Using simulated data and real data from the WTCCC project, we show that our method (i) produces a significant boost in signal and accurately identifies the form of the allelic heterogeneity in regions where it is known to exist, (ii) can suggest new signals that are not found by testing typed or imputed SNPs and (iii) can provide more accurate estimates of effect sizes in regions of association.

Zheng G, Marchini J, Geller NL. 2009. Introduction to the Special Issue: Genome-Wide Association Studies STATISTICAL SCIENCE, 24 (4), pp. 387-387. | Show Abstract | Read more

Introduction to the Special Issue: Genome-Wide Association Studies

Howie BN, Donnelly P, Marchini J. 2009. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet, 5 (6), pp. e1000529. | Show Abstract | Read more

Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.

Newton-Cheh C, Johnson T, Gateva V, Tobin MD, Bochud M, Coin L, Najjar SS, Zhao JH, Heath SC, Eyheramendy S et al. 2009. Genome-wide association study identifies eight loci associated with blood pressure. Nat Genet, 41 (6), pp. 666-676. | Show Abstract | Read more

Elevated blood pressure is a common, heritable cause of cardiovascular disease worldwide. To date, identification of common genetic variants influencing blood pressure has proven challenging. We tested 2.5 million genotyped and imputed SNPs for association with systolic and diastolic blood pressure in 34,433 subjects of European ancestry from the Global BPgen consortium and followed up findings with direct genotyping (N ≤ 71,225 European ancestry, N ≤ 12,889 Indian Asian ancestry) and in silico comparison (CHARGE consortium, N = 29,136). We identified association between systolic or diastolic blood pressure and common variants in eight regions near the CYP17A1 (P = 7 × 10(-24)), CYP1A2 (P = 1 × 10(-23)), FGF5 (P = 1 × 10(-21)), SH2B3 (P = 3 × 10(-18)), MTHFR (P = 2 × 10(-13)), c10orf107 (P = 1 × 10(-9)), ZNF652 (P = 5 × 10(-9)) and PLCD3 (P = 1 × 10(-8)) genes. All variants associated with continuous blood pressure were associated with dichotomous hypertension. These associations between common variants and blood pressure and hypertension offer mechanistic insights into the regulation of blood pressure and may point to novel targets for interventions to prevent cardiovascular disease.

Jallow M, Teo YY, Small KS, Rockett KA, Deloukas P, Clark TG, Kivinen K, Bojang KA, Conway DJ, Pinder M et al. 2009. Genome-wide and fine-resolution association analysis of malaria in West Africa. Nat Genet, 41 (6), pp. 657-665. | Show Abstract | Read more

We report a genome-wide association (GWA) study of severe malaria in The Gambia. The initial GWA scan included 2,500 children genotyped on the Affymetrix 500K GeneChip, and a replication study included 3,400 children. We used this to examine the performance of GWA methods in Africa. We found considerable population stratification, and also that signals of association at known malaria resistance loci were greatly attenuated owing to weak linkage disequilibrium (LD). To investigate possible solutions to the problem of low LD, we focused on the HbS locus, sequencing this region of the genome in 62 Gambian individuals and then using these data to conduct multipoint imputation in the GWA samples. This increased the signal of association, from P = 4 × 10(-7) to P = 4 × 10(-14), with the peak of the signal located precisely at the HbS causal variant. Our findings provide proof of principle that fine-resolution multipoint imputation, based on population-specific sequencing data, can substantially boost authentic GWA signals and enable fine mapping of causal variants in African populations.

Spencer CC, Su Z, Donnelly P, Marchini J. 2009. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet, 5 (5), pp. e1000477. | Show Abstract | Read more

Genome-wide association studies are revolutionizing the search for the genes underlying human complex diseases. The main decisions to be made at the design stage of these studies are the choice of the commercial genotyping chip to be used and the numbers of case and control samples to be genotyped. The most common method of comparing different chips is using a measure of coverage, but this fails to properly account for the effects of sample size, the genetic model of the disease, and linkage disequilibrium between SNPs. In this paper, we argue that the statistical power to detect a causative variant should be the major criterion in study design. Because of the complicated pattern of linkage disequilibrium (LD) in the human genome, power cannot be calculated analytically and must instead be assessed by simulation. We describe in detail a method of simulating case-control samples at a set of linked SNPs that replicates the patterns of LD in human populations, and we used it to assess power for a comprehensive set of available genotyping chips. Our results allow us to compare the performance of the chips to detect variants with different effect sizes and allele frequencies, look at how power changes with sample size in different populations or when using multi-marker tags and genotype imputation approaches, and how performance compares to a hypothetical chip that contains every SNP in HapMap. A main conclusion of this study is that marked differences in genome coverage may not translate into appreciable differences in power and that, when taking budgetary considerations into account, the most powerful design may not always correspond to the chip with the highest coverage. We also show that genotype imputation can be used to boost the power of many chips up to the level obtained from a hypothetical "complete" chip containing all the SNPs in HapMap. Our results have been encapsulated into an R software package that allows users to design future association studies and our methods provide a framework with which new chip sets can be evaluated.

O'Donovan MC, Norton N, Williams H, Peirce T, Moskvina V, Nikolov I, Hamshere M, Carroll L, Georgieva L, Dwyer S et al. 2009. Analysis of 10 independent samples provides evidence for association between schizophrenia and a SNP flanking fibroblast growth factor receptor 2. Mol Psychiatry, 14 (1), pp. 30-36. | Show Abstract | Read more

We and others have previously reported linkage to schizophrenia on chromosome 10q25-q26 but, to date, a susceptibility gene in the region has not been identified. We examined data from 3606 single-nucleotide polymorphisms (SNPs) mapping to 10q25-q26 that had been typed in a genome-wide association study (GWAS) of schizophrenia (479 UK cases/2937 controls). SNPs with P<0.01 (n=40) were genotyped in an additional 163 UK cases and those markers that remained nominally significant at P<0.01 (n=22) were genotyped in replication samples from Ireland, Germany and Bulgaria consisting of a total of 1664 cases with schizophrenia and 3541 controls. Only one SNP, rs17101921, was nominally significant after meta-analyses across the replication samples and this was genotyped in an additional six samples from the United States/Australia, Germany, China, Japan, Israel and Sweden (n=5142 cases/6561 controls). Across all replication samples, the allele at rs17101921 that was associated in the GWAS showed evidence for association independent of the original data (OR 1.17 (95% CI 1.06-1.29), P=0.0009). The SNP maps 85 kb from the nearest gene encoding fibroblast growth factor receptor 2 (FGFR2) making this a potential susceptibility gene for schizophrenia.

Barnes C, Plagnol V, Fitzgerald T, Redon R, Marchini J, Clayton D, Hurles ME. 2008. A robust statistical method for case-control association testing with copy number variation. Nat Genet, 40 (10), pp. 1245-1252. | Show Abstract | Read more

Copy number variation (CNV) is pervasive in the human genome and can play a causal role in genetic diseases. The functional impact of CNV cannot be fully captured through linkage disequilibrium with SNPs. These observations motivate the development of statistical methods for performing direct CNV association studies. We show through simulation that current tests for CNV association are prone to false-positive associations in the presence of differential errors between cases and controls, especially if quantitative CNV measurements are noisy. We present a statistical framework for performing case-control CNV association studies that applies likelihood ratio testing of quantitative CNV measurements in cases and controls. We show that our methods are robust to differential errors and noisy data and can achieve maximal theoretical power. We illustrate the power of these methods for testing for association with binary and quantitative traits, and have made this software available as the R package CNVtools.

Marchini J, Howie B. 2008. Comparing algorithms for genotype imputation. Am J Hum Genet, 83 (4), pp. 535-539. | Read more

Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg MS, Taylor KD, Barmada MM et al. 2008. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat Genet, 40 (8), pp. 955-962. | Show Abstract | Read more

Several risk factors for Crohn's disease have been identified in recent genome-wide association studies. To advance gene discovery further, we combined data from three studies on Crohn's disease (a total of 3,230 cases and 4,829 controls) and carried out replication in 3,664 independent cases with a mixture of population-based and family-based controls. The results strongly confirm 11 previously reported loci and provide genome-wide significant evidence for 21 additional loci, including the regions containing STAT3, JAK2, ICOSLG, CDKAL1 and ITLN1. The expanded molecular understanding of the basis of this disease offers promise for informed therapeutic development.

Loos RJ, Lindgren CM, Li S, Wheeler E, Zhao JH, Prokopenko I, Inouye M, Freathy RM, Attwood AP, Beckmann JS et al. 2008. Common variants near MC4R are associated with fat mass, weight and risk of obesity. Nat Genet, 40 (6), pp. 768-775. | Show Abstract | Read more

To identify common variants influencing body mass index (BMI), we analyzed genome-wide association data from 16,876 individuals of European descent. After previously reported variants in FTO, the strongest association signal (rs17782313, P = 2.9 x 10(-6)) mapped 188 kb downstream of MC4R (melanocortin-4 receptor), mutations of which are the leading cause of monogenic severe childhood-onset obesity. We confirmed the BMI association in 60,352 adults (per-allele effect = 0.05 Z-score units; P = 2.8 x 10(-15)) and 5,988 children aged 7-11 (0.13 Z-score units; P = 1.5 x 10(-8)). In case-control analyses (n = 10,583), the odds for severe childhood obesity reached 1.30 (P = 8.0 x 10(-11)). Furthermore, we observed overtransmission of the risk allele to obese offspring in 660 families (P (pedigree disequilibrium test average; PDT-avg) = 2.4 x 10(-4)). The SNP location and patterns of phenotypic associations are consistent with effects mediated through altered MC4R function. Our findings establish that common variants near MC4R influence fat mass, weight and obesity risk at the population level and reinforce the need for large-scale data integration to identify variants influencing continuous biomedical traits.

Cited:

738

Scopus

O'Donovan MC, Craddock N, Norton N, Williams H, Peirce T, Moskvina V, Nikolov I, Hamshere M, Carroll L, Georgieva L et al. 2008. Identification of loci associated with schizophrenia by genome-wide association and follow-up Nature Genetics, 40 (9), pp. 1053-1055. | Show Abstract | Read more

We carried out a genome-wide association study of schizophrenia (479 cases, 2,937 controls) and tested loci with P < 10 -5 in up to 16,726 additional subjects. Of 12 loci followed up, 3 had strong independent support (P < 5 × 10 -4 ), and the overall pattern of replication was unlikely to occur by chance (P = 9 × 10 -8 ). Meta-analysis provided strongest evidence for association around ZNF804A (P = 1.61 × 10 -7 ) and this strengthened when the affected phenotype included bipolar disorder (P = 9.96 × 10 -9 ). © 2008 Nature Publishing Group.

Wellcome Trust Case Control Consortium, Australo-Anglo-American Spondylitis Consortium (TASC), Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, Kwiatkowski DP, McCarthy MI et al. 2007. Association scan of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants. Nat Genet, 39 (11), pp. 1329-1337. | Show Abstract | Read more

We have genotyped 14,436 nonsynonymous SNPs (nsSNPs) and 897 major histocompatibility complex (MHC) tag SNPs from 1,000 independent cases of ankylosing spondylitis (AS), autoimmune thyroid disease (AITD), multiple sclerosis (MS) and breast cancer (BC). Comparing these data against a common control dataset derived from 1,500 randomly selected healthy British individuals, we report initial association and independent replication in a North American sample of two new loci related to ankylosing spondylitis, ARTS1 and IL23R, and confirmation of the previously reported association of AITD with TSHR and FCRL3. These findings, enabled in part by increased statistical power resulting from the expansion of the control reference group to include individuals from the other disease groups, highlight notable new possibilities for autoimmune regulation and suggest that IL23R may be a common susceptibility factor for the major 'seronegative' diseases.

Cited:

2944

Scopus

Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM et al. 2007. A second generation human haplotype map of over 3.1 million SNPs Nature, 449 (7164), pp. 851-861. | Show Abstract | Read more

We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r 2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r 2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations. ©2007 Nature Publishing Group.

International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P et al. 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449 (7164), pp. 851-861. | Show Abstract | Read more

We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.

Marchini J, Howie B, Myers S, McVean G, Donnelly P. 2007. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet, 39 (7), pp. 906-913. | Show Abstract | Read more

Genome-wide association studies are set to become the method of choice for uncovering the genetic basis of human diseases. A central challenge in this area is the development of powerful multipoint methods that can detect causal variants that have not been directly genotyped. We propose a coherent analysis framework that treats the problem as one involving missing or uncertain genotypes. Central to our approach is a model-based imputation method for inferring genotypes at observed or unobserved SNPs, leading to improved power over existing methods for multipoint association mapping. Using real genome-wide association study data, we show that our approach (i) is accurate and well calibrated, (ii) provides detailed views of associated regions that facilitate follow-up studies and (iii) can be used to validate and correct data at genotyped markers. A notable future use of our method will be to boost power by combining data from genome-wide scans that use different SNP sets.

Wellcome Trust Case Control Consortium. 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447 (7145), pp. 661-678. | Show Abstract | Read more

There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip 500K Mapping Array Set) undertaken in the British population, which has examined approximately 2,000 individuals for each of 7 major diseases and a shared set of approximately 3,000 controls. Case-control comparisons identified 24 independent association signals at P < 5 x 10(-7): 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a large number of further signals (including 58 loci with single-point P values between 10(-5) and 5 x 10(-7)) likely to yield additional susceptibility loci. The importance of appropriately large samples was confirmed by the modest effect sizes observed at most loci identified. This study thus represents a thorough validation of the GWA approach. It has also demonstrated that careful use of a shared control group represents a safe and effective approach to GWA analyses of multiple disease phenotypes; has generated a genome-wide genotype database for future studies of common diseases in the British population; and shown that, provided individuals with non-European ancestry are excluded, the extent of population stratification in the British population is generally modest. Our findings offer new avenues for exploring the pathophysiology of these important disorders. We anticipate that our data, results and software, which will be widely available to other investigators, will provide a powerful resource for human genetics research.

Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ, Perry JR, Rayner NW, Freathy RM et al. 2007. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science, 316 (5829), pp. 1336-1341. | Show Abstract | Read more

The molecular mechanisms involved in the development of type 2 diabetes are poorly understood. Starting from genome-wide genotype data for 1924 diabetic cases and 2938 population controls generated by the Wellcome Trust Case Control Consortium, we set out to detect replicated diabetes association signals through analysis of 3757 additional cases and 5346 controls and by integration of our findings with equivalent data from other international consortia. We detected diabetes susceptibility loci in and around the genes CDKAL1, CDKN2A/CDKN2B, and IGF2BP2 and confirmed the recently described associations at HHEX/IDE and SLC30A8. Our findings provide insight into the genetic architecture of type 2 diabetes, emphasizing the contribution of multiple variants of modest effect. The regions identified underscore the importance of pathways influencing pancreatic beta cell development and function in the etiology of type 2 diabetes.

Eyheramendy S, Marchini J, McVean G, Myers S, Donnelly P. 2007. A model-based approach to capture genetic variation for future association studies. Genome Res, 17 (1), pp. 88-95. | Show Abstract | Read more

Genome-wide association studies are still constrained by the cost of genotyping. For this reason, the selection of a reduced set of markers or tags able to capture a significant proportion of the genetic variation is an important aspect of these studies. Most tagging SNP selection methods have been successful in capturing the genetic variation of the data from which the tags have been chosen. However, when these tags are used in an independent data set, a significant proportion of the remaining SNPs (non-tags) are not captured and, in most cases, there is no information on which SNPs are captured. We propose to use a probabilistic model to predict the non-tags based on a set of tags, as a way to capture genetic variation. An important advantage of this method is that it directly predicts the genotype of the non-tags with which we can test for association with the phenotype and which could help to elucidate the location of genes responsible for increasing disease susceptibility. Additionally, this method provides an estimate of the probabilities with which the predictions are made, which reflects the confidence of the probabilistic model. We also propose new methods to select the tagging SNPs. We empirically show by using HapMap data that our approach is able to capture significantly more genetic variation than methods based solely on a pairwise LD measure.

Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, Byrne EH, McCarroll SA, Gaudet R et al. 2007. Genome-wide detection and characterization of positive selection in human populations. Nature, 449 (7164), pp. 913-918. | Show Abstract | Read more

With the advent of dense maps of human genetic variation, it is now possible to detect positive natural selection across the human genome. Here we report an analysis of over 3 million polymorphisms from the International HapMap Project Phase 2 (HapMap2). We used 'long-range haplotype' methods, which were developed to identify alleles segregating in a population that have undergone recent selection, and we also developed new methods that are based on cross-population comparisons to discover alleles that have swept to near-fixation within a population. The analysis reveals more than 300 strong candidate regions. Focusing on the strongest 22 regions, we develop a heuristic for scrutinizing these regions to identify candidate targets of selection. In a complementary analysis, we identify 26 non-synonymous, coding, single nucleotide polymorphisms showing regional evidence of positive selection. Examination of these candidates highlights three cases in which two genes in a common biological process have apparently undergone positive selection in the same population:LARGE and DMD, both related to infection by the Lassa virus, in West Africa;SLC24A5 and SLC45A2, both involved in skin pigmentation, in Europe; and EDAR and EDA2R, both involved in development of hair follicles, in Asia.

Nejentsev S, Howson JM, Walker NM, Szeszko J, Field SF, Stevens HE, Reynolds P, Hardy M, King E, Masters J et al. 2007. Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature, 450 (7171), pp. 887-892. | Show Abstract | Read more

The major histocompatibility complex (MHC) on chromosome 6 is associated with susceptibility to more common diseases than any other region of the human genome, including almost all disorders classified as autoimmune. In type 1 diabetes the major genetic susceptibility determinants have been mapped to the MHC class II genes HLA-DQB1 and HLA-DRB1 (refs 1-3), but these genes cannot completely explain the association between type 1 diabetes and the MHC region. Owing to the region's extreme gene density, the multiplicity of disease-associated alleles, strong associations between alleles, limited genotyping capability, and inadequate statistical approaches and sample sizes, which, and how many, loci within the MHC determine susceptibility remains unclear. Here, in several large type 1 diabetes data sets, we analyse a combined total of 1,729 polymorphisms, and apply statistical methods-recursive partitioning and regression-to pinpoint disease susceptibility to the MHC class I genes HLA-B and HLA-A (risk ratios >1.5; P(combined) = 2.01 x 10(-19) and 2.35 x 10(-13), respectively) in addition to the established associations of the MHC class II genes. Other loci with smaller and/or rarer effects might also be involved, but to find these, future searches must take into account both the HLA class II and class I genes and use even larger samples. Taken together with previous studies, we conclude that MHC-class-I-mediated events, principally involving HLA-B*39, contribute to the aetiology of type 1 diabetes.

de Bakker PI, McVean G, Sabeti PC, Miretti MM, Green T, Marchini J, Ke X, Monsuur AJ, Whittaker P, Delgado M et al. 2006. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat Genet, 38 (10), pp. 1166-1172. | Show Abstract | Read more

The proteins encoded by the classical HLA class I and class II genes in the major histocompatibility complex (MHC) are highly polymorphic and are essential in self versus non-self immune recognition. HLA variation is a crucial determinant of transplant rejection and susceptibility to a large number of infectious and autoimmune diseases. Yet identification of causal variants is problematic owing to linkage disequilibrium that extends across multiple HLA and non-HLA genes in the MHC. We therefore set out to characterize the linkage disequilibrium patterns between the highly polymorphic HLA genes and background variation by typing the classical HLA genes and >7,500 common SNPs and deletion-insertion polymorphisms across four population samples. The analysis provides informative tag SNPs that capture much of the common variation in the MHC region and that could be used in disease association studies, and it provides new insight into the evolutionary dynamics and ancestral origins of the HLA loci and their haplotypes.

Evans DM, Marchini J, Morris AP, Cardon LR. 2006. Two-stage two-locus models in genome-wide association. PLoS Genet, 2 (9), pp. e157. | Show Abstract | Read more

Studies in model organisms suggest that epistasis may play an important role in the etiology of complex diseases and traits in humans. With the era of large-scale genome-wide association studies fast approaching, it is important to quantify whether it will be possible to detect interacting loci using realistic sample sizes in humans and to what extent undetected epistasis will adversely affect power to detect association when single-locus approaches are employed. We therefore investigated the power to detect association for an extensive range of two-locus quantitative trait models that incorporated varying degrees of epistasis. We compared the power to detect association using a single-locus model that ignored interaction effects, a full two-locus model that allowed for interactions, and, most important, two two-stage strategies whereby a subset of loci initially identified using single-locus tests were analyzed using the full two-locus model. Despite the penalty introduced by multiple testing, fitting the full two-locus model performed better than single-locus tests for many of the situations considered, particularly when compared with attempts to detect both individual loci. Using a two-stage strategy reduced the computational burden associated with performing an exhaustive two-locus search across the genome but was not as powerful as the exhaustive search when loci interacted. Two-stage approaches also increased the risk of missing interacting loci that contributed little effect at the margins. Based on our extensive simulations, our results suggest that an exhaustive search involving all pairwise combinations of markers across the genome might provide a useful complement to single-locus scans in identifying interacting loci that contribute to moderate proportions of the phenotypic variance.

Marchini J, Cutler D, Patterson N, Stephens M, Eskin E, Halperin E, Lin S, Qin ZS, Munro HM, Abecasis GR et al. 2006. A comparison of phasing algorithms for trios and unrelated individuals. Am J Hum Genet, 78 (3), pp. 437-450. | Show Abstract | Read more

Knowledge of haplotype phase is valuable for many analysis methods in the study of disease, population, and evolutionary genetics. Considerable research effort has been devoted to the development of statistical and computational methods that infer haplotype phase from genotype data. Although a substantial number of such methods have been developed, they have focused principally on inference from unrelated individuals, and comparisons between methods have been rather limited. Here, we describe the extension of five leading algorithms for phase inference for handling father-mother-child trios. We performed a comprehensive assessment of the methods applied to both trios and to unrelated individuals, with a focus on genomic-scale problems, using both simulated data and data from the HapMap project. The most accurate algorithm was PHASE (v2.1). For this method, the percentages of genotypes whose phase was incorrectly inferred were 0.12%, 0.05%, and 0.16% for trios from simulated data, HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trios, and HapMap Yoruban trios, respectively, and 5.2% and 5.9% for unrelated individuals in simulated data and the HapMap CEPH data, respectively. The other methods considered in this work had comparable but slightly worse error rates. The error rates for trios are similar to the levels of genotyping error and missing data expected. We thus conclude that all the methods considered will provide highly accurate estimates of haplotypes when applied to trio data sets. Running times differ substantially between methods. Although it is one of the slowest methods, PHASE (v2.1) was used to infer haplotypes for the 1 million-SNP HapMap data set. Finally, we evaluated methods of estimating the value of r(2) between a pair of SNPs and concluded that all methods estimated r(2) well when the estimated value was >or=0.8.

Cited:

44

Scopus

Evans DM, Marchini J, Morris AP, Cardon LR. 2006. Two-stage two-locus models in genome-wide association. PLoS genetics, 2 (9), | Show Abstract | Read more

Studies in model organisms suggest that epistasis may play an important role in the etiology of complex diseases and traits in humans. With the era of large-scale genome-wide association studies fast approaching, it is important to quantify whether it will be possible to detect interacting loci using realistic sample sizes in humans and to what extent undetected epistasis will adversely affect power to detect association when single-locus approaches are employed. We therefore investigated the power to detect association for an extensive range of two-locus quantitative trait models that incorporated varying degrees of epistasis. We compared the power to detect association using a single-locus model that ignored interaction effects, a full two-locus model that allowed for interactions, and, most important, two two-stage strategies whereby a subset of loci initially identified using single-locus tests were analyzed using the full two-locus model. Despite the penalty introduced by multiple testing, fitting the full two-locus model performed better than single-locus tests for many of the situations considered, particularly when compared with attempts to detect both individual loci. Using a two-stage strategy reduced the computational burden associated with performing an exhaustive two-locus search across the genome but was not as powerful as the exhaustive search when loci interacted. Two-stage approaches also increased the risk of missing interacting loci that contributed little effect at the margins. Based on our extensive simulations, our results suggest that an exhaustive search involving all pairwise combinations of markers across the genome might provide a useful complement to single-locus scans in identifying interacting loci that contribute to moderate proportions of the phenotypic variance.

International HapMap Consortium. 2005. A haplotype map of the human genome. Nature, 437 (7063), pp. 1299-1320. | Show Abstract | Read more

Inherited genetic variation has a critical but as yet largely uncharacterized role in human disease. Here we report a public database of common variation in the human genome: more than one million single nucleotide polymorphisms (SNPs) for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted. These data document the generality of recombination hotspots, a block-like structure of linkage disequilibrium and low haplotype diversity, leading to substantial correlations of SNPs with many of their neighbours. We show how the HapMap resource can guide the design and analysis of genetic association studies, shed light on structural variation and recombination, and identify loci that may have been subject to natural selection during human evolution.

Marchini J, Donnelly P, Cardon LR. 2005. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet, 37 (4), pp. 413-417. | Show Abstract | Read more

After nearly 10 years of intense academic and commercial research effort, large genome-wide association studies for common complex diseases are now imminent. Although these conditions involve a complex relationship between genotype and phenotype, including interactions between unlinked loci, the prevailing strategies for analysis of such studies focus on the locus-by-locus paradigm. Here we consider analytical methods that explicitly look for statistical interactions between loci. We show first that they are computationally feasible, even for studies of hundreds of thousands of loci, and second that even with a conservative correction for multiple testing, they can be more powerful than traditional analyses under a range of models for interlocus interactions. We also show that plausible variations across populations in allele frequencies among interacting loci can markedly affect the power to detect their marginal effects, which may account in part for the well-known difficulties in replicating association results. These results suggest that searching for interactions among genetic loci can be fruitfully incorporated into analysis strategies for genome-wide association studies.

Marchini J, Presanis A. 2004. Comparing methods of analyzing fMRI statistical parametric maps. Neuroimage, 22 (3), pp. 1203-1213. | Show Abstract | Read more

Approaches for the analysis of statistical parametric maps (SPMs) can be crudely grouped into three main categories in which different philosophies are applied to delineate activated regions. These being type I error control thresholding, false discovery rate (FDR) control thresholding and posterior probability thresholding. To better understand the properties of these main approaches, we carried out a simulation study to compare the approaches as they would be used on real data sets. Using default settings, we find that posterior probability thresholding is the most powerful approach, and type I error control thresholding provides the lowest levels of type I error. False discovery rate control thresholding performs in between the other approaches for both these criteria, although for some parameter settings this approach can approximate the performance of posterior probability thresholding. Based on these results, we discuss the relative merits of the three approaches in an attempt to decide upon an optimal approach. We conclude that viewing the problem of delineating areas of activation as a classification problem provides a highly interpretable framework for comparing the methods. Within this framework, we highlight the role of the loss function, which explicitly penalizes the types of errors that may occur in a given analysis.

Marchini J, Cardon LR, Phillips MS, Donnelly P. 2004. The effects of human population structure on large genetic association studies. Nat Genet, 36 (5), pp. 512-517. | Show Abstract | Read more

Large-scale association studies hold substantial promise for unraveling the genetic basis of common human diseases. A well-known problem with such studies is the presence of undetected population structure, which can lead to both false positive results and failures to detect genuine associations. Here we examine approximately 15,000 genome-wide single-nucleotide polymorphisms typed in three population groups to assess the consequences of population structure on the coming generation of association studies. The consequences of population structure on association outcomes increase markedly with sample size. For the size of study needed to detect typical genetic effects in common diseases, even the modest levels of population structure within population groups cannot safely be ignored. We also examine one method for correcting for population structure (Genomic Control). Although it often performs well, it may not correct for structure if too few loci are used and may overcorrect in other settings, leading to substantial loss of power. The results of our analysis can guide the design of large-scale association studies.

Henry CJ, Lightowler HJ, Marchini J. 2003. Intra-individual variation in resting metabolic rate during the menstrual cycle. Br J Nutr, 89 (6), pp. 811-817. | Show Abstract | Read more

Little information exists on the extent of day-to-day intra-individual variation in resting metabolic rate (RMR) in women. The present study has investigated the intra-individual variation in RMR of women during the menstrual cycle. Nineteen women (naturally cycling non-pill users) were recruited to the study. Anthropometric and RMR measurements were taken at least three times per week for the duration of one complete menstrual cycle; measurements were taken for a second, consecutive cycle in eight of the nineteen subjects. RMR was measured by indirect calorimetry using a ventilated hood system under standardized conditions. The measurements made throughout each complete menstrual cycle were averaged and the levels of inter- and intra-individual variation in RMR were assessed by determining the CV (%). Mean RMR of the group was 5686 (sd 674) kJ/d; inter-individual variation in RMR was 11.8 %. There were wide differences in the intra-individual variation in RMR of women (CV range 1.7-10.4 %). The CV in ten subjects was small (2-4 %), while the CV in nine women was high (5-10 %), indicating a significant variation in RMR during the menstrual cycle in certain subjects. Using statistical models, it has been shown that there was a significant effect on RMR due to a subject-specific level of variability; this was the case even when accounting for a possible training effect. In conclusion, the findings from our present study show that RMR cannot be assumed to be 'stable' in all women. The implications of intra-individual variation in RMR and its impact on energy balance needs further research.

Marchini JL, Smith SM. 2003. On bias in the estimation of autocorrelations for fMRI voxel time-series analysis. Neuroimage, 18 (1), pp. 83-90. | Show Abstract | Read more

For fMRI time-series analysis to be statistically valid, it is important to deal correctly with temporal autocorrelation in the noise. Most of the approaches in the literature adopt a two-stage approach in which the autocorrelation structure is estimated using the residuals of an initial model fit. This estimate is then used to "prewhiten" the data and the model before the model is refit to obtain final activation parameter estimates. An assumption implicit in this scheme is that the residuals from the initial model fit represent a realization of the "true" noise process. In general this assumption will not be correct as certain components of the noise will be removed by the model fit. In this paper we examine (i) the form of the bias induced by the initial model fit, (ii) methods of correcting for the bias, and (iii) the impact of bias correction on the model parameter estimates. We find that while bias correction does result in more accurate estimates of the correlation structure, this does not translate into improved estimates of the model parameters. In fact estimates of the model parameters and their standard errors are seen to be so accurate that we conclude that bias correction is unnecessary.

Cited:

3994

Scopus

Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu FL, Yang HM, Ch'ang LY, Huang W, Liu B, Shen Y et al. 2003. The International HapMap Project NATURE, 426 (6968), pp. 789-796. | Show Abstract | Read more

The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention. © 2003 Nature Publishing Group.

Balding DJ, Carothers AD, Marchini JL, Cardon LR, Vetta A, Griffiths B, Weir BS, Hill WG, Goldstein D, Strimmer K et al. 2002. Discussion on the meeting on 'statistical modelling and analysis of genetic data' Journal of the Royal Statistical Society. Series B: Statistical Methodology, 64 (4), pp. 737-775. | Read more

Marchini JL, Ripley BD. 2000. A new statistical approach to detecting significant activation in functional MRI. Neuroimage, 12 (4), pp. 366-380. | Show Abstract | Read more

There are many ways to detect activation patterns in a time series of observations at a single voxel in a functional magnetic resonance imaging study. The critical problem is to estimate the statistical significance, which depends on the estimation of both the magnitude of the response to the stimulus and the serial dependence of the time series and especially on the assumptions made in that estimation. We show that for experimental designs with periodic stimuli, only a few aspects of the serial dependence are important and these can be estimated reliably via nonparametric estimation of the spectral density of the time series, whereas existing techniques are biased by their assumptions. The linear model with (stationary) serially dependent errors can be analyzed entirely in frequency domain, and doing so provides many insights. In particular, we introduce a technique to detect periodic activations and show that it has a distribution theory that enables us to assign significance levels down to 1 in 100,000, levels which are needed when a whole brain image is under consideration. Nonparametric spectral density estimation is shown to be self-calibrating and accurate when compared to several other time-domain approaches. The technique is especially resistant to high frequency artefacts that we have found in some datasets and we demonstrate that time-domain approaches may be sufficiently susceptible to these effects to give misleading results. The method is easily generalized to handle event-related designs. We found it necessary to consider the trends in the time series carefully and use nonlinear filters to remove the trends and robust techniques to remove "spikes." Using this in connection with our techniques allows us to detect activations in clumps of a few (even one) voxel in periodic designs, yet produce essentially no false positive detections at any voxels in null datasets.

O'Connell J, Sharp K, Shrine N, Wain L, Hall I, Tobin M, Zagury JF, Delaneau O, Marchini J. 2016. Haplotype estimation for biobank-scale data sets. Nat Genet, 48 (7), pp. 817-820. | Show Abstract | Read more

The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as ∼0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.

Dahl A, Iotchkova V, Baud A, Johansson Å, Gyllensten U, Soranzo N, Mott R, Kranis A, Marchini J. 2016. A multiple-phenotype imputation method for genetic studies. Nat Genet, 48 (4), pp. 466-472. | Show Abstract | Read more

Genetic association studies have yielded a wealth of biological discoveries. However, these studies have mostly analyzed one trait and one SNP at a time, thus failing to capture the underlying complexity of the data sets. Joint genotype-phenotype analyses of complex, high-dimensional data sets represent an important way to move beyond simple genome-wide association studies (GWAS) with great potential. The move to high-dimensional phenotypes will raise many new statistical problems. Here we address the central issue of missing phenotypes in studies with any level of relatedness between samples. We propose a multiple-phenotype mixed model and use a computationally efficient variational Bayesian algorithm to fit the model. On a variety of simulated and real data sets from a range of organisms and trait types, we show that our method outperforms existing state-of-the-art methods from the statistics and machine learning literature and can boost signals of association.

Delaneau O, Zagury JF, Marchini J. 2013. Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods, 10 (1), pp. 5-6. | Read more

1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature, 491 (7422), pp. 56-65. | Show Abstract | Read more

By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. 2012. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet, 44 (8), pp. 955-959. | Show Abstract | Read more

The 1000 Genomes Project and disease-specific sequencing efforts are producing large collections of haplotypes that can be used as reference panels for genotype imputation in genome-wide association studies (GWAS). However, imputing from large reference panels with existing methods imposes a high computational burden. We introduce a strategy called 'pre-phasing' that maintains the accuracy of leading methods while reducing computational costs. We first statistically estimate the haplotypes for each individual within the GWAS sample (pre-phasing) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because (i) the GWAS samples must be phased only once, whereas standard methods would implicitly repeat phasing with each reference panel update, and (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match two unphased GWAS genotypes to a pair of reference haplotypes. We implemented our approach in the MaCH and IMPUTE2 frameworks, and we tested it on data sets from the Wellcome Trust Case Control Consortium 2 (WTCCC2), the Genetic Association Information Network (GAIN), the Women's Health Initiative (WHI) and the 1000 Genomes Project. This strategy will be particularly valuable for repeated imputation as reference panels evolve.

Cardin N, Holmes C, Wellcome Trust Case Control Consortium, Donnelly P, Marchini J. 2011. Bayesian hierarchical mixture modeling to assign copy number from a targeted CNV array. Genet Epidemiol, 35 (6), pp. 536-548. | Show Abstract | Read more

Accurate assignment of copy number at known copy number variant (CNV) loci is important for both increasing understanding of the structural evolution of genomes as well as for carrying out association studies of copy number with disease. As with calling SNP genotypes, the task can be framed as a clustering problem but for a number of reasons assigning copy number is much more challenging. CNV assays have lower signal-to-noise ratios than SNP assays, often display heavy tailed and asymmetric intensity distributions, contain outlying observations and may exhibit systematic technical differences among different cohorts. In addition, the number of copy-number classes at a CNV in the population may be unknown a priori. Due to these complications, automatic and robust assignment of copy number from array data remains a challenging problem. We have developed a copy number assignment algorithm, CNVCALL, for a targeted CNV array, such as that used by the Wellcome Trust Case Control Consortium's recent CNV association study. We use a Bayesian hierarchical mixture model that robustly identifies both the number of different copy number classes at a specific locus as well as relative copy number for each individual in the sample. This approach is fully automated which is a critical requirement when analyzing large numbers of CNVs. We illustrate the methods performance using real data from the Wellcome Trust Case Control Consortium's CNV association study and using simulated data.

1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. 2010. A map of human genome variation from population-scale sequencing. Nature, 467 (7319), pp. 1061-1073. | Show Abstract | Read more

The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

Liu JZ, Tozzi F, Waterworth DM, Pillai SG, Muglia P, Middleton L, Berrettini W, Knouff CW, Yuan X, Waeber G et al. 2010. Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nat Genet, 42 (5), pp. 436-440. | Show Abstract | Read more

Smoking is a leading global cause of disease and mortality. We established the Oxford-GlaxoSmithKline study (Ox-GSK) to perform a genome-wide meta-analysis of SNP association with smoking-related behavioral traits. Our final data set included 41,150 individuals drawn from 20 disease, population and control cohorts. Our analysis confirmed an effect on smoking quantity at a locus on 15q25 (P = 9.45 x 10(-19)) that includes CHRNA5, CHRNA3 and CHRNB4, three genes encoding neuronal nicotinic acetylcholine receptor subunits. We used data from the 1000 Genomes project to investigate the region using imputation, which allowed for analysis of virtually all common SNPs in the region and offered a fivefold increase in marker density over HapMap2 (ref. 2) as an imputation reference panel. Our fine-mapping approach identified a SNP showing the highest significance, rs55853698, located within the promoter region of CHRNA5. Conditional analysis also identified a secondary locus (rs6495308) in CHRNA3.

Su Z, Cardin N, Donnelly P, Marchini J, Control WTC. 2009. A Bayesian Method for Detecting and Characterizing Allelic Heterogeneity and Boosting Signals in Genome-Wide Association Studies STATISTICAL SCIENCE, 24 (4), pp. 430-450. | Show Abstract | Read more

The standard paradigm for the analysis of genome-wide association studies involves carrying out association tests at both typed and imputed SNPs. These methods will not be optimal for detecting the signal of association at SNPs that are not currently known or in regions where allelic heterogeneity occurs. We propose a novel association test, complementary to the SNP-based approaches, that attempts to extract further signals of association by explicitly modeling and estimating both unknown SNPs and allelic heterogeneity at a locus. At each site we estimate the genealogy of the case-control sample by taking advantage of the HapMap haplotypes across the genome. Allelic heterogeneity is modeled by allowing more than one mutation on the branches of the genealogy. Our use of Bayesian methods allows us to assess directly the evidence for a causative SNP not well correlated with known SNPs and for allelic heterogeneity at each locus. Using simulated data and real data from the WTCCC project, we show that our method (i) produces a significant boost in signal and accurately identifies the form of the allelic heterogeneity in regions where it is known to exist, (ii) can suggest new signals that are not found by testing typed or imputed SNPs and (iii) can provide more accurate estimates of effect sizes in regions of association.

Howie BN, Donnelly P, Marchini J. 2009. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet, 5 (6), pp. e1000529. | Show Abstract | Read more

Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.

Spencer CC, Su Z, Donnelly P, Marchini J. 2009. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet, 5 (5), pp. e1000477. | Show Abstract | Read more

Genome-wide association studies are revolutionizing the search for the genes underlying human complex diseases. The main decisions to be made at the design stage of these studies are the choice of the commercial genotyping chip to be used and the numbers of case and control samples to be genotyped. The most common method of comparing different chips is using a measure of coverage, but this fails to properly account for the effects of sample size, the genetic model of the disease, and linkage disequilibrium between SNPs. In this paper, we argue that the statistical power to detect a causative variant should be the major criterion in study design. Because of the complicated pattern of linkage disequilibrium (LD) in the human genome, power cannot be calculated analytically and must instead be assessed by simulation. We describe in detail a method of simulating case-control samples at a set of linked SNPs that replicates the patterns of LD in human populations, and we used it to assess power for a comprehensive set of available genotyping chips. Our results allow us to compare the performance of the chips to detect variants with different effect sizes and allele frequencies, look at how power changes with sample size in different populations or when using multi-marker tags and genotype imputation approaches, and how performance compares to a hypothetical chip that contains every SNP in HapMap. A main conclusion of this study is that marked differences in genome coverage may not translate into appreciable differences in power and that, when taking budgetary considerations into account, the most powerful design may not always correspond to the chip with the highest coverage. We also show that genotype imputation can be used to boost the power of many chips up to the level obtained from a hypothetical "complete" chip containing all the SNPs in HapMap. Our results have been encapsulated into an R software package that allows users to design future association studies and our methods provide a framework with which new chip sets can be evaluated.

1169

Thank you for registering your interest

We were unable to record your request to register for interest in future opportunities. Please try again and if problems persist contact us at webteam@ndm.ox.ac.uk