# Professor Chris Holmes

Research Area: | Genetics and Genomics |
---|---|

Scientific Themes: | Genetics & Genomics |

Keywords: | Bayesian Statistics, Pattern Recognition, Spatial Statistics, Statistical Genetics, Statistical Genomics and Genetic Epidemiology. |

I have a broad interest in the theory, methods and applications of statistics and statistical modelling. My background and beliefs lie in Bayesian statistics which provides a unified framework to stochastic modelling and information processing. I am particularly interested in pattern recognition and nonlinear, nonparametric methods.

There are no collaborations listed for this principal investigator.

© 2017 Biometrika Trust. Bayesian robustness under model misspecification is a current area of active research. Among recent ideas is that of raising the likelihood function to a power. In this paper we discuss the choice of appropriate power and provide examples.

BACKGROUND: Single-cell RNA-Seq can be a valuable and unbiased tool to dissect cellular heterogeneity, despite the transcriptome's limitations in describing higher functional phenotypes and protein events. Perhaps the most important shortfall with transcriptomic 'snapshots' of cell populations is that they risk being descriptive, only cataloging heterogeneity at one point in time, and without microenvironmental context. Studying the genetic ('nature') and environmental ('nurture') modifiers of heterogeneity, and how cell population dynamics unfold over time in response to these modifiers is key when studying highly plastic cells such as macrophages. RESULTS: We introduce the programmable Polaris™ microfluidic lab-on-chip for single-cell sequencing, which performs live-cell imaging while controlling for the culture microenvironment of each cell. Using gene-edited macrophages we demonstrate how previously unappreciated knockout effects of SAMHD1, such as an altered oxidative stress response, have a large paracrine signaling component. Furthermore, we demonstrate single-cell pathway enrichments for cell cycle arrest and APOBEC3G degradation, both associated with the oxidative stress response and altered proteostasis. Interestingly, SAMHD1 and APOBEC3G are both HIV-1 inhibitors ('restriction factors'), with no known co-regulation. CONCLUSION: As single-cell methods continue to mature, so will the ability to move beyond simple 'snapshots' of cell populations towards studying the determinants of population dynamics. By combining single-cell culture, live-cell imaging, and single-cell sequencing, we have demonstrated the ability to study cell phenotypes and microenvironmental influences. It's these microenvironmental components - ignored by standard single-cell workflows - that likely determine how macrophages, for example, react to inflammation and form treatment resistant HIV reservoirs.

Big Datasets are endemic, but are often notoriously difficult to analyse because of their size, heterogeneity and quality. The purpose of this paper is to open a discourse on the potential for modern decision theoretic optimal experimental design methods, which by their very nature have traditionally been applied prospectively, to improve the analysis of Big Data through retrospective designed sampling in order to answer particular questions of interest. By appealing to a range of examples, it is suggested that this perspective on Big Data modelling and analysis has the potential for wide generality and advantageous inferential and computational properties. We highlight current hurdles and open research questions surrounding efficient computational optimisation in using retrospective designs, and in part this paper is a call to the optimisation and experimental design communities to work together in the field of Big Data analysis.

© 2016 Informa UK Limited, trading as Taylor & Francis Group. This work characterizes the dispersion of some popular random probability measures, including the bootstrap, the Bayesian bootstrap, and the Pólya tree prior. This dispersion is measured in terms of the variation of the Kullback–Leibler divergence of a random draw from the process to that of its baseline centring measure. By providing a quantitative expression of this dispersion around the baseline distribution, our work provides insight for comparing different parameterizations of the models and for the setting of prior parameters in applied Bayesian settings. This highlights some limitations of the existing canonical choice of parameter settings in the Pólya tree process.

© 2016 The Author(s). Published with license by Taylor & Francis. Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing sequence data. However, the reporting of output from HMMs has largely been restricted to the presentation of the most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most probable marginals using the forward–backward algorithm. In this article, we expand the amount of information we could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to (i) find MAP sequences, (ii) compute posterior probabilities, and (iii) simulate sample paths. We collectively call these recursions k-segment algorithms and illustrate their utility using simulated and real examples. We also highlight the prospective and retrospective use of k-segment constraints for fitting HMMs or exploring existing model fits. Supplementary materials for this article are available online.

© 2016, Institute of Mathematical Statistics. All rights reserved. In this article we propose novel Bayesian nonparametric methods using Dirichlet Process Mixture (DPM) models for detecting pairwise dependence between random variables while accounting for uncertainty in the form of the underlying distributions. A key criteria is that the procedures should scale to large data sets. In this regard we find that the formal calculation of the Bayes factor for a dependent-vs.-independent DPM joint probability measure is not feasible computationally. To address this we present Bayesian diagnostic measures for characterising evidence against a “null model” of pairwise independence. In simulation studies, as well as for a real data analysis, we show that our approach provides a useful tool for the exploratory nonparametric Bayesian analysis of large multivariate data sets.

Decisions based partly or solely on predictions from probabilistic models may be sensitive to model misspecification. Statisticians are taught from an early stage that "all models are wrong, but some are useful" however, little formal guidance exists on how to assess the impact of model approximation on decision making, or how to proceed when optimal actions appear sensitive to model fidelity. This article presents an overview of recent developments across different disciplines to address this. We review diagnostic techniques, including graphical approaches and summary statistics, to help highlight decisions made through minimised expected loss that are sensitive to model misspecification. We then consider formal methods for decision making under model misspecification by quantifying stability of optimal actions to perturbations to the model within a neighbourhood of model space. This neighbourhood is defined in either one of two ways. First, in a strong sense via an information (Kullback-Leibler) divergence around the approximating model. Second, using a Bayesian nonparametric model (prior) centred on the approximating model, in order to "average out" over possible misspecifications. This is presented in the context of recent work in the robust control, macroeconomics and financial mathematics literature. We adopt a Bayesian approach throughout although the presentation is agnostic to this position.

© 2016, Institute of Mathematical Statistics. All rights reserved. We present a novel Bayesian nonparametric regression model for covariates X and continuous response variable Y ∈ R. The model is parametrized in terms of marginal distributions for Y and X and a regression function which tunes the stochastic ordering of the conditional distributions F(y|x). By adopting an approximate composite likelihood approach, we show that the resulting posterior inference can be decoupled for the separate components of the model. This procedure can scale to very large datasets and allows for the use of standard, existing, software from Bayesian nonparametric density estimation and Plackett-Luce ranking estimation to be applied. As an illustration, we show an application of our approach to a US Census dataset, with over 1,300,000 data points and more than 100 covariates.

We propose a framework for general Bayesian inference. We argue that a valid update of a prior belief distribution to a posterior can be made for parameters which are connected to observations through a loss function rather than the traditional likelihood function, which is recovered as a special case. Modern application areas make it increasingly challenging for Bayesians to attempt to model the true data-generating mechanism. For instance, when the object of interest is low dimensional, such as a mean or median, it is cumbersome to have to achieve this via a complete model for the whole data distribution. More importantly, there are settings where the parameter of interest does not directly index a family of density functions and thus the Bayesian approach to learning about such parameters is currently regarded as problematic. Our framework uses loss functions to connect information in the data to functionals of interest. The updating of beliefs then follows from a decision theoretic approach involving cumulative loss functions. Importantly, the procedure coincides with Bayesian updating when a true likelihood is known yet provides coherent subjective inference in much more general settings. Connections to other inference frameworks are highlighted.

As cardiac cell models become increasingly complex, a correspondingly complex 'genealogy' of inherited parameter values has also emerged. The result has been the loss of a direct link between model parameters and experimental data, limiting both reproducibility and the ability to re-fit to new data. We examine the ability of approximate Bayesian computation (ABC) to infer parameter distributions in the seminal action potential model of Hodgkin and Huxley, for which an immediate and documented connection to experimental results exists. The ability of ABC to produce tight posteriors around the reported values for the gating rates of sodium and potassium ion channels validates the precision of this early work, while the highly variable posteriors around certain voltage dependency parameters suggests that voltage clamp experiments alone are insufficient to constrain the full model. Despite this, Hodgkin and Huxley's estimates are shown to be competitive with those produced by ABC, and the variable behaviour of posterior parametrized models under complex voltage protocols suggests that with additional data the model could be fully constrained. This work will provide the starting point for a full identifiability analysis of commonly used cardiac models, as well as a template for informative, data-driven parametrization of newly proposed models.

The function of the majority of genes in the mouse and human genomes remains unknown. The mouse embryonic stem cell knockout resource provides a basis for the characterization of relationships between genes and phenotypes. The EUMODIC consortium developed and validated robust methodologies for the broad-based phenotyping of knockouts through a pipeline comprising 20 disease-oriented platforms. We developed new statistical methods for pipeline design and data analysis aimed at detecting reproducible phenotypes with high power. We acquired phenotype data from 449 mutant alleles, representing 320 unique genes, of which half had no previous functional annotation. We captured data from over 27,000 mice, finding that 83% of the mutant lines are phenodeviant, with 65% demonstrating pleiotropy. Surprisingly, we found significant differences in phenotype annotation according to zygosity. New phenotypes were uncovered for many genes with previously unknown function, providing a powerful basis for hypothesis generation and further investigation in diverse systems.

To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.

© 2015 International Society for Bayesian Analysis. In this article we describe Bayesian nonparametric procedures for twosample hypothesis testing. Namely, given two sets of samples y (1) iid F (1) and y (2) iid F (2) , with F (1) , F (2) unknown, we wish to evaluate the evidence for the null hypothesis H 0 : F (1) = F (2) versus the alternative H 1 : F (1) ≠ F (2) . Our method is based upon a nonparametric Pólya tree prior centered either subjectively or using an empirical procedure. We show that the Pólya tree prior leads to an analytic expression for the marginal likelihood under the two hypotheses and hence an explicit measure of the probability of the null Pr(H 0 |{y (1) , y (2) }).

Upper- and lower-body fat depots exhibit opposing associations with obesity-related metabolic disease. We defined the relationship between DEXA-quantified fat depots and diabetes/cardiovascular risk factors in a healthy population-based cohort (n = 3,399). Gynoid fat mass correlated negatively with insulin resistance after total fat mass adjustment, whereas the opposite was seen for abdominal fat. Paired transcriptomic analysis of gluteal subcutaneous adipose tissue (GSAT) and abdominal subcutaneous adipose tissue (ASAT) was performed across the BMI spectrum (n = 49; 21.4-45.5 kg/m(2)). In both depots, energy-generating metabolic genes were negatively associated and inflammatory genes were positively associated with obesity. However, associations were significantly weaker in GSAT. At the systemic level, arteriovenous release of the proinflammatory cytokine interleukin-6 (n = 34) was lower from GSAT than ASAT. Isolated preadipocytes retained a depot-specific transcriptional "memory" of embryonic developmental genes and exhibited differential promoter DNA methylation of selected genes (HOTAIR, TBX5) between GSAT and ASAT. Short hairpin RNA-mediated silencing identified TBX5 as a regulator of preadipocyte proliferation and adipogenic differentiation in ASAT. In conclusion, intrinsic differences in the expression of developmental genes in regional adipocytes provide a mechanistic basis for diversity in adipose tissue (AT) function. The less inflammatory nature of lower-body AT offers insight into the opposing metabolic disease risk associations between upper- and lower-body obesity.

BACKGROUND: Reliable measures of anti-malarial resistance are crucial for malaria control. Resistance is typically a complex trait: multiple mutations in a single parasite (a haplotype or genotype) are necessary for elaboration of the resistant phenotype. The frequency of a genetic motif (proportion of parasite clones in the parasite population that carry a given allele, haplotype or genotype) is a useful measure of resistance. In areas of high endemicity, malaria patients generally harbour multiple parasite clones; they have multiplicities of infection (MOIs) greater than one. However, most standard experimental procedures only allow measurement of marker prevalence (proportion of patient blood samples that test positive for a given mutation or combination of mutations), not frequency. It is misleading to compare marker prevalence between sites that have different mean MOIs; frequencies are required instead. METHODS: A Bayesian statistical model was developed to estimate Plasmodium falciparum genetic motif frequencies from prevalence data collected in the field. To assess model performance and computational speed, a detailed simulation study was implemented. Application of the model was tested using datasets from five sites in Uganda. The datasets included prevalence data on markers of resistance to sulphadoxine-pyrimethamine and an average MOI estimate for each study site. RESULTS: The simulation study revealed that the genetic motif frequencies that were estimated using the model were more accurate and precise than conventional estimates based on direct counting. Importantly, the model did not require measurements of the MOI in each patient; it used the average MOI in the patient population. Furthermore, if a dataset included partially genotyped patient blood samples, the model imputed the data that were missing. Using the model and the Ugandan data, genotype frequencies were estimated and four biologically relevant genotypes were identified. CONCLUSIONS: The model allows fast, accurate, reliable estimation of the frequency of genetic motifs associated with resistance to anti-malarials using prevalence data collected from malaria patients. The model does not require per-patient MOI measurements and can easily analyse data from five markers. The model will be a valuable tool for monitoring markers of anti-malarial drug resistance, including markers of resistance to artemisinin derivatives and partner drugs.

Bladder cancers are a leading cause of death from malignancy. Molecular markers might predict disease progression and behaviour more accurately than the available prognostic factors. Here we use whole-genome sequencing to identify somatic mutations and chromosomal changes in 14 bladder cancers of different grades and stages. As well as detecting the known bladder cancer driver mutations, we report the identification of recurrent protein-inactivating mutations in CDKN1A and FAT1. The former are not mutually exclusive with TP53 mutations or MDM2 amplification, showing that CDKN1A dysfunction is not simply an alternative mechanism for p53 pathway inactivation. We find strong positive associations between higher tumour stage/grade and greater clonal diversity, the number of somatic mutations and the burden of copy number changes. In principle, the identification of sub-clones with greater diversity and/or mutation burden within early-stage or low-grade tumours could identify lesions with a high risk of invasive progression.

Cited:

55

Scopus

OBJECTIVES:Microsatellite instability (MSI) is an established marker of good prognosis in colorectal cancer (CRC). Chromosomal instability (CIN) is strongly negatively associated with MSI and has been shown to be a marker of poor prognosis in a small number of studies. However, a substantial group of double-negative (MSI-/CIN-) CRCs exists. The prognosis of these patients is unclear. Furthermore, MSI and CIN are each associated with specific molecular changes, such as mutations in KRAS and BRAF, that have been associated with prognosis. It is not known which of MSI, CIN, and the specific gene mutations are primary predictors of survival.METHODS:We evaluated the prognostic value (disease-free survival, DFS) of CIN, MSI, mutations in KRAS, NRAS, BRAF, PIK3CA, FBXW7, and TP53, and chromosome 18q loss-of-heterozygosity (LOH) in 822 patients from the VICTOR trial of stage II/III CRC. We followed up promising associations in an Australian community-based cohort (N=375).RESULTS:In the VICTOR patients, no specific mutation was associated with DFS, but individually MSI and CIN showed significant associations after adjusting for stage, age, gender, tumor location, and therapy. A combined analysis of the VICTOR and community-based cohorts showed that MSI and CIN were independent predictors of DFS (for MSI, hazard ratio (HR)=0.58, 95% confidence interval (CI) 0.36-0.93, and P=0.021; for CIN, HR=1.54, 95% CI 1.14-2.08, and P=0.005), and joint CIN/MSI testing significantly improved the prognostic prediction of MSI alone (P=0.028). Higher levels of CIN were monotonically associated with progressively poorer DFS, and a semi-quantitative measure of CIN was a better predictor of outcome than a simple CIN+/-variable. All measures of CIN predicted DFS better than the recently described Watanabe LOH ratio.CONCLUSIONS:MSI and CIN are independent predictors of DFS for stage II/III CRC. Prognostic molecular tests for CRC relapse should currently use MSI and a quantitative measure of CIN rather than specific gene mutations.

BACKGROUND: In order to better understand cancer as a complex disease with multiple genetic and epigenetic factors, it is vital to model the fundamental biological relationships among these alterations as well as their relationships with important clinical outcomes. METHODS: We develop an integrative network-based Bayesian analysis (iNET) approach that allows us to jointly analyze multi-platform high-dimensional genomic data in a computationally efficient manner. The iNET approach is formulated as an objective Bayesian model selection problem for Gaussian graphical models to model joint dependencies among platform-specific features using known biological mechanisms. Using both simulated datasets and a glioblastoma (GBM) study from The Cancer Genome Atlas (TCGA), we illustrate the iNET approach via integrating three data types, microRNA, gene expression (mRNA), and patient survival time. RESULTS: We show that the iNET approach has greater power in identifying cancer-related microRNAs than non-integrative approaches based on realistic simulated datasets. In the TCGA GBM study, we found many mRNA-microRNA pairs and microRNAs that are associated with patient survival time, with some of these associations identified in previous studies. CONCLUSIONS: The iNET discovers relationships consistent with the underlying biological mechanisms among these variables, as well as identifying important biomarkers that are potentially relevant to patient survival. In addition, we identified some microRNAs that can potentially affect patient survival which are missed by non-integrative approaches.

OBJECTIVES: Microsatellite instability (MSI) is an established marker of good prognosis in colorectal cancer (CRC). Chromosomal instability (CIN) is strongly negatively associated with MSI and has been shown to be a marker of poor prognosis in a small number of studies. However, a substantial group of "double-negative" (MSI-/CIN-) CRCs exists. The prognosis of these patients is unclear. Furthermore, MSI and CIN are each associated with specific molecular changes, such as mutations in KRAS and BRAF, that have been associated with prognosis. It is not known which of MSI, CIN, and the specific gene mutations are primary predictors of survival. METHODS: We evaluated the prognostic value (disease-free survival, DFS) of CIN, MSI, mutations in KRAS, NRAS, BRAF, PIK3CA, FBXW7, and TP53, and chromosome 18q loss-of-heterozygosity (LOH) in 822 patients from the VICTOR trial of stage II/III CRC. We followed up promising associations in an Australian community-based cohort (N=375). RESULTS: In the VICTOR patients, no specific mutation was associated with DFS, but individually MSI and CIN showed significant associations after adjusting for stage, age, gender, tumor location, and therapy. A combined analysis of the VICTOR and community-based cohorts showed that MSI and CIN were independent predictors of DFS (for MSI, hazard ratio (HR)=0.58, 95% confidence interval (CI) 0.36-0.93, and P=0.021; for CIN, HR=1.54, 95% CI 1.14-2.08, and P=0.005), and joint CIN/MSI testing significantly improved the prognostic prediction of MSI alone (P=0.028). Higher levels of CIN were monotonically associated with progressively poorer DFS, and a semi-quantitative measure of CIN was a better predictor of outcome than a simple CIN+/- variable. All measures of CIN predicted DFS better than the recently described Watanabe LOH ratio. CONCLUSIONS: MSI and CIN are independent predictors of DFS for stage II/III CRC. Prognostic molecular tests for CRC relapse should currently use MSI and a quantitative measure of CIN rather than specific gene mutations.

This paper is concerned with statistical methods for the segmental classification of linear sequence data where the task is to segment and classify the data according to an underlying hidden discrete state sequence. Such analysis is commonplace in the empirical sciences including genomics, finance and speech processing. In particular, we are interested in answering the following question: given data y and a statistical model π(x, y) of the hidden states x, what should we report as the prediction x̂ under the posterior distribution π(x|y)? That is, how should you make a prediction of the underlying states? We demonstrate that traditional approaches such as reporting the most probable state sequence or most probable set of marginal predictions can give undesirable classification artefacts and offer limited control over the properties of the prediction. We propose a decision theoretic approach using a novel class of Markov loss functions and report x̂ via the principle of minimum expected loss (maximum expected utility).We demonstrate that the sequence of minimum expected loss under the Markov loss function can be enumerated exactly using dynamic programming methods and that it offers flexibility and performance improvements over existing techniques. The result is generic and applicable to any probabilistic model on a sequence, such as Hidden Markov models, change point or product partition models. © Institute of Mathematical Statistics, 2013.

Gene expression in multiple individual cells from a tissue or culture sample varies according to cell-cycle, genetic, epigenetic and stochastic differences between the cells. However, single-cell differences have been largely neglected in the analysis of the functional consequences of genetic variation. Here we measure the expression of 92 genes affected by Wnt signaling in 1,440 single cells from 15 individuals to associate single-nucleotide polymorphisms (SNPs) with gene-expression phenotypes, while accounting for stochastic and cell-cycle differences between cells. We provide evidence that many heritable variations in gene function--such as burst size, burst frequency, cell cycle-specific expression and expression correlation/noise between cells--are masked when expression is averaged over many cells. Our results demonstrate how single-cell analyses provide insights into the mechanistic and network effects of genetic variability, with improved statistical power to model these effects on gene expression.

MOTIVATION: The identification of nucleosomes along the chromatin is key to understanding their role in the regulation of gene expression and other DNA-related processes. However, current experimental methods (MNase-ChIP, MNase-Seq) sample nucleosome positions from a cell population and contain biases, making thus the precise identification of individual nucleosomes not straightforward. Recent works have only focused on the first point, where noise reduction approaches have been developed to identify nucleosome positions. RESULTS: In this article, we propose a new approach, termed NucleoFinder, that addresses both the positional heterogeneity across cells and experimental biases by seeking nucleosomes consistently positioned in a cell population and showing a significant enrichment relative to a control sample. Despite the absence of validated dataset, we show that our approach (i) detects fewer false positives than two other nucleosome calling methods and (ii) identifies two important features of the nucleosome organization (the nucleosome spacing downstream of active promoters and the enrichment/depletion of GC/AT dinucleotides at the centre of in vitro nucleosomes) with equal or greater ability than the other two methods.

Cited:

57

Scopus

The stochastic nature of generating eukaryotic transcripts challenges conventional methods for obtaining and analyzing single-cell gene expression data. In order to address the inherent noise, detailed methods are described on how to collect data on multiple genes in a large number of single cells using microfluidic arrays. As part of a study exploring the effect of genotype on Wnt pathway activation, data were collected for 96 qPCR assays on 1440 lymphoblastoid cells. The description of methods includes preliminary data processing steps. The methods used in the collection and analysis of single-cell qPCR data are contrasted with those used in conventional qPCR. © 2012 Elsevier Inc.

Many individuals with multiple or large colorectal adenomas or early-onset colorectal cancer (CRC) have no detectable germline mutations in the known cancer predisposition genes. Using whole-genome sequencing, supplemented by linkage and association analysis, we identified specific heterozygous POLE or POLD1 germline variants in several multiple-adenoma and/or CRC cases but in no controls. The variants associated with susceptibility, POLE p.Leu424Val and POLD1 p.Ser478Asn, have high penetrance, and POLD1 mutation was also associated with endometrial cancer predisposition. The mutations map to equivalent sites in the proofreading (exonuclease) domain of DNA polymerases ɛ and δ and are predicted to cause a defect in the correction of mispaired bases inserted during DNA replication. In agreement with this prediction, the tumors from mutation carriers were microsatellite stable but tended to acquire base substitution mutations, as confirmed by yeast functional assays. Further analysis of published data showed that the recently described group of hypermutant, microsatellite-stable CRCs is likely to be caused by somatic POLE mutations affecting the exonuclease domain.

SUMMARY: GREVE has been developed to assist with the identification of recurrent genomic aberrations across cancer samples. The exact characterization of such aberrations remains a challenge despite the availability of increasing amount of data, from SNParray to next-generation sequencing. Furthermore, genomic aberrations in cancer are especially difficult to handle because they are, by nature, unique to the patients. However, their recurrence in specific regions of the genome has been shown to reflect their relevance in the development of tumors. GREVE makes use of previously characterized events to identify such regions and focus any further analysis. AVAILABILITY: GREVE is available through a web interface and open-source application (http://www.well.ox.ac.uk/GREVE).

Significance testing one SNP at a time has proven useful for identifying genomic regions that harbor variants affecting human disease. But after an initial genome scan has identified a "hit region" of association, single-locus approaches can falter. Local linkage disequilibrium (LD) can make both the number of underlying true signals and their identities ambiguous. Simultaneous modeling of multiple loci should help. However, it is typically applied ad hoc: conditioning on the top SNPs, with limited exploration of the model space and no assessment of how sensitive model choice was to sampling variability. Formal alternatives exist but are seldom used. Bayesian variable selection is coherent but requires specifying a full joint model, including priors on parameters and the model space. Penalized regression methods (e.g., LASSO) appear promising but require calibration, and, once calibrated, lead to a choice of SNPs that can be misleadingly decisive. We present a general method for characterizing uncertainty in model choice that is tailored to reprioritizing SNPs within a hit region under strong LD. Our method, LASSO local automatic regularization resample model averaging (LLARRMA), combines LASSO shrinkage with resample model averaging and multiple imputation, estimating for each SNP the probability that it would be included in a multi-SNP model in alternative realizations of the data. We apply LLARRMA to simulations based on case-control genome-wide association studies data, and find that when there are several causal loci and strong LD, LLARRMA identifies a set of candidates that is enriched for true signals relative to single locus analysis and to the recently proposed method of Stability Selection.

Metabolic Syndrome (MetS) is highly prevalent and has considerable public health impact, but its underlying genetic factors remain elusive. To identify gene networks involved in MetS, we conducted whole-genome expression and genotype profiling on abdominal (ABD) and gluteal (GLU) adipose tissue, and whole blood (WB), from 29 MetS cases and 44 controls. Co-expression network analysis for each tissue independently identified nine, six, and zero MetS-associated modules of coexpressed genes in ABD, GLU, and WB, respectively. Of 8,992 probesets expressed in ABD or GLU, 685 (7.6%) were expressed in ABD and 51 (0.6%) in GLU only. Differential eigengene network analysis of 8,256 shared probesets detected 22 shared modules with high preservation across adipose depots (D(ABD-GLU) = 0.89), seven of which were associated with MetS (FDR P<0.01). The strongest associated module, significantly enriched for immune response-related processes, contained 94/620 (15%) genes with inter-depot differences. In an independent cohort of 145/141 twins with ABD and WB longitudinal expression data, median variability in ABD due to familiality was greater for MetS-associated versus un-associated modules (ABD: 0.48 versus 0.18, P = 0.08; GLU: 0.54 versus 0.20, P = 7.8×10(-4)). Cis-eQTL analysis of probesets associated with MetS (FDR P<0.01) and/or inter-depot differences (FDR P<0.01) provided evidence for 32 eQTLs. Corresponding eSNPs were tested for association with MetS-related phenotypes in two GWAS of >100,000 individuals; rs10282458, affecting expression of RARRES2 (encoding chemerin), was associated with body mass index (BMI) (P = 6.0×10(-4)); and rs2395185, affecting inter-depot differences of HLA-DRB1 expression, was associated with high-density lipoprotein (P = 8.7×10(-4)) and BMI-adjusted waist-to-hip ratio (P = 2.4×10(-4)). Since many genes and their interactions influence complex traits such as MetS, integrated analysis of genotypes and coexpression networks across multiple tissues relevant to clinical traits is an efficient strategy to identify novel associations.

UNLABELLED: High-throughput technologies can identify genes whose expression profiles correlate with specific phenotypes; however, placing these genes into a biological context remains challenging. To help address this issue, we developed nested Expression Analysis Systematic Explorer (nEASE). nEASE complements traditional gene ontology enrichment approaches by determining statistically enriched gene ontology subterms within a list of genes based on co-annotation. Here, we overview an open-source software version of the nEASE algorithm. nEASE can be used either stand-alone or as part of a pathway discovery pipeline. AVAILABILITY: nEASE is implemented within the Multiple Experiment Viewer software package available at http://www.tm4.org/mev. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Genome-wide association study (GWAS) data on a disease are increasingly available from multiple related populations. In this scenario, meta-analyses can improve power to detect homogeneous genetic associations, but if there exist ancestry-specific effects, via interactions on genetic background or with a causal effect that co-varies with genetic background, then these will typically be obscured. To address this issue, we have developed a robust statistical method for detecting susceptibility gene-ancestry interactions in multi-cohort GWAS based on closely-related populations. We use the leading principal components of the empirical genotype matrix to cluster individuals into "ancestry groups" and then look for evidence of heterogeneous genetic associations with disease or other trait across these clusters. Robustness is improved when there are multiple cohorts, as the signal from true gene-ancestry interactions can then be distinguished from gene-collection artefacts by comparing the observed interaction effect sizes in collection groups relative to ancestry groups. When applied to colorectal cancer, we identified a missense polymorphism in iron-absorption gene CYBRD1 that associated with disease in individuals of English, but not Scottish, ancestry. The association replicated in two additional, independently-collected data sets. Our method can be used to detect associations between genetic variants and disease that have been obscured by population genetic heterogeneity. It can be readily extended to the identification of genetic interactions on other covariates such as measured environmental exposures. We envisage our methodology being of particular interest to researchers with existing GWAS data, as ancestry groups can be easily defined and thus tested for interactions.

We explore the use of generalized t priors on regression coefficients to help understand the nature of association signal within "hit regions" of genome-wide association studies. The particular generalized t distribution we adopt is a Student distribution on the absolute value of its argument. For low degrees of freedom, we show that the generalized t exhibits "sparsity-prior" properties with some attractive features over other common forms of sparse priors and includes the well known double-exponential distribution as the degrees of freedom tends to infinity. We pay particular attention to graphical representations of posterior statistics obtained from sparsity-path-analysis (SPA) where we sweep over the setting of the scale (shrinkage/precision) parameter in the prior to explore the space of posterior models obtained over a range of complexities, from very sparse models with all coefficient distributions heavily concentrated around zero, to models with diffuse priors and coefficients distributed around their maximum likelihood estimates. The SPA plots are akin to LASSO plots of maximum a posteriori (MAP) estimates but they characterise the complete marginal posterior distributions of the coefficients plotted as a function of the precision of the prior. Generating posterior distributions over a range of prior precisions is computationally challenging but naturally amenable to sequential Monte Carlo (SMC) algorithms indexed on the scale parameter. We show how SMC simulation on graphic-processing-units (GPUs) provides very efficient inference for SPA. We also present a scale-mixture representation of the generalized t prior that leads to an expectation-maximization (EM) algorithm to obtain MAP estimates should only these be required.

Genome-wide array approaches and sequencing analyses are powerful tools for identifying genetic aberrations in cancers, including leukemias and lymphomas. However, the clinical and biological significance of such aberrations and their subclonal distribution are poorly understood. Here, we present the first genome-wide array based study of pre-treatment and relapse samples from patients with B-cell chronic lymphocytic leukemia (B-CLL) that uses the computational statistical tool OncoSNP. We show that quantification of the proportion of copy number alterations (CNAs) and copy neutral loss of heterozygosity regions (cnLOHs) in each sample is feasible. Furthermore, we (i) reveal complex changes in the subclonal architecture of paired samples at relapse compared with pre-treatment, (ii) provide evidence supporting an association between increased genomic complexity and poor clinical outcome (iii) report previously undefined, recurrent CNA/cnLOH regions that expand or newly occur at relapse and therefore might harbor candidate driver genes of relapse and/or chemotherapy resistance. Our findings are likely to impact on future therapeutic strategies aimed towards selecting effective and individually tailored targeted therapies.

In biomarker discovery studies, uncertainty associated with case and control labels is often overlooked. By omitting to take into account label uncertainty, model parameters and the predictive risk can become biased, sometimes severely. The most common situation is when the control set contains an unknown number of undiagnosed, or future, cases. This has a marked impact in situations where the model needs to be well-calibrated, e.g., when the prediction performance of a biomarker panel is evaluated. Failing to account for class label uncertainty may lead to underestimation of classification performance and bias in parameter estimates. This can further impact on meta-analysis for combining evidence from multiple studies. Using a simulation study, we outline how conventional statistical models can be modified to address class label uncertainty leading to well-calibrated prediction performance estimates and reduced bias in meta-analysis. We focus on the problem of mislabeled control subjects in case-control studies, i.e., when some of the control subjects are undiagnosed cases, although the procedures we report are generic. The uncertainty in control status is a particular situation common in biomarker discovery studies in the context of genomic and molecular epidemiology, where control subjects are commonly sampled from the general population with an established expected disease incidence rate.

We have performed a metabolite quantitative trait locus (mQTL) study of the (1)H nuclear magnetic resonance spectroscopy ((1)H NMR) metabolome in humans, building on recent targeted knowledge of genetic drivers of metabolic regulation. Urine and plasma samples were collected from two cohorts of individuals of European descent, with one cohort comprised of female twins donating samples longitudinally. Sample metabolite concentrations were quantified by (1)H NMR and tested for association with genome-wide single-nucleotide polymorphisms (SNPs). Four metabolites' concentrations exhibited significant, replicable association with SNP variation (8.6×10(-11)<p<2.8×10(-23)). Three of these-trimethylamine, 3-amino-isobutyrate, and an N-acetylated compound-were measured in urine. The other-dimethylamine-was measured in plasma. Trimethylamine and dimethylamine mapped to a single genetic region (hence we report a total of three implicated genomic regions). Two of the three hit regions lie within haplotype blocks (at 2p13.1 and 10q24.2) that carry the genetic signature of strong, recent, positive selection in European populations. Genes NAT8 and PYROXD2, both with relatively uncharacterized functional roles, are good candidates for mediating the corresponding mQTL associations. The study's longitudinal twin design allowed detailed variance-components analysis of the sources of population variation in metabolite levels. The mQTLs explained 40%-64% of biological population variation in the corresponding metabolites' concentrations. These effect sizes are stronger than those reported in a recent, targeted mQTL study of metabolites in serum using the targeted-metabolomics Biocrates platform. By re-analysing our plasma samples using the Biocrates platform, we replicated the mQTL findings of the previous study and discovered a previously uncharacterized yet substantial familial component of variation in metabolite levels in addition to the heritability contribution from the corresponding mQTL effects.

Accurate assignment of copy number at known copy number variant (CNV) loci is important for both increasing understanding of the structural evolution of genomes as well as for carrying out association studies of copy number with disease. As with calling SNP genotypes, the task can be framed as a clustering problem but for a number of reasons assigning copy number is much more challenging. CNV assays have lower signal-to-noise ratios than SNP assays, often display heavy tailed and asymmetric intensity distributions, contain outlying observations and may exhibit systematic technical differences among different cohorts. In addition, the number of copy-number classes at a CNV in the population may be unknown a priori. Due to these complications, automatic and robust assignment of copy number from array data remains a challenging problem. We have developed a copy number assignment algorithm, CNVCALL, for a targeted CNV array, such as that used by the Wellcome Trust Case Control Consortium's recent CNV association study. We use a Bayesian hierarchical mixture model that robustly identifies both the number of different copy number classes at a specific locus as well as relative copy number for each individual in the sample. This approach is fully automated which is a critical requirement when analyzing large numbers of CNVs. We illustrate the methods performance using real data from the Wellcome Trust Case Control Consortium's CNV association study and using simulated data.

In this article we develop a class of stochastic boosting (SB) algorithms, which build upon the work of Holmes and Pintore (Bayesian Stat. 8, Oxford University Press, Oxford, 2007). They introduce boosting algorithms which correspond to standard boosting (e. g. Bühlmann and Hothorn, Stat. Sci. 22:477-505, 2007) except that the optimization algorithms are randomized; this idea is placed within a Bayesian framework. We show that the inferential procedure in Holmes and Pintore (Bayesian Stat. 8, Oxford University Press, Oxford, 2007) is incorrect and further develop interpretational, computational and theoretical results which allow one to assess SB's potential for classification and regression problems. To use SB, sequential Monte Carlo (SMC) methods are applied. As a result, it is found that SB can provide better predictions for classification problems than the corresponding boosting algorithm. A theoretical result is also given, which shows that the predictions of SB are not significantly worse than boosting, when the latter provides the best prediction. We also investigate the method on a real case study from machine learning. © 2010 Springer Science+Business Media, LLC.

We propose a hierarchical Bayesian nonparametric mixture model for clustering when some of the covariates are assumed to be of varying relevance to the clustering problem. This can be thought of as an issue in variable selection for unsupervised learning. We demonstrate that by defining a hierarchical population based nonparametric prior on the cluster locations scaled by the inverse covariance matrices of the likelihood we arrive at a 'sparsity prior' representation which admits a conditionally conjugate prior. This allows us to perform full Gibbs sampling to obtain posterior distributions over parameters of interest including an explicit measure of each covariate's relevance and a distribution over the number of potential clusters present in the data. This also allows for individual cluster specific variable selection. We demonstrate improved inference on a number of canonical problems.

AIMS: We introduce an innovative multilocus test for disease association. It is an extension of an existing score test that gains power over alternative methods by incorporating a parsimonious one-degree-of-freedom model for interaction. We use our method in applications designed to detect interactions that generate hypotheses about the functionality of prostate cancer (PRCA) susceptibility regions. METHODS: Our proposed score test is designed to gain additional power through the use of a retrospective likelihood that exploits an assumption of independence between unlinked loci in the underlying population. Its performance is validated through simulation. The method is used in conditional scans with data from stage II of the Cancer Genetic Markers of Susceptibility PRCA genome-wide association study. RESULTS: Our proposed method increases power to detect susceptibility loci in diverse settings. It identified two high-ranking, biologically interesting interactions: (1) rs748120 of NR2C2 and subregions of 8q24 that contain independent susceptibility loci specific to PRCA and (2) rs4810671 of SULF2 and both JAZF1 and HNF1B that are associated with PRCA and type 2 diabetes. CONCLUSIONS: Our score test is a promising multilocus tool for genetic epidemiology. The results of our applications suggest functionality for poorly understood PRCA susceptibility regions. They motivate replication study.

To understand how miRNAs contribute to the molecular phenotype of adipose tissues and related traits, we performed global miRNA expression profiling in subcutaneous abdominal and gluteal adipose tissue of 70 human subjects and characterised which miRNAs were differentially expressed between these tissues. We found that 12% of the miRNAs were significantly differentially expressed between abdominal and gluteal adipose tissue (FDR adjusted p<0.05) in the primary study, of which 59 replicated in a follow-up study of 40 additional subjects. Further, 14 miRNAs were found to be associated with metabolic syndrome case-control status in abdominal tissue and three of these replicated (primary study: FDR adjusted p<0.05, replication: p<0.05 and directionally consistent effect). Genome-wide genotyping was performed in the 70 subjects to enable miRNA expression quantitative trait loci (eQTL) analysis. Candidate miRNA eQTLs were followed-up in the additional 40 subjects and six significant, independent cis-located miRNA eQTLs (primary study: p<0.001; replication: p<0.05 and directionally consistent effect) were identified. Finally, global mRNA expression profiling was performed in both tissues to enable association analysis between miRNA and target mRNA expression levels. We find 22% miRNAs in abdominal and 9% miRNAs in gluteal adipose tissue with expression levels significantly associated with the expression of corresponding target mRNAs (FDR adjusted p<0.05). Taken together, our results indicate a clear difference in the miRNA molecular phenotypic profile of abdominal and gluteal adipose tissue, that the expressions of some miRNAs are influenced by cis-located genetic variants and that miRNAs are associated with expression levels of their predicted mRNA targets.

¹H Nuclear Magnetic Resonance spectroscopy (¹H NMR) is increasingly used to measure metabolite concentrations in sets of biological samples for top-down systems biology and molecular epidemiology. For such purposes, knowledge of the sources of human variation in metabolite concentrations is valuable, but currently sparse. We conducted and analysed a study to create such a resource. In our unique design, identical and non-identical twin pairs donated plasma and urine samples longitudinally. We acquired ¹H NMR spectra on the samples, and statistically decomposed variation in metabolite concentration into familial (genetic and common-environmental), individual-environmental, and longitudinally unstable components. We estimate that stable variation, comprising familial and individual-environmental factors, accounts on average for 60% (plasma) and 47% (urine) of biological variation in ¹H NMR-detectable metabolite concentrations. Clinically predictive metabolic variation is likely nested within this stable component, so our results have implications for the effective design of biomarker-discovery studies. We provide a power-calculation method which reveals that sample sizes of a few thousand should offer sufficient statistical precision to detect ¹H NMR-based biomarkers quantifying predisposition to disease.

We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design.

Waist-hip ratio (WHR) is a measure of body fat distribution and a predictor of metabolic consequences independent of overall adiposity. WHR is heritable, but few genetic variants influencing this trait have been identified. We conducted a meta-analysis of 32 genome-wide association studies for WHR adjusted for body mass index (comprising up to 77,167 participants), following up 16 loci in an additional 29 studies (comprising up to 113,636 subjects). We identified 13 new loci in or near RSPO3, VEGFA, TBX15-WARS2, NFE2L3, GRB14, DNM3-PIGC, ITPR2-SSPN, LY86, HOXC13, ADAMTS9, ZNRF3-KREMEN1, NISCH-STAB1 and CPEB4 (P = 1.9 × 10⁻⁹ to P = 1.8 × 10⁻⁴⁰) and the known signal at LYPLAL1. Seven of these loci exhibited marked sexual dimorphism, all with a stronger effect on WHR in women than men (P for sex difference = 1.9 × 10⁻³ to P = 1.2 × 10⁻¹³). These findings provide evidence for multiple loci that modulate body fat distribution independent of overall adiposity and reveal strong gene-by-sex interactions.

MOTIVATION: Quantifying differences in linkage disequilibrium (LD) between sub-groups can highlight genetic regions or sites under selection and/or associated with disease, and may have utility in trans-ethnic mapping studies. RESULTS: We present a novel pseudo Bayes factor (PBF) approach that assess differences in covariance of genotype frequencies from single nucleotide polymorphism (SNP) data from a genome-wide study. The magnitude of the PBF reflects the strength of evidence for a difference, while accounting for the sample size and number of SNPs, without the requirement for permutation testing to establish statistical significance. Application of the PBF to HapMap and Gambian malaria SNP data reveals regional LD differences, some known to be under selection. AVAILABILITY AND IMPLEMENTATION: The PBF approach has been implemented in the BALD (Bayesian analysis of LD differences) C++ software, and is available from http://homepages.lshtm.ac.uk/tgclark/downloads.

Standard techniques for single marker quantitative trait mapping perform poorly in detecting complex interacting genetic influences. When a genetic marker interacts with other genetic markers and/or environmental factors to influence a quantitative trait, a sample of individuals will show different effects according to their exposure to other interacting factors. This paper presents a Bayesian mixture model, which effectively models heterogeneous genetic effects apparent at a single marker. We compute approximate Bayes factors which provide an efficient strategy for screening genetic markers (genome-wide) for evidence of a heterogeneous effect on a quantitative trait. We present a simulation study which demonstrates that the approximation is good and provide a real data example which identifies a population-specific genetic effect on gene expression in the HapMap CEU and YRI populations. We advocate the use of the model as a strategy for identifying candidate interacting markers without any knowledge of the nature or order of the interaction. The source of heterogeneity can be modeled as an extension.

Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.

GIPC1 is a cytoplasmic scaffold protein that interacts with numerous receptor signaling complexes, and emerging evidence suggests that it plays a role in tumorigenesis. GIPC1 is highly expressed in a number of human malignancies, including breast, ovarian, gastric, and pancreatic cancers. Suppression of GIPC1 in human pancreatic cancer cells inhibits in vivo tumor growth in immunodeficient mice. To better understand GIPC1 function, we suppressed its expression in human breast and colorectal cancer cell lines and human mammary epithelial cells (HMECs) and assayed both gene expression and cellular phenotype. Suppression of GIPC1 promotes apoptosis in MCF-7, MDA-MD231, SKBR-3, SW480, and SW620 cells and impairs anchorage-independent colony formation of HMECs. These observations indicate GIPC1 plays an essential role in oncogenic transformation, and its expression is necessary for the survival of human breast and colorectal cancer cells. Additionally, a GIPC1 knock-down gene signature was used to interrogate publically available breast and ovarian cancer microarray datasets. This GIPC1 signature statistically correlates with a number of breast and ovarian cancer phenotypes and clinical outcomes, including patient survival. Taken together, these data indicate that GIPC1 inhibition may represent a new target for therapeutic development for the treatment of human cancers.

BACKGROUND: Array comparative genomic hybridization (aCGH) to detect copy number variants (CNVs) in mammalian genomes has led to a growing awareness of the potential importance of this category of sequence variation as a cause of phenotypic variation. Yet there are large discrepancies between studies, so that the extent of the genome affected by CNVs is unknown. We combined molecular and aCGH analyses of CNVs in inbred mouse strains to investigate this question. PRINCIPAL FINDINGS: Using a 2.1 million probe array we identified 1,477 deletions and 499 gains in 7 inbred mouse strains. Molecular characterization indicated that approximately one third of the CNVs detected by the array were false positives and we estimate the false negative rate to be more than 50%. We show that low concordance between studies is largely due to the molecular nature of CNVs, many of which consist of a series of smaller deletions and gains interspersed by regions where the DNA copy number is normal. CONCLUSIONS: Our results indicate that CNVs detected by arrays may be the coincidental co-localization of smaller CNVs, whose presence is more likely to perturb an aCGH hybridization profile than the effect of an isolated, small, copy number alteration. Our findings help explain the hitherto unexplored discrepancies between array-based studies of copy number variation in the mouse genome.

We describe a statistical method for the characterization of genomic aberrations in single nucleotide polymorphism microarray data acquired from cancer genomes. Our approach allows us to model the joint effect of polyploidy, normal DNA contamination and intra-tumour heterogeneity within a single unified Bayesian framework. We demonstrate the efficacy of our method on numerous datasets including laboratory generated mixtures of normal-cancer cell lines and real primary tumours.

MOTIVATION: Identifying the network structure through which genes and their products interact can help to elucidate normal cell physiology as well as the genetic architecture of pathological phenotypes. Recently, a number of gene network inference tools have appeared based on Gaussian graphical model representations. Following this, we introduce a novel Boosting approach to learn the structure of a high-dimensional Gaussian graphical model motivated by the applications in genomics. A particular emphasis is paid to the inclusion of partial prior knowledge on the structure of the graph. With the increasing availability of pathway information and large-scale gene expression datasets, we believe that conditioning on prior knowledge will be an important aspect in raising the statistical power of structural learning algorithms to infer true conditional dependencies. RESULTS: Our Boosting approach, termed BoostiGraph, is conceptually and algorithmically simple. It complements recent work on the network inference problem based on Lasso-type approaches. BoostiGraph is computationally cheap and is applicable to very high-dimensional graphs. For example, on graphs of order 5000 nodes, it is able to map out paths for the conditional independence structure in few minutes. Using computer simulations, we investigate the ability of our method with and without prior information to infer Gaussian graphical models from artificial as well as actual microarray datasets. The experimental results demonstrate that, using our method, it is possible to recover the true network topology with relatively high accuracy. AVAILABILITY: This method and all other associated files are freely available from http://www.stats.ox.ac.uk/~anjum/.

Deafness is the most common sensory disorder in humans and the aetiology of genetic deafness is complex. Mouse mutants have been crucial in identifying genes involved in hearing. However, many deafness genes remain unidentified. Using N-ethyl N-nitrosourea (ENU) mutagenesis to generate new mouse models of deafness, we identified a novel semi-dominant mouse mutant, Cloth-ears (Clth). Cloth-ears mice show reduced acoustic startle response and mild hearing loss from approximately 30 days old. Auditory-evoked brainstem response (ABR) and distortion product otoacoustic emission (DPOAE) analyses indicate that the peripheral neural auditory pathway is impaired in Cloth-ears mice, but that cochlear function is normal. In addition, both Clth/Clth and Clth/+ mice display paroxysmal tremor episodes with behavioural arrest. Clth/Clth mice also show a milder continuous tremor during movement and rest. Longitudinal phenotypic analysis showed that Clth/+ and Clth/Clth mice also have complex defects in behaviour, growth, neurological and motor function. Positional cloning of Cloth-ears identified a point mutation in the neuronal voltage-gated sodium channel alpha-subunit gene, Scn8a, causing an aspartic acid to valine (D981V) change six amino acids downstream of the sixth transmembrane segment of the second domain (D2S6). Complementation testing with a known Scn8a mouse mutant confirmed that this mutation is responsible for the Cloth-ears phenotype. Our findings suggest a novel role for Scn8a in peripheral neural hearing loss and paroxysmal motor dysfunction.

Cited:

87

Scopus

Highly recombinant populations derived from inbred lines, such as advanced intercross lines and heterogeneous stocks, can be used to map loci far more accurately than is possible with standard intercrosses. However, the varying degrees of relatedness that exist between individuals complicate analysis, potentially leading to many false positive signals. We describe a method to deal with these problems that does not require pedigree information and accounts for model uncertainty through model averaging. In our method, we select multiple quantitative trait loci (QTL) models using forward selection applied to resampled data sets obtained by nonparametric bootstrapping and subsampling. We provide model-averaged statistics about the probability of loci or of multilocus regions being included in model selection, and this leads to more accurate identification of QTL than by single-locus mapping. The generality of our approach means it can potentially be applied to any population of unknown structure. Copyright © 2009 by the Genetics Society of America.

MOTIVATION: Short interfering RNA (siRNA)-induced RNA interference is an endogenous pathway in sequence-specific gene silencing. The potency of different siRNAs to inhibit a common target varies greatly and features affecting inhibition are of high current interest. The limited success in predicting siRNA potency being reported so far could originate in the small number and the heterogeneity of available datasets in addition to the knowledge-driven, empirical basis on which features thought to be affecting siRNA potency are often chosen. We attempt to overcome these problems by first constructing a meta-dataset of 6483 publicly available siRNAs (targeting mammalian mRNA), the largest to date, and then applying a Bayesian analysis which accommodates feature set uncertainty. A stochastic logistic regression-based algorithm is designed to explore a vast model space of 497 compositional, structural and thermodynamic features, identifying associations with siRNA potency. RESULTS: Our algorithm reveals a number of features associated with siRNA potency that are, to the best of our knowledge, either under reported in literature, such as anti-sense 5' -3' motif 'UCU', or not reported at all, such as the anti-sense 5' -3' motif 'ACGA'. These findings should aid in improving future siRNA potency predictions and might offer further insights into the working of the RNA-induced silencing complex (RISC).

With the emergence of Biobanks alongside large-scale genome-wide association studies (GWAS) we will soon be in the enviable situation of obtaining precise estimates of population allele frequencies for SNPs which make up the panels in standard genotyping arrays, such as those produced from Illumina and Affymetrix. For disease association studies it is well known that for rare diseases with known population minor allele frequencies (pMAFs) a case-only design is most powerful. That is, for a fixed budget the optimal procedure is to genotype only cases (affecteds). In such tests experimenters look for a divergence from allele distribution in cases from that of the known population pMAF; in order to test the null hypothesis of no association between the disease status and the allele frequency. However, what has not been previously characterized is the utility of controls (known unaffecteds) when available. In this study we consider frequentist and Bayesian statistical methods for testing for SNP genotype association when population MAFs are known and when both cases and controls are available. We demonstrate that for rare diseases the most powerful frequentist design is, somewhat counterintuitively, to actively discard the controls even though they contain information on the association. In contrast we develop a Bayesian test which uses all available information (cases and controls) and appears to exhibit uniformaly greater power than all frequentist methods we considered.

In this article we propose a modification to the output fromMetropolis-within-Gibbs samplers that can lead to substantial reductions in the variance over standard estimates. The idea is simple: at each time step of the algorithm, introduce an extra sample into the estimate that is negatively correlated with the current sample, the rationale being that this provides a two-sample numerical approximation to a Rao-Blackwellized estimate. As the conditional sampling distribution at each step has already been constructed, the generation of the antithetic sample often requires negligible computational effort. Our method is implementable whenever one subvector of the state can be sampled from its full conditional and the corresponding distribution function may be inverted, or the full conditional has a symmetric density. We demonstrate our approach in the context of logistic regression and hierarchical Poisson models. The data and computer code used in this article are available online. © 2009 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.

We have cultured Plasmodium falciparum directly from the blood of infected individuals to examine patterns of mature-stage gene expression in patient isolates. Analysis of the transcriptome of P. falciparum is complicated by the highly periodic nature of gene expression because small variations in the stage of parasite development between samples can lead to an apparent difference in gene expression values. To address this issue, we have developed statistical likelihood-based methods to estimate cell cycle progression and commitment to asexual or sexual development lineages in our samples based on microscopy and gene expression patterns. In cases subsequently matched for temporal development, we find that transcriptional patterns in ex vivo culture display little variation across patients with diverse clinical profiles and closely resemble transcriptional profiles that occur in vitro. These statistical methods, available to the research community, assist in the design and interpretation of P. falciparum expression profiling experiments where it is difficult to separate true differential expression from cell-cycle dependent expression. We reanalyze an existing dataset of in vivo patient expression profiles and conclude that previously observed discrete variation is consistent with the commitment of a varying proportion of the parasite population to the sexual development lineage.

MOTIVATION: Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths. RESULTS: We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data.

UNLABELLED: Current genotyping algorithms typically call genotypes by clustering allele-specific intensity data on a single nucleotide polymorphism (SNP) by SNP basis. This approach assumes the availability of a large number of control samples that have been sampled on the same array and platform. We have developed a SNP genotyping algorithm for the Illumina Infinium SNP genotyping assay that is entirely within-sample and does not require the need for a population of control samples nor parameters derived from such a population. Our algorithm exhibits high concordance with current methods and >99% call accuracy on HapMap samples. The ability to call genotypes using only within-sample information makes the method computationally light and practical for studies involving small sample sizes and provides a valuable independent quality control metric for other population-based approaches. AVAILABILITY: http://www.stats.ox.ac.uk/~giannoul/GenoSNP/.

Cited:

26

Scopus

The methodology of interacting sequential Monte Carlo (SMC) samplers is introduced. SMC samplers are methods for sampling from a sequence of densities on a common measurable space using a combination of Markov chain Monte Carlo (MCMC) and sequential importance sampling/resampling (SIR) methodology. One of the main problems with SMC samplers when simulating from trans-dimensional, multimodal static targets is that transition kernels do not mix which leads to low particle diversity. In such situations poor Monte Carlo estimates may be derived. To deal with this problem an interacting SMC approach for static inference is introduced. The method proceeds by running SMC samplers in parallel on, initially, different regions of the state space and then moving the corresponding samples onto the entire state space. Once the samplers reach a common space the samplers are combined and allowed to interact. The method is intended to increase the diversity of the population of samples. It is established that interacting SMC admit a Feynman-Kac representation; this provides a framework for the convergence results that are developed. In addition, the methodology is demonstrated on a trans-dimensional inference problem in Bayesian mixture modelling and also, using adaptive methods, a mixture modelling problem in population genetics. © 2007 Elsevier B.V. All rights reserved.

Genome-wide single nucleotide polymorphism (SNP) genotyping platforms have made an important contribution to population genetics and genetic epidemiology. Recently there has been a realisation that these SNP platforms can also be used for typing copy number variants (CNVs). This allows for 'generalised' genotyping of both SNPs and CNVs simultaneously on a common sample set, with advantages in terms of cost and unified analysis. In this article we review various statistical approaches to calling CNVs from SNP data. We highlight three tiers of algorithms depending on the level of information used.

Cited:

49

Scopus

We present an extension of population-based Markov chain Monte Carlo to the transdimensional case. A major challenge is that of simulating from high- and transdimensional target measures. In such cases, Markov chain Monte Carlo methods may not adequately traverse the support of the target; the simulation results will be unreliable. We develop population methods to deal with such problems, and give a result proving the uniform ergodicity of these population algorithms, under mild assumptions. This result is used to demonstrate the superiority, in terms of convergence rate, of a population transition kernel over a reversible jump sampler for a Bayesian variable selection problem. We also give an example of a population algorithm for a Bayesian multivariate mixture model with an unknown number of components. This is applied to gene expression data of 1000 data points in six dimensions and it is demonstrated that our algorithm outperforms some competing Markov chain samplers. In this example, we show how to combine the methods of parallel chains (Geyer, 1991), tempering (Geyer & Thompson, 1995), snooker algorithms (Gilks et al., 1994), constrained sampling and delayed rejection (Green & Mira, 2001). © 2007 Biometrika Trust.

Cited:

93

Scopus

In this paper we present a review of population-based simulation for static inference problems. Such methods can be described as generating a collection of random variables {X n } n=1,N in parallel in order to simulate from some target density π (or potentially sequence of target densities). Population-based simulation is important as many challenging sampling problems in applied statistics cannot be dealt with successfully by conventional Markov chain Monte Carlo (MCMC) methods. We summarize population-based MCMC (Geyer, Computing Science and Statistics: The 23rd Symposium on the Interface, pp. 156-163, 1991; Liang and Wong, J. Am. Stat. Assoc. 96, 653-666, 2001) and sequential Monte Carlo samplers (SMC) (Del Moral, Doucet and Jasra, J. Roy. Stat. Soc. Ser. B 68, 411-436, 2006a), providing a comparison of the approaches. We give numerical examples from Bayesian mixture modelling (Richard son and Green, J. Roy. Stat. Soc. Ser. B 59, 731-792, 1997). © 2007 Springer Science+Business Media, LLC.

This paper focuses on interest rate models with regime switching and extends previous nonlinear threshold models by relaxing the assumption of a fixed number of regimes. Instead we suggest automatic model determination through Bayesian inference via the reversible jump Markov Chain Monte Carlo (MCMC) algorithm. Moreover, we allow the thresholds in the volatility to be driven not only by the interest rate but also by other economic factors. We illustrate our methodology by applying it to interest rates and other economic factors of the American economy.

Array-based technologies have been used to detect chromosomal copy number changes (aneuploidies) in the human genome. Recent studies identified numerous copy number variants (CNV) and some are common polymorphisms that may contribute to disease susceptibility. We developed, and experimentally validated, a novel computational framework (QuantiSNP) for detecting regions of copy number variation from BeadArray SNP genotyping data using an Objective Bayes Hidden-Markov Model (OB-HMM). Objective Bayes measures are used to set certain hyperparameters in the priors using a novel re-sampling framework to calibrate the model to a fixed Type I (false positive) error rate. Other parameters are set via maximum marginal likelihood to prior training data of known structure. QuantiSNP provides probabilistic quantification of state classifications and significantly improves the accuracy of segmental aneuploidy identification and mapping, relative to existing analytical tools (Beadstudio, Illumina), as demonstrated by validation of breakpoint boundaries. QuantiSNP identified both novel and validated CNVs. QuantiSNP was developed using BeadArray SNP data but it can be adapted to other platforms and we believe that the OB-HMM framework has widespread applicability in genomic research. In conclusion, QuantiSNP is a novel algorithm for high-resolution CNV/aneuploidy detection with application to clinical genetics, cancer and disease association studies.

Cited:

26

WOS

We present a new approach for modelling annealing of fission tracks in apatite, aiming to address various problems with existing models. We cast the model in a fully Bayesian context, which allows us explicitly to deal with data and parameter uncertainties and correlations, and also to deal with the predictive uncertainties. We focus on a well-known annealing algorithm [Laslett, G.M., Green, P.F., Duddy, I.R., Gleadow. A.J.W., 1987. Thermal annealing of fission tracks in apatite. 2. A quantitative-analysis. Chem. Geol., 65 (1), 1-13], and build a hierachical Bayesian model to incorporate both laboratory and geological timescale data as direct constraints. Relative to the original model calibration, we find a better (in terms of likelihood) model conditioned just on the reported laboratory data. We then include the uncertainty on the temperatures recorded during the laboratory annealing experiments. We again find a better model, but the predictive uncertainty when extrapolated to geological timescales is increased due to the uncertainty on the laboratory temperatures. Finally, we explictly include a data set [Vrolijk, P., Donelick, R.A., Quenq, J., Cloos. M., 1992. Testing models of fission track annealing in apatite in a simple thermal setting: site 800, leg 129. In: Larson, R., Lancelet, Y. (Eds.), Proceedings of the Ocean Drilling Program, Scientific Results, vol. 129, pp. 169-176] which provides low-temperature geological timescale constraints for the model calibration. When combined with the laboratory data, we find a model which satisfies both the low-temperature and high-temperature geological timescale benchmarks, although the fit to the original laboratory data is degraded. However, when extrapolated to geological timescales, this combined model significantly reduces the well-known rapid recent cooling artifact found in many published thermal models for geological samples. © 2006 Elsevier Inc. All rights reserved.

BK channels regulate vascular tone by hyperpolarizing smooth muscle in response to fluctuating calcium concentrations. Oestrogen has been reported to lower blood pressure by increasing BK channel open probability through direct binding to the regulatory beta1-subunit(s) associated with the channel. The present investigation demonstrates that 17beta-oestradiol activates the BK channel complex by increasing the burst duration of channel openings. A subconductance state was observed in 25% of recordings following the addition of 17beta-oestradiol and could reflect uncoupling between the pore forming alpha1-subunit and the regulatory beta1-subunit. We also present evidence that more than one beta1-subunit is required to facilitate binding of 17beta-oestradiol to the channel complex.

Cited:

31

WOS

In this paper we develop a generalized statistical methodology for characterizing geochronological data, represented by a distribution of single mineral ages. The main characteristics of such data are the heterogeneity and error associated with its collection. The former property means that mixture models are often appropriate for their analysis, in order to identify discrete age components in the overall distribution. We demonstrate that current methods (e.g., Sambridge and Compston, 1994) for analyzing such problems are not always suitable due to the restriction of the class of component densities that may be fitted to the data. This is of importance, when modelling geochronological data, as it is often the case that skewed and heavy tailed distributions will fit the data well. We concentrate on developing (Bayesian) mixture models with flexibility in the class of component densities, using Markov chain Monte Carlo (MCMC) methods to fit the models. Our method allows us to use any component density to fit the data, as well as returning a probability distribution for the number of components. Furthermore, rather than dealing with the observed ages, as in previous approaches, we make the inferences of components from the "true" ages, i.e., the ages had we been able to observe them without measurement error. We demonstrate our approach on two data sets: uranium-lead (U-Pb) zircon ages from the Khorat basin of northern Thailand and the Carrickalinga Head formation of southern Australia. © Springer Science+Business Media, Inc. 2006.

Cited:

30

WOS

We use a reproducing kernel Hilbert space representation to derive the smoothing spline solution when the smoothness penalty is a function λ(t) of the design space t, thereby allowing the model to adapt to various degrees of smoothness in the structure of the data. We propose a convenient form for the smoothness penalty function and discuss computational algorithms for automatic curve fitting using a generalised crossvalidation measure. © 2006 Biometrika Trust.

Cited:

105

Scopus

Malaria represents one of the major worldwide challenges to public health. A recent breakthrough in the study of the disease follows the annotation of the genome of the malaria parasite Plasmodium falciparum and the mosquito vector (an organism that spreads an infectious disease) Anopheles. Of particular interest is the molecular biology underlying the immune response system of Anopheles, which actively fights against Plasmodium infection. This article reports a statistical analysis of gene expression time profiles from mosquitoes that have been infected with a bacterial agent. Specifically, we introduce a Bayesian model-based hierarchical clustering algorithm for curve data to investigate mechanisms of regulation in the genes concerned; that is, we aim to cluster genes having similar expression profiles. Genes displaying similar, interesting profiles can then be highlighted for further investigation by the experimenter. We show how our approach reveals structure within the data not captured by other approaches. One of the most pertinent features of the data is the sample size, which records the expression levels of 2,771 genes at 6 time points. Additionally, the time points are unequally spaced, and there is expected nonstationary behavior in the gene profiles. We demonstrate our approach to be readily implementable under these conditions, and highlight some crucial computational savings that can be made in the context of a fully Bayesian analysis. © 2006 American Statistical Association.

Cited:

22

WOS

Cited:

148

WOS

This article proposes a new Bayesian approach to prediction on continuous covariates. The Bayesian partition model constructs arbitrarily complex regression and classification surfaces by splitting the covariate space into an unknown number of disjoint regions. Within each region the data are assumed to be exchangeable and come from some simple distribution. Using conjugate priors, the marginal likelihoods of the models can be obtained analytically for any proposed partitioning of the space where the number and location of the regions is assumed unknown a priori. Markov chain Monte Carlo simulation techniques are used to obtain predictive distributions at the design points by averaging across posterior samples of partitions. © 2005 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.

We present a method for Bayesian model-based hierarchical coclustering of gene expression data and use it to study the temporal transcription responses of an Anopheles gambiae cell line upon challenge with multiple microbial elicitors. The method fits statistical regression models to the gene expression time series for each experiment and performs coclustering on the genes by optimizing a joint probability model, characterizing gene coregulation between multiple experiments. We compute the model using a two-stage Expectation-Maximization-type algorithm, first fixing the cross-experiment covariance structure and using efficient Bayesian hierarchical clustering to obtain a locally optimal clustering of the gene expression profiles and then, conditional on that clustering, carrying out Bayesian inference on the cross-experiment covariance using Markov chain Monte Carlo simulation to obtain an expectation. For the problem of model choice, we use a cross-validatory approach to decide between individual experiment modeling and varying levels of coclustering. Our method successfully generates tightly coregulated clusters of genes that are implicated in related processes and therefore can be used for analysis of global transcript responses to various stimuli and prediction of gene functions.

Cited:

53

Scopus

Low-temperature thermochronology is a powerful method for constraining the time-temperature history of rocks and provides constraints on denudation chronologies, landscape evolution and tectonic history of geological terrains. We present a strategy for modelling thermal histories constrained by thermochronological data from multiple samples in vertical profiles. The thermal history is specified to be similar in form for each sample, and we include extra parameters to determine the offset temperature between the uppermost and lowermost samples. We combine the likelihood from each sample to produce a joint likelihood for all samples together, and using initially stochastic search then directed search methods we try to identify a maximum likelihood solution. We also implement a test (Bayesian Information Criterion) which allows us to assess whether we have overparameterised the thermal history and potentially introduced unwarranted complexity. This test allows us to simplify the thermal history model without compromising the acceptable data fit. Subsequently, we use a sampling approach to determine the uncertainty or resolution of the thermal history. Markov chain Monte Carlo is straightforward to implement and is used to produce joint and marginal probability distributions, and from these we can infer credible intervals on the model parameters. We demonstrate that combining the samples is preferable in that the final model is easier to interpret and generally has smaller uncertainties than the case where we model all samples independently. We consider both synthetic and real data examples, focusing on apatite fission track analysis, but the general approach is applicable to other analytical methods such as (U-Th)/He dating and 40 Ar/ 39 Ar analysis and in principle it is straightforward to combine these different data types into one joint thermal history model. © 2005 Elsevier B.V. All rights reserved.

Cited:

51

Scopus

In many problems in geostatistics the response variable of interest is strongly related to the underlying geology of the spatial location. In these situations there is often little correlation in the responses found in different rock strata, so the underlying covariance structure shows sharp changes at the boundaries of the rock types. Conventional stationary and nonstationary spatial methods are inappropriate, because they typically assume that the covariance between points is a smooth function of distance. In this article we propose a generic method for the analysis of spatial data with sharp changes in the underlying covariance structure. Our method works by automatically decomposin g the spatial domain into disjoint regions within which the process is assumed to be stationary, but the data are assumed independent across regions. Uncertainty in the number of disjoint regions, their shapes, and the model within regions is dealt with in a fully Bayesian fashion. We illustrate our approach on a previously unpublished dataset relating to soil permeability of the Schneider Buda oil field in Wood County, Texas. © 2005 American Statistical Association.

Cited:

257

Scopus

In the past ten years there has been a dramatic increase of interest in the Bayesian analysis of finite mixture models. This is primarily because of the emergence of Markov chain Monte Carlo (MCMC) methods. While MCMC provides a convenient way to draw inference from complicated statistical models, there are many, perhaps underappreciated, problems associated with the MCMC analysis of mixtures. The problems are mainly caused by the nonidentifiability of the components under symmetric priors, which leads to so-called label switching in the MCMC output. This means that ergodic averages of component specific quantities will be identical and thus useless for inference. We review the solutions to the label switching problem, such as artificial identifiability constraints, relabelling algorithms and label invariant loss functions. We also review various MCMC sampling schemes that have been suggested for mixture models and discuss posterior sensitivity to prior specification. © Institute of Mathematical Statistics, 2005.

We have given an overview of a modeling strategy aimed at exploiting the spatial geometry of the sample distributions in order to maximize the retrieval of thermal history information from the thermochronological data. Philosophically, we aim to find thermal history solutions that fit the observations well, but do not have unwarranted complexity. These two requirements are quantified through the Bayesian Information Criterion, which combines the data likelihood and the number of model parameters. The overall approach relies on exploiting the spatial geometry of the sample locations to combine data from individual samples and identify a common thermal history. The combination of different data sets has the advantage of improving the resolution in the inferred thermal history, and also reducing the complexity. Markov chain Monte Carlo sampling provides a means of constructing reliable representations on the probability distributions for the parameters. 1D modeling is relevant to vertical profiles, and provides an estimate of the paleotemperature gradient directly from the data. The 2D approach relies on a partition model, in which each partition contains a subgroup of the samples with a common thermal history. The partition model approach allows for an unknown number of discontinuities, whose locations are also unknown. The extension to 3D combines the 1D and 2D approaches to find partitions in which samples at different elevations have experienced a common form of thermal history, but the actual temperatures depend on the elevation. As presented here, it is implicit that the spatial relationship between samples has not changed over time, at least not in a way that will lead to differential thermal histories. The approach presented here is different to 3D thermal models (Braun 2003, 2005; Ehlers 2005) but complementary. Thus, we infer the thermal history directly from the data, while the other 3D models are specified and certain parameters are adjusted to match the observed data. Both approaches assume thatthe predictive models for fission track annealing or helium diffusion in apatite are correct. In principle, this assumption can be relaxed and appropriate predictive model parameters can be estimated as part of the modelling process. However, this will lead to significant trade-off between annealing of diffusion parameters and the thermal history (Gallagher and Evans 1991). Future modifications to this approcah will include more generalized sampling of the thermal histories during th MCMC sampling of the partition structure, incorporation of mutiple data types (e.g., apatite fission track and (U-Th)/He data) and potentially allowing for irregularly shaped partition boundaries. Copyright © Mineralogical Society of America.

It is widely supposed that the tissue specificity of gene expression indicates gene function. Now, an extensive analysis of gene expression in the mouse reveals that quantitative measurement of expression levels in different tissues can contribute powerfully to the prediction of gene function.

The technique of kriging is widely known to be limited by its assumption of stationarity, and performs poorly when the data involve localized effects such as discontinuities or nonlinear trends. A Bayesian partition model (BPM) is compared with results from ordinary kriging for various synthetic discontinuous 1-D functions, as well as for 1986 precipitation data from Switzerland. This latter dataset has been analysed during a comparison of spatial interpolation techniques, and has been interpreted as a stationary distribution and one thus suited to kriging. The results demonstrate that the BPM outperformed kriging in all of the datasets compared (when tested for prediction accuracy at a number of validation points), with improvements by a factor of up to 6 for the synthetic functions. © The Geological Society of London.

Cited:

31

Scopus

A Bayesian method is presented for the nonparametric modeling of univariate and multivariate non-Gaussian response data. Data-adaptive multivariate regression splines are used where the number and location of the knot points are treated as random. The posterior model space is explored using a reversible-jump Markov chain Monte Carlo sampler. Computational difficulties are partly alleviated by introducing a random residual effect in the model that leaves many of the posterior conditional distributions of the model parameters in standard form. The use of the latent residual effect provides a convenient vehicle for modeling correlation in multivariate response data, and as such our method can be seen to generalize the seemingly unrelated regression model to non-Gaussian data. We illustrate the method on a number of examples, including two previously unpublished datasets relating to the spatial smoothing of multivariate accident data in Texas and the modeling of credit card use across multiple retail sectors.

Traditionally the neighbourhood size k in the k-nearest-neighbour algorithm is either fixed at the first nearest neighbour or is selected on the basis of a crossvalidation study. In this paper we present an alternative approach that develops the k-nearest-neighbour algorithm using likelihood-based inference. Our method takes the form of a generalised linear regression on a set of k-nearest-neighbour autocovariates. By defining the k-nearest-neighbour algorithm in this way we are able to extend the method to accommodate the original predictor variables as possible linear effects as well as allowing for the inclusion of multiple nearest-neighbour terms. The choice of the final model proceeds via a stepwise regression procedure. It is shown that our method incorporates a conventional generalised linear model and a conventional k-nearest-neighbour algorithm as special cases. Empirical results suggest that the method out-performs the standard k-nearest-neighbour method in terms of misclassification rate on a wide variety of datasets.

We introduce a procedure for generalized monotonic curve fitting that is based on a Bayesian analysis of the isotonic regression model. Conventional isotonic regression fits monotonically increasing step functions to data. In our approach we treat the number and location of the steps as random. For each step level we adopt the conjugate prior to the sampling distribution of the data as if the curve was unconstrained. We then propose to use Markov chain Monte Carlo simulation to draw samples from the unconstrained model space and retain only those samples for which the monotonic constraint holds. The proportion of the samples collected for which the constraint holds can be used to provide a value for the weight of evidence in terms of Bayes factors for monotonicity given the data. Using the samples, probability statements can be made about other quantities of interest such as the number of change points in the data and posterior distributions on the location of the change points can be provided. The method is illustrated throughout by a reanalysis of the leukaemia data studied by Schell and Singh.

We present a new method for classification using a Bayesian version of the Multivariate Adaptive Regression Spline (MARS) model of J.H. Friedman (Annals of Statistics, 19, 1-141, 1991). Special attention is paid to the use of Markov chain Monte Carlo (MCMC) simulation to gain inference under the model. In particular we discuss three important developments in MCMC methodology. First, we describe the reversible jump MCMC algorithm of P.J. Green (Biometrika, 82, 711-732, 1995) which allows inference on a varying dimensional, possibly uncountable, model space. This allows us to consider MARS models of differing numbers and positions of splines. Secondly, we discuss marginalisation which is used to reduce the effective dimension of the parameter space under consideration. Thirdly, we describe the use of latent variables to improve the MCMC computation. Our methods are generic and can be applied to any basis function model including, wavelets, artificial neural nets and radial basis functions. We present examples to show that the Bayesian MARS classifier is competitive with other approaches on a number of benchmark data sets.

This article considers inference in a Bayesian seemingly unrelated regression (SUR) model where the set of regressors is assumed unknown a priori. That is, we allow for uncertainty in the covariate set by defining a prior distribution on the model space. The posterior inference is analytically intractable and we adopt computer-intensive simulation using variable dimension Markov chain Monte Carlo algorithms to approximate quantities of interest. Applications are given for vector autoregression (VAR) models of unknown order and multivariate spline models with unknown knot points.

The coupling from the past (CFTP) procedure is a protocol for finite-state Markov chain Monte Carlo (MCMC) methods whereby the algorithm itself can determine the necessary runtime to convergence. In this paper, we demonstrate how this protocol can be applied to the problem of signal reconstruction using Bayesian wavelet analysis where the dimensionality of the wavelet basis set is unknown, and the observations are distorted by Gaussian white noise of unknown variance. MCMC simulation is used to account for model uncertainty by drawing samples of wavelet bases for approximating integrals (or summations) on the model space that are either too complex or too computationally demanding to perform analytically. We extend the CFTP protocol by making use of the central limit theorem to show how the algorithm can also monitor its own approximation error induced by MCMC. In this way, we can assess the number of MCMC samples needed to approximate the integral to within a user specified tolerance level. Hence, the method automatically ensures convergence and determines the necessary number of iterations needed to meet the error criteria.

Cited:

82

Scopus

Nearest neighbour algorithms are among the most popular methods used in statistical pattern recognition. The models are conceptually simple and empirical studies have shown that their performance is highly competitive against other techniques. However, the lack of a formal framework for choosing the size of the neighbourhood k is problematic. Furthermore, the method can only make discrete predictions by reporting the relative frequency of the classes in the neighbourhood of the prediction point. We present a probabilistic framework for the k-nearest-neighbour method that largely overcomes these difficulties. Uncertainty is accommodated via a prior distribution on k as well as in the strength of the interaction between neighbours, These prior distributions propagate uncertainty through to proper probabilistic predictions that have continuous support on (0, 1). The method makes no assumptions about the distribution of the predictor variables. The method is also fully automatic with no user-set parameters and empirically it proves to be highly accurate on many bench-mark data sets.

Cited:

46

Scopus

Problems in data analysis often require the unsupervised partitioning of a data set into classes. Several methods exist for such partitioning but many have the weakness of being formulated via strict parametric models (e.g., each class is modeled by a single Gaussian) or being computationally intensive in high-dimensional data spaces. We reconsider the notion of such cluster analysis in information-theoretic terms and show that an efficient partitioning may be given via a minimization of partition entropy. A reversible-jump sampling is introduced to explore the variable-dimension space of partition models.

This paper presents a Bayesian nonlinear approach for the analysis of spatial count data. It extends the Bayesian partition methodology of Holmes, Denison, and Mallick (1999, Bayesian partitioning for classification and regression, Technical Report, Imperial College, London) to handle data that involve counts. A demonstration involving incidence rates of leukemia in New York state is used to highlight the methodology. The model allows us to make probability statements on the incidence rates around point sources without making any parametric assumptions about the nature of the influence between the sources and the surrounding location.

Cited:

33

WOS

Radial wavelet networks have recently been proposed as a method for nonparametric regression. In this paper we analyze their performance within a Bayesian framework. We derive probability distributions over both the dimension of the networks and the network coefficients by placing a prior on the degrees of freedom of the model. This process bypasses the need to test or select a finite number of networks during the modeling process. Predictions are formed by mixing over many models of varying dimension and parameterization.We show that the complexity of the models adapts to the complexity of the data and produces good results on a number of benchmark test series.

Cited:

70

Scopus

A Bayesian framework for the analysis of radial basis functions (RBF) is proposed that accommodates uncertainty in the dimension of the model. A distribution is denned over the space of all RBF models of a given basis function, and posterior densities are computed using reversible jump Markov chain Monte Carlo samplers (Green, 1995). This alleviates the need to select the architecture during the modeling process. The resulting networks are shown to adjust their size to the complexity of the data.

In modern applications, statisticians are faced with integrating heterogeneous data modalities relevant for an inference, prediction, or decision problem. In such circumstances, it is convenient to use a graphical model to represent the statistical dependencies, via a set of connected "modules", each relating to a specific data modality, and drawing on specific domain expertise in their development. In principle, given data, the conventional statistical update then allows for coherent uncertainty quantification and information propagation through and across the modules. However, misspecification of any module can contaminate the estimate and update of others, often in unpredictable ways. In various settings, particularly when certain modules are trusted more than others, practitioners have preferred to avoid learning with the full model in favor of approaches that restrict the information propagation between modules, for example by restricting propagation to only particular directions along the edges of the graph. In this article, we investigate why these modular approaches might be preferable to the full model in misspecified settings. We propose principled criteria to choose between modular and full-model approaches. The question arises in many applied settings, including large stochastic dynamical systems, meta-analysis, epidemiological models, air pollution models, pharmacokinetics-pharmacodynamics, and causal inference with propensity scores.

Bayesian methods are advantageous for biological modelling studies due to their ability to quantify and characterize posterior variability in model parameters. When Bayesian methods cannot be applied, due either to non-determinism in the model or limitations on system observability, approximate Bayesian computation (ABC) methods can be used to similar effect, despite producing inflated estimates of the true posterior variance. Owing to generally differing application domains, there are few studies comparing Bayesian and ABC methods, and thus there is little understanding of the properties and magnitude of this uncertainty inflation. To address this problem, we present two popular strategies for ABC sampling that we have adapted to perform exact Bayesian inference, and compare them on several model problems. We find that one sampler was impractical for exact inference due to its sensitivity to a key normalizing constant, and additionally highlight sensitivities of both samplers to various algorithmic parameters and model conditions. We conclude with a study of the O'Hara-Rudy cardiac action potential model to quantify the uncertainty amplification resulting from employing ABC using a set of clinically relevant biomarkers. We hope that this work serves to guide the implementation and comparative assessment of Bayesian and ABC sampling techniques in biological models.

© 2017 Remi Bardenet, Arnaud Doucet, and Chris Holmes. Markov chain Monte Carlo methods are often deemed too computationally intensive to be of any practical use for big data applications, and in particular for inference on datasets containing a large number n of individual data points, also known as tall datasets. In scenarios where data are assumed independent, various approaches to scale up the Metropolis- Hastings algorithm in a Bayesian inference context have been recently proposed in machine learning and computational statistics. These approaches can be grouped into two categories: divide-and-conquer approaches and, subsampling-based algorithms. The aims of this article are as follows. First, we present a comprehensive review of the existing literature, commenting on the underlying assumptions and theoretical guarantees of each method. Second, by leveraging our understanding of these limitations, we propose an original subsampling-based approach relying on a control variate method which samples under regularity conditions from a distribution provably close to the posterior distribution of interest, yet can require less than O(n) data point likelihood evaluations at each iteration for certain statistical models in favourable scenarios. Finally, we emphasize that we have only been able so far to propose subsampling-based methods which display good performance in scenarios where the Bernstein-von Mises approximation of the target posterior distribution is excellent. It remains an open challenge to develop such methods in scenarios where the Bernstein-von Mises approximation is poor.

© 2017 International Society for Bayesian Analysis. Nonparametric and nonlinear measures of statistical dependence between pairs of random variables are important tools in modern data analysis. In particular the emergence of large data sets can now support the relaxation of linearity assumptions implicit in traditional association scores such as correlation. Here we describe a Bayesian nonparametric procedure that leads to a tractable, explicit and analytic quantification of the relative evidence for dependence vs independence. Our approach uses Pólya tree priors on the space of probability measures which can then be embedded within a decision theoretic test for dependence. Pólya tree priors can accommodate known uncertainty in the form of the underlying sampling distribution and provides an explicit posterior probability measure of both dependence and independence. Well known advantages of having an explicit probability measure include: easy comparison of evidence across different studies; encoding prior information; quantifying changes in dependence across different experimental conditions, and the integration of results within formal decision analysis.

Characterizing the technical precision of measurements is a necessary stage in the planning of experiments and in the formal sample size calculation for optimal design. Instruments that measure multiple analytes simultaneously, such as in high-throughput assays arising in biomedical research, pose particular challenges from a statistical perspective. The current most popular method for assessing precision of high-throughput assays is by scatterplotting data from technical replicates. Here, we question the statistical rationale of this approach from both an empirical and theoretical perspective, illustrating our discussion using four example data sets from different genomic platforms. We demonstrate that such scatterplots convey little statistical information of relevance and are potentially highly misleading. We present an alternative framework for assessing the precision of high-throughput assays and planning biomedical experiments. Our methods are based on repeatability-a long-established statistical quantity also known as the intraclass correlation coefficient. We provide guidance and software for estimation and visualization of repeatability of high-throughput assays, and for its incorporation into study design. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

We have performed a metabolite quantitative trait locus (mQTL) study of the (1)H nuclear magnetic resonance spectroscopy ((1)H NMR) metabolome in humans, building on recent targeted knowledge of genetic drivers of metabolic regulation. Urine and plasma samples were collected from two cohorts of individuals of European descent, with one cohort comprised of female twins donating samples longitudinally. Sample metabolite concentrations were quantified by (1)H NMR and tested for association with genome-wide single-nucleotide polymorphisms (SNPs). Four metabolites' concentrations exhibited significant, replicable association with SNP variation (8.6×10(-11)<p<2.8×10(-23)). Three of these-trimethylamine, 3-amino-isobutyrate, and an N-acetylated compound-were measured in urine. The other-dimethylamine-was measured in plasma. Trimethylamine and dimethylamine mapped to a single genetic region (hence we report a total of three implicated genomic regions). Two of the three hit regions lie within haplotype blocks (at 2p13.1 and 10q24.2) that carry the genetic signature of strong, recent, positive selection in European populations. Genes NAT8 and PYROXD2, both with relatively uncharacterized functional roles, are good candidates for mediating the corresponding mQTL associations. The study's longitudinal twin design allowed detailed variance-components analysis of the sources of population variation in metabolite levels. The mQTLs explained 40%-64% of biological population variation in the corresponding metabolites' concentrations. These effect sizes are stronger than those reported in a recent, targeted mQTL study of metabolites in serum using the targeted-metabolomics Biocrates platform. By re-analysing our plasma samples using the Biocrates platform, we replicated the mQTL findings of the previous study and discovered a previously uncharacterized yet substantial familial component of variation in metabolite levels in addition to the heritability contribution from the corresponding mQTL effects.

We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design.

Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.

We describe a statistical method for the characterization of genomic aberrations in single nucleotide polymorphism microarray data acquired from cancer genomes. Our approach allows us to model the joint effect of polyploidy, normal DNA contamination and intra-tumour heterogeneity within a single unified Bayesian framework. We demonstrate the efficacy of our method on numerous datasets including laboratory generated mixtures of normal-cancer cell lines and real primary tumours.

We have cultured Plasmodium falciparum directly from the blood of infected individuals to examine patterns of mature-stage gene expression in patient isolates. Analysis of the transcriptome of P. falciparum is complicated by the highly periodic nature of gene expression because small variations in the stage of parasite development between samples can lead to an apparent difference in gene expression values. To address this issue, we have developed statistical likelihood-based methods to estimate cell cycle progression and commitment to asexual or sexual development lineages in our samples based on microscopy and gene expression patterns. In cases subsequently matched for temporal development, we find that transcriptional patterns in ex vivo culture display little variation across patients with diverse clinical profiles and closely resemble transcriptional profiles that occur in vitro. These statistical methods, available to the research community, assist in the design and interpretation of P. falciparum expression profiling experiments where it is difficult to separate true differential expression from cell-cycle dependent expression. We reanalyze an existing dataset of in vivo patient expression profiles and conclude that previously observed discrete variation is consistent with the commitment of a varying proportion of the parasite population to the sexual development lineage.