Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Big datasets are becoming more prevalent in modern statistics. This poses a major challenge for statisticians as such datasets can be difficult to analyse due to their size, complexity and quality. In this research, we follow the work of Drovandi et al. (2015), and use experimental design techniques in analysing big data as a way of extracting relevant information in order to answer specific questions. Such an approach can therefore significantly reduce the size of the dataset to be analyzed, and potentially overcome concerns about poor quality due to, for example, sample bias and missingness. We focus on a sequential design approach for extracting informative data. When fitting relatively complex models (for example, those that are nonlinear) the performance of a design in answering specific questions will generally depend upon the assumed model and the corresponding values of the parameters. As such, it is useful to consider prior information for such sequential design problems. As in Drovandi et al. (2015), we argue that this can be obtained in big data settings through the use of an initial learning phase where data are extracted from the big dataset such that appropriate models for analysis can be explored and prior distributions of parameters can be formed. Given such prior information, sequential design is undertaken as a way of identifying informative data to extract from the big dataset. This approach is demonstrated in an example where there is interest to determine how particular covariates effect the chance of an individual defaulting of their mortgage, and we also explore the appropriateness of a model developed in the literature for the chance of a late arrival in domestic air travel. Lastly, we show that this designed approach can provide a methodology for identifying gaps in big data which may reveal limitations in the types of inferences that may be drawn.

Original publication

DOI

10.1016/B978-0-12-803732-4.00006-4

Type

Chapter

Book title

Computational and Statistical Methods for Analysing Big Data with Applications

Publication Date

01/01/2016

Pages

111 - 129