Big data and design of experiments
McGree JM., Drovandi CC., Ryan EG., Mengersen K., Holmes C., Richardson S.
Big datasets are becoming more prevalent in modern statistics. This poses a major challenge for statisticians as such datasets can be difficult to analyse due to their size, complexity and quality. In this research, we follow the work of Drovandi et al. (2015), and use experimental design techniques in analysing big data as a way of extracting relevant information in order to answer specific questions. Such an approach can therefore significantly reduce the size of the dataset to be analyzed, and potentially overcome concerns about poor quality due to, for example, sample bias and missingness. We focus on a sequential design approach for extracting informative data. When fitting relatively complex models (for example, those that are nonlinear) the performance of a design in answering specific questions will generally depend upon the assumed model and the corresponding values of the parameters. As such, it is useful to consider prior information for such sequential design problems. As in Drovandi et al. (2015), we argue that this can be obtained in big data settings through the use of an initial learning phase where data are extracted from the big dataset such that appropriate models for analysis can be explored and prior distributions of parameters can be formed. Given such prior information, sequential design is undertaken as a way of identifying informative data to extract from the big dataset. This approach is demonstrated in an example where there is interest to determine how particular covariates effect the chance of an individual defaulting of their mortgage, and we also explore the appropriateness of a model developed in the literature for the chance of a late arrival in domestic air travel. Lastly, we show that this designed approach can provide a methodology for identifying gaps in big data which may reveal limitations in the types of inferences that may be drawn.