Christopher Yau: Big Data
Over the past decade, data-driven science has produced enormous sets of data. The convergence of statistics and computer science, in the field known as machine learning, provide the means to understand these large datasets. Ultimately, machine learning algorithms will be develop into clinical decision making support systems.
Q: Can you explain what big data is?
Christopher Yau: Big data is a phenomenon that has arisen out of a long period of data-driven science. Over the last decade, we have seen the cost of genomic technologies rapidly reduce; at the same time, the number of samples and people that we are analysing has rapidly increased. In big data we are looking at the massive data repositories that we have generated over the last ten years and will be generating in years to come, and try to extract knowledge from these data sets to answer both specific and multiple scientific questions.
Q: How is big data helping you explain differences in cancer?
CY: My group works with the ovarian cancer laboratory in Oxford, and in one of our studies we have been able to take 40 tumour samples from a single ovarian cancer patient. These tumour samples have come from different parts of the body where the disease has spread and at different times during the patient's diagnosis. We have also taken samples before and after chemotherapy. We have sequenced each of these 40 tumour samples and the process has generated a massive 40 terabytes of raw data for this individual patient. What we can learn from this data is critical insights into how the cancer arose in this particular individual, how it evolved and spread, and how this tumour reacted to chemotherapy. This has been really important, as this comprehensive profiling has allowed us to gain insights into the tumour evolution in this patient, which we couldn't have got from just a single tumour sample. What is exciting in the next few years is that we be will applying the same technique to further ovarian cancer patients, and putting together a massive comprehensive profile of ovarian cancers in different patients.
Q: Why is it important to understand these differences in tumours?
CY: Patients respond differently to cancer treatments, and at the moment we don't have a full understanding or the ability to predict exactly how they will respond. By looking at the genetic differences between the cancers and genetic differences between cells within the same tumour, we can learn about the mechanisms of drug resistance and also of radiotherapy resistance in these patients. By relating the genetic changes to how they respond we hope to produce better and more effective treatment plans in the future.
Q: What are the most important lines of research that have risen in the last five or ten years?
CY: What has been really exciting in the last ten years has been the convergence of two different fields. On the one hand we have statistics, which is traditionally a mathematical discipline, and on the other hand we have computer science, which is more technologically driven. As these two fields have converged, in the field known as machine learning, what we have seen is an explosion of new ideas for analysing data and developing smarter, more efficient computational algorithms. This has been really important, because in parallel, in genetics we have seen an explosion in the amount of data we can collect and generate; without these new ideas coming from machine learning for interpreting this data, we would have the means of generating lots of data but no means of understanding it.
Q: Why does your research matter and why should we put money into it?
CY: One of the most interesting things that has occurred this year has been the announcement of the Genomics England 100,000 Genomes Project. This project is going to be very important, as it will sequence 100,000 genomes and provide another resource of data for us to study. It will also help the NHS prepare for the challenge of integrating genomic technologies into modern healthcare. However, whilst it's easy to buy more sequences to meet the capacity challenge of integrating genome sequencing into healthcare, it isn't quite so easy to hire and train new data analysts. So, a lot of my work is concerned with developing machine learning algorithms that make the task of processing and analysing complex genomic data sets much easier and much faster. It is important also to consider that machine learning in a biomedical context is a lot different to machine learning in other applications. Most people are probably familiar with the use of automatic face tagging in social networking sites, or speech recognition software on their phones. In these applications, if you make an error it is generally not a terrible thing and might in fact be quite amusing; but, in a clinical setting we can't afford to make errors, and we need to engineer machine learning algorithms to respect much more stringent specifications for robustness and reliability.
Q: How does your research fit into translational medicine within the Department?
CY: Ultimately what we would like to do is turn the machine learning algorithms that we develop from research tools into clinical decision making support systems. For example, we have been working with the Biomedical Research Centre here in Oxford to develop a diagnostic system for leukaemia, which allows us to translate complex genomic data coming from a particular type of technology into a simple 1-2 page report that can be used by clinicians to develop a patient treatment plan. In the future this automation is going to become more important because we will see multiple genomic technologies being used, and also the need to integrate these genomic technologies with imaging technologies and other types of information that is being gathered about patients.