Blog Post

Big Data: The New Bacon

November 16, 2016

David Hall is a bioinformatician with an expertise in the development of algorithms, software tools, and data systems for the management and analysis of large biological data sets for biotechnology and biomedical research applications. He joined Rho in June, 2014 and is currently overseeing capabilities development in the areas of bioinformatics and big biomedical data. He holds a B.S. in Computer Science from Wake Forest University and a Ph.D. in Genetics with an emphasis in Computational Biology from the University of Georgia.

And Big Data is all the rage as people in the business world realize that you can make a lot of money by finding patterns in data that allow you to target marketing to the most likely buyers. Big Data and a type of artificial intelligence called machine learning are closely connected. Machine learning involves teaching a computer to make predictions by training it to find and exploit patterns in Big Data. Whenever you see a computer make predictions—from predicting how much a home is worth to predicting the best time to buy an airline ticket to predicting which movies you will like—Big Data and machine learning are probably behind it.

However, Big Data and machine learning are nothing new to people in the sciences. We have been collecting big datasets and looking for patterns for decades. Most people in the biomedical sciences consider the Big Data era starting in the early to mid-1990s as various genome sequencing projects ramped up. The human genome project wrapped up in 2003, took more than 10 years, and cost somewhere north of $500 million. And that was to sequence just one genome. A few years later, the 1000 Genome Project started, whose goal was to characterize genetic differences across 1000 diverse individuals so that we can predict who is susceptible to various diseases among other things. This effort was partially successful, but we learned that 1000 genomes is not enough.

The cost to sequence a human genome has fallen to around $1,000. So the ambition and scale of big biomedical data has increased proportionately. Researchers in the UK are undertaking a project to sequence the genomes of 100K individuals. In the US, the Precision Medicine Initiative will sequence 1 million individuals. Combining this data with detailed clinical and health data will allow machine learning and other techniques to more accurately predict a wider range of disease susceptibilities and responses to treatments. Private companies are undertaking their own big genomic projects and are even sequencing the “microbiomes” of research participants to see what role good and bad microbes play in health.

Like Moore’s law that predicted the vast increasing in computing power, the amount of biomedical data we can collect is on a similar trajectory. Genomics data combined with electronic medical records combined with data from wearables and mobile apps combined with environmental data will one day shroud each individual in a data cloud. In the not too distant future, maybe medicine will involve feeding a patient’s data cloud to an artificial intelligence that has learned to make diagnoses and recommendations by looking through millions of other personal data clouds. It seems hard to conceive, but this is the trajectory of precision medicine. Technology has a way of sneaking up on us and the pace of change keeps getting faster. Note that the management and analysis of all of this data will be very hard. I’ll cover that in a future post.