Petra LeBeau, ScD, is a Senior Biostatistician and Lead of the Bioinformatics Analytics Team at Rho. She has over 13 years of experience in providing statistical support for clinical trials and observational studies, from study design to reporting. Her experience includes 3+ years of working with genomic data sets (e.g. transcriptome and metagenome). Her current interest is in machine learning using clinical trial and high-dimensional data.
Agustin Calatroni, MS, is a Principal Statistical Scientist at Rho. His academic background includes a master’s degree in economics from the Univesité Paris 1 Panthéon-Sorbonne and a master’s degree in statistics from North Carolina State University. In the last 5 years, he has participated in a number of competitions to develop prediction models. He is particularly interested in the use of stacking models to combine several machine learning techniques into one predictive model in order to decrease the variance (bagging), bias (boosting) and improve the predictive accuracy.
Derek Lawrence, Senior Clinical Data Manager, has 9 years of data management and analysis experience in the health care / pharmaceutical industry. Derek serves as Rho’s Operational Service Leader in Clinical Data Management, an internal expert responsible for disseminating the application of new technology, best practices, and processes.
Artificial Intelligence (AI) may seem like rocket science, but most people use it every day without realizing it. Ride-sharing apps, airplane ticket purchasing aggregators, ATM machines, recommendations for your next eBook or superstore purchase, or the photo library within your smartphone—all these common apps use machine learning algorithms to improve the user experience.
Machine learning (ML) algorithms make predictions and, in turn, learn from their own predictions resulting in improved performance over time. ML has slowly been making its way into health research and the healthcare system due in part to an exponential growth in data stemming from new developments in technology like genomics. Rho supports many studies with large datasets including the microbiome, proteome, metabolome, and the transcriptome. The rapid growth of health-related data will continue along with the development of new methodologies like systems biology (i.e. the computational and mathematical modeling of interactions within biological systems) that leverage these data. ML will continue to be a key enabler in these areas. The ever-increasing amounts of computational power, improvements in data storage devices, and falling computational costs have given clinical trial centers the opportunity to apply ML techniques to large and complex data which would not have been possible a decade ago. In general, ML is divided into two main types of techniques: (1) Supervised learning, in which a model is trained on known input and output data in order to predict future outputs, and (2) unsupervised learning, where instead of predicting outputs, the system tries to find naturally occurring patterns or groups within the data. In each type of ML, there a large number of existing algorithms. Example supervised learning algorithms include random forest, boosted trees, neural networks, and deep neural networks just to name a few. Similarly, unsupervised learning has a plethora of algorithms.
Lately, it has become clear that in order to substantially increase the accuracy of a predictive model, we need to use an ensemble of models. The idea behind ensembles is that by combining a diverse set of models one is able to produce a stronger, higher performing model which in turn results in better predictions. By creating an ensemble of models, we maximize the accuracy, precision, and stability of our predictions. The power of the ensemble technique can be intuited with a real-world example: In the early 20th century, the famous English statistician Francis Galton (who created the statistical concept of correlation) attended a local fair. While there, he came across a contest that involved guessing the weight of an ox. He looked around and noticed a very diverse crowd; there were people like him who maybe had little knowledge about cattle, and there were farmers and butchers whose guesses would be considered that of an expert. In general, the diverse audience ended up giving a wide variety of responses. He wondered what would happen if he took the average of all these responses, expert, and non-expert alike. What he found was that the average of all the responses was much closer to the true weight of the ox than any individual guess alone. This phenomenon has been called the “wisdom of crowds.” Similarly, today’s best prediction models are often the result of an ensemble of various models which together provide a better overall prediction accuracy than any individual one would be capable of.
As data management is concerned, the current clinical research model is centered on electronic data capture systems (EDC), in which a database is constructed that comprises the vast majority of the data for a particular study or trial. Getting all of the data into a single system involves a significant investment in the form of external data imports, redundant data entry, transcription from paper sources, transfers from electronic medical/health record systems (EMR/EHR), and the like. Additionally, the time and effort required to build, test, and validate complicated multivariate edit checks into the EDC system to help clean the data as they are entered is substantial and can only utilize data that currently exist in the EDC system itself. As data source variety increases, along with surges in data volume and data velocity, this model becomes less and less effective at identifying anomalous data.
At Rho, we are investing in talent and technology that in the near future will use ML ensemble models in the curation and maintenance of clinical databases. Our current efforts to develop tools to aggregate that data from a variety of sources will be a key enabler. Similar to the ways the banking industry uses ML to identify ‘normal’ and ‘abnormal’ spending patterns and make real-time decisions to allow or decline purchases, ML algorithms can identify univariate and multivariate clusters of anomalous data for manual review. These continually-learning algorithms will enable a focused review of potentially erroneous data without the development of the traditional EDC infrastructure, saving not only time performing data reviews but also identifying potential issues of which we would normally have been unaware.