Petra LeBeau, ScD (@LebeauPetra) , is a Senior Biostatistician and Lead of the Bioinformatics Analytics Team at Rho. She has over 13 years of experience in providing statistical support in all areas of clinical trials and observational studies. Her experience includes 3+ years of working with genomic data sets (e.g. transcriptome and metagenome). Her current interest is in machine learning using clinical trial and high-dimensional data.
Agustin Calatroni, MS (@acalatr), is a Principal Statistical Scientist at Rho. His academic background includes a master’s degree in economics from the Université Paris 1 Panthéon-Sorbonne and a master’s degree in statistics from North Carolina State University. In the last 5 years, he has participated in a number of competitions to develop prediction models. He is particularly interested in the use of stacking models to combine several machine learning techniques into one predictive model in order to decrease the variance (bagging), bias (boosting) and improve the predictive accuracy.
At Rho, we are proud of our commitment to supporting education and fostering innovative problem-solving for the next generation of scientists, researchers, and statisticians. One way we enjoy promoting innovation is by participating in the annual Industrial Math/Stat Modeling Workshop for Graduate Students (IMSM) hosted by the National Science Foundation-supported Statistical and Applied Mathematical Sciences Institute (SAMSI). IMSM is a 10-day program to expose graduate students in mathematics, statistics, and computational science to challenging and exciting real-world projects arising in industrial and government laboratory research. The workshop is held in SAS Hall on the campus of North Carolina State University. This summer marked our 8th consecutive year as an IMSM Problem Presenter. We were joined by industry leaders from Sandia National Laboratories, MIT Lincoln Laboratories, US Army Corps of Engineers (USACE), US Environmental Protection Agency (EPA) and , Savvysherpa.
SAMSI participants 2018 Agustin Calatroni (first from left),Petra LeBeau (first from right), and Emily Lei Kang (second from right) with students from the SAMSI program.
Rho was represented at the 2018 workshop by investigators Agustin Calatroni and Petra LeBeau, with the assistance of Dr. Emily Lei Kang from the University of Cincinnati. Rho’s problem for this year was Visualizing and Interpreting Machine Learning Models for Liver Disease Detection.
Machine learning (ML) interpretability is a hot topic as many tools have become available over the last couple of years (including a variety of very user-friendly ones) that are able to create pretty accurate ML models, but the constructs that could help us explain and trust these black-box models are still under development.
The success of ML algorithms in medicine and multi-omics studies over the last decade has come as no surprise to ML researchers. This can be largely attributed to their superior predictive accuracy and their ability to work on both large volume and high-dimensional datasets. The key notion behind their performance is self-improvement. That is, these algorithms make predictions and improve them over time by analyzing mistakes made in earlier predictions and avoiding these errors in future predictions. The difficulty with this “predict and learn” paradigm is that these algorithms suffer from diminished interpretability, usually due to the high number of nonlinear interactions within the resulting models. This is often referred to as the “black-box” nature of ML methods.
In cases where interpretability is crucial, for instance in studies of disease pathologies, ad-hoc methods leveraging the strong predictive nature of these methods have to be implemented. These methods are used as aides for ML users to answer questions like: ‘why did the algorithm make certain decisions?’, ‘what variables were the most important in predictions?’, and/or ‘is the model trustworthy?’
The IMSM students were challenged with studying the interpretability of a particular class of ML methods called gradient boosting machines (GBM) on the prediction if a subject had liver disease or not. Rho investigators provided a curated data set and pre-built the model for the students. To construct the model, the open-source Indian Liver Patient Dataset was used which contains records of 583 liver patients from North East India (Dheeru and Karra Taniskidou, 2017). The dataset contains eleven variables: a response variable indicating disease status of the patient (416 with disease, 167 without) and ten clinical predictor variables (Age, Gender, Total Bilirubin, Direct Bilirubin, Alkaline Phosphatase, Alamine Aminotransferase, Aspartate Aminotransferase, Total Proteins, Albumin, Albumin and Globulin Ratio). The data was divided into 467 training and 116 test records for model building.
The scope of work for the students was not to improve or optimize the performance of the GBM model but to explain and visualize the method’s intrinsic latent behavior.
The IMSM students decided to break interpretability down into two areas. Global, where the entire dataset is used for interpretation and local, where a subset of the data is used for deriving an interpretive analysis of the model. The details of these methods will be further discussed in two additional blog posts.
Rho is honored to have the opportunity to work with exceptional students and faculty to apply state of the art mathematical and statistical techniques to solve real-world problems and advance our knowledge of human diseases.
You can visit the IMSM Workshop website to learn more about the program, including the problem Rho presented and the students’ solution.
With thanks to the IMSM students Adams Kusi Appiah1, Sharang Chaudhry2, Chi Chen3, Simona Nallon4, Upeksha Perera5, Manisha Singh6, Ruyu Tan7 and advisor Dr. Emily Lei Kang from the University of Cincinnati
1Department of Biostatistics, University of Nebraska Medical Center; 2Department of Mathematical Sciences, University of Nevada, Las Vegas; 3Department of Biostatistics, State University of New York at Buffalo; 4Department of Statistics, California State University, East Bay; 5Department of Mathematics and Statistics, Sam Houston State University; 6Department of Information Science, University of Massachusetts; 7Department of Applied Mathematics, University of Colorado at Boulder
Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.