Role of Prediction and Causal Inference in Clinical Research

November 17, 2020


As a part of Cytel’s "New Horizons Webinar Series", Alind Gupta, Senior Data Scientist, presents case studies from his research on applying machine learning for predictive analysis and evidence generation.

The biopharmaceutical and healthcare industries now collect more data than ever before due to advances in the variety of information sources combined with the ability to store vast quantities of diverse data. Sophisticated machine learning (ML) and AI techniques allow us to access and analyze any combination of a multitude of data sources. The way that traditional controlled sources are viewed is being adapted in light of new evidence that emerges from real-world data. In his presentation, Alind introduces us to the concept of ML and Causal Inference and discusses case studies from randomized clinical trials and real-world data.

Click the button to register for the on demand webinar.


What is the Difference between Prediction and Causal Inference?

Machine learning (ML) aims to discover patterns from data that can be used for prediction. It is never focused on the task of Causal Inference. The goals of ML are accuracy, in terms of closeness between prediction and the outcome, and model generalizability.

In causal inference, we infer cause-effect relationships from data (interventional or observational). Input from domain expertise is essential to get information that you cannot mine from the data. The goals of causal inference are invariance, unbiasedness and transportability of effects.

What are Some Considerations for Applying Machine Learning for Predictive Analysis?

Alind presents a case study on applying machine learning for prediction to target therapies for advanced cancer. Targeted therapies for advanced cancer have shown higher efficacy and better safety profile compared to cytotoxic chemotherapy. More than 40 such therapies exist that have been approved by the FDA to treat specific cancer types.

When you have randomized clinical trials (RCT) that routinely collect large amounts of data on covariates and outcomes, ML can be useful to mine this data that was painstakingly collected for the clinical trial, for individual-level prediction and knowledge discovery. However, machine learning cannot replace standard inference methods of RCT analysis, as typically, randomization of treatment assignment is not necessarily exploited by ML.

What are Some Considerations for ML in clinical research?

The fields of medicine and public health are undergoing a data revolution. An increasing availability of data has brought about a growing interest in machine-learning algorithms. But there are many considerations in ML that are driven by the fact that clinical research is a field where the stakes are high. These considerations include transparency and interpretability of models. We need to be able to understand how exactly the model is using a set of predictors to make prediction about the outcome. The FDA has also released guidance on SaMD; they care about transparency and keeping clinicians in the loop. However, this becomes a challenge in the case of a black-box machine learning model where it is difficult to understand exactly how it is working on the inside.

Rigorous external validation and understanding of limitations is also necessary because we are learning from data that we acquire from the real world. In clinical research, we generally work with “small data” particularly in target therapies. Even when we use observational data, the real world data set is generally very small and missing data also needs to be accounted for.

ML can do a lot of things, but we need to see if there is an actual demand as the established methods work quite well and so, there needs to be a demonstration of added value.

What is the Scope of Causal Inference for Clinical Research?

RCTs are the gold standard for comparing effectiveness of interventions. Randomization reduces selection bias and confounding and provides causal estimates of effects of taking drug A vs drug B. Also, decision-making and health policy need to be informed by causal knowledge about comparative effectiveness and safety.

However, randomized trials are not always feasible as they are expensive, sometimes unethical, and often impractical and untimely. Moreover, decisions need to be made even in the absence of a randomized trial to address them, for example, maintaining status quo is also a decision.

Alternately, we can emulate a randomized trial using real world data. For each observational analysis for causal inference, we can imagine a hypothetical randomized trial that we would prefer to conduct. We refer to such trials as Target Trials. A Target trial is the hypothetical randomized trial that we would like to conduct to answer a causal question. A causal analysis of observational data can be viewed as an attempt to emulate some target trial. If we cannot explain what our target trial is then chances are that our causal question is not well-defined.


In his talk, Alind introduces the recent ongoing work at Cytel in applying methods from causal analysis and reinforcement learning in the context of comparative effectiveness research and dynamic treatment regimes.

In summary, Causal Inference is a complex scientific task that relies on triangulating evidence from multiple sources and on the application of a variety of methodological approaches. In clinical research, decision makers use causal inference to choose among different courses of action. We can compare the outcomes under these said different courses of action as well as compare the effectiveness and safety. One way to learn what works is by conducting randomized trials. In the absence of a randomized clinical trial, we can try to estimate causal effects from real-world data, as it can complement existing trials at a fraction of time and cost. The approach of emulating target trials makes causal inference and head-to-head comparative effectiveness using real-world data a reality.

Watch the on demand webinar to learn more.


About Alind Gupta


Alind Gupta is a senior data scientist at Cytel, focusing on probabilistic graphical models and Bayesian inference. His current work focuses on the use of Bayesian networks and Markov models for modelling heterogeneity in response to cancer immunotherapy and for long-term survival prediction using clinical trial and real-world data. Alind has a PhD from the University of Toronto studying genetics of rare diseases.