Wen X, van der Laan M, Wyss R, Platt RW, Pajouheshnia R, Schneeweiss S. Machine learning to optimize and automate confounding control in studies using high-dimensional healthcare data. Presented at the Virtual 2020 36th ICPE International Conference on Pharmacoepidemiology & Therapeutic Risk Management; September 16, 2020. [abstract] Drug Saf. 2020 Oct 15; 29(S3):384. doi: 10.1002/pds.5114


BACKGROUND: Using machine learning (ML) methods, including superlearner, hybrid learner, and collaborative targeted maximum likelihoodestimation (CTMLE), in high-dimensional confounding adjustmentpromises to reduce unmeasured confounding and optimize control ina given data set. However, concerns remain whether these methods can avoid bias amplification from inappropriate adjustment for instru-mental variables or colliders.

OBJECTIVES: To provide an in-depth overview of recent advancementsthat expand the frequently-used high-dimensional propensity score(hdPS) framework with ML to help automate covariate identification and prioritization for improved causal effect estimation. Researchers working on high-dimensional real-world data, comprised of electronic health record, administrative, clinical, and other healthcare data, willbenefit from attending this symposium.

DESCRIPTION: We will first describe how hdPS optimizes confounderidentification and selection free of cognitive biases through auto-mated feature generation and prioritization and why it may reducebias from unmeasured confounding through proxy adjustment. Insightwill be provided into the conditions under which an assumption ofreduced unmeasured confounding is likely to hold, and whether thismight outweigh the potential cost of adjusting for colliders or instru-mental variables, which is known to potentially increase residual bias.Additional ML algorithms are thought to optimize covariate prioritiza-tion in the framework of causal inference. We will summarize recentlyproposed hdPS extensions that incorporate additional ML algorithmsand their ability to address potential complications to causal interpre-tations of findings. The discussion around the pros and cons of theseapproaches will focus on: a) data structures in which the methodsmight perform similarly and/or differently; and b) cases of potentialmodel misspecification due to investigator-driven variable selectionapproaches to propensity score estimation. Finally, we will introduceCTMLE that incorporates ML in estimation of both the treatmentmechanism and the outcome regression, and provides statistically effi-cient estimations of target estimands in causal inference. The CTMLEframework and its cross-validation for selecting tuning parameters ofthe embedded ML algorithms will be discussed.

Share on: