Machine Learning from Irregularly-Sampled Temporal Data: A Case Study in Predicting Across Most Diseases in Electronic Health Records - David Page
From Desmond J Gardfrey
Much of the world’s real data on people is irregularly-sampled, temporal, and observational (meaning we don’t get to experiment as in a randomized clinical trial). For example, customers make purchases on various dates of their choice, not necessarily once a week or once a month, and we only observe rather than intervene in their decisions. Patients visit the doctor whenever they feel the need, and we observe their doctors’ entries in the electronic health record (EHR), without the ability to randomize patient treatments. We show that despite this lack of control or sampling regularity, we can predict future events from such data with surprising accuracy, for example better than 80% on average across a variety of diagnosis codes in the EHR a month in advance. We further show that despite many types of potential confounding, we can actually discover causal factors (e.g., effect of a drug on a disease or on a measurement such as blood pressure) at similar levels of accuracy for real problems. The key to doing so is modeling person-specific, time-varying baseline levels, e.g. of a measurement such as blood pressure or a risk such as for heart attack. On the applied side this talk will focus entirely on medical applications, but the approaches developed and employed are general-purpose machine learning algorithms with broad potential applicability.