Over the course of 2020, artificial intelligence researchers published at least 117 peer reviewed papers.1 Each showcased a new clinically-oriented machine learning model performing tasks including estimating the pre-test probability of SARS-CoV-2 infection, reading chest imaging, or predicting prognosis risk. As enticing as these new technologies seem, facilities should take cautious approaches to using these models in clinical practice, just as with other areas of innovation adoption in medicine. Models should be tested in each individual environment before use.
Drugs and devices undergo strict vetting processes before hospitals can utilize them. The FDA considers concrete bodies of evidence, including laboratory studies and clinical trials, to ensure safety and efficacy before permitting market entrance. In contrast, the FDA does not evaluate the majority of the clinically applicable models developed for COVID-19. They fall under a designation, called clinical decision support, that exempts them from regulatory oversight.2 Clinical decision support systems serve information to a provider who then aggregates with other data to make a clinical decision regarding patient care. As such, the use of these models occurs in a caveat emptor environment.
This exemption does not equate to suitability for use. Specifically, given the lack of formal validation, poorly constructed machine learning tools may lead decision-making astray. While models can fail for a number of reasons, the COVID-19 pandemic offers three unique challenges.
First, models are only as good as the data used to train them. The novel nature of COVID-19, particularly for models developed earlier in the pandemic, limited the size of datasets used to train algorithms. In extreme examples, models marketed to help COVID-19 patients were trained without using COVID-19 patients at all.3 One such extreme example, the Epic Deterioration Index model, is easily accessible in our EHR and a simple addition to a user’s screen. While Epic has advertised this model for COVID-19 risk stratification, it has not made data or performance results public.4 Generalization studies show conflicting results, generating confusion as to whether, for instance, the model overestimates or underestimates risks of poor outcomes.5 The harms of suboptimal data utilization have been documented with respect to a variety of outcomes, such as the potential exacerbation of healthcare inequities.6,7
The second unique issue concerns generalizability. A model’s usefulness derives not from how well it performs on retrospective data used to train it but rather from how well it performs on prospective data it has not seen before. If patient data used for training looks significantly different from the patient data that the model will be used on, the tool will perform worse than expected. These considerations become particularly pertinent for a pandemic which, through its temporal course thus far, has found various geographic epicenters with different demographics. Taking a single snapshot at one time point may fail to inform a subsequent stage of the pandemic. These concerns have practically borne out in the external validation of published models.8
Third, applications should include robust performance monitoring infrastructure. Models should consistently be interrogated against new incoming data to ensure the model is performing as expected. This process protects against model drift, which is the expected, well documented decrementation in model quality as a result of inevitable shifts in characteristics of disease processes, treatments, or impacted populations. Perhaps the most pressing concern in the context of COVID-19 is the emergence of novel variants. The efficacy of models should be re-evaluated in the same manner that the efficacy of vaccines and therpeutics is re-evaluated. For example, we have programming code that runs daily to validate a favorable outcome model deployed for an ongoing randomized controlled trial.9,10 Models designed for parsimonious use by physicians do not include tools for monitoring, and shifts in performance cannot easily be determined.
Our concerns are compounded by the accessibility of machine learning tools. Developers publishing clinically relevant models often simplify the number of features used and present their tools in clinician-friendly forms. Models can be found as online calculators, whether on developer-created websites or at centralized resources such as MDCalc. These forms simplify use but also remove barriers to using models that have not been vetted.
Facilities internationally have seen relatively unpredictable ebbs and surges in patient volumes.11 Should hospitals become overwhelmed, medical teams may reach out for tools to help sick patients. The use of artificial intelligence to face unmet needs in the clinical setting will only grow from this point on. The particularities of this market’s “buyer beware” environment highlight the need for hospitals to effectively test any clinical models they seek to apply clinically on their own data. Artificial intelligence holds great promise to assist patients and physicians, but only when applied carefully and thoughtfully.