Aplicaciones de inteligencia artificial en el diagnóstico por la imagen en neurología

Los trastornos neurológicos representan la principal causa de discapacidad y la segunda causa de mortalidad a nivel mundial (Feigin et al., 2020). En el ámbito de la neurología, el diagnóstico por la imagen se basa principalmente en modalidades que generan grandes cantidades de datos complejos, como la resonancia magnética (RM), la tomografía computarizada (TC) y la medicina nuclear. Es por ello que gran parte de la investigación en aplicaciones de inteligencia artificial (IA) en radiología se ha centrado en los trastornos neurológicos. De hecho, entre el 29 % y el 38 % de todas las aplicaciones basadas en IA disponibles comercialmente en radiología se centran en el cerebro o la columna vertebral, una proporción mayor que para cualquier otra región anatómica (IA Central).

La mayoría de estas aplicaciones tienen como objetivo asistir a los radiólogos, ya sea optimizando la interpretación de imágenes o ampliando sus capacidades, por ejemplo, proporcionando una cuantificación más detallada de los datos de neuroimagen (Olthof et al., 2020). En este libro se describen las aplicaciones más habituales de la IA en neurorradiología y se examinan los datos que las avalan.

Hemorragia intracraneal

La hemorragia intracraneal aguda (HIC) afecta a alrededor de 3,4 millones de personas cada año en todo el mundo (organización mundial de accidentes cerebrovasculares, 2022). La HIC conlleva una morbimortalidad elevada y a menudo requiere una rápida intervención neuroquirúrgica o un estrecho seguimiento clínico y por la imagen (Broderick et al., 2007; van Asch et al., 2010). En pacientes con déficits neurológicos agudos y sospecha de accidente cerebrovascular, la detección de hemorragia intracraneal aguda es crucial, ya que es una contraindicación absoluta para la trombólisis intravenosa (Fugate y Rabinstein, 2015).

En el ámbito de urgencias, los casos sospechosos de HIC suelen evaluarse inicialmente mediante una tomografía computarizada sin contraste (TCSC) de la cabeza. Esto se debe a que la TC es muy accesible, rápida y altamente sensible para detectar HIC, con relativamente pocas contraindicaciones (A. Jain et al., 2021). Como alternativa, la resonancia magnética (RM) es más sensible para hemorragias pequeñas y crónicas, pero es más lenta, su disponibilidad es menor, es una técnica costosa y contraindicada en algunos pacientes (Chalela et al., 2007).

En un estudio destinado a determinar patrones de error por parte de los residentes de radiología en la detección de HIC, los investigadores encontraron discrepancias en el 4,6 % de los exámenes nocturnos interpretados por los residentes y, de ese porcentaje, el 13,6 % se debió a hemorragia que no estaba incluida o era inexacta en los informes de los residentes (Strub et al., 2007). La HIC puede subdividirse en hemorragia intraparenquimatosa, hemorragia intraventricular, hemorragia subdural, hemorragia extradural y hemorragia subaracnoidea. De estas, las hemorragias subdural y subaracnoidea suelen pasar desapercibidas, sobre todo si son muy pequeñas (Strub et al., 2007).

Además, los residentes de radiología suelen confundir la anatomía cerebral normal y los artefactos de la imagen con una hemorragia intracraneal (Erly et al., 2002).

La gran mayoría de las aplicaciones basadas en IA que tienen por objetivo detectar y clasificar la hemorragia intracraneal utilizan la TCSC como entrada y se basan en redes neuronales convolucionales. Salvo algunas excepciones (Bar et al., 2019; Wang et al., 2021; Ye et al., 2019), no es habitual encontrar descripciones exhaustivas de la arquitectura de red en la mayoría de las aplicaciones. La cantidad y la calidad de los datos utilizados para entrenar estos algoritmos varía ampliamente, desde cientos (Bar et al., 2019; Heit et al., 2021) a miles (McLouth et al., 2021; Rava, Seymour, et al., 2021) o decenas de miles (Chilamkurthy et al., 2018; Gibson et al., 2022; Ginat, 2021) de exploraciones de TCSC.

Además de clasificar la presencia o ausencia de HIC, los algoritmos basados en IA también se han utilizado para categorizar los subtipos de HIC (Chilamkurthy et al., 2018; Gibson et al., 2022; Wang et al., 2021; Ye et al., 2019), detectar hallazgos asociados como el efecto de masa, el desplazamiento de la línea media y fracturas (Chilamkurthy et al., 2018), y realizar la segmentación y volumetría de las hemorragias (Bar et al., 2019; Gibson et al., 2022; Heit et al., 2021). Asimismo, una aplicación basada en IA también estima el grado de incertidumbre en la decisión del algoritmo para ayudar al radiólogo a interpretar el resultado del algoritmo (Gibson et al., 2022).

En general, entre los subtipos de HIC, los estudios mencionados muestran la mayor sensibilidad para la hemorragia intraventricular (Chilamkurthy et al., 2018; Gibson et al., 2022; McLouth et al., 2021; Wang et al., 2021), probablemente debido a la gran diferencia en la densidad de la TC entre el líquido cefalorraquídeo y la sangre. Sin embargo, la sensibilidad de todas las aplicaciones es relativamente baja en las hemorragias subaracnoideas, (Gibson et al., 2022; McLouth et al., 2021; Rava, Seymour, et al., 2021; Wang et al., 2021; Ye et al., 2019), posiblemente porque estas tienden a ser pequeñas y/o a encontrarse adyacentes a estructuras óseas o artefactos de TC hiperdensos (por ejemplo, en las cisternas basales). Otras aplicaciones también han mostrado una sensibilidad relativamente baja para la hemorragia subdural, particularmente cuando se encuentra en ubicaciones menos comunes, como a lo largo de la hoz cerebral (Chilamkurthy et al., 2018; Rao et al., 2021; Wang et al., 2021; Ye et al. , 2019). La sensibilidad también tiende a ser menor para las hemorragias más pequeñas, definidas como <1,5 ml o <5 ml, dependiendo del estudio (Heit et al., 2021; McLouth et al., 2021; Rava, Seymour, et al., 2021). Solo en uno de los estudios mencionados se ha investigado de forma sistemática las diferencias entre los proveedores de escáneres y los parámetros de exploración en el rendimiento diagnóstico de las aplicaciones basadas en IA para la detección de hemorragias intracraneales (McLouth et al., 2021).

En algunos estudios se ha comparado directamente el rendimiento de las aplicaciones basadas en IA con el de los expertos. En un estudio en el que se evaluaron 160 TCSC de cabeza y cuello (49 % con HIC), se utilizó la evaluación del especialista en neurorradiología como valor de referencia. En este estudio, una red neuronal convolucional (CNN) U-Net ofreció una sensibilidad del 91 % y una especificidad del 89 %, en comparación con dos residentes de neurorradiología que presentaron una sensibilidad del 99-100 % y una especificidad del 98 % (Schmitt et al., 2022). En otro estudio, se compararon las interpretaciones de una aplicación basada en IA aprobada por la FDA y con marcado CE con las lecturas de un grupo de tres residentes de neurorradiología que definieron los valores de referencia.

La aplicación basada en IA ofreció la misma sensibilidad que la de un residente de neurorradiología (91,9 %), si bien su especificidad fue sustancialmente inferior (aplicación: 84,4 %; residente: 99,6 %) (Eldaya et al., 2022). En comparación con los estudiantes de radiología, una aplicación basada en IA presentó una sensibilidad más alta y una especificidad ligeramente menor en la detección de HIC (Ye et al., 2019). El engrosamiento dural, las calcificaciones durales e intraparenquimatosas y los artefactos de movimiento o rayas son los más propensos a ser confundidos con HIC por las aplicaciones basadas en IA (Bar et al., 2019; Eldaya et al., 2022; Rao et al., 2021).

Aunque se han realizado numerosos estudios sobre la precisión diagnóstica de las aplicaciones basadas en IA para detectar HIC, estas también aportan un beneficio adicional: la posibilidad de realizar lecturas más rápidas de la exploración, lo que podría dar lugar a un tratamiento más rápido de los pacientes. Si bien hay menos estudios en los que se evalúa el impacto del cribado basado en IA en el tiempo, algunos informes respaldan la reducción de los tiempos de lectura. En un estudio en el que se analizaron 620 TCSC, el tiempo transcurrido desde la finalización de la exploración hasta la notificación de los resultados fue de 73 minutos cuando la IA alertó al lector humano sobre hallazgos, en comparación con 132 minutos en los casos en los que no se llevó a cabo dicha notificación (Wismüller y Stockmaster, 2020). Además, se ha observado que el uso de aplicaciones basadas en IA se asocia a ingresos más breves de los pacientes en el servicio de urgencias (561 minutos frente a 781 minutos sin IA) (Chien et al., 2022).

Accidente cerebrovascular isquémico agudo

Oclusión de grandes vasos

En los pacientes que han sufrido un accidente cerebrovascular isquémico agudo, la identificación rápida de oclusiones de grandes vasos en el cerebro es esencial para garantizar un tratamiento oportuno. En general, el término «oclusión de grandes vasos (OGV)» se refiere a oclusiones de arterias lo suficientemente grandes como para ser susceptibles de trombectomía mecánica. Actualmente, esto incluye la arteria carótida interna (ACI) las partes proximales de las arterias cerebrales media (M1 y M2), anterior (A1) y posterior (P1), así como la arteria basilar (Mokin et al., 2019; Pirson et al., 2022).

Las OGV se detectan directamente mediante angiografía por sustracción digital, angiografía por TC o por RM, o indirectamente mediante técnicas no angiográficas. En la angiografía, las oclusiones vasculares aparecen como una interrupción repentina del llenado de contraste de una arteria (en la angiografía con contraste) o de la señal de flujo (en las técnicas sin contraste, como la angiografía por RM por tiempo de vuelo). Esto puede ocurrir con o sin la presencia de llenado de contraste o señal de flujo distal al lugar de la oclusión. En técnicas no angiográficas, los signos de imagen indirectos de OGV incluyen la visualización de un vaso hiperdenso en la TCSC, que representa el trombo ocluyente (Gács et al., 1983). Además, se observa un signo de trombo de susceptibilidad en imágenes de resonancia magnética (ponderadas en T2 o por susceptibilidad) (Flacke et al., 2000).

La mayoría de las soluciones basadas en IA para la detección de OGV utilizan la angiografía por TC (Amukotuwa et al., 2019; Murray et al., 2020; Rava, Peterson, et al., 2021; Wardlaw et al., 2022; Yahav- Dovrat et al., 2021), si bien algunas también se basan en la TCSC (Lisowska et al., 2017; Olive-Gadea et al., 2020).

Asimismo, la mayoría de las aplicaciones se han centrado en las OGV de las arterias intracraneales de la circulación anterior (Adhya et al., 2021; Amukotuwa et al., 2019; Dehkharghani et al., 2021; Rava, Peterson, et al., 2021), lo que refleja el hecho de que la trombectomía mecánica se realiza con mucha menos frecuencia en oclusiones de vasos de la circulación posterior (Adusumilli et al., 2022).

En una revisión de datos sobre aplicaciones basadas en IA para detectar la OGV, las sensibilidades oscilaron entre el 80 % y el 96 % y las especificidades entre el 90 % y el 98 % (Wardlaw et al., 2022). En los estudios revisados, los falsos positivos se relacionaron comúnmente con una estenosis arterial, hemorragia intracraneal, tumores hipervasculares u oclusiones de vasos distales que no cumplían los criterios para ser considerados OGV (Amukotuwa et al., 2019; Yahav- Dovrat et al., 2021). Lamentablemente, no se dispone de datos publicados sobre el rendimiento de varias soluciones basadas en IA con marcado CE, incluyendo algunas de las diseñadas para la detección de OGV (van Leeuwen et al., 2021).

En el momento de redactar esta publicación, solo se disponía de un estudio que investigara la rentabilidad de las herramientas basadas en IA para la detección de la OGV. El análisis del estudio demostró que, suponiendo que los médicos pasen por alto el 6 % de las OGV y que la IA pueda contribuir a reducir esta cifra a la mitad, se podría conseguir un ahorro de costes de 11 millones de dólares al año en el Reino Unido (van Leeuwen, Meijer, et al., 2021).

Dado que los radiólogos y los residentes de radiología raramente pasan por alto las OGV en las técnicas angiográficas (Duvekot et al., 2021), el principal beneficio potencial que ofrece la detección de OGV mediante IA es agilizar el tratamiento, al proporcionar una evaluación más rápida. Algunas de las aplicaciones disponibles actualmente requieren entre 1 y 3,5 minutos para procesar los datos y tomar una decisión sobre la presencia o ausencia de una OGV (Amukotuwa et al., 2019; Dehkharghani et al., 2021; Olive-Gadea et al., 2020). Algunas herramientas se han asociado a una reducción del tiempo transcurrido desde la obtención de imágenes hasta el traslado del paciente a un hospital capaz de realizar una trombectomía mecánica de unos 22,5 minutos (Hassan et al., 2020), del tiempo transcurrido desde la llegada del paciente al hospital hasta la notificación al equipo neuroendovascular de unos 15 minutos (Morey et al., 2021), y del tiempo transcurrido desde la obtención de imágenes hasta la punción inguinal para realizar la trombectomía mecánica de unos 25 minutos (Adhya et al., 2021).

Alteraciones isquémicas iniciales en el tejido cerebral

En la TC, las alteraciones iniciales del tejido cerebral asociadas a la isquemia incluyen edema tisular y una reducción de la atenuación tisular debido al edema iónico (Marks et al., 1999). Estas alteraciones se incorporan a las herramientas de clasificación visual utilizadas por los radiólogos, siendo la más común la puntuación inicial de TC del Programa de Accidentes Cerebrovasculares de Alberta (ASPECTS). ASPECTS puede ayudar a predecir tanto los resultados funcionales como el desarrollo de hemorragia intracraneal sintomática después de la trombólisis intravenosa (Schröder & Thomalla, 2016). La mayoría de las aplicaciones basadas en IA que tienen como objetivo detectar alteraciones isquémicas iniciales en la TCSC lo hacen proporcionando una evaluación automatizada de ASPECTS (Wardlaw et al., 2022). Sin embargo, otras aplicaciones se centran en identificar estos cambios mediante angiografía por TC (Abdelkhaleq et al., 2021; Öman et al., 2019) o perfusión por TC (Hakim et al., 2021).

La mayoría de los algoritmos para identificar alteraciones isquémicas iniciales en la TC basados en IA utilizan la evaluación visual de la TCSC por parte de radiólogos, neurorradiólogos u otros médicos como estándar de referencia (Goebel et al., 2018; Hoelter et al., 2020; Kniep et al. ., 2020; Maegerlein et al., 2019; Seker et al., 2019), si bien otros también se basan en imágenes ponderadas por difusión de RM (Abdelkhaleq et al., 2021; Herweh et al., 2016; H. Kuang et al., 2019; Qiu et al., 2020) o el núcleo del infarto definido por TC de perfusión (Olive-Gadea et al., 2019). La mayoría de estas aplicaciones utilizan bosques aleatorios (Guberina et al., 2018; Herweh et al., 2016; Kniep et al., 2020; H. Kuang et al., 2019; Maegerlein et al., 2019; Nagel et al. , 2017; Olive-Gadea et al., 2019; Qiu et al., 2020) o redes neuronales convolucionales (Öman et al., 2019). Además, muchos estudios se han centrado en la identificación automatizada de alteraciones isquémicas iniciales en la resonancia magnética ponderada por difusión (Boldsen et al., 2018; Mohd Saad et al., 2019; Nazari-Farsani et al., 2020; Siddique et al., 2022; Song, 2019; Wong et al., 2022), que es un método muy sensible pero que no se encuentra ampliamente disponible en entornos agudos.

Al igual que ocurre con las aplicaciones de OGV, no se dispone de datos públicos sobre el rendimiento de algunas de las soluciones basadas en IA con marcado CE en la detección de alteraciones isquémicas iniciales (van Leeuwen et al., 2021). El algoritmo con más datos publicados es un enfoque de bosque aleatorio sobre la evaluación de ASPECTS que mostró ausencia de inferioridad con respecto a los neurorradiólogos, con una sensibilidad del 44 % y una especificidad del 93 %, utilizando una TC de seguimiento como referencia (Nagel et al., 2017). En otro estudio en el que se utilizó el mismo algoritmo y método de verificación, se halló que el algoritmo ofrecía una mayor sensibilidad (83 % frente al 73 %) pero una menor especificidad (57 % frente al 84 %) para la puntuación ASPECTS, en comparación con los neurorradiólogos (Guberina et al., 2018). En un tercer estudio, este algoritmo también obtuvo mejores resultados en la puntuación ASPECTS en comparación con neurólogos y residentes de neurología, y obtuvo resultados similares en comparación con neurorradiólogos (Ferreti et al., 2020).

En general, son pocos los estudios en los que se comparan directamente diferentes aplicaciones basadas en IA utilizadas para detectar alteraciones isquémicas iniciales en TCSC (Goebel et al., 2018; Hoelter et al., 2020). En un estudio se compararon tres aplicaciones disponibles comercialmente (dos basadas en aprendizaje automático y una basada en densitometría) en 131 pacientes (Hoelter et al., 2020).

Se halló que las aplicaciones basadas en IA tenían un área bajo la curva (AUC) de entre 0,73 y 0,76 en comparación con el consenso de tres neurorradiólogos.

La evaluación visual de las alteraciones isquémicas iniciales en la TCSC es particularmente difícil en la fosa posterior, donde es habitual que haya artefactos que dificultan su interpretación (Hwang et al., 2012). En una cohorte de 69 pacientes con oclusiones de la arteria basilar sometidos a una TCSC en las 6 horas siguientes al inicio de los síntomas, un algoritmo basado en el bosque aleatorio identificó alteraciones isquémicas iniciales en la circulación posterior, con un AUC que oscilaba entre 0,70 (en el cerebelo) y 0,82 (en el tálamo) utilizando la TCSC de seguimiento como referencia (Kniep et al., 2020).

Varios factores, además de la localización anatómica, influyen en la detectabilidad de las alteraciones isquémicas iniciales en la TCSC. En un estudio se halló que la precisión de la evaluación ASPECTS difiere según el tipo de reconstrucción de TC utilizada, si bien un algoritmo automatizado ofreció un rendimiento más consistente a través de varias reconstrucciones de TC que los residentes o consultores de radiología (Seker et al., 2019). Además, la precisión de las evaluaciones de ASPECTS, tanto humanas como basadas en IA, aumenta con el tiempo transcurrido desde el inicio de los síntomas hasta la TCSC, debido a que las alteraciones isquémicas iniciales se vuelven más pronunciados (Potreck et al., 2022).

Accidentes cerebrovasculares que se desconoce cuándo se han producido

Saber cuánto tiempo ha pasado desde el inicio de los síntomas del accidente cerebrovascular es crucial para guiar el tratamiento adecuado, ya que la trombólisis intravenosa solo está indicada cuando se administra en las 4,5 horas siguientes a la aparición de los síntomas (Powers et al., 2018). El inicio del accidente cerebrovascular no siempre es definitivo, por ejemplo en pacientes que presentan un accidente cerebrovascular al despertar. El accidente cerebrovascular al despertar se produce en aproximadamente el 14 % de los pacientes, según se desprende de un estudio poblacional realizado en pacientes que acuden a un servicio de urgencias (Mackey et al., 2011). Se han propuesto varios enfoques basados en imágenes para identificar pacientes dentro de la ventana de tiempo de la trombólisis.

Hasta el momento, se ha investigado minuciosamente la presencia de lesiones de accidente cerebrovascular agudo en las imágenes ponderadas por difusión (IPD) y su ausencia en las imágenes de resonancia magnética con recuperación de inversión atenuada por fluidos (FLAIR) (Ebinger et al., 2010; Thomalla et al., 2011; Thomalla et al., 2018). La interpretación automatizada de las IPD y las imágenes de RM FLAIR también se ha convertido en el objetivo de los algoritmos basados en IA diseñados para ayudar a los radiólogos.

Los enfoques para clasificar los tiempos de inicio de los accidentes cerebrovasculares basados en IA han incluido el uso de redes neuronales convolucionales (CNN) (Polson et al., 2022) o una combinación de diversos algoritmos de aprendizaje automático (Jiang et al., 2022; H. Lee et al., 2020; Zhu et al., 2021). Algunos estudios han utilizado un enfoque basado en la radiómica, que implica la segmentación de lesiones IPD y FLAIR, la extracción de diferentes características de imagen y, a continuación, la alimentación de estas características a diferentes algoritmos de clasificación (Jiang et al., 2022; H. Lee et al., 2020; Zhu et al., 2021).

En varios estudios, la clasificación basada en IA del tiempo de inicio del accidente cerebrovascular ha arrojado sensibilidades más altas pero especificidades más bajas que la evaluación visual realizada por radiólogos (H. Lee et al., 2020; Polson et al., 2022). Las sensibilidades que se han notificado oscilan entre el 73 % y el 86 %, y las especificidades entre el 68 % y el 85 % (Jiang et al., 2022; H. Lee et al., 2020; Polson et al., 2022; Zhu et al., 2021). En un estudio en el que se utilizó un enfoque radiómico basado únicamente en las IPD y las imágenes ponderadas en T1, combinado con un algoritmo de aprendizaje profundo, se halló una sensibilidad del 95 % y una especificidad del 50 % en la identificación de los pacientes dentro de la ventana temporal de la trombólisis (Y.-Q. Zhang et al., 2022).

Traumatismo craneoencefálico

El traumatismo craneoencefálico agudo (TCA) es un traumatismo físico que sucede de forma repentina y que daña el cerebro. Sus manifestaciones incluyen la HIC, la lesión axonal difusa y fracturas del cráneo y faciales. En las imágenes también se pueden detectar las consecuencias de algunas de estas manifestaciones, como el desplazamiento de la línea media y la hernia cerebral, que pueden requerir tratamiento de urgencia si son graves (Schweitzer et al., 2019).

Aunque las fracturas de cráneo no desplazadas sin HIC asociada se tratan de forma conservadora (Skull Fractures, s.f.), pocos estudios han abordado su detección mediante técnicas basadas en IA. No obstante, recientemente se han realizado algunos intentos de clasificar las fracturas de cráneo detectadas en TCSC.

Un algoritmo basado en un enfoque de aprendizaje de etiquetas múltiples y entrenado con 174 TCSC (103 con fracturas) mostró una precisión del 98 % y una especificidad del 92 % para detectar fracturas de cráneo (Emon et al., 2022). La menor precisión y especificidad correspondieron a las fracturas con hundimiento, y la mayor precisión y especificidad a las fracturas lineales y las fracturas faciales. Una aplicación basada en aprendizaje profundo destinada a detectar hallazgos críticos en TC de la cabeza sin contraste mostró una sensibilidad del 81,2 % al 87,2 % y una especificidad del 77,5 % al 86,1 % (dependiendo del conjunto de datos de prueba) en la detección de fracturas de cráneo (Chilamkurthy et al., 2018). En el mismo estudio, el desplazamiento de la línea media y el efecto de masa, ambas consecuencias comunes de la HIC relacionada con traumatismos, se identificaron con una sensibilidad del 87,5 % al 90,1 % y del 70,9 % al 81,2 %, así como con una especificidad del 83,7 % al 89,4 % y del 61,6 % al 73,4 % (según el conjunto de datos de la prueba), respectivamente. Un algoritmo que combinaba la extracción de las características morfológicas del cráneo con CNN y que fue entrenado en 25 TCSC y probado en 10 TCSC de pacientes con traumatismo craneoencefálico ofreció una precisión media del 60 % en la detección de fracturas de cráneo (Z. Kuang et al., 2020). Otro algoritmo de aprendizaje profundo logró una sensibilidad del 91,4 % y una especificidad del 87,5 % en la identificación de fracturas de cráneo en una serie de 150 tomografías computarizadas (TC) póstumas de la cabeza (Heimer et al., 2018).

Enfermedades neurodegenerativas

Muchas afecciones neurológicas pueden describirse como neurodegenerativas, si bien el término suele utilizarse para referirse a enfermedades neurológicas crónicas asociadas a la pérdida gradual de tejido cerebral y que generalmente causan demencia y/o disfunción motora (Lamptey et al., 2022). Más de una quinta parte de los algoritmos basados en IA con certificación CE o aprobados por la FDA en neurorradiología se centran pacientes con demencia (AI for Radiology, s.f.). La mayoría de ellos calculan automáticamente los volúmenes cerebrales regionales, miden el grosor cortical y cuantifican las lesiones de la sustancia blanca causadas por enfermedades cerebrales de pequeños vasos (AI for Radiology, s.f.).

Muchos algoritmos basados en IA para enfermedades específicas se centran en la enfermedad de Alzheimer, la cual se caracteriza patológicamente por la presencia de placas extracelulares compuestas de la proteína β-amiloide y ovillos neurofibrilares intracelulares que contienen la proteína tau, y que provoca un deterioro cognitivo progresivo, tanto amnésico como no amnésico (Knopman et al., 2021). Algunos de estos algoritmos son capaces de distinguir entre personas con Alzheimer y personas cognitivamente normales mediante resonancia magnética, con una sensibilidad que oscila entre el 78 % y el 99,1 % y una especificidad que oscila entre el 70 % y el 92,68 % (Battineni et al., 2022). Un enfoque basado en máquinas de vectores de apoyo no lineales fue capaz de diferenciar entre la enfermedad de Alzheimer (EA) y otros síndromes de demencia, como la degeneración lobar frontotemporal, con una precisión del 84 % (Davatzikos et al., 2008).

También se han realizado esfuerzos para predecir la conversión de la fase prodrómica de la EA a la EA clínica, ya que se cree que en la primera es cuando las intervenciones terapéuticas podrían ser particularmente eficaces (Crous-Bou et al., 2017).

El deterioro cognitivo leve (DCL) describe una situación en la que las personas presentan déficits cognitivos más graves de lo esperado para su edad, pero que no interfieren significativamente en sus actividades cotidianas (Petersen, 2016). Se han utilizado diferentes enfoques basados en IA para predecir la conversión de DCL a EA con precisiones del 66 % al 92 % (Amoroso et al., 2018; Bron et al., 2015; Lebedev et al., 2014; G. Lee et al., 2019; Lu et al., 2018; Moradi et al., 2015; Ocasio & Duong, 2021; Salvatore et al., 2015; Spasov et al., 2019).

El diagnóstico inicial también es crucial para el tratamiento efectivo de la enfermedad de Parkinson (EP), otra afección neurodegenerativa común. Esta enfermedad se caracteriza patológicamente por la degeneración de las neuronas dopaminérgicas en la sustancia negra (Pagan, 2012). En el momento en que aparecen los síntomas motores que apuntan hacia un diagnóstico clínico de EP, se estima que se ha perdido más del 60 % de las neuronas dopaminérgicas del cerebro (GBD 2016 Parkinson's Disease Collaborators, 2018). Se han desarrollado varios enfoques de aprendizaje automático para distinguir entre la EP y los controles sanos, utilizando características morfológicas derivadas de la RM estructural (Adeli et al., 2016; Chakraborty et al., 2020; Peng et al., 2017), la RM funcional (Long et al., 2012; Pläschke et al., 2017; Tang et al., 2017), la tomografía por emisión de positrones (PET) (Piccardo et al., 2021) y la tomografía computarizada por emisión de positrones únicos (SPECT) (Choi et al., 2017; Hirschauer et al., 2015; Ozsahin et al., 2020), a menudo en combinación con puntuaciones clínicas.

Dado que los síntomas motores de la EP se superponen con los de otras afecciones neurológicas, las características clínicas por sí solas no suelen bastar para diagnosticar con seguridad la EP (Rizzo et al., 2016). Distinguir la EP idiopática de los síndromes parkinsonianos atípicos, como la atrofia multisistémica y la parálisis supranuclear progresiva, basándose en las características clínicas, es especialmente difícil (Rizzo et al., 2016). Aprovechando el potencial de la neuroimagen, un estudio inicial empleó aprendizaje automático de vectores de soporte para clasificar la enfermedad de Parkinson idiopática y otras causas de parkinsonismo. Utilizaron imágenes con tensor de difusión, logrando una sensibilidad del 94 % y una especificidad del 100 % (Haller et al., 2012). Otros estudios mostraron una elevada precisión para distinguir entre la EP idiopática y el parkinsonismo atípico utilizando la RM estructural (Duchesne et al., 2009; Focke et al., 2011; Huppertz et al., 2016; Marquand et al., 2013; Salvatore et al., 2014), imágenes ponderadas por susceptibilidad (Haller et al., 2013) y una combinación de imágenes de tensor de difusión y RM estructural (Cherubini et al., 2014).

También se han llevado a cabo estudios utilizando modelos de aprendizaje automático para ayudar a guiar el tratamiento de la EP. En un estudio con 67 pacientes que padecen enfermedad de Parkinson, se halló que las características extraídas mediante la RM pueden clasificar los parámetros óptimos, en comparación con los mejorables, para la estimulación cerebral profunda, con una precisión del 88 % (Boutet et al., 2021). Esto podría ayudar a optimizar el largo, costoso y engorroso proceso actual de realización de las exhaustivas pruebas clínicas necesarias para optimizar los parámetros de estimulación cerebral profunda en pacientes con EP.

Esclerosis múltiple

La esclerosis múltiple (EM) es un trastorno autoinmunitario frecuente del sistema nervioso central, caracterizado patológicamente por la desmielinización inflamatoria, y que da lugar a una amplia gama de manifestaciones neurológicas (McGinley et al., 2021). La RM desempeña un papel importante en el diagnóstico y el tratamiento de la EM, y es la técnica de imagen de elección para cuantificar y clasificar las lesiones de EM en el cerebro y la médula espinal (Matthews et al., 2016). Las características de las imágenes son una parte crucial de los criterios de diagnóstico de la EM (Thompson et al., 2018) y las directrices recomiendan utilizar la RM para monitorizar a los pacientes y guiar el tratamiento (Wattjes et al., 2015). Diversos algoritmos basados en IA han recibido la autorización de la FDA y la certificación CE para cuantificar la atrofia cerebral y la segmentación automatizada de lesiones en EM (Cavedo et al., 2022; Qubiotech Neurocloud Vol, 2021; Zaki et al., 2022).

Muchos de los algoritmos basados en IA que se emplean en la EM se centran en la extracción automatizada de las características de las imágenes (Afzal et al., 2022; Bonacchi et al., 2022; Eichinger et al., 2020; Moazami et al., 2021). La evaluación visual de la presencia de lesiones de esclerosis múltiple y su progresión a lo largo del tiempo constituye una parte crucial en el diagnóstico y seguimiento de la EM. Sin embargo, este proceso es laborioso y complejo (Danelakis et al., 2018). En su lugar, se han desarrollado varios enfoques tradicionales de aprendizaje automático (Brosch et al., 2016; Goldberg-Zimring et al., 1998; Karimian & Jafari, 2015; Samarasekera et al., 1997; Schmidt et al., 2012; S. Zhang et al., 2018) y aprendizaje profundo (Birenbaum & Greenspan, 2017; Deshpande et al., 2015; Roy et al., 2018; Valverde et al., 2017, 2019) para segmentar automáticamente las lesiones de EM. Alrededor del 30 % de estos estudios utilizan CNN y el 40 % utilizan enfoques de aprendizaje automático de vectores de soporte (Afzal et al., 2022).

Los enfoques de aprendizaje profundo han arrojado coeficientes de similitud de Dice (una medida de superposición espacial que va de 0 a 1) de 0,52 a 0,67, en comparación con las segmentaciones manuales de las lesiones (Afzal et al., 2022). También se han investigado varios enfoques basados en IA para cuantificar automáticamente la atrofia cerebral, que es otro factor pronóstico basado en técnicas de imagen de la evolución de la EM (Andravizou et al., 2019) (Dolz et al., 2018; Kushibar et al., 2018; Wachinger et al., 2018).

Los algoritmos basados en IA también se han aprovechado para identificar anomalías de la RM que no son claramente visibles a simple vista y no están incluidas en los actuales criterios de diagnóstico de la EM. Entre ellas se encuentran, por ejemplo, anomalías de las venas cerebrales y deposición de hierro detectadas mediante imágenes ponderadas por susceptibilidad (Lopatina et al., 2020), y anomalías en áreas de apariencia normal de la sustancia blanca y gris, tanto en secuencias de RM convencionales (Eitel et al., 2019) como avanzadas (Neeb & Schenk, 2019; Saccà et al., 2019; Yoo et al., 2018; Zurita et al., 2018).

En el diagnóstico de la esclerosis múltiple, es crucial excluir otras enfermedades con presentación clínica similar. Sin embargo, en ocasiones, este proceso puede resultar complicado (Wildner et al., 2020). Utilizando características extraídas de resonancia magnética, modelos de bosques aleatorios y redes neuronales convolucionales (CNN), se han obtenido resultados precisos para distinguir entre la esclerosis múltiple y los trastornos del espectro de neuromielitis óptica (Eshaghi et al., 2016; Rocca et al., 2021), así como para identificar trastornos no inflamatorios de la sustancia blanca (Mangeat et al., 2020; Theocharakis et al., 2009), migraña (Rocca et al., 2021), vasculitis del sistema nervioso central (Rocca et al., 2021) y tumores cerebrales (Ekşi et al., 2021).

La EM se divide en varios fenotipos clínicos que tienen diferentes pronósticos y estrategias de tratamiento óptimas (Lublin et al., 2014). Mediante la RM con tensor de difusión (Kocevar et al., 2016; Marzullo et al., 2019), la espectroscopia de resonancia magnética (EkŞİ et al., 2020; Ion-Mărgineanu et al., 2017) y las medidas de atrofia basadas en RM (Bonacchi et al., 2020), en varios estudios se ha evaluado el potencial de los enfoques basados en IA diseñados para distinguir entre diferentes fenotipos clínicos de EM.

El tratamiento de la EM se personaliza en función de marcadores de pronóstico clínicos, demográficos, analíticos y de imágenes (Rotstein & Montalban, 2019). Se ha evaluado la capacidad de varios algoritmos basados en IA para predecir la conversión del primer episodio clínico indicativo de una enfermedad inflamatoria crónica del SNC, conocido como «síndrome clínicamente aislado», a EM definitiva utilizando características de RM con una sensibilidad del 64 % al 77 % y una especificidad del 66 % al 78 % (Bendfeldt et al., 2019; Wottschel et al., 2015, 2019). También se han diseñado algoritmos basados en IA que combinan datos clínicos y de RM para predecir la evolución de la enfermedad y la discapacidad clínica (Filippi et al., 2013; Roca et al., 2020; Tommasin et al., 2021; Zhao et al., 2017, 2020). Mediante el uso de máquinas de vectores de soporte y árboles extremadamente aleatorios, un estudio descubrió que una «huella digital» de imagen de alta dimensión derivada de imágenes ponderadas en T1 y FLAIR era mejor para predecir la respuesta al tratamiento de la EM que las medidas de respuesta al tratamiento derivadas de la RM convencional, como el volumen cerebral y el número y volumen de las lesiones (AUC 0,89 frente a 0,69) (Kanber et al., 2019).

Además, los algoritmos basados en IA han demostrado tener el potencial de optimizar los protocolos de resonancia magnética utilizados en la esclerosis múltiple. Esto incluye la extracción de información de secuencias de RM convencionales, la generación de secuencias sintéticas a partir de imágenes adquiridas, por ejemplo, imágenes realzadas por contraste a partir de RM no contrastadas (Bonacchi et al., 2022).



En el lapso de aproximadamente una década, la investigación sobre las aplicaciones de la IA en neurorradiología ha avanzado notablemente. La IA ha sido especialmente útil para ayudar a diagnosticar enfermedades como los accidentes cerebrovasculares y la hemorragia intracraneal, afecciones en las que la detección temprana es crucial. Asimismo, cada vez hay más pruebas de que la IA podría utilizarse para seguir la evolución de enfermedades neurológicas, predecir los resultados y, en última instancia, permitir estrategias de tratamiento más personalizadas y eficaces. La investigación sobre algoritmos basados en IA debería complementarse en el futuro con el análisis de la rentabilidad de estas aplicaciones y la medición del efecto de su aplicación en los resultados generales de los pacientes. Además, estas aplicaciones deberían estar avaladas por más datos publicados sobre su rendimiento para fomentar su uso. En el ámbito de la neurorradiología, el uso de la IA se presenta como altamente prometedor para mejorar la calidad de la atención al paciente.

Guide to Artificial Intelligence in Radiology

    Artificial intelligence (AI) is playing a growing role in all our lives and has shown promise in addressing some of the greatest current and upcoming societal challenges we face. The healthcare industry, though notoriously complex and resistant to disruption, potentially has a lot to gain from the use of AI. With an established history of leading digital transformation in healthcare and an urgent need for improved efficiency, radiology has been at the forefront of harnessing AI’s potential.

    This book covers how and why AI can address challenges faced by radiology departments, provides an overview of the fundamental concepts related to AI, and describes some of the most promising use cases for AI in radiology. In addition, the major challenges associated with the adoption of AI into routine radiological practice are discussed. The book also covers some crucial points radiology departments should keep in mind when deciding on which AI-based solutions to purchase. Finally, it provides an outlook on what new and evolving aspects of AI in radiology to expect in the near future.

    The healthcare industry has experienced a number of trends over the past few decades that demand a change in the way certain things are done. These trends are particularly salient in radiology, where the diagnostic quality of imaging scans has improved dramatically while scan times have decreased. As a result, the amount and complexity of medical imaging data acquired have increased substantially over the past few decades (Smith-Bindman et al., 2019; Winder et al., 2021) and are expected to continue to increase (Tsao, 2020). This issue is complicated by a widespread global shortage of radiologists (AAMC Report Reinforces Mounting Physician Shortage, 2021, Clinical Radiology UK Workforce Census 2019 Report, 2019). Healthcare workers, including radiologists, have an increasing workload (Bruls & Kwee, 2020; Levin et al., 2017) that contributes to burnout and medical errors (Harry et al., 2021). Being an essential service provider to virtually all other hospital departments, staff shortages within radiology have significant effects that spread throughout the hospital and to society as a whole (England & Improvement, 2019; Sutherland et al., n.d.).

    With an ageing global population and a rising burden of chronic illnesses, these issues are expected to pose even more of a challenge to the healthcare industry in the future.

    AI-based medical imaging solutions have the potential to ameliorate these challenges for several reasons. They are particularly suited to handling large, complex datasets (Alzubaidi et al., 2021). Moreover, they are well suited to automate some of the tasks traditionally performed by radiologists and radiographers, potentially freeing up time and making workflows within radiology departments more efficient (Allen et al., 2021; Baltruschat et al., 2021; Kalra et al., 2020; O’Neill et al., 2021; van Leeuwen et al., 2021; Wong et al., 2019). AI is also capable of detecting complex patterns in data that humans cannot necessarily find or quantify (Dance, 2021; Korteling et al., 2021; Kühl et al., 2020).

    The term “artificial intelligence” refers to the use of computer systems to solve specific problems in a way that simulates human reasoning. One fundamental characteristic of AI is that, like humans, these systems can tailor their solutions to changing circumstances. Note that, while these systems are meant to mimic on a fundamental level how humans think, their capacity to do so (e.g. in terms of the amount of data they can handle at one time, the nature and amount of patterns they can find in the data, and the speed at which they do so) often exceeds that of humans.

    AI solutions come in the form of computer algorithms, which are pieces of computer code representing instructions to be followed to solve a specific problem. In its most fundamental form, the algorithm takes data as an input, performs some computation on that data, and returns an output.

    An AI algorithm can be explicitly programmed to solve a specific task, analogous to a step-by-step recipe for baking a cake. On the other hand, the algorithm can be programmed to look for patterns within the data in order to solve the problem. These types of algorithms are known as machine learning algorithms. Thus, all machine learning algorithms are AI, but not all AI is machine learning. The patterns in the data that the algorithm can be explicitly programmed to look for or that it can “discover” by itself are known as features. An important characteristic of machine learning is that such algorithms learn from the data itself, and their performance improves the more data they are given.

    One of the most common uses of machine learning is in classification - assigning a piece of data a particular label. For example, a machine learning algorithm might be used to tell if a photo (the input) shows a dog or a cat (the label). The algorithm can learn to do so in a supervised or unsupervised way.

    Supervised learning

    In supervised learning, the machine learning algorithm is given data that has been labelled with the ground truth, in this example, photos of dogs and cats that have been labelled as such. The process then goes through the following phases:

    1.Training phase: The algorithm learns the features associated with dogs and cats using the aforementioned data (training data).
    2.Test phase: The algorithm is then given a new set of photos (the test data), it labels them and the performance of the algorithm on that data is assessed.

    In some cases, there is a phase in between training and test, known as the validation phase. In this phase, the algorithm is given a new set of photos (not included in either the training or test data), its performance is assessed on this data, and the model is tweaked and retrained on the training data. This is repeated until some predefined performance-based criterion is reached, and the algorithm then enters the test phase.

    Unsupervised learning

    In unsupervised learning, the algorithm identifies features within the input data that allow it to assign classes to the individual data points without being told explicitly what those classes are or should be. Such algorithms can identify patterns or group data points together without human intervention and include clustering and dimensionality reduction algorithms. Not all machine learning algorithms perform classification. Some are used to predict a continuous metric (e.g. the temperature in four weeks’ time) instead of a discrete label (e.g. cats vs dogs). These are known as regression algorithms.

    Neural networks and deep learning

    A neural network is made up of an input layer and an output layer, which are themselves composed of nodes. In simple neural networks, features that are manually derived from a dataset are fed into the input layer, which performs some computations, the results of which are relayed to the output layer. In deep learning, multiple “hidden” layers exist between the input and output layers. Each node of the hidden layers performs calculations using certain weights and relays the output to the next hidden layer until the output layer is reached.

    In the beginning, random values are assigned to the weights and the accuracy of the algorithm is calculated. The values of the weights are then iteratively adjusted until a set of weight values that maximize accuracy is found. This iterative adjustment of the weight values is usually done by moving backwards from the output layer to the input layer, a technique called backpropagation. This entire process is done on the training data.

    Performance evaluation

    Understanding how the performance of AI algorithms is assessed is key to interpreting the AI literature. Several performance metrics exist for assessing how well a model performs certain tasks. No single metric is perfect, so a combination of several metrics provides a fuller picture of model performance.

    In regression, the most commonly used metrics include:

    • Mean absolute error (MAE): the average difference between the predicted values and the ground truth.
    • Root mean square error (RMSE): the differences between the predicted values and the ground truth are squared and then averaged over the sample. Then the square root of the average is taken. Unlike the MAE, the RMSE thus gives higher weight to larger differences.
    • R2: the proportion of the total variance in the ground truth explained by the variance in the predicted values. It ranges from 0 to 1.

    The following metrics are commonly used in classification tasks:

    • Accuracy: this is the proportion of all predictions that were predicted correctly. It ranges from 0 to 1.
    • Sensitivity: also known as the true positive rate (TPR) or recall, this is the proportion of true positives that were predicted correctly. It ranges from 0 to 1.
    • Specificity: Also known as the true negative rate (TNR), this is the proportion of true negatives that were predicted correctly. It ranges from 0 to 1.
    • Precision: also known as positive predictive value (PPV), this is the proportion of positive classifications that were predicted correctly. It ranges from 0 to 1.

    An inherent trade-off exists between sensitivity and specificity. The relevant importance of each, as well as their interpretation, highly depends on the specific research question and classification task.

    Importantly, although classification models are meant to reach a binary conclusion, they are inherently probability-based. This means that these models will output a probability that a data point belongs to one class or another. In order to reach a conclusion on the most likely class, a threshold is used. Metrics such as accuracy, sensitivity, specificity and precision refer to the performance of the algorithm based on a certain threshold. The area under the receiver operating characteristic curve (AUC) is a threshold-independent performance metric. The AUC can be interpreted as the probability that a random positive example is ranked higher by the algorithm than a random negative example.

    In image segmentation tasks, which are a type of classification task, the following metrics are commonly used:

    • Dice similarity coefficient: a measure of overlap between two sets (e.g. two images) that is calculated as two times the number of elements common to the sets divided by the sum of the number of elements in each set. It ranges from 0 (no overlap) to 1 (perfect overlap).
    • Hausdorff distance: a measure of how far two sets (e.g. two images) within a space are far from each other. It is basically the largest distance from one point in one set to the closest point in the other set.

    Internal and external validity

    Internally valid models perform well in their task on the data being used to train and validate them. The degree to which they are internally valid is assessed using the performance metrics outlined above and depends on the characteristics of the model itself and the quality of the data that the model was trained and validated on.

    Externally valid models perform well in their tasks on new data (Ramspek et al., 2021). The better the model performs on data that differs from the data the models were trained and validated on, the higher the external validity. In practice, this often requires the performance of the models to be tested on data from hospitals or geographical areas that were not part of the model’s training and validation datasets.

    Guidelines for evaluating AI research

    Several guidelines have been developed to assess the evidence behind AI-based interventions in healthcare (X. Liu et al., 2020; Mongan et al., 2020; Shelmerdine et al., 2021; Weikert et al., 2021). These provide a template for those doing AI research in healthcare and ensure that relevant information is reported transparently and comprehensively, but can also be used by other stakeholders to assess the quality of published research. This helps ensure that AI-based solutions with substantial potential or actual limitations, particularly those caused by poor reporting (Bozkurt et al., 2020; D. W. Kim et al., 2019; X. Liu et al., 2019; Nagendran et al., 2020; Yusuf et al., 2020), are not prematurely adopted (CONSORT-AI and SPIRIT-AI Steering Group, 2019). Guidelines have also been proposed for evaluating the trustworthiness of AI-based solutions in terms of transparency, confidentiality, security, and accountability (Buruk et al., 2020; Lekadir et al., 2021; Zicari et al., 2021).

    Over the past few years, AI has shown great potential in addressing a broad range of tasks within a medical imaging department, including many that happen before the patient is scanned. Implementations of AI to improve the efficiency of radiology workflows prior to patient scanning are sometimes referred to as “upstream AI” (Kapoor et al., 2020; M. L. Richardson et al., 2021).


    One promising upstream AI application is predicting whichpatients arelikelytomisstheirscanappointments. Missed appointments are associated with significantly increased workload and costs (Dantas et al., 2018). Using a Gradient Boosting approach, Nelson et al. predicted missed hospital magnetic resonance imaging (MRI) appointments in the United Kingdom’s National Health Service (NHS) with high accuracy (Nelson et al., 2019). Their simulations also suggested that acting on the predictions of this model by targeting patients who are likely to miss their appointments would potentially yield a net benefit of several pounds per appointment across a range of model thresholds and missed appointment rates (Nelson et al., 2019). Similar results were recently found in a study of a single hospital in Singapore. For the 6-month period following the deployment of the predictive tool they were able to significantly reduce the no show rate from 19.3 % tp 15.9 % which translated into a potential economic benefit of $180,000 (Chong et. al., 2020).

    Scheduling scans in a radiology department is a challenging endeavour because, although it is largely an administrative task, it depends heavily on medical information. The task of assigning patients to specific appointments thus often requires the input of someone with domain knowledge, which stipulates that either the person making the appointments must be a radiologist or radiology technician, or these people will have to provide input regularly. In either scenario, the process is somewhat inefficient and can potentially be streamlined using AI-based algorithms that check scan indications and contraindications and provide the people scheduling the scans with information about scan urgency (Letourneau-Guillon et al., 2020).


    Depending on hospital or clinic policy, the decision on what exact scan protocol a patient receives is usually made based on the information on the referring physician’s scan request and the judgement of the radiologist. This is often supplemented by direct communication between the referring physician and radiologist and the radiologist’s review of the patient’s medical information. This process improves patient care (Boland et al., 2014) but can be time-consuming and inefficient, particularly with modalities like MRI, where a large number of protocol permutations exist. In one study, protocolling alone accounted for about 6 % of the radiologist’s working time (Schemmel et al., 2016). Radiologists are also often interrupted by tasks such as protocolling when interpreting images, despite the fact that the latter is considered a radiologist’s primary responsibility (Balint et al., 2014; J.-P. J. Yu et al., 2014).

    Interpretation of the narrative text of the referring physician’s scan request has been attempted using natural language classifiers, the same technology used in chatbots and virtual assistants. Natural language classifiers based on deep learning have shown promise in assigning patients to either a contrast-enhanced or non-enhanced MRI protocol for musculoskeletal MRI, with an accuracy of 83 % (Trivedi et al., 2018) and 94 % (Y. H. Lee, 2018). Similar algorithms have shown an accuracy of 95 % for predicting the appropriate brain MRI protocol using a combination of up to 41 different MRI sequences (Brown & Marotta, 2018). Across a wide range of body regions, a deep-learning-based natural language classifier decided based on the narrative text of the scan requests whether to automatically assign a specific computed tomography (CT) or MRI protocol (which it did with 95 % accuracy) or, in more difficult cases, recommend a list of three most appropriate protocols to the radiologist (which it did with 92 % accuracy) (Kalra et al., 2020).

    AI has also been used to decide whether already protocolled scans need to be extended, a decision which has to be made in real-time while the patient is inside the scanner. One such example is in prostate MRI, where a decision on whether to administer a contrast agent is often made after the non-contrast sequences. Hötker et al. found that a convolutional neural network (CNN) assigned 78 % of patients to the appropriate prostate MRI protocol (Hötker et al., 2021). The sensitivity of the CNN for the need for contrast was 94.4 % with a specificity of 68.8 % and only 2 % of patients in their study would have had to be called back for a contrast- enhanced scan (Hötker et al., 2021).

    Image quality improvement and monitoring

    Many AI-based solutions that work in the background of radiology workflows to improve image quality have recently been established. These include solutions for monitoring image quality, reducing image artefacts, improving spatial resolution, and speeding up scans.

    Such solutions are entering the radiology mainstream, particularly for computed tomography, which for decades used established but artefact-prone methods for reconstructing interpretable images from the raw sensor data (Deák et al., 2013; Singh et al., 2010).

    These are gradually being replaced by deep-learning- based reconstruction methods, which improve image quality while maintaining low radiation doses (Akagi et al., 2019; H. Chen et al., 2017; Choe et al., 2019; Shan et al., 2019). This reconstruction is performed on supercomputers on the CT scanner itself or on the cloud. The balance between radiation dose and image quality can be adjusted on a protocol-specific basis to tailor scans to individual patients and clinical scenarios (McLeavy et al., 2021; Willemink & Noël, 2019). Such approaches have found particular use when scanning children, pregnant women, and obese patients as well as CT scans of the urinary tract and heart (McLeavy et al., 2021).

    AI-based solutions have also been used to speed up scans while maintaining diagnostic quality. Scan time reduction not only improves overall efficiency but also contributes to an overall better patient experience and compliance with imaging examination. A multi- centre study of spine MRI showed that a deep-learning- based image reconstruction algorithm that enhanced images using filtering and detail-preserving noise reduction reduced scan times by 40 % (Bash, Johnson, et al., 2021). For T1-weighted MRI scans of the brain, a similar algorithm that improves image sharpness and reduces image noise reduced scan times by 60 % while maintaining the accuracy of brain region volumetry compared to standard scans (Bash, Wang, et al., 2021).

    In routine radiological practice, images often contain artefacts that reduce their interpretability. These artefacts are the result of characteristics of the specific imaging modality or protocol used or factors intrinsic to the patient being scanned, such as the presence of foreign bodies or the patient moving during the scan. Particularly with MRI, imaging protocols that demand fast scanning often introduce certain artefacts to the reconstructed image. In one study, a deep-learning- based algorithm reduced banding artefacts associated with balanced steady-state free precession MRI sequences of the brain and knee (K. H. Kim & Park, 2017). For real-time imaging of the heart using MRI, another study found that the aliasing artefacts introduced by the data undersampling were reduced by using a deep-learning-based approach (Hauptmann et al., 2019). The presence of metallic foreign bodies such as dental, orthopaedic or vascular implants is a common patient-related factor causing image artefacts in both CT and MRI (Boas & Fleischmann, 2012; Hargreaves et al., 2011). Although not yet well established, several deep-learning-based approaches for reducing these artefacts have been investigated (Ghani & Clem Karl, 2019; Puvanasunthararajah et al., 2021; Zhang & Yu, 2018). Similar approaches are being tested for reducing motion-related artefacts in MRI (Tamada et al., 2020; B. Zhao et al., 2022).

    AI-based solutions for monitoring image quality potentially reduce the need to call patients back to repeat imaging examinations, which is a common problem (Schreiber-Zinaman & Rosenkrantz, 2017). A deep-learning-based algorithm that identifies the radiographic view acquired and extracts quality-related metrics from ankle radiographs was able to predict image quality with about 94 % accuracy (Mairhöfer et al., 2021). Another deep-learning-based approach was capable of predicting nondiagnostic liver MRI scans with a negative predictive value of between 86 % and 94 % (Esses et al., 2018). This real-time automated quality control potentially allows radiology technicians to rerun scans or run additional scans with greater diagnostic value.

    Scan reading prioritization

    With staff shortages and increasing scan numbers, radiologists face long reading lists. To optimize efficiency and patient care, AI-based solutions have been suggested as a way to prioritize which scans radiologists read and report first, usually by screening acquired images for findings that require urgent intervention (O’Connor & Bhalla, 2021). This has been most extensively studied in neuroradiology, where moving CT scans that were found to have intracranial haemorrhage by an AI-based tool to the top of the reading list reduced the time it took radiologists to view the scans by several minutes (O’Neill et al., 2021). Another study found that the time-to diagnosis (which includes the time from image acquisition to viewing by the radiologist and the time to read and report the scans) was reduced from 512 to 19 minutes in an outpatient setting when such a worklist prioritization was used (Arbabshirani et al., 2018). A simulation study using AI-based worklist prioritization based on identifying urgent findings on chest radiographs (such as pneumothorax, pleural effusions, and foreign bodies) also found a substantial reduction in the time it took to view and report the scans compared to standard workflow prioritization (Baltruschat et al., 2021).

    Image interpretation

    Currently, the majority of commercially available AI- based solutions in medical imaging focus on some aspect of analyzing and interpreting images (Rezazade Mehrizi et al., 2021; van Leeuwen et al., 2021). This includes segmenting parts of the image (for surgical or radiation therapy targeting, for example), bringing suspicious areas to radiologists’ attention, extracting imaging biomarkers (radiomics), comparing images across time, and reaching specific imaging diagnoses.


    ¡ 29–38 % of commercially available AI-based applications in radiology (Rezazade Mehrizi et al., 2021; van Leeuwen et al., 2021).

    Most commercially available AI-based solutions targeted at neuroimaging data aim to detect and characterize ischemic stroke, intracranial haemorrhage, dementia, and multiple sclerosis (Olthof et al., 2020). Several studies have shown excellent accuracy of AI- based methods for the detection and classification of intraparenchymal, subarachnoid, and subdural haemorrhage on head CT (Flanders et al., 2020; Ker et al., 2019; Kuo et al., 2019). Subsequent studies showed that, compared to radiologists, some AI-based solutions have substantially lower false positive and negative rates (Ginat, 2020; Rao et al., 2021). In ischemic stroke, AI-based solutions have largely focused on the quantification of the infarct core (Goebel et al., 2018; Maegerlein et al., 2019), the detection of large vessel occlusion (Matsoukas et al., 2022; Morey et al., 2021; Murray et al., 2020; Shlobin et al., 2022), and the prediction of stroke outcomes (Bacchi et al., 2020; Nielsen et al., 2018; Y. Yu et al., 2020, 2021).

    In multiple sclerosis, AI has been used to identify and segment lesions (Nair et al., 2020; S.-H. Wang et al., 2018), which can be particularly helpful for the longitudinal follow-up of patients. It has also been used to extract imaging features associated with progressive disease and conversion from clinically isolated syndrome to definite multiple sclerosis (Narayana et al., 2020; Yoo et al., 2019). Other applications of AI in neuroradiology include the detection of intracranial aneurysms (Faron et al., 2020; Nakao et al., 2018; Ueda et al., 2019) and the segmentation of brain tumours (Kao et al., 2019; Mlynarski et al., 2019; Zhou et al., 2020) as well as the prediction of brain tumour genetic markers from imaging data (Choi et al., 2019; J. Zhao et al., 2020)


    ¡ 24 %–31 % of commercially available AI-based applications in radiology (Rezazade Mehrizi et al., 2021; van Leeuwen et al., 2021).

    When interpreting chest radiographs, radiologists detected substantially more critical and urgent findings when aided by a deep-learning-based algorithm, and did so much faster than without the algorithm (Nam et al., 2021). Deep-learning-based image interpretation algorithms have also been found to improve radiology residents’ sensitivity for detecting urgent findings on chest radiographs from 66 % to 73 % (E. J. Hwang, Nam, et al., 2019). Another study which focused on a broader range of findings on chest radiographs also found that radiologists aided by a deep-learning-based algorithm had higher diagnostic accuracy than radiologists who read the radiographs without assistance (Seah et al., 2021). The uses of AI in chest radiology also extend to cross-sectional imaging like CT. A deep learning algorithm was found to detect pulmonary embolism on CT scans with high accuracy (AUC = 0.85) (Huang, Kothari, et al., 2020). Moreover, a deep learning algorithm was 90 % accurate in detecting aortic dissection on non-contrast-enhanced CT scans, similar to the performance of radiologists (Hata et al., 2021).

    Outside the emergency setting, AI-based solutions have been widely tested and implemented for tuberculosis screening on chest radiographs (E. J. Hwang, Park, et al., 2019; S. Hwang et al., 2016; Khan et al., 2020; Qin et al., 2019; WHO Operational Handbook on Tuberculosis Module 2: Screening – Systematic Screening for Tuberculosis Disease, n.d.). In addition, they have been useful for lung cancer screening both in terms of detecting lung nodules on CT (Setio et al., 2017) and chest radiographs (Li et al., 2020) and by classifying whether nodules are likely to be malignant or benign (Ardila et al., 2019; Bonavita et al., 2020; Ciompi et al., 2017; B. Wu et al., 2018). AI-based solutions also show great promise for the diagnosis of pneumonia, chronic obstructive pulmonary disease, and interstitial lung disease (F. Liu et al., 2021).


    ¡ 11 % of commercially available AI-based applications in radiology (Rezazade Mehrizi et al., 2021; van Leeuwen et al., 2021).

    So far, many of the AI-based algorithms targeting breast imaging aim to reduce the workload of radiologists reading mammograms. Ways to do this have included using AI-based algorithms to triage out negative mammograms, which in one study was associated with a reduction in radiologists’ workload by almost one-fifth (Yala et al., 2019). Other studies that have replaced second readers of mammograms with AI- based algorithms have shown that this leads to fewer false positives and false negatives as well as reduces the workload of the second reader by 88 % (McKinney et al., 2020).

    AI-based solutions for mammography have also been found to increase the diagnostic accuracy of radiologists (McKinney et al., 2020; Rodríguez-Ruiz et al., 2019; Watanabe et al., 2019) and some have been found to be highly accurate in independently detecting and classifying breast lesions (Agnes et al., 2019; Al- Antari et al., 2020; Rodriguez-Ruiz et al., 2019).
    Despite this, a recent systematic review of 36 AI- based algorithms found that these studies were of poor methodological quality and that all algorithms were less accurate than the consensus of two or more radiologists (Freeman et al., 2021). AI-based algorithms have nonetheless shown potential for extracting cancer-predictive features from mammograms beyond mammographic breast density (Arefan et al., 2020; Dembrower et al., 2020; Hinton et al., 2019). Beyond mammography, AI-based solutions have been developed for detecting and classifying breast lesions on ultrasound (Akkus et al., 2019; Park et al., 2019; G.- G. Wu et al., 2019) and MRI (Herent et al., 2019).


    ¡ 11 % of commercially available AI-based applications in radiology (Rezazade Mehrizi et al., 2021; van Leeuwen et al., 2021).

    Cardiac radiology has always been particularly challenging because of the difficulties inherent in acquiring images of a constantly moving organ. Because of this, it has benefited immensely from advances in imaging technology and seems set to benefit greatly from AI as well (Sermesant et al., 2021). Most of the AI-based applications of the cardiovascular system use MRI, CT or ultrasound data (Weikert et al., 2021). Prominent examples include the automated calculation of ejection fraction on echocardiography, quantification of coronary artery calcification on cardiac CT, determination of right ventricular volume on CT pulmonary angiography, and determination of heart chamber size and thickness on cardiac MRI (Medical AI Evaluation, n.d., The Medical Futurist, n.d.). AI-based solutions for the prediction of patients likely to respond favourably to cardiac interventions, such as cardiac resynchronization therapy, based on imaging and clinical parameters have also shown great promise (Cikes et al., 2019; Hu et al., 2019). Changes in cardiac MRI not readily visible to human readers but potentially useful for differentiating different types of cardiomyopathies can also be detected using AI through texture analysis (Neisius et al., 2019; J. Wang et al., 2020) and other radiomic approaches (Mancio et al., 2022).


    ¡ 7–11 % of commercially available AI-based applications in radiology (Rezazade Mehrizi et al., 2021; van Leeuwen et al., 2021).

    Promising applications of AI in the assessment of muscles, bones and joints include applications where human readers generally show poor between- and within-rater reliability, such as the determination of skeletal age based on bone radiographs (Halabi et al., 2019; Thodberg et al., 2009) and screening for osteoporosis on radiographs (Kathirvelu et al., 2019; J.-S. Lee et al., 2019) and CT (Pan et al., 2020). AI- based solutions have also shown promise for detecting fractures on radiographs and CT (Lindsey et al., 2018; Olczak et al., 2017; Urakawa et al., 2019). One systematic review of AI-based solutions for fracture detection in several different body parts showed AUCs ranging from 0.94 to 1.00 and accuracies of 77 % to 98 % (Langerhuizen et al., 2019). AI-based solutions have also achieved accuracies similar to radiologists for classification of the severity of degenerative changes of the spine (Jamaludin et al., 2017) and extremity joints (F. Liu et al., 2018; Thomas et al., 2020). AI-based solutions have also been developed to determine the origin of skeletal metastases (Lang et al., 2019) and the classification of primary bone tumours (Do et al., 2017).

    Abdomen and pelvis

    ¡ 4 % of commercially available AI-based applications in radiology (Rezazade Mehrizi et al., 2021; van Leeuwen et al., 2021).

    Much of the efforts in using AI in abdominal imaging have thus far concentrated on the automated segmentation of organs such as the liver (Dou et al., 2017), spleen (Moon et al., 2019), pancreas (Oktay et al., 2018), and kidneys (Sharma et al., 2017). In addition, a systematic review of 11 studies using deep learning for the detection of malignant liver masses showed accuracies of up to 97 % and AUCs of up to 0.92 (Azer, 2019).

    Other applications of AI in abdominal radiology include the detection of liver fibrosis (He et al., 2019; Yasaka et al., 2018), fatty liver disease, hepatic iron content, the detection of free abdominal gas on CT, and automated volumetry and segmentation of the prostate (AI for Radiology, n.d.).

    Despite the great potential of AI in medical imaging, it has yet to find widespread implementation and impact in routine clinical practice. This research-to- clinic translation is being hindered by several complex and interrelated issues that directly or indirectly lower the likelihood of AI-based solutions being adopted. One major way they do so is by creating a lack of trust in AI- based solutions by key stakeholders such as regulators, healthcare professionals and patients (Cadario et al., 2021; Esmaeilzadeh, 2020; J. P. Richardson et al., 2021; Tucci et al., 2022).


    One major challenge is to develop AI-based solutions that continue to perform well in new, real-world scenarios. In a large systematic review, almost half of the studied AI-based medical imaging algorithms reported a greater than 0.05 decrease in the AUC when tested on new data (A. C. Yu et al., 2022). This lack of generalizability can lead to adverse effects on how well the model performs in a real-world scenario.

    If a solution performs poorly when tested on a dataset with a similar or identical distribution to the training dataset, it is said to lack narrow generalizability and is often a consequence of overfitting (Eche et al., 2021). Potential solutions for overfitting are using larger training datasets and reducing the model’s complexity. If a solution performs poorly when tested on a dataset with a different distribution to the training dataset (e.g. a different distribution of patient ethnicities), it is said to lack broad generalizability (Eche et al., 2021). Solutions to poor broad generalizability include stress-testing the model on datasets with different distributions from the training dataset (Eche et al., 2021).

    AI solutions are often developed in a high-resource environment such as large technology companies and academic medical centres in wealthy countries. It is likely that findings and performance in these high-resource contexts will fail to generalize to lower- resource contexts such as smaller hospitals, rural areas or poorer countries (Price & Nicholson, 2019), which complicates the issue further.

    Risk of bias

    Biases can arise in AI-based solutions due to data or human factors. The former occurs when the data used to train the AI solution does not adequately represent the target population. Datasets can be unrepresentative when they are too small or have been collected in a way that misrepresents a certain population category. AI solutions trained on unrepresentative data perpetuate biases and perform poorly in the population categories underrepresented or misrepresented in the training data. The presence of such biases has been empirically shown in many AI-based medical imaging studies (Larrazabal et al., 2020; Seyyed-Kalantari et al., 2021).

    AI-based solutions are prone to several subjective and sometimes implicitly or explicitly prejudiced decisions during their development by humans. These human factors include how the training data is selected, how it is labelled, and how the decision is made to focus on the specific problem the AI-based solution intends to solve (Norori et al., 2021). Some recommendations and tools are available to help minimize the risk of bias in AI research (AIF360: A Comprehensive Set of Fairness Metrics for Datasets and Machine Learning Models, Explanations for These Metrics, and Algorithms to Mitigate Bias in Datasets and Models, n.d., IBM Watson Studio - Model Risk Management, n.d.; Silberg & Manyika, 2019).

    Data quantity, quality and variety

    Problems such as bias and lack of generalizability can be mitigated by ensuring that training data is of sufficient quantity, quality and variety. However, this is difficult to do because patients are often reluctant to share their data for commercial purposes (Aggarwal, Farag, et al., 2021; Ghafur et al., 2020; Trinidad et al., 2020), hospitals and clinics are usually not equipped to make this data available in a useable and secure manner, and organizing and labelling the data is time- consuming and expensive.

    Many datasets can be used for a number of different purposes, and sharing data between companies can help make the process of data collection and organization more efficient, as well as increase the amount of data available for each application. However, developers are often reluctant to share data with each other, or even reveal the exact source of their data, to stay competitive.

    Data protection and privacy

    The development and implementation of AI-based solutions require that patients are explicitly informed about, and give their consent to, the use of their data for a particular purpose and by certain people. This data also has to be adequately protected from data breaches and misuse. Failure to ensure this greatly undermines the public’s trust in AI-based solutions and hinders their adoption. While regulations governing health data privacy state that the collection of fully anonymized data does not require explicit patient consent (General Data Protection Regulation (GDPR) – Official Legal Text, 2016; Office for Civil Rights (OCR), 2012) and in theory protects from the data being misused, whether or not imaging data can be fully anonymized is controversial (Lotan et al., 2020; Murdoch, 2021). Whether consent can be truly informed considering the complexity of the data being acquired, and the resulting myriad of potential future uses of the data, is also disputed (Vayena & Blasimme, 2017).

    IT infrastructure

    Among hospital departments, radiology has always been at the forefront ofdigitalization. AI-based solutions that focus on image processing and interpretation are likely to find the prerequisite infrastructure in most radiology departments, for example for linking imaging equipment to computers for analysis and for archiving images and other outputs. However, most radiology departments are likely to require significant infrastructure upgrades for other applications of AI, particularly those requiring the integration of information from multiple sources and having complex outputs. Moreover, it is important to keep in mind that the distribution of necessary infrastructure is highly unequal across and within countries (Health Ethics & Governance, 2021).

    In terms of computing power, radiology departments will either have to invest resources into the hardware and personnel necessary to run these AI-based solutions or opt for cloud-based solutions. The former comes with an extra cost but allows data processing within the confines of the hospital or clinic’s local network. Cloud-based solutions for computing (known as “infrastructure as a service” or “IaaS”) are often considered the less secure and less trustworthy option, but this depends on a number of factors and is thus not always true (Baccianella & Gough, n.d.). Guidelines on what to consider when procuring cloud-based solutions in healthcare are available (Cloud Security for Healthcare Services, 2021).

    Lack of standardization, interoperability, and integrability

    The problem of infrastructure becomes even more complicated when considering how fragmented the AI medical imaging market currently is (Alexander et al., 2020). It is therefore likely that in the near future a single department will have several dozen AI-based solutions from different vendors running simultaneously. Having a separate self-contained infrastructure (e.g. a workstation or server) for each of these would be incredibly complicated and difficult to manage. Suggested solutions for this have included AI solution “marketplaces”, similar to app stores (Advanced AI Solutions for Radiology, n.d., Curated Marketplace, 2018, Imaging AI Marketplace - Overview, n.d., Sectra Amplifier Marketplace, 2021, The Nuance AI Marketplace for Diagnostic Imaging, n.d.), and development of an overarching vendor-neutral infrastructure (Leiner et al., 2021). The successful implementation of such solutions requires close partnerships between AI solution developers, imaging vendors and information technology companies.


    It is often impossible to understand exactly how AI- based solutions come to their conclusions, particularly with complex approaches like deep learning. This reduces how transparent the decision-making process for procuring and approving these solutions can be, makes the identification of biases difficult, and makes it harder for clinicians to explain the outputs of these solutions to their patients and to determine whether a solution is working properly or has malfunctioned (Char et al., 2018; Reddy et al., 2020; Vayena et al., 2018; Whittlestone et al., 2019). Some have suggested that techniques that help humans understand how AI- based algorithms made certain decisions or predictions (“interpretable” or “explainable” AI) might help mitigate these challenges. However, others have argued that currently available techniques are unsuitable for understanding individual decisions of an algorithm and have warned against relying on them for ensuring that algorithms work in a safe and reliable way (Ghassemi et al., 2021).


    In healthcare systems, a framework of accountability ensures that healthcare workers and medical institutions can be held responsible for adverse effects resulting from their actions. The question of who should be held accountable for the failures of an AI- based solution is complicated. For pharmaceuticals, for example, the accountability for inherent failures in the product or its use often lies with either the manufacturer or the prescriber. One key difference is that AI-based systems are continuously evolving and learning, and so inherently work in a way that is independent of what their developers could have foreseen (Yeung, 2018). To the end-user such as the healthcare worker, the AI- based solution may be opaque and so they may not be able to tell when the solution is malfunctioning or inaccurate (Habli et al., 2020; Yeung, 2018).


    Despite substantial progress in their development over the past few years, deep learning algorithms are still surprising brittle. This means that, when the algorithm faces a scenario that differs substantially from what it faced during training, it cannot contextualize and often produces nonsensical or inaccurate results. This happens because, unlike humans, most algorithms learn to perceive things within the confines of certain assumptions, but fail to generalize outside these assumptions. As an example of how this can be abused with malicious intent, subtle changes to medical images, imperceptible by humans, can render the results of disease-classifying algorithms inaccurate (Finlayson et al., 2018). The lack of interpretability of many AI-based solutions compounds this problem because it makes it difficult to troubleshoot how they reached the wrong conclusion.

    So far, more than 100 AI-based products have gained conformité européenne (CE) marking or Food and Drug Adminstration (FDA) clearance. These products can be found in continuously updated and searchable online databases curated by the FDA (Center for Devices & Radiological Health, n.d.), the American College of Radiology (Assess-AI, n.d.), and others (AI for Radiology, n.d., The Medical Futurist, n.d.; E. Wu et al., 2021). The increasing number of available products, the inherent complexity of many of these solutions, and the fact that many people who usually make purchasing decisions in hospitals are not familiar with evaluating such products make it important to think carefully when deciding on which product to purchase. Such decisions will need to be made after incorporating input from healthcare workers, information technology (IT) professionals, as well as management, finance, legal, and human resources professionals within hospitals.

    Deciding on whether to purchase an AI-based solution in radiology, as well as which of the increasing number of commercially available solutions to purchase, includes considerations of quality, safety, and finances. Over the past few years, several guidelines have emerged to help potential buyers make these decisions (A Buyer’s Guide to AI in Health and Care, 2020; Omoumi et al., 2021; Reddy et al., 2021), and these guidelines are likely to evolve in the future with changing expectations from customers, regulatory bodies, and stakeholders involved in reimbursement decisions.

    First of all, it has to be clear to the potential buyer what the problem is and whether AI is the appropriate approach to this solution, or whether alternatives exist that are more advantageous on balance. If AI is the appropriate approach, buyers should know exactly what a potential AI-based product’s scope of the solution is - i.e. what specific problem the AI-based solution is designed to solve and in what specific circumstances. This includes whether the solution is intended for screening, diagnosis, monitoring, treatment recommendation or another application. It also includes the intended users of the solution and what kind of specific qualifications or training they are expected to have in order to be able to operate the solution and interpret its outputs. It needs to be clear to buyers whether the solution is intended to replace certain tasks that would normally be performed by the end-user, act as a double-reader, as a triaging mechanism, or for other tasks like quality control. Buyers should also understand whether the solution is intended to provide “new” information (i.e. information that would otherwise be unavailable to the user without the solution), improve the performance of an existing task beyond a human’s or other non-AI-based solution’s performance or if it is intended to save time or other resources.

    Buyers should also have access to information that allows them to assess the potential benefits of the AI solution, and this should be backed up by published scientific evidence for the efficacy and cost-efficiency of the solution. How this is done will depend highly on the solution itself and the context in which it is expected to be deployed, but guidelines for this are available (National Institute for Health and Care Excellence (NICE), n.d.). Some questions to ask here would be: How much of an influence will the solution have on patient management? Will it improve diagnostic performance? Will it save time and money? Will it affect patients’ quality of life? It should also be clear to the buyer who exactly is expected to benefit from the use of this solution (Radiologists? Clinicians? Patients? The healthcare system or society as a whole?).

    As with any healthcare intervention, all AI-based solutions come with potential risks, and these should be made clear to the buyer. Some of these risks might have legal consequences, such as the potential for misdiagnosis. These risks should be quantified, and potential buyers should have a framework for dealing with them, including identifying a framework for accountability within the organizations implementing these solutions. Buyers should also ensure they clearly understand the potential negative effects on radiologists’ training and the potential disruption to radiologists’ workflows associated with the use of these solutions.

    Specifics of the AI solution’s design are also relevant to the decision on whether or not to purchase it. These include how robust the solution is to differences between vendors and scanning parameters, the circumstances under which the algorithm was trained (including potential confounding factors), and the way that performance was assessed. It should also be clear to buyers if and how potential sources of bias were accounted for during development. Because a core characteristic of AI-based solutions is their ability to continuously learn from new data, whether and how exactly this retraining is incorporated into the solution with time should also be clear to the buyer, including whether or not new regulatory approval is needed with each iteration. This also includes whether or not retraining is required, for example, due to changes in imaging equipment at the buyer’s institution.

    The main selling points of many AI-based solutions are ease-of-use and improved workflows. Therefore, potential buyers should carefully scrutinize how these solutions are to be integrated into existing workflows, including inter-operability with PACS and electronic medical record systems. Whether or not the solution requires extra hardware (e.g. graphical processing units) or software (e.g. for visualization of the solution’s outputs), or if it can readily be integrated into the existing information technology infrastructure of the buyer’s organization influences the overall cost of the solution for the buyer and is therefore also a critical consideration. In addition, the degree of manual interaction required, both under normal circumstances and for troubleshooting, should be known to the buyer. All potential users of the AI solution should be involved in the purchasing process to ensure that they are familiar with it and that it meets their professional ethical standards and suits their needs.

    From a regulatory perspective, it should be clear to the buyer whether the solution complies with medical device and data protection regulations. Has the solution been approved in the buyer’s country? If so, under which risk classification? Buyers should also consider creating data flow maps that display how the data flows in the operation of the AI-based solution, including who has access to the data.

    Finally, there are other factors to consider which are not necessarily unique to AI-based solutions and which buyers might be familiar with from purchasing other types of solutions. This includes the licensing model of the solution, how users are to be trained on using the solution, how the solution is maintained, how failures in the solution are dealt with, and whether additional costs are to be expected when scaling up the solution’s implementation (e.g. using the solution for more imaging equipment or more users). This allows the potential buyer to anticipate the current and future costs of purchasing the solution.

    The past decade of increasing interest and progress in AI-based solutions for medical imaging has set the stage for a number of trends that are likely to appear or intensify in the near future.

    Firstly, there is an increasing sentiment that, although AI holds a great deal of promise for interpretive applications (such as the detection of pathology), non-interpretive AI-based solutions might hold the most potential in terms of instilling efficiency into radiology workflows and improving patient experiences. This trend towards involving AI earlier in the patient management process is likely to extend to AI increasingly acting as a clinical decision support system to guide when and which imaging scans are performed.

    For this to happen, AI needs to be integrated into existing clinical information systems, and the specific algorithms used need to be able to handle more varied data. This will likely pave the way for the development of algorithms that are capable of integrating demographic, clinical, and laboratory patient data to make recommendations about patient management (Huang, Pareek, et al., 2020; Rockenbach, 2021). The previously mentioned natural language processing algorithms that have been used to interpret scan requests may be useful candidates for this.

    In addition, we are likely to see AI algorithms that can interpret multiple different types of imaging data from the same patient. Currently, less than 5 % of commercially available AI-based solutions in medical imaging work with more than one imaging modality (Rezazade Mehrizi et al., 2021; van Leeuwen et al., 2021) despite the fact that the typical patient in a hospital receives multiple imaging scans during their stay (Shinagare et al., 2014). With this, it is also likely that more AI-based solutions will be developed that target hitherto neglected modalities such as nuclear imaging techniques and ultrasound.

    The current market for AI-based solutions in radiology is spread across a relatively large number of companies (Alexander et al., 2020). Potential users are likely to expect a streamlined integration of these products in their workflows, which can be challenging in such a fragmented market. Improved integration can be achieved in several different ways, including with vendor-neutral marketplaces or by the gradual consolidation of providers of AI-based solutions.

    With the expanding use of AI, the issue of trust between AI developers, healthcare professionals, regulators, and patients will become more relevant. It is therefore likely that efforts will intensify to take steps towards strengthening that trust. This will potentially include raising the expected standards of evidence for AI- based solutions (Aggarwal, Sounderajah, et al., 2021; X. Liu et al., 2019; van Leeuwen et al., 2021; Yusuf et al., 2020), making them more transparent through the use and improvement of interpretable AI techniques (Holzinger et al., 2017; Reyes et al., 2020; “Towards Trustable Machine Learning,” 2018), and enhancing techniques for maintaining patient data privacy (G. Kaissis et al., 2021; G. A. Kaissis et al., 2020).

    Furthermore, while most existing regulations stipulate that AI-based algorithms cannot be modified after regulatory approval, this is likely to change in the future. The potential for these algorithms to learn from data acquired after approval and adapt to changing circumstances is a major advantage of AI. Still, frameworks for doing so have thus far been lacking in the healthcare sector. However, promising ideas have recently emerged, including adapting existing hospital quality assurance and improvement frameworks to monitor AI-based algorithms’ performance and the data they are trained on and update the algorithms accordingly (Feng et al., 2022). This will likely require the development of multidisciplinary teams within hospitals consisting of clinicians, IT professionals, and biostatisticians who closely collaborate with model developers and regulators (Feng et al., 2022).

    While the obstacles discussed in previous sections might slow down the adoption of AI in radiology somewhat, the fear of AI potentially replacing radiologists is unlikely to be one of them. A recent survey from Europe showed that most radiologists did not perceive a reduction in their clinical workload after adopting AI-based solutions (European Society of Radiology (ESR), 2022), likely because, at the same time, demand for radiologists’ services has been continuously rising. Studies from around the world have shown that radiology professionals, particularly those with AI exposure and experience, are generally optimistic about the role of AI in their practice (Y. Chen et al., 2021; Huisman et al., 2021; Ooi et al., 2021; Santomartino & Yi, 2022; Scott et al., 2021).

    AI has shown promise in positively impacting virtually every facet of a radiology department’s work - from scheduling and protocolling patient scans to interpreting images and reaching diagnoses. Promising research on AI-based tools in radiology has not yet been widely translated to adoption in routine practice, however, because of a number of complex, partially intertwined issues. Potential solutions exist for many of these challenges, but many of these solutions require further refinement and testing. In the meantime, guidelines are emerging to help potential users of AI-based solutions in radiology navigate the increasing number of commercial products. This encourages their adoption in real-world scenarios, thus allowing their true potential to be uncovered, as well as their weaknesses to be identified and addressed in a safe and effective way. As these incremental improvements are made, these tools will likely evolve to handle more varied data, become integrated into consolidated workflows, become more transparent, and ultimately more useful for increasing efficiency and improving patient care.

    AI Central. (n.d.). Retrieved July 2, 2022, from https://aicentral.com/

    AI for radiology(n.d.). Retrieved July 2, 2022, from https://grand-challenge.org/aiforradiology/

