Using Artificial Intelligence to Enable Low-Cost Medical Imaging – Phillip Lippe interviews Keelin Murphy

Medical imaging is a cornerstone of medicine for the diagnosis of disease, treatment selection, and quantification of treatment effects. Now, with the help of deep learning, researchers and engineers strive to enable the widespread use of low-cost medical imaging devices that automatically interpret medical images. This allows low and middle-income countries to meet their clinical demand and radiologists to reduce diagnostic time. In this interview, Phillip Lippe, a PhD student at the QUVA Lab, interviewed Keelin Murphy, a researcher at the Thira Lab, to learn more about the lab’s research and the developments of the BabyChecker project.

Keelin Murphy is an Assistant Professor at the Diagnostic Image Analysis Group in Radboud University Medical Center. Her research interests are in AI for low-cost imaging modalities with a focus on applications for low and middle-income countries. This includes chest X-ray applications for the detection of tuberculosis and other abnormalities, as well as ultrasound AI for applications including prenatal screening.
Phillip Lippe is a PhD student in the QUVA Lab at the University of Amsterdam and part of the ELLIS PhD program in cooperation with Qualcomm. His research focuses on the intersection of causality and machine learning, particularly causal representation learning and temporal data. Before starting his PhD, he completed his Master’s degree in Artificial Intelligence at the University of Amsterdam.

The QUVA Lab is a collaboration between Qualcomm and the University of Amsterdam. The mission of QUVA lab is to perform world-class research on deep vision. Such vision strives to automatically interpret with the aid of deep learning what happens where, when, and why in images and video.

The Thira Lab is a collaboration between Thirona, Delft Imaging, and Radboud UMC. The mission of the lab is to perform world-class research to strengthen healthcare with innovative imaging solutions. Research projects in the lab focus on the recognition, detection, and quantification of objects and structures in images, with an initial focus on applications in the area of chest CT, radiography, and retinal imaging.

In this interview, both Labs come together to discuss the challenges in deep learning regarding the medical imaging domain.

Phillip: Keelin, you witnessed the transition from simple AI to deep learning. What do you think deep learning has to offer in medical image analysis?

I believe deep learning has a huge role to play in medical image analysis. Firstly, radiology equipment is expensive and requires the training of dedicated physicians, which means that low and middle-income countries can not meet their clinical radiology demands. Using deep learning-powered image analysis, therefore, has the potential to homogenize medical imaging accessibility around the world.

Secondly, even in richer countries such as the Netherlands, we can use deep learning to reduce the costs of radiology clinics. Every minute a radiologist spends looking at an x-ray, for example, is expensive and radiologists have to review a lot of x-rays every day. While every x-ray still requires the radiologists’ utmost attention, many of these x-rays actually show no signs of malicious threat. Deep learning could be used here to prioritize radiologists’ work list, by putting cases that seem normal at the bottom and cases that are deemed urgent at the top of the list. When artificial intelligence can really be relied upon, we could even start removing items from the radiologists’ workflow entirely.

Phillip: You mentioned that you use deep learning, which of course has many facets of neural networks, such as graph neural networks (GNNs) or transformers. Since you are working in imaging analysis, I assume you mostly work with computer vision models. Are you using convolutional neural networks (CNNs) for classification and segmentation or do you even go beyond that scope?

As you mentioned, we almost always use a CNN, where the type of CNN is dependent on the application. More than not, due to the confidential nature of medical data, the most important factor in determining which model to use is actually the amount of data that is available. Using too little data to train a model runs the risk of overfitting and introducing lots of uncertainty. Therefore, we have to try to use models cleverly to make sure to overcome such consequences, by using class balancing, data augmentation, or adjusting the network architecture. Other factors include the size of the model. For instance, to enable global use of the BabyChecker, it is important that the model can fit on a mobile phone device which sets requirements for the size of the network we can use.

Phillip: We know that deep learning models can create false predictions, so it might happen that the system indicates that a measurement looks all good, while that person actually needs to go to the hospital. How do you deal with this uncertainty and possible mistakes?

First, we should acknowledge that uncertainty is inevitable. Radiologists make mistakes, just like models can make mistakes. Only through strict quality control processes can we ensure that these models are reliable enough so that they do more good than harm. Especially in the medical field, this poses many challenges. For instance, on the technical side, we should figure out how to deal with domain shifts. On the legal side, we should determine who is responsible if the model makes a mistake and what legal actions can be taken. Those things are incredibly unclear at the moment.

Right now, I still see that artificial intelligence has a big role to play as a suggestive assistant to a radiologist if one is present, or as a screening tool when one isn’t. For instance, in Africa tuberculosis is very prevalent but most often there is no physician available. One of the products developed in our group and now scaled by Delft Imaging is able to detect tuberculosis-related abnormalities in inexpensive chest x-rays and refer patients for a more accurate and expensive microbiological test when necessary. While this product is not flawless, it does allow us to help more people that we couldn’t have helped otherwise. So until we reach the stage that systems are sufficiently quality controlled, using deep learning for screening and suggestions can be really useful.

Phillip: This sounds similar to challenges in autonomous driving, where it is hard to determine who is really at fault in an accident. We know that another problem is that neural networks tend to be overconfident, also in situations where they should not be. Are there ways to address this problem?

Yes, I have not yet mentioned it to you before, but this is actually really important for getting artificial intelligence accepted in the clinical workflow. Sometimes it might happen that an image of noise accidentally makes its way into the database due to a malfunctioning scanner. If the system would still give you a score for emphysema, then you lose faith in that system. In such cases we want the system to output that the image is very different than the images it was trained on and that the model cannot classify that image. It would be even better if the system provides an interpretable explanation for why it made a certain prediction since transparency in the prediction process is crucial for clinicians to be able to trust the system.

Phillip: You mentioned interpretability, a topic that has gotten a lot of popularity recently. Especially due to discussions about whether interpretability techniques are truly interpretable. Have you already tried out interpretability methods for neural networks or are those methods still a bit too noisy?

While interpretability methods work well in theory, for me the field is still too under-researched to have practical value. One popular method for explaining predictions in the medical field is by producing heat maps based on the weights of the network. However, such methods are hard to quantify and look more pretty than actually being useful explanations.

Phillip: In the low data regime, where models are trained on small amounts of data, the explanations might also quickly overfit to random noise. Yes, indeed. When clinicians hear about these topics in AI, are they reluctant to participate in research on artificial intelligence?

My experience is really positive, but I work mainly with doctors who are interested in what artificial intelligence has to offer. I believe that clinicians recognize that AI is coming to their field and they either have to get on board or they are going to be left behind. As I mentioned before though, the technology in most cases is not ready to be left unattended. Therefore, use cases that researchers and clinicians prefer right now, often assume a suggestive/assistant role for the AI algorithm or a screening role in scenarios where no trained reader is available.

Phillip: Since we talked a lot about data sparsity, how important is it for you to have collaborations across hospitals or medical companies to get access to data?

Collaboration with your partners is super important, in lots of ways. I believe that researchers should never try to develop medical image analysis solutions if they are not collaborating with clinicians. First off, to do research, you are dependent on the availability and the quality of data that is gathered. If we want to move the field forward, we should communicate about the following topics more: which data can be used for what research purpose; how should the data be gathered so that the quality is the highest; and how can we get consent from patients to use their data more often than we do now.

Secondly, there is some knowledge sharing involved. Sometimes I read a paper that had no clinical input and you can really see the difference. Either the researchers made mistakes that could have been prevented, or you can see that the research does not have practical value.

Phillip: Do you consider the agreement of a patient to use their data to be the biggest hurdle for developing something like a medical ImageNet?

The problem is that patients are not asked often enough if their data can be used for commercial purposes. Even so, when they are asked, patients might be reluctant to share such private data without being aware of how and by who it is used. While everybody working in the field of artificial intelligence knows that data is the cornerstone of everything, we should think about how we can communicate this effectively to the community. For instance, by providing more education to create public awareness of what AI is and why large amounts of data are necessary to create successful solutions.

Phillip: From the perspective of a patient, it is a small thing to give, but for the research domain, every single patient who is willing to share their data makes a big difference in enabling better medical image analysis.

Yes indeed. Still, there are challenges that need to be addressed. For instance, do patients feel comfortable with sharing their data with all companies or do they prefer to share their data selectively? What does it mean for competition if all companies have access to the same data? These are questions that we need to find an answer to, together with the community.

Phillip: Yes maybe, patients even want to go as specific as to approve for which specific applications their data can be used for. As scientists, we of course assume that data will be used for good, but we need to make sure that data is really only used for beneficial applications and not applications that might harm people.

Yes, and we should also make sure that data is completely de-identifiable, so that the person the image is taken of, can never be traced back to that image.

Phillip: Now, what is the research focus of the Thira lab?

Our research focuses on two things: the scalability of existing methods in the medical domain, and the reliability of the new methods we are developing to make predictions. Whatever research we do, I would say the red line is always the clinical applicability of our solutions rather than developing pure theoretical knowledge.

Phillip: When developing a device like the BabyChecker, do you only use data acquired with that device to train the model, or is there some domain adaptation involved?

In general, in the minimum viable product stage, we only use data acquired with the actual device so no domain adaption is necessary. At this early stage, BabyChecker’s software works with a selected ultrasound probe so that early adopters in our projects can gain easy access to BabyChecker. Over 70 operators who have been trained to use BabyChecker are scanning pregnant women in Tanzania, Ghana, Sierra Leone, Ethiopia, and soon Uganda as well. The data comes back to our partner Delft Imaging, where experts keep a close check on how well the software is working and where physicians determine the quality of the data. This way we make sure that the system is rigorous and that patients get the correct care.

Phillip: You have already mentioned some future improvements to the BabyChecker, where do you want to be in four years?

At the moment, the BabyChecker checks a few things: 1) The gestational age of the baby to determine the estimated due date, 2) The position of the baby so that when the baby is in a breech position, the woman can make sure to deliver the baby in a hospital, and 3) The presence of twins since this is also a high-risk pregnancy where the woman should go to the hospital to deliver. Additionally, we are looking to perform placenta localization and detect the fetal heartbeat to discover possible pregnancy complications

Phillip: Let’s say that in four years the field of AI has made at least one or multiple steps forward. Where do you see that AI needs to improve, especially in the medical domain?

In general, I would like to see how we can use low-cost x-rays and ultrasounds for lots of other diagnoses. For example, heart failure or lung disease. However, in order for such applications to be feasible, we need AI methods that can work well with small amounts of training data. I think that is really the biggest challenge that we have to overcome.

Phillip: In terms of evaluation, when would you consider your research to be successful? Is it when doctors use the products that you have developed or is it when you feel like there is nothing to improve in the short term?

While I believe I will never feel like there is nothing to improve, I would say my research is successful if we can reliably screen large amounts of people in low-resource settings for all sorts of illnesses and possible complications and get them referred for the treatment they need.

On October the 6th, 2022, the Thira Lab and the QUVA Lab will talk about their current work during the lunch Meetup of ‘ICAI: The Labs’ on AI for Computer Vision in the Netherlands. Want to join? Sign up!