How should medical AI be regulated, and who is responsible when deadly mistakes occur?
In February, San-Francisco-based Doximity, a telehealth and medical professional networking company, rolled out a beta version of its medical chatbot DocsGPT, which was intended to help doctors with multiple tasks including writing discharge instructions for patients, note taking, and responding to other medical-related prompts ranging from answering questions about medical conditions to performing calculations for health-related medical algorithms like measuring kidney function.
However, as I have reported, the app was also engaging in “race-norming” and amplifying race-based medical inaccuracies that could be dangerous to patients who are Black. Although doctors could use it to answer a variety of questions and perform tasks that would impact medical care, the chatbot itself is not classified as a medical device—as doctors aren’t technically supposed to input medically sensitive information (though several doctors and researchers have stated that many still do). As such, companies are free to develop and release these applications without going through a regulatory process that makes sure these apps actually work as intended.
Still, many companies are developing their chatbots and generative artificial intelligence models for integration into health care settings—from medical scribes to diagnostic chatbots—raising broad-ranging concerns over AI regulation and liability. Stanford University data scientist and dermatologist Roxana Daneshjou tells proto.life part of the problem is figuring out if the models even work.
“Epic [Systems], a major electronics health care record system, and Microsoft have already created a partnership to try to bring large language models [the technology behind chatbots] to electronic medical record systems,” she says. “Things are moving really fast, faster than usual for technology and medicine.”
(Author’s note: Unrelated to this story and my interview with Daneshjou, I contributed to a research article she published this fall, which found many more chatbots, including ChatGPT and Bard, propagated race-based medicine in their prompts.)
Herndon, Virginia-based Epic Systems came under scrutiny in 2021 because the AI algorithm they developed for assessing sepsis risk, which was deployed to more than 100 health systems nationwide, would miss two-thirds of cases while still managing to issue plenty of false alarms.
So when something inevitably goes terribly wrong, who is liable? W. Nicholson Price, a professor of law at the University of Michigan, tells proto.life that largely depends on what goes wrong. “I think it’s certainly possible that having big tech centrally involved in regulatory discussions might result in less liability for them,” he says. “I assume that’s part of what they want, and what they will argue will promote innovation and access.”
When is software medicine or health care?
Lots of software in daily use in hospitals and clinics across the United States has never been nor will ever be considered for regulation by the U.S. Food and Drug Administration (FDA). For example, there’s no reason for the FDA to regulate which email clients, spreadsheet programs, or internet browsers should be used in the hospital (though their use still falls under other regulations like privacy laws). On the other end of the spectrum is software used for running X-ray scanners, MRI machines, or other medical algorithms. These are regulated with oversight from the FDA. But what about everything in between?
As more and more new AI technologies are used for medical and health care purposes, the question emerges: Which of them need to be tested and approved as medical treatments—and how?
Some 500 FDA-approved AI models are built with a singular function, for example, screening mammograms for signs of cancer and flagging-up telltale cases for priority review by human radiologists. However, many companies want to roll out AI tools as informational health devices that technically don’t make any diagnostic claims, pointing to the image recognition app Google Lens as an example.
The Google Lens app was used to build an AI program called DermAssist, which allows people to take a picture of their skin to ask whether a spot looks like a pathogenic lesion or cyst. In 2021, scientists criticized the application for failing to include darker skin tones when training the algorithm, making its results questionable for people with darker skin. These apps always come with a disclaimer, but it can still confuse consumers. “Some apps try to put disclaimers [that they aren’t diagnostic medical devices approved by the FDA], but they’re essentially doing diagnostic tasks,” Daneshjou says.
Radiologists who used an AI assistant were likely to defer to the algorithm’s judgment. This often swayed them to make the wrong diagnosis.
Therein lies the loophole: As long as companies don’t explicitly declare their AI application is intended to be used as a diagnostic or medical tool, then they don’t need to worry about seeking regulatory approval from the FDA. In the real world, that means chatbots intended for and used by people to provide companionship or to provide mental health advice are unregulated.
In March of this year, the U.S.-based National Eating Disorder Association made headlines after it fired its helpline staff, replacing them with an AI chatbot called Tessa. A month later, the chatbot was removed from the site after an activist, Sharon Maxwell, shared harmful advice that the chatbot was offering to people with eating disorders, telling them to eat at a 500- to 1,000-calorie deficit and track weight weekly—which further perpetuates unhealthy eating habits associated with the disorder and increases their risk of relapse.
That same month, a Belgian man died by suicide after talking with an AI chatbot on an app called Chai for six weeks. He suffered from ecological anxiety, a chronic fear of impending environmental doom. The chatbot apparently convinced him it would be better for the planet if he were to end his own life. So he did.
Key Silicon Valley leaders see AI chatbots as a game changer for health and wellness despite such risks—and the lack of guardrails on these apps as little more than a bump on the road to a brighter future. On September 27, 2023, Ilya Sutskever, the chief scientist at OpenAI, tweeted, “In the future, once the robustness of our models will exceed some threshold, we will have *wildly effective* and dirt cheap AI therapy. Will lead to a radical improvement in people’s experience of life. One of the applications I’m most eagerly awaiting.”
Google launches AI health tool for skin conditions in Europe https://t.co/oneAuvgBGp The algorithm was developed based on training data with less than 4% dark skin types. It should come with a warning BEWARE OF RESULTS IF BLACK!!! pic.twitter.com/PZNPjR3kF2
— Ade Adamson, MD MPP (@AdeAdamson) May 18, 2021
Proving that an algorithm works
Safety is always the first concern when submitting software for FDA approval, but the other issue is showing that an algorithm actually works. AI in medicine expert Eric Topol, the founder and director of Scripps Research Translational Institute in La Jolla, California, says there’s a lack of transparency and little public disclosure for the 500 or so already-FDA-approved AI models in use. That’s because they are markedly less complex than large language models or chatbots—though it’s hard to say how, specifically. Most companies aren’t publishing the data they use to train these models because they claim it’s proprietary.
“Lack of transparency hasn’t increased the confidence among clinicians who are [suspicious] of the data,” Topol says, adding that the algorithms are rarely tested in a real-world, prospective setting where doctors and researchers can track the outcomes of patients. “Without compelling data, without transparency, there will be problems with implementation.”
To meet the highest standards of care in medicine, an algorithm should not only provide an answer, but offer a correct one—clearly and effectively. It should improve health outcomes in a meaningful way.
A recent study published in the journal JAMA Network Open tested an algorithm that predicts hospital-acquired blood clots in children. Although the algorithm effectively predicted the clots, it didn’t improve patient outcomes compared to standard care. Another recent study published in the journal Radiology found that radiologists who used an AI assistant to screen mammograms for signs of cancer were likely to defer to the algorithm’s judgment despite some of the radiologists being highly experienced. This often swayed them to make the wrong diagnosis, lowering their accuracy from 80 percent to 45 percent for highly experienced radiologists (less experienced practitioners performed even worse) because they assumed the AI spotted something they hadn’t. Basically, the biases built into the algorithm beget more bias from the user.
This becomes considerably more difficult when assessing generative AI or other algorithms that can be used for multiple types of tasks. Chatbots that use large-language models work by predicting which words in a sentence are likely to come next, but these outputs are probabilistic. The model might provide five different outputs even if the same prompt is used five times. “People don’t even have a way to grade the outputs,” Daneshjou says. The other problem is that generative AI algorithms can literally make things up—a problem that experts call hallucination—and their performance worsens over time, which is known as “AI aging.”
Regulating dynamic algorithms
Who decides that an algorithm has shown enough promise to be approved for use in a medical setting? And what happens when they inevitably age or change as they receive more data as input? It’s not clear how those updates will be regulated—or even whether they could be.
“The biggest challenge is that these technologies are different in nature than the AI technologies before it,” says Bertalan Meskó, director of the Budapest-based Medical Futurist Institute. “Generative AI is close to adaptive AI that changes with every decision it makes, therefore it will be different on the day it was deployed from the version [that] got approved.”
To address this, regulators want companies to outline precisely what they expect will change over time. “The FDA is developing a predetermined change control plan structure, where companies can say what parameters they expect will change, and within what limits, and if [the agency] okays the plan, those changes don’t need to go back to FDA,” says Price, the University of Michigan law professor.
With multimodal AI applications that can do multiple medical tasks, the AI may need to repeat the regulatory process for each indication. “What is appropriate for one indication may or may not be appropriate for another indication,” says Jim McKinney, a press officer for the FDA. “One type of cancer may present differently than another type of cancer, and the FDA would need to evaluate the device’s potential benefits and risk for each desired indication of use.”
But what happens when an AI is updated for an already-approved indication?
McKinney explained over email that a cornerstone of the FDA’s current approach to regulating AI and machine learning-based software involves assessing a risk/benefit profile for each device depending on its use, as well as an evaluation of potential biases. We recognize “the need for improved methodologies,” he says, to identify biased algorithms and improve them.
Physicians may be putting sensitive health data into these models, which may violate health care privacy laws.
Even if the review process is perfect, however, specific algorithms might escape regulation as medical devices.
Epic’s sepsis algorithm, as an example, wasn’t technically approved as a medical device and was only used with electronic health records as a predictive or screening tool, without making any health claims that would be required to go through the FDA at the time. In 2022, the FDA released updated guidance that these tools should now be regulated as medical devices, but will not pursue any actions against similar algorithms currently in use.
Several physicians proto.life spoke to admitted that they’ve heard of cases where colleagues are already using tools like ChatGPT in practice. In many cases, the tasks are innocuous—they use it for things like drafting form letters to insurance companies and to otherwise unburden themselves of small and onerous office duties. More significantly, the physicians say, it is sometimes used as a diagnostic aid.
But in all these cases, physicians may be putting sensitive health data into these models, which may violate health care privacy laws. It is unclear what happens to the data once it goes into ChatGPT, Bard, or other similar AI services—or how those companies might use it. Additionally, it isn’t clear how reliable these tools are, because assessing their effectiveness for these specific uses is challenging.
What does this tell us about the future?
Some AI researchers have questioned with whom U.S. regulators are working to develop these regulations. The Senate recently hosted a private meeting on AI, which predominantly featured celebrity CEOs from multiple tech companies, most of whom don’t have a research background in AI methodology or ethics and who have financial incentives to argue for looser regulations. The guests included Meta’s Mark Zuckerberg, Elon Musk, Google’s Sundar Pichai, and OpenAI’s Sam Altman.
Notably absent from the meeting were reporters, members of the public, patient advocates, academic ethics experts, creators whose work is used for free to train the AIs, workers in the global south (like those in Kenya who earn less than $2 an hour to help make a chatbots’ responses less toxic), and all the rank-and-file tech workers who will do the heavy lifting writing AI code.
Excluding these voices from the regulatory process could make it easier for these companies to market chatbots and apps like DermAssist as non-medical devices, even when they are being used in medical settings. There are many bombastic claims about how AI can revolutionize medicine, but researchers warn we need better studies, transparency, and data before we can be sure.
On October 30th, U.S. President Joe Biden passed an executive order aimed to reduce some of the risks posed by AI. It requires developers to share safety data with the government for AIs that pose risks to national security, economy, and public health before release to the public, in accordance with the Defense Production Act, and it also instructs agencies to set standards for these programs.
“To realize the promise of AI and avoid the risk, we need to govern this technology,” Biden said in a press conference. “In the wrong hands AI can make it easier for hackers to exploit vulnerabilities in the software that makes our society run.”
Still, there’s no answer to who would be responsible when something goes wrong. That question is still up in the air—there haven’t been any court cases that have leveled blame at individual doctors, hospital administrators, companies, or regulators themselves.
Ade Adamson, a dermatologist and researcher at University of Austin at Texas tweeted that Google’s dermatology app should come with a warning that says “beware of results if Black.” But these apps are also used outside of medical settings, beyond the ability of doctors to oversee, so it could be difficult for people to heed such warnings, since their doctors could also be using potentially biased chatbots or algorithms to make important decisions on their behalf, without their knowledge.
In the end, there is no opt out—not for consumers, not for health providers, and not for the FDA. Like it or not, AI is here to stay.