The COVID-19 pandemic exposed significant corruption and ineptness within the medical community. The hamfisted federal policies were amplified without question by local health officials and executed sometimes withough a second thought by your own physician.
From debates over the efficacy of facemasks for kids to discussions about vaccine mandates for those already recovered from the virus, it became evident that medical professionals - like workers in almost any industry - will just do what they’re told.
In our own world, it was extremely difficult for myself and Jenny to find a physician who would treat our kids with dignity and our opinions with the respect we deserved.
Enter Artificial Intelligence
Now, a new study brings to bear one of the cutting-edge technologies that I've spent a lot of last year investigating — artificial intelligence. (See my A.I blog here)
A multi-center, randomized clinical vignette study explored the impact of GPT-4, a large language model developed by OpenAI, on physicians' diagnostic abilities. The study aimed to assess whether this AI could aid doctors in improving their diagnostic reasoning compared to traditional resources.
THE BOTTOM LINE: Doctors were tasked with diagnosing cases—half of them had access to GPT-4, while the other half did it solo. The control group nailed 73% of the cases, and the GPT-4-assisted group hit 77%. Not exactly groundbreaking.
Here’s the kicker: GPT-4 alone scored 92% !!!
Seems like the docs didn’t feel like listening to the AI.
Study Overview
Objective
To assess the impact of the GPT-4 large language model on physicians’ diagnostic reasoning compared to conventional diagnostic resources.
Methodology
Design: Multi-center, randomized clinical vignette study.
Participants: 50 resident and attending physicians specializing in family medicine, internal medicine, or emergency medicine.
Intervention:
GPT-4 Group: Physicians had access to GPT-4 in addition to conventional diagnostic resources.
Control Group: Physicians used only conventional diagnostic resources.
Procedure: Participants had 60 minutes to review up to six clinical vignettes adapted from established diagnostic reasoning exams.
Primary Outcome: Diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps.
Secondary Outcomes: Time spent per case and final diagnosis accuracy.
Key Findings
Diagnostic Performance:
GPT-4 Group: Median diagnostic reasoning score per case was 76.3%.
Control Group: Median score was 73.7%.
Adjusted Difference: 1.6 percentage points (95% CI -4.4 to 7.6; p=0.60), not statistically significant.
Time Efficiency:
GPT-4 Group: Median time spent per case was 519 seconds.
Control Group: Median time was 565 seconds.
Time Difference: GPT-4 group spent 82 seconds less per case (95% CI -195 to 31; p=0.20), not statistically significant.
GPT-4 Alone:
GPT-4 independently scored a median of 92.1% per case.
Outperformed the control group by 15.5 percentage points (95% CI 1.5 to 29.5; p=0.03), a statistically significant difference.
Discussion
The study reveals that while GPT-4 has a high diagnostic accuracy on its own, physicians did not significantly improve their performance when using it as an aid. Several factors might contribute to this outcome:
Integration Challenges: Physicians may find it difficult to effectively incorporate AI suggestions into their diagnostic process, especially under time constraints.
Trust and Skepticism: There may be reluctance to rely on AI recommendations due to concerns about accuracy or unfamiliarity with the technology.
Workflow Disruption: Using AI tools might not seamlessly fit into existing clinical workflows, potentially hindering their effective use.
Implications
Enhancing Physician-AI Collaboration:
Training and Education: Physicians may benefit from training on how to interact with AI tools effectively.
Interface Improvements: Optimizing AI interfaces to align with clinical workflows could facilitate better integration.
Maximizing AI Potential:
Diagnostic Accuracy: AI models like GPT-4 can serve as valuable tools to improve diagnostic accuracy and patient outcomes.
Efficiency Gains: Potential reductions in time per case could alleviate physician workload.
Addressing Barriers:
Building Trust: Demonstrating AI reliability through continued research and real-world applications can increase physician confidence.
Policy and Guidelines: Establishing clear guidelines for AI use in clinical settings can provide a framework for safe and effective integration.
Conclusion
The study underscores the promising capabilities of AI in medical diagnostics but highlights the need for strategies to improve physician-AI collaboration. As AI technologies advance, fostering effective partnerships between physicians and AI tools will be essential to enhance patient care and outcomes.
Final Thoughts
Think about your own line of work, the place where you spend your days. How many people just go through the motions, doing exactly what they’ve been taught, versus those who push boundaries and sometimes take the heat for it? What’s the split—50/50? More like 10 to 90, right? Especially in professions like doctors, where following orders and checklists is the norm. During the pandemic, it made some of us wonder—maybe it’s time to let AI take the reins instead.
For all the reasons others are sharing in comments, as well as what I consider to be the most important reason, NO to AI for health decisions!! No! No! No!
All that would do would be to lock in Allopathic medicine, aka "evidence-based medicine" (EBM) as the one and only officially and legally recognized system of health that exists. To the exclusion of other systems of health that are FAR superior with less risk to their patients. Like Ayurveda. A system of health primarily practiced in India, thousands of years in practice, where their practitioners look down their noses (rightfully) at allopathy as being dangerous and ineffective. And Traditional Chinese Medicine practitioners. Homeopaths - the dominant health system in the US until Rockefeller and his Bernays/Flexner associates made petrochemical-based "medicine" the only one permitted in the US - even while Rockefeller himself kept his own personal homeopathic doctor bedside until his last breath in his mansion.
Naturopathy, Herbalism, all of the holistic health systems are much gentler, safer and more effective than allopathy at healing pretty much any condition a person experiences. Because they focus on the healing person, not treating the condition. Iatrocide, death-by-doctor, is the 2nd leading cause of death in this nation. Allopathic doctors. A system of health where AI sorts through only allopathic medicine research, studies, reports, academia, etc because that system is the only one studied and calculated into 1's and 0's formats an AI would draw from will be fundamentally flawed from inception. But granted unimpeachable status that law, authorities would defer to would be a catastrophe for humanity.
AI cannot capture art. Healing is an art. And healing is by its very nature driven by nature, not man. Energies and systems of the earth, frequencies, elements of our Creator and creation are beyond our ability to comprehend, even though we imagine we have the power to dominate and control nature. Allopathy and medical "science" defies man's control, is the product of dangerous hubris of man who imagines otherwise. AI is a tool of man's invention that serves to validate man's systems and ideas of power over nature when reality is the other way around.
Our collapsing and repugnant state of medicine is the inevitable result of the *system* of medicine itself. Allopathy fails at its most fundamental point, the premise it's based on: man controls nature. With poisons and butchery. The only area of health that it is superior in involves that butchery process, cutting, slicing in order to sew rips, breaks and tears to our bodies that happen after car accidents, stabbings and bullets tear our flesh and organs and break our bones. That's it. That's all allopathy does well. If AI can be used to study the most effective ways to heal those injuries then have at it. For the rest of our health keep those damn death machines away from us!
Justin, As someone who spends extensive time in medical AI, and has since Shortliffe, this paper and your conversation miss the point. For pre-canned situations like these, any generative AI will look at 1,000,000 priors and come up with an average answer which will likely be close to the truth 88% of the time, which is consonant with what this study shows.
This has been going on for a long time. From AI round one (Shortliffe/Feigenbaum, in the late 1970s) to round two (Weed) to later rounds (I was part of the initial Watson testing which was round six), the "AI will save medicine" cohort has been trying and failing. Now we are in round nine. We will have the same end result.
The generative/LLM/Deep Learning engines of today are essentially the same as the first engines -- there are larger training sets and far more iterations, but exactly all the same limitations. Generative AI uses neural networks to compute CORRELATIONS -- NOT INTELLIGENCE. There is no intelligence in these tools whatsoever. Because there is no way for these engines to ascertain "truth," all outputs are a probabilistic stab based on word/data association numbers and how others wrote about similar situations. There is no ability to trace logic or to back-trace. This is foundational to the technology. The results may be impressive and are designed to LOOK like intelligence. But they are not. Examples are legion. But black swans, hallucinations, and other known issues with this technology will persist. RAG and other tools may limit some of the edge conditions -- but the issues cannot be eliminated.
This leads to the most important observation about any generative AI -- IT IS UNSAFE TO ASK ANY QUESTION TO WHICH YOU DO NOT ALREADY KNOW THE ANSWER. Anyone who has used these tools has discovered that they can ask a question, know when they get the answer that it is wrong, re-ask the question and sometimes then get the right answer. But if you are not smart/informed enough to recognize an answer is wrong, you may proceed with the wrong information with, in the case of health care, potentially fatal results. As Weed discovered, this is foundationally why these tools are relatively not-useful to practitioners who actually know things. Either you know the answer already and dealing with the AI's mix of right and wrong is just a waste of time, or you know you do not know enough to discern the right answer so you refer the patient. Nothing has changed in these measurements for 30 years.
But these pivotal points on generative AI entirely miss the two key points on medical AI which is why it has never succeeded and will not as envisioned by most. First, every patient is their own science experiment (as I always tell my patients). The fact that a bunch of people with similar symptoms reacted in a particular way to a particular diagnosis or treatment is contributory to one's analysis, but may or may not have anything to do with the patient in front of you. This is a truism in health care which is why there are doctors and nurses. (Early in my career I published one of the earliest computer simulation pieces on open heart surgery showing that using an "average patient" approach killed 10% of all open-heart patients. Only a per-patient approach could address this, and this approach is now uniformly used worldwide.)
Second, something that succeeds at the 90% level (harking back to the aforementioned open heart situation) is permanently unacceptable in health care. Repeated studies have shown that this is about the maximum (even forgetting the individual variability issues) performance of any generative AI. Recently an AI company that claimed it could do better with medical records had to sign a consent letter with the Texas DA for fraud -- it cannot do better.
So combining the fact that the answer must be known to you before you ask the question (which means the asking is generally a waste of time), the 90% top-out on correctness, and the fact that the individualization of medical care is the foundation of all care, this is just not a tool to be used this way. It might be great for appointment scheduling or for further frustrating patients looking for help -- but making "AI doctors" is not coming via this mechanic.
P.S. There are other variations of things called AI (like Cognitive AI) that might have a chance of making headway since they have a "truth" anchor -- but these are not part of the generative AI conversation. If you are interested in the AI space and its limitations, the following article is illuminating: https://towardsdatascience.com/the-rise-of-cognitive-ai-a29d2b724ccc