The MRI Second Opinion: What Happens When AI Outgrows Its Label

When a developer named Antoine published a post called "I used Claude Code to get a second opinion on my MRI," Hacker News responded with 406 upvotes and 528 comments. That's not the reaction you'd expect for a medical curiosity piece. It's the reaction you get when something true and slightly uncomfortable gets said out loud.

The story isn't really about MRI analysis. It's about the growing chasm between what AI tools are marketed as and what frontier models can actually do. Claude Code is a developer tool — a coding assistant built to write Python, scaffold repositories, and debug APIs. It is emphatically not a radiology platform. But when Antoine fed it MRI results, Opus delivered. And that fact has implications running far beyond one person's health scare.

For small teams building SaaS products, health-adjacent tools, or any kind of AI-assisted workflow, what this post reveals isn't a curiosity. It's a strategic signal. Frontier models have crossed a capability threshold that marketing copy hasn't caught up with. The question now is how to think clearly about that gap — and whether to move on it.

What Is This Actually?

Claude Code is Anthropic's agentic coding assistant, positioned primarily as a developer productivity tool. It operates as a CLI-based agent with direct access to the local filesystem, terminal commands, web browsing, and multi-modal input including images. Developers use it to write and refactor code, navigate large codebases, and run multi-step technical tasks autonomously with minimal handholding.

The Opus model powering it is Anthropic's deepest reasoning model. It isn't a specialist — it's a generalist trained on a vast corpus of text and image data that includes scientific literature, clinical guidelines, radiology reports, and a meaningful volume of medical documentation. When prompted carefully, it can reason across medical images and accompanying text with a level of nuance that surprises people who assume AI models are bounded by their product category.

What Antoine did was use Claude Code's file-access capabilities to pass MRI scan images and associated reports directly to the model, then prompt it to analyze the findings in a structured way. Claude Code's agentic design meant it could systematically work through multiple files, cross-reference findings, and produce a coherent analysis rather than answering a single isolated question in a chat window. The filesystem access mattered — it turned a one-shot question into a systematic workflow.

The result, according to the post, was detailed enough to be genuinely useful. The model identified findings, placed them in clinical context, flagged what warranted attention, and communicated uncertainty in a calibrated way. Antoine framed the output explicitly as a "second opinion" — supplementary to actual medical advice, not a substitute for it.

That framing matters enormously. This story isn't "AI replaces radiologist." It's "person with access to a frontier AI model got a better understanding of their own health situation than they would have otherwise." Those are different claims, and the second one is both accurate and hard to argue against.

The technical underpinning is multimodal reasoning. Opus can process images alongside text, meaning MRI scan images — or photographs of printed scan results — can be analyzed in context with textual radiology reports, symptom descriptions, and follow-up questions. This capability appeared in usable form with GPT-4V in late 2023, but what's changed is quality. Today's frontier models can reason carefully, acknowledge uncertainty, and produce output that hedges appropriately rather than projecting false confidence.

The Hacker News thread made one thing clear: Antoine wasn't alone. Dozens of commenters described doing essentially the same thing independently — feeding lab results, X-rays, pathology reports, and discharge summaries to Claude, GPT-4o, or Gemini and getting back analyses they found more useful than what their healthcare providers had time to explain. The post just named the experience publicly, which is why it hit a nerve.

Why This Matters Right Now

Twelve months ago, the medical AI conversation was dominated by narrow specialist tools: AI trained specifically on chest X-rays to detect pulmonary nodules, or systems fine-tuned for diabetic retinopathy screening. These were impressive within constrained domains but brittle outside them. The implicit assumption was that general-purpose AI couldn't match what specialist medical AI could do.

That assumption is now broken. Opus 4, GPT-4o, and Gemini 1.5 Pro are general-purpose models that can reason across medical domains with enough accuracy to be meaningfully useful for health-literate individuals. They're not replacing radiologists. They're doing something subtler — filling the interpretation gap between "you have a report" and "you understand what the report actually means."

That gap is substantial, and it's a real structural problem. Healthcare systems globally are under pressure. Wait times for specialist consultations are measured in weeks or months. Primary care physicians average 15–20 minutes per appointment. Patients routinely leave with reports they can't interpret, diagnoses they don't fully understand, and follow-up questions that won't get answered until the next visit. AI doesn't fix the system. But for someone who already has a report in hand, it can fill in the comprehension gap.

The timing matters for another reason: data portability is improving. FHIR standards, patient portal apps, and wearable health data are putting more medical records directly in patients' hands. When people have the raw data, the next question is always "what does this mean?" That question used to have exactly one answer: wait and ask your doctor. Now it has two.

Regulatory frameworks haven't caught up with this shift. The FDA has cleared over 700 AI/ML-based software devices, but those are purpose-built medical AI systems, not general-purpose foundation models used off-label by individuals interpreting their own scans. The legal and liability landscape for someone using Claude Code to interpret their own MRI is genuinely murky — which creates both risk and opportunity depending on where a product sits in that landscape.

There's also the capability trajectory to consider. Models are improving fast enough that today's "impressive but limited" benchmark looks different in 12 months. Small teams building health AI tools need to plan for a world where the underlying AI layer is materially stronger than it is today. Building now, with appropriate humility about current limitations, is still a viable path — provided you design for a stronger floor rather than trying to compensate for a weak one.

Practical Implications for Small Teams

The MRI story opens up several distinct angles for small teams, freelancers, and agencies working at the intersection of AI and health information. Not all of them require building a medical product.

The patient-facing interpretation layer. The largest unmet need this story surfaces is the gap between having a medical document and understanding it. There are almost no good consumer products that help patients interpret their own health data in plain language. A team that built a focused product for helping people understand lab results, discharge summaries, or imaging reports — not diagnose, just comprehend — would be addressing genuine and growing demand. The AI capability is already there. The real product work is intake flows, document parsing, appropriate framing, and UX — not model development.

Clinical documentation for small practices. Solo practitioners and independent clinics are systematically underserved by enterprise health IT vendors like Epic and Cerner, whose minimum contract sizes effectively exclude them. A freelancer or small agency that builds AI-assisted clinical note summarization, referral letter drafting, or patient letter automation for small practices is going after a market with real pain and real budgets. The MRI story demonstrates that Opus-class models can reason in clinical language accurately enough to be useful in production — which materially lowers the capability risk for tools like this.

Health content at different economics. Agencies producing health content for publishers, insurance companies, or wellness brands are doing work that is labor-intensive and expertise-dependent. Frontier AI can draft, research, and structure health content at a pace that fundamentally changes the economics — but only if the team understands the model's failure modes well enough to build an appropriate QA layer around it. The valuable person in this workflow isn't the one who uses the AI. It's the one who can tell when the AI is wrong.

Second-opinion infrastructure for underserved markets. In markets where specialist access is genuinely scarce — much of Southeast Asia, sub-Saharan Africa, and rural areas everywhere — the "AI second opinion" use case isn't a curiosity, it's a public health lever. A team building for these markets faces a different ethical and regulatory calculus than a US-based startup does, because the counterfactual isn't "see a specialist" — it's "get no specialist input at all." That shifts the risk-benefit math considerably.

One additional angle that the Antoine story highlights for technical founders: Claude Code's agentic, filesystem-aware architecture lets developers prototype sophisticated analysis pipelines without standing up full ML infrastructure. A developer could write a script that accepts DICOM files from a local directory, preprocesses them, and passes them to Claude via API with a structured prompt — outputting findings to a JSON template or formatted report. This is a prototype, not a production system, but the time to a working proof of concept is now measured in hours. That's a meaningful change from where things stood 18 months ago.

How to Respond and Act on This

The right response for a small team depends heavily on where they sit relative to this story. Here's a practical framework for thinking it through before writing any code.

Map your regulatory exposure first. If you're building something that helps patients interpret their own records — rather than providing clinical diagnosis — you may be operating outside FDA medical device classification in the United States. But that requires legal clarity before you ship anything customer-facing. The distinction between "health information" and "medical advice" is real and regulatorily meaningful. A product that explains what an HbA1c number means sits in a different category than a product that tells someone they have Type 2 diabetes.

Prototype with Claude Code before building with the API. If you want to understand what frontier AI can actually do with medical documents, run your own structured tests. Upload anonymized sample reports — never use real patient data without proper consent frameworks in place — and evaluate the outputs systematically. Build a rubric: Is it accurate? Does it appropriately hedge uncertainty? Does it say "I don't know" when it doesn't? Does it flag urgent findings without over-alarming on benign ones? The answers will tell you whether the AI layer is ready for your use case and what guardrails you need to build around it.

Build the product wrapper, not the model. No small team should be attempting to fine-tune their own medical AI. The frontier models from Anthropic, OpenAI, and Google are already capable enough, and they're improving faster than any small team can independently match. The durable value a small team can add is in the product layer: intake flows, document parsing, structured output formatting, appropriate disclaimers, escalation logic, and user experience design. That's where engineering time belongs.

Take data privacy seriously from day one, not as an afterthought. Medical data is among the most sensitive personal information in existence. HIPAA compliance in the US, GDPR in Europe, or equivalent frameworks elsewhere are foundational constraints, not optional considerations. Anthropic's Business and Enterprise tiers offer BAAs for HIPAA compliance. If your product handles actual patient data, you need to be operating under a BAA with your AI provider — not under standard consumer terms of service.

Consider a two-model architecture. For health information products, using a fast lower-cost model like Claude Haiku or GPT-4o Mini for initial document triage and classification — then routing complex or high-stakes analyses to a frontier model like Opus or GPT-4o — keeps costs manageable while applying maximum reasoning capability where it matters most. This isn't just a cost optimization; it's a sensible risk architecture.

Document your accuracy testing before you ship. Whatever you build, run it against a set of known-good examples and document the error profile systematically. This protects you legally and reputationally if something goes wrong. Health AI that ships without systematic accuracy validation is a liability waiting to surface.

Comparison: AI Models for Medical Document Analysis

Tool	Best for	Free plan	Starting price	Key differentiator
Claude Opus 4 (Anthropic)	Complex multi-document reasoning, long clinical reports	No	~$20/mo (Pro); API usage-based	Deepest reasoning; HIPAA BAA on Enterprise
GPT-4o (OpenAI)	Fast vision + text analysis; broad ecosystem	Yes (limited)	~$20/mo (Plus)	Strongest third-party integrations; BAA on Enterprise
Gemini 1.5 Pro (Google)	Very long-context document sets (up to 1M tokens)	Yes (limited)	~$22/mo (AI Premium)	Best for large document volumes; Google Workspace native
Claude Haiku (Anthropic)	High-volume triage at low cost	No	~$0.80/M input tokens (API)	Fastest Anthropic model; good for preprocessing
Rad AI (dedicated)	Radiology workflow automation for clinical settings	No	Enterprise	FDA-cleared; purpose-built for radiology workflows

The key distinction between general-purpose models and specialist tools like Rad AI is regulatory status. General-purpose AI used for medical interpretation is off-label and uncleared. Purpose-built tools carry FDA clearance and the liability framework that accompanies it. For small teams, the general-purpose route is faster and cheaper to start with; the specialist route becomes necessary for anything entering formal clinical workflows.

What the HN Community Is Saying

The Hacker News thread on this post is rich and worth analyzing as a temperature check on where informed technical opinion actually sits.

The optimists — a sizable portion of the thread — are largely people who have done exactly what Antoine did, and found it valuable. The recurring theme isn't "AI is better than doctors." It's "AI gave me enough context to have a better conversation with my doctor." Multiple commenters describe using Claude or GPT-4o to pre-interpret lab results before appointments, arriving with specific questions rather than diffuse anxiety. The utility isn't diagnostic — it's navigational.

The skeptics raise legitimate concerns. Hallucination risk in a medical context is qualitatively different from hallucination risk in a coding context. If Opus confidently misidentifies a benign finding as alarming (or vice versa), the consequences aren't a broken build. The counter-argument, also present in the thread, is that patients who already have a report are going to interpret it somehow — through Google, through anxious forum-reading, or by sitting with vague worry until the next appointment. Compared to those alternatives, a well-calibrated frontier AI with appropriate uncertainty language may actually represent an improvement.

Practitioners — people identifying as physicians, radiologists, or clinical researchers in the thread — are split. Some are dismissive, noting that a model doesn't know what it can't see, whereas an experienced radiologist carries years of pattern recognition that doesn't live in any report. Others are genuinely curious and note that the outputs Antoine described sound appropriately hedged and plausible. A few point out that this is structurally similar to what AI scribes and ambient documentation tools already do in clinical settings — just applied by the patient rather than the provider.

The privacy discussion is pointed. Several commenters flag that uploading actual medical images to a commercial AI service raises real questions about data handling and what Anthropic does with uploaded imagery under standard consumer terms. The consensus is murkier than most users assume, and anyone doing this should at minimum be on a paid subscription with explicit data commitments — ideally an enterprise plan with a BAA in place.

The meta-thread — "what does it mean that we're using a coding tool as a medical resource?" — surfaces repeatedly. The sharpest observation: the tool category label is becoming increasingly irrelevant. What matters is model capability. Claude Code is Claude Opus with filesystem access. Of course it can analyze an MRI.

Risks and Things to Watch

The capability story here is real. So are the risks, and they deserve equal attention.

Calibration without clinical context. Frontier models are good at reasoning from what they're given. They're less good at knowing what they can't see. A radiologist reading an MRI has seen thousands of scans; they know what normal variation looks like across age groups, they factor in clinical history that may not appear in the report, and they know which findings are incidental and which require urgent action. A language model reasoning over scan images and report text doesn't carry that embedded clinical judgment. It can produce plausible-sounding analysis that is technically accurate about visible findings while missing clinical significance entirely.

The confidence calibration trap. Frontier models have gotten better at expressing uncertainty, but calibration isn't perfect. In a domain where "this is probably nothing" can have real consequences if wrong, imperfect calibration matters more than it does in most other contexts. Health-literate users who know how to weight AI output appropriately are a different population from users who take any confident-sounding statement at face value. Product design has to account for the latter group.

Data privacy and consent. When a patient uploads their own MRI to a commercial AI service, HIPAA doesn't cleanly apply — the regulation is designed to govern covered entities, not individuals managing their own records. But the AI provider's commercial data policies still govern what happens to that imagery. Standard consumer accounts with some providers have historically allowed use of conversations for model training. Medical imagery in a training pipeline is a meaningful risk that most users aren't considering when they paste their scan into a chat window.

Regulatory tightening is likely, not hypothetical. The current environment around AI medical interpretation is permissive largely through inertia and speed of change. The EU AI Act classifies medical AI as high-risk. FDA guidance on AI-enabled software devices is actively evolving and trending toward tighter classification. Small teams building health AI tools should design for a rising compliance bar rather than assuming the current gray zone persists. Building compliant from the start is cheaper than retrofitting.

The liability gap. If someone acts on AI-generated medical interpretation and comes to harm, the liability question is unresolved. Anthropic's terms explicitly disclaim medical advice. The user's physician likely wasn't involved. This isn't a reason not to use these tools, but it is a reason to be explicit about what your product is and isn't — and to make that explicit in the product itself, not buried in a terms-of-service page.

Frequently Asked Questions

Is using Claude to analyze an MRI actually safe? The safety question needs to be framed relative to alternatives. Using Claude to better understand a report you already have — while continuing to rely on actual medical providers for clinical decisions — is probably safer than the typical alternative: anxious late-night Googling of symptoms and medical jargon. The risk profile changes meaningfully if someone uses AI interpretation to avoid seeking care they actually need. That's a misuse case that product designers in this space should actively work to prevent, not assume away.

Can frontier AI models actually read MRI images accurately? Models like Opus 4 and GPT-4o can process images and reason over them, but they're not reading raw DICOM data the way radiology software does. Most practical use cases involve photographs or screenshots of scan images, or textual radiology reports. The model's reasoning is language-model reasoning applied to visual input — it reasons about what it sees, not performing pixel-level segmentation or precise measurement. Useful for interpretation and contextualization; less useful for technical precision tasks a clinical system would perform.

Does Claude Code specifically offer advantages over just using Claude.ai for this? Claude Code's advantage is its filesystem access and agentic workflow capabilities. If you have many files — multiple scan series, a long radiology report, historical comparison scans — Claude Code can systematically work through all of them in a single analysis session, cross-reference findings, and write structured output to files. For a single document or image, Claude.ai's interface is simpler and produces comparable results. The Code version matters when the workflow is multi-file, multi-step, and benefits from systematic organization.

What's the actual difference between using Claude for this versus asking a doctor friend informally? Asking a physician friend is genuinely different — they have clinical judgment, can ask follow-up questions, factor in context you haven't mentioned, and bear some professional accountability for their input. Claude has broader literature coverage, no time pressure, availability at 3am, and infinite patience for follow-up questions. Neither is a substitute for a formal consultation with your own physician who has your complete medical history. They serve different purposes; the AI's primary advantage is accessibility, not expertise depth.

If I want to build a product in this space, where do I actually start? Start with the legal and ethical framework before writing any code. Understand what "health information" versus "medical advice" means in your jurisdiction. Read FDA guidance on software as a medical device. Understand HIPAA's scope and whether your use case falls inside or outside it. Then prototype with anonymized data to evaluate what the AI can actually do, and build your product layer — intake, disclaimers, UX, escalation logic — around a realistic assessment of both model capability and model failure modes.

Are there existing products doing patient-side health interpretation well? The space is genuinely underdeveloped relative to the opportunity. Companies like Abridge and Nabla are focused on clinical documentation for providers, not patient interpretation. The patient-side layer — helping people understand their own health data — is thinner. Most existing tools are either too narrow (single-test interpretation), too expensive (direct primary care apps that charge concierge fees), or too cautious about liability to provide meaningful interpretation at all. The gap is real and is part of why the Antoine post resonated so strongly.

How do I handle HIPAA if I'm building a health AI product? If your product handles Protected Health Information as a service provider to covered entities — hospitals, clinics, insurers — you need a Business Associate Agreement with your AI provider. Both Anthropic and OpenAI offer BAAs on their Enterprise plans. If your product is consumer-facing, with patients using it for their own data, the HIPAA framework is different and you need specific legal advice about your configuration. HIPAA-adjacent state laws like California's Confidentiality of Medical Information Act may apply even when federal HIPAA doesn't — and those are often stricter.

What happens when these models get materially better? The trajectory is clear and one-directional. Future models trained with more medical data, potentially with expert clinical feedback loops, and with access to real-time literature will be measurably stronger than current ones. Products built on this capability layer should be designed for that stronger floor — meaning the value you add should live in the product and workflow layer, not in compensating for model limitations that will diminish over time. Bet on the model getting better. Build what the model can't do for itself.

Final Verdict

The MRI second opinion story matters because it exposes a category error that a lot of people in tech are still making. The assumption is that AI capability is bounded by product category — that a coding assistant can code, a writing assistant can write, and medical AI belongs to medical AI companies. That assumption was always somewhat thin. In mid-2026, it's functionally false.

Opus 4 and its peers are reasoning engines. They reason over whatever input you provide, with whatever context you supply, and they produce output calibrated to that context. When the input is an MRI and the context is a request for clinical interpretation, they produce clinical interpretation. Not perfectly. Not with the embedded judgment of a specialist who has spent 20 years looking at scans. But well enough to be genuinely useful to a health-literate individual who treats the output as one data point among several, not as a final word.

For small teams, the takeaway depends on where you sit. If you're building SaaS tools that could incorporate health information interpretation — even peripherally — this story signals that the capability barrier is lower than you likely assumed. The work isn't making the AI smart enough. It's building the product wrapper: intake flows, data privacy architecture, output framing, escalation logic. That's product work, not AI research, and it's within reach of a small team with domain knowledge and good judgment.

If you work in health content, clinical documentation support, or patient-facing information services, the window to build meaningful tools with frontier AI is open now. The regulatory environment is uncertain but not closed. The AI capability is real but imperfect. The market need is unambiguous and growing. Teams that move thoughtfully — taking data privacy seriously from day one, building appropriate guardrails, testing accuracy systematically, and communicating limitations clearly to users — have a genuine opportunity ahead of better-resourced but slower-moving incumbents.

If you're a healthcare provider or work with small practices, the Antoine story should prompt a different question: what are your patients learning about their own health from AI systems you have no visibility into? The patients who interpret their MRI before their appointment are going to arrive with specific questions. Building AI-assisted patient communication tools that give people better-framed information before they show up — rather than leaving that gap for general-purpose AI to fill unguided — is a competitive differentiator available to small practices right now.

The people who should wait are those tempted to build diagnostic-adjacent tools without the legal groundwork in place. The gap between "health information tool" and "medical device" is real, and regulatory movement is toward tighter classification, not looser. Design for where that line is going, not just where it currently sits. The teams that treat compliance as architecture rather than afterthought are the ones who will still be operating two years from now when the regulatory environment firms up.

What Antoine actually demonstrated isn't that AI is ready to be your physician. It's that AI is ready to be the health-literate friend who actually read the report — the one who explains what findings mean, flags what seems worth raising with your doctor, and helps you arrive at your next appointment prepared rather than confused. That's genuinely valuable. Building products that deliver it reliably and responsibly is the actual opportunity the story is pointing at.