AI Governance in Insurance: What Regulators Are Actually Asking

Jun 22, 2026 9:00:00 AM

• Brandon Robinson •

Monthly Roundup

AI Governance in Insurance: What Regulators Are Actually Asking

TL;DR

Accuracy isn't on the exam. Every framework that has emerged — the NAIC Model Bulletin, Colorado's Regulation 10-1-1, NY DFS guidance, and the EU AI Act — converges on three demands: a model inventory, a decision audit trail, and human accountability at the decision point. None of them asks how accurate your model is.
Regulators are applying old law to new tools. As NY DFS put it in December 2025, most of the law it enforces is "technology-agnostic." The unfair-discrimination and consumer-protection statutes already on the books were written assuming a human was accountable for the decision.
The 2026 calendar is real. 24–25 states plus DC have adopted the NAIC bulletin, a 12-state evaluation pilot runs March–September 2026, and Colorado's insurance evidence-of-compliance deadline lands July 1, 2026.
Agentic AI is the quiet compliance risk. Only one in five companies has a mature governance model for autonomous agents. Systems that take action without a human in the loop break the accountability link the entire regulatory framework presumes.
Human-in-the-loop is the architecture that survives an exam — because the reviewer's action is the audit trail, and the override authority is the accountability record.

Every vendor pitching insurance AI right now leads with the same slide: an accuracy number. 97%. 99.2%. Some impressive percentage with a confident decimal point.

Here's what regulators don't ask about: the accuracy number.

The NAIC, Colorado, New York's Department of Financial Services, and the EU AI Act have all converged on the same short list of demands — and "how accurate is your model" isn't on it. If you are a Chief Compliance Officer, General Counsel, or Head of Underwriting Ops sponsoring an AI project, the gap between what your vendor is selling and what an examiner will ask for is the most expensive thing on your roadmap that nobody has priced in yet.

The Accuracy Trap

Accuracy is the easiest thing to measure and the easiest thing to sell. It fits on a slide, sounds like progress, and lets a buyer feel like due diligence is done. The problem is that it answers a question no regulator is asking.

A model can be 99% accurate and still produce a disparate impact on a protected class. It can be 99% accurate and leave no record of why it reached a decision. It can be 99% accurate and take an action no authorized human ever reviewed. None of those failures show up in the accuracy number — and every one of them is exactly what an examiner is trained to find.

This is the same disconnect we see when carriers evaluate intake tools on extraction accuracy alone. We've argued before that insurance-specific AI with HITL outperforms generic models precisely because the score on a benchmark is not the thing that determines whether the system holds up in production — or in an exam.

What Regulators Are Actually Asking — The Three Convergent Demands

Read the NAIC Model Bulletin, Colorado's Regulation 10-1-1, New York's circular-letter guidance, and the EU AI Act side by side and a pattern emerges. They use different words, but they ask for the same three things.

Inventory. What AI systems do you use, for which decisions, on which populations? You cannot govern what you cannot list.
Audit trail. For any consumer outcome influenced by AI, can you reconstruct the inputs, the decision, and the reasoning behind it?
Human accountability. Who was authorized to override, escalate, or finalize that decision — and what did they actually do?

Notice what's absent. There's no requirement to hit a model-performance threshold. The frameworks don't certify your AI as "good enough." They ask whether you can explain and account for what it did. That distinction is the whole game.

"Many of the laws that DFS enforces are technology-agnostic, meaning the core regulatory obligations are the same for manual processes as they are for AI models and systems."
— Acting Superintendent Kaitlin Asrow, NY Department of Financial Services, statement to the NY State Assembly, December 16, 2025

Translation: regulators are not, for the most part, writing brand-new rules for AI. They are applying the existing body of unfair-discrimination, model-risk, and consumer-protection law to AI outputs (NY DFS, December 2025). And that body of law was written assuming a human was accountable at the decision point. When you remove the human, you don't escape the obligation — you just lose the evidence that you met it.

The 2026 Regulatory Calendar — What's Live, What's Imminent

If this still feels theoretical, look at the dates. The regulatory tide stopped being a forecast and became a calendar:

NAIC Model Bulletin: 24–25 states plus DC have adopted it or substantially similar guidance as of early 2026 (NAIC). For most carriers, this is already the operating standard in your largest markets.
NAIC AI Systems Evaluation Tool: a 12-state pilot runs March–September 2026. This is the examiner's questionnaire — the structured tool regulators will use to review AI governance during market-conduct exams. Its results are expected to shape the long-term framework adopted at the NAIC fall meeting.
Colorado Regulation 10-1-1: evidence of compliance is due July 1, 2026 for private-passenger auto and health insurers, after the amended rule took effect October 15, 2025 (Colorado DOI). This is the most concrete near-term deadline most carriers face.
EU AI Act high-risk obligations: deferred to December 2, 2027. AI used for risk assessment and pricing in life and health insurance is classified high-risk under Annex III. The original August 2, 2026 deadline has moved: the EU's Digital Omnibus — provisionally agreed in May 2026 and endorsed by the European Parliament on June 16, 2026 — pushes stand-alone Annex III high-risk obligations to December 2, 2027, pending final formal adoption. The obligations themselves aren't going away, and the Act's transparency requirements still apply on the original August 2, 2026 date.
NAIC Third-Party Data and Models Working Group: a registration regime for outside data and model vendors is being drafted through late 2026 — the budget-cycle item that turns "is your vendor compliant?" into a filing requirement.

The two-budget-cycle problem

Some of this is enforced now (Colorado, the adopted NAIC bulletins). Some is roughly two budget cycles away (third-party model registration, the matured NAIC evaluation framework). The carriers that treat the imminent items as "later" are the ones who will be assembling an audit trail retroactively — which, as anyone who has tried it knows, is the hardest possible way to produce one.

Timeline of insurance AI regulatory deadlines from the March 2026 NAIC pilot through the deferred December 2027 EU AI Act high-risk deadline

What an Exam Actually Looks Like — The Five Categories of the NAIC Tool

The NAIC AI Systems Evaluation Tool organizes an examiner's review into five focus areas. Walk through them with one question in mind — what would I have to produce, on demand, to satisfy each?

Model inventory. A complete register of AI systems, what each one decides, and which populations it touches. The examiner asks for the list; you produce it — or you don't.
Data governance. Where the data comes from, how it's controlled, and whether it contains proxies for protected characteristics. You produce lineage and quality controls.
Bias testing. Documented, quantitative testing for disparate impact on protected classes. You produce the test results and the remediation record.
Performance monitoring. Evidence that you watch the system over time and catch drift. You produce monitoring logs.
Decision audit trails. For a sampled decision, the inputs, the output, the reasoning, and the human who authorized it. You produce the full reconstruction.

This is the place to be precise about the contrarian claim. Bias testing is on the examiner's list — so "regulators don't test your model" would be wrong. But notice what bias testing measures: the disparate impact of the output on protected classes, not the overall correctness of the model. A 99%-accurate model with a 6-point approval gap between demographic groups fails category three with flying colors. Accuracy and fairness are different axes, and only one of them is on the exam.

Here's the punchline that most governance checklists miss: four of the five categories collapse if you can't show humans were in the loop. Inventory needs an owner. Data governance needs an approver. Monitoring needs someone who acts on the alert. And the decision audit trail is, by definition, the record of who reviewed and authorized the outcome. Strip the human out of the workflow and you haven't just lost a control — you've lost the evidence the examiner came to see.

The $2.5M Lesson — How Enforcement Actually Lands

The closest analogue to what insurance carriers should expect didn't come from an insurance regulator at all. On July 10, 2025, the Massachusetts Attorney General announced a $2.5 million settlement with Earnest Operations, a student-loan lender, over allegations that its AI-driven underwriting produced a disparate impact on Black, Hispanic, and non-citizen applicants (Massachusetts AG).

The detail that matters for insurers: the AG didn't need a new AI statute to act. The case rested on existing consumer-protection and fair-lending law — the technology-agnostic body of law Asrow described. The findings read like the NAIC tool's failure modes: models untested for disparate impact, decisions applicants couldn't get explained, and no governance structure documenting who was accountable. The remedy required exactly what the frameworks ask for going forward — bias testing, written governance, and ongoing reporting.

Why this matters for carriers

If a state AG can reach a multimillion-dollar settlement over AI underwriting using laws that predate the technology, the absence of a finalized AI statute in your state is not protection. The exposure already exists. What's new is that examiners now have a structured tool to go looking for it.

Why Agentic AI Is Quietly the Highest-Risk Architecture for Compliance

This is the part most carriers haven't thought through. The hottest pitch in insurance AI right now is "agentic" — systems that don't just extract or score, but take autonomous action: routing, deciding, finalizing, with no human in the loop. The efficiency story is genuinely compelling. The compliance story is the opposite.

Every framework we've discussed presumes a human is accountable at the decision point. The EU AI Act is explicit that natural persons must be able to effectively oversee high-risk AI. An agentic system that takes action by design is, structurally, the thing that breaks that link. There is no reviewer whose action becomes the audit trail, because the design goal was to remove the reviewer.

And the market isn't ready for the burden. Deloitte's State of AI in the Enterprise 2026 found that only one in five companies has a mature governance model for autonomous AI agents (Deloitte, 2026), even as adoption climbs sharply. Carriers signing for agentic AI in 2026 are accepting a compliance liability that isn't written into the contract — and won't surface until the first exam or the first complaint.

This is not an argument against automation. It's an argument for being deliberate about where the human sits. The teams getting this right are the ones thinking carefully about how to staff a human-in-the-loop operation rather than designing the human out of it. BCG's framing on human oversight reaches the same conclusion from the strategy side: even an AI-first insurer needs the human firmly in the loop.

Why HITL Is the Compliance Architecture — Not Just a Feature

Here is the connection that nobody currently ranking for "AI governance insurance" makes explicitly. The three demands — inventory, audit trail, human accountability — aren't features you bolt onto a model. In a human-in-the-loop architecture, they are produced by construction.

The reviewer's action is the audit trail. When a person confirms, corrects, or escalates an AI-suggested value, that interaction is logged with the inputs, the suggestion, and the outcome. You don't have to reconstruct the decision later — the system recorded it as it happened.
The override authority is the accountability record. Routing by confidence threshold and document type means every decision has a defined owner. "Who was authorized to finalize this?" has a built-in answer.
The inventory writes itself. When AI suggestions flow through a controlled review workflow, you already know which systems touch which decisions for which populations — because that's how the work gets routed.

Diagram showing how a human-in-the-loop workflow produces a regulatory audit trail at each step

This is why human-in-the-loop is better understood as a compliance architecture than a product feature. It's also the throughline behind why so many AI initiatives stall: in 78% of insurance AI pilots never scale, the blocker is rarely accuracy — it's that nobody trusts an output they can't see, explain, or stand behind. The same visibility that builds internal trust is what produces the regulatory record. SortSpoke keeps humans in the driver's seat for exactly this reason — extraction is automated; the decision and the accountability stay with your team.

It's the same logic behind a defensible security posture. A SOC 2 Type 2 compliance posture matters for the same reason an audit trail matters to an examiner: it's continuous evidence that the controls actually operate — not a claim that they exist.

The Governance Gap — Why Most Carriers Aren't Ready

The uncomfortable data point: most carriers know there's a problem and still aren't positioned to pass. Grant Thornton's 2026 AI Impact Survey found that 44% of insurance executives say governance or compliance challenges have contributed to an AI project failing or underperforming — and only 24% are very confident they could pass an independent AI governance review within 90 days (Grant Thornton, April 2026).

The gap is usually not policy — most boards have adopted AI governance policies. The gap is evidence. The controls exist on paper, but the proof is fragmented across teams and tools, which is exactly what an examiner's structured questionnaire is built to expose. Closing that gap is a 12-month problem, not a 12-week one — it involves architecture decisions, data lineage, and workflow design you cannot sprint through the week before an exam. This is why it pays to pressure-test vendors early; our list of 9 questions before buying underwriting AI is built around the evidence a system can produce, not the demo it can perform. The same discipline applies upstream, where underwriting submission triage increasingly runs on AI and needs the same audit trail as everything downstream.

A Concrete Readiness Checklist

If you want a fast read on where you stand, work through these five questions. Each maps directly to something an examiner can ask for.

The 90-Day Readiness Test
Inventory: Can you produce a complete AI model inventory in 24 hours?
Audit trail: Can you trace any AI-influenced decision back to its inputs, its reasoning, and the human who authorized it?
Bias testing: Do you have documented, quantitative testing for disparate impact on protected classes?
Human accountability: Are escalation and override pathways defined by document type and confidence threshold?
Third-party models: Are your vendors positioned for the NAIC third-party model registration regime now being drafted?

If any of those answers is "not quickly" or "not for every decision," that's the work — and it's worth starting before the deadline picks the timeline for you.

Key Takeaways

Regulators audit explanation, not accuracy. Every framework converges on three demands — inventory, audit trail, and human accountability — and a model-performance score satisfies none of them.

The law is technology-agnostic. Existing unfair-discrimination and consumer-protection statutes already apply to AI outputs — as a $2.5M AI-underwriting settlement showed before any new AI law was on the books.

Human-in-the-loop is the architecture that produces the required evidence by default — and agentic AI, by removing the human, removes the proof. That's an architecture choice with a compliance price tag.

Regulators aren't asking whether your AI is accurate. They're asking whether you can explain how it reached a decision — and prove a human had authority over it. That's an architecture problem, not a compliance-team problem. Book a 15-minute demo → to see how SortSpoke's human-in-the-loop architecture produces the audit trail and human-accountability record by default — not as a bolt-on.

Commercial P&C Insurers Guide to Solving the Underwriting Bottleneck

Download the Guide