Skip to content
AI-accelerated delivery · You pay when it works
Plano, TX · Munich · HyderabadAccepting Q2 2026 briefs
Whitepaper · 2026-06-11 · 13 min read

Compliance-grade RAG: Retrieval systems a regulator can audit

Why retrieval demos collapse under inspection, and the five engineering disciplines that hold up when an auditor asks how a verdict was reached

At a glance
  • Retrieval reduces hallucination but does not remove it: Stanford's RegLab evaluation found that leading commercial RAG tools produced false or misgrounded answers on 17 to 33 percent of legal queries, against vendor claims of hallucination-free citations.
  • Grounding has a floor: on Vectara's hallucination leaderboard, the best-scoring model still adds unsupported content to 1.8 percent of summaries of a document it was handed, and the weakest evaluated model exceeds 24 percent. At compliance volumes, even the floor compounds into hundreds of unreliable verdicts.
  • Hallucination is structural, not a bug awaiting a patch: OpenAI researchers show that standard training and evaluation reward a confident guess over an admitted uncertainty, so the fix must live in the system architecture, not the model release cycle.
  • The audit requirements are already written: EU AI Act Article 12 mandates automatic event logging over a high-risk system's lifetime, Article 14 mandates designed-in human oversight, and FDA's January 2025 draft guidance requires documented credibility evidence per context of use.
  • Five disciplines convert a probabilistic generator into an auditable system: citations pinned to source spans, versioned rule sets, string-verified evidence quotes, deterministic verdict records, and instrumented human review queues. We run all five in production for a European pharmaceutical regulator, at roughly 2 minutes per asset against 2 to 3 hours of manual review.

Retrieval-augmented generation looks finished in a demo and falls apart in an audit. Independent evaluations find that commercial retrieval-based systems still produce false or misgrounded answers on 17 to 33 percent of hard queries, while the EU AI Act and FDA draft guidance now specify logging, oversight, and credibility evidence that consumer-grade architectures cannot produce. We set out the five engineering disciplines that make a retrieval system auditable, and we show how they run in production inside the compliance scanner we built for a European pharmaceutical regulator.

The demo works; the audit fails

Every head of regulatory affairs has seen the demo by now. A vendor connects a language model to a document store. Someone asks a question about a labeling rule. The system answers in seconds and shows a link to a source document. The room is impressed. The pilot gets approved.

Then the quality team asks the questions it asks of any system of record. Which version of the rule did the system apply? Which sentence in the source supports this verdict? If we run the same asset through next month, do we get the same answer? Who reviewed this output before anyone acted on it? The demo has no answers. It was never designed to have them.

That gap has a name. Consumer-grade RAG optimizes for a plausible answer. Compliance-grade RAG optimizes for a defensible record. The two differ in architecture, not in polish, and the difference cannot be retrofitted after the pilot succeeds.

Consumer-grade RAG produces answers. Compliance-grade RAG produces evidence. An answer with a link is an opinion. An answer with a versioned rule, a verbatim quote, and a span reference is evidence.

We have built both kinds. QueryNow has shipped more than 200 production deployments since 2014, and the lessons in this paper come from one of them: a retrieval system we built for a European pharmaceutical regulator that needed verdicts it could stand behind in an inspection. The published research explains why the changes we made were not optional.

Retrieval narrows the hallucination problem; it does not close it

The standard defense of RAG is that grounding a model in retrieved documents eliminates hallucination. The empirical record says it narrows it. Stanford's RegLab ran the first preregistered evaluation of commercial retrieval-based legal research tools and found they produced false or misgrounded answers on 17 to 33 percent of queries, despite marketing that promised hallucination-free citations. The tools did beat a general-purpose model without retrieval. None came close to the reliability their buyers had assumed, and the authors called the providers' claims overstated.

The failure mode that matters most in a regulated setting is not the invented fact. It is the misgrounded citation: an answer that cites a real document which does not support the claim. The Stanford team flagged exactly this pattern. A reviewer who checks only that the cited document exists will pass it. Only a reviewer who reads the cited span will catch it, which is why span-level citation is the first discipline in this paper.

Even when a model is handed the exact source text, ungrounded content persists. Vectara's hallucination leaderboard measures how often models add unsupported statements when summarizing a document they were given, scored by its HHEM-2.3 evaluation model. As of its May 2026 update, the best model scored 1.8 percent, and rates across evaluated models ran as high as 24.2 percent (Exhibit 1).

Exhibit 1: Grounding narrows the hallucination gap but never closes it

Hallucination rates by system class, drawn from published evaluations. Commercial RAG legal research tools: 17 to 33 percent of queries answered falsely or with misgrounded citations (Stanford RegLab, preregistered evaluation). Frontier models summarizing a document supplied to them: 1.8 percent at best, 24.2 percent at worst (Vectara hallucination leaderboard, HHEM-2.3, May 2026). The floor is nonzero in every published measurement, which is why compliance-grade systems verify outputs instead of trusting them.

A nonzero floor matters at compliance volume. A scanner that applies 11 rules to 620 assets renders more than 6,800 verdicts. At even a 2 percent ungrounded rate, more than 130 of those verdicts assert something their cited source does not say. An audit only needs to find one.

There is also a structural reason to stop waiting for the model that fixes this. Researchers from OpenAI and Georgia Tech argue that hallucination is a statistical consequence of how models are trained and evaluated: mainstream benchmarks score accuracy in a way that rewards a confident guess over an admitted uncertainty, so models learn to guess. Their proposed remedy is to change how evaluations are scored across the industry. A compliance buyer cannot wait for that. The practical conclusion is to choose an architecture that verifies every output, because the generator will keep guessing.

The audit requirements are already written

Buyers sometimes treat auditability as a preference. In the European Union it is now a design obligation. Article 12 of the EU AI Act requires high-risk AI systems to automatically record events over the lifetime of the system, with logs sufficient to trace operation, detect risk situations, and support post-market monitoring. Article 14 requires human oversight to be designed into the system itself: the people responsible must be able to interpret the output, stay alert to automation bias, override or reverse a result, and stop the system. These obligations apply to high-risk systems from August 2026. They describe engineering features, not policy binders.

United States guidance points the same direction. FDA's January 2025 draft guidance on AI used to support regulatory decision-making for drugs and biologics sets out a risk-based credibility assessment framework: sponsors define a context of use, assess model risk by its influence on the decision and the consequence of error, then collect and document credibility evidence in a form the agency can review, with a seven-step process and lifecycle maintenance expectations. EMA's reflection paper on AI in the medicinal product lifecycle, adopted in September 2024, takes a human-centric line across the same ground.

Read together, these texts converge on a short list of questions an inspector will ask your retrieval system to answer. None of them can be answered by a system that stores its outputs as chat transcripts.

Four regulatory texts ask the same five questions of your retrieval system
Inspector's questionRegulatory anchorEngineering requirement
What did the system do, and when?EU AI Act Article 12: automatic event logging over the system lifetimeAppend-only verdict records with timestamps and input hashes
Which rule, in which version, was applied?FDA draft guidance: credibility evidence per defined context of useRule sets under version control, version stamped on every verdict
What in the source supports the output?EMA reflection paper: transparency and interpretability expectationsVerbatim evidence quotes pinned to source spans
Can a person catch and correct an error?EU AI Act Article 14: designed-in human oversight and overrideReview queues with recorded overrides
Does the same input give the same answer?FDA draft guidance: model risk assessed by influence and consequencePinned model, prompt, and rule versions; reproducible runs

Five disciplines separate compliance-grade from consumer-grade

The architecture that answers those five questions is not exotic. It is a set of disciplines applied at specific layers of the retrieval pipeline. Each one removes a class of audit failure.

  • Citations to source spans. A link to a document is not a citation. A compliance-grade citation names the document, its version, the section, and the character range of the supporting passage. The reviewer clicks and lands on a highlighted sentence, not at the top of a 40-page PDF. Span-level citation is what makes the misgrounded-citation failure mode visible at review time instead of at inspection time.
  • Versioned rule sets. Compliance rules change, and verdicts rendered under the old rule must stay interpretable after the change. Every rule carries an identifier and a version. Every verdict records the version it was scored against. When a regulator updates a requirement, you re-run affected assets against the new version and the audit trail shows both runs.
  • Evidence quotes, verified by string match. The system must quote the exact source language that triggered each verdict, and the pipeline must then verify that the quoted string actually occurs in the source document. This is a mechanical check, not a model judgment. It converts the worst failure mode, a confident verdict resting on text that does not exist, into a detectable error that never reaches a reviewer as fact.
  • Deterministic verdict records. The output of a scan is a record, not a message. Each record carries the asset identifier and content hash, the rule identifier and version, the model and prompt versions, the retrieved span references, the verified quote, the verdict, the confidence score, the timestamp, and the reviewer action. Records are append-only. Re-running an asset produces a comparable record, so drift between runs becomes a measurable quantity instead of an anecdote.
  • Human review queues, instrumented. Verdicts route to people by confidence and by rule criticality. The reviewer's decision is recorded against the machine's, which does two jobs at once: it satisfies the oversight obligation in EU AI Act Article 14, and it accumulates the labeled dataset that tells you precision per rule. The queue is the override mechanism the law asks for, with instrumentation attached.
Consumer-grade and compliance-grade RAG differ at every layer of the stack
LayerConsumer-grade defaultCompliance-grade requirement
CitationLink to a source documentSpan reference with highlighted passage and document version
RulesInstructions embedded in a promptVersioned rule set, version stamped on every verdict
EvidenceModel asserts that it used the sourceVerbatim quote, string-verified against the source text
OutputChat transcriptStructured, append-only verdict record with input hash
OversightUser reads the answer if they choose toRouted review queue, overrides recorded against machine verdicts
ReproducibilityBest effort, silent model updatesPinned model, prompt, and rule versions; comparable re-runs
Exhibit 2: An auditable verdict carries its evidence with it

Anatomy of a single verdict record from a compliance scan. The record holds the asset content hash, the rule identifier and version, the retrieved span references, the verbatim evidence quote with its string-match verification result, the verdict and confidence score, the timestamp, and the reviewer action if the verdict was routed for human review. Each field answers one of the five inspector questions from the regulatory mapping table. Nothing in the record depends on re-asking the model what it meant.

None of these disciplines requires research-grade novelty. All of them require the build team to treat the verdict record, not the chat answer, as the product. That decision has to be made before the first sprint, because every layer of the pipeline depends on it.

A European regulator runs this architecture in production

A European pharmaceutical regulator runs our AI compliance scanner over marketing assets. The system has scanned more than 620 assets. Each scan applies 11 rules. A scan takes roughly 2 minutes per asset, against the 2 to 3 hours a manual review of the same asset used to take. We do not name the client, which is itself a compliance position: the work is referenceable in process detail, not in logo.

Each of the five disciplines is load-bearing in that deployment. The 11 rules live in a versioned rule set, so a verdict from an early scan remains interpretable after a rule is tightened. Every verdict carries a verbatim quote from the asset, and the pipeline rejects any quote that fails the string match before a reviewer ever sees it. Verdicts land in append-only records with the asset hash and rule version attached. Reviewers work a queue rather than a chat window, and their confirmations and overrides are recorded against the machine's verdicts.

The throughput number is real, but it is not the point. The point is that the speed is reviewable. A reviewer confirms a verdict in seconds because the rule, its version, and the highlighted evidence are on one screen. The audit discipline did not slow the system down. It is the reason the regulator can use the speed at all.

The regulator did not buy a faster reviewer. It bought a system whose every verdict can be replayed, with the rule version and the evidence quote attached.

The procurement model matters as much as the architecture, because compliance-grade claims are testable claims. Our standard engagement scopes one workflow, signs executable acceptance criteria on day one, and builds in the client's environment over two weeks. The client pays $10,000 only after every criterion passes. For compliance-grade RAG, the criteria are mechanical: quote-verification pass rates, reproducibility across re-runs, correct queue routing at defined confidence thresholds. Larger programs run as repeated two-week sprints on the same terms. We build to SOC 2, HIPAA, and GDPR standards, and every implementation is aligned with the EU AI Act.

That model exists because of the gap this paper describes. A vendor who sells answers can hide behind a demo. A vendor who sells verdict records has to pass an executable test in your environment before being paid. Heads of regulatory affairs and quality should hold every retrieval system, internal or purchased, to the second standard.

What to do with this on Monday morning

  1. List every workflow where a generative system already produces text a regulator could later read. Ask the five inspector questions from this paper of each one. Where the system has no answer, you have found your exposure.
  2. Rewrite your RAG procurement criteria around evidence. Require span-level citations, string-verified quotes, and reproducible verdict records in the RFP, and score the vendor demo on an audit scenario rather than a question-and-answer scenario.
  3. Put your compliance rules under version control this month, before any AI work begins. If the rules live in PDFs and institutional memory, no system can stamp a verdict with the rule version it applied.
  4. Stand up a human review queue before you scale any pilot. Route verdicts by confidence and rule criticality, record every override, and use the queue data to measure precision per rule each quarter.
  5. Map your current logging against EU AI Act Articles 12 and 14 now; the high-risk obligations apply from August 2026. Treat the gap analysis as an engineering ticket list, not a legal memo.
  6. Scope one workflow and sign executable acceptance criteria before funding a build. If a vendor will not commit to pass-fail criteria covering quote verification and reproducibility, the system is consumer-grade regardless of the pitch.
Get the designed PDF

The full paper is free to read on this page. The designed PDF arrives in your inbox, attached, within a minute.

Sources
  1. Stanford RegLab, Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (2025)
  2. Kalai, Nachum, Vempala et al. (OpenAI), Why Language Models Hallucinate (2025)
  3. Vectara, Hallucination Leaderboard, HHEM-2.3 (updated May 2026)
  4. EU Artificial Intelligence Act, Article 12: Record-Keeping (2024)
  5. EU Artificial Intelligence Act, Article 14: Human Oversight (2024)
  6. U.S. FDA, Considerations for the Use of Artificial Intelligence To Support Regulatory Decision-Making for Drug and Biological Products, draft guidance (January 2025)
  7. European Medicines Agency, Reflection paper on the use of AI in the medicinal product lifecycle (2024)

Reading is free. So is the scope.

Describe one workflow and get acceptance criteria and a price in under a minute. The first build is $10,000, two weeks, paid only after every criterion passes.

Tell us the workflow →Related solution →