Retrieval-augmented generation looks finished in a demo and falls apart in an audit. Independent evaluations find that commercial retrieval-based systems still produce false or misgrounded answers on 17 to 33 percent of hard queries, while the EU AI Act and FDA draft guidance now specify logging, oversight, and credibility evidence that consumer-grade architectures cannot produce. We set out the five engineering disciplines that make a retrieval system auditable, and we show how they run in production inside the compliance scanner we built for a European pharmaceutical regulator.
The demo works; the audit fails
Every head of regulatory affairs has seen the demo by now. A vendor connects a language model to a document store. Someone asks a question about a labeling rule. The system answers in seconds and shows a link to a source document. The room is impressed. The pilot gets approved.
Then the quality team asks the questions it asks of any system of record. Which version of the rule did the system apply? Which sentence in the source supports this verdict? If we run the same asset through next month, do we get the same answer? Who reviewed this output before anyone acted on it? The demo has no answers. It was never designed to have them.
That gap has a name. Consumer-grade RAG optimizes for a plausible answer. Compliance-grade RAG optimizes for a defensible record. The two differ in architecture, not in polish, and the difference cannot be retrofitted after the pilot succeeds.
Consumer-grade RAG produces answers. Compliance-grade RAG produces evidence. An answer with a link is an opinion. An answer with a versioned rule, a verbatim quote, and a span reference is evidence.
We have built both kinds. QueryNow has shipped more than 200 production deployments since 2014, and the lessons in this paper come from one of them: a retrieval system we built for a European pharmaceutical regulator that needed verdicts it could stand behind in an inspection. The published research explains why the changes we made were not optional.
Retrieval narrows the hallucination problem; it does not close it
The standard defense of RAG is that grounding a model in retrieved documents eliminates hallucination. The empirical record says it narrows it. Stanford's RegLab ran the first preregistered evaluation of commercial retrieval-based legal research tools and found they produced false or misgrounded answers on 17 to 33 percent of queries, despite marketing that promised hallucination-free citations. The tools did beat a general-purpose model without retrieval. None came close to the reliability their buyers had assumed, and the authors called the providers' claims overstated.
The failure mode that matters most in a regulated setting is not the invented fact. It is the misgrounded citation: an answer that cites a real document which does not support the claim. The Stanford team flagged exactly this pattern. A reviewer who checks only that the cited document exists will pass it. Only a reviewer who reads the cited span will catch it, which is why span-level citation is the first discipline in this paper.
Even when a model is handed the exact source text, ungrounded content persists. Vectara's hallucination leaderboard measures how often models add unsupported statements when summarizing a document they were given, scored by its HHEM-2.3 evaluation model. As of its May 2026 update, the best model scored 1.8 percent, and rates across evaluated models ran as high as 24.2 percent (Exhibit 1).
Hallucination rates by system class, drawn from published evaluations. Commercial RAG legal research tools: 17 to 33 percent of queries answered falsely or with misgrounded citations (Stanford RegLab, preregistered evaluation). Frontier models summarizing a document supplied to them: 1.8 percent at best, 24.2 percent at worst (Vectara hallucination leaderboard, HHEM-2.3, May 2026). The floor is nonzero in every published measurement, which is why compliance-grade systems verify outputs instead of trusting them.
A nonzero floor matters at compliance volume. A scanner that applies 11 rules to 620 assets renders more than 6,800 verdicts. At even a 2 percent ungrounded rate, more than 130 of those verdicts assert something their cited source does not say. An audit only needs to find one.
There is also a structural reason to stop waiting for the model that fixes this. Researchers from OpenAI and Georgia Tech argue that hallucination is a statistical consequence of how models are trained and evaluated: mainstream benchmarks score accuracy in a way that rewards a confident guess over an admitted uncertainty, so models learn to guess. Their proposed remedy is to change how evaluations are scored across the industry. A compliance buyer cannot wait for that. The practical conclusion is to choose an architecture that verifies every output, because the generator will keep guessing.
The audit requirements are already written
Buyers sometimes treat auditability as a preference. In the European Union it is now a design obligation. Article 12 of the EU AI Act requires high-risk AI systems to automatically record events over the lifetime of the system, with logs sufficient to trace operation, detect risk situations, and support post-market monitoring. Article 14 requires human oversight to be designed into the system itself: the people responsible must be able to interpret the output, stay alert to automation bias, override or reverse a result, and stop the system. These obligations apply to high-risk systems from August 2026. They describe engineering features, not policy binders.
United States guidance points the same direction. FDA's January 2025 draft guidance on AI used to support regulatory decision-making for drugs and biologics sets out a risk-based credibility assessment framework: sponsors define a context of use, assess model risk by its influence on the decision and the consequence of error, then collect and document credibility evidence in a form the agency can review, with a seven-step process and lifecycle maintenance expectations. EMA's reflection paper on AI in the medicinal product lifecycle, adopted in September 2024, takes a human-centric line across the same ground.
Read together, these texts converge on a short list of questions an inspector will ask your retrieval system to answer. None of them can be answered by a system that stores its outputs as chat transcripts.
| Inspector's question | Regulatory anchor | Engineering requirement |
|---|---|---|
| What did the system do, and when? | EU AI Act Article 12: automatic event logging over the system lifetime | Append-only verdict records with timestamps and input hashes |
| Which rule, in which version, was applied? | FDA draft guidance: credibility evidence per defined context of use | Rule sets under version control, version stamped on every verdict |
| What in the source supports the output? | EMA reflection paper: transparency and interpretability expectations | Verbatim evidence quotes pinned to source spans |
| Can a person catch and correct an error? | EU AI Act Article 14: designed-in human oversight and override | Review queues with recorded overrides |
| Does the same input give the same answer? | FDA draft guidance: model risk assessed by influence and consequence | Pinned model, prompt, and rule versions; reproducible runs |
Five disciplines separate compliance-grade from consumer-grade
The architecture that answers those five questions is not exotic. It is a set of disciplines applied at specific layers of the retrieval pipeline. Each one removes a class of audit failure.
- Citations to source spans. A link to a document is not a citation. A compliance-grade citation names the document, its version, the section, and the character range of the supporting passage. The reviewer clicks and lands on a highlighted sentence, not at the top of a 40-page PDF. Span-level citation is what makes the misgrounded-citation failure mode visible at review time instead of at inspection time.
- Versioned rule sets. Compliance rules change, and verdicts rendered under the old rule must stay interpretable after the change. Every rule carries an identifier and a version. Every verdict records the version it was scored against. When a regulator updates a requirement, you re-run affected assets against the new version and the audit trail shows both runs.
- Evidence quotes, verified by string match. The system must quote the exact source language that triggered each verdict, and the pipeline must then verify that the quoted string actually occurs in the source document. This is a mechanical check, not a model judgment. It converts the worst failure mode, a confident verdict resting on text that does not exist, into a detectable error that never reaches a reviewer as fact.
- Deterministic verdict records. The output of a scan is a record, not a message. Each record carries the asset identifier and content hash, the rule identifier and version, the model and prompt versions, the retrieved span references, the verified quote, the verdict, the confidence score, the timestamp, and the reviewer action. Records are append-only. Re-running an asset produces a comparable record, so drift between runs becomes a measurable quantity instead of an anecdote.
- Human review queues, instrumented. Verdicts route to people by confidence and by rule criticality. The reviewer's decision is recorded against the machine's, which does two jobs at once: it satisfies the oversight obligation in EU AI Act Article 14, and it accumulates the labeled dataset that tells you precision per rule. The queue is the override mechanism the law asks for, with instrumentation attached.
| Layer | Consumer-grade default | Compliance-grade requirement |
|---|---|---|
| Citation | Link to a source document | Span reference with highlighted passage and document version |
| Rules | Instructions embedded in a prompt | Versioned rule set, version stamped on every verdict |
| Evidence | Model asserts that it used the source | Verbatim quote, string-verified against the source text |
| Output | Chat transcript | Structured, append-only verdict record with input hash |
| Oversight | User reads the answer if they choose to | Routed review queue, overrides recorded against machine verdicts |
| Reproducibility | Best effort, silent model updates | Pinned model, prompt, and rule versions; comparable re-runs |
Anatomy of a single verdict record from a compliance scan. The record holds the asset content hash, the rule identifier and version, the retrieved span references, the verbatim evidence quote with its string-match verification result, the verdict and confidence score, the timestamp, and the reviewer action if the verdict was routed for human review. Each field answers one of the five inspector questions from the regulatory mapping table. Nothing in the record depends on re-asking the model what it meant.
None of these disciplines requires research-grade novelty. All of them require the build team to treat the verdict record, not the chat answer, as the product. That decision has to be made before the first sprint, because every layer of the pipeline depends on it.
A European regulator runs this architecture in production
A European pharmaceutical regulator runs our AI compliance scanner over marketing assets. The system has scanned more than 620 assets. Each scan applies 11 rules. A scan takes roughly 2 minutes per asset, against the 2 to 3 hours a manual review of the same asset used to take. We do not name the client, which is itself a compliance position: the work is referenceable in process detail, not in logo.
Each of the five disciplines is load-bearing in that deployment. The 11 rules live in a versioned rule set, so a verdict from an early scan remains interpretable after a rule is tightened. Every verdict carries a verbatim quote from the asset, and the pipeline rejects any quote that fails the string match before a reviewer ever sees it. Verdicts land in append-only records with the asset hash and rule version attached. Reviewers work a queue rather than a chat window, and their confirmations and overrides are recorded against the machine's verdicts.
The throughput number is real, but it is not the point. The point is that the speed is reviewable. A reviewer confirms a verdict in seconds because the rule, its version, and the highlighted evidence are on one screen. The audit discipline did not slow the system down. It is the reason the regulator can use the speed at all.
The regulator did not buy a faster reviewer. It bought a system whose every verdict can be replayed, with the rule version and the evidence quote attached.
The procurement model matters as much as the architecture, because compliance-grade claims are testable claims. Our standard engagement scopes one workflow, signs executable acceptance criteria on day one, and builds in the client's environment over two weeks. The client pays $10,000 only after every criterion passes. For compliance-grade RAG, the criteria are mechanical: quote-verification pass rates, reproducibility across re-runs, correct queue routing at defined confidence thresholds. Larger programs run as repeated two-week sprints on the same terms. We build to SOC 2, HIPAA, and GDPR standards, and every implementation is aligned with the EU AI Act.
That model exists because of the gap this paper describes. A vendor who sells answers can hide behind a demo. A vendor who sells verdict records has to pass an executable test in your environment before being paid. Heads of regulatory affairs and quality should hold every retrieval system, internal or purchased, to the second standard.
What to do with this on Monday morning
- List every workflow where a generative system already produces text a regulator could later read. Ask the five inspector questions from this paper of each one. Where the system has no answer, you have found your exposure.
- Rewrite your RAG procurement criteria around evidence. Require span-level citations, string-verified quotes, and reproducible verdict records in the RFP, and score the vendor demo on an audit scenario rather than a question-and-answer scenario.
- Put your compliance rules under version control this month, before any AI work begins. If the rules live in PDFs and institutional memory, no system can stamp a verdict with the rule version it applied.
- Stand up a human review queue before you scale any pilot. Route verdicts by confidence and rule criticality, record every override, and use the queue data to measure precision per rule each quarter.
- Map your current logging against EU AI Act Articles 12 and 14 now; the high-risk obligations apply from August 2026. Treat the gap analysis as an engineering ticket list, not a legal memo.
- Scope one workflow and sign executable acceptance criteria before funding a build. If a vendor will not commit to pass-fail criteria covering quote verification and reproducibility, the system is consumer-grade regardless of the pitch.
- Stanford RegLab, Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (2025)
- Kalai, Nachum, Vempala et al. (OpenAI), Why Language Models Hallucinate (2025)
- Vectara, Hallucination Leaderboard, HHEM-2.3 (updated May 2026)
- EU Artificial Intelligence Act, Article 12: Record-Keeping (2024)
- EU Artificial Intelligence Act, Article 14: Human Oversight (2024)
- U.S. FDA, Considerations for the Use of Artificial Intelligence To Support Regulatory Decision-Making for Drug and Biological Products, draft guidance (January 2025)
- European Medicines Agency, Reflection paper on the use of AI in the medicinal product lifecycle (2024)