Building a document extraction pipeline that passes audit

In the first quarter of 2026 we shipped a document extraction pipeline for a mid-cap pharmaceutical company's regulatory affairs team. 90 days end-to-end. 73% reviewer time saved. Zero findings on the external regulatory review. Here's how that happened — not the marketing version, the actual sequence of decisions.

The problem, stated plainly

Every clinical study report coming back from a CRO was being read by a regulatory associate, re-read by a reviewer, and spot-checked by the VP. The associate extracted endpoint data into a structured template. The reviewer checked it. The VP sampled. Total cycle time: 4.5 hours per report, three people, across 240 reports a year.

The work that wasn't model work

We spent the first two weeks not touching a model. Interviews with 9 people. Shadowed 6 reviews. Read the CSRs themselves. Mapped the data fields — there are 41 — against the extraction template. Identified which fields had schema constraints that could be validated mechanically and which required interpretive judgment.

Then we built the evaluation set. The client's scientists labeled 120 reports against the template. We versioned the eval set in Git alongside the eventual code. This is the unglamorous work that makes the pilot credible later.

The model selection, compressed

We ran three variants: a single-shot prompt, a two-step extract-then-review prompt, and a schema-constrained structured output. The two-step approach with structured outputs won — 99.1% field accuracy against the eval set, with a predictable failure mode (it over-quotes from the source, which is easy to review) rather than an unpredictable one (fabricated values).

The production system

Shipped inside the client's Azure tenant. Auth via their existing Entra SSO. Audit trail forwarded to their Sentinel SIEM. Review UI for the RA team — a single page per report showing the source PDF, the extracted structured output, and a one-click accept/edit/reject workflow. Cost caps at the workflow level. A weekly digest to the VP showing throughput, accuracy on sampled re-reviews, and cost per report.

The system is model-provider-agnostic: the prompt orchestration sits in the client's code, and the model call is a swappable layer. Today it's Claude 3.5 Sonnet; a year from now it'll likely be something else, and the eval set will tell them whether to switch.

What we'd do differently