Why we sign before we build
Every QueryNow engagement starts the same way. We scope one workflow with you, write the acceptance criteria as executable pass-fail checks, and both sides sign the sheet on day one. We then build in your environment for two weeks. You pay $10,000 only after every criterion passes. Larger programs run as repeated two-week sprints on the same terms.
We have shipped more than 200 production deployments since 2014, and the sheet is the part of the process that has earned its place in every one of the recent ones. It removes the two failure modes that kill AI projects: the vendor who declares success on a demo, and the client who keeps moving the goal. Once the sheet is signed, success is not an opinion. It is a test run.
If a criterion cannot fail, it is not a criterion. It is a hope. Every line on this sheet must be a check that a script or a named reviewer can run and that returns exactly one of two results: pass or fail.
This template is the client-facing version of our internal sheet. Use it with us or with anyone else. A vendor who refuses to sign pass-fail criteria before building is telling you something useful.
Define the workflow first
Acceptance criteria attach to a workflow, not to a technology. Before you write a single criterion, define the workflow in three parts: input, output, trigger. One sentence each. If any of the three needs a paragraph, you are looking at two workflows. Split them and scope the first one.
| Element | What to write | Example: marketing asset compliance scan |
|---|---|---|
| Input | The artifacts the system receives, with formats, sources, and size limits named | A marketing PDF or image up to 50 pages, uploaded by a reviewer from the review portal |
| Output | The artifact the system produces, with its consumer and required fields named | A verdict record listing every rule checked, pass or flag per rule, each flag citing the exact passage |
| Trigger | The event that starts a run, stated as manual or automatic | Reviewer submits the asset for scan in the portal |
The definition does real work. Every criterion below must trace back to the input, the output, or the trigger. Anything that does not trace back is scope creep wearing a requirements costume.
How to write an executable criterion
Every criterion has three parts, and they map to the three columns of the table in the next section. First, the behavior: what the system must do, stated as a single observable fact. Second, the verification method: the exact procedure that checks the behavior, including what test data is used and who runs it. Third, the pass condition: the number or binary outcome that decides the result.
- The behavior describes the system, not the project. "The team delivers a working integration" is a project status. "Every record created in the CRM appears in the billing system within five minutes" is a behavior.
- The verification method must be runnable without the vendor in the room. If only the builder can execute the check, you are accepting a demo, not a system.
- The test data is fixed before the build starts. Agree the evaluation set, the sample documents, and the adversarial cases on day one and freeze them. A gold set assembled after the build measures the build's flattery, not its quality.
- The pass condition is a number or a binary outcome. "Fast" is an adjective. "Median scan time under three minutes across the 50-asset test set" is a pass condition.
Example criteria: ten rows you can adapt
The rows below span three common build types: a compliance scanner, a knowledge copilot, and a system integration. They are modeled on checks we have signed in production work. The compliance scanner we operate for a European pharmaceutical regulator runs 11 rules per scan across more than 620 marketing assets, at roughly 2 minutes per asset against the 2 to 3 hours the same review took manually, and it passed criteria shaped exactly like these. Treat every number as a placeholder to negotiate, not a benchmark to copy.
| Criterion | How it is verified | Pass condition |
|---|---|---|
| Scanner evaluates the full rule set on every asset | Run the frozen 50-asset test set through the scanner and export the verdict records | Every verdict record contains a result for all 11 rules, with no null or skipped entries |
| Every flagged violation cites the exact passage | Client reviewer opens 20 randomly selected flags from the test run | 20 of 20 flags open the source asset at the highlighted passage that triggered the flag |
| Scan throughput meets review-queue needs | Timestamped logs from the full test-set run, measured end to end per asset | Median scan time under 3 minutes per asset; no asset exceeds 10 minutes |
| Copilot answers carry a working citation | Score the frozen 100-question evaluation set; a named client reviewer checks each cited link | At least 95 of 100 answers cite a source document that opens and supports the answer |
| Copilot refuses what it cannot ground | Submit the 25 agreed adversarial questions that have no answer in the corpus | 25 of 25 produce an explicit refusal or escalation, zero invented answers |
| Copilot is responsive under real load | Load test at the agreed concurrency of 20 simultaneous users | 95th percentile time to first token under 5 seconds |
| Integration syncs every record, field-correct | Create the 200 agreed test records in the source system; reconcile both systems by script | 200 of 200 records appear in the target with every mapped field matching exactly |
| Sync failures alert a named owner | Inject the 5 agreed failure cases: bad credential, malformed record, target outage, duplicate key, timeout | 5 of 5 failures raise an alert to the named owner within 10 minutes, each identifying the failed record |
| No secrets in code, config, or logs | Run the agreed secret scanner over the repository and review logs from the full test run | Zero credentials found; all authentication flows through the client's vault or managed identity |
| Access is restricted to the agreed group | A test user outside the agreed directory group attempts to reach the system | Access is denied and the attempt appears in the audit log with user and timestamp |
Eight to twelve rows is the healthy range for a two-week build. Fewer than six and the sheet is probably vague. More than fifteen and the workflow is probably two workflows.
What makes a criterion testable
Before signing, run every row through these checks. A single failing row weakens the whole sheet, because the vague row is where the dispute will live.
- Binary outcome. The verification ends in pass or fail. No "substantially complete," no percentage of done.
- One behavior per row. A criterion that tests accuracy and speed and security is three criteria. Split it so a failure points at one thing.
- Named verification. The row says who or what runs the check: a script, a load test, a named reviewer role. "Will be validated" names nobody.
- Frozen test data. The evaluation set exists, is agreed, and is versioned before the build begins. Both sides keep a copy.
- Client-executable. Your team can run the verification alone, in your environment, after the vendor leaves.
- Quantified threshold. Every pass condition contains a number or an exact expected state. Adjectives are negotiation debt.
- Bounded time. The verification itself completes inside an agreed window, typically one day, so acceptance cannot drift.
- Failure is defined as precisely as success. The row implies exactly what evidence demonstrates a fail, so a fail is a fact, not an argument.
The strongest test of the sheet: could a third party with no stake in the project run every verification and announce the results, and would both sides accept them without a meeting? If yes, you have acceptance criteria. If no, you have a slide.
The sign-off block
The last page of the sheet is the sign-off block. It exists so that the people with actual authority commit before the build, not after. Three signatures, dated, on day one.
- Workflow owner. The business lead accountable for the workflow today. Their signature says the criteria describe success for the business, completely.
- Technical owner. The client-side IT or security lead. Their signature says the environment access, test data, and security checks listed are real and available.
- Delivery lead. The builder. Their signature says every criterion will pass within the sprint, in the client environment, or the fail terms apply.
- Client obligations, listed and dated. Environment access, service accounts, test data, and reviewer availability, each with a name and a date. A build blocked by a missing credential is a different conversation than a failed criterion, so the sheet separates them.
- Change control. After signing, criteria change only by written agreement of all three signers. A changed criterion restarts the two-week clock for the affected work. This protects both sides equally.
What happens on fail
The terms we sign are simple, and we recommend them to anyone buying an AI build. Acceptance runs at the end of the sprint, in the client environment, against the frozen test data. If every criterion passes, the invoice is due: $10,000 for the sprint. If any criterion fails, the builder may fix and re-run within the agreed acceptance window. If it still fails, you pay nothing. Not a reduced fee. Nothing.
Partial passes do not trigger partial payment, because a workflow that almost works still requires the manual process it was meant to replace. The only carve-out is the client obligations list: a criterion that could not be verified because an agreed access or data item was never provided is paused, not failed, and the clock extends by the days lost.
The sheet is the deal. Every criterion passes, you pay. Any criterion fails and stays failed, you pay nothing. Programs bigger than one workflow run the same sheet again, sprint after sprint, and either side can stop after any sprint.