Haim Ari
Haim Ari·

Building a Multi-Agent Verification Pipeline

verificationtutorialmulti-agent
Share:

Why Verification Matters for AI-Generated Code

AI agents are remarkably good at generating code that looks correct. The syntax is valid, the logic reads well, and the structure follows established patterns. But looking correct and being correct are different things.

An agent refactoring an authentication module might rename a function from validate_token to verify_token in the implementation file — but miss the three call sites in other modules. An agent adding a new API endpoint might forget to register the route in the router. An agent writing a utility function might introduce a type mismatch that the language server would catch instantly but the agent never consulted.

These aren't rare edge cases. In our testing, roughly 15–20% of AI-generated changesets that pass a surface-level review contain issues that automated verification catches. The code compiles in isolation but fails when integrated. It passes the happy path but breaks on edge cases the agent didn't consider.

This is why dkod doesn't treat verification as an optional add-on. It's a core part of the agent protocol — every changeset must pass through the verification pipeline before it can be merged.

The Verification Flow

Here's the complete lifecycle of a changeset, from submission to merge:

Loading diagram...

The flow is straightforward: the agent calls dk commit to submit its changes, then dk check to trigger verification. The platform copies the full repository into an isolated temp directory, overlays the agent's changed files on top, and runs the pipeline. Results come back as structured data — not raw log output — so the agent can parse failures and fix them programmatically.

The Three Gates

dkod's verification pipeline runs three categories of checks, in order. Each gate must pass before the next one runs.

Gate 1: Lint

The lint gate catches style violations, unused imports, dead code, and common anti-patterns. It runs fast — typically under 10 seconds — and catches the low-hanging issues before more expensive checks run.

For a Rust project, this means cargo clippy. For TypeScript, eslint or biome. For Python, ruff or flake8. dkod auto-detects the project type and runs the appropriate linter, or you can configure a custom command in .dkod/pipeline.yaml.

Why lint first? Because agents frequently introduce lint issues that would cascade into confusing test failures. Catching them early keeps the feedback loop tight.

Gate 2: Type-Check

The type-check gate validates that the changeset is structurally sound — types align, function signatures match their call sites, imports resolve correctly. This is the gate that catches the renamed-function-but-missed-call-sites problem.

For Rust: cargo check. For TypeScript: tsc --noEmit. For Go: go build ./.... For Python with type hints: mypy or pyright.

Type-checking is the single most valuable gate for AI-generated code. Agents operate on partial context — they see the files they're working on but often miss the ripple effects of their changes across the codebase. The type checker sees everything.

Gate 3: Test

The test gate runs the project's test suite — but not the full suite. dkod uses changeset-aware test execution: it analyzes which files changed, determines which test files are affected, and runs only those tests.

When changeset_aware is enabled in the pipeline config, dkod scopes the test command to affected packages:

  • Rust: cargo test -p affected-crate-1 -p affected-crate-2
  • TypeScript (Bun): bun test src/affected-dir
  • Python: pytest affected_package1 affected_package2
  • Go: go test ./affected/pkg/...

This keeps verification fast — typically under 30 seconds even on large codebases — while still catching regressions in the code the agent actually touched.

Structured Feedback and the Retry Loop

When a gate fails, dkod doesn't just return "verification failed." It returns structured data that tells the agent exactly what went wrong:

  • Which gate failed (lint, type-check, or test)
  • Which files have issues
  • The specific errors — lint rule violations, type mismatches, test assertion failures
  • Exit codes and relevant output from each step

This structured feedback is what makes the retry loop work. The agent reads the failure data, understands what broke, fixes the issues in its overlay, and resubmits. No human intervention required.

In practice, most agents fix lint and type-check failures on the first retry. Test failures sometimes take two or three attempts — but even then, the total time from first submission to successful merge is usually under two minutes. Compare that to a traditional workflow where a CI failure means waiting 10-30 minutes for the full pipeline, reading logs, making a fix, pushing, and waiting again.

Configuring Your Pipeline

dkod resolves the verification pipeline in priority order:

  1. .dkod/pipeline.yaml in your repository (highest priority)
  2. Database pipeline configured via the API
  3. Auto-detected from project files (lowest priority)

For most projects, auto-detection works well — dkod looks at your project files (Cargo.toml, package.json, pyproject.toml, go.mod) and picks sensible defaults. But for fine-grained control, you can define a custom pipeline:

pipeline: name: my-project timeout: 5m stages: - name: checks parallel: true steps: - name: lint run: cargo clippy timeout: 30s changeset_aware: true - name: typecheck run: cargo check timeout: 60s changeset_aware: true - name: test steps: - name: test run: cargo test timeout: 3m changeset_aware: true

The parallel flag lets you run lint and type-check simultaneously, cutting wall-clock time. The changeset_aware flag scopes each command to only the affected packages.

Why Not Just Use CI?

Traditional CI (GitHub Actions, GitLab CI, Jenkins) runs after code is pushed. dkod's verification runs before code is merged — it's part of the agent's workflow, not an external check that runs minutes later.

The differences compound:

  • Speed: dkod verification takes < 30 seconds (changeset-aware, targeted tests). CI typically takes 5-30 minutes (full suite, cold builds).
  • Feedback format: dkod returns structured data agents can parse. CI returns logs meant for humans to read.
  • Retry cost: dkod retries are instant — the agent fixes the issue and resubmits within the same session. CI retries mean a new push, a new pipeline, and another 5-30 minute wait.
  • Isolation: dkod verification runs in an isolated copy with the agent's overlay applied. CI runs on whatever the branch looks like after a push — which might include other changes that interfere.

This doesn't mean you should remove CI. dkod's verification pipeline is the fast inner loop for agents. CI remains the final safety net that runs the full test suite, integration tests, and deployment checks. The two are complementary.

Putting It All Together

The verification pipeline is the last piece of the agent-native workflow:

  1. dk init — agent opens a session on the codebase
  2. Agent writes code — reading context with dk cat and dk search, making changes with dk add
  3. dk commit — agent submits its changeset with an intent description
  4. dk check — pipeline runs lint, type-check, and test gates
  5. Agent fixes issues — if verification fails, the agent reads structured feedback and retries
  6. dk push — once verification passes, the changeset merges into a Git commit

Every step is fast. Every step returns structured data. And every step is designed for machines, not humans. The result: AI agents that ship verified, tested code — without a human babysitting every change.


This is the fifth post in our series on agent-native development. Previously: Introducing dkod. For full pipeline configuration reference, see the Verification Pipeline docs.

Join the community

Discuss this post, ask questions, and connect with other developers building with AI agents.

Join Discord