Part of the DEPLOY Method — Engineer phase
Almost nobody has shipped a multi-agent system at production scale. The distance between a LangGraph demo that works in a notebook and a system that runs for paying customers is where every other team stalls — and it stalls for reasons that are not obvious until you've built one. This is the ENGINEER and PILOT phases of the DEPLOY Method, compressed into a 12-week embedded engagement for teams that already have an agent prototype with real users and need to industrialize it. I architected Auralink — 1.7 million lines of production code, roughly 20 autonomous agents resolving 78% of incidents without human intervention, peer-reviewed on arXiv. No comparable multi-agent system exists in production today. The work I will do with your team is the same work I did with mine, adapted to your codebase, your agents, and your operational constraints. I've shipped eight AI ventures to production. I know which decisions can be deferred and which ones will bite you six weeks after launch if you skip them now.
Every agent demo works in a notebook and falls apart under concurrent production traffic. The tutorial uses synchronous calls, a single happy-path trajectory, and mocked tools. Production runs dozens of agent sessions in parallel, each making real tool calls with real failure modes, and the naive orchestration pattern that looked clean in the demo becomes a thundering herd of retries, deadlocks, and half-committed state. Your team knows this is a problem and does not have the reference architecture for solving it.
Your eval strategy for single-turn LLM calls does not extend to multi-step agent trajectories. You can evaluate a prompt. You cannot yet evaluate a 14-step plan where the fifth step chose the wrong tool, the ninth step passed the wrong argument, and the final answer was still technically correct. Failure modes in agent trajectories compound across steps and the evaluation methodology from single-turn work produces misleading scores. Without trajectory-level evaluation you cannot tell whether a model update improved or regressed the system, and you cannot ship with confidence.
Cost-per-task explodes unpredictably because each agent step multiplies token burn. A single user request triggers a plan, which triggers tool calls, which trigger sub-agents, which trigger more tool calls. Your per-session token count is now 40x a regular LLM call and your CFO wants a model that explains why one power user cost €18 in tokens last Tuesday. You have no instrumentation to answer that — no per-step token accounting, no routing logic that picks cheaper models for easier steps, no budget caps that fail gracefully when a session runs away.
When an agent does something wrong in production you have no observability stack that tells you which step, which prompt, which tool call caused it. The user complains that 'the agent gave a weird answer.' Your logs show the final response and nothing else. You cannot reproduce the trajectory because the agent is non-deterministic. You cannot tell whether the bug is in the planner, the tool router, the retrieval layer, or a specific prompt template. Every incident becomes a multi-day forensics exercise and your team loses confidence in the system faster than users do.
The engagement runs in four three-week phases. I work embedded with your engineering team — your engineers build, I bring the topology decisions, the eval methodology, and the observability patterns from Auralink. No work happens on a consulting slide. By the end of week twelve your team operates the system without me.
I go deep on your current prototype — the agent graph, the tool inventory, the state management, the failure modes you've already hit. I produce a written topology design: which agents, which responsibilities, which communication patterns, which state boundaries, which failure-isolation zones. The design is specific to your domain and your codebase, not a reference architecture copy-pasted from a blog post. By end of week three your team has a blueprint they can defend to a senior reviewer and a migration path from the current prototype that does not require a rewrite.
Your engineers implement the topology. I work alongside them on the harder calls — the orchestration primitives, the concurrency strategy, the state machine for long-running sessions, the retry and compensation logic for tool failures. We ship incrementally against real traffic from week five onwards, not a big-bang cutover in week seven. By end of week seven the new topology is serving production traffic and the old prototype has been decommissioned.
Trajectory-level evaluation built on the patterns I developed for Auralink — per-step evaluation, ground-truth trajectories for regression testing, LLM-as-judge with calibrated prompts, and the statistical methodology that lets you say 'this model update improved the system by 4.2% with p < 0.01' rather than 'the new version feels better.' Per-step token accounting and cost-per-task dashboards so your CFO can answer the questions that will come. Your team runs the eval on every change from week nine onwards.
The observability stack your on-call engineer will use when the pager goes off at 3am — trajectory traces linked to user sessions, per-step prompts and completions, tool-call inputs and outputs, token accounting, latency breakdowns, cost attribution. Runbooks for the top-10 incident types your system will produce. Working sessions with your SRE team so they own the alerting thresholds, the dashboards, and the incident response playbooks. When I leave, your team operates the system. No retainer, no ongoing dependency.
Enterprise technology organizations and Series-B+ startups with an agent prototype that has real users, budget for a 12-week embedded engagement, and an engineering team with the capacity to own the system after handoff. Product teams where the CTO or VP Engineering has already hit the wall between 'agent demo works' and 'agent system operates' and knows the gap is a topology problem, an eval problem, and an observability problem — not a prompt-engineering problem. This is not for teams without LLM production experience — they need the Readiness Audit or the Strategy Sprint first. It is also not for teams without an existing codebase; the engagement assumes a prototype to industrialize, not a greenfield build.
Not much. The orchestration framework is a vehicle — the decisions that matter are the topology, the state management, the eval methodology, and the observability. I've worked across the major frameworks and across custom orchestration code. In week one I assess whether your current framework is the right vehicle for where you're going; sometimes the answer is yes and we build on it, sometimes the answer is that a specific bottleneck argues for a migration. I make that call with evidence, not based on which framework has the best marketing.
A senior AI engineer you hire in 2026 has probably not shipped a production multi-agent system because almost nobody has. I've done it once, at 1.7 million lines of code and 78% autonomous resolution. The pattern recognition is not available on the contractor market yet. Your engineers do the implementation; I bring the topology decisions, the eval methodology, and the observability patterns that would otherwise take them three iterations and twelve months to learn. When I leave, your team owns everything and does not need me again.
No. Agent topology, eval harness, and observability are each three-week problems done well and each one-week problems done badly. The compressed version produces a system that runs until it doesn't, and the debugging cost in month four exceeds the consulting savings in month one. If you do not have twelve weeks, the right engagement is the Pilot-to-Production Hardening service, which covers the production-readiness work without the full topology redesign. I will recommend that honestly if it's the right fit.
Almost never. In the engagements I've run, the topology design preserves 60-80% of the existing code and changes the orchestration layer, the state boundaries, and the failure-isolation patterns. The business logic your team wrote is usually fine; what needs to change is how the agents coordinate, how state is managed, and how failures are handled. Full rewrites are a sign of a consultant who does not want to read your code. I read your code.
It's a measured number from Auralink's production system, reported in the arXiv paper. 78% of incidents assigned to the agent pool are resolved without a human in the loop — which includes the cases where an agent correctly escalates, not just the cases where it solves end-to-end. The methodology for measuring it is part of what I bring to your engagement. Every team I've worked with ends up with a different number because their task profile is different; the point is not to replicate 78%, it's to build the measurement infrastructure that tells you what your real number is.
Laten we bespreken hoe deze dienst uw specifieke uitdagingen aanpakt en echte resultaten oplevert.