From POC to scale: long-running AI workflows
In one of the most recent client projects I worked on, what I was building wasn’t an ordinary chatbot. It wasn’t even a single agent. It was a process. A sequence of decisions and actions a team of humans used to do over hours or days, with branches, reviews, and pauses. The interesting questions stopped being about prompts and tools a while back. They’re about how you run something that takes a long time, fails halfway through, and has to be safe to retry.
I keep reaching for the same mental model. The workflow is a graph. Nodes are typed. The runtime walks the graph. Everything else, from how state is stored to how concurrency is handled, falls out of that one decision.
This is the version of the post I wish someone had handed me a year ago. What I mean by graph. What works in a POC. Where the POC breaks. What production actually looks like once a client is depending on it.
Workflows as graphs
A workflow is a directed graph the team draws once. Nodes are units of work with a typed input and a typed output. Edges are transitions, optionally guarded by a condition on the previous node’s output. I use four node kinds: det (deterministic code), llm (a single model call with a schema in and a schema out), human (parks the run and waits for a review), and effect (does something the outside world can see).
The graph is data, not code. The same workflow as a definition file:
id: case-triage
version: 3
nodes:
- id: classify
kind: det
out: ClassifyResult
retry: { max: 3, backoff: exponential }
- id: enrich
kind: llm
model: opus-4-7
out: EnrichResult
retry: { max: 4, backoff: jitter }
- id: fetch_refs
kind: det
out: RefBundle
- id: decide
kind: llm
model: sonnet-4-6
out: Decision
- id: review
kind: human
queue: ops-l2
sla_hours: 48
out: ReviewOutcome
- id: handoff
kind: effect
target: crm.create_case
compensate: crm.cancel_case
edges:
- { from: classify, to: enrich, when: "category == 'incident'" }
- { from: classify, to: fetch_refs }
- { from: [enrich, fetch_refs], to: decide }
- { from: decide, to: review, when: "confidence < 0.8" }
- { from: decide, to: handoff, when: "confidence >= 0.8" }
- { from: review, to: handoff, when: "approved" }
The runtime that walks this graph is a separate concern. When the runtime gets harder, the definition doesn’t move. Every node implements the same small contract:
from typing import Protocol, runtime_checkable
from pydantic import BaseModel
class Context(BaseModel):
run_id: str
node_id: str
attempt: int
trace_id: str
@runtime_checkable
class Node(Protocol):
"""Every workflow node implements this.
`compensate` is optional and only meaningful for effect nodes."""
kind: str # "det" | "llm" | "human" | "effect"
def idempotency_key(self, inp: BaseModel, ctx: Context) -> str: ...
async def run(self, inp: BaseModel, ctx: Context) -> BaseModel: ...
idempotency_key lifts a question that gets murkier the longer you ignore it: what does it mean to run this node twice? For det and llm nodes the key is usually a hash of the input. For effect nodes the key is the bridge to the outside world. We’ll come back to it.
This is the part people skip when they jump straight to a free-form ReAct loop. ReAct treats the trajectory as something the model invents at runtime. That’s fine for a chatbot. For work where the right answer is “do these eight things in this order, branch if the third one comes back ambiguous, and pause for a human if the confidence is below 0.8”, you want that shape pinned down before the model runs.
The POC pattern
The first version of every one of these systems looks the same. A single process holds the graph in memory. It walks the nodes in order, calls the model inline for llm nodes, runs the code inline for det nodes, holds the partial state in a dictionary. human nodes are a TODO. effect nodes are mocked.
You demo it and it works, because the constraints you’ve quietly chosen are doing all the work: small graph, short horizon, your own input, the model behaves. The trap is reading the success of the POC as evidence that the architecture is sound. It isn’t. It’s reasonable for the size of the problem. Each constraint becomes a hole the moment the workflow is real.
Where the POC breaks
Per-node success rates aren’t 100%. Call them 99% for a tuned llm node and higher for det. A 20-node workflow at 99% per node lands at 0.99^20 ≈ 0.82 end to end. A three-nines per-node target gets you to 0.999^20 ≈ 0.98. The headline isn’t the number. It’s that POC code which assumes everything succeeds is reasoning about a regime that doesn’t exist.
The same arithmetic recurs for every failure surface the POC fused into one process: latency, memory, concurrency, idempotency, observability, and human review. Each one is a separate fix.
| Failure mode | POC behaviour | Production fix |
|---|---|---|
| The run outlives the process | a crash or deploy drops in-flight work | runtime is durable; state is written between every node |
| Failures compound across nodes | one transient error fails the run | typed retries per node, with per-node SLOs |
| State lives only in process memory | restart equals restart from zero | every node’s input and output is persisted alongside the run state |
| Concurrency contends with the model | parallel runs hit the same rate limit at the same time | one model gateway holds rate limit and budget for all callers |
| Effects can’t be safely re-run | retry sends a second email | idempotency key on (run_id, node_id, attempt), compensations for non-idempotent effects |
| Humans don’t reply on function-call clocks | the run blocks for hours or days | review is a node type; the runtime parks the run and resumes it on an event |
The pattern across the rows is the same. The POC stored state in the wrong place, treated failure as exceptional, and confused “the orchestrator is running” with “the workflow is alive”.
What production looks like
The fix isn’t bigger boxes. It’s separating two things the POC fused: the workflow definition and the workflow runtime.
Definitions are versioned data. The graph for a workflow lives in a store with a version. A run is bound to a version forever. A deploy that changes the graph doesn’t change the meaning of an in-flight run.
State is durable, written after every node transition. The run is identified by an opaque id. Every node’s input and output is persisted alongside the state, so you can reconstruct exactly what happened.
Node executions are messages on a queue. The runtime doesn’t loop through the graph in process. It enqueues “execute node n of run r at attempt k” as a task. A pool of stateless workers picks tasks up. Crashes become boring. The task is redelivered, another worker handles it.
Workers are stateless. The contract is the same everywhere. Derive the idempotency key, check the state store, do the work only if there is no recorded outcome.
async def execute(task: Task) -> None:
key = f"{task.run_id}:{task.node_id}:{task.attempt}"
node = registry.get(task.node_id)
# Already finished this exact attempt? Return the recorded outcome.
if outcome := await state.outcome(key):
await queue.ack(task, outcome)
return
# Record the intent before we touch the outside world.
await state.write_intent(key, inp=task.input)
try:
out = await node.run(task.input, ctx=task.ctx)
except TransientError:
delay = backoff(task.attempt) # exponential, capped, jittered
await queue.retry(task, delay=delay)
return
await state.write_outcome(key, out=out)
await queue.ack(task, out)
The intent record is what turns the worker loop from “retry safe in the common case” into “retry safe under crashes”. Watch what happens when an effect node crashes after the side effect but before the outcome is written:
For effects that can’t be queried after the fact (a payment, a ticket, an outbound webhook), the graph declares a compensating action that runs if the workflow later fails or is cancelled. Forward and reverse live on the same node:
class HandoffNode:
kind = "effect"
def idempotency_key(self, inp: HandoffInput, ctx: Context) -> str:
return f"handoff:{ctx.run_id}:{ctx.node_id}:{ctx.attempt}"
async def run(self, inp: HandoffInput, ctx: Context) -> HandoffOutput:
case = await crm.create_case(inp.payload, request_id=self.idempotency_key(inp, ctx))
return HandoffOutput(case_id=case.id)
async def compensate(self, inp, out: HandoffOutput, ctx: Context) -> None:
await crm.cancel_case(out.case_id, reason="workflow rolled back")
The rest of the architecture is what supports this. A model gateway owns rate limits, budgets, and prompt caching. All llm nodes call through it. When the provider degrades, exactly one place knows. Human review is a node type, not a side effect. A review node pushes a task on a review queue, records the requirement on the run’s state, and parks the run. When the review is acknowledged elsewhere, an event wakes the run via the scheduler. Every run produces a trace whose structure mirrors the graph. The workflow definition is the schema for the trace. Replays are first-class: pick a run, pick a node, replay from there with the recorded inputs.
None of these moves are novel on their own. The same shape is recognisable from Temporal, Step Functions, Airflow, and durable execution systems generally. The novelty is in deciding that long-running LLM workflows belong in that category, rather than treating them as a thing that lives inside a chatbot framework.
What this buys you
Runs become independent of any single process. Deploys don’t drop runs. Crashes don’t drop runs. A run can fan out into thousands of parallel sub-runs, each persisted, each retryable. Replays are a regular operation, not an incident.
The quieter change is what it does to the way you build. Once the graph is the artefact, the model call stops being the centre of gravity. You spend more time on the shape of the work and less on the prompt. That turns out to be where most of the leverage was anyway.
What’s next
A second post needs to follow this one. How you evaluate a workflow like this end to end. Per-node evals are necessary, but they don’t tell you whether the workflow as a whole is doing the right thing. I’ll write that one next.