writing

From POC to scale: long-running AI workflows

· #ai-systems #workflows #production #multi-agent

In one of the most recent client projects I worked on, what I was building wasn’t an ordinary chatbot. It wasn’t even a single agent. It was a process. A sequence of decisions and actions a team of humans used to do over hours or days, with branches, reviews, and pauses. The interesting questions stopped being about prompts and tools a while back. They’re about how you run something that takes a long time, fails halfway through, and has to be safe to retry.

I keep reaching for the same mental model. The workflow is a graph. Nodes are typed. The runtime walks the graph. Everything else, from how state is stored to how concurrency is handled, falls out of that one decision.

This is the version of the post I wish someone had handed me a year ago. What I mean by graph. What works in a POC. Where the POC breaks. What production actually looks like once a client is depending on it.

Workflows as graphs

A workflow is a directed graph the team draws once. Nodes are units of work with a typed input and a typed output. Edges are transitions, optionally guarded by a condition on the previous node’s output. I use four node kinds: det (deterministic code), llm (a single model call with a schema in and a schema out), human (parks the run and waits for a review), and effect (does something the outside world can see).

The graph is data, not code. The same workflow as a definition file:

id: case-triage
version: 3
nodes:
  - id: classify
    kind: det
    out: ClassifyResult
    retry: { max: 3, backoff: exponential }
  - id: enrich
    kind: llm
    model: opus-4-7
    out: EnrichResult
    retry: { max: 4, backoff: jitter }
  - id: fetch_refs
    kind: det
    out: RefBundle
  - id: decide
    kind: llm
    model: sonnet-4-6
    out: Decision
  - id: review
    kind: human
    queue: ops-l2
    sla_hours: 48
    out: ReviewOutcome
  - id: handoff
    kind: effect
    target: crm.create_case
    compensate: crm.cancel_case
edges:
  - { from: classify,                to: enrich,     when: "category == 'incident'" }
  - { from: classify,                to: fetch_refs }
  - { from: [enrich, fetch_refs],    to: decide }
  - { from: decide,                  to: review,     when: "confidence < 0.8" }
  - { from: decide,                  to: handoff,    when: "confidence >= 0.8" }
  - { from: review,                  to: handoff,    when: "approved" }

The runtime that walks this graph is a separate concern. When the runtime gets harder, the definition doesn’t move. Every node implements the same small contract:

from typing import Protocol, runtime_checkable
from pydantic import BaseModel

class Context(BaseModel):
    run_id: str
    node_id: str
    attempt: int
    trace_id: str

@runtime_checkable
class Node(Protocol):
    """Every workflow node implements this.
    `compensate` is optional and only meaningful for effect nodes."""
    kind: str  # "det" | "llm" | "human" | "effect"

    def idempotency_key(self, inp: BaseModel, ctx: Context) -> str: ...
    async def run(self, inp: BaseModel, ctx: Context) -> BaseModel: ...

idempotency_key lifts a question that gets murkier the longer you ignore it: what does it mean to run this node twice? For det and llm nodes the key is usually a hash of the input. For effect nodes the key is the bridge to the outside world. We’ll come back to it.

This is the part people skip when they jump straight to a free-form ReAct loop. ReAct treats the trajectory as something the model invents at runtime. That’s fine for a chatbot. For work where the right answer is “do these eight things in this order, branch if the third one comes back ambiguous, and pause for a human if the confidence is below 0.8”, you want that shape pinned down before the model runs.

An example workflow as a directed graph: trigger, classify (det), branch, parallel enrich (llm) and fetch refs (det), join, review (human, dashed border), handoff (effect, accent stroke).triggerdetclassifybranchllmenrichdetfetch refsjoinhumanrevieweffecthandoff
Workflow as a typed graph. Solid borders are automated nodes; the dashed border parks the run for a human; the accent node is an irreversible side effect.

The POC pattern

The POC architecture: a trigger calls a single orchestrator process that holds the graph and state in memory, makes synchronous model calls, and fires side effects directly.triggerorchestrator (single process)graph definition + run state (in memory)walks the nodes, calls the model inline,fires effects directly, retries ad hocmodel gatewaysyncside effectsfire and forget
The POC. One process owns the graph, the state, and the call stack. Sync model calls, direct effects, retries ad hoc.

The first version of every one of these systems looks the same. A single process holds the graph in memory. It walks the nodes in order, calls the model inline for llm nodes, runs the code inline for det nodes, holds the partial state in a dictionary. human nodes are a TODO. effect nodes are mocked.

You demo it and it works, because the constraints you’ve quietly chosen are doing all the work: small graph, short horizon, your own input, the model behaves. The trap is reading the success of the POC as evidence that the architecture is sound. It isn’t. It’s reasonable for the size of the problem. Each constraint becomes a hole the moment the workflow is real.

Where the POC breaks

Per-node success rates aren’t 100%. Call them 99% for a tuned llm node and higher for det. A 20-node workflow at 99% per node lands at 0.99^20 ≈ 0.82 end to end. A three-nines per-node target gets you to 0.999^20 ≈ 0.98. The headline isn’t the number. It’s that POC code which assumes everything succeeds is reasoning about a regime that doesn’t exist.

The same arithmetic recurs for every failure surface the POC fused into one process: latency, memory, concurrency, idempotency, observability, and human review. Each one is a separate fix.

Failure modePOC behaviourProduction fix
The run outlives the processa crash or deploy drops in-flight workruntime is durable; state is written between every node
Failures compound across nodesone transient error fails the runtyped retries per node, with per-node SLOs
State lives only in process memoryrestart equals restart from zeroevery node’s input and output is persisted alongside the run state
Concurrency contends with the modelparallel runs hit the same rate limit at the same timeone model gateway holds rate limit and budget for all callers
Effects can’t be safely re-runretry sends a second emailidempotency key on (run_id, node_id, attempt), compensations for non-idempotent effects
Humans don’t reply on function-call clocksthe run blocks for hours or daysreview is a node type; the runtime parks the run and resumes it on an event

The pattern across the rows is the same. The POC stored state in the wrong place, treated failure as exceptional, and confused “the orchestrator is running” with “the workflow is alive”.

What production looks like

The fix isn’t bigger boxes. It’s separating two things the POC fused: the workflow definition and the workflow runtime.

The production architecture as three swim lanes: control plane (scheduler, graph definitions, durable state, trace sink); data plane (task queue, worker pool); externals (model gateway, human review queue, side effects). Triggers feed the scheduler above the bands.triggerscontrol planeschedulergraph definitionsversioneddurable staterun, node, attempttrace sinkdata planetask queueat least onceworker poolstateless, idempotentkey (run_id, node_id, attempt)externalsmodel gatewayrate + budget + cachehuman review queueparks the runside effects+ compensationsscheduler writes stateworkers read and write statesolid: control flow. dashed: state read and write.trace sink receives spans from every component. review completion wakes the scheduler.
The production layout. The scheduler is the only piece that knows the graph. Workers know how to run a single node. Everything else is durable storage or an external dependency.

Definitions are versioned data. The graph for a workflow lives in a store with a version. A run is bound to a version forever. A deploy that changes the graph doesn’t change the meaning of an in-flight run.

State is durable, written after every node transition. The run is identified by an opaque id. Every node’s input and output is persisted alongside the state, so you can reconstruct exactly what happened.

Node executions are messages on a queue. The runtime doesn’t loop through the graph in process. It enqueues “execute node n of run r at attempt k” as a task. A pool of stateless workers picks tasks up. Crashes become boring. The task is redelivered, another worker handles it.

Workers are stateless. The contract is the same everywhere. Derive the idempotency key, check the state store, do the work only if there is no recorded outcome.

async def execute(task: Task) -> None:
    key = f"{task.run_id}:{task.node_id}:{task.attempt}"
    node = registry.get(task.node_id)

    # Already finished this exact attempt? Return the recorded outcome.
    if outcome := await state.outcome(key):
        await queue.ack(task, outcome)
        return

    # Record the intent before we touch the outside world.
    await state.write_intent(key, inp=task.input)

    try:
        out = await node.run(task.input, ctx=task.ctx)
    except TransientError:
        delay = backoff(task.attempt)  # exponential, capped, jittered
        await queue.retry(task, delay=delay)
        return

    await state.write_outcome(key, out=out)
    await queue.ack(task, out)

The intent record is what turns the worker loop from “retry safe in the common case” into “retry safe under crashes”. Watch what happens when an effect node crashes after the side effect but before the outcome is written:

Sequence diagram showing a worker writing intent, calling an effect, crashing before writing outcome, then the task being redelivered to a second attempt that finds the intent, queries the effect, and records the outcome idempotently.workerstate storeeffectswrite intent (run, node, attempt=1)send email (request_id = key)deliveredprocess dies before write_outcometask redelivered to a new worker, attempt = 2read intent: found, no outcomequery outcome by request_idalready deliveredwrite outcome (idempotent ok)
The crash is invisible to the outside world. The intent record makes the retry idempotent because the effect knows how to look itself up by the request id we passed through.

For effects that can’t be queried after the fact (a payment, a ticket, an outbound webhook), the graph declares a compensating action that runs if the workflow later fails or is cancelled. Forward and reverse live on the same node:

class HandoffNode:
    kind = "effect"

    def idempotency_key(self, inp: HandoffInput, ctx: Context) -> str:
        return f"handoff:{ctx.run_id}:{ctx.node_id}:{ctx.attempt}"

    async def run(self, inp: HandoffInput, ctx: Context) -> HandoffOutput:
        case = await crm.create_case(inp.payload, request_id=self.idempotency_key(inp, ctx))
        return HandoffOutput(case_id=case.id)

    async def compensate(self, inp, out: HandoffOutput, ctx: Context) -> None:
        await crm.cancel_case(out.case_id, reason="workflow rolled back")

The rest of the architecture is what supports this. A model gateway owns rate limits, budgets, and prompt caching. All llm nodes call through it. When the provider degrades, exactly one place knows. Human review is a node type, not a side effect. A review node pushes a task on a review queue, records the requirement on the run’s state, and parks the run. When the review is acknowledged elsewhere, an event wakes the run via the scheduler. Every run produces a trace whose structure mirrors the graph. The workflow definition is the schema for the trace. Replays are first-class: pick a run, pick a node, replay from there with the recorded inputs.

None of these moves are novel on their own. The same shape is recognisable from Temporal, Step Functions, Airflow, and durable execution systems generally. The novelty is in deciding that long-running LLM workflows belong in that category, rather than treating them as a thing that lives inside a chatbot framework.

What this buys you

Runs become independent of any single process. Deploys don’t drop runs. Crashes don’t drop runs. A run can fan out into thousands of parallel sub-runs, each persisted, each retryable. Replays are a regular operation, not an incident.

The quieter change is what it does to the way you build. Once the graph is the artefact, the model call stops being the centre of gravity. You spend more time on the shape of the work and less on the prompt. That turns out to be where most of the leverage was anyway.

What’s next

A second post needs to follow this one. How you evaluate a workflow like this end to end. Per-node evals are necessary, but they don’t tell you whether the workflow as a whole is doing the right thing. I’ll write that one next.