From POC to scale: long-running AI workflows

2026-06-07 · #ai-systems #workflows #production #multi-agent

In one of the most recent client projects I worked on, the thing we were building was a process - a sequence of decisions and actions a team of humans used to do over hours or days, with branches, reviews, and pauses. Nothing about it looked like a chatbot, and a single agent was never going to cover it.

The interesting questions stopped being about prompts and tools a while back. They are about how you run something that takes a long time, fails halfway through, and has to be safe to retry.

I keep coming back to the same mental model: the workflow is a graph. Typed nodes. A runtime that walks it.

Most other decisions - where state lives, how concurrency works, what a retry means - follow from that one.

This is the post I wish someone had handed me a year ago.

Workflows as graphs

A workflow is a directed graph the team draws once.

A node is a unit of work with typed input and output. An edge is a transition, optionally guarded by a condition on the previous node’s output.

Four node kinds have covered everything I needed so far:

det - deterministic code
llm - one model call, schema in, schema out
human - parks the run and waits for a review
effect - does something the outside world can see

The graph is data. The same workflow as a definition file:

id: case-triage
version: 3
nodes:
  - id: classify
    kind: det
    out: ClassifyResult
    retry: { max: 3, backoff: exponential }
  - id: enrich
    kind: llm
    model: opus-4-7
    out: EnrichResult
    retry: { max: 4, backoff: jitter }
  - id: fetch_refs
    kind: det
    out: RefBundle
  - id: decide
    kind: llm
    model: sonnet-4-6
    out: Decision
  - id: review
    kind: human
    queue: ops-l2
    sla_hours: 48
    out: ReviewOutcome
  - id: handoff
    kind: effect
    target: crm.create_case
    compensate: crm.cancel_case
edges:
  - { from: classify,                to: enrich,     when: "category == 'incident'" }
  - { from: classify,                to: fetch_refs }
  - { from: [enrich, fetch_refs],    to: decide }
  - { from: decide,                  to: review,     when: "confidence < 0.8" }
  - { from: decide,                  to: handoff,    when: "confidence >= 0.8" }
  - { from: review,                  to: handoff,    when: "approved" }

The runtime that walks this graph is a separate concern. When the runtime gets harder later, this file does not move.

Every node implements the same small contract:

from typing import Protocol, runtime_checkable
from pydantic import BaseModel

class Context(BaseModel):
    run_id: str
    node_id: str
    attempt: int
    trace_id: str

@runtime_checkable
class Node(Protocol):
    """Every workflow node implements this.
    `compensate` is optional and only meaningful for effect nodes."""
    kind: str  # "det" | "llm" | "human" | "effect"

    def idempotency_key(self, inp: BaseModel, ctx: Context) -> str: ...
    async def run(self, inp: BaseModel, ctx: Context) -> BaseModel: ...

idempotency_key forces an early answer to an awkward question: what does it mean to run this node twice?

For det and llm nodes the key is usually a hash of the input. For effect nodes it is the bridge to the outside world. More on that below.

This is the part I find easy to skip when reaching for a free-form ReAct loop. ReAct treats the trajectory as something the model invents at runtime. For a chatbot, mostly fine.

For work where the answer is “do these eight things in this order, branch if the third one comes back ambiguous, pause for a human if confidence is below 0.8”, I want that shape pinned down before the model runs.

Workflow as a typed graph. Solid borders are automated nodes; the dashed border parks the run for a human; the accent node is an irreversible side effect.

The POC pattern

The POC. One process owns the graph, the state, and the call stack. Sync model calls, direct effects, retries ad hoc.

The first version of every system I have built like this looks roughly the same.

One process holds the graph in memory. It walks the nodes in order, calls the model inline, runs the deterministic code inline, keeps partial state in a dictionary. human nodes are a TODO. effect nodes are mocked.

You demo it and it works. Mostly because of the constraints you quietly chose: small graph, short horizon, your own inputs, a model that behaves.

The mistake I keep making is reading POC success as evidence the architecture is sound, when mostly it is evidence the problem was small. Each constraint becomes a hole the moment the workflow is real.

Where the POC breaks

Per-node success rates are not 100%. Call them 99% for a tuned llm node and higher for det.

A 20-node workflow at 99% per node is 0.99^20 ≈ 0.82 end to end. Three nines per node gets you to 0.999^20 ≈ 0.98.

POC code that assumes everything succeeds is reasoning about a regime that does not exist.

The same thing repeats for every failure surface the POC fused into one process: latency, memory, concurrency, idempotency, observability, human review. Each one is a separate fix.

Failure mode	POC behaviour	Production fix
The run outlives the process	a crash or deploy drops in-flight work	runtime is durable; state is written between every node
Failures compound across nodes	one transient error fails the run	typed retries per node, with per-node SLOs
State lives only in process memory	restart equals restart from zero	every node’s input and output is persisted alongside the run state
Concurrency contends with the model	parallel runs hit the same rate limit at the same time	one model gateway holds rate limit and budget for all callers
Effects can’t be safely re-run	retry sends a second email	idempotency key on `(run_id, node_id, attempt)`, compensations for non-idempotent effects
Humans don’t reply on function-call clocks	the run blocks for hours or days	`review` is a node type; the runtime parks the run and resumes it on an event

The pattern I keep seeing across these rows is the same. The POC stored state in the wrong place, treated failure as exceptional, and confused “the orchestrator is running” with “the workflow is alive”.

What production looks like

The fix was separating the two things the POC fused: the workflow definition and the workflow runtime.

The production layout. The scheduler is the only piece that knows the graph. Workers know how to run a single node. Everything else is durable storage or an external dependency.

Definitions become versioned data. The graph lives in a store with a version. A run is bound to its version forever, so a deploy cannot change the meaning of an in-flight run.

State is durable, written after every node transition. Every node’s input and output is persisted, so you can reconstruct exactly what happened.

Node executions become queue messages: “execute node n of run r at attempt k”. Stateless workers pick them up. A crash stops being an event - the task gets redelivered, another worker takes it.

The worker contract is the same everywhere: derive the idempotency key, check the state store, only do the work if there is no recorded outcome.

async def execute(task: Task) -> None:
    key = f"{task.run_id}:{task.node_id}:{task.attempt}"
    node = registry.get(task.node_id)

    # Already finished this exact attempt? Return the recorded outcome.
    if outcome := await state.outcome(key):
        await queue.ack(task, outcome)
        return

    # Record the intent before we touch the outside world.
    await state.write_intent(key, inp=task.input)

    try:
        out = await node.run(task.input, ctx=task.ctx)
    except TransientError:
        delay = backoff(task.attempt)  # exponential, capped, jittered
        await queue.retry(task, delay=delay)
        return

    await state.write_outcome(key, out=out)
    await queue.ack(task, out)

The intent record is what turns this from “retry safe in the common case” into “retry safe under crashes”. Here is an effect node crashing after the side effect lands but before the outcome is written:

The crash is invisible to the outside world. The intent record makes the retry idempotent because the effect knows how to look itself up by the request id we passed through.

Some effects cannot be queried after the fact - a payment, a ticket, an outbound webhook. For those, the graph declares a compensating action that runs if the workflow later fails or is cancelled. Forward and reverse live on the same node:

class HandoffNode:
    kind = "effect"

    def idempotency_key(self, inp: HandoffInput, ctx: Context) -> str:
        return f"handoff:{ctx.run_id}:{ctx.node_id}:{ctx.attempt}"

    async def run(self, inp: HandoffInput, ctx: Context) -> HandoffOutput:
        case = await crm.create_case(inp.payload, request_id=self.idempotency_key(inp, ctx))
        return HandoffOutput(case_id=case.id)

    async def compensate(self, inp, out: HandoffOutput, ctx: Context) -> None:
        await crm.cancel_case(out.case_id, reason="workflow rolled back")

The rest of the architecture supports this loop.

A model gateway owns rate limits, budgets, and prompt caching. All llm nodes call through it. When the provider degrades, one place knows.

Human review is its own node type. A review node records the requirement, parks the run, and an event wakes it later. Nothing blocks while a human thinks.

Every run produces a trace shaped like the graph. The definition is effectively the schema for the trace. Replays are first-class: pick a run, pick a node, replay from there with the recorded inputs.

None of these moves are novel. Temporal, Step Functions, Airflow - durable execution systems generally - all have this shape. The decision that matters is putting long-running LLM workflows in that category instead of inside a chatbot framework.

What this buys you

Runs stop depending on any single process. Deploys do not drop runs. Crashes do not drop runs. A run can fan out into thousands of parallel sub-runs, each persisted, each retryable. Replay becomes a routine operation.

The quieter change is what it does to the way you build. Once the graph is the artefact, the model call stops being the centre of gravity. You spend more time on the shape of the work and less on the prompt. That turns out to be where most of the leverage was anyway.

What’s next

The follow-up exists now: evals for workflows, not just nodes. Per-node evals are necessary, but they do not tell you whether the workflow as a whole did the right thing.