Why We Ditched DAGs for State Machines in Agent Orchestration
We thought DAGs would give us deterministic execution. Instead, they gave us distributed deadlocks at 3 AM.

Our first version of agent-swarm used a DAG-based workflow engine. It was beautiful on paper: nodes for agents, edges for data flow, topological sorting guaranteeing execution order. We shipped it to production in June. By July, we had built a cron job that restarted stuck workflows at 2 AM.
The problem wasn’t the DAGs. It was that agents aren’t functions. They fail in ways that violate the DAG’s core assumption: that a failed node is either retryable or terminal. When Agent A produces technically valid but semantically ambiguous output, Agent B doesn’t fail—it proceeds confidently down the wrong path. Then Agent C builds on that. By the time you notice, you’re three hops deep in garbage.
We rebuilt around explicit state machines. Not because they’re theoretically cleaner. Because they let us express something DAGs can’t: the system knows it’s confused and needs to ask for help.
What Actually Breaks in Production DAGs
DAG-based orchestrators (Prefect, Airflow, Temporal’s async workflows) assume failure modes are known. A node throws an exception, you catch it, you retry with exponential backoff. This works for deterministic compute. It fails for LLM agents because:
- Semantic drift: Output passes schema validation but shifts meaning. The “summary” node starts emitting bullet points. Downstream agents weren’t built for bullets.
- Negotiation deadlocks: Agent A needs clarification from Agent B, which needs context from Agent A. DAGs have no primitive for “pause and discuss.”
- Silent degradation: Confidence scores drop below thresholds but stay above failure thresholds. The system gets worse without failing.
In our experience, these showed up as tasks stuck in “running” for 45 minutes, or workflows completing with outputs that didn’t match the input’s intent. The DAG was happy. The result was useless.
Our State Machine Design
We moved to a hierarchical state machine where each agent has explicit states: idle → preparing → executing → validating → committing | negotiating | failed. The key is that negotiating is a first-class state, not an error handler.
// From /packages/core/src/state-machine/agent-state.ts
export type AgentState =
| { type: 'idle'; context: AgentContext }
| { type: 'preparing'; context: AgentContext; intent: TaskIntent }
| { type: 'executing'; context: AgentContext; startTime: ISO8601 }
| { type: 'validating'; context: AgentContext; candidate: CandidateOutput }
| { type: 'negotiating'; context: AgentContext; dispute: DisputeRecord; partners: AgentId[] }
| { type: 'committing'; context: AgentContext; output: CommittedOutput }
| { type: 'failed'; context: AgentContext; reason: FailureReason; retryable: boolean };
// Guard functions enforce valid transitions
const validTransitions: Record<AgentState['type'], AgentState['type'][]> = {
idle: ['preparing'],
preparing: ['executing', 'failed'],
executing: ['validating', 'failed'],
validating: ['committing', 'negotiating', 'failed'],
negotiating: ['executing', 'committing', 'failed'],
committing: ['idle'], // completes the cycle
failed: ['preparing', 'idle'], // retry or abort
};The negotiating state is where agents resolve ambiguity. Instead of retrying silently, they enter explicit coordination. The state machine tracks which agents are involved, what the dispute is about (through DisputeRecord), and how long they’ve been stuck.
Why Does Explicit Negotiation Beat Implicit Retries?
The Pattern: Coordinated Validation with Dispute Resolution
Here’s a real workflow from our codebase: multi-agent document analysis where a retriever, synthesizer, and fact-checker must agree before committing. The DAG version would chain them linearly. The state machine version lets them iterate.
// From /packages/workflows/src/document-analysis/coordination.ts
async function coordinatedAnalysis(
stateMachine: StateMachineService,
document: Document,
agentPool: AgentPool
): Promise<AnalysisResult> {
// All three agents start in parallel, same initial context
const retriever = await agentPool.acquire('retriever');
const synthesizer = await agentPool.acquire('synthesizer');
const factChecker = await agentPool.acquire('fact-checker');
const sharedContext = await buildSharedContext(document);
// Each agent enters executing independently
const retrieverPromise = stateMachine.transition(retriever.id, {
from: 'idle',
to: 'executing',
context: sharedContext,
});
const synthesizerPromise = stateMachine.transition(synthesizer.id, {
from: 'idle',
to: 'executing',
context: sharedContext,
});
// Fact checker waits for initial outputs, then validates
const [retrieverOutput, synthesisOutput] = await Promise.all([
retrieverPromise, synthesizerPromise
]);
// Explicit validation gate with possible negotiation
const validation = await stateMachine.transition(factChecker.id, {
from: 'idle',
to: 'validating',
context: {
...sharedContext,
retrieverOutput,
synthesisOutput,
},
});
// If validation disputes, enter negotiation
if (validation.type === 'validating' && validation.candidate.confidence < CONFIDENCE_THRESHOLD) {
const dispute = await stateMachine.transition(factChecker.id, {
from: 'validating',
to: 'negotiating',
dispute: {
issue: 'confidence_too_low',
details: validation.candidate.concerns,
proposedResolution: null,
},
partners: [retriever.id, synthesizer.id],
});
// Negotiation protocol: agents propose resolutions, vote, retry
const resolution = await runNegotiationProtocol(dispute, {
maxRounds: 3,
timeoutMs: 30000,
});
if (resolution.accepted) {
return commitWithResolution(stateMachine, factChecker, resolution);
}
return escalateToHuman(stateMachine, factChecker, dispute);
}
// Clean commit path
return commitOutput(stateMachine, factChecker, validation.candidate);
}The critical difference: runNegotiationProtocol is a first-class citizen. It has timeouts, round limits, and explicit voting. In our DAG version, this was a retry loop with a comment saying “TODO: better coordination.” The retry loop would spin for 10 minutes before timing out. The negotiation protocol typically resolves in under 5 seconds or escalates immediately.
What We Tried That Didn’t Work
Before settling on this design, we experimented with several approaches that seemed reasonable but collapsed under load.
Event sourcing with CQRS. We tried persisting every state change as events, with separate read models for querying. The write amplification was worse than expected—every LLM token generation produced state updates, and our event store grew 20GB daily. More critically, rebuilding state from events for debugging was too slow when operators needed answers in seconds, not minutes.
Persistent actor models (Akka-style). We liked the idea of agents as actors with mailboxes. But message passing between LLM agents is too coarse-grained. An agent needs structured access to another’s context, not just async messages. We ended up with actor wrappers around state machines, which added overhead without benefit.
Temporal-style durable execution. We actually wanted this to work. But Temporal’s determinism requires deterministic operations. LLM inference is fundamentally non-deterministic—same prompt, different outputs. Temporal’s replay-based recovery broke constantly. We would have needed to wrap every LLM call in deterministic stubs, which defeats the point of using agents instead of code.
Operational Wins: What Changed Day-To-Day
The state machine isn’t just cleaner code. It’s changed how we operate the system.
| Before (DAGs) | After (State Machines) |
|---|---|
| “Task running 45min”—why? Unknown. | “Agent in negotiating state, round 2 of 3, waiting for synthesizer response” |
| Kill stuck workflows, lose partial progress | Checkpoint at every state transition, resume from any point |
| Retry storms when downstream agents disagree | Explicit negotiation limits, human escalation after 3 rounds |
| Schema errors caught late | Validation gates at every transition, semantic and syntactic |
The observability change alone justified the migration. Our PagerDuty alerts shifted from “something is stuck somewhere” to “Agent X in Workflow Y needs human judgment on semantic conflict Z.” Mean time to resolution for stuck workflows dropped from 45 minutes to under 5 minutes.
Implementation Detail: Versioned Checkpoints
One pattern we learned from database systems: checkpoints must be versioned and immutable. We don’t update state in place. Every transition produces a new checkpoint, old ones are retained for replay and debugging.
// From /packages/core/src/checkpoint/store.ts
interface Checkpoint {
id: CheckpointId; // ULID, sortable
agentId: AgentId;
workflowId: WorkflowId;
sequenceNumber: number; // strictly increasing
state: AgentState; // full state at this point
parentCheckpoint: CheckpointId | null;
createdAt: ISO8601;
// Critical: LLM outputs that led here, for reproducibility
trace: {
prompt: string;
completion: string;
modelVersion: string;
temperature: number;
}[];
}
async function transitionWithCheckpoint(
store: CheckpointStore,
agentId: AgentId,
targetState: AgentState['type'],
params: TransitionParams
): Promise<Checkpoint> {
const current = await store.getLatest(agentId);
// Optimistic concurrency — detect stale states
if (current.state.type !== params.from) {
throw new StaleStateError(
`Expected state ${params.from}, found ${current.state.type}`
);
}
// Execute transition, capture new state
const newState = await applyGuard(current.state, targetState, params);
// Persist before acknowledging
const checkpoint = await store.append({
agentId,
state: newState,
sequenceNumber: current.sequenceNumber + 1,
parentCheckpoint: current.id,
trace: captureTrace(),
});
// Publish event for observers (metrics, alerting)
await publishTransitionEvent(checkpoint);
return checkpoint;
}The optimistic concurrency check caught race conditions we didn’t know existed. When two controllers tried to move the same agent (happened during network partitions), we’d get explicit errors instead of silently corrupted state. The StaleStateError became our most informative exception—it told us exactly what the system expected vs. what it found.
The Migration Path: Incremental, Not Revolutionary
We didn’t rebuild everything at once. The migration took six weeks with zero downtime. Here’s the sequence that worked:
- Week 1-2: Wrapped existing DAG tasks as state machine actions. The DAG still orchestrated, but individual “nodes” were now state machines with single states.
- Week 3-4: Migrated one hot path workflow (document analysis) to full state machine coordination. Ran shadow mode: state machine outputs vs. DAG outputs, compared results.
- Week 5: Fixed divergence issues. Mostly around timing—state machines were more patient, DAGs had aggressive timeouts.
- Week 6: Cut over remaining workflows. Kept DAG engine running for rollback, disabled after 48 hours stable.
The shadow mode comparison was essential. We found cases where the DAG was “correct” by accident—race conditions that happened to resolve favorably. The state machine exposed these as explicit negotiation rounds, which initially looked like regressions. They weren’t. They were bugs we’d been unaware of.
Recommendations We Stand Behind
After running this in production for several months, here are specific, opinionated recommendations:
- Never let agents retry without knowing why. Explicit states force you to categorize failure modes. If you can’t categorize it, you can’t handle it.
- Negotiation needs budgets. Unlimited negotiation is just a distributed deadlock. Cap rounds, cap time, escalate to humans. We use 3 rounds / 30 seconds.
- Checkpoints before external calls. Every LLM inference, every database write—checkpoint first. Recovery without data loss is possible.
- State machines compose, but shallowly. We tried 5-level hierarchies. Debugging them was worse than flat machines. Now we go 2-3 levels deep maximum.
- Keep the DAG for data flow, not agent coordination. We still use DAG-like structures for batch processing where agents don’t negotiate. Don’t throw away old tools—just use them where they fit.
What You Can Build This Week
You don’t need our full framework to start. Here’s a minimal state machine you can add to an existing agent system:
// Drop-in state machine for existing agents
export class MinimalAgentStateMachine {
private state: 'idle' | 'working' | 'validating' | 'stuck' = 'idle';
private checkpoint: unknown = null;
async execute<T>(
agent: (input: unknown, checkpoint: unknown) => Promise<T>,
input: unknown,
validator: (output: T) => { valid: boolean; reason?: string }
): Promise<T> {
// Transition: idle → working
this.state = 'working';
this.checkpoint = input;
try {
const output = await agent(input, this.checkpoint);
// Transition: working → validating
this.state = 'validating';
const validation = validator(output);
if (validation.valid) {
this.state = 'idle';
return output;
}
// Transition: validating → stuck (negotiation needed)
this.state = 'stuck';
throw new AgentValidationError(validation.reason, output);
} catch (error) {
// All failures go through stuck state
this.state = 'stuck';
throw error;
}
}
getCurrentState() {
return {
state: this.state,
hasCheckpoint: this.checkpoint !== null,
};
}
// Recovery: resume from checkpoint with new parameters
async resume(agent: Function, newInput: unknown) {
if (this.state !== 'stuck') {
throw new Error('Can only resume from stuck state');
}
return this.execute(agent, newInput, (_x) => ({ valid: true }));
}
}This fits in one file. It gives you three things your DAG probably doesn’t: explicit validation gates, stuck-state detection, and resumability. Add metrics on state transitions and you’ll see your system’s behavior more clearly than any execution graph.
The state machine isn’t a theoretical improvement. It’s the difference between “the system is doing something weird” and “Agent 7 needs help with semantic validation of claim X.” That specificity translates directly to uptime, debuggability, and sleep.