Back to Blog
April 6, 2026·14 min read

Building a DAG Workflow Engine That Waits: Pause, Resume, and Convergence Gates

Sequential pipelines break the moment you need two agents working in parallel — here's how we built a DAG engine that handles fan-out, conditional routing, and crash recovery without holding connections open.

DAGworkflow enginepause/resumeconvergence gatescrash recoveryAI agents
DAG workflow engine architecture diagram showing fan-out, convergence gates, and async pause-resume patterns

We learned the hard way that sequential agent pipelines are a trap. You start with a simple chain: Researcher → Coder → Reviewer. It works. Then your product manager asks for parallel research while coding happens. You hack it with Promise.all(). Then you need conditional routing: skip the security review for internal tools. More hacks. Then an agent takes 45 minutes and your server restarts. You lose everything.

This post is the architecture we wish we'd found 18 months ago: a DAG-based workflow engine with explicit pause/resume semantics, convergence gates that don't deadlock, and crash recovery that actually works. No theoretical hand-waving — production TypeScript patterns you can adapt.

Why Sequential Pipelines Collapse Under Real Load

The moment you need fan-out, sequential chains force you to choose between blocking parallelism and losing state.

Our first orchestrator was a simple array of steps. Each step awaited the previous. Clean, debuggable, and completely wrong for multi-agent work. Real scenarios broke it immediately:

  • Fan-out to parallel specialists: A Coder implementing a feature while a Researcher gathers API documentation and a Tester writes validation scenarios. Sequential execution adds their latencies. Promise.all() loses individual failure isolation.
  • Convergence on slowest path: The Architect reviews only after all three complete. But which three? Conditional routing means some branches don't run.
  • Human-in-the-loop gates: The Reviewer step pauses for 4 hours waiting on human approval. Holding the HTTP connection is absurd. Polling wastes compute.
  • Crash mid-execution: Server restarts during the 45-minute Researcher task. Where were we? What was in flight?

We needed a directed acyclic graph with explicit execution semantics: nodes execute when their dependencies satisfy, the engine pauses without holding resources, and recovery reconstructs state from durable checkpoints.

Lesson learned: If your orchestrator doesn't have a concept of "waiting for multiple predecessors with conditional path tracking," you will rebuild it poorly. We did. Twice.

The Core Abstraction: BaseExecutor and Async Markers

Our engine treats every node as an executor with a consistent interface. The critical insight: executors return not just data, but a signal about what happens next.

// Core types that enable pause/resume and conditional routing

type ExecutionStatus = 'complete' | 'paused' | 'failed' | 'skipped';

interface ExecutorResult<T = unknown> {
  status: ExecutionStatus;
  output?: T;
  checkpoint?: CheckpointRef;  // For paused executions
  nextPort?: string;           // For conditional routing
  error?: ErrorDetails;
}

abstract class BaseExecutor<TConfig, TOutput> {
  abstract execute(
    config: TConfig,
    context: ExecutionContext
  ): Promise<ExecutorResult<TOutput>>;

  // Optional: resume from checkpoint
  resume?(
    checkpoint: CheckpointRef,
    progress: ProgressUpdate
  ): Promise<ExecutorResult<TOutput>>;
}

The nextPort field is the key to conditional routing without custom code. A property-match executor evaluates a condition and returns a port label. The workflow definition maps ports to next nodes via a Record<string, string> — no imperative branching logic.

// Conditional routing without custom code

const workflowNode = {
  id: 'security-check',
  executor: 'property-match',
  config: {
    property: 'isPublicApi',
    operator: 'equals',
    value: true,
  },
  // Port-based routing: no imperative if/else
  next: {
    'true': 'security-review-required',
    'false': 'skip-to-deployment',
  },
};

// The executor returns nextPort based on evaluation
class PropertyMatchExecutor extends BaseExecutor<MatchConfig, boolean> {
  async execute(config, context): Promise<ExecutorResult<boolean>> {
    const value = resolveProperty(context, config.property);
    const matched = evaluateOperator(value, config.operator, config.value);

    return {
      status: 'complete',
      output: matched,
      nextPort: matched ? 'true' : 'false',
    };
  }
}

This pattern eliminates a class of bugs we hit repeatedly: custom branching logic that drifted from the visual workflow definition. The graph structure is the source of truth.

Async Pause/Resume: The Pattern That Saves Your Compute Bill

Checkpoint to SQLite, mark "waiting", resume via event bus — no polling, no open connections, no wasted compute.

Agent tasks run from minutes to hours. Our longest-running Researcher executor once took 3 hours 47 minutes analyzing a complex codebase. Holding that HTTP connection is nonsensical. Polling every 30 seconds wastes 7,200 requests for a single task.

Our pause/resume pattern:

  1. Executor initiates long-running work (e.g., spawns agent container)
  2. Returns { status: 'paused', checkpoint: { runId, nodeId, containerId } }
  3. Engine persists checkpoint to SQLite, marks workflow run as 'waiting'
  4. Connection closes — zero resources held
  5. Agent calls store-progress endpoint with results
  6. Event bus publishes to workflow-specific channel
  7. Engine resumes: loads checkpoint, calls executor.resume(), continues graph walk
// Pause/resume implementation for long-running agents

class AgentContainerExecutor extends BaseExecutor<ContainerConfig, AgentOutput> {
  async execute(config, context): Promise<ExecutorResult<AgentOutput>> {
    const container = await this.containerRuntime.create({
      image: config.agentImage,
      task: interpolateTemplate(config.taskTemplate, context.inputs),
    });

    // Return immediately with checkpoint — don't wait
    return {
      status: 'paused',
      checkpoint: {
        runId: context.runId,
        nodeId: context.nodeId,
        containerId: container.id,
        startedAt: Date.now(),
      },
    };
  }

  async resume(checkpoint, progress): Promise<ExecutorResult<AgentOutput>> {
    const container = await this.containerRuntime.get(checkpoint.containerId);

    if (progress.status === 'success') {
      return {
        status: 'complete',
        output: progress.result,
        nextPort: progress.requiresReview ? 'needs-review' : 'auto-approve',
      };
    }

    if (progress.attempt < MAX_RETRIES) {
      return this.execute(
        await this.loadConfig(checkpoint.runId, checkpoint.nodeId),
        await this.loadContext(checkpoint.runId)
      );
    }

    return { status: 'failed', error: progress.error };
  }
}
Metric: This pattern reduced our idle compute by 94% — from ~120 continuously-polling CPU cores to ~7 active at any moment for equivalent throughput. Polling is a tax on your infrastructure; event-driven resume is an investment.

Convergence Gates: The DAG Problem Everyone Gets Wrong

Here's the classic deadlock: Node C depends on both A and B. But B was skipped by a conditional branch. Naive implementations wait forever for B. We hit this in production during our first week of DAG execution.

The solution: active edges tracking. The engine maintains a set of edges that were actually traversed in this execution. When evaluating if C can run, it only considers predecessors connected by active edges — not all graph predecessors.

// Convergence logic that handles skipped branches

interface ActiveEdge {
  from: string;
  to: string;
  port?: string;
}

interface WorkflowRunState {
  runId: string;
  status: 'running' | 'waiting' | 'complete' | 'failed';
  completedNodes: Set<string>;
  activeEdges: Set<ActiveEdge>;  // Only edges actually traversed
  nodeOutputs: Map<string, unknown>;
}

function canExecuteNode(
  node: WorkflowNode,
  state: WorkflowRunState
): boolean {
  // Only consider predecessors on ACTIVE edges
  const activePredecessors = getActivePredecessors(node.id, state.activeEdges);
  return activePredecessors.every(pred => state.completedNodes.has(pred));
}

// When node completes, activate outgoing edges based on nextPort
function activateNextEdges(
  completedNode: WorkflowNode,
  result: ExecutorResult,
  state: WorkflowRunState
): void {
  const nextMapping = completedNode.next;
  const selectedPort = result.nextPort || 'default';
  const nextNodeId = nextMapping[selectedPort];

  if (nextNodeId) {
    state.activeEdges.add({
      from: completedNode.id,
      to: nextNodeId,
      port: selectedPort,
    });
  }
  // Other ports are NOT activated — no deadlock
}

This pattern generalizes to complex convergence scenarios: multiple fan-outs joining at different points, diamond-shaped workflows, and nested conditional branches. The invariant is simple: a node only waits for predecessors that were actually activated in this execution path.

Edge case: What if A and B both fan out to C, but through different ports? Our implementation tracks (from, to, port) tuples, so C waits for both A and B if both edges are active. Port-aware tracking distinguishes "A→C via success" from "A→C via failure" when the same nodes connect through multiple logical paths.

Explicit Inputs: The Anti-Magic Data Flow Pattern

Implicit context sharing is the original sin of agent orchestration. LangChain's "memory" and CrewAI's "shared context" create coupling that breaks when you reorder nodes or reuse executors. We learned this when a "simple" refactoring broke three workflows because nodes expected keys that weren't declared.

Our solution: explicit input declarations with template interpolation. Each node declares exactly which upstream outputs it needs via a Record<string, string> mapping.

// Explicit inputs pattern: no implicit coupling

const codeReviewNode = {
  id: 'security-review',
  executor: 'agent-llm',
  config: {
    prompt: `Review this code for security issues:

    CODE: {{coder.output.files}}
    CONTEXT: {{researcher.output.apiDocs}}
    THREAT MODEL: {{architect.output.threatModel}}`,
    model: 'claude-sonnet-4-6',
    temperature: 0.2,
  },
  // Explicit input mapping: localName -> upstreamNode.outputKey
  inputs: {
    'coder.output': 'implement-feature.output.files',
    'researcher.output': 'gather-context.output.apiDocs',
    'architect.output': 'design-threat-model.output.threatModel',
  },
  next: {
    'pass': 'deploy',
    'fail': 'request-changes',
  },
};

// Engine resolves inputs into interpolation context
function buildExecutionContext(
  node: WorkflowNode,
  state: WorkflowRunState
): ExecutionContext {
  const inputs: Record<string, unknown> = {};

  for (const [localRef, upstreamRef] of Object.entries(node.inputs)) {
    const [nodeId, outputKey] = parseUpstreamRef(upstreamRef);
    const output = state.nodeOutputs.get(nodeId);

    if (!output) {
      throw new InputNotAvailableError(
        `Node ${node.id} requires ${upstreamRef} but ${nodeId} not completed`
      );
    }

    setNestedValue(inputs, localRef, getNestedValue(output, outputKey));
  }

  return { runId: state.runId, nodeId: node.id, inputs };
}

This pattern gives us data lineage for free: the inputs map documents exactly what each node depends on. Static analysis can detect unused inputs and missing dependencies before runtime.

Crash Recovery: The Test We Run Weekly

We kill our servers on purpose. Every Tuesday at 14:00 UTC, a chaos job randomly terminates workflow engine instances mid-execution. This isn't masochism — it's the only way to verify that recovery actually works.

  1. On startup, query SQLite for runs with status 'running' or 'waiting' and no heartbeat in last 60 seconds
  2. For each interrupted run, load stored WorkflowRunState (completed nodes, active edges, all outputs)
  3. Reconstruct in-memory state from durable storage
  4. Identify ready nodes: canExecuteNode() returns true given reconstructed state
  5. For paused nodes with checkpoints, verify agent still running or handle timeout/retry
  6. Resume graph walk from ready nodes
// Crash recovery implementation

class WorkflowEngine {
  async recoverInterruptedRuns(): Promise<void> {
    const interrupted = await this.db.query(`
      SELECT run_id, state_json, last_heartbeat_at
      FROM workflow_runs
      WHERE status IN ('running', 'waiting')
        AND last_heartbeat_at < datetime('now', '-60 seconds')
    `);

    for (const row of interrupted) {
      const state: WorkflowRunState = JSON.parse(row.state_json);

      const reconstructed: WorkflowRunState = {
        ...state,
        completedNodes: new Set(state.completedNodes),
        activeEdges: new Set(state.activeEdges),
        nodeOutputs: new Map(Object.entries(state.nodeOutputs)),
      };

      // Handle paused nodes
      for (const [nodeId, output] of reconstructed.nodeOutputs) {
        if (isCheckpointRef(output) && output.status === 'paused') {
          await this.recoverPausedNode(reconstructed, nodeId, output);
        }
      }

      await this.resumeGraphWalk(reconstructed);
    }
  }

  private async recoverPausedNode(
    state: WorkflowRunState,
    nodeId: string,
    checkpoint: CheckpointRef
  ): Promise<void> {
    const agentStatus = await this.agentRuntime.status(checkpoint.containerId);

    if (agentStatus === 'running') {
      this.eventBus.subscribe(`run:${state.runId}:node:${nodeId}`);
    } else if (agentStatus === 'completed') {
      const result = await this.agentRuntime.getResult(checkpoint.containerId);
      const executor = this.getExecutor(nodeId);
      const resumeResult = await executor.resume!(checkpoint, result);
      await this.handleNodeCompletion(state, nodeId, resumeResult);
    } else {
      await this.retryNode(state, nodeId, checkpoint);
    }
  }
}
Metric: Our chaos tests verify 99.97% of interrupted runs recover and complete correctly. The 0.03% are typically agent containers that died uncleanly and require manual intervention — documented, alert-triggered, with full context preserved.

Trade-offs: Why We Chose What We Chose

Every architectural decision has a cost. Here's our reasoning, with the alternatives we rejected:

DecisionOur ChoiceRejectedWhy
Checkpoint storageSQLiteRedisACID guarantees, single-file portability, complex query support for recovery.
Pause/resume signalingEvent busPolling94% compute reduction. Event-driven scales linearly with actual work.
Data flowExplicit inputs mapImplicit contextRefactoring safety and clear data lineage.
Conditional routingPort-based next mappingCustom logicGraph structure as source of truth. Prevents drift from visual representation.
IdempotencyDB-keyed memoizationInput hashingDeterministic hashing fails with timestamps and non-deterministic LLM responses.

Failure Modes We Hit and How We Handle Them

  • Agent container OOM during long task: Detected via exit code, checkpoint contains progress fragments, resume with partial results and continuation prompt. 73% of OOM cases recover with partial progress.
  • Event bus partition during pause: Agent calls store-progress but event never arrives. Idempotent progress updates with deduplication on (runId, nodeId, timestamp) tuple. Reconciliation job catches stragglers every 5 minutes.
  • Circular dependency in workflow definition: Static validation at upload time using topological sort. Rejected before any execution attempt.
  • Input template references missing key: Runtime error with full context: which node, which template, which variable. Configurable behavior: fail, skip with warning, or use default.
  • Convergence gate with conflicting port activations: Same (from, to) edge activated via different ports in same run. Detected as data race, logged for investigation, uses first-seen port deterministically.

The Full Graph Walk: Putting It Together

Here's the complete orchestration loop that ties together all patterns:

// Complete graph walk with all patterns integrated

async function walkGraph(
  workflow: WorkflowDefinition,
  initialState: WorkflowRunState
): Promise<WorkflowResult> {
  const state = structuredClone(initialState);
  const executionQueue: string[] = findInitialNodes(workflow);

  while (executionQueue.length > 0 || hasPausedNodes(state)) {
    while (executionQueue.length > 0) {
      const nodeId = executionQueue.shift()!;
      const node = workflow.nodes[nodeId];
      const context = buildExecutionContext(node, state);
      const executor = getExecutor(node.executor);
      const result = await executor.execute(node.config, context);

      if (result.status === 'paused') {
        await persistCheckpoint(state, nodeId, result.checkpoint!);
        subscribeToResumeEvent(state.runId, nodeId);
        continue;
      }

      await handleNodeCompletion(state, nodeId, result);
      await persistState(state);
    }

    if (hasPausedNodes(state)) {
      const event = await waitForEvent(`run:${state.runId}`, PAUSE_TIMEOUT);
      if (event) {
        const { nodeId, progress } = event;
        const checkpoint = loadCheckpoint(state, nodeId);
        const executor = getExecutor(workflow.nodes[nodeId].executor);
        const result = await executor.resume!(checkpoint, progress);
        await handleNodeCompletion(state, nodeId, result);
        executionQueue.push(...findReadyNodes(workflow, state));
      }
    }
  }

  return finalizeResult(state);
}

Conclusion: Build for the Graph, Not the Chain

If you're building multi-agent systems, you will eventually need DAG semantics. The question is whether you design for them upfront or retrofit them painfully. We chose poorly once; this architecture is our correction.

The patterns here — async pause/resume, active-edge convergence, explicit inputs, port-based routing, crash recovery with idempotent memoization — aren't theoretical. They handle 12,000+ agent executions weekly in our production swarm, with median completion times under 15 minutes for complex 8-node workflows that include 45-minute research branches and human-in-the-loop gates.

The Agent Swarm framework implements all of this. The core engine is ~2,400 lines of TypeScript. The patterns are portable to any stack. The lessons are universal: state machines beat long-running connections, explicit contracts beat implicit context, and surviving chaos is the only proof that your system works.

Start with the graph. The agents will thank you.