April 1, 2026·12 min read

The Task State Machine: 7-State Lifecycle That Recovers From Agent Crashes

Most agent orchestrators work perfectly until the first container restart. Here's how we built a state machine that survives the inevitable chaos of production — stalled sessions, orphaned work, and everything in between.

state machinetask lifecycleresiliencedistributed systemsAI agents

7-State Task Lifecycle Diagram showing transitions between unassigned, offered, pending, in_progress, and terminal states

Your agents will crash. Not might — will. At 2 AM. During a critical data pipeline. With 47 tasks in flight and a CEO waiting for the morning report.

If your task model is a simple todo/doing/done trifecta, you're not building an orchestrator — you're building a task graveyard. We learned this the hard way running 500+ agents processing complex ETL workflows across spot instances. Simple state machines hide failure modes. They conflate "assigned" with "alive," "in_progress" with "making progress," and "failed" with "retry blindly until rate limits explode."

Our 7-state machine — unassigned → offered → pending → in_progress → [completed|failed|cancelled] — with intermediate stalled and paused substates, isn't academic overhead. It's survival gear. This post details the recovery patterns that make autonomous agent swarms resilient to the failures that inevitably happen at scale.

Why Does the Simple 'Todo/Doing/Done' Model Break?

The naive three-state model assumes synchronous reliability. It assumes that when you mark a task "doing," the assignee will either return with "done" or throw a catchable error. This assumption collapses under the reality of distributed AI agents:

Container OOM kills mid-execution: The Python process processing a 10MB JSON payload gets SIGKILL'd by the kernel. The task status remains "doing" indefinitely.
Network partitions: The agent loses connection to the control plane after receiving the task. It completes the work, but the result never arrives.
Context window exhaustion: Claude hits the 200k token limit and hangs — not crashes, just stops emitting tokens while holding the task lock.
Credential expiration: AWS credentials expire mid-flight. API calls freeze indefinitely waiting for IAM refresh that never comes.
Spot instance preemption: Google Cloud sends a preempt signal. The agent has 25 seconds to checkpoint or die. Most don't checkpoint.

Agents crash, containers restart, and networks partition. Without intermediate states, you cannot distinguish "actively working" from "silently dead," leading to tasks stuck in limbo forever.

Without granular states, "doing" becomes a semantic graveyard. In production telemetry, we observed that 12% of tasks marked "in_progress" were actually orphaned — agents had died hours ago, but the state remained unchanged. This doesn't just waste compute; it blocks downstream dependencies and violates SLAs.

The Offer/Accept Protocol: Backpressure as a First-Class Citizen

Force-assignment is a liability. When the lead agent decrees, "You, worker-3, process this invoice," and worker-3 is experiencing GC thrashing or is mid-shutdown sequence, the task enters a void. The lead assumes acceptance; the worker never acknowledges. The task is lost.

We implemented an offer/accept pattern that treats agents as autonomous participants, not slave processes:

Offer: Lead broadcasts task to qualified agents (matching capabilities, region, etc.)
Evaluate: Agents inspect current load (memory, queue depth, token utilization) and either accept or reject
Reserve: First acceptor gets a 30-second lease to transition to pending
Confirm: Agent must call start_task() within the lease window, moving state to in_progress
Fallback: If lease expires, task returns to unassigned with exponential backoff

This prevents the catastrophic "overloaded agent drops everything" cascade. An agent with 95% memory utilization or a full context window can reject new offers, broadcasting its unavailability. The swarm self-heals by routing around congestion.

interface TaskStateMachine {
  id: string;
  state: 'unassigned' | 'offered' | 'pending' | 'in_progress' |
         'stalled' | 'paused' | 'completed' | 'failed' | 'cancelled';
  version: number; // Optimistic locking
  offeredAt?: Date;
  acceptedBy?: string;
  leaseExpiresAt?: Date;
  lastProgressAt?: Date;
  checkpoint?: TaskCheckpoint;
}

class TaskDistributor {
  async offerTask(task: Task, candidates: Agent[]): Promise<void> {
    const offers = candidates.slice(0, 3).map(agent =>
      this.sendOffer(agent, task)
    );

    const winner = await Promise.race([
      ...offers,
      this.timeout(30000)
    ]);

    if (winner) {
      await this.transition(task.id, 'pending', {
        acceptedBy: winner.agentId,
        leaseExpiresAt: new Date(Date.now() + 30000)
      });
    }
  }

  async handleAccept(agentId: string, taskId: string): Promise<boolean> {
    const task = await this.store.get(taskId);
    if (task.state !== 'offered') return false;

    return this.store.updateIfVersion(
      taskId,
      task.version,
      { state: 'pending', acceptedBy: agentId }
    );
  }
}

How Do You Detect a Stalled Agent Without False Positives?

Once a task enters in_progress, how do you distinguish between "agent crunching through a 10-minute LLM inference" and "zombie container holding a lease on work it will never finish"? Timeout-based detection is too coarse — you'll kill legitimate long-running tasks or wait too long to detect failures.

We track two independent signals:

lastUpdatedAt: Timestamp of any state change or heartbeat ping
lastProgressAt: Timestamp of the last store_progress() call containing meaningful work-product

The distinction matters. An agent might heartbeat regularly (lastUpdatedAt recent) but be stuck in an infinite loop calling a failing tool. By requiring semantic progress milestones — stored via explicit checkpoint calls — we can detect livelock.

Track lastUpdatedAt timestamps and mandate store-progress calls at milestones. If either stalls beyond timeout, release the task back to the pool with its checkpoint intact.

In our production clusters, we observe agent container restarts every 4-6 hours due to memory pressure. Without heartbeat detection, these events orphaned 23% of in-flight tasks. With dual-track monitoring, we reduced orphaned tasks to 0.3%, with automatic recovery via checkpoint resume.

class StallDetector {
  constructor(
    private heartbeatTimeoutMs: number = 120000,
    private progressTimeoutMs: number = 300000
  ) {}

  async evaluateTask(task: TaskStateMachine): Promise<Action> {
    const now = Date.now();

    if (now - task.lastUpdatedAt.getTime() > this.heartbeatTimeoutMs) {
      return {
        type: 'MARK_STALLED',
        reason: 'heartbeat_lost',
        nextState: 'stalled'
      };
    }

    if (task.state === 'in_progress' &&
        now - task.lastProgressAt.getTime() > this.progressTimeoutMs) {
      return {
        type: 'MARK_STALLED',
        reason: 'progress_stalled',
        nextState: 'stalled',
        preserveProgress: true
      };
    }

    return { type: 'NOOP' };
  }

  async releaseAndRequeue(taskId: string): Promise<void> {
    await this.db.transaction(async (trx) => {
      const task = await trx.tasks.get(taskId);
      if (task.checkpoint) {
        await trx.checkpoints.save(taskId, task.checkpoint);
      }
      await trx.tasks.update(taskId, {
        state: 'unassigned',
        acceptedBy: null,
        attempt: task.attempt + 1,
        lastFailureReason: 'stall_detected'
      });
    });
  }
}

The Progress-as-Checkpoint Pattern

Agents must externalize state at semantic milestones — not arbitrary time intervals. When processing a 10,000-row dataset, checkpoint every 500 rows. When refactoring code, checkpoint after each file modification. When negotiating with external APIs, checkpoint after each successful mutation.

This creates external state that survives session crashes. When task T1 stalls on agent A and moves to agent B, B reads the checkpoint and resumes at row 4,500, not row 0.

The performance impact is dramatic: Tasks recovered via checkpoint resume 3.2x faster than cold starts, avoiding expensive recomputation of idempotent work. More importantly, deterministic checkpointing enables exactly-once semantics for non-idempotent operations (like charging a credit card or sending an email).

class CheckpointManager {
  async storeProgress(
    taskId: string,
    milestone: string,
    data: Record<string, any>
  ): Promise<void> {
    const checkpoint: TaskCheckpoint = {
      milestone,
      timestamp: new Date(),
      data,
      diff: this.computeDiff(taskId, data),
      checksum: this.hash(data)
    };

    await this.store.atomicWrite(taskId, checkpoint);
    await this.coordinator.ackProgress(taskId, milestone);
  }

  async resumeFromCheckpoint(taskId: string): Promise<ResumeContext> {
    const checkpoint = await this.store.getLatest(taskId);
    const task = await this.tasks.get(taskId);

    return {
      originalPrompt: task.description,
      priorContext: task.conversationHistory,
      currentState: checkpoint.data,
      lastCompletedMilestone: checkpoint.milestone,
      resumeNarrative: this.generateResumeNarrative(checkpoint)
    };
  }
}

// Agent usage during execution
async function processDataset(taskId: string, rows: Row[]) {
  const checkpoint = await checkpointManager.resumeFromCheckpoint(taskId);
  const startIndex = checkpoint.currentState?.lastProcessedIndex || 0;

  for (let i = startIndex; i < rows.length; i += BATCH_SIZE) {
    const batch = rows.slice(i, i + BATCH_SIZE);
    await processBatch(batch);

    if (i % 100 === 0) {
      await checkpointManager.storeProgress(taskId, `rows_${i}`, {
        lastProcessedIndex: i,
        partialResults: accumulators
      });
    }
  }
}

Failure Categorization That Drives Retry Policy

Not all failures deserve the same response. A credential expiration shouldn't retry immediately. A schema mismatch shouldn't retry at all — it needs re-specification. Our state machine categorizes failures into routing decisions:

Failure Type	Example	State Machine Response
Transient	Rate limit, network timeout	Retry with exponential backoff (max 5x)
Auth	Expired token, IAM rejection	Park in 'paused', trigger rotation, resume
Schema	API contract mismatch	Escalate to lead for re-specification
Ambiguity	Unclear requirements	Transition to 'paused', request-human-input
Logic	Code bug, infinite loop	Route to self-healing pipeline (fix, test, retry)

This categorization happens at the error boundary. When an agent throws, we inspect the error signature — HTTP status codes, specific exception types, or LLM-generated failure classifications — to determine the next state. Genuine bugs get routed to a "repair swarm" that analyzes stack traces and generates patches, rather than hammering the same failing code path.

Cancellation That Actually Stops Work

Naive cancellation updates a database row from in_progress to cancelled and calls it a day. Meanwhile, the Claude process keeps burning tokens on abandoned work, or worse, commits side effects (charges the credit card, sends the email) 30 seconds after the "cancellation."

Real cancellation requires cooperation. We implement a cancellation token checked between LLM calls and tool invocations. When the coordinator receives a cancel request, it:

Sets the cancellation flag in Redis with the task ID
Sends SIGTERM to the agent process (if containerized)
Agent checks isCancelled() between tool calls
If cancelled, agent runs cleanup hooks (rollback transactions, release locks)
Agent exits, coordinator confirms state transition to cancelled

For the pause/resume mechanism — used when a task needs human input mid-execution — we serialize the full agent context (conversation history, tool state, partial results, active file handles) to cold storage. The task moves to paused. When human input arrives, we hydrate the context on a fresh agent instance, preserving the exact execution state.

Edge Cases and Battle-Scarred Lessons

Theory meets reality in the edge cases. Here are the failure modes that nearly took down our production swarm:

Split-brain during network partition: Coordinator thinks agent timed out; agent thinks it's still working. Mitigation: Versioned leases with fencing tokens — only the holder of the current lease version can commit.
Checkpoint bloat: Early implementations stored full context every 10 seconds, generating 50MB/minute per agent. Solution: Differential checkpoints and periodic garbage collection.
Zombie completion: Agent finishes task after timeout/transfer, causing duplicate work. Mitigation: Idempotency keys on all output commits; duplicate results from old agents are rejected.
Clock skew nightmares: VMs with drifted clocks (>30s difference) caused false stall detection. Solution: Hybrid logical clocks (Lamport timestamps) for ordering, physical clocks only for approximate staleness.
The thundering herd: When a stalled task returns to unassigned, 100 agents rush to accept it. Solution: Jittered exponential backoff on the offer phase and agent-side circuit breakers.

The 3 AM Lesson: We once had an agent get stuck in an infinite tool loop calling a weather API for a "current temperature" that it already had. It heartbeated correctly every 30 seconds, so our old system thought it was healthy. It burned through $400 in API credits overnight. That's when we added the lastProgressAt semantic checkpoint requirement. Heartbeats prove the process is alive; checkpoints prove it's not insane.

Comparison: Simple vs. Resilient Task Models

Capability	Simple Todo/Doing/Done	7-State Resilient FSM
Failure Detection	Global timeout only	Heartbeat + Progress tracking
Recovery Granularity	Restart from scratch	Checkpoint resume (3.2x faster)
Load Handling	Force assignment (drops tasks)	Offer/accept with backpressure
Cancellation	DB flag (orphaned processes)	Cooperative termination with cleanup
Failure Response	Blind retry	Categorized routing (auth/schema/logic)

Conclusion: Design for the Death of Agents

Resilience isn't bolted on with a cron job that cleans up stuck tasks. It's designed into every state transition. Every arrow in your state diagram should answer: "What if the agent dies here? What if the network partitions here? What if the task takes 10x longer than expected?"

The 7-state lifecycle feels heavier than a boolean isComplete flag. It requires checkpoint storage, heartbeat infrastructure, and careful lease management. But when you're running 500 agents at 2 AM and a spot instance evaporates mid-task, that "overhead" becomes the difference between a system that heals itself and a pager that wakes you up.

Build your state machine assuming your agents are mortal. Because they are. Design for the crash, and your swarm will survive the night.

Ready to build your own agent swarm?

Agent Swarm is an open-source framework for orchestrating autonomous AI agents with built-in resilience patterns, checkpoint recovery, and intelligent failure handling.

Get Started