March 30, 2026·7 min read

The Architecture Behind Task Delegation: Pools, Routing, and Dependencies

Most multi-agent systems start simple: one agent gets a task, does the work. But what happens when you have 6 agents, 50 concurrent tasks, and dependencies between them? Here's how we built the delegation system behind Agent Swarm — and the hard lessons from 3,000+ completed tasks.

architecturetask delegationAI agentsorchestration

If you're building a multi-agent system, task delegation is the first problem that becomes non-trivial. A single agent with a single task is straightforward. But the moment you have multiple agents with different capabilities, concurrent work, and tasks that depend on each other, you need an actual system.

We've been running Agent Swarm in production for over 90 days. Six agents. Over 3,000 completed tasks. The delegation architecture has been rewritten twice. Here's what we landed on and why.

The Task Lifecycle

Every piece of work in Agent Swarm flows through a state machine. This sounds obvious, but getting the states right took multiple iterations. Our first version had three states (pending, running, done). The current version has ten — but the core flow looks like this:

The key insight was separating assignment from execution. A task can be assigned to an agent (pending) but not yet started. This matters because agents run in Docker containers that poll for work — there is a real gap between "this task is yours" and "the agent has picked it up."

Lesson: If your agents are distributed processes, you need at least one state between "assigned" and "running." Without it, you can't distinguish between an agent that hasn't started yet and one that crashed silently.

Task Pools vs. Direct Assignment

We support two delegation models, and you need both. Direct assignment is when the Lead agent knows exactly who should do the work: "Picateclas, implement this PR." Task pools are when work is posted without a specific assignee and agents claim it.

Direct Assignment

The Lead creates a task and assigns it to a specific agent by ID. The target agent's runner picks it up on next poll. Best for specialized work where only one agent has the right context — like assigning a PR review to the Reviewer agent.

Task Pools

Tasks are created as "unassigned" with tags (e.g., implementation, research). Idle agents poll the pool, filter by their capabilities, and claim tasks. This is how we load-balance — if one coder agent is busy, another can pick up the work.

Offer/Accept Pattern

A middle ground: the Lead offers a task to a specific agent, but the agent can reject it (e.g., if it lacks context or is overloaded). Rejected tasks go back to the pool or get re-offered. This prevents forcing work onto agents that would fail at it.

Lesson: Start with direct assignment — it's simple and predictable. Add pools when you have more than 2-3 workers and want natural load balancing. Add offer/accept when you notice agents failing on tasks they shouldn't have been assigned.

How the Lead Routes Work

The Lead agent is itself an AI — it reads Slack messages, interprets intent, breaks down complex requests into sub-tasks, and decides who does what. This is both the most powerful and most fragile part of the system.

The Lead has access to every agent's profile: their name, role description, current status, and task history. When a new request comes in, it evaluates which agent is the best fit based on role specialization and current workload. A bug fix goes to the coder. A PR review goes to the reviewer. A research question goes to the researcher.

But the Lead also handles task decomposition. A Slack message like "add a new blog post to the landing page" becomes: (1) research the existing blog format, (2) write the post, (3) test the build, (4) create a PR. The Lead creates these as separate tasks with dependencies.

Lesson: Your routing agent needs structured access to agent capabilities, not just names. We store each agent's role, specialization, and task history. Without this, the Lead agent guesses — and guesses wrong about 15% of the time.

Task Dependencies

Real work has order. You can't review a PR that doesn't exist yet. You can't deploy code that hasn't been tested. Agent Swarm supports dependsOn — a list of task IDs that must complete before a task becomes eligible for execution.

When a task's dependencies are all completed, it transitions from blocked to its normal lifecycle. If any dependency fails, the dependent task can be automatically cancelled or left for the Lead to decide.

We also built Workflows on top of this — declarative DAGs of tasks with interpolation. A workflow definition says "run task A, pass its output to task B, then fan out to tasks C and D in parallel." Workflows are how we handle recurring multi-step processes like daily content generation or scheduled health checks.

// Workflow node with cross-step data access
{
  "id": "write-post",
  "type": "agent-task",
  "inputs": { "research": "gather-data" },
  "config": {
    "template": "Write a post using: {{research.taskOutput}}"
  },
  "next": ["review-post"]
}

Lesson: Dependencies are essential, but keep them shallow. Deep dependency chains (A → B → C → D → E) are fragile — one failure cascades. We aim for wide, shallow DAGs: fan out early, join late. Our most reliable workflows have 2-3 levels of depth, not 5-6.

When Tasks Fail

With 3,000+ completed tasks, we've also seen hundreds of failures. The delegation system needs to handle failure as a first-class outcome, not an exception.

Structured failure reasons

Every failed task includes a failureReason field. This isn't just for humans — the Lead reads it to decide whether to retry, reassign, or escalate.

Automatic memory indexing

When a task fails, the failure reason is indexed into the swarm's memory. Next time a similar task comes up, agents can search for past failures and avoid the same mistakes.

Concurrency limits

Each agent has a MAX_CONCURRENT_TASKS setting. Overloading an agent causes context window pressure, leading to degraded output. We run most agents at 1-2 concurrent tasks.

Pause and resume

Long-running tasks can be paused (freeing the agent for urgent work) and resumed later with full context. The session state is preserved so the agent picks up exactly where it left off.

Lesson: The biggest source of task failures isn't bugs — it's context. An agent assigned a task without enough background information will either produce wrong output or waste time researching. Providing structured context in the task description (repo, file paths, relevant PRs) cuts failure rates dramatically.

Key Takeaways for Builders

Model your task states explicitly

Don't collapse assignment and execution into one state. Distributed agents need at least: unassigned, assigned, running, completed, failed. Add more as you discover gaps.

Support both direct assignment and pools

Direct assignment for specialized work, pools for load balancing. The offer/accept pattern is the bridge between them — the Lead suggests, the agent decides.

Keep dependency DAGs wide and shallow

Fan out work early, join results late. Deep chains are fragile. Our most reliable workflows have a fan-out of 3-5 parallel tasks with 2-3 levels of depth.

Index failures into memory

Every failed task is a learning opportunity. If agents can search past failures before starting similar work, your retry success rate goes up significantly.

Context is everything

The single biggest lever for task success is the quality of the task description. Include the repo, the files, the intent, and any constraints. An agent with good context succeeds; one without it wastes cycles.

Task delegation is the backbone of any multi-agent system. Get it wrong and your agents step on each other, starve for work, or fail silently. Get it right and you have a system that scales — add more agents, handle more work, without changing the architecture.

Agent Swarm is open source. The full task lifecycle, pool implementation, and workflow engine are in the repo. If you're building your own agent orchestrator, start there.