April 27, 2026·13 min read

Our AI Worker Containers Have Zero Local Database And a 30-Line Bash Script That Makes It Impossible to Add One

How we banned database imports from worker containers with a bash script, and why it saved our agent swarm from catastrophic state divergence.

stateless workersdatabase boundarymicroservicesdistributed systemsagent architecturehorizontal scaling

Stateless worker architecture diagram showing API boundary between workers and database

Three workers. Three SQLite databases. Three different beliefs about whether task #8472 was pending, in-progress, or complete.

We shipped three contradictory pull requests to the same Linear issue before anyone noticed. Each worker had “successfully” updated its own local state. None of them had checked with each other.

This is the story of why we banned database imports from our worker containers entirely—not with documentation, not with code review checklists, but with a 30-line bash script that runs in CI and fails the build if any worker tries to touch a database module.

The Incident: When “Fast Memory” Becomes Distributed Corruption

Our agent swarm architecture seemed sensible: an API server owns the task queue, workers pull tasks and execute them. But workers need context—previous actions, conversation history, cached embeddings. We added SQLite because “it’s just temporary state” and “direct SQL is faster than HTTP calls.”

The pattern was everywhere in our codebase. Workers would:

Fetch a task from the API
Write UPDATE tasks SET status = ‘in_progress’ to local SQLite
Begin work, periodically syncing “real” progress back to the API

This worked fine with one worker. It worked fine with two. At three workers, the race conditions started.

Worker A pulled task #8472, wrote in_progress to its SQLite, started processing. Worker B, starting 400ms later, found the task still marked pending in the API (Worker A hadn’t synced yet), pulled it, wrote in_progress to its own SQLite, and started processing. Worker C did the same.

Three LLM inference sessions. Three conflicting code suggestions. Three PRs opened against the same Linear ticket. The API server eventually received three completion notifications and had to reconcile conflicting outputs.

The debugging session was a nightmare. Each worker’s SQLite had a different timeline. The API server had a fourth timeline. We had to manually reconstruct which worker did what when, cross-referencing container logs with database timestamps that weren’t synchronized.

The Rule: Six Paths, Two Patterns, Zero Exceptions

We made a decision immediately after that incident: worker containers may not import database modules, ever. Not for caching. Not for temporary state. Not for “just this one query.”

We codified this in three layers:

Architectural: Workers are stateless executors. The API server is the sole owner of state.
API design: Every worker action that previously touched the database now goes through HTTP endpoints.
Enforcement: A bash script that runs in CI and fails the build on any violation.

The forbidden zones in our worker codebase:

src/commands/—CLI commands that trigger agent actions
src/hooks/—lifecycle hooks for task execution
src/providers/—LLM provider integrations and tool definitions
src/prompts/—prompt templates and context assembly
src/cli.tsx—main entry point and command routing
src/claude.ts—Claude-specific integration and response parsing

The two forbidden import patterns:

from ‘be/db’—our database query layer
bun:sqlite—the raw SQLite driver

Pattern-based detection beats type-based enforcement because it’s trivial to verify and impossible to circumvent through type gymnastics. You either import the module or you don’t.

Why Bash, Not ESLint?

Because // eslint-disable-next-line @agent-swarm/no-db-import exists. ESLint rules can be bypassed with a comment. Bash grep checks the actual file content—a violation either exists or it doesn’t.

We tried lint rules first. They’re elegant. They integrate with IDEs. They provide helpful error messages. And they fail exactly when you need them most—under deadline pressure, when someone really needs to “just add one quick query.”

Bash is dumb. Bash is fast. Bash checks the actual file content. A violation either exists in the file or it doesn’t. There’s no semantic analysis to bypass, no configuration to tweak, no comment directive to suppress.

Here’s the full script we run in CI:

#!/bin/bash
# scripts/check-db-boundary.sh
# Enforces: workers may never import database modules

set -euo pipefail

# Forbidden patterns
DB_MODULE="from ['\"]be/db['\"]"
SQLITE_DRIVER="bun:sqlite"

# Worker paths that must remain database-free
WORKER_PATHS=(
  "src/commands/"
  "src/hooks/"
  "src/providers/"
  "src/prompts/"
  "src/cli.tsx"
  "src/claude.ts"
)

VIOLATIONS=0

for path in "${WORKER_PATHS[@]}"; do
  if [ -d "$path" ]; then
    if grep -rE "$DB_MODULE" "$path" --include="*.ts" --include="*.tsx" 2>/dev/null; then
      echo "ERROR: Database module import found in $path"
      VIOLATIONS=$((VIOLATIONS + 1))
    fi
    if grep -rE "$SQLITE_DRIVER" "$path" --include="*.ts" --include="*.tsx" 2>/dev/null; then
      echo "ERROR: SQLite driver import found in $path"
      VIOLATIONS=$((VIOLATIONS + 1))
    fi
  elif [ -f "$path" ]; then
    if grep -E "$DB_MODULE" "$path" 2>/dev/null; then
      echo "ERROR: Database module import found in $path"
      VIOLATIONS=$((VIOLATIONS + 1))
    fi
    if grep -E "$SQLITE_DRIVER" "$path" 2>/dev/null; then
      echo "ERROR: SQLite driver import found in $path"
      VIOLATIONS=$((VIOLATIONS + 1))
    fi
  fi
done

if [ $VIOLATIONS -gt 0 ]; then
  echo ""
  echo "Found $VIOLATIONS database boundary violation(s)."
  echo "Workers must remain stateless. Use HTTP APIs instead."
  exit 1
fi

echo "Database boundary check passed."

Run time: ~200ms on our entire worker codebase. The CI step looks like this:

# .github/workflows/ci.yml
- name: Enforce database boundary
  run: |
    chmod +x scripts/check-db-boundary.sh
    ./scripts/check-db-boundary.sh

The Replacement: HTTP-First Architecture

Every worker action that previously touched the database now goes through our API server. The surface is simple:

Old pattern	New pattern	Latency
`db.query(‘SELECT * FROM tasks...’)`	`GET /api/tasks/:id`	+12ms
`db.exec(‘UPDATE tasks...’)`	`POST /api/tasks/:id/progress`	+8ms
`INSERT INTO messages...`	`POST /api/messages`	+15ms
`SELECT * FROM context_cache...`	`GET /api/context/:taskId`	+11ms

Measured via internal benchmarks, 1000 requests each, p50 latency. Worker and API server co-located in the same VPC.

Workers carry two headers on every request:

// src/lib/api-client.ts
const apiClient = new Axios({
  baseURL: process.env.API_SERVER_URL,
  headers: {
    'Authorization': `Bearer ${process.env.API_KEY}`,
    'X-Agent-ID': process.env.AGENT_ID,
  },
});

// All state access goes through this client
export async function fetchTask(taskId: string): Promise<Task> {
  const { data } = await apiClient.get(`/api/tasks/${taskId}`);
  return data;
}

export async function updateProgress(
  taskId: string,
  progress: ProgressUpdate
): Promise<void> {
  await apiClient.post(`/api/tasks/${taskId}/progress`, progress);
}

The 8–15ms overhead per call is noise compared to LLM inference time. A typical agent task involves 3–5 API calls and 2–4 LLM inferences. The HTTP overhead is under 100ms; the LLM calls are 2–30 seconds.

The Counterintuitive Benefit Nobody Warned Us About

Debugging a multi-agent failure used to require checking N container logs plus N local databases, then reconciling N parallel timelines. Now it’s: check the API database, replay the task ID.

Single source of truth makes every cross-agent investigation an order of magnitude faster. When three workers interact with a task, there’s exactly one timeline in the API database—not three timelines that have to be merged with timestamps that may or may not be synchronized.

The mental model is simple: the API server is the computer. Workers are just CPUs that execute instructions. They don’t have memory. They don’t have state. They don’t have opinions about what “should” be happening.

The Ripple Effect: What Statelessness Unlocked

Once workers became stateless, three things became trivial:

Horizontal scaling became a non-event. Any container can handle any task because they’re identical. No sticky sessions. No data migration. No “warmup” containers that have cached state. We scale by adding containers to a pool. The API server’s task queue handles distribution.

Workers became ephemeral. We run worker containers as Docker images that get killed and recreated constantly. A worker that crashes mid-task simply… stops. Its replacement pulls the same task from the queue and continues. No state to preserve. No “graceful shutdown” handler that might fail.

API design became rigorous. When you can’t take shortcuts through the database, every data access pattern needs a proper API endpoint. This forced us to design clean boundaries between concerns. The API server is now a well-documented service with explicit contracts. Workers are dumb clients that don’t need to understand database schemas.

What We Tried That Didn’t Work

Before settling on the bash script, we experimented with softer enforcement:

Code review checklists: forgotten under deadline pressure. Reviewers don’t catch violations in 400-line PRs.
ESLint rules with custom plugins: bypassed with // eslint-disable. The escape hatch becomes the default during crunch.
Runtime detection: checking imports at worker startup. This catches violations late—after deployment, when the build already passed.
Architecture Decision Records: great for onboarding, useless for enforcement. Developers don’t re-read ADRs before submitting PRs.

The bash script won because it’s the only option with zero escape hatches. You cannot bypass a grep. You cannot negotiate with a failing CI check at 2 AM when you need to ship a hotfix.

The Pattern: Apply This to Any Agent System

You don’t need our exact stack to use this pattern. The general form is:

Identify your orchestrator (owns state) and workers (execute tasks)
Define the boundary—which modules/paths workers may not access
Write a dumb script that greps for violations
Make it a merge gate in CI
Replace direct state access with HTTP/queue messages

The enforcement mechanism matters more than the specific technology. Discipline degrades under deadline pressure; bash scripts don’t.

The Prediction: Where Agent Frameworks Are Heading

Within 12 months, every mature agent framework will ship with a default “workers must not access shared state directly” invariant. The alternative is debugging race conditions that surface as “flaky agent behavior” weeks after deployment.

Teams running 10+ concurrent agents who give each worker database access are quietly racking up state divergence bugs. They manifest as:

Agents that “forget” constraints mentioned in previous turns
Tasks marked complete by one agent and pending by another
Duplicate actions executed by different workers
Intermittent failures that can’t be reproduced with single-worker testing

These aren’t flaky agents. They’re distributed systems bugs wearing agent costumes. The fix isn’t better prompting—it’s architectural separation enforced with tooling.

When Should You Add This Rule?

Not when you hit 10 workers. Not when you first see a race condition. Add it when you have two workers and are tempted to give them SQLite for “fast memory.” That temptation is the warning sign. The cost of fixing state divergence grows quadratically with worker count. Fix it at two workers, not twenty.

What About “Just Cache Embeddings Locally”?

This is the most common resistance we hear. Embedding caches are the gateway drug to local state. Use Redis with a shared instance, or call an embedding API with caching at the API server level. The 50ms you save isn’t worth the distributed systems headache you’re creating. Cache invalidation across local SQLite instances is the problem you’re avoiding by avoiding the cache entirely.

The Script Is the Keystone

Every architectural decision in our agent swarm traces back to that 30-line bash script. Stateless workers. Ephemeral containers. Clean API boundaries. Horizontal scaling without config changes. Debugging via single timeline replay.

The script isn’t clever. It’s not even elegant. But it’s the invariant that holds everything together, and it will outlast every engineer on the team. That’s the point. Architectural boundaries enforced with tooling survive team turnover, late-night refactors, and “just this once” shortcuts.

Your agent swarm will face the same temptation. The question is whether you’ll enforce the boundary before the incident, or after.