Back to Blog
April 20, 2026·13 min read

Why We Banned 5-Minute Intervals in Our Agent Orchestrator (And What the Prompt Cache Actually Costs You)

The 5-minute polling interval sits on a caching cliff: long enough to expire your context, short enough to bankrupt you before the task finishes.

prompt cachingagent schedulingAnthropicLLM cachingautonomous agentsorchestration
Agent scheduling intervals showing the forbidden dead zone between 270 and 1200 seconds

Our first month running self-pacing agents at scale, the token bill arrived and I stared at it for ten minutes looking for a bug. We had projected $2,400 based on our load tests. The actual charge was $11,200. Same task volume, same models, same concurrency. The architecture hadn’t changed. What had changed was invisible: every agent had independently decided that “check every 5 minutes” was the correct polling strategy, and Anthropic’s prompt cache expires at exactly 300 seconds.

This is the story of how the most natural number in human scheduling—5 minutes—became the single most expensive choice in our autonomous system. It’s a story about K/V cache TTL semantics, the tyranny of round numbers, and why tool descriptions are the real programming language for agent swarms.

Why Is 300 Seconds the Most Expensive Number in AI?

Because it sits on an all-or-nothing cliff: long enough for Anthropic’s prompt cache to fully expire (TTL 300s), but short enough that you pay the full cache miss cost without amortizing it across meaningful idle time.

To understand why 300 seconds is poison, you need to understand how prompt caching actually works under the hood. When you send a request to Claude with cached context, Anthropic stores the processed key-value tensors from your system prompt, tool definitions, and conversation history in a distributed cache. This isn’t a soft cache where entries gradually fade. This is a hard TTL: at exactly 300 seconds, the entry is invalidated. Gone.

The semantics are binary. At 270 seconds, you pay for the incremental tokens in your new message. At 301 seconds, you pay for every single token in your cached prefix—all 8,000 tokens of system instructions, all 12,000 tokens of tool schemas, all 3,000 tokens of conversation history—again. Full price. Full latency.

There’s no gradient. No partial credit. The cache hit rate doesn’t degrade from 100% to 80% over time. It goes from 100% to 0% at the TTL boundary. This isn’t an implementation detail of one vendor; it’s intrinsic to how cached K/V state works at scale. Redis, Memcached, CloudFront, Anthropic’s inference layer—they all use TTL-based eviction because checking freshness is cheaper than gradual decay.

So 300 seconds is the worst possible choice. You sleep just long enough to guarantee a cache miss, then wake up just frequently enough to guarantee you’ll pay that miss again before the task completes. A build that takes 30 minutes and polls every 5 minutes burns through 6 full context reconstructions. If it had polled every 20 minutes, it would burn 1. If it had polled every 4 minutes, it would burn 0.

What Does the Dead Zone Look Like in Production?

A cluster of ScheduleWakeup calls at 300s, 600s, and 900s intervals with 0% cache hit rates, creating a sawtooth pattern in your token consumption graph where every wake-up costs 20,000+ tokens instead of 200.

When we instrumented our orchestrator, the telemetry was almost insulting in its clarity. We had given agents a ScheduleWakeup tool and described the delaySeconds parameter as “the number of seconds to wait before the next execution.” Without any other guidance, they reached for human numbers. We saw a distribution that looked like a clockmaker’s fever dream: massive spikes at 300, 600, 900, and 1800 seconds.

The 300-second spike was a bloodbath. Agents polling external APIs for task completion, checking file locks, waiting for CI builds. They would wake up, see no change, and immediately schedule another 300-second sleep. Each time, they paid the full context reconstruction. We were effectively paying $0.12 per wake-up for what should have been $0.001.

IntervalCache StateTokens per Wake30-Min Task Cost
240s (4 min)Warm~150~$0.02
300s (5 min)Cold (Miss)~23,000~$0.72
600s (10 min)Cold (Miss)~23,000~$0.36
1200s (20 min)Cold (Amortized)~23,000~$0.06

Cost comparison for a 30-minute polling task with 20k cached context tokens at Anthropic pricing. The 5-minute interval is 36x more expensive than 20-minute intervals and 36x more expensive than 4-minute intervals.

This is the “round number trap.” Humans think in base-10 time units because we have ten fingers. Cron schedules, UI defaults, and biological intuition all push us toward 5, 10, 15, 30. But cache TTLs are binary. They don’t care about our fingers. The 300–1200 second range is a dead zone: too long to stay warm, too short to be efficient.

What We Tried That Didn’t Work

Our first instinct was to ask nicely. We updated the system prompt to include: “Please be efficient with wake-up intervals to minimize token costs.” The agents responded by adding jitter. Instead of 300 seconds, they chose 312 seconds, then 298, then 305. All still in the dead zone. When we suggested “longer intervals,” they jumped to 600 seconds—still a cache miss, just half as frequent. Still 18x too expensive.

We considered asking Anthropic to extend the TTL. This is architecturally impossible for them—the 300s window is a capacity planning constant, not a configuration knob. K/V cache storage at inference scale is a finite resource. Every second of TTL is a multiplication of their GPU memory pressure.

We tried randomizing intervals within a range, hoping the law of averages would save us. It doesn’t. If you randomize between 200 and 400 seconds, you’re warm 30% of the time and cold 70% of the time. You’re still paying for misses you could have avoided entirely. Randomization spreads the pain; it doesn’t eliminate it.

The Fix: Three Interventions

1. Tool Descriptions Are Programming Language

We rewrote the ScheduleWakeup tool schema. Previously, the description was: “Schedule the next wake-up after a delay.” Now it’s a 400-character specification of cache windows:

{
  "name": "ScheduleWakeup",
  "description": "Schedule the next execution. CRITICAL: The prompt cache expires at 300s. Use either SHORT intervals (60-270s) to keep cache warm for active work, or LONG intervals (1200s+) to amortize the cache miss during idle polling. NEVER use 300-600s—this is the 'dead zone' where you pay full cache cost without progress.",
  "parameters": {
    "type": "object",
    "properties": {
      "delaySeconds": {
        "type": "integer",
        "description": "Seconds to wait. Valid regions: 60-270 (cache warm) or 1200-3600 (amortized miss). Values outside these ranges will be clamped.",
        "minimum": 60,
        "maximum": 3600
      },
      "reason": {
        "type": "string",
        "enum": ["active_work", "idle_poll"],
        "description": "active_work: expecting state change soon (use 60-270s). idle_poll: checking dormant state (use 1200s+)."
      }
    },
    "required": ["delaySeconds", "reason"]
  }
}

This single change shifted the distribution immediately. When we logged intervals after deployment, the 300s spike evaporated. Agents started choosing 240s for active work and 1200s for idle checks. The tool description isn’t documentation—it’s the actual control surface for agent behavior.

2. Runtime Clamping with the Forbidden Gap

Tool descriptions can be ignored by confused models or antagonistic prompts. We added a runtime validation layer that enforces the contract at the orchestrator level. If an agent requests 300–600 seconds, we snap it to the nearest valid region:

function clampWakeupInterval(
  requestedDelay: number,
  reason: 'active_work' | 'idle_poll'
): number {
  const DEAD_ZONE_END = 1199;
  const CACHE_TTL = 270; // Anthropic's effective warm window

  // If already valid, accept
  if (requestedDelay <= CACHE_TTL || requestedDelay >= DEAD_ZONE_END) {
    return requestedDelay;
  }

  // In the dead zone: snap based on intent
  if (reason === 'active_work') {
    // Snap to warm region
    return Math.min(requestedDelay, CACHE_TTL);
  } else {
    // Snap to amortization region
    return Math.max(requestedDelay, DEAD_ZONE_END);
  }
}

// Enforcement in the orchestrator
const validatedDelay = clampWakeupInterval(
  toolCall.args.delaySeconds,
  toolCall.args.reason
);

if (validatedDelay !== toolCall.args.delaySeconds) {
  logger.warn({
    agentId,
    requested: toolCall.args.delaySeconds,
    enforced: validatedDelay,
    msg: 'Agent attempted dead zone scheduling; clamped to valid cache window'
  });
}

await scheduleWakeup(agentId, validatedDelay);

This creates a “pit of success.” Even if an agent is confused and asks for 5 minutes, it gets 4.5 minutes (warm) or 20 minutes (amortized). Never the cliff.

3. Cache Telemetry for the Lead Agent

We exposed prompt cache hit/miss data back to the agents themselves. When a worker wakes up, the orchestrator includes metadata in the context:

{
  "wakeup_metadata": {
    "previous_delay_seconds": 300,
    "cache_hit": false,
    "tokens_charged": 23100,
    "suggestion": "Previous interval (300s) caused cache miss. For cache efficiency, use 60-270s if work is ongoing, or 1200s+ if idle."
  }
}

Our lead agent (the meta-agent that manages the swarm) audits these telemetry streams. It identifies workers with chronically bad scheduling patterns and either adjusts their tool parameters dynamically or escalates them for model fine-tuning. This closes the feedback loop: the system learns which intervals actually work rather than which intervals sound good.

The Economic Reality

After implementing the three interventions, we measured scheduled-wake-up token costs over a two-week period. The reduction was approximately 40%. More importantly, task completion quality didn’t degrade—agents weren’t “waiting too long” and missing deadlines. They were simply waiting efficiently.

The Two Valid Regions Framework

When deciding how long to sleep, agents must classify the work:

  • Active Work (60–270s): You’re waiting for a specific event that is likely to happen soon—a build finishing, a file being written, a human responding. Stay in the cache window. Pay nothing extra on wake.
  • Idle Poll (1200s+): You’re checking a dormant state that might not change for hours. Accept the cache miss, but amortize it across 20+ minutes of idle time. Never poll idle states every 5 minutes.

300 seconds fails both tests. For active work, it’s too long—you miss the cache. For idle work, it’s too short—you pay the miss too frequently.

The Broader Lesson: Scheduling Is a Caching Problem

For autonomous agents, wake-up scheduling isn’t a timing problem. It’s a caching problem. The industry is about to rediscover everything database engineers learned in 1995 about working set size, cache invalidation, and amortization—but applied to LLM context instead of disk pages.

In 1995, if you designed a database that touched disk every 5 minutes regardless of load, you were fired. In 2026, if you design an agent that reconstructs 20k context every 5 minutes regardless of necessity, you just 5x your inference bill.

The frameworks that survive will internalize this. They won’t ask “how long should we wait?” They’ll ask “what cache state do we want to be in when we wake up?” They’ll expose cache telemetry as first-class observability. They’ll treat tool descriptions as critical infrastructure code, not user documentation.

If you’re building long-running autonomous agents with any self-scheduling capability—cron triggers, adaptive polling, wake-up-on-event—you’re probably already paying the 5-minute tax. Measure it. The dead zone is expensive, predictable, and entirely avoidable. Ban the round numbers. Embrace the two valid regions. Your token bill will thank you.