April 29, 2026·13 min read

Why Our Agents Sleep for 4 Minutes 30 Seconds (And Yours Should Too)

Your agent's sleep(300) is silently bleeding money. Here's the Anthropic prompt cache TTL mechanic that turns reasonable defaults into six-figure anti-patterns.

Anthropic prompt cacheAI agent pollingLLM cost optimizationcache TTLagent schedulingagent-swarm

Cost curve showing dramatic spike at 5 minute sleep intervals due to Anthropic cache TTL cliff

We were burning $847/month on a single agent before breakfast. The fix? Change one number from 300 to 270. Same wake-ups per hour, same tasks completed, same user experience. The difference was whether Anthropic’s prompt cache considered us “warm” or “cold.”

The Five-Minute Bug Everyone Has

Every agent that polls has a sleep interval. When we built our first long-running agent—the one that babysits PRs, checks CI status, and nudges stale reviews—we asked the obvious question: “How often should it check?” Five minutes felt reasonable. Long enough to not hammer APIs, short enough to feel responsive. We typed await sleep(300 * 1000) and moved on.

What nobody told us: Anthropic’s prompt cache holds your system prompt and conversation prefix for exactly five minutes from the last hit. Not “about five minutes.” Not “five minutes give or take.” Exactly five minutes, and then you fall off a cliff.

A cache hit on input tokens costs roughly 10% of an uncached read. For our babysit-prs agent, that meant the difference between $0.0045 and $0.045 per resume. With ~12 wake-ups per hour, that’s $0.054 versus $0.54. Per hour. For one agent.

The Cliff Diagram (Real Data)

We instrumented our 24-agent swarm and plotted cost-per-resume against sleep duration. The curve is brutal:

Sleep Duration	Cache State	Cost Per Resume	Hit Rate
60–270s	Warm (guaranteed)	$0.0045	94%
290–310s	Roulette	$0.025 (variable)	~40%
320–1200s	Cold (guaranteed)	$0.045	8%
1200s+	Cold, amortized	$0.045 (but fewer wakes)	0% (by design)

The “dead zone” is 300–1200 seconds. You’ve paid the cache miss, but you haven’t slept long enough to amortize that cost across meaningful work. It’s the worst of both worlds: expensive AND frequent.

The $847 Month vs. The $89 Month

Our babysit-prs agent ran for 30 days with sleep(300). The numbers:

~12 wake-ups per hour × 24 hours × 30 days = ~8,640 resumes
Cache hit rate: ~8% (some got lucky, most didn’t)
Input token cost: $847

We changed one line: await sleep(270 * 1000). That’s it. Same agent, same context window, same wake-up cadence in human terms—roughly 13 vs 12 per hour. Next 30 days:

~13 wake-ups per hour (slightly more frequent)
Cache hit rate: 94%
Input token cost: $89

We also tested the other valid strategy: sleep(1800) (30 minutes). Two wake-ups per hour, guaranteed cache miss each time, but so few resumes that total cost came in at $134. Both beat the “reasonable” five-minute default by 6–10x.

The ScheduleWakeup Tool: Codifying the Rule

We now ship a ScheduleWakeup tool that bakes this logic into the agent’s reasoning. The schema is simple, but the description is where the magic lives:

{
  "name": "ScheduleWakeup",
  "description": "Schedule the next time this agent should wake and check for work. CRITICAL: The Anthropic prompt cache TTL is 5 minutes (300s). Sleeping for 300-1200s puts you in the 'dead zone'—you pay full cache-miss cost but wake too frequently to amortize it. Choose either: (1) 60-270s to keep cache warm for active polling, or (2) 1200s+ to accept the cache miss and batch work. Never choose 300-1200s. The 'reasonable' 5-minute default is an anti-pattern.",
  "parameters": {
    "type": "object",
    "properties": {
      "seconds": {
        "type": "integer",
        "minimum": 60,
        "maximum": 3600,
        "description": "Seconds until next wake. Must be in [60,270] for hot-loop polling or [1200,3600] for idle waits. Values in (270,1200) are rejected."
      },
      "reason": {
        "type": "string",
        "description": "Why this interval was chosen: 'active-work-ongoing' for 60-270s, 'idle-waiting-for-external' for 1200s+"
      }
    },
    "required": ["seconds", "reason"]
  }
}

The description field matters. Anthropic’s model re-reads this every time it considers calling the tool. It re-derives the cache-aware logic fresh each wake-up. We’ve watched traces where the agent explains: “CI is still running, I should hot-loop at 240s to stay cached” or “No open PRs need attention, I’ll idle at 1800s.”

Cache-Aware Pacing: A New Design Dimension

Once you internalize the cache cliff, polling stops being “how often should I check?” and becomes “when does my cache window naturally expire and what should I do then?”

Our agents now burst. When a build is running, they hot-loop at 240–270s: “Is it done? Is it done? Is it done?” The cache stays warm, they’re synchronized with the actual work. Once the build queues or completes, they drop immediately to 1800s or 3600s. Why pay to stay warm when there’s nothing warm to stay for?

This is genuinely new. Traditional cron jobs don’t have context reload costs. A bash script polling every 5 minutes pays nothing for the interval. AI agents pay for every context window they load. The scheduling primitive is different.

What Doesn’t Work (And Why)

We tried the obvious fixes that seem like they should work. They didn’t.

Jitter / randomized offsets: Adding ±30s random jitter to 300s sleep. Hypothesis: spread the herd, reduce peak load, maybe stay under the TTL. Reality: Anthropic’s cache TTL is deterministic per conversation thread, not global. Jitter just made us miss unpredictably. Cost variance went up, average stayed bad.

Adaptive backoff: Start at 60s, double until we hit “efficiency.” Problem: the first wake after a long sleep pays full cost. If you back off past 270s, that first expensive wake happens every cycle. You’re optimizing for the wrong metric (wakes per hour) instead of total cost.

Preemptive refresh: Wake at 240s, do nothing, sleep 240s again to “keep the cache alive.” This doubles your wake-ups for marginal hit rate improvement. The math: 2 × $0.0045 = $0.009, which beats 1 × $0.045, but loses to just doing real work at 270s and accepting occasional misses.

Swarm-Scale Numbers

Across 24 agents running ~1,400 wake-ups per day, switching from uniform 5-minute polling to cache-aware pacing:

Anthropic input-token bill: down 58%
Task throughput: unchanged
Latency to completion: unchanged (actually slightly better during active work)
Successful completion rate: unchanged

The savings showed up in week 1 and stuck. No regression in any operational metric. Just money we stopped lighting on fire.

Is This Anthropic-Specific?

No. The mechanic varies, but the cliff exists everywhere.

OpenAI: Prompt caching exists with a quieter eviction window. We’ve measured similar patterns but less dramatic cost deltas.
Gemini: Context caching is explicitly TTL-controlled (configurable up to 1 hour). Same cliff, user-visible duration.
Every major provider: Sticky context features are table stakes now. They all have TTLs. They all have cliffs.

The agent-architecture rule generalizes: your scheduler must be aware of your provider’s cache TTL. The natural human intuitions about polling intervals—5 minutes, 15 minutes, 1 hour—are uniformly wrong because they were designed for systems where context is free to reload.

What We Think Happens Next

Within 12 months, every serious agent framework will ship with cache-aware scheduling primitives. LangGraph, CrewAI, AutoGen, Mastra—someone will crack this first, the rest will follow. sleep(5*60) will be flagged as an anti-pattern in linters, same as setTimeout(0).

Teams polling on round-number minute intervals are silently bleeding 5–10x their token bill. Most won’t notice until they cross-reference Anthropic’s cached-vs-uncached metrics in the dashboard—and that view is buried three clicks deep.

Audit Your Code Today

# Checklist for auditing your polling code:

[ ] grep -r "sleep.*300" --include="*.ts" --include="*.js" --include="*.py"
[ ] grep -r "setTimeout.*300" --include="*.ts" --include="*.js"
[ ] grep -r "asyncio.sleep.*300" --include="*.py"
[ ] grep -r "time.sleep.*300" --include="*.py"

[ ] For each hit: is this an agent that loads a system prompt?
    If yes: is the sleep in [60,270] or [1200,inf]?
    If in (270,1200): you have the bug.

[ ] Check your Anthropic dashboard:
    Usage → Select model → Cached tokens vs. Uncached tokens
    If uncached >> cached for long-running agents: you have the bug.

[ ] Calculate your own optimal interval:
    cache_hit_cost = (system_prompt_tokens + prefix_tokens) * 0.0000003
    cache_miss_cost = (system_prompt_tokens + prefix_tokens) * 0.000003

    hourly_wakes = 3600 / sleep_seconds
    hourly_cost = hourly_wakes * cache_miss_cost  # worst case

    Find the sleep_seconds where hourly_cost breaks your budget.

The Calculator

Here’s the back-of-envelope we use for new agents:

Cache-Aware vs. Naïve Polling

System prompt: 60K tokens · Context prefix: 10K tokens · Wake-ups: 12/hour

Strategy	Sleep	Hit Rate	Monthly Cost
Dead zone (naïve)	300s	8%	$847
Hot loop	270s	94%	$89
Amortized idle	1800s	0%*	$134

*By design: few enough wakes that cache miss cost is amortized.

The One-Line Fix

If you take one thing from this: find every sleep(300) in your agent codebase and ask whether it should be 270 or 1800. The “reasonable” default is a trap designed for a world that didn’t exist yet.

FAQ

Does this apply to short-lived agents too?

Cache-aware scheduling matters most for agents with wake-up loops. One-shot agents or those with human-in-the-loop triggers don’t hit this—their context loads fresh each time regardless. Focus on polling agents first.

What if my provider doesn’t document their cache TTL?

Measure it. Log the cache_hit_ratio from API responses across different sleep intervals. The cliff will show up in your data as a cost spike. Most providers fall in the 5–10 minute range.

Can I just use a persistent connection to keep the cache alive?

No—the Anthropic cache is server-side, tied to your conversation ID, not your HTTP connection. Keeping a socket open doesn’t extend the TTL. Only authenticated requests with the same cache_control blocks refresh it.

How do I handle jitter when APIs rate-limit?

Apply jitter to the work you do after waking, not to the sleep itself. Wake at 270s, check rate limit status, then add 0–30s random delay before calling the LLM. The cache TTL resets on the LLM call, not on wake.

Is there any case where 300s is actually correct?

Only if your agent wakes, does zero LLM work, and immediately sleeps again. In that case you’re not paying for context anyway. But that’s a strange agent architecture—most wake to reason, which loads context.