The real bottleneck for autonomous agents is usable context.

This is empirical observation from running agents hard, not a benchmark claim. If you've pushed Opus past 150k in agentic loops you've seen exactly this degradation pattern. The 1M number is a theoretical max, not a working capacity.

Not every change, but the kind that touches shared types, crosses module boundaries, or needs to respect non-obvious project conventions. These are exactly the changes that matter most in real codebases.

After compaction, Ghajini doesn't know what it doesn't know. It can't reliably distinguish load-bearing context from noise, so it discards details it later needs and then confabulates its way through the gap.

This is aspirational, yes. But if the bottleneck is context window and compaction is the only way to extend a session, then whoever solves principled compaction — knowing what to keep, when to compact — unlocks materially longer autonomous runs.

Model intelligence is fine. Execution speed is more than fine. For “taste”, you can imbue the agent with your intent, tacit knowledge, and judgment using skills, code, context docs and workflow design.

What actually caps autonomous execution is usable context window. If the effective context window is just 150k – 200k tokens, you can’t trust agents to one-shot work that takes up way more tokens than that.

Opus 4.6 advertises 1M tokens but degrades well before that. By around 150k, the agent follows instructions less reliably, grows disoriented and equivocates, picks tools more sloppily, drops constraints and intent it was holding earlier in the run, and just goes for hacky solutions.

So the practical working window sits closer to 150k than 200k, and that budget has to cover everything the agent needs.

Loading enough tacit expertise to make an agent capable at a non-trivial task burns through most of that budget. Skills consume tokens. Context consumes tokens. Tool calls consume tokens. The examples needed for the agent to exercise judgment in ambiguous situations consume tokens. The agent’s own output in the chat consumes tokens. For complex workflows that require the agent to synthesize disparate inputs in concert, by the time the session is loaded with what the agent needs to know, the agent is left with virtually no residual capacity for reasoning and execution!

The same constraint also renders even ostensibly atomic tasks intractable for the agent. Changing a few lines in a mature codebase can require the agent to load enough surrounding code, type definitions, call sites, and project conventions that the context needed to even apprehend the problem exceeds the usable window before the agent writes a single token of output.

You then have two options. Decompose the task into smaller pieces the agent can handle in isolation, which itself requires human judgment about where the seams are. Or keep the task out of agent scope entirely. Both options push the cognitive burden back on to you. And this is hard! Often, even before you undertake any nontrivial refactoring, you have to spend a few sessions just working with the agent to figure out the perfect order of steps to undertake the refactoring, along with clear session boundaries where you either compact the session or start afresh.

Now you may very well ask, why not just compact, right? Yes, compaction is the obvious objection here, but I think auto-compaction triggered by the agent itself is an anti-pattern! The agent does not have a reliable sense of what is relevant to retain versus what to discard, so it drops details it needed and then continues working without them, which produces bad output that the agent delivers with unwarranted certitude and lots of gaslighting. Compaction works only when you trigger it deliberately, dump the relevant context into a working document first so nothing important lives only in the volatile window, and then pass the compact command an argument that specifies what to preserve. That argument is what stores all the intelligence needed in specifying what’s relevant and what isn’t. Designing a workflow that supplies a good compaction argument at the right moments could be a real breakthrough for agent autonomy over longer, more complex workflows. But letting the agent decide when and how to compact itself remains, to my mind, a cardinal mistake.

A larger usable context window lets you provide many more examples directly, which alleviates the need to even try to articulate the more ineffable dimensions of the work into a skill. And it fundamentally recalibrates the complexity of tasks the agent can one-shot with full autonomy.

Do the math on your own setups. Skills, examples, tool schemas, context docs. You'll find a well-equipped agent session easily runs 80-100k tokens before the agent even starts working. That's most of 150k gone.

If you have to decide where to decompose, or decide what stays in agent scope, that's you doing the hard cognitive work. The agent is doing the executing, not the core architectural and prioritization work.

Posts