Hierarchical Context Loading: Reducing Token Usage 50-70%
Why bigger context windows don't mean better results. A three-layer architecture that loads only what's needed, when it's needed.
My CLAUDE.md file was 800 lines. It contained everything: agent workflows, skill methodologies, tool catalogs, engagement templates, reporting standards. Every session loaded the entire system context.
Then I read Chroma's context rot research. The finding was uncomfortable: LLMs don't process context uniformly. Performance degrades as input length increases, even on simple tasks. My comprehensive configuration was working against me.
The Context Rot Problem
Chroma's July 2025 study tested 18 state-of-the-art models including GPT-4.1, Claude 4, and Gemini 2.5. The results challenged a core assumption:
"LLMs are typically presumed to process context uniformly—that is, the model should handle the 10,000th token just as reliably as the 100th. In practice, this assumption does not hold."
The symptoms are familiar to anyone who's worked with long system prompts:
- Instructions in the middle get overlooked ("lost-in-the-middle")
- Hallucinations increase as the model builds on flawed context
- Random mistakes and refusals appear with longer inputs
- Claude decays slowest among tested models, but no model is immune
The math makes it worse. Double your context length, quadruple your compute cost. And you're paying for capacity that doesn't translate to performance.
The Instruction Limit Reality
Here's the number that changed my architecture: frontier LLMs can follow approximately 150-200 instructions with reasonable consistency. Smaller models attend to fewer.
My 800-line CLAUDE.md wasn't a comprehensive configuration. It was noise competing for attention.
Progressive Disclosure as Architecture
Anthropic's context engineering guidance describes progressive disclosure:
"Agents incrementally discover relevant context through exploration. Each interaction yields context that informs the next decision. Agents assemble understanding layer by layer, maintaining only what's necessary in working memory."
This isn't just a RAG technique. It's an architecture principle for system configuration.
The Three-Layer Architecture
I rebuilt my framework with explicit size limits at each layer:
Layer 1: Navigation (Root CLAUDE.md)
Size limit: <250 lines
Contains only:
- System architecture overview
- Pointers to agents/ directory
- Pointers to skills/ directory
- Global stack preferences
- Critical rules (not all rules)
Never contains:
- Agent workflows (→ agents/*.md)
- Skill methodologies (→ skills/*/SKILL.md)
- Detailed tool lists (→ skill directories)
- Engagement data (→ output/)
The root file is a map, not a library.
Layer 2: Skill Context (SKILL.md files)
Size limit: <500 lines per file
Each skill gets a complete context file:
skills/security-testing/SKILL.md
skills/writer/SKILL.md
skills/qa-review/SKILL.md
SKILL.md acts as navigation for deeper content extracted to:
methodologies/- Process documentationreference/- Standards, frameworksworkflows/- Step-by-step procedurestemplates/- Template files
Layer 3: Agent Prompts
Size limit: <150 lines per file
Compact agent identity files that specify:
- Role definition
- Which SKILL.md to load
- Context loading order
- Communication style
The Loading Flow
When a user invokes an agent:
1. Claude Code spawns agent
→ Load agents/security.md (150 lines)
2. Agent loads root context
→ Load CLAUDE.md (250 lines)
3. Agent detects skill needed
→ Load skills/security-testing/SKILL.md (500 lines)
4. Agent checks for session state
→ Load sessions/YYYY-MM-DD-project.md (if exists)
5. Execute with minimal context
What's NOT loaded: Every other skill, historical sessions, unused templates, methodologies for other workflows.
Token Efficiency Results
The reduction depends on task complexity:
| Task Type | Before | After | Reduction |
|---|---|---|---|
| Simple navigation | 800 lines | 250 lines | 69% |
| Single skill work | 800 lines | 750 lines | 6% (but better organized) |
| Multi-skill work | 800 lines | ~1000 lines | 0% (but loaded sequentially) |
The 69% reduction for simple tasks matters most. Not every interaction needs full skill context. Quick questions, file navigation, git operations - these don't need security testing methodologies loaded.
Why This Mirrors RAG Economics
Research shows RAG queries cost 1250x less than pure long-context approaches. The principle applies to system prompts:
Long-context approach:
- Stuff everything into CLAUDE.md
- Pay to process 800+ lines every interaction
- Instructions compete for attention
Hierarchical approach:
- Store skill details in files (cheap)
- Load only current skill (efficient)
- Update one file when patterns change (maintainable)
You're not "losing" information by not loading it. You're preventing it from interfering with the current task.
Implementation Constraints
I enforce these limits with pre-commit hooks:
# Agent files must be <150 lines
# SKILL.md files must be <500 lines
# Root CLAUDE.md must be <250 lines
If a file exceeds its limit, I extract content to subdirectories. The constraint forces progressive disclosure.
The Routing Decision Tree
When modifying configuration, I ask:
What is being updated?
│
├─ Global preference → Root CLAUDE.md
├─ Agent behavior → agents/[agent].md
├─ Skill workflow → skills/[skill]/SKILL.md
├─ Detailed process → skills/[skill]/methodologies/
└─ Reference data → skills/[skill]/reference/
Context belongs at the most specific level possible. Root-level additions require justification.
What I Learned
Bigger windows don't mean better results. Context rot is real. More instructions doesn't mean more followed.
Navigation beats libraries. Root configuration should point to context, not contain it.
Lazy loading for AI. The programming pattern applies: don't load what you don't need, unload what you're done with.
Measure by task type. Simple tasks benefit most from hierarchical loading. Complex tasks load more, but sequentially rather than all at once.
The goal isn't minimizing context. It's loading the right context for each interaction. Less loaded often means more followed.
Sources
Context Rot & Performance
- Context Rot: How Increasing Input Tokens Impacts LLM Performance - Chroma Research
- Context rot: the emerging challenge - Understanding AI
- Why larger LLM context windows are all the rage - IBM Research
Progressive Disclosure & Context Engineering
- Effective context engineering for AI agents - Anthropic
- Progressive Disclosure in Agent Skills - Martha Kelly
- Context Engineering Guide - Prompt Engineering Guide
CLAUDE.md Best Practices
- Claude Code: Best practices for agentic coding - Anthropic
- Writing a good CLAUDE.md - HumanLayer
- Using CLAUDE.MD files - Claude Blog
Token Optimization
- Token-Budget-Aware LLM Reasoning - ACL 2025
- Optimize LLM API Costs: Token Strategies for 2025 - Sparkco
RAG vs Long Context
- RAG vs Long Context Window Models - Legion Intel
- RAG vs long-context LLMs comparison - Meilisearch