Hierarchical Context Loading: Reducing Token Usage 50-70%

Why bigger context windows don't mean better results. A three-layer architecture that loads only what's needed, when it's needed.

Hero image for Hierarchical Context Loading: Reducing Token Usage 50-70%

My CLAUDE.md file was 800 lines. It contained everything: agent workflows, skill methodologies, tool catalogs, engagement templates, reporting standards. Every session loaded the entire system context.

Then I read Chroma's context rot research. The finding was uncomfortable: LLMs don't process context uniformly. Performance degrades as input length increases, even on simple tasks. My comprehensive configuration was working against me.

The Context Rot Problem

Chroma's July 2025 study tested 18 state-of-the-art models including GPT-4.1, Claude 4, and Gemini 2.5. The results challenged a core assumption:

"LLMs are typically presumed to process context uniformly—that is, the model should handle the 10,000th token just as reliably as the 100th. In practice, this assumption does not hold."

The symptoms are familiar to anyone who's worked with long system prompts:

  • Instructions in the middle get overlooked ("lost-in-the-middle")
  • Hallucinations increase as the model builds on flawed context
  • Random mistakes and refusals appear with longer inputs
  • Claude decays slowest among tested models, but no model is immune

The math makes it worse. Double your context length, quadruple your compute cost. And you're paying for capacity that doesn't translate to performance.

The Instruction Limit Reality

Here's the number that changed my architecture: frontier LLMs can follow approximately 150-200 instructions with reasonable consistency. Smaller models attend to fewer.

My 800-line CLAUDE.md wasn't a comprehensive configuration. It was noise competing for attention.

Progressive Disclosure as Architecture

Anthropic's context engineering guidance describes progressive disclosure:

"Agents incrementally discover relevant context through exploration. Each interaction yields context that informs the next decision. Agents assemble understanding layer by layer, maintaining only what's necessary in working memory."

This isn't just a RAG technique. It's an architecture principle for system configuration.

The Three-Layer Architecture

I rebuilt my framework with explicit size limits at each layer:

Layer 1: Navigation (Root CLAUDE.md)

Size limit: <250 lines

Contains only:

  • System architecture overview
  • Pointers to agents/ directory
  • Pointers to skills/ directory
  • Global stack preferences
  • Critical rules (not all rules)

Never contains:

  • Agent workflows (→ agents/*.md)
  • Skill methodologies (→ skills/*/SKILL.md)
  • Detailed tool lists (→ skill directories)
  • Engagement data (→ output/)

The root file is a map, not a library.

Layer 2: Skill Context (SKILL.md files)

Size limit: <500 lines per file

Each skill gets a complete context file:

skills/security-testing/SKILL.md
skills/writer/SKILL.md
skills/qa-review/SKILL.md

SKILL.md acts as navigation for deeper content extracted to:

  • methodologies/ - Process documentation
  • reference/ - Standards, frameworks
  • workflows/ - Step-by-step procedures
  • templates/ - Template files

Layer 3: Agent Prompts

Size limit: <150 lines per file

Compact agent identity files that specify:

  • Role definition
  • Which SKILL.md to load
  • Context loading order
  • Communication style

The Loading Flow

When a user invokes an agent:

1. Claude Code spawns agent
   → Load agents/security.md (150 lines)

2. Agent loads root context
   → Load CLAUDE.md (250 lines)

3. Agent detects skill needed
   → Load skills/security-testing/SKILL.md (500 lines)

4. Agent checks for session state
   → Load sessions/YYYY-MM-DD-project.md (if exists)

5. Execute with minimal context

What's NOT loaded: Every other skill, historical sessions, unused templates, methodologies for other workflows.

Token Efficiency Results

The reduction depends on task complexity:

Task Type Before After Reduction
Simple navigation 800 lines 250 lines 69%
Single skill work 800 lines 750 lines 6% (but better organized)
Multi-skill work 800 lines ~1000 lines 0% (but loaded sequentially)

The 69% reduction for simple tasks matters most. Not every interaction needs full skill context. Quick questions, file navigation, git operations - these don't need security testing methodologies loaded.

Why This Mirrors RAG Economics

Research shows RAG queries cost 1250x less than pure long-context approaches. The principle applies to system prompts:

Long-context approach:

  • Stuff everything into CLAUDE.md
  • Pay to process 800+ lines every interaction
  • Instructions compete for attention

Hierarchical approach:

  • Store skill details in files (cheap)
  • Load only current skill (efficient)
  • Update one file when patterns change (maintainable)

You're not "losing" information by not loading it. You're preventing it from interfering with the current task.

Implementation Constraints

I enforce these limits with pre-commit hooks:

# Agent files must be <150 lines
# SKILL.md files must be <500 lines
# Root CLAUDE.md must be <250 lines

If a file exceeds its limit, I extract content to subdirectories. The constraint forces progressive disclosure.

The Routing Decision Tree

When modifying configuration, I ask:

What is being updated?
│
├─ Global preference → Root CLAUDE.md
├─ Agent behavior → agents/[agent].md
├─ Skill workflow → skills/[skill]/SKILL.md
├─ Detailed process → skills/[skill]/methodologies/
└─ Reference data → skills/[skill]/reference/

Context belongs at the most specific level possible. Root-level additions require justification.

What I Learned

Bigger windows don't mean better results. Context rot is real. More instructions doesn't mean more followed.

Navigation beats libraries. Root configuration should point to context, not contain it.

Lazy loading for AI. The programming pattern applies: don't load what you don't need, unload what you're done with.

Measure by task type. Simple tasks benefit most from hierarchical loading. Complex tasks load more, but sequentially rather than all at once.

The goal isn't minimizing context. It's loading the right context for each interaction. Less loaded often means more followed.


Sources

Context Rot & Performance

Progressive Disclosure & Context Engineering

CLAUDE.md Best Practices

Token Optimization

RAG vs Long Context