framework

Hierarchical Context Loading: Reducing Token Usage by 60%

Why bigger context windows don't mean better results. A three-layer architecture that loads only what's needed, when it's needed.

Chris Groves

01 Jan 2026 — 5 min read

Update (January 29, 2026): This post was originally published January 1, 2026. After completing a framework rebuild, I've updated all file sizes and token reduction percentages to reflect the current production state. The core architecture and concepts remain unchanged—only the specific measurements have been updated for accuracy.

My CLAUDE.md file was 800 lines. It contained everything: agent workflows, skill methodologies, tool catalogs, engagement templates, reporting standards. Every session loaded the entire system context.

Then I read Chroma's context rot research. The finding was uncomfortable: LLMs don't process context uniformly. Performance degrades as input length increases, even on simple tasks. My comprehensive configuration was working against me.

The Context Rot Problem

Chroma's 2024 study tested 18 state-of-the-art models including Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro. The results challenged a core assumption:

"LLMs are typically presumed to process context uniformly—that is, the model should handle the 10,000th token just as reliably as the 100th. In practice, this assumption does not hold."

┌────────────────────────────────────────────────────────┐
│  Context Rot: Performance Degradation Over Length      │
├────────────────────────────────────────────────────────┤
│                                                         │
│  Instruction                                            │
│  Attention     █████████████████  100% (early tokens)  │
│                ███████████████    ~85% (middle)        │
│                ██████████         ~65% (middle-late)   │
│                ████████████████   ~90% (recent)        │
│                                                         │
│  "Lost-in-the-middle" effect:                          │
│  ├─ Early context: High attention (recency bias)       │
│  ├─ Middle context: LOWEST attention (overlooked)      │
│  └─ Recent context: High attention (primacy bias)      │
│                                                         │
│  Symptoms at 800+ lines:                               │
│  ├─ Instructions in middle sections ignored            │
│  ├─ Hallucinations increase (flawed context)           │
│  ├─ Random mistakes and refusals                       │
│  └─ Cost: 2x context = 4x compute                      │
│                                                         │
│  Claude decays slowest, but NO model is immune         │
└────────────────────────────────────────────────────────┘

The symptoms are familiar to anyone who's worked with long system prompts:

Instructions in the middle get overlooked ("lost-in-the-middle")
Hallucinations increase as the model builds on flawed context
Random mistakes and refusals appear with longer inputs
Claude decays slowest among tested models, but no model is immune

The math makes it worse. Double your context length, quadruple your compute cost. And you're paying for capacity that doesn't translate to performance.

The Instruction Limit Reality

Here's the number that changed my architecture: frontier LLMs can follow approximately 150-200 instructions with reasonable consistency. Smaller models attend to fewer.

My 800-line CLAUDE.md wasn't a comprehensive configuration. It was noise competing for attention.

Progressive Disclosure as Architecture

Anthropic's context engineering guidance describes progressive disclosure:

"Agents incrementally discover relevant context through exploration. Each interaction yields context that informs the next decision. Agents assemble understanding layer by layer, maintaining only what's necessary in working memory."

This isn't just a RAG technique. It's an architecture principle for system configuration.

The Three-Layer Architecture

I rebuilt my framework with explicit size limits at each layer:

┌────────────────────────────────────────────────────────┐
│  BEFORE: Monolithic CLAUDE.md (800 lines)             │
├────────────────────────────────────────────────────────┤
│  Every session loads:                                  │
│  ├─ All agent workflows                                │
│  ├─ All skill methodologies                            │
│  ├─ All tool catalogs                                  │
│  ├─ All engagement templates                           │
│  └─ All reporting standards                            │
│                                                         │
│  Result: 800 lines loaded for every interaction        │
│          (even simple questions)                       │
└────────────────────────────────────────────────────────┘

                         ↓ REFACTOR ↓

┌────────────────────────────────────────────────────────┐
│  AFTER: Three-Tier Progressive Loading                 │
├────────────────────────────────────────────────────────┤
│  Layer 1: CLAUDE.md (~315 lines)                       │
│  ├─ System architecture overview                       │
│  ├─ Agent routing decision tree                        │
│  ├─ Pointers to agents/ and skills/                    │
│  └─ Global stack preferences                           │
├────────────────────────────────────────────────────────┤
│  Layer 2: skills/[skill]/SKILL.md (~400-500 lines)     │
│  ├─ Skill identity and purpose                         │
│  ├─ 5-phase workflow structure                         │
│  ├─ Tool and script inventory                          │
│  └─ Pointers to methodologies/, templates/             │
├────────────────────────────────────────────────────────┤
│  Layer 3: agents/[agent].md (~170-190 lines)           │
│  ├─ Agent role definition                              │
│  ├─ Context loading order                              │
│  ├─ Skill routing                                      │
│  └─ Communication style                                │
└────────────────────────────────────────────────────────┘

Size target: ~300 lines

Contains only:

System architecture overview
Agent routing decision tree
Pointers to agents/ and skills/ directories
Global stack preferences
Critical orchestration rules

Never contains:

Agent workflows (→ agents/*.md)
Skill methodologies (→ skills/*/SKILL.md)
Detailed tool lists (→ skill directories)
Engagement data (→ output/)

The root file is a map, not a library.

Layer 2: Skill Context (SKILL.md files)

Size target: 400-500 lines per file

Each skill gets a complete context file:

skills/security/SKILL.md
skills/write/SKILL.md
skills/compliance/SKILL.md

SKILL.md acts as navigation for deeper content extracted to:

methodologies/ - Process documentation
reference/ - Standards, frameworks
workflows/ - Step-by-step procedures
templates/ - Template files

Layer 3: Agent Prompts

Size target: ~170-190 lines per file

Compact agent identity files that specify:

Role definition
Which SKILL.md to load
Context loading order
Communication style

The Loading Flow

When a user invokes an agent:

1. Claude Code spawns agent
   → Load agents/security.md (~185 lines)

2. Agent loads root context
   → Load CLAUDE.md (~315 lines)

3. Agent detects skill needed
   → Load skills/security/SKILL.md (~445 lines)

4. Agent checks for session state
   → Load sessions/YYYY-MM-DD-project.md (if exists)

5. Execute with minimal context

What's NOT loaded: Every other skill, historical sessions, unused templates, methodologies for other workflows.

Token Efficiency Results

The reduction depends on task complexity:

Task Type	Before	After	Reduction
Simple navigation	800 lines	315 lines	61%
Single skill work	800 lines	760 lines	5% (but better organized)
Multi-skill work	800 lines	~1200 lines	-50% (but loaded sequentially)

The 61% reduction for simple tasks matters most. Not every interaction needs full skill context. Quick questions, file navigation, git operations - these don't need security testing methodologies loaded.

Token Efficiency Breakdown by Use Case:

┌──────────────────────────────────────────────────────┐
│  Use Case 1: Simple Question ("What skills exist?")  │
├──────────────────────────────────────────────────────┤
│  BEFORE:                                             │
│  └─ CLAUDE.md: 800 lines                             │
│     Total: 800 lines                                 │
│                                                       │
│  AFTER:                                              │
│  └─ CLAUDE.md: 315 lines                             │
│     Total: 315 lines                                 │
│                                                       │
│  Reduction: 61% fewer tokens                         │
└──────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│  Use Case 2: Single Skill ("Run pentest")            │
├──────────────────────────────────────────────────────┤
│  BEFORE:                                             │
│  └─ CLAUDE.md: 800 lines                             │
│     Total: 800 lines                                 │
│                                                       │
│  AFTER:                                              │
│  ├─ agents/security.md: 185 lines                    │
│  ├─ CLAUDE.md: 315 lines                             │
│  └─ skills/security/SKILL.md: 445 lines              │
│     Total: 945 lines                                 │
│                                                       │
│  Tradeoff: +18% tokens, but loaded progressively     │
│            (not all at once)                         │
└──────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│  Use Case 3: Multi-Phase Work (Create skill)         │
├──────────────────────────────────────────────────────┤
│  BEFORE:                                             │
│  └─ CLAUDE.md: 800 lines (all phases)                │
│     Total: 800 lines                                 │
│                                                       │
│  AFTER (Phase 1):                                    │
│  ├─ CLAUDE.md: 315 lines                             │
│  ├─ create-skill/SKILL.md: 458 lines                 │
│  └─ requirements-questions.md: 200 lines             │
│     Subtotal: 973 lines                              │
│                                                       │
│  AFTER (Phase 3):                                    │
│  ├─ Same base: 773 lines                             │
│  ├─ template-1.md: 150 lines                         │
│  └─ template-2.md: 120 lines                         │
│     Subtotal: 1043 lines                             │
│                                                       │
│  Key benefit: Load what's needed WHEN needed         │
│               (progressive over time, not upfront)   │
└──────────────────────────────────────────────────────┘

Why This Mirrors RAG Economics

Research shows RAG queries cost 1250x less than pure long-context approaches. The principle applies to system prompts:

Long-context approach:

Stuff everything into CLAUDE.md
Pay to process 800+ lines every interaction
Instructions compete for attention

Hierarchical approach:

Store skill details in files (cheap)
Load only current skill (efficient)
Update one file when patterns change (maintainable)

You're not "losing" information by not loading it. You're preventing it from interfering with the current task.

Implementation Guidelines

The framework maintains these size targets through documentation standards:

Agent files: ~170-190 lines per file
SKILL.md files: 400-500 lines per file
Root CLAUDE.md: ~300-350 lines

When a file approaches its target, content gets extracted to subdirectories (methodologies/, workflows/, templates/, reference/). The constraint forces progressive disclosure.

The Routing Decision Tree

When modifying configuration, I ask:

What is being updated?
│
├─ Global preference → Root CLAUDE.md
├─ Agent behavior → agents/[agent].md
├─ Skill workflow → skills/[skill]/SKILL.md
├─ Detailed process → skills/[skill]/methodologies/
└─ Reference data → skills/[skill]/reference/

Context belongs at the most specific level possible. Root-level additions require justification.