Building a Privacy-First AI Model Ranker with OpenRouter ZDR
Automated model ranking system aggregating benchmarks from four sources with zero data retention filtering.
The Model Selection Problem Nobody Talks About
Which AI model should your security tool use right now?
Not which model topped a leaderboard six months ago. Not which one your favorite newsletter recommended. Right now, with current pricing, current performance, and a guarantee that your prompts -- containing exploit code, client vulnerability data, or attack payloads -- aren't being stored or used for training.
That question turns out to be surprisingly hard to answer. Model benchmarks are scattered across multiple services. LMArena tracks human preference Elo ratings. Artificial Analysis benchmarks coding quality and reasoning. OpenRouter provides live throughput data and tracks which providers actually honor zero data retention policies. Every service uses different model IDs for the same model. And privacy policies vary not just per model, but per provider endpoint.
I got tired of manually cross-referencing tabs. So I built a tool to do it.
What the ZDR Model Ranker Does
The ZDR Model Ranker is an infrastructure tool in the Intelligence Adjacent framework that aggregates benchmark data from four sources, cross-references model identifiers, applies privacy-first filtering, and produces a ranked list of models suitable for security work.
It does not run its own evaluations. It aggregates existing ones. The insight is that the data already exists -- it just lives in different places with different naming conventions and no unified interface.
refresh.ts --> sources.ts (4 parallel fetches)
|
mapper.ts (cross-reference model IDs)
|
scorer.ts (normalize, weight, rank)
|
client.ts (cache read/write, public API)
The pipeline runs once, writes a local cache, and every consumer reads from cache. No surprise network calls during your security workflow.
How to Use the Model Ranker
The tool exposes a simple command-line interface with sensible defaults. Here's what you need to know:
Basic Usage
Default behavior -- refresh rankings and show top 15 models:
/model-ranker
This fetches fresh data from all sources, applies filters, ranks models using the default security-interactive profile, and displays the top 15 results. The rankings are cached locally for 7 days.
Available Flags
Show a specific number of models:
/model-ranker --top 5 # Show only top 5 models
/model-ranker --top 10 # Show only top 10 models
Check cache status without refreshing:
/model-ranker --status
This reads from the local cache without making any API calls. Useful for checking if your cache is still fresh or needs a refresh.
Use a different weight profile:
/model-ranker --profile batch-scanning # Cost-optimized ranking
/model-ranker --profile deep-analysis # Reasoning-first ranking
Weight profiles change the scoring algorithm to prioritize different dimensions. More on profiles below.
Quiet mode for cron jobs:
/model-ranker --quiet # Suppress verbose output, errors only
Designed for automated weekly refreshes. Add this to cron:
0 2 * * 0 cd /path/to/ia-framework && /model-ranker --quiet
What Happens Under the Hood
When you run /model-ranker without --status:
- Fetches current data from OpenRouter ZDR endpoints, model catalog, Artificial Analysis, and LMArena
- Cross-references model IDs across all sources
- Applies privacy filters (ZDR compliance, US-only policy, Claude exclusion)
- Normalizes and scores all dimensions
- Ranks models using the specified weight profile
- Writes cache to
tools/model-ranker/cache/zdr-rankings.json - Displays top N models
With --status, it skips steps 1-6 and just reads the existing cache.
Four Data Sources, Fetched in Parallel
The ranker pulls from four APIs simultaneously using Promise.all():
1. OpenRouter ZDR Endpoints -- The core data source. OpenRouter's ZDR endpoint returns every provider endpoint that guarantees zero data retention, including live performance data updated every 30 minutes: latency percentiles (p50/p75/p90/p99 in milliseconds) and throughput percentiles (p50/p75/p90/p99 in tokens/sec). This is real traffic data, not a synthetic benchmark.
2. OpenRouter Model Catalog -- Pricing (per-token prompt and completion costs), context window sizes, architecture details, and modality support for every model on the platform.
3. Artificial Analysis Benchmarks -- Independent quality scores including a coding index (programming problem solving), an intelligence index (composite of evaluations covering math, science, coding, reasoning), and a speed index (tokens per second).
4. LMArena Human Preference Ratings -- Crowdsourced Elo ratings from Arena (formerly LMSYS Chatbot Arena), backed by millions of user votes. The ranker pulls from nakasyou/lmarena-history, a community project that converts the daily Pickle snapshots into JSON. Both overall and coding-specific Elo ratings are extracted.
Each fetcher handles its own errors independently. If Artificial Analysis is down or you don't have an API key, the ranker continues with the remaining sources and notes the gap in the cache metadata:
export async function fetchAllSources(): Promise<AllSourceData> {
const [zdr, models, aa, lmarena] = await Promise.all([
fetchZDREndpoints(),
fetchOpenRouterModels(),
fetchArtificialAnalysis(),
fetchLMArena(),
]);
return { zdr, models, aa, lmarena };
}
The Cross-Referencing Problem
Here is where it gets interesting. LMArena calls a model chatgpt-4o-latest. OpenRouter calls it openai/gpt-4o. Artificial Analysis might call it GPT-4o. Same model, three different names.
The mapper solves this with a three-phase approach:
Phase 1: Canonical Key Normalization. Strip the provider prefix, lowercase everything, normalize separators (dots, underscores, spaces become hyphens), and remove common suffixes like :free or date stamps.
export function toCanonicalKey(modelId: string): string {
let key = modelId.toLowerCase();
// Strip provider prefix
const slashIndex = key.indexOf('/');
if (slashIndex !== -1) key = key.substring(slashIndex + 1);
// Strip suffixes, normalize separators
key = key.replace(/:free$/, '').replace(/:beta$/, '');
key = key.replace(/[\._\s]+/g, '-').replace(/-+/g, '-');
return key.replace(/^-+|-+$/g, '');
}
// "anthropic/claude-sonnet-4.5" -> "claude-sonnet-4-5"
// "Google Gemini 2.0 Flash" -> "google-gemini-2-0-flash"
// "gpt-4.1-mini:free" -> "gpt-4-1-mini"
Phase 2: Alias Table. A static lookup for known mismatches that normalization alone can't resolve. Manually maintained -- when the fuzzy matcher logs a warning, I add the mapping here:
const ALIAS_TABLE: Record<string, string> = {
'chatgpt-4o-latest': 'gpt-4o',
'grok-3-preview': 'grok-3',
'gemini-2-0-flash-exp': 'gemini-2-0-flash-001',
'gemini-2-5-pro-preview': 'gemini-2-5-pro-preview-05-06',
// ... more entries
};
Phase 3: Levenshtein Fuzzy Fallback. For keys longer than 12 characters, compute the Levenshtein distance against all known OpenRouter canonical keys. Accept matches with distance at most 2 and a length ratio above 0.7 (to prevent gemma-2-2b from matching gemma-2-27b). Fuzzy matches are logged as warnings for manual review:
Mapper fuzzy matches (review for alias table):
lmarena: "gemini-2-5-flash-preview-04-17" -> "google/gemini-2.5-flash-preview-05-20" (distance: 2)
This three-phase approach handles the vast majority of cross-referencing automatically while surfacing edge cases for human review.
Privacy-First Filter Pipeline
Before scoring, every model passes through three mandatory filters:
1. ZDR Compliance. The model must appear in OpenRouter's ZDR endpoint list. If a provider hasn't committed to zero data retention for that endpoint, the model is excluded. OpenRouter takes a conservative stance: if they can't establish a clear policy for a provider, they assume it retains and trains on data.
2. US-Only Provider Policy. For security work involving US-based client data, the ranker blocks providers with data handling jurisdictions that may conflict with client requirements: deepseek/, qwen/, mistralai/, 01-ai/, alibaba/. This is an opinionated filter. You could argue it's too aggressive. For handling client vulnerability data, I'd rather be too conservative.
3. Claude Exclusion. Claude models are excluded from the ranking because the IA framework uses the Anthropic native SDK (Software Development Kit) for Claude, never routing through OpenRouter. Including Claude in the rankings would be misleading since the framework never actually uses it via OpenRouter.
The pipeline narrows the field significantly. From the actual cache data: the ZDR endpoint returned 568 provider endpoints covering 342 models. After deduplication and applying all three filters, 159 models remain eligible for ranking.
Scoring: Normalization, Imputation, and Weighted Composites
Each eligible model gets scored across five dimensions:
| Dimension | Primary Source | Fallback |
|---|---|---|
| Coding Quality | Artificial Analysis Coding Index | median imputation |
| Reasoning Quality | LMArena Elo | AA Intelligence Index |
| Cost Efficiency | OpenRouter pricing (prompt + completion average) | median imputation |
| Speed | ZDR endpoint live throughput (p50 tokens/sec) | AA Speed Index |
| Context Window | OpenRouter context_length | median imputation |
Why Median Imputation Matters
Not every model has data from every source. A model might appear on OpenRouter's ZDR list and have LMArena ratings but lack Artificial Analysis benchmarks. The naive approach -- using zero for missing data -- would systematically penalize these models. A model missing its coding index isn't necessarily bad at coding; it just hasn't been benchmarked there yet.
Instead, the ranker uses median imputation: missing values are replaced with the median of all eligible models for that dimension. This is a standard practice in machine learning preprocessing. The dataCompleteness field on each ranked model tracks what fraction of scoring dimensions had real data versus imputed values, so consumers can factor that into their decisions.
Min-Max Normalization
All dimensions are normalized to a 0-1 range using min-max scaling. Cost is inverted (lower cost = higher score). If all values in a dimension are equal, every model gets 0.5 to avoid division-by-zero edge cases.
function minMaxNormalize(values: number[]): number[] {
const min = Math.min(...values);
const max = Math.max(...values);
const range = max - min;
if (range === 0) return values.map(() => 0.5);
return values.map(v => (v - min) / range);
}
Weight Profiles Encode Workflow Knowledge
The composite score is a weighted sum across all five dimensions. Three profiles are predefined, each encoding practical knowledge about what matters in different security workflows:
- security-interactive -- Coding 0.35, Reasoning 0.25, Cost 0.20, Speed 0.10, Context 0.10
- batch-scanning -- Coding 0.20, Reasoning 0.15, Cost 0.40, Speed 0.15, Context 0.10
- deep-analysis -- Coding 0.25, Reasoning 0.40, Cost 0.10, Speed 0.05, Context 0.20
security-interactive is the default. Coding quality leads because interactive security work -- writing exploits, analyzing code, reviewing configurations -- is fundamentally a coding task. Reasoning is second because you need the model to understand attack chains and security context. Cost matters but doesn't dominate because interactive sessions are low-volume.
batch-scanning flips the priority. When you're running automated scans across many targets, cost dominates. You might make thousands of API calls in a single engagement. Speed matters more here too -- you're waiting for batch results, not having a conversation.
deep-analysis prioritizes reasoning above all else. Complex vulnerability analysis, threat modeling, and report writing need a model that can hold long context and think through multi-step problems. Cost is deprioritized because these are high-value, low-frequency tasks.
Cache-First: No Surprise Latency
getTopModels() never makes a network call. This is a deliberate architectural decision.
export async function getTopModels(
options: GetTopModelsOptions = {}
): Promise<RankedModel[]> {
const cache = await readCache();
if (!cache) throw new Error('Rankings cache not found. Run refresh.');
if (!isCacheFresh(cache)) throw new Error('Rankings cache expired.');
const rankings = cache.profileRankings[profile] || cache.rankings;
return rankings.slice(0, count);
}
When a security workflow calls getTopModels() to decide which model to use for the next API call, the last thing you want is a cascading network dependency. The cache-first pattern means consumers always get a deterministic, fast response from local disk.
The cache is refreshed by running refreshRankings() separately -- as a weekly cron job, a manual CLI (command-line interface) invocation, or a pre-engagement setup step. The cache file has a 7-day TTL (time to live). If it expires, getTopModels() throws rather than silently returning stale data.
Enforcing ZDR in the OpenRouter Client
The ranker tells you which models are safe to use. But what about the actual API calls? I also added an enforceZDR option to the IA framework's OpenRouter client that automatically injects ZDR parameters into every request:
const client = new OpenRouterClient({
enforceZDR: true, // All requests go through ZDR endpoints
});
// Under the hood, every request gets:
// provider: { zdr: true, data_collection: "deny" }
When enforceZDR is enabled, the client adds provider.zdr: true and data_collection: "deny" to every outgoing request. Combined with OpenRouter's fail-safe behavior -- requests fail rather than routing to non-ZDR endpoints -- this creates a defense-in-depth approach. The ranker selects ZDR-compliant models, and the client ensures the routing layer enforces ZDR at the API level.
What the Output Looks Like
Here's actual output from the ranker using the security-interactive profile:
ZDR Model Rankings (security-interactive)
Sources: openrouter-zdr (568), openrouter-models (342), lmarena (243)
Filters: ZDR required, blocked prefixes applied, Claude excluded
Before filter: 342 models -> After filter: 159 models
Top 10:
1. google/gemini-2.5-pro-preview-05-06 score: 70.1% data: 60%
2. openai/gpt-4.1 score: 65.7% data: 80%
3. openai/gpt-4.1-mini score: 64.8% data: 80%
4. openai/gpt-4.1-nano score: 62.2% data: 80%
5. morph/morph-v3-fast score: 60.6% data: 60%
6. meta-llama/llama-4-maverick score: 60.4% data: 60%
7. google/gemini-2.5-flash-lite score: 60.3% data: 60%
8. openai/o4-mini score: 59.9% data: 80%
9. google/gemini-2.0-flash-001 score: 59.6% data: 80%
10. x-ai/grok-3 score: 58.8% data: 80%
Recognizable, high-quality models that passed all privacy filters. GPT-4.1 and Gemini 2.5 Pro leading the pack, with strong data completeness from having LMArena Elo ratings. Models with only 60% data completeness are typically missing Artificial Analysis benchmarks (optional API key wasn't configured for this run).
Programmatic Usage
The ranker exposes a clean TypeScript API for integration into other framework tools:
import { getTopModels, refreshRankings, getCacheStatus } from './client';
// Get top 5 ZDR-compliant models for interactive security work
const top5 = await getTopModels({ count: 5 });
console.log(top5[0].id); // "google/gemini-2.5-pro-preview-05-06"
console.log(top5[0].raw.zdrThroughputP50); // 54 tokens/sec (live)
// Switch profile for batch scanning
const batchModels = await getTopModels({
count: 3,
profile: 'batch-scanning'
});
// Check cache health
const status = await getCacheStatus();
// { exists: true, fresh: true, modelCount: 159 }
What I Learned Building This
The hardest part isn't scoring -- it's cross-referencing. I spent more time on the model ID mapper than on the entire scoring engine. Every benchmark service names models differently, and the naming conventions shift over time. The alias table is a living document that grows with each refresh cycle.
Live throughput data changes the game. Before building this, I would have relied entirely on Artificial Analysis speed benchmarks. But the ZDR endpoint's live throughput data -- measured from actual API traffic in the last 30 minutes -- is a fundamentally better signal for production model selection. A model might benchmark fast in a controlled environment but have congested endpoints in practice.
Opinionated filters save time. The US-only policy and Claude exclusion are opinions baked into the tool. They eliminate entire categories of models that practitioners would otherwise need to evaluate individually. When you're preparing for an engagement, you want to pick a model and start working -- not research data handling policies for every provider.
Try It Yourself
The ZDR Model Ranker is open source as part of the IA framework:
# Clone the framework
git clone https://github.com/intelligence-adjacent/ia-framework.git
# Set your API key
echo "OPENROUTER_API_KEY=your-key-here" >> .env
# Run your first ranking refresh
/model-ranker
# Check different profiles
/model-ranker --profile batch-scanning
/model-ranker --profile deep-analysis
# Verify your cache
/model-ranker --status
# Optional: add Artificial Analysis for richer benchmarks
echo "ARTIFICIAL_ANALYSIS_API_KEY=your-key-here" >> .env
The tool requires only an OpenRouter API key. Artificial Analysis is optional (the ranker degrades gracefully without it). No other dependencies beyond the Bun runtime.
Found This Helpful?
I write about building security tools, AI frameworks, and the engineering decisions behind them. Subscribe for free to get methodology guides. Paid members get implementation deep dives and early access to new tools.
Sources
OpenRouter Documentation
- Zero Data Retention (ZDR) - ZDR feature documentation and per-request enforcement
- ZDR Endpoints API Reference - Live endpoint performance data API
- Data Collection Policies - Provider data handling and conservative defaults
- Provider Routing - Multi-provider routing with privacy filtering
- Logging and Privacy - OpenRouter's own data retention policies
- Implicit Caching and ZDR - Ephemeral caching stance
Benchmark Sources
- LMArena (Arena) - Crowdsourced Elo ratings with millions of user votes
- Arena Leaderboard - Live leaderboard (rebranded January 2026)
- nakasyou/lmarena-history - Daily JSON snapshots of LMArena scores
- Artificial Analysis - Independent coding, intelligence, and speed benchmarks
- Artificial Analysis Methodology - Benchmarking methodology and confidence intervals
- Artificial Analysis Coding Index - Programming problem solving evaluation
Technical References
- Levenshtein Distance - Edit distance metric used in fuzzy matching
- scikit-learn Imputation - Median imputation for missing data
- Normalization in Machine Learning - Min-max scaling methodology
- API Caching Architecture - Cache-first design patterns
Industry Context
- State of AI Agent Security 2026 - Adoption outpacing security controls
- OWASP AI Testing Guide - AI trustworthiness testing methodology