Model Intelligence (MI) Metric

Calibrated from MRCR v2 8-needle long context retrieval benchmark data

Overview

Model Intelligence (MI) estimates how well an LLM will perform at the current context fill level. Research shows that retrieval and reasoning quality degrades monotonically as context fills — but at different rates per model family. MI provides a continuous [0, 1] score that tells users when to start a new session.

Benchmark Evidence

The MRCR v2 8-needle benchmark measures retrieval accuracy across context lengths:

Model	256K accuracy	1M accuracy	Relative drop
Opus 4.6	91.9%	78.3%	~14.8%
Sonnet 4.6	90.6%	65.1%	~28.1%
Haiku 4.5	(estimated)	(estimated)	~57%

Key insight: Even the best model loses accuracy with context length, but the rate varies dramatically per model family.

Formula

MI(u) = max(0, 1 - u^β)

Where:

u = current_used_tokens / context_window_size (utilization ratio, 0 to 1)
β (beta) = curve shape, controls where degradation steepens (model-specific)
All models drop from 1.0 to 0.0 — beta controls when the drop happens

Per-Model Profiles

Calibrated from MRCR v2 benchmark data:

Model Family	β (beta)	MI at 25%	MI at 50%	MI at 75%
opus	1.8	0.918	0.713	0.404
sonnet	1.5	0.875	0.646	0.350
haiku	1.2	0.811	0.565	0.292
default	1.5	0.875	0.646	0.350

Model matching: The model_id string is checked for "opus", "sonnet", or "haiku" (case-insensitive). Unknown models fall back to the default (sonnet) profile.

Why β?

β > 1 creates convex decay — quality stays high initially, then drops faster as context fills
Higher β = quality retained longer (Opus has β=1.8, Haiku has β=1.2)
All models reach MI=0.0 at full context, but Opus stays high longer before dropping

Example Calculations

Opus at 50% context (β=1.8):

MI = max(0, 1 - 0.50^1.8) = 1 - 0.287 = 0.713

Sonnet at 50% context (β=1.5):

MI = max(0, 1 - 0.50^1.5) = 1 - 0.354 = 0.646

Haiku at 50% context (β=1.2):

MI = max(0, 1 - 0.50^1.2) = 1 - 0.435 = 0.565

Color Thresholds

MI Range	Color	Label	Interpretation
> 0.70	Green	Operating well	Minimal degradation
0.40–0.70	Yellow	Degrading	Consider wrapping up
< 0.40	Red	Significant	Start a new session

Implication: Opus enters yellow around 60% utilization, sonnet around 50%, haiku around 45%. MI values are displayed with 3 decimal places (e.g., MI:0.995) for precision at low utilization.

Zone Indicators

Zone indicators provide an at-a-glance signal for session state, displayed alongside the MI score. The zones use model-size-aware thresholds — 1M context models get absolute token thresholds, while standard models use utilization ratios.

Five States

Zone	Indicator	Color	Meaning	1M model (>= 500k ctx)	Standard model (< 500k ctx)
Planning	Plan	Green	Safe to plan and code	< 70k tokens used	< (40% - 30k tokens)
Code-only	Code	Yellow	Avoid starting new plans	70k–100k tokens	(40% - 30k) to 40%
Dump zone	Dump	Orange	Quality declining, finish up	100k–250k tokens	40%–70% utilization
Hard limit	ExDump	Dark red	Start a new session	250k–275k tokens	70%–75% utilization
Dead zone	Dead	Light gray	Nothing productive here	>= 275k tokens	>= 75% utilization

Design Rationale

The dump zone is graduated, not a cliff. When users enter Dump (orange), model quality is declining but they can still finish up current work. ExDump (dark red) is the clear signal to start a new session. Dead (light gray) communicates "past the point of usefulness" without alarm.

The 100k dump zone limit for 1M models comes from Matt Pocock (@mattpocockuk). The 40% threshold for standard models was validated by Dex.

Why Model-Size-Aware Thresholds?

A single 40% threshold doesn't work for 1M context models — 40% of 1M is 400k tokens, but empirical evidence shows quality degrades much earlier. The absolute token thresholds (70k/100k/250k) reflect real-world dump zone behavior observed in 1M context sessions.

Example Statusline Output

Claude Opus 4.6 | myproject | main | 850,000 (85.0%) | MI:0.713 Plan

The zone indicator appears after the MI score, colored according to the zone.

Design Rationale

Why not CPS + ES + PS?

The previous MI formula was MI = 0.60×CPS + 0.25×ES + 0.15×PS, which produced zig-zag charts because:

ES (cache efficiency) fluctuated per turn based on API caching behavior
PS (productivity) swung wildly between planning responses (low) and code generation (high)
At low utilization, CPS barely moved (≈0.99), so ES/PS noise dominated the visual

These components measured model activity, not model intelligence. The new formula uses only context utilization — the one signal that the benchmark proves correlates with quality degradation.

Key Properties

Guaranteed monotonic decrease — MI is a pure function of utilization
No per-turn noise — no zig-zag in charts
Model-aware — Opus degrades gently, Haiku more aggressively
No previous entry needed — reduces file I/O when delta display is disabled
Benchmark-grounded — calibrated from measured retrieval accuracy

Configuration

# Model Intelligence (MI) score display
show_mi=false

# Override model-specific beta with a custom value
# Set to 0 (default) to use the model's built-in profile
# mi_curve_beta=0

The mi_curve_beta config overrides the model profile's beta (but not alpha). Set it to a positive value to use a custom curve shape for all models.

Guard Clause

If context_window_size == 0 (malformed data), MI returns 1.0 with utilization 0.0.

Sync Points: Package vs Standalone Script

The MI formula and zone logic are duplicated between the package and the standalone script and must be kept in sync:

Logic	Package (`src/`)	Standalone Python (`scripts/statusline.py`)
MODEL_PROFILES	`intelligence.py`	`statusline.py`
get_model_profile	`intelligence.py`	`statusline.py`
MI formula	`calculate_context_pressure()`	`compute_mi()`
Color thresholds	`get_mi_color()`	`get_mi_color()`
Zone indicator	`get_context_zone()`	`get_context_zone()`
Zone constants	`ZONE_1M_`, `ZONE_STD_`	`ZONE_1M_`, `ZONE_STD_`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Intelligence (MI) Metric

Overview

Benchmark Evidence

Formula

Per-Model Profiles

Why β?

Example Calculations

Color Thresholds

Zone Indicators

Five States

Design Rationale

Why Model-Size-Aware Thresholds?

Example Statusline Output

Design Rationale

Why not CPS + ES + PS?

Key Properties

Configuration

Guard Clause

Sync Points: Package vs Standalone Script

FilesExpand file tree

MODEL_INTELLIGENCE.md

Latest commit

History

MODEL_INTELLIGENCE.md

File metadata and controls

Model Intelligence (MI) Metric

Overview

Benchmark Evidence

Formula

Per-Model Profiles

Why β?

Example Calculations

Color Thresholds

Zone Indicators

Five States

Design Rationale

Why Model-Size-Aware Thresholds?

Example Statusline Output

Design Rationale

Why not CPS + ES + PS?

Key Properties

Configuration

Guard Clause

Sync Points: Package vs Standalone Script