Find the right LLM models for your AI agents.
A simple model swap can cut your agent's costs by 10–100x without sacrificing performance.
AgentOpt is supported by DAPLab at Columbia University.
[2026/04] Version 0.1.0 released.
Framework-agnostic by construction. AgentOpt intercepts LLM calls at the one place every SDK eventually goes through — the outbound HTTP request — so it works the same with anything that ships an LLM call over the wire. No framework adapters, no plugin per provider, no wrapping your client. In-process Python frameworks (LangChain, LangGraph, CrewAI, LlamaIndex, AG2, OpenAI Agents SDK, plain openai/anthropic) attach through an httpx patch; subprocess and CLI agents (Claude Code, Gemini CLI, OpenHarness, Terminal Bench, OpenClaw) attach through HTTPS_PROXY with a local CA. The same code works on both — and on anything custom you write tomorrow.
On top of that one primitive, AgentOpt gives you three capabilities that share the same proxy, the same record schema, and the same cache:
- Selection — search a combinatorial model space to find the best fixed combination for an agent.
- Routing — swap models per call at runtime based on prompt, history, or any policy you write.
- Tracking — just record token usage, latency, and per-query cost across an agent run.
The combinatorial search problem is real: 3 steps × 8 models = 512 combinations to evaluate. AgentOpt's selection algorithms (arm elimination, LUCB, Bayesian) home in on the best combination with a fraction of the brute-force cost, and the routing API lets you keep refining at runtime once you've shipped.
Same accuracy band, 20–100x cost difference — just by picking the right model combination:
| Benchmark | Expensive Combo | Acc | Cost | Budget Combo | Acc | Cost | Savings |
|---|---|---|---|---|---|---|---|
| BFCL | Opus | 72% | $60.78 | Qwen3 Next | 71% | $1.87 | 32x |
| HotpotQA | Opus + Opus | ~73% | $2.71 | Qwen3 Next + gpt-oss-120b | 71.3% | $0.13 | 21x |
| MathQA | Opus + Opus | ~98.5% | $5.89 | Ministral + C3 Haiku | 94.0% | $0.05 | 118x |
Run it once against a small evaluation dataset; ship the winner. Read more in our blog post.
For workloads where one fixed combination isn't optimal — easy prompts shouldn't pay GPT-4o prices, hard ones shouldn't suffer on Haiku — a Router decides at every LLM call which model to use, based on the prompt, prior calls in the session, or any feature you can compute. Common policies:
- Length/complexity-based — short prompts → small model, long context or tool-call-heavy → big model.
- First-call-big — a strong model for the planning hop, cheap models for the follow-ups.
- Bandit / learned routing — feed selection results back into a contextual bandit so routing decisions improve with traffic.
- Provider failover & A/B — route a fraction of traffic to a candidate model for live comparison without redeploying.
The routing API runs the same in-process or through the agentopt serve daemon, so you can prototype locally and switch a single env var to share the policy across many clients.
pip install agentopt-pyTwo axes determine which entry point you reach for:
- Selection vs routing — find one fixed model combination offline (selection), or pick a model per LLM call at runtime (routing).
- In-process vs subprocess — does your agent live in the same Python process as your script (LangChain, OpenAI SDK, …), or run as an external CLI / Docker container (Gemini CLI, Terminal Bench, Claude Code, …)?
A third deployment axis — local proxy vs agentopt serve daemon — is just an env-var flip; the code below is byte-identical between modes. The four canonical setups follow.
The base case. Your agent uses an LLM SDK directly; you search a {planner, solver} combination space to pick the cheapest combo that hits the accuracy band.
from openai import OpenAI
from agentopt import ModelSelector
class MyAgent:
def __init__(self, models):
self.client = OpenAI()
self.planner_model = models["planner"]
self.solver_model = models["solver"]
def run(self, input_data):
plan = self.client.chat.completions.create(
model=self.planner_model,
messages=[{"role": "user", "content": f"Plan: {input_data}"}],
).choices[0].message.content
return self.client.chat.completions.create(
model=self.solver_model,
messages=[
{"role": "system", "content": f"Follow this plan:\n{plan}"},
{"role": "user", "content": input_data},
],
).choices[0].message.content
dataset = [
("What is the capital of France?", "Paris"),
("What is 2 + 2?", "4"),
("What color is the sky?", "blue"),
# 100+ samples recommended for production; 10–20 surfaces clear winners.
]
def eval_fn(expected, actual):
return 1.0 if expected.lower() in str(actual).lower() else 0.0
selector = ModelSelector(
agent=MyAgent,
models={
"planner": ["gpt-4o", "gpt-4o-mini", "gpt-4.1-nano"],
"solver": ["gpt-4o", "gpt-4o-mini", "gpt-4.1-nano"],
}, # → 3 × 3 = 9 combinations
eval_fn=eval_fn,
dataset=dataset,
method="auto", # arm_elimination — smart + cheap
)
results = selector.select_best(parallel=True, max_concurrent=50)
results.print_summary()Output:
Model Selection Results
----------------------------------------------------------------------------
Rank Model Accuracy Latency Price
----------------------------------------------------------------------------
>>> 1 planner=gpt-4.1-nano + solver=gpt-4.1-nano 100.00% 0.85s $0.000420
2 planner=gpt-4o-mini + solver=gpt-4o-mini 100.00% 1.20s $0.002372
3 planner=gpt-4o + solver=gpt-4o 100.00% 2.70s $0.014355
...
With method="auto" AgentOpt eliminates clearly worse combinations after a few datapoints; LLM-as-judge is supported — just call your judge LLM inside eval_fn.
When the agent is an external CLI (Gemini CLI, Claude Code, Terminal Bench, OpenHarness, …), run() shells out via subprocess.run. You write zero env-var plumbing — while a selection is in flight, AgentOpt patches subprocess.Popen to inject HTTPS_PROXY + the CA bundle into every child, so the CLI's LLM calls are intercepted and tracked the same way as in-process ones.
import subprocess
from agentopt import ModelSelector
class GeminiCLIAgent:
def __init__(self, models):
self.model = models["agent"]
def run(self, prompt):
# No agentopt imports here. The subprocess patch handles routing.
return subprocess.run(
["gemini", "-m", self.model, "-p", prompt],
capture_output=True, text=True,
).stdout
selector = ModelSelector(
agent=GeminiCLIAgent,
models={"agent": ["gemini-2.5-flash", "gemini-2.5-pro"]},
eval_fn=eval_fn,
dataset=dataset,
method="brute_force",
)
selector.select_best(parallel=False).print_summary()For agents that ignore HTTPS_PROXY and need the proxy URL / CA cert injected into a config file (OpenClaw is the canonical case), agentopt.get_current_session_proxy() is the escape hatch — see the OpenClawAgent wrapper for the pattern. Working examples per CLI: examples/selection/local/.
A Router decides at every LLM call which model to use — no models= search space, no eval dataset. The same MyAgent from §1 is reused; only the harness around it changes.
from agentopt import LLMTracker, RandomRouter
agent = MyAgent({"planner": "gpt-4o-mini", "solver": "gpt-4o-mini"})
router = RandomRouter(candidates=["gpt-4o-mini", "gpt-4.1-nano"], seed=0)
questions = [
"What is the capital of France?",
"What is 2 + 2?",
"What color is the sky?",
]
with LLMTracker(router=router) as tracker:
for i, q in enumerate(questions, 1):
with tracker.track(data_id=f"q{i}"):
print(agent.run(q))
tracker.print_summary()Output:
Paris
4
Blue.
============================================================
Routing summary
============================================================
Model usage by datapoint:
[q1] 2 call(s), 4.11s
gpt-4.1-nano 2.06s
gpt-4.1-nano 2.06s
[q2] 2 call(s), 2.22s
gpt-4.1-nano 1.11s
gpt-4.1-nano 1.11s
[q3] 2 call(s), 6.03s
gpt-4o-mini 3.01s
gpt-4o-mini 3.01s
Tokens per model:
gpt-4.1-nano prompt= 19268 completion= 8 total= 19276
gpt-4o-mini prompt= 9638 completion= 6 total= 9644
Total latency: 12.37s across 6 call(s)
RandomRouter is the simplest built-in policy. Write your own by subclassing Router and implementing route(ctx) -> Optional[str] — return a model name to swap, or None to keep the client's choice. The same code works for subprocess agents too (CLIs are routed at the mitmproxy hop). See the router docs and examples/routing/local/.
Same Python code as §3 — but the proxy state (cache, records, mitmproxy masters) lives in a long-lived agentopt serve daemon instead of this process. One gateway can serve many clients in any language, and routing policies preloaded on the daemon apply across all of them.
Start the daemon (in its own terminal):
# Plain daemon — clients bring their own routers.
agentopt serve --port 9000 --cache-dir ./shared_cache
# Or set a daemon-wide default router (per-session overrides still allowed).
agentopt serve --port 9000 \
--routing-policy random --candidate-models gpt-4o,gpt-4o-mini --seed 42
# Or preload custom Router subclasses for clients to push per-session.
agentopt serve --port 9000 --policy-module ./my_policies.pyThen run the §3 script against it — only the env var changes:
AGENTOPT_GATEWAY_URL=http://127.0.0.1:9000 python my_routing_script.pyLLMTracker detects AGENTOPT_GATEWAY_URL in __init__ and routes through RemoteBackend; in-process httpx calls forward through the daemon's per-session proxy port, and subprocess agents get that same port injected into HTTPS_PROXY. See examples/routing/daemon/ and examples/selection/daemon/.
All four setups share the same agent contract:
MyAgent.__init__(self, models)— receive a dict like{"planner": "gpt-4o", "solver": "gpt-4o-mini"}and build your agent. For routing, the dict is the initial model assignment; the router overrides per call.MyAgent.run(self, input_data)— run on a single datapoint and return the output.
Selection additionally needs a dataset of (input, expected) pairs and an eval_fn(expected, actual) -> float; neither is required for routing.
Working examples for the frameworks and CLI agents named above. Examples are organised into four quadrants under examples/: {selection, routing} × {local, daemon}.
| Framework | Type | Selection | Routing |
|---|---|---|---|
| OpenAI Agents SDK | in-process | openai_sdk.py | openai_sdk.py |
| LangChain | in-process | langchain.py | langchain.py |
| LangGraph | in-process | langgraph.py | langgraph.py |
| CrewAI | in-process | crewai.py | crewai.py |
| LlamaIndex | in-process | llamaindex.py | llamaindex.py |
| AG2 | in-process | ag2.py | ag2.py |
| OpenAI-Compatible API | in-process | custom_agent.py | custom_agent.py |
| Gemini CLI | subprocess | gemini_cli.py | gemini_cli.py |
| OpenHarness | subprocess | openharness.py | openharness.py |
| Terminal Bench | subprocess (Docker) | terminal_bench.py | terminal_bench.py |
| OpenClaw | subprocess | openclaw.py | openclaw.py |
AgentOpt includes a rich set of selection algorithms. Advanced users may get significant speedups by choosing the right method for their use case. See the documentation and advanced_algorithms.py for details.
If you do not need the strict best model combination and want lower search cost, epsilon_lucb is often a good choice: it stops once an ε-optimal arm is found (tune epsilon to trade off how close to optimal you need to be versus how many runs you spend).
method= |
Best for | How it works |
|---|---|---|
"auto" (default) |
General use | Automatically finds the best combination (wired to arm_elimination — strong best-arm identification with lower search cost than brute_force) |
"brute_force" |
Small search spaces | Evaluates all combinations |
"random" |
Quick exploration | Samples a random fraction |
"hill_climbing" |
Topology-aware search | Greedy search using model quality/speed rankings |
"arm_elimination" |
Best-arm identification | Bandit; eliminates statistically dominated combinations |
"epsilon_lucb" |
Extra search cost savings when ε-optimal is enough | Bandit; stops when an epsilon-optimal best arm is identified |
"threshold" |
Thresholding objectives | Bandit; determines whether each combination is above/below a user-defined threshold on the performance metric (e.g., mean accuracy) |
"lm_proposal" |
LLM-guided search | Uses a proposer LLM to shortlist promising combinations |
"bayesian" |
Expensive evaluations | GP-based Bayesian optimization over categorical model choices; uses correlation between combinations (requires pip install "agentopt-py[bayesian]") |
selector = ModelSelector(
agent=MyAgent, models=models, eval_fn=eval_fn, dataset=dataset,
method="epsilon_lucb",
epsilon=0.01
)
results = selector.select_best(parallel=True)Everything in AgentOpt builds on a single primitive: intercept every outbound LLM HTTP call. Selection, routing, tracking, and caching all hang off that one seam.
| Agent shape | Where we intercept | What we patch |
|---|---|---|
| In-process Python (LangChain, OpenAI SDK, …) | The HTTP library, before encryption | httpx.Client.send |
| Subprocess / CLI / Docker (Gemini CLI, Claude Code, Terminal Bench, …) | The network, via a local mitmproxy on a per-session port | subprocess.Popen.__init__ (to inject HTTPS_PROXY + CA bundle) |
in-process: subprocess:
agent.run(input) agent.run(input)
└── SDK (langchain/openai/…) └── subprocess.run([...]) ← Popen patch
└── httpx.Client.send() ← patched └── child process inherits HTTPS_PROXY
└── LLM API └── mitmproxy on session port
├── TLS-terminates with our CA
└── forwards to LLM API
The mitmproxy CA is generated once and merged with certifi's system CAs into a bundle at ~/.mitmproxy/agentopt-bundle.pem; the subprocess patch sets SSL_CERT_FILE / REQUESTS_CA_BUNDLE / NODE_EXTRA_CA_CERTS so the child trusts both. No agent code changes either way — the patches install when LLMTracker.start() (or the with LLMTracker: context) is entered, and uninstall on exit (refcounted so concurrent trackers don't interfere).
| What runs at the intercept | What it produces | |
|---|---|---|
| Tracking | Record provider, model, tokens, latency, cache hit/miss | A CallRecord per LLM call |
| Caching | Hash request body → look up SQLite/in-memory cache → short-circuit on hit | Replays of cached responses, with original latency preserved |
| Routing | Run the active Router.route(ctx) to swap body["model"] before forwarding |
Per-call model overrides |
Selection orchestrates the same primitive: for each combination it instantiates MyAgent(combo), runs run() over the dataset inside a tracking session, and ranks by aggregated accuracy / latency / cost. Smart algorithms (auto = arm-elimination by default) drop dominated combinations early so the cost scales sublinearly with the search space.
Where does the mitmproxy state (cache, records, masters) live? You choose at runtime by setting one env var.
| Mode | Selected when | Proxy state lives in | Multi-language / multi-process clients |
|---|---|---|---|
| Local | AGENTOPT_GATEWAY_URL unset |
The Python process running LLMTracker |
Subprocess agents only |
| Daemon | AGENTOPT_GATEWAY_URL=http://host:port set |
The agentopt serve process |
First-class — any client that respects HTTPS_PROXY |
The user-facing API is byte-identical: ModelSelector(...).select_best(), tracker.track(), tracker.get_records(). In daemon mode the in-process httpx patch forwards through the daemon's per-session port instead of recording locally, and subprocess agents get the daemon's port injected into HTTPS_PROXY — the daemon does the cache + record on both paths.
- Framework-agnostic. Anything that ships an LLM call over the wire works — no plugin per framework, no adapter per provider.
- Subprocess agents are first-class. Claude Code, Gemini CLI, Terminal Bench, Docker-bound agents — all intercepted with no env-var plumbing in the agent's
run(). - Caching saves real money during iteration. Identical request bodies are deduplicated across runs.
- State outlives a single experiment (daemon mode). Cache, providers, and (optionally) a default routing policy survive across runs and clients.
- Routing and selection compose. Today selection picks a winning combination; tomorrow routing decides per call. Future versions can feed selection results into a learned router.
For the full architecture — _active_session_var ContextVar attribution, per-session masters, CA bundle plumbing, daemon control plane — see docs/api/proxy.md.
results = selector.select_best()
results.print_summary() # formatted table
best = results.get_best() # ModelResult with highest accuracy
combo = results.get_best_combo() # {"planner": "gpt-4o", "solver": "gpt-4o-mini"}
results.to_csv("results.csv") # export all results
results.export_config("config.yaml") # export best combo as YAMLCustom model pricing — define pricing for self-hosted or custom models:
selector = ModelSelector(
...,
model_prices={
"my-custom-model": {"input_price": 2.50, "output_price": 10.00},
},
)Custom cache directory — LLM response caching is enabled by default (.agentopt_cache/). To customize:
from agentopt import LLMTracker
tracker = LLMTracker(cache_dir="./my_cache")
selector = ModelSelector(..., tracker=tracker)
results = selector.select_best() # cache flushed automaticallyUsing prebuilt LLM instances — pass framework-specific LLM objects instead of model name strings:
from langchain_openai import ChatOpenAI
selector = ModelSelector(
agent=MyAgent,
models={
"planner": [ChatOpenAI(model="gpt-4o"), ChatOpenAI(model="gpt-4o-mini")],
"solver": [ChatOpenAI(model="gpt-4o"), ChatOpenAI(model="gpt-4o-mini")],
},
eval_fn=eval_fn,
dataset=dataset,
)Full documentation at agentoptimizer.github.io/agentopt — including the Selectors, Router, Tracker, and Results API references, plus guides on how it works and response caching.
Apache 2.0
