GitHub - AgentOptimizer/agentopt: AgentOpt automatically finds the best LLM model combination for each step of your agent — optimizing for accuracy, cost, and latency.

Find the right LLM models for your AI agents.

A simple model swap can cut your agent's costs by 10–100x without sacrificing performance.

AgentOpt is supported by DAPLab at Columbia University.

News

[2026/04] Version 0.1.0 released.

Why AgentOpt

Framework-agnostic by construction. AgentOpt intercepts LLM calls at the one place every SDK eventually goes through — the outbound HTTP request — so it works the same with anything that ships an LLM call over the wire. No framework adapters, no plugin per provider, no wrapping your client. In-process Python frameworks (LangChain, LangGraph, CrewAI, LlamaIndex, AG2, OpenAI Agents SDK, plain openai/anthropic) attach through an httpx patch; subprocess and CLI agents (Claude Code, Gemini CLI, OpenHarness, Terminal Bench, OpenClaw) attach through HTTPS_PROXY with a local CA. The same code works on both — and on anything custom you write tomorrow.

On top of that one primitive, AgentOpt gives you three capabilities that share the same proxy, the same record schema, and the same cache:

Selection — search a combinatorial model space to find the best fixed combination for an agent.
Routing — swap models per call at runtime based on prompt, history, or any policy you write.
Tracking — just record token usage, latency, and per-query cost across an agent run.

The combinatorial search problem is real: 3 steps × 8 models = 512 combinations to evaluate. AgentOpt's selection algorithms (arm elimination, LUCB, Bayesian) home in on the best combination with a fraction of the brute-force cost, and the routing API lets you keep refining at runtime once you've shipped.

Use Cases

Offline model selection — find the best fixed combination

Same accuracy band, 20–100x cost difference — just by picking the right model combination:

Benchmark	Expensive Combo	Acc	Cost	Budget Combo	Acc	Cost	Savings
BFCL	Opus	72%	$60.78	Qwen3 Next	71%	$1.87	32x
HotpotQA	Opus + Opus	~73%	$2.71	Qwen3 Next + gpt-oss-120b	71.3%	$0.13	21x
MathQA	Opus + Opus	~98.5%	$5.89	Ministral + C3 Haiku	94.0%	$0.05	118x

Run it once against a small evaluation dataset; ship the winner. Read more in our blog post.

Online model routing — pick a different model per call

For workloads where one fixed combination isn't optimal — easy prompts shouldn't pay GPT-4o prices, hard ones shouldn't suffer on Haiku — a Router decides at every LLM call which model to use, based on the prompt, prior calls in the session, or any feature you can compute. Common policies:

Length/complexity-based — short prompts → small model, long context or tool-call-heavy → big model.
First-call-big — a strong model for the planning hop, cheap models for the follow-ups.
Bandit / learned routing — feed selection results back into a contextual bandit so routing decisions improve with traffic.
Provider failover & A/B — route a fraction of traffic to a candidate model for live comparison without redeploying.

The routing API runs the same in-process or through the agentopt serve daemon, so you can prototype locally and switch a single env var to share the policy across many clients.

Installation

pip install agentopt-py

Quick Start

Two axes determine which entry point you reach for:

Selection vs routing — find one fixed model combination offline (selection), or pick a model per LLM call at runtime (routing).
In-process vs subprocess — does your agent live in the same Python process as your script (LangChain, OpenAI SDK, …), or run as an external CLI / Docker container (Gemini CLI, Terminal Bench, Claude Code, …)?

A third deployment axis — local proxy vs agentopt serve daemon — is just an env-var flip; the code below is byte-identical between modes. The four canonical setups follow.

1. In-process agent + offline model selection

The base case. Your agent uses an LLM SDK directly; you search a {planner, solver} combination space to pick the cheapest combo that hits the accuracy band.

from openai import OpenAI
from agentopt import ModelSelector


class MyAgent:
    def __init__(self, models):
        self.client = OpenAI()
        self.planner_model = models["planner"]
        self.solver_model = models["solver"]

    def run(self, input_data):
        plan = self.client.chat.completions.create(
            model=self.planner_model,
            messages=[{"role": "user", "content": f"Plan: {input_data}"}],
        ).choices[0].message.content
        return self.client.chat.completions.create(
            model=self.solver_model,
            messages=[
                {"role": "system", "content": f"Follow this plan:\n{plan}"},
                {"role": "user", "content": input_data},
            ],
        ).choices[0].message.content


dataset = [
    ("What is the capital of France?", "Paris"),
    ("What is 2 + 2?", "4"),
    ("What color is the sky?", "blue"),
    # 100+ samples recommended for production; 10–20 surfaces clear winners.
]

def eval_fn(expected, actual):
    return 1.0 if expected.lower() in str(actual).lower() else 0.0

selector = ModelSelector(
    agent=MyAgent,
    models={
        "planner": ["gpt-4o", "gpt-4o-mini", "gpt-4.1-nano"],
        "solver":  ["gpt-4o", "gpt-4o-mini", "gpt-4.1-nano"],
    },                                  # → 3 × 3 = 9 combinations
    eval_fn=eval_fn,
    dataset=dataset,
    method="auto",                      # arm_elimination — smart + cheap
)
results = selector.select_best(parallel=True, max_concurrent=50)
results.print_summary()

Output:

    Model Selection Results
    ----------------------------------------------------------------------------
    Rank  Model                                     Accuracy  Latency      Price
    ----------------------------------------------------------------------------
>>>    1  planner=gpt-4.1-nano + solver=gpt-4.1-nano 100.00%    0.85s  $0.000420
       2  planner=gpt-4o-mini + solver=gpt-4o-mini   100.00%    1.20s  $0.002372
       3  planner=gpt-4o + solver=gpt-4o              100.00%    2.70s  $0.014355
    ...

With method="auto" AgentOpt eliminates clearly worse combinations after a few datapoints; LLM-as-judge is supported — just call your judge LLM inside eval_fn.

2. Subprocess agent + offline model selection

When the agent is an external CLI (Gemini CLI, Claude Code, Terminal Bench, OpenHarness, …), run() shells out via subprocess.run. You write zero env-var plumbing — while a selection is in flight, AgentOpt patches subprocess.Popen to inject HTTPS_PROXY + the CA bundle into every child, so the CLI's LLM calls are intercepted and tracked the same way as in-process ones.

import subprocess
from agentopt import ModelSelector


class GeminiCLIAgent:
    def __init__(self, models):
        self.model = models["agent"]

    def run(self, prompt):
        # No agentopt imports here. The subprocess patch handles routing.
        return subprocess.run(
            ["gemini", "-m", self.model, "-p", prompt],
            capture_output=True, text=True,
        ).stdout


selector = ModelSelector(
    agent=GeminiCLIAgent,
    models={"agent": ["gemini-2.5-flash", "gemini-2.5-pro"]},
    eval_fn=eval_fn,
    dataset=dataset,
    method="brute_force",
)
selector.select_best(parallel=False).print_summary()

For agents that ignore HTTPS_PROXY and need the proxy URL / CA cert injected into a config file (OpenClaw is the canonical case), agentopt.get_current_session_proxy() is the escape hatch — see the OpenClawAgent wrapper for the pattern. Working examples per CLI: examples/selection/local/.

3. Local backend + online model routing

A Router decides at every LLM call which model to use — no models= search space, no eval dataset. The same MyAgent from §1 is reused; only the harness around it changes.

from agentopt import LLMTracker, RandomRouter

agent = MyAgent({"planner": "gpt-4o-mini", "solver": "gpt-4o-mini"})

router = RandomRouter(candidates=["gpt-4o-mini", "gpt-4.1-nano"], seed=0)
questions = [
    "What is the capital of France?",
    "What is 2 + 2?",
    "What color is the sky?",
]

with LLMTracker(router=router) as tracker:
    for i, q in enumerate(questions, 1):
        with tracker.track(data_id=f"q{i}"):
            print(agent.run(q))
tracker.print_summary()

Output:

Paris
4
Blue.
============================================================
Routing summary
============================================================

Model usage by datapoint:
  [q1]  2 call(s), 4.11s
      gpt-4.1-nano                     2.06s
      gpt-4.1-nano                     2.06s
  [q2]  2 call(s), 2.22s
      gpt-4.1-nano                     1.11s
      gpt-4.1-nano                     1.11s
  [q3]  2 call(s), 6.03s
      gpt-4o-mini                      3.01s
      gpt-4o-mini                      3.01s

Tokens per model:
  gpt-4.1-nano   prompt= 19268   completion=     8   total= 19276
  gpt-4o-mini    prompt=  9638   completion=     6   total=  9644

Total latency: 12.37s across 6 call(s)

RandomRouter is the simplest built-in policy. Write your own by subclassing Router and implementing route(ctx) -> Optional[str] — return a model name to swap, or None to keep the client's choice. The same code works for subprocess agents too (CLIs are routed at the mitmproxy hop). See the router docs and examples/routing/local/.

4. Daemon backend + online model routing

Same Python code as §3 — but the proxy state (cache, records, mitmproxy masters) lives in a long-lived agentopt serve daemon instead of this process. One gateway can serve many clients in any language, and routing policies preloaded on the daemon apply across all of them.

Start the daemon (in its own terminal):

# Plain daemon — clients bring their own routers.
agentopt serve --port 9000 --cache-dir ./shared_cache

# Or set a daemon-wide default router (per-session overrides still allowed).
agentopt serve --port 9000 \
    --routing-policy random --candidate-models gpt-4o,gpt-4o-mini --seed 42

# Or preload custom Router subclasses for clients to push per-session.
agentopt serve --port 9000 --policy-module ./my_policies.py

Then run the §3 script against it — only the env var changes:

AGENTOPT_GATEWAY_URL=http://127.0.0.1:9000 python my_routing_script.py

LLMTracker detects AGENTOPT_GATEWAY_URL in __init__ and routes through RemoteBackend; in-process httpx calls forward through the daemon's per-session proxy port, and subprocess agents get that same port injected into HTTPS_PROXY. See examples/routing/daemon/ and examples/selection/daemon/.

What you provide

All four setups share the same agent contract:

MyAgent.__init__(self, models) — receive a dict like {"planner": "gpt-4o", "solver": "gpt-4o-mini"} and build your agent. For routing, the dict is the initial model assignment; the router overrides per call.
MyAgent.run(self, input_data) — run on a single datapoint and return the output.

Selection additionally needs a dataset of (input, expected) pairs and an eval_fn(expected, actual) -> float; neither is required for routing.

Framework Compatibility

Working examples for the frameworks and CLI agents named above. Examples are organised into four quadrants under examples/: {selection, routing} × {local, daemon}.

Framework	Type	Selection	Routing
OpenAI Agents SDK	in-process	openai_sdk.py	openai_sdk.py
LangChain	in-process	langchain.py	langchain.py
LangGraph	in-process	langgraph.py	langgraph.py
CrewAI	in-process	crewai.py	crewai.py
LlamaIndex	in-process	llamaindex.py	llamaindex.py
AG2	in-process	ag2.py	ag2.py
OpenAI-Compatible API	in-process	custom_agent.py	custom_agent.py
Gemini CLI	subprocess	gemini_cli.py	gemini_cli.py
OpenHarness	subprocess	openharness.py	openharness.py
Terminal Bench	subprocess (Docker)	terminal_bench.py	terminal_bench.py
OpenClaw	subprocess	openclaw.py	openclaw.py

Selection Algorithms

AgentOpt includes a rich set of selection algorithms. Advanced users may get significant speedups by choosing the right method for their use case. See the documentation and advanced_algorithms.py for details.

If you do not need the strict best model combination and want lower search cost, epsilon_lucb is often a good choice: it stops once an ε-optimal arm is found (tune epsilon to trade off how close to optimal you need to be versus how many runs you spend).

`method=`	Best for	How it works
`"auto"` (default)	General use	Automatically finds the best combination (wired to `arm_elimination` — strong best-arm identification with lower search cost than `brute_force`)
`"brute_force"`	Small search spaces	Evaluates all combinations
`"random"`	Quick exploration	Samples a random fraction
`"hill_climbing"`	Topology-aware search	Greedy search using model quality/speed rankings
`"arm_elimination"`	Best-arm identification	Bandit; eliminates statistically dominated combinations
`"epsilon_lucb"`	Extra search cost savings when ε-optimal is enough	Bandit; stops when an epsilon-optimal best arm is identified
`"threshold"`	Thresholding objectives	Bandit; determines whether each combination is above/below a user-defined `threshold` on the performance metric (e.g., mean accuracy)
`"lm_proposal"`	LLM-guided search	Uses a proposer LLM to shortlist promising combinations
`"bayesian"`	Expensive evaluations	GP-based Bayesian optimization over categorical model choices; uses correlation between combinations (requires `pip install "agentopt-py[bayesian]"`)

selector = ModelSelector(
    agent=MyAgent, models=models, eval_fn=eval_fn, dataset=dataset,
    method="epsilon_lucb",
    epsilon=0.01
)
results = selector.select_best(parallel=True)

How It Works

Everything in AgentOpt builds on a single primitive: intercept every outbound LLM HTTP call. Selection, routing, tracking, and caching all hang off that one seam.

One primitive, two interception sites

Agent shape	Where we intercept	What we patch
In-process Python (LangChain, OpenAI SDK, …)	The HTTP library, before encryption	`httpx.Client.send`
Subprocess / CLI / Docker (Gemini CLI, Claude Code, Terminal Bench, …)	The network, via a local mitmproxy on a per-session port	`subprocess.Popen.__init__` (to inject `HTTPS_PROXY` + CA bundle)

in-process:                                subprocess:

  agent.run(input)                           agent.run(input)
   └── SDK (langchain/openai/…)               └── subprocess.run([...])  ← Popen patch
       └── httpx.Client.send()  ← patched          └── child process inherits HTTPS_PROXY
           └── LLM API                                └── mitmproxy on session port
                                                          ├── TLS-terminates with our CA
                                                          └── forwards to LLM API

The mitmproxy CA is generated once and merged with certifi's system CAs into a bundle at ~/.mitmproxy/agentopt-bundle.pem; the subprocess patch sets SSL_CERT_FILE / REQUESTS_CA_BUNDLE / NODE_EXTRA_CA_CERTS so the child trusts both. No agent code changes either way — the patches install when LLMTracker.start() (or the with LLMTracker: context) is entered, and uninstall on exit (refcounted so concurrent trackers don't interfere).

Three capabilities on top of the seam

	What runs at the intercept	What it produces
Tracking	Record provider, model, tokens, latency, cache hit/miss	A `CallRecord` per LLM call
Caching	Hash request body → look up SQLite/in-memory cache → short-circuit on hit	Replays of cached responses, with original latency preserved
Routing	Run the active `Router.route(ctx)` to swap `body["model"]` before forwarding	Per-call model overrides

Selection orchestrates the same primitive: for each combination it instantiates MyAgent(combo), runs run() over the dataset inside a tracking session, and ranks by aggregated accuracy / latency / cost. Smart algorithms (auto = arm-elimination by default) drop dominated combinations early so the cost scales sublinearly with the search space.

Two backends, one API

Where does the mitmproxy state (cache, records, masters) live? You choose at runtime by setting one env var.

Mode	Selected when	Proxy state lives in	Multi-language / multi-process clients
Local	`AGENTOPT_GATEWAY_URL` unset	The Python process running `LLMTracker`	Subprocess agents only
Daemon	`AGENTOPT_GATEWAY_URL=http://host:port` set	The `agentopt serve` process	First-class — any client that respects `HTTPS_PROXY`

The user-facing API is byte-identical: ModelSelector(...).select_best(), tracker.track(), tracker.get_records(). In daemon mode the in-process httpx patch forwards through the daemon's per-session port instead of recording locally, and subprocess agents get the daemon's port injected into HTTPS_PROXY — the daemon does the cache + record on both paths.

What this buys you

Framework-agnostic. Anything that ships an LLM call over the wire works — no plugin per framework, no adapter per provider.
Subprocess agents are first-class. Claude Code, Gemini CLI, Terminal Bench, Docker-bound agents — all intercepted with no env-var plumbing in the agent's run().
Caching saves real money during iteration. Identical request bodies are deduplicated across runs.
State outlives a single experiment (daemon mode). Cache, providers, and (optionally) a default routing policy survive across runs and clients.
Routing and selection compose. Today selection picks a winning combination; tomorrow routing decides per call. Future versions can feed selection results into a learned router.

For the full architecture — _active_session_var ContextVar attribution, per-session masters, CA bundle plumbing, daemon control plane — see docs/api/proxy.md.

Results API

results = selector.select_best()

results.print_summary()               # formatted table
best = results.get_best()             # ModelResult with highest accuracy
combo = results.get_best_combo()      # {"planner": "gpt-4o", "solver": "gpt-4o-mini"}
results.to_csv("results.csv")         # export all results
results.export_config("config.yaml")  # export best combo as YAML

Advanced Usage

Custom model pricing — define pricing for self-hosted or custom models:

selector = ModelSelector(
    ...,
    model_prices={
        "my-custom-model": {"input_price": 2.50, "output_price": 10.00},
    },
)

Custom cache directory — LLM response caching is enabled by default (.agentopt_cache/). To customize:

from agentopt import LLMTracker

tracker = LLMTracker(cache_dir="./my_cache")
selector = ModelSelector(..., tracker=tracker)
results = selector.select_best()  # cache flushed automatically

Using prebuilt LLM instances — pass framework-specific LLM objects instead of model name strings:

from langchain_openai import ChatOpenAI

selector = ModelSelector(
    agent=MyAgent,
    models={
        "planner": [ChatOpenAI(model="gpt-4o"), ChatOpenAI(model="gpt-4o-mini")],
        "solver":  [ChatOpenAI(model="gpt-4o"), ChatOpenAI(model="gpt-4o-mini")],
    },
    eval_fn=eval_fn,
    dataset=dataset,
)

Documentation

Full documentation at agentoptimizer.github.io/agentopt — including the Selectors, Router, Tracker, and Results API references, plus guides on how it works and response caching.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
.codex/skills/agentopt		.codex/skills/agentopt
.github/workflows		.github/workflows
docs		docs
examples		examples
src/agentopt		src/agentopt
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
logo.png		logo.png
mkdocs.yml		mkdocs.yml
model_price.json		model_price.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News

Why AgentOpt

Use Cases

Offline model selection — find the best fixed combination

Online model routing — pick a different model per call

Installation

Quick Start

1. In-process agent + offline model selection

2. Subprocess agent + offline model selection

3. Local backend + online model routing

4. Daemon backend + online model routing

What you provide

Framework Compatibility

Selection Algorithms

How It Works

One primitive, two interception sites

Three capabilities on top of the seam

Two backends, one API

What this buys you

Results API

Advanced Usage

Documentation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

News

Why AgentOpt

Use Cases

Offline model selection — find the best fixed combination

Online model routing — pick a different model per call

Installation

Quick Start

1. In-process agent + offline model selection

2. Subprocess agent + offline model selection

3. Local backend + online model routing

4. Daemon backend + online model routing

What you provide

Framework Compatibility

Selection Algorithms

How It Works

One primitive, two interception sites

Three capabilities on top of the seam

Two backends, one API

What this buys you

Results API

Advanced Usage

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages