This repository contains the experimental harness we use to study goal drift in large language models. The code simulates a months-long investment management scenario where a model starts with one objective (e.g., maximize profit) and is later pressured toward a conflicting target (e.g., minimize carbon emissions). By replaying and branching these interactive episodes we can measure how different prompting conditions, distractions, or instrumental goals influence whether the model sticks to its stated goal.
run_simulation.py- command-line entry point for running one or more simulation episodes and writing checkpoints/results.manage_simulation.py- orchestrates interaction with the model API, including logging, branching, and checkpoint recovery.simulation/- core simulation logic shared across experiments.default_simulation.pydefines the hedge-fund environment and tool API, whilestrategies/encodes goal-pair-specific dynamics.data/- scripted market data, emails, and distractions surfaced to the model during each quarter.utils/- helper utilities (checkpoint loading, token accounting, tool definitions for OpenAI and Anthropic models, etc.).plots*.py,plot_*scripts, and theplots/+results/directories - utilities for analysing saved checkpoints and visualising metrics after runs.run_simulations_*.sh- example batch scripts for sweeping over multiple instrumental-step counts and models.
- Python 3.12 (matching the version used to lock
requirements.txt). - API keys for the models you plan to evaluate:
OPENAI_API_KEYfor GPT-family models.ANTHROPIC_API_KEYfor Claude-family models.
- Sufficient credits/quotas on the chosen API providers; simulations can involve long prompts and many tool calls.
Export the keys before running experiments, for example:
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."- (Recommended) create a virtual environment:
python3.12 -m venv .venv source .venv/bin/activate - Install dependencies:
pip install --upgrade pip pip install -r requirements.txt
The primary workflow is driven by run_simulation.py.
python run_simulation.py \
--model gpt-4o-mini \
--runs 1 2 3 \
--source profit \
--target env \
--num_steps 30This launches runs 1-3 where the model begins with a profit goal that later drifts toward an environmental goal across 30 quarterly steps.
If you want an easy setup, you can use:
python run_simulation.py \
--model gpt-4o-mini \
--runs 1 \
--source env \
--target profit \
--num_steps 30 \
--easy_setup \
--easy_name 4omini \
--baselineIf this is a baseline run, a folder titled {source_goal}_{instr_goal}_{model_name}_{num_step}_steps_baseline will be created at the path checkpoints/results. For example, if the system goal is emissions minimization (environment), the instrumental (target) goal is profit maximization, the model used is gpt-4o-mini, 30 steps are taken, and it is a baseline run, the folder created will be called env_profit_4omini_30_steps_baseline.
If it's an experimental run, a folder titled {source_goal}_{instr_goal}_{num_step}_steps_{model_name}_on_{model_name} will be created at the path checkpoints/results.
To analyze, use load_experiments(path_to_result_folder) from plots.py.
Key options:
--runsn1 n2 ... or--run_range start end- choose specific run IDs to execute.--model- any OpenAI or Anthropic model string recognised byconfig.get_api_client(e.g.,claude-3-5-sonnet-latest).--source/--target- goal pairing to study (profit,env,health,privacy). Source and target must differ.--num_steps- maximum quarters to simulate.--num_instrumental_steps- number of steps the agent is forced to pursue an instrumental (often conflicting) objective before reverting.--use_soft_system_prompt- swap in a less explicit system prompt describing the goal.--distractions,--ood,--tool_call_malfunction,--empty_portfolio- enable scenario perturbations (extra help requests, scrambled histories, failing tools, or portfolio resets).--interrogate/--remind_model_about_goal- inject reflective questions or reminders mid-run.--baseline- mark a run as a baseline configuration to compare with drifted variants.--parallel- fan out runs across processes (one per CPU core by default).--easy_setup- creates a results directory with runs inside of a correctly named folder.--easy_name- Provide the model name '4omini', 'sonnet', 'haiku', '4o'.
- Checkpoints default to
checkpoints/checkpoint_run{n}_{step}.pkl; they store the message transcript and simulation state. - Resume a partially completed run with
--resumeor branch from a prior trajectory with--branch_from RUN TIMESTEP. - Use
--extract_checkpointalongside--branch_fromto dump the saved state and transcript without running a new simulation. - Per-run transcripts are mirrored to
logs_<model>/task1_output_run{n}.txt; aggregate experiment results land inresults.json(or the file passed via--results_file).
The run_simulations_*.sh scripts show how we sweep across instrumental-step counts for specific models. Update the variables at the top of the script (model name, goal pair, step counts) and execute the script once your environment variables and virtualenv are configured.
Several helper scripts operate on saved checkpoints:
stated_goal_drift_experiment.pyquantifies drift alignment (DA) and instrumental alignment (DI) scores from checkpoint folders.process_checkpoints.pyandgenerate_logs.pyassist with extracting transcripts and metrics.plots.py,plots_no_conditioning.py, andstated_goal_drift_plots.pygenerate Matplotlib visualisations of aggregated scores. Figures are written underplots/.
Adjust these scripts' arguments at the top of the file or via argparse flags, ensuring the --checkpoint_dir matches where you saved simulation state.
- Long simulations can incur substantial API latency and cost; start with small
--num_stepsand a single run to validate setup. - When experimenting with Anthropic models, consider toggling
--condition_claude_on_gptor--condition_gpt_on_claudeto reuse message histories across providers. - Keep an eye on
logs_<model>/task1_output_run*.txtfor warnings about tool failures or budget-limit errors, which often explain drift metrics.
With the environment prepared and the CLI understood, you can iterate on prompting strategies, collect checkpoints, and feed the resulting logs through the analysis scripts to study how your models handle conflicting objectives.