Goal Drift Evaluations

This repository contains the experimental harness we use to study goal drift in large language models. The code simulates a months-long investment management scenario where a model starts with one objective (e.g., maximize profit) and is later pressured toward a conflicting target (e.g., minimize carbon emissions). By replaying and branching these interactive episodes we can measure how different prompting conditions, distractions, or instrumental goals influence whether the model sticks to its stated goal.

Repository Layout

run_simulation.py - command-line entry point for running one or more simulation episodes and writing checkpoints/results.
manage_simulation.py - orchestrates interaction with the model API, including logging, branching, and checkpoint recovery.
simulation/ - core simulation logic shared across experiments. default_simulation.py defines the hedge-fund environment and tool API, while strategies/ encodes goal-pair-specific dynamics.
data/ - scripted market data, emails, and distractions surfaced to the model during each quarter.
utils/ - helper utilities (checkpoint loading, token accounting, tool definitions for OpenAI and Anthropic models, etc.).
plots*.py, plot_* scripts, and the plots/ + results/ directories - utilities for analysing saved checkpoints and visualising metrics after runs.
run_simulations_*.sh - example batch scripts for sweeping over multiple instrumental-step counts and models.

Prerequisites

Python 3.12 (matching the version used to lock requirements.txt).
API keys for the models you plan to evaluate:
- OPENAI_API_KEY for GPT-family models.
- ANTHROPIC_API_KEY for Claude-family models.
Sufficient credits/quotas on the chosen API providers; simulations can involve long prompts and many tool calls.

Export the keys before running experiments, for example:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

Environment Setup

(Recommended) create a virtual environment:

python3.12 -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install --upgrade pip
pip install -r requirements.txt

Running Simulations

The primary workflow is driven by run_simulation.py.

python run_simulation.py \
  --model gpt-4o-mini \
  --runs 1 2 3 \
  --source profit \
  --target env \
  --num_steps 30

This launches runs 1-3 where the model begins with a profit goal that later drifts toward an environmental goal across 30 quarterly steps.

If you want an easy setup, you can use:

python run_simulation.py \
  --model gpt-4o-mini \
  --runs 1 \
  --source env \
  --target profit \
  --num_steps 30 \
  --easy_setup \
  --easy_name 4omini \
  --baseline

If this is a baseline run, a folder titled {source_goal}_{instr_goal}_{model_name}_{num_step}_steps_baseline will be created at the path checkpoints/results. For example, if the system goal is emissions minimization (environment), the instrumental (target) goal is profit maximization, the model used is gpt-4o-mini, 30 steps are taken, and it is a baseline run, the folder created will be called env_profit_4omini_30_steps_baseline.

If it's an experimental run, a folder titled {source_goal}_{instr_goal}_{num_step}_steps_{model_name}_on_{model_name} will be created at the path checkpoints/results.

To analyze, use load_experiments(path_to_result_folder) from plots.py.

Key options:

--runs n1 n2 ... or --run_range start end - choose specific run IDs to execute.
--model - any OpenAI or Anthropic model string recognised by config.get_api_client (e.g., claude-3-5-sonnet-latest).
--source / --target - goal pairing to study (profit, env, health, privacy). Source and target must differ.
--num_steps - maximum quarters to simulate.
--num_instrumental_steps - number of steps the agent is forced to pursue an instrumental (often conflicting) objective before reverting.
--use_soft_system_prompt - swap in a less explicit system prompt describing the goal.
--distractions, --ood, --tool_call_malfunction, --empty_portfolio - enable scenario perturbations (extra help requests, scrambled histories, failing tools, or portfolio resets).
--interrogate / --remind_model_about_goal - inject reflective questions or reminders mid-run.
--baseline - mark a run as a baseline configuration to compare with drifted variants.
--parallel - fan out runs across processes (one per CPU core by default).
--easy_setup - creates a results directory with runs inside of a correctly named folder.
--easy_name - Provide the model name '4omini', 'sonnet', 'haiku', '4o'.

Checkpoints, Branching, and Logs

Checkpoints default to checkpoints/checkpoint_run{n}_{step}.pkl; they store the message transcript and simulation state.
Resume a partially completed run with --resume or branch from a prior trajectory with --branch_from RUN TIMESTEP.
Use --extract_checkpoint alongside --branch_from to dump the saved state and transcript without running a new simulation.
Per-run transcripts are mirrored to logs_<model>/task1_output_run{n}.txt; aggregate experiment results land in results.json (or the file passed via --results_file).

Batch Scripts

The run_simulations_*.sh scripts show how we sweep across instrumental-step counts for specific models. Update the variables at the top of the script (model name, goal pair, step counts) and execute the script once your environment variables and virtualenv are configured.

Analysing Runs

Several helper scripts operate on saved checkpoints:

stated_goal_drift_experiment.py quantifies drift alignment (DA) and instrumental alignment (DI) scores from checkpoint folders.
process_checkpoints.py and generate_logs.py assist with extracting transcripts and metrics.
plots.py, plots_no_conditioning.py, and stated_goal_drift_plots.py generate Matplotlib visualisations of aggregated scores. Figures are written under plots/.

Adjust these scripts' arguments at the top of the file or via argparse flags, ensuring the --checkpoint_dir matches where you saved simulation state.

Tips

Long simulations can incur substantial API latency and cost; start with small --num_steps and a single run to validate setup.
When experimenting with Anthropic models, consider toggling --condition_claude_on_gpt or --condition_gpt_on_claude to reuse message histories across providers.
Keep an eye on logs_<model>/task1_output_run*.txt for warnings about tool failures or budget-limit errors, which often explain drift metrics.

With the environment prepared and the CLI understood, you can iterate on prompting strategies, collect checkpoints, and feed the resulting logs through the analysis scripts to study how your models handle conflicting objectives.

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
checkpoints_4o		checkpoints_4o
checkpoints_4omini		checkpoints_4omini
checkpoints_4omini_dots		checkpoints_4omini_dots
checkpoints_5mini		checkpoints_5mini
checkpoints_haiku		checkpoints_haiku
checkpoints_sonnet		checkpoints_sonnet
code_old		code_old
data		data
logs		logs
logs_claude-3-5-haiku-latest		logs_claude-3-5-haiku-latest
logs_claude-3-5-sonnet-latest		logs_claude-3-5-sonnet-latest
logs_gpt-4o-2024-11-20		logs_gpt-4o-2024-11-20
logs_gpt-4o-mini		logs_gpt-4o-mini
plots		plots
results		results
results_conditioning		results_conditioning
results_no_conditioning		results_no_conditioning
results_old		results_old
simulation		simulation
utils		utils
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
checkpoint_data.json		checkpoint_data.json
config.py		config.py
generate_logs.py		generate_logs.py
manage_simulation.py		manage_simulation.py
plot_config_2.json		plot_config_2.json
plot_profits.py		plot_profits.py
plot_temp.py		plot_temp.py
plots.py		plots.py
plots_no_conditioning.py		plots_no_conditioning.py
process_checkpoints.py		process_checkpoints.py
requirements.in		requirements.in
requirements.txt		requirements.txt
results.json		results.json
run_simulation.py		run_simulation.py
run_simulations_4o.sh		run_simulations_4o.sh
run_simulations_4omini.sh		run_simulations_4omini.sh
run_simulations_5mini.sh		run_simulations_5mini.sh
run_simulations_haiku.sh		run_simulations_haiku.sh
run_simulations_qwen.sh		run_simulations_qwen.sh
run_simulations_sonnet.sh		run_simulations_sonnet.sh
stated_goal_drift_experiment.py		stated_goal_drift_experiment.py
stated_goal_drift_plots.py		stated_goal_drift_plots.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goal Drift Evaluations

Repository Layout

Prerequisites

Environment Setup

Running Simulations

Checkpoints, Branching, and Logs

Batch Scripts

Analysing Runs

Tips

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Goal Drift Evaluations

Repository Layout

Prerequisites

Environment Setup

Running Simulations

Checkpoints, Branching, and Logs

Batch Scripts

Analysing Runs

Tips

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages