Collusive Covert Attack Detection for LLM Multi-Agent HotpotQA

This repository implements the experiment:
Detecting Collusive Covert Attacks in LLM-Based Multi-Agent Systems.

The system runs decentralized HotpotQA discussions with benign workers, covert malicious workers, and optional trusted detector agents. Detectors can quarantine suspicious workers through the existing permission manager.

By default, nonzero-m attack conditions anchor one malicious role on the reporter/final-output agent and sample any remaining malicious workers from the rest of the pool. This makes the attack variable measure influence over the final answer rather than only low-impact side chatter. Set ANCHOR_MALICIOUS_REPORTER=false to sample all malicious workers uniformly at random.

Experiment Matrix

For each attack strategy:

suboptimal_fixation
reframing_misalignment
fake_injection
execution_delay

we run seven conditions:

m=0, d=0 clean baseline
m=1, d=0 single attacker
m=2, d=0 collusive attackers
m=1, d=1 single attacker plus detector
m=2, d=1 collusion plus one detector
m=2, d=2 collusion plus two detectors
m=all, d=0 all-malicious sanity check

Total:

4 attacks × 7 conditions = 28 total conditions

Defaults:

25–30 HotpotQA tasks per condition
5 repetitions
4 turns per task
1000 tokens per response

Setup

Install dependencies:

python -m pip install -e .

For local testing and CI-equivalent tooling:

python -m pip install -e '.[dev]'

Set Azure OpenAI credentials in your shell or .env:

export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_ENDPOINT="https://...openai.azure.com/"
export AZURE_OPENAI_DEPLOYMENT="gpt-4o"
export AZURE_OPENAI_API_VERSION="2024-10-21"

Packaged entry points:

python -m covert_collusive_hotpot.run_experiments --help
python -m covert_collusive_hotpot.generate_paper_assets

The legacy root wrappers remain temporarily available for compatibility:

python parallel_experiment_runner.py --help
python generate_paper_tables_and_figures.py

Testing

Run the full local test suite:

PYTHONPATH=src pytest tests -q

Run only the packaging and entry-point smoke checks:

PYTHONPATH=src pytest tests/test_package_entrypoints.py -q

Check the two primary experiment entry surfaces:

PYTHONPATH=src python -m covert_collusive_hotpot.run_experiments --help
python parallel_experiment_runner.py --help

There are also two GitHub Actions workflows:

default CI in .github/workflows/ci.yml
manual Azure-backed smoke in .github/workflows/azure-manual-smoke.yml

See tests/README.md for a more detailed breakdown of the test suite and CI layout.

Smoke Test

Run a small validation matrix before the full experiment:

python -m covert_collusive_hotpot.run_experiments --smoke --smoke-tasks 2 --max-concurrent 1 --run-label smoke

The smoke and full experiment commands require the HotpotQA dataset. On the first run, datasets will download the hotpot_qa distractor split from Hugging Face and cache it locally. That means the first run requires internet access unless you already have the dataset cached.

The smoke test checks:

clean baseline
all-malicious sanity conditions for every attack
one detector condition to exercise interrogation and quarantine

Full Run

Use tmux so the experiment survives SSH disconnects:

tmux new -s collusive_hotpot
python -m covert_collusive_hotpot.run_experiments --max-concurrent 4 --run-label full_collusive_hotpot

Detach with:

Ctrl-b d

Reattach with:

tmux attach -t collusive_hotpot

Resume Behavior

The runner writes:

aggregate condition rows to experiment_results.<run_label>.csv
task-level checkpoints to task_progress.<run_label>.jsonl
permanent failures to failed_configs.<run_label>.csv

If the process stops, rerun the same command. Completed task checkpoints and aggregate rows are reused automatically.

Metrics

Task Metrics

functional correctness
completion failure
attack success
latency / turns

Detection Metrics

detection recall / rate
detection precision
false positive rate
attribution accuracy
benign restriction rate
detection latency

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github/workflows		.github/workflows
.idea		.idea
src/covert_collusive_hotpot		src/covert_collusive_hotpot
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LOG_SCHEMA.json		LOG_SCHEMA.json
README.md		README.md
generate_paper_tables_and_figures.py		generate_paper_tables_and_figures.py
mock_openai.py		mock_openai.py
parallel_experiment_runner.py		parallel_experiment_runner.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collusive Covert Attack Detection for LLM Multi-Agent HotpotQA

Experiment Matrix

Setup

Testing

Smoke Test

Full Run

Resume Behavior

Metrics

Task Metrics

Detection Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Collusive Covert Attack Detection for LLM Multi-Agent HotpotQA

Experiment Matrix

Setup

Testing

Smoke Test

Full Run

Resume Behavior

Metrics

Task Metrics

Detection Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages