Skip to content

stfbk/PILLAR-Benchmarking

Repository files navigation

PILLAR Go Evaluation Framework

This repository contains the evaluation framework for assessing the performance of PILLAR, in particular its LINDDUN Go functionality.

It comprises scripts for running threat elicitation evaluations using various LLMs, compiling results, evaluating reasoning quality, and visualizing performance metrics, as well as an ablation study investigating the effect of multi-agent discussion rounds.

The framework relies on the core PILLAR logic and prompts, ensuring consistency with the main system.

Table of Contents


Installation

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Environment Variables

Create a .env file in the project root with your API keys:

OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
GEMINI_API_KEY=your_gemini_api_key  # Optional, for Gemini-based evaluation

For local models via Ollama, ensure Ollama is running locally at http://localhost:11434.


Project Structure

├── main.py                      # Main evaluation runner
├── results.py                   # Results compilation and metrics calculation
├── results_viewer.py            # Interactive visualization dashboard
├── boxplot_viewer.py            # Boxplot visualization for reason scores
├── table.py                     # LaTeX table generator for metrics
├── calculate_spearman.py        # Spearman correlation between judges
├── reason_evaluation.py         # GPT-4o-based reason quality evaluation
├── reason_evaluation_gemini.py  # Gemini-based reason quality evaluation
├── requirements.txt             # Python dependencies
├── llms/
│   ├── linddun_go.py            # PILLAR inner logic
│   └── prompts.py               # System and user prompts from PILLAR
├── misc/
│   ├── deck.json                # LINDDUN Go card deck definition
│   └── utils.py                 # Utility functions
├── benchmarks/                  # Benchmark scenarios and results
│   ├── benchmark1/
│   │   ├── benchmark.json       # Ground truth threats
│   │   ├── description.json     # Application description
│   │   ├── database.csv         # Database schema
│   │   └── results*/            # Model evaluation results
│   ├── benchmark2/
│   └── benchmark3/
└── ablation-study/              # Ablation study for multi-agent rounds
    ├── run_ablation_study.py    # Run ablation experiments
    ├── analyze_ablation_results.py  # Analyze ablation results
    └── benchmarks/              # Ablation-specific benchmarks

Main Scripts

main.py - Main Evaluation Runner

Runs the LINDDUN Go threat elicitation evaluation for specified models.

# Single-agent evaluation
python main.py

# Multi-agent evaluation with 3 discussion rounds
python main.py --multiagent 3

Configuration (edit in main.py):

  • models_to_test: List of tuples (model_name, provider) where provider is "openai", "anthropic", or "ollama"
  • card_range: Range of LINDDUN Go cards to evaluate (default: 0-33)
  • temperature: LLM temperature setting (default: 0.7)

Output: Results are saved in benchmarks/<benchmark>/results<model_name>/:

  • found_threats.json: Raw model responses
  • metrics_results.csv: Performance metrics

results.py - Results Compilation

Compiles evaluation results by comparing found threats against ground truth and calculating metrics.

Metrics calculated:

  • True/False Positives and Negatives
  • Precision, Recall, Specificity, Accuracy, F1 Score
  • Optional: Reason score statistics

reason_evaluation.py - Deepeval GPT4o Reason Evaluation

Evaluates the quality of reasoning provided by models using GPT-4o as a judge via the DeepEval framework.

python reason_evaluation.py

Output: deepeval.json files in each results directory containing:

  • Relevance scores for each threat's reasoning
  • Concordance information (whether model agreed with ground truth)

reason_evaluation_gemini.py - Deepeval Gemini Reason Evaluation

Alternative reason evaluation using Google's Gemini model as the judge.

python reason_evaluation_gemini.py

Output: deepeval-gemini.json files with similar structure to GPT-4o evaluation.


calculate_spearman.py - Judge Correlation Analysis

Calculates Spearman rank correlations between GPT-4o and Gemini judges to assess inter-rater reliability.

python calculate_spearman.py

Contains pre-computed scores for both judges and calculates correlations for:

  • Overall mean scores
  • Concordant cases (model agreed with ground truth)
  • Discordant cases (model disagreed with ground truth)

table.py - LaTeX Table Generator

Generates LaTeX tables summarizing overall performance metrics across all benchmarks.

python table.py

Output:

  • Prints LaTeX table to console
  • Saves to metrics_table.tex

The table pairs base models with their multi-agent variants for easy comparison.


Visualization Tools

results_viewer.py - Interactive Dashboard

A matplotlib-based interactive dashboard for exploring evaluation results.

python results_viewer.py

Features:

  • Switch between benchmarks and overall aggregated view
  • View different metrics (Precision, Recall, F1, Accuracy, etc.)
  • Filter by concordance (all/concordant/discordant cases)
  • Sort models by performance or alphabetically
  • Bar charts with model grouping (base vs multi-agent)

boxplot_viewer.py - Reason Score Distribution

Visualizes the distribution of reason quality scores across models using boxplots.

python boxplot_viewer.py

Features:

  • Aggregates reason scores from deepeval.json files
  • Filter by concordance status
  • Compare score distributions across all models

Ablation Study

The ablation-study/ directory contains scripts for investigating how the number of multi-agent discussion rounds affects performance.

run_ablation_study.py

Runs experiments with varying numbers of deliberation rounds.

cd ablation-study
python run_ablation_study.py

Configuration (edit in script):

  • ABLATION_MODELS: List of models to test
  • ABLATION_ROUNDS: Number of rounds to test (default: [1, 2, 3])

analyze_ablation_results.py

Analyzes convergence patterns and performance trends from ablation experiments.

cd ablation-study
python analyze_ablation_results.py

Analyzes:

  • Convergence patterns across rounds
  • Vote trajectory and stability
  • Performance metrics by round count

Benchmark Structure

Each benchmark folder contains:

File Description
description.json Application description, data types, DFD, policies
database.csv Database schema with data types and properties
benchmark.json Ground truth threat responses (generated or curated)

License

This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.

About

Evaluation Framework for PILLAR

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages