PILLAR Go Evaluation Framework

This repository contains the evaluation framework for assessing the performance of PILLAR, in particular its LINDDUN Go functionality.

It comprises scripts for running threat elicitation evaluations using various LLMs, compiling results, evaluating reasoning quality, and visualizing performance metrics, as well as an ablation study investigating the effect of multi-agent discussion rounds.

The framework relies on the core PILLAR logic and prompts, ensuring consistency with the main system.

Installation

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Environment Variables

Create a .env file in the project root with your API keys:

OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
GEMINI_API_KEY=your_gemini_api_key  # Optional, for Gemini-based evaluation

For local models via Ollama, ensure Ollama is running locally at http://localhost:11434.

Project Structure

├── main.py                      # Main evaluation runner
├── results.py                   # Results compilation and metrics calculation
├── results_viewer.py            # Interactive visualization dashboard
├── boxplot_viewer.py            # Boxplot visualization for reason scores
├── table.py                     # LaTeX table generator for metrics
├── calculate_spearman.py        # Spearman correlation between judges
├── reason_evaluation.py         # GPT-4o-based reason quality evaluation
├── reason_evaluation_gemini.py  # Gemini-based reason quality evaluation
├── requirements.txt             # Python dependencies
├── llms/
│   ├── linddun_go.py            # PILLAR inner logic
│   └── prompts.py               # System and user prompts from PILLAR
├── misc/
│   ├── deck.json                # LINDDUN Go card deck definition
│   └── utils.py                 # Utility functions
├── benchmarks/                  # Benchmark scenarios and results
│   ├── benchmark1/
│   │   ├── benchmark.json       # Ground truth threats
│   │   ├── description.json     # Application description
│   │   ├── database.csv         # Database schema
│   │   └── results*/            # Model evaluation results
│   ├── benchmark2/
│   └── benchmark3/
└── ablation-study/              # Ablation study for multi-agent rounds
    ├── run_ablation_study.py    # Run ablation experiments
    ├── analyze_ablation_results.py  # Analyze ablation results
    └── benchmarks/              # Ablation-specific benchmarks

Main Scripts

`main.py` - Main Evaluation Runner

Runs the LINDDUN Go threat elicitation evaluation for specified models.

# Single-agent evaluation
python main.py

# Multi-agent evaluation with 3 discussion rounds
python main.py --multiagent 3

Configuration (edit in main.py):

models_to_test: List of tuples (model_name, provider) where provider is "openai", "anthropic", or "ollama"
card_range: Range of LINDDUN Go cards to evaluate (default: 0-33)
temperature: LLM temperature setting (default: 0.7)

Output: Results are saved in benchmarks/<benchmark>/results<model_name>/:

found_threats.json: Raw model responses
metrics_results.csv: Performance metrics

`results.py` - Results Compilation

Compiles evaluation results by comparing found threats against ground truth and calculating metrics.

Metrics calculated:

True/False Positives and Negatives
Precision, Recall, Specificity, Accuracy, F1 Score
Optional: Reason score statistics

`reason_evaluation.py` - Deepeval GPT4o Reason Evaluation

Evaluates the quality of reasoning provided by models using GPT-4o as a judge via the DeepEval framework.

python reason_evaluation.py

Output: deepeval.json files in each results directory containing:

Relevance scores for each threat's reasoning
Concordance information (whether model agreed with ground truth)

`reason_evaluation_gemini.py` - Deepeval Gemini Reason Evaluation

Alternative reason evaluation using Google's Gemini model as the judge.

python reason_evaluation_gemini.py

Output: deepeval-gemini.json files with similar structure to GPT-4o evaluation.

`calculate_spearman.py` - Judge Correlation Analysis

Calculates Spearman rank correlations between GPT-4o and Gemini judges to assess inter-rater reliability.

python calculate_spearman.py

Contains pre-computed scores for both judges and calculates correlations for:

Overall mean scores
Concordant cases (model agreed with ground truth)
Discordant cases (model disagreed with ground truth)

`table.py` - LaTeX Table Generator

Generates LaTeX tables summarizing overall performance metrics across all benchmarks.

python table.py

Output:

Prints LaTeX table to console
Saves to metrics_table.tex

The table pairs base models with their multi-agent variants for easy comparison.

Visualization Tools

`results_viewer.py` - Interactive Dashboard

A matplotlib-based interactive dashboard for exploring evaluation results.

python results_viewer.py

Features:

Switch between benchmarks and overall aggregated view
View different metrics (Precision, Recall, F1, Accuracy, etc.)
Filter by concordance (all/concordant/discordant cases)
Sort models by performance or alphabetically
Bar charts with model grouping (base vs multi-agent)

`boxplot_viewer.py` - Reason Score Distribution

Visualizes the distribution of reason quality scores across models using boxplots.

python boxplot_viewer.py

Features:

Aggregates reason scores from deepeval.json files
Filter by concordance status
Compare score distributions across all models

Ablation Study

The ablation-study/ directory contains scripts for investigating how the number of multi-agent discussion rounds affects performance.

`run_ablation_study.py`

Runs experiments with varying numbers of deliberation rounds.

cd ablation-study
python run_ablation_study.py

Configuration (edit in script):

ABLATION_MODELS: List of models to test
ABLATION_ROUNDS: Number of rounds to test (default: [1, 2, 3])

`analyze_ablation_results.py`

Analyzes convergence patterns and performance trends from ablation experiments.

cd ablation-study
python analyze_ablation_results.py

Analyzes:

Convergence patterns across rounds
Vote trajectory and stability
Performance metrics by round count

Benchmark Structure

Each benchmark folder contains:

File	Description
`description.json`	Application description, data types, DFD, policies
`database.csv`	Database schema with data types and properties
`benchmark.json`	Ground truth threat responses (generated or curated)

License

This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PILLAR Go Evaluation Framework

Table of Contents

Installation

Environment Variables

Project Structure

Main Scripts

`main.py` - Main Evaluation Runner

`results.py` - Results Compilation

`reason_evaluation.py` - Deepeval GPT4o Reason Evaluation

`reason_evaluation_gemini.py` - Deepeval Gemini Reason Evaluation

`calculate_spearman.py` - Judge Correlation Analysis

`table.py` - LaTeX Table Generator

Visualization Tools

`results_viewer.py` - Interactive Dashboard

`boxplot_viewer.py` - Reason Score Distribution

Ablation Study

`run_ablation_study.py`

`analyze_ablation_results.py`

Benchmark Structure

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ablation-study		ablation-study
benchmarks		benchmarks
llms		llms
misc		misc
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
boxplot_viewer.py		boxplot_viewer.py
calculate_spearman.py		calculate_spearman.py
main.py		main.py
reason_evaluation.py		reason_evaluation.py
reason_evaluation_gemini.py		reason_evaluation_gemini.py
requirements.txt		requirements.txt
results.py		results.py
results_viewer.py		results_viewer.py
table.py		table.py

License

stfbk/PILLAR-Benchmarking

Folders and files

Latest commit

History

Repository files navigation

PILLAR Go Evaluation Framework

Table of Contents

Installation

Environment Variables

Project Structure

Main Scripts

main.py - Main Evaluation Runner

results.py - Results Compilation

reason_evaluation.py - Deepeval GPT4o Reason Evaluation

reason_evaluation_gemini.py - Deepeval Gemini Reason Evaluation

calculate_spearman.py - Judge Correlation Analysis

table.py - LaTeX Table Generator

Visualization Tools

results_viewer.py - Interactive Dashboard

boxplot_viewer.py - Reason Score Distribution

Ablation Study

run_ablation_study.py

analyze_ablation_results.py

Benchmark Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`main.py` - Main Evaluation Runner

`results.py` - Results Compilation

`reason_evaluation.py` - Deepeval GPT4o Reason Evaluation

`reason_evaluation_gemini.py` - Deepeval Gemini Reason Evaluation

`calculate_spearman.py` - Judge Correlation Analysis

`table.py` - LaTeX Table Generator

`results_viewer.py` - Interactive Dashboard

`boxplot_viewer.py` - Reason Score Distribution

`run_ablation_study.py`

`analyze_ablation_results.py`

Packages