Skip to content

AmitAK1/Neural-Vulnerability-Scanner

Repository files navigation

Neural Vulnerability Scanner

A deep learning system for detecting security vulnerabilities in C/C++ source code, built by fine-tuning CodeBERT (125M params) with LoRA (Low-Rank Adaptation) on the BigVul dataset.

Key Results

Method Precision Recall F1 Score Accuracy
CodeBERT + LoRA (default threshold) 74.10% 70.07% 72.03% 72.79%
CodeBERT + LoRA (tuned threshold) 64.35% 87.94% 74.32% 69.61%
AST Rule-Based Scorer 64.09% 57.37% 60.54% 62.62%
Ensemble (CodeBERT + AST) 62.71% 89.50% 73.75% 68.14%

Best F1: 74.32% on held-out test set (2,172 samples)

Architecture

Input C/C++ Code
       │
       ▼
┌──────────────┐     ┌──────────────────┐
│  Preprocess  │     │  AST Risk Scorer  │
│  (clean code)│     │  (12+ CWE rules)  │
└──────┬───────┘     └────────┬──────────┘
       │                      │
       ▼                      │
┌──────────────┐              │
│   CodeBERT   │              │
│  + LoRA r=64 │              │
│  (125M params)│             │
└──────┬───────┘              │
       │                      │
       ▼                      ▼
┌─────────────────────────────────┐
│     Ensemble (α-weighted)       │
│  score = α·CB + (1-α)·AST      │
└─────────────────────────────────┘
       │
       ▼
   SAFE / VULNERABLE

Training Strategy

Two-Stage Fine-Tuning with LoRA

The model was trained using a two-stage approach that resolves the gradient conflict between LoRA adapters and the classifier head:

Stage 1 — Classifier Warmup (LoRA frozen)

  • Only the classifier head is trainable (~1.2M params)
  • Learning rate: 1e-3 (high — small MLP needs aggressive optimization)
  • 2 epochs with cosine scheduler
  • Result: F1 = 66.70%

Stage 2 — Joint Fine-Tuning (LoRA unfrozen)

  • LoRA adapters + classifier jointly trained (~4.7M params)
  • Learning rate: 5e-5 (lower for stability with 2× adapter params)
  • 5 epochs with cosine scheduler
  • Result: F1 = 68.41% (validation)

Version History (10 iterations)

Version F1 Key Change
v1 37.7% Classifier head never trained
v2 60.2% Basic LoRA
v4 64.1% Focal loss (overfit)
v7 62.6% LoRA+classifier gradient conflict identified
v9 69.8% Two-stage training + 21K data
v10 74.3% LoRA r=64 (doubled capacity) + threshold tuning

Dataset

  • Source: BigVul — real-world C/C++ vulnerabilities from CVE/NVD
  • Size: 21,720 balanced samples (10,860 vulnerable + 10,860 safe)
  • Split: 80% train (17,376) / 10% validation (2,172) / 10% test (2,172)
  • Top CWEs: CWE-119 (buffer overflow), CWE-20 (input validation), CWE-399 (resource management), CWE-125 (OOB read), CWE-200 (info leak), CWE-416 (use-after-free)

Model Details

Parameter Value
Base Model microsoft/codebert-base (RoBERTa architecture)
Total Parameters 128.7M
Trainable (Stage 2) 4.7M (3.7% of total)
LoRA Rank (r) 64
LoRA Alpha 128
LoRA Dropout 0.15
Target Modules query, key, value
Max Sequence Length 512 tokens
Training Precision FP16
Hardware NVIDIA RTX 3050 (4GB VRAM)

AST Rule-Based Scorer

A hand-crafted vulnerability scorer covering 12+ CWE categories:

  • CWE-119/125/787: Buffer overflow, OOB read/write (strcpy, sprintf, memcpy patterns)
  • CWE-20: Input validation (copy_from_user, unchecked return values)
  • CWE-416: Use-after-free detection
  • CWE-415: Double free detection
  • CWE-190: Integer overflow (arithmetic without bounds)
  • CWE-362: Race conditions (lock/unlock mismatch)
  • CWE-476: NULL pointer dereference
  • CWE-200: Information exposure (uninitialized struct copy)
  • CWE-264/284: Permission/access control issues
  • Command injection, format string vulnerabilities

Project Structure

neural-vulnerability-scanner/
├── 00_eda_bigvul.ipynb            # Exploratory data analysis
├── 01b_dataset_50k.ipynb          # Dataset preparation (21K balanced)
├── 02_ast_pipeline.ipynb          # AST rule-based analyzer
├── 04_optimized_training.ipynb    # Model training + evaluation + ensemble
├── ast_analyzer.py                # Standalone AST vulnerability scorer
├── vulnerability-scanner-lora-v10/  # Final trained model weights
│   ├── adapter_config.json
│   ├── adapter_model.safetensors  # LoRA adapter weights (~16MB)
│   ├── tokenizer.json
│   └── ...
├── CWE_Cheat_Sheet.md            # CWE reference guide
├── QUICKSTART.md                  # Quick setup instructions
├── SETUP.bat                      # Windows setup script
└── README.md

Quick Start

Prerequisites

  • Python 3.10+
  • NVIDIA GPU with CUDA support (tested on RTX 3050 4GB)
  • Conda (recommended)

Installation

# Clone the repo
git clone https://github.com/AmitAK1/Neural-Vulnerability-Scanner.git
cd Neural-Vulnerability-Scanner

# Create conda environment
conda create -n vuln-scanner python=3.10 -y
conda activate vuln-scanner

# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers peft datasets scikit-learn numpy tqdm

Inference (using saved model)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("./vulnerability-scanner-lora-v10")
base_model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/codebert-base", num_labels=2
)
model = PeftModel.from_pretrained(base_model, "./vulnerability-scanner-lora-v10")
model.eval()

# Predict
code = "void f(char *input) { char buf[64]; strcpy(buf, input); }"
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits, dim=-1)[0]
    print(f"Vulnerable: {probs[1]:.3f}, Safe: {probs[0]:.3f}")

Training from Scratch

Run the notebooks in order:

  1. 01b_dataset_50k.ipynb — Prepare dataset (requires BigVul CSVs from Kaggle)
  2. 04_optimized_training.ipynb — Train model (uses a reload cell to skip retraining)

Tech Stack

  • PyTorch — Deep learning framework
  • HuggingFace Transformers — CodeBERT model
  • PEFT/LoRA — Parameter-efficient fine-tuning
  • Scikit-learn — Evaluation metrics
  • NumPy/Pandas — Data processing

Hardware Requirements

Component Minimum Recommended
GPU VRAM 4GB 8GB+
RAM 16GB 32GB
Training Time ~3 hours (RTX 3050) ~1.5 hours (RTX 3060+)

License

MIT License

Author

Amit Kamble — B.Tech CSE (AI), IIITDM Kancheepuram
GitHub · LinkedIn

About

CodeBERT + LoRA fine-tuning for C/C++ vulnerability detection | F1 = 74.3% | PyTorch, HuggingFace Transformers, PEFT

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages