Neural Vulnerability Scanner

A deep learning system for detecting security vulnerabilities in C/C++ source code, built by fine-tuning CodeBERT (125M params) with LoRA (Low-Rank Adaptation) on the BigVul dataset.

Key Results

Method	Precision	Recall	F1 Score	Accuracy
CodeBERT + LoRA (default threshold)	74.10%	70.07%	72.03%	72.79%
CodeBERT + LoRA (tuned threshold)	64.35%	87.94%	74.32%	69.61%
AST Rule-Based Scorer	64.09%	57.37%	60.54%	62.62%
Ensemble (CodeBERT + AST)	62.71%	89.50%	73.75%	68.14%

Best F1: 74.32% on held-out test set (2,172 samples)

Architecture

Input C/C++ Code
       │
       ▼
┌──────────────┐     ┌──────────────────┐
│  Preprocess  │     │  AST Risk Scorer  │
│  (clean code)│     │  (12+ CWE rules)  │
└──────┬───────┘     └────────┬──────────┘
       │                      │
       ▼                      │
┌──────────────┐              │
│   CodeBERT   │              │
│  + LoRA r=64 │              │
│  (125M params)│             │
└──────┬───────┘              │
       │                      │
       ▼                      ▼
┌─────────────────────────────────┐
│     Ensemble (α-weighted)       │
│  score = α·CB + (1-α)·AST      │
└─────────────────────────────────┘
       │
       ▼
   SAFE / VULNERABLE

Training Strategy

Two-Stage Fine-Tuning with LoRA

The model was trained using a two-stage approach that resolves the gradient conflict between LoRA adapters and the classifier head:

Stage 1 — Classifier Warmup (LoRA frozen)

Only the classifier head is trainable (~1.2M params)
Learning rate: 1e-3 (high — small MLP needs aggressive optimization)
2 epochs with cosine scheduler
Result: F1 = 66.70%

Stage 2 — Joint Fine-Tuning (LoRA unfrozen)

LoRA adapters + classifier jointly trained (~4.7M params)
Learning rate: 5e-5 (lower for stability with 2× adapter params)
5 epochs with cosine scheduler
Result: F1 = 68.41% (validation)

Version History (10 iterations)

Version	F1	Key Change
v1	37.7%	Classifier head never trained
v2	60.2%	Basic LoRA
v4	64.1%	Focal loss (overfit)
v7	62.6%	LoRA+classifier gradient conflict identified
v9	69.8%	Two-stage training + 21K data
v10	74.3%	LoRA r=64 (doubled capacity) + threshold tuning

Dataset

Source: BigVul — real-world C/C++ vulnerabilities from CVE/NVD
Size: 21,720 balanced samples (10,860 vulnerable + 10,860 safe)
Split: 80% train (17,376) / 10% validation (2,172) / 10% test (2,172)
Top CWEs: CWE-119 (buffer overflow), CWE-20 (input validation), CWE-399 (resource management), CWE-125 (OOB read), CWE-200 (info leak), CWE-416 (use-after-free)

Model Details

Parameter	Value
Base Model	`microsoft/codebert-base` (RoBERTa architecture)
Total Parameters	128.7M
Trainable (Stage 2)	4.7M (3.7% of total)
LoRA Rank (r)	64
LoRA Alpha	128
LoRA Dropout	0.15
Target Modules	query, key, value
Max Sequence Length	512 tokens
Training Precision	FP16
Hardware	NVIDIA RTX 3050 (4GB VRAM)

AST Rule-Based Scorer

A hand-crafted vulnerability scorer covering 12+ CWE categories:

CWE-119/125/787: Buffer overflow, OOB read/write (strcpy, sprintf, memcpy patterns)
CWE-20: Input validation (copy_from_user, unchecked return values)
CWE-416: Use-after-free detection
CWE-415: Double free detection
CWE-190: Integer overflow (arithmetic without bounds)
CWE-362: Race conditions (lock/unlock mismatch)
CWE-476: NULL pointer dereference
CWE-200: Information exposure (uninitialized struct copy)
CWE-264/284: Permission/access control issues
Command injection, format string vulnerabilities

Project Structure

neural-vulnerability-scanner/
├── 00_eda_bigvul.ipynb            # Exploratory data analysis
├── 01b_dataset_50k.ipynb          # Dataset preparation (21K balanced)
├── 02_ast_pipeline.ipynb          # AST rule-based analyzer
├── 04_optimized_training.ipynb    # Model training + evaluation + ensemble
├── ast_analyzer.py                # Standalone AST vulnerability scorer
├── vulnerability-scanner-lora-v10/  # Final trained model weights
│   ├── adapter_config.json
│   ├── adapter_model.safetensors  # LoRA adapter weights (~16MB)
│   ├── tokenizer.json
│   └── ...
├── CWE_Cheat_Sheet.md            # CWE reference guide
├── QUICKSTART.md                  # Quick setup instructions
├── SETUP.bat                      # Windows setup script
└── README.md

Quick Start

Prerequisites

Python 3.10+
NVIDIA GPU with CUDA support (tested on RTX 3050 4GB)
Conda (recommended)

Installation

# Clone the repo
git clone https://github.com/AmitAK1/Neural-Vulnerability-Scanner.git
cd Neural-Vulnerability-Scanner

# Create conda environment
conda create -n vuln-scanner python=3.10 -y
conda activate vuln-scanner

# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers peft datasets scikit-learn numpy tqdm

Inference (using saved model)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("./vulnerability-scanner-lora-v10")
base_model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/codebert-base", num_labels=2
)
model = PeftModel.from_pretrained(base_model, "./vulnerability-scanner-lora-v10")
model.eval()

# Predict
code = "void f(char *input) { char buf[64]; strcpy(buf, input); }"
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits, dim=-1)[0]
    print(f"Vulnerable: {probs[1]:.3f}, Safe: {probs[0]:.3f}")

Training from Scratch

Run the notebooks in order:

01b_dataset_50k.ipynb — Prepare dataset (requires BigVul CSVs from Kaggle)
04_optimized_training.ipynb — Train model (uses a reload cell to skip retraining)

Tech Stack

PyTorch — Deep learning framework
HuggingFace Transformers — CodeBERT model
PEFT/LoRA — Parameter-efficient fine-tuning
Scikit-learn — Evaluation metrics
NumPy/Pandas — Data processing

Hardware Requirements

Component	Minimum	Recommended
GPU VRAM	4GB	8GB+
RAM	16GB	32GB
Training Time	~3 hours (RTX 3050)	~1.5 hours (RTX 3060+)

License

MIT License

Author

Amit Kamble — B.Tech CSE (AI), IIITDM Kancheepuram
GitHub · LinkedIn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Vulnerability Scanner

Key Results

Architecture

Training Strategy

Two-Stage Fine-Tuning with LoRA

Version History (10 iterations)

Dataset

Model Details

AST Rule-Based Scorer

Project Structure

Quick Start

Prerequisites

Installation

Inference (using saved model)

Training from Scratch

Tech Stack

Hardware Requirements

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
vulnerability-scanner-lora-v10		vulnerability-scanner-lora-v10
.gitignore		.gitignore
00_eda_bigvul.ipynb		00_eda_bigvul.ipynb
01b_dataset_50k.ipynb		01b_dataset_50k.ipynb
02_ast_pipeline.ipynb		02_ast_pipeline.ipynb
04_optimized_training.ipynb		04_optimized_training.ipynb
CWE_Cheat_Sheet.md		CWE_Cheat_Sheet.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SETUP.bat		SETUP.bat
ast_analyzer.py		ast_analyzer.py

Folders and files

Latest commit

History

Repository files navigation

Neural Vulnerability Scanner

Key Results

Architecture

Training Strategy

Two-Stage Fine-Tuning with LoRA

Version History (10 iterations)

Dataset

Model Details

AST Rule-Based Scorer

Project Structure

Quick Start

Prerequisites

Installation

Inference (using saved model)

Training from Scratch

Tech Stack

Hardware Requirements

License

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages