A deep learning system for detecting security vulnerabilities in C/C++ source code, built by fine-tuning CodeBERT (125M params) with LoRA (Low-Rank Adaptation) on the BigVul dataset.
| Method | Precision | Recall | F1 Score | Accuracy |
|---|---|---|---|---|
| CodeBERT + LoRA (default threshold) | 74.10% | 70.07% | 72.03% | 72.79% |
| CodeBERT + LoRA (tuned threshold) | 64.35% | 87.94% | 74.32% | 69.61% |
| AST Rule-Based Scorer | 64.09% | 57.37% | 60.54% | 62.62% |
| Ensemble (CodeBERT + AST) | 62.71% | 89.50% | 73.75% | 68.14% |
Best F1: 74.32% on held-out test set (2,172 samples)
Input C/C++ Code
│
▼
┌──────────────┐ ┌──────────────────┐
│ Preprocess │ │ AST Risk Scorer │
│ (clean code)│ │ (12+ CWE rules) │
└──────┬───────┘ └────────┬──────────┘
│ │
▼ │
┌──────────────┐ │
│ CodeBERT │ │
│ + LoRA r=64 │ │
│ (125M params)│ │
└──────┬───────┘ │
│ │
▼ ▼
┌─────────────────────────────────┐
│ Ensemble (α-weighted) │
│ score = α·CB + (1-α)·AST │
└─────────────────────────────────┘
│
▼
SAFE / VULNERABLE
The model was trained using a two-stage approach that resolves the gradient conflict between LoRA adapters and the classifier head:
Stage 1 — Classifier Warmup (LoRA frozen)
- Only the classifier head is trainable (~1.2M params)
- Learning rate: 1e-3 (high — small MLP needs aggressive optimization)
- 2 epochs with cosine scheduler
- Result: F1 = 66.70%
Stage 2 — Joint Fine-Tuning (LoRA unfrozen)
- LoRA adapters + classifier jointly trained (~4.7M params)
- Learning rate: 5e-5 (lower for stability with 2× adapter params)
- 5 epochs with cosine scheduler
- Result: F1 = 68.41% (validation)
| Version | F1 | Key Change |
|---|---|---|
| v1 | 37.7% | Classifier head never trained |
| v2 | 60.2% | Basic LoRA |
| v4 | 64.1% | Focal loss (overfit) |
| v7 | 62.6% | LoRA+classifier gradient conflict identified |
| v9 | 69.8% | Two-stage training + 21K data |
| v10 | 74.3% | LoRA r=64 (doubled capacity) + threshold tuning |
- Source: BigVul — real-world C/C++ vulnerabilities from CVE/NVD
- Size: 21,720 balanced samples (10,860 vulnerable + 10,860 safe)
- Split: 80% train (17,376) / 10% validation (2,172) / 10% test (2,172)
- Top CWEs: CWE-119 (buffer overflow), CWE-20 (input validation), CWE-399 (resource management), CWE-125 (OOB read), CWE-200 (info leak), CWE-416 (use-after-free)
| Parameter | Value |
|---|---|
| Base Model | microsoft/codebert-base (RoBERTa architecture) |
| Total Parameters | 128.7M |
| Trainable (Stage 2) | 4.7M (3.7% of total) |
| LoRA Rank (r) | 64 |
| LoRA Alpha | 128 |
| LoRA Dropout | 0.15 |
| Target Modules | query, key, value |
| Max Sequence Length | 512 tokens |
| Training Precision | FP16 |
| Hardware | NVIDIA RTX 3050 (4GB VRAM) |
A hand-crafted vulnerability scorer covering 12+ CWE categories:
- CWE-119/125/787: Buffer overflow, OOB read/write (strcpy, sprintf, memcpy patterns)
- CWE-20: Input validation (copy_from_user, unchecked return values)
- CWE-416: Use-after-free detection
- CWE-415: Double free detection
- CWE-190: Integer overflow (arithmetic without bounds)
- CWE-362: Race conditions (lock/unlock mismatch)
- CWE-476: NULL pointer dereference
- CWE-200: Information exposure (uninitialized struct copy)
- CWE-264/284: Permission/access control issues
- Command injection, format string vulnerabilities
neural-vulnerability-scanner/
├── 00_eda_bigvul.ipynb # Exploratory data analysis
├── 01b_dataset_50k.ipynb # Dataset preparation (21K balanced)
├── 02_ast_pipeline.ipynb # AST rule-based analyzer
├── 04_optimized_training.ipynb # Model training + evaluation + ensemble
├── ast_analyzer.py # Standalone AST vulnerability scorer
├── vulnerability-scanner-lora-v10/ # Final trained model weights
│ ├── adapter_config.json
│ ├── adapter_model.safetensors # LoRA adapter weights (~16MB)
│ ├── tokenizer.json
│ └── ...
├── CWE_Cheat_Sheet.md # CWE reference guide
├── QUICKSTART.md # Quick setup instructions
├── SETUP.bat # Windows setup script
└── README.md
- Python 3.10+
- NVIDIA GPU with CUDA support (tested on RTX 3050 4GB)
- Conda (recommended)
# Clone the repo
git clone https://github.com/AmitAK1/Neural-Vulnerability-Scanner.git
cd Neural-Vulnerability-Scanner
# Create conda environment
conda create -n vuln-scanner python=3.10 -y
conda activate vuln-scanner
# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers peft datasets scikit-learn numpy tqdmfrom transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch
# Load model
tokenizer = AutoTokenizer.from_pretrained("./vulnerability-scanner-lora-v10")
base_model = AutoModelForSequenceClassification.from_pretrained(
"microsoft/codebert-base", num_labels=2
)
model = PeftModel.from_pretrained(base_model, "./vulnerability-scanner-lora-v10")
model.eval()
# Predict
code = "void f(char *input) { char buf[64]; strcpy(buf, input); }"
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
probs = torch.softmax(model(**inputs).logits, dim=-1)[0]
print(f"Vulnerable: {probs[1]:.3f}, Safe: {probs[0]:.3f}")Run the notebooks in order:
01b_dataset_50k.ipynb— Prepare dataset (requires BigVul CSVs from Kaggle)04_optimized_training.ipynb— Train model (uses a reload cell to skip retraining)
- PyTorch — Deep learning framework
- HuggingFace Transformers — CodeBERT model
- PEFT/LoRA — Parameter-efficient fine-tuning
- Scikit-learn — Evaluation metrics
- NumPy/Pandas — Data processing
| Component | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 4GB | 8GB+ |
| RAM | 16GB | 32GB |
| Training Time | ~3 hours (RTX 3050) | ~1.5 hours (RTX 3060+) |
MIT License
Amit Kamble — B.Tech CSE (AI), IIITDM Kancheepuram
GitHub · LinkedIn