Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity (AISTATS, 2026)
FineGates is a structured sparsification framework for efficient adaptation and compression of Large Language Models (LLMs).
Instead of adding new parameters (e.g., LoRA), FineGates learns stochastic binary row and column gates that directly sparsify the base model weights.
This enables:
- ✔️ Minimal trainable parameters
- ✔️ Real inference-time speedup
- ✔️ Up to 40% structured pruning
- ✔️ Theoretical convergence guarantees
FineGates replaces low-rank adaptation with trainable structured gates:
- Gates are trained using stochastic relaxation
- They converge to binary values (0/1)
- Entire rows and columns are removed
- Inference cost is reduced directly (no post-pruning step required)
- Matches or outperforms Full Finetuning
- Competitive or superior to LoRA
- Requires up to 10× fewer trainable parameters
- Compresses the base model by 10–40%
- Up to 25% CPU inference time reduction
- No quantization required
- No post-training pruning required
- Structured pruning during training
- Up to 44% parameter removal
- Faster convergence with only ~0.09% additional parameters
- FineGates satisfies the Polyak–Łojasiewicz (PL) condition
- Simpler and better-conditioned optimization landscape than LoRA
- Convergence guarantees under stochastic gating
- Avoids bilinear degeneracy present in low-rank parameterizations
| Script | Description |
|---|---|
sparse_posttrain_llama.py |
Structured sparsification without a target task |
finetune_glue.py |
Finetune on downstream tasks with sparsification |
sparse_pretrain_llama.py |
Pretraining with structured sparsity |
python sparse_posttrain_llama.pypython finetune_glue.pypython sparse_pretrain_llama.py| Method | Adds Trainable Parameters | Reduces Inference Cost | Structured Pruning | Pruning During Training | Post-hoc Search |
|---|---|---|---|---|---|
| LoRA | Yes (Low-rank matrices) | ❌ No | ❌ No | ❌ No | ❌ No |
| MaskLLM | Yes (Mask per weight) | ✔️ Yes (Semi-structured) | ❌ Semi-structured | ✔️ Yes | ❌ No |
| FineGates | Minimal (Gate vectors) | ✔️ Yes | ✔️ Yes | ✔️ Yes | ❌ No |
- LoRA → Parameter-efficient adaptation (no compression)
- MaskLLM → Hardware-aware semi-structured pruning
- FineGates → Structured sparsification as an adaptation mechanism during pre-training, post-training or finetuning.
FineGates unifies adaptation and compression in a single training process, without adding heavy parameter matrices or requiring post-training pruning.
- LoRA adds low-rank matrices but keeps the base model fully dense.
- Inference cost remains unchanged.
- Optimization involves bilinear parameterization with potential degeneracy.
- No structured pruning is performed.
FineGates, instead:
- Learns binary row/column gates.
- Directly removes entire dimensions from weight matrices.
- Reduces both trainable and inference-time parameters.
- Provides theoretical convergence guarantees.
- MaskLLM learns semi-structured sparsity masks (e.g., 2:4 sparsity).
- Requires training a mask tensor aligned with weight dimensions.
- Designed primarily for GPU-efficient sparsity patterns.
- Retains dense matrix dimensions (semi-structured rather than full row/column removal).
FineGates, instead:
- Learns lightweight gate vectors (not full masks).
- Removes entire rows and columns (true structured pruning).
- Naturally reduces matrix dimensions.
- Beneficial for both CPU and GPU inference.
- Requires significantly fewer trainable parameters.
- Does not rely on specialized hardware sparsity patterns.
FineGates enables structured compression with minimal performance degradation:
- Up to 20% parameter removal with negligible accuracy drop
- Up to 40% structured sparsity with <4% accuracy loss
- Up to 470M parameters removed in LLaMA-1B with ~6% drop
This demonstrates that task-specific subnetworks exist within pretrained models and can be efficiently identified through gating.
The implementation is compatible with:
- RoBERTa (Base / Large)
- LLaMA (1B / 7B)
- Other Transformer-based architectures
If you use FineGates in your research, please cite:
@inproceedings{
svirsky2026train,
title={Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity},
author={Jonathan Svirsky and Yehonathan Refael and Ofir Lindenbaum},
booktitle={The 29th International Conference on Artificial Intelligence and Statistics},
year={2026},
url={https://openreview.net/forum?id=jU4ERfrjpH}
}