A clean, minimal 2-D particle simulator written in Python, showcasing GPU acceleration with Numba CUDA alongside a CPU reference implementation.
Built as a portfolio project demonstrating:
- Scientific computing fundamentals (numerical integration, force fields)
- GPU programming with Numba CUDA (
@cuda.jit,cuda.grid,cuda.synchronize) - Velocity-Verlet integration (symplectic, second-order accurate)
- CPU vs GPU benchmarking methodology
- Modular, well-commented Python project structure
Each particle pair closer than diameter rβ feels a harmonic repulsion:
U(r) = A Β· (1 β r/rβ)Β² for r < rβ, else 0
This potential is always finite, never diverges, and stays numerically stable with any reasonable timestep. It models a colloidal suspension or soft-sphere gas.
The classic MD pair potential:
U(r) = 4Ξ΅ [ (Ο/r)ΒΉΒ² β (Ο/r)βΆ ]
The rβ»ΒΉΒ² term is short-range Pauli repulsion; rβ»βΆ is van der Waals attraction.
Requires a small timestep (dt β€ 0.001) and runs in NVT (thermostat always on).
Both potentials use Velocity-Verlet integration:
x(t+Ξt) = x(t) + v(t)Β·Ξt + Β½a(t)Β·ΞtΒ²
v(t+Ξt) = v(t) + Β½[a(t) + a(t+Ξt)]Β·Ξt
Time-reversible and second-order accurate β the standard choice for MD.
md_sim/
βββ __init__.py # package metadata
βββ cpu_forces.py # CPU reference: soft + LJ potentials (NumPy)
βββ gpu_forces.py # GPU kernel: soft potential (Numba CUDA)
βββ integrator.py # Velocity-Verlet, force-agnostic
βββ simulation.py # Main driver: init β equilibrate β run
βββ benchmark.py # CPU vs GPU timing comparison
βββ utils.py # Particle init, kinetic energy, temperature
run_example.py # Minimal working example (run this!)
README.md
pip install numpy numbaFor GPU support, also install the NVIDIA CUDA Toolkit: https://developer.nvidia.com/cuda-downloads
No GPU? The simulator runs fully in CPU-only mode (the default).
# Soft repulsion, 256 particles, CPU
python run_example.py
# Custom particle count, 1000 steps
python run_example.py --n 512 --steps 1000
# GPU acceleration (requires CUDA)
python run_example.py --gpu --n 1024
# Lennard-Jones potential (CPU, NVT)
python run_example.py --lj --n 64 --steps 500
# CPU vs GPU benchmark table
python run_example.py --benchmarkfrom md_sim.simulation import Simulation
# Soft repulsion (stable, default)
sim = Simulation(
n_particles = 256,
box_size = 25.0,
dt = 0.01,
use_gpu = False, # set True if CUDA is available
potential_type = "soft",
target_temp = 1.0,
)
sim.equilibrate(steps=300)
history = sim.run(steps=500, log_every=50)
# Access results
for state in history:
print(state.step, state.ke, state.temp)Simulation ready: 256 particles | box=25.0 | dt=0.01 | potential=soft | CPU
Equilibrating 300 steps (thermostat ON) β¦
Done. T = 1.0000 KE = 255.0000
Step Time KE PE Temp
--------------------------------------------------
301 3.010 252.5852 1351.7369 0.9905
360 3.600 255.9235 1348.3744 1.0036
420 4.200 250.2586 1354.0553 0.9814
...
600 6.000 272.0598 1332.2603 1.0669
Done. 300 steps in 8.7 s (34.6 steps/s)
KE fluctuates naturally around N Γ T = 256 β the system is in thermal equilibrium.
@cuda.jit
def _soft_forces_kernel(pos, box, r0sq, A, forces_out):
i = cuda.grid(1) # global thread index β particle i
n = pos.shape[0]
if i >= n: # guard: extra threads do nothing
return
fx = numba.float32(0.0)
fy = numba.float32(0.0)
for j in range(n): # thread i loops over all j
if i == j:
continue
# ... compute force from j on i ...
fx += ...
fy += ...
forces_out[i, 0] = fx # one write to global memory per thread
forces_out[i, 1] = fy| CUDA concept | What it means here |
|---|---|
@cuda.jit |
Compile Python β PTX GPU machine code |
cuda.grid(1) |
Global thread index across all blocks β particle index |
threads_per_block = 128 |
Block size (warp-aligned; tune per GPU) |
blocks_per_grid = βN/128β |
Enough blocks to cover all N particles |
cuda.synchronize() |
Block host until all GPU threads finish (required for timing) |
The CPU executes each particle's force computation sequentially.
The GPU executes all N simultaneously across thousands of CUDA cores.
For N = 1024 particles:
- CPU: ~1M operations, done one at a time
- GPU: ~1M operations, done in parallel across 1024 threads
| N | CPU (ms) | GPU (ms) | Speedup |
|---|---|---|---|
| 64 | ~5 | ~1.5 | ~3Γ |
| 128 | ~20 | ~1.7 | ~12Γ |
| 256 | ~80 | ~2.2 | ~36Γ |
| 512 | ~320 | ~3.5 | ~91Γ |
| 1024 | ~1280 | ~8.0 | ~160Γ |
GPU timings include hostβdevice memory transfer.
| N | Regime |
|---|---|
| < 128 | GPU overhead (memory transfer, kernel launch) may dominate |
| 256β512 | GPU starts to pull ahead |
| β₯ 1024 | GPU wins clearly (10β200Γ depending on hardware) |
The CPU kernel is pure Python β intentionally readable, not optimised. A NumPy-vectorised CPU would be much faster but less instructive.
| Feature | Difficulty | Impact |
|---|---|---|
| NumPy-vectorised CPU kernel | Easy | 10β100Γ faster CPU |
| Cell / neighbour lists | Medium | O(NΒ²) β O(N) force calculation |
| CUDA shared memory tiling | Medium | Reduces global memory reads |
| 3-D simulation | Easy | Add z-component throughout |
| Velocity-rescaling thermostat (Berendsen) | Easy | Smoother temperature control |
| Matplotlib / live animation | Easy | Visual output |
| Multiple particle species | Medium | Mixtures, alloys |
| NPT ensemble (pressure control) | Medium | Variable box size |
- Allen & Tildesley, Computer Simulation of Liquids (2017) β the MD bible
- Frenkel & Smit, Understanding Molecular Simulation (2002)
- Numba CUDA documentation
- NVIDIA CUDA C Programming Guide
MIT β free to use, study, and modify.