Add opt-in torch GPU solver for invert_network#1490
Add opt-in torch GPU solver for invert_network#1490s-sasaki-earthsea-wizard wants to merge 3 commits into
Conversation
Add a CUDA-accelerated path for the per-pixel weighted least-squares
inversion in `ifgram_inversion.py`, batched as normal-equations + Cholesky
on a single CUDA device via PyTorch. The solver is opt-in and the default
(`mintpy.networkInversion.solver = auto`) resolves to `cpu`, so existing
setups are unaffected and the CPU path is byte-for-byte unchanged.
Surface
- cfg keys: `mintpy.networkInversion.solver = cpu|torch` (default `auto`),
`mintpy.networkInversion.gpuChunkSize = <int>` (default 0 = auto-size).
- CLI flags: `--solver {cpu,torch}` and `--gpu-chunk-size N` on
`ifgram_inversion.py`.
- New module `src/mintpy/ifgram_inversion_gpu.py` holds the torch path;
`ifgram_inversion.py` dispatches to it only when `solver=torch` is
explicitly requested.
Behavior
- VRAM auto-sizing probes free GPU memory and chooses a per-chunk pixel
count with a fixed headroom factor; `gpuChunkSize > 0` overrides.
- Rank-deficient pixels are detected via `torch.linalg.cholesky_ex` info
codes and zeroed so NaN/Inf cannot propagate downstream.
- Per-pixel NaN observations are handled by zeroing the corresponding row
weight, which is mathematically equivalent to dropping that row from the
WLS system.
- Selecting `solver=torch` on a host without a visible CUDA device raises
immediately rather than silently falling back to CPU, keeping any
performance regression visible.
Packaging
- Adds `[gpu]` extras in `pyproject.toml`, sourced from
`requirements-gpu.txt`. The PyTorch CUDA wheels live on a separate index;
`installation.md` documents the install command in a follow-up commit.
The opt-in GPU solver in `ifgram_inversion_gpu.py` is implemented entirely on top of `torch.linalg.cholesky_ex`, with no cupy entry point. Listing `cupy-cuda12x` in `requirements-gpu.txt` therefore pulls a multi-hundred-MB runtime that no code path imports. Drop it. Pin `torch>=2.11` to match the version exercised in the bench matrix used during development (Blackwell sm_120 wheel from the cu128 index). Earlier torch releases have not been validated against this code path.
Document the new opt-in `torch` GPU solver added in the previous commits: - `docs/gpu.md` — setup, CLI / template surface, behavior notes (VRAM auto-sizing, rank-deficient pixel handling, NaN observations, hard-fail on missing CUDA), and indicative performance numbers. - `docs/installation.md` §2.4 — install the `[gpu]` extras together with the matching PyTorch CUDA wheel index. - `docs/README.md` and `docs/dask.md` — add cross-links so readers can reach the GPU page from the documentation root and from the Dask page (since the two parallelism paths are orthogonal and need to be picked one or the other). Performance numbers in `gpu.md` §4 are stated inline without any external repository links so the page stays self-contained.
Reviewer's GuideAdds an opt-in CUDA-accelerated PyTorch solver for the invert_network step, with CLI/template wiring, configuration defaults, docs, tests, and packaging extras, while keeping the existing CPU path as the default and behaviorally unchanged. Sequence diagram for GPU-backed invert_network executionsequenceDiagram
actor User
participant CLI as IfgramInversionCLI
participant Runner as run_ifgram_inversion
participant Patch as run_ifgram_inversion_patch
participant GPU as IfgramInversionGPUModule
participant Torch as TorchLibrary
participant CUDA as CUDADevice
User->>CLI: run ifgram_inversion.py\n--solver torch --gpu-chunk-size 0
CLI->>Runner: inps with solver=torch,\n gpuChunkSize=0
Runner->>Patch: run_ifgram_inversion_patch(...,\n solver=torch, gpu_chunk_size=0)
Patch->>Patch: build design matrices A, B\nselect pixels idx_pixel2inv
Patch->>GPU: estimate_timeseries_batch(A, B, y, tbase_diff,\n weight_sqrt, min_norm_velocity, rcond,\n min_redundancy, inv_quality_name,\n chunk_size=gpu_chunk_size, solver=torch)
GPU->>Torch: torch.cuda.is_available()
Torch-->>GPU: True (or raises error if False)
GPU->>Torch: cuda.mem_get_info() (when chunk_size<=0)
Torch-->>GPU: free_bytes, total_bytes
GPU->>GPU: _auto_chunk_size()\nchoose chunk_size from free VRAM
loop for each pixel chunk
GPU->>Torch: as_tensor(G, tbase_diff, y_chunk, w_chunk)
Torch->>CUDA: launch kernels for Gw, yw, N, r
CUDA-->>Torch: Gw, yw, N, r on device
GPU->>Torch: linalg.cholesky_ex(N)
Torch->>CUDA: batched Cholesky
CUDA-->>Torch: L, info
GPU->>GPU: zero-out rank-deficient pixels\n(info != 0)
GPU->>Torch: cholesky_solve(r, L)
Torch->>CUDA: solve for X_batch
CUDA-->>Torch: X_batch
GPU->>GPU: build ts_chunk, inv_quality_chunk,\n num_inv_obs_chunk
GPU-->>Patch: partial ts, quality, counts\nfor this chunk
Patch->>Patch: write into global\narrays at indices
end
GPU-->>Patch: ts, inv_quality, num_inv_obs
Patch->>Patch: assign ts[:, idx_pixel2inv]\ninv_quality[idx_pixel2inv]\nnum_inv_obs[idx_pixel2inv]
Patch-->>Runner: inversion results
Runner-->>User: timeseries/velocity outputs\n(with GPU-accelerated invert_network)
Class diagram for the new GPU-batched inversion solver and integrationclassDiagram
class IfgramInversionPatch {
+run_ifgram_inversion_patch(ifgram_file, box, ref_phase, obs_ds_name, weight_func, water_mask_file, min_norm_velocity, mask_ds_name, mask_threshold, min_redundancy, calc_cov, solver, gpu_chunk_size)
}
class IfgramInversionCLI {
+create_parser(subparsers)
+--solver: str (cpu|torch)
+--gpuChunkSize: int
+read_template2inps(template_file, inps)
}
class SmallbaselineAppDefaults {
+mintpy.networkInversion.solver: str
+mintpy.networkInversion.gpuChunkSize: int
}
class IfgramInversionGPUModule {
+SUPPORTED_SOLVERS: tuple
+DEFAULT_CHUNK_SIZE: int
+VRAM_SAFETY: float
+is_solver_available(solver)
+_get_torch_device(solver)
+_auto_chunk_size(num_pair, num_unknown, dtype_bytes)
+_solve_cholesky(G_dev, w_dev, y_dev)
+estimate_timeseries_batch(A, B, y, tbase_diff, weight_sqrt, min_norm_velocity, rcond, min_redundancy, inv_quality_name, chunk_size, solver, print_msg)
}
class TorchLibrary {
+cuda.is_available()
+cuda.mem_get_info()
+as_tensor(data, dtype, device)
+linalg.cholesky_ex(N)
+cholesky_solve(r, L)
}
class CUDADevice {
<<hardware>>
}
IfgramInversionCLI --> IfgramInversionPatch : passes solver\n& gpuChunkSize
SmallbaselineAppDefaults --> IfgramInversionCLI : provides default\nconfig values
IfgramInversionPatch ..> IfgramInversionGPUModule : imports\nestimate_timeseries_batch
IfgramInversionGPUModule ..> TorchLibrary : uses
TorchLibrary ..> CUDADevice : executes kernels on
IfgramInversionPatch --> IfgramInversionGPUModule : uses when\nsolver != cpu
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Sharing additional benchmark data for the GPU torch solver. The PR description's Performance section reports per-step
Configuration: warm SSD, NVIDIA RTX 5080 (16 GiB), float32, The 5 — 94× range is structurally driven by per-pixel solve cost (∝ K · D²): Kuju (K=167, D=24) sits at the floor, SanFranBay (K=1297, D=333) at the ceiling. The Fernandina and Galapagos figures above are consistent with the Numerical agreement: the float32 round-off gate (rms / |cpu|.max < 1e-5) is met for the user-visible final products ( Full per-step wall breakdown, fixture parity verification (cpu and torch fixtures verified byte-identical except for the 2 → Dataset records (Zenodo):
|
Description of proposed changes
This PR adds an opt-in CUDA-accelerated path for the per-pixel weighted
least-squares inversion in the
invert_networkstep (ifgram_inversion.py).The fork has been running this code on tutorial-scale and large-scale scenes
for several weeks; this submission consolidates the implementation as it
currently stands.
The default path is unchanged.
mintpy.networkInversion.solver = autoresolves to
cpu, and the existing CPU code path is byte-for-byte identicalto upstream — every other step in
smallbaselineApp.pycontinues to run onthe CPU regardless of this setting.
The aim is to contribute faster InSAR time-series processing for NVIDIA GPU
users, since
invert_networkis the dominant CPU bottleneck on typicalworkflows and the gap widens with scene size.
Closes #1489 (RFC).
Implementation summary
src/mintpy/ifgram_inversion_gpu.pybatches the per-pixel WLSsystems on a single CUDA device. The solver is normal-equations +
Cholesky via
torch.linalg.cholesky_ex, which (a) is significantlyfaster than
torch.linalg.lstsqon the matrix shapes encountered here and(b) lets us detect rank-deficient pixels through the returned
infocodesrather than via post-hoc residual checks.
ifgram_inversion.pydispatches to the GPU module only whensolver = torchis explicitly requested; the CPU loop is untouched.[gpu]extras inpyproject.toml, sourced fromrequirements-gpu.txt(just
torch>=2.11). Install requires the PyTorch CUDA wheel index:pip install -e ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128(documented in
docs/installation.md§2.4).tests/test_ifgram_inversion_gpu.pycover the dispatch logic andthe GPU fast paths with synthetic NaN / rank-deficient fixtures.
Behavior notes
gpuChunkSize = 0(the default) probes free GPUmemory at runtime and chooses a per-chunk pixel count with a fixed
headroom factor; passing a positive integer overrides this for
reproducible chunking across hosts with different VRAM.
cholesky_exinfo codes andzeroed so NaN/Inf cannot propagate downstream; a warning line reports
the count per chunk.
row weight, which is mathematically equivalent to dropping that row from
the WLS system.
solver = torchon a host withouta visible CUDA device raises immediately rather than silently falling
back to CPU; this keeps any performance regression visible.
Design pivot vs the original RFC
The RFC (#1489) originally described
torch.linalg.lstsqas the GPU solver.During development the path was switched to normal-equations + Cholesky
after a side-by-side benchmark showed it preserves output equivalence to
float32 round-off (RMS ~1e-5) while running ~16× faster than
lstsqon thesame matrix shapes (tutorial dataset: FernandinaSenDT128). An RMS difference
on the order of
1e-5in the displacement field is well below the typicalInSAR noise floor — sub-millimeter on a per-pixel basis — so the two solvers
are operationally equivalent for the geophysical use case. The
lstsqpathwas removed before this submission so there is only one supported GPU code
path to reason about.
Performance
Indicative numbers measured on an NVIDIA RTX 5080 (Blackwell sm_120, CUDA
12.8, PyTorch 2.11). Speedup will vary with scene size, GPU class, and
chunk-size tuning.
invert_networkinternalLarge-scene absolute timings: CPU 6189 s → torch 170 s on the same machine.
Numerical equivalence between the
cpuandtorchsolvers holds tofloat32 round-off in both cases (RMS on the order of
1e-5; absolute RMSmax ~16 µm on the large-scene case).
Reproduction artifacts (harness scripts, raw logs, full reports) live in a
separate repository. Links below are pinned to a single sibling commit so
the data does not move during review:
cpuvstorchend-to-end on Fernandina:https://github.com/s-sasaki-earthsea-wizard/mintpy-benchmark/blob/c20ca8bb/reports/report_torch.md
lstsqvs Cholesky equivalence + per-step speedup:https://github.com/s-sasaki-earthsea-wizard/mintpy-benchmark/blob/c20ca8bb/reports/report_solver_comparison.md
https://github.com/s-sasaki-earthsea-wizard/mintpy-benchmark/blob/c20ca8bb/reports/report_chunk_sweep.md
torch.profilerGPU kernel breakdown:https://github.com/s-sasaki-earthsea-wizard/mintpy-benchmark/blob/c20ca8bb/reports/report_profile.md
https://github.com/s-sasaki-earthsea-wizard/mintpy-benchmark/blob/c20ca8bb/reports/report_large_scene.md
Numbers are from a single development machine; absolute timings will vary
across hardware, but the qualitative findings (Cholesky > lstsq; GPU > CPU
at this matrix scale; speedup grows with scene size) should hold for any
recent NVIDIA CUDA-class device. Harness scripts and raw logs in the
mintpy-benchmark repository above let other GPU users reproduce on their
own data.
Local validation
Run on the PR branch (
upstream/main+ the three commits in this PR), againstFernandinaSenDT128:
pre-commit run --all-filesexits clean (13 hooks pass + 1 skip on json,per the upstream
.pre-commit-config.yaml).smallbaselineApp.pyend-to-end with default settings (solver = autoresolves to
cpu): all 18 steps,Normal end of smallbaselineApp processing!,total wall time 2 h 2 m (the
correct_tropospherestep's CDS downloaddominated; the actual computation portion is small). All standard output
products generated (
timeseries.h5,timeseries_ERA5.h5,timeseries_ERA5_ramp.h5,timeseries_ERA5_ramp_demErr.h5,velocity.h5,velocityERA5.h5,geo/, etc.).smallbaselineApp.pyend-to-end withsolver = torch(same dataset,ERA5 grib + tropo product reused via symlink so this run only re-exercises
invert_networkand the post-tropo steps): all 18 steps,Normal end of smallbaselineApp processing!, total wall time 6 m 57 s. Log confirms theGPU path was actually entered:
Same set of standard output products as the CPU run.
Disclosure
This work was developed with the assistance of Claude Opus 4.7 (Anthropic's coding
assistant). All design decisions, benchmark execution, and review of the
generated code were performed by me. Per project convention the
Assisted-by: Claude Opus 4.7trailers used during fork development have been strippedfrom this branch's commit history; this paragraph is the canonical
disclosure.
If the AI-assisted aspect raises review or maintenance concerns for the
project, I'm happy to discuss — including whether to keep the GPU module
opt-in / under a feature flag.
Reminders
.pre-commit-config.yaml; CI to confirm.Summary by Sourcery
Introduce an opt-in GPU-accelerated solver for the invert_network step using a PyTorch CUDA backend, while preserving the existing CPU behavior as default and documenting configuration and usage.
New Features:
Enhancements:
Tests: