Skip to content

Add opt-in torch GPU solver for invert_network#1490

Open
s-sasaki-earthsea-wizard wants to merge 3 commits into
insarlab:mainfrom
s-sasaki-earthsea-wizard:gpu_torch_solver
Open

Add opt-in torch GPU solver for invert_network#1490
s-sasaki-earthsea-wizard wants to merge 3 commits into
insarlab:mainfrom
s-sasaki-earthsea-wizard:gpu_torch_solver

Conversation

@s-sasaki-earthsea-wizard
Copy link
Copy Markdown

@s-sasaki-earthsea-wizard s-sasaki-earthsea-wizard commented May 6, 2026

Description of proposed changes

This PR adds an opt-in CUDA-accelerated path for the per-pixel weighted
least-squares inversion in the invert_network step (ifgram_inversion.py).
The fork has been running this code on tutorial-scale and large-scale scenes
for several weeks; this submission consolidates the implementation as it
currently stands.

The default path is unchanged. mintpy.networkInversion.solver = auto
resolves to cpu, and the existing CPU code path is byte-for-byte identical
to upstream — every other step in smallbaselineApp.py continues to run on
the CPU regardless of this setting.

The aim is to contribute faster InSAR time-series processing for NVIDIA GPU
users, since invert_network is the dominant CPU bottleneck on typical
workflows and the gap widens with scene size.

Closes #1489 (RFC).

Implementation summary

  • New module src/mintpy/ifgram_inversion_gpu.py batches the per-pixel WLS
    systems on a single CUDA device. The solver is normal-equations +
    Cholesky
    via torch.linalg.cholesky_ex, which (a) is significantly
    faster than torch.linalg.lstsq on the matrix shapes encountered here and
    (b) lets us detect rank-deficient pixels through the returned info codes
    rather than via post-hoc residual checks.
  • ifgram_inversion.py dispatches to the GPU module only when
    solver = torch is explicitly requested; the CPU loop is untouched.
  • New [gpu] extras in pyproject.toml, sourced from requirements-gpu.txt
    (just torch>=2.11). Install requires the PyTorch CUDA wheel index:
    pip install -e ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128
    (documented in docs/installation.md §2.4).
  • Tests in tests/test_ifgram_inversion_gpu.py cover the dispatch logic and
    the GPU fast paths with synthetic NaN / rank-deficient fixtures.

Behavior notes

  • VRAM auto-sizinggpuChunkSize = 0 (the default) probes free GPU
    memory at runtime and chooses a per-chunk pixel count with a fixed
    headroom factor; passing a positive integer overrides this for
    reproducible chunking across hosts with different VRAM.
  • Rank-deficient pixels are detected via cholesky_ex info codes and
    zeroed so NaN/Inf cannot propagate downstream; a warning line reports
    the count per chunk.
  • Per-pixel NaN observations are handled by zeroing the corresponding
    row weight, which is mathematically equivalent to dropping that row from
    the WLS system.
  • No silent CPU fallback — selecting solver = torch on a host without
    a visible CUDA device raises immediately rather than silently falling
    back to CPU; this keeps any performance regression visible.

Design pivot vs the original RFC

The RFC (#1489) originally described torch.linalg.lstsq as the GPU solver.
During development the path was switched to normal-equations + Cholesky
after a side-by-side benchmark showed it preserves output equivalence to
float32 round-off (RMS ~1e-5) while running ~16× faster than lstsq on the
same matrix shapes (tutorial dataset: FernandinaSenDT128). An RMS difference
on the order of 1e-5 in the displacement field is well below the typical
InSAR noise floor — sub-millimeter on a per-pixel basis — so the two solvers
are operationally equivalent for the geophysical use case. The lstsq path
was removed before this submission so there is only one supported GPU code
path to reason about.

Performance

Indicative numbers measured on an NVIDIA RTX 5080 (Blackwell sm_120, CUDA
12.8, PyTorch 2.11). Speedup will vary with scene size, GPU class, and
chunk-size tuning.

Scene Pixels ifgs invert_network internal step wall
FernandinaSenDT128 (tutorial) 270k 288 ~16× faster ~4.5× faster
GalapagosSenDT128 (large) 3.4M 475 ~44× faster ~36× faster

Large-scene absolute timings: CPU 6189 s → torch 170 s on the same machine.
Numerical equivalence between the cpu and torch solvers holds to
float32 round-off in both cases (RMS on the order of 1e-5; absolute RMS
max ~16 µm on the large-scene case).

Reproduction artifacts (harness scripts, raw logs, full reports) live in a
separate repository. Links below are pinned to a single sibling commit so
the data does not move during review:

Numbers are from a single development machine; absolute timings will vary
across hardware, but the qualitative findings (Cholesky > lstsq; GPU > CPU
at this matrix scale; speedup grows with scene size) should hold for any
recent NVIDIA CUDA-class device. Harness scripts and raw logs in the
mintpy-benchmark repository above let other GPU users reproduce on their
own data.

Local validation

Run on the PR branch (upstream/main + the three commits in this PR), against
FernandinaSenDT128:

  • pre-commit run --all-files exits clean (13 hooks pass + 1 skip on json,
    per the upstream .pre-commit-config.yaml).

  • smallbaselineApp.py end-to-end with default settings (solver = auto
    resolves to cpu): all 18 steps, Normal end of smallbaselineApp processing!,
    total wall time 2 h 2 m (the correct_troposphere step's CDS download
    dominated; the actual computation portion is small). All standard output
    products generated (timeseries.h5, timeseries_ERA5.h5,
    timeseries_ERA5_ramp.h5, timeseries_ERA5_ramp_demErr.h5, velocity.h5,
    velocityERA5.h5, geo/, etc.).

  • smallbaselineApp.py end-to-end with solver = torch (same dataset,
    ERA5 grib + tropo product reused via symlink so this run only re-exercises
    invert_network and the post-tropo steps): all 18 steps, Normal end of smallbaselineApp processing!, total wall time 6 m 57 s. Log confirms the
    GPU path was actually entered:

    mintpy.networkInversion.solver: auto --> torch
    estimating time-series via torch solver (batched, GPU)
    GPU auto chunk_size = 19403 pixels (free VRAM 15.1 GiB)
    estimating time-series via torch batched WLS in 14 chunk(s) of up to 19403 pixels ...
    

    Same set of standard output products as the CPU run.

Disclosure

This work was developed with the assistance of Claude Opus 4.7 (Anthropic's coding
assistant). All design decisions, benchmark execution, and review of the
generated code were performed by me. Per project convention the
Assisted-by: Claude Opus 4.7 trailers used during fork development have been stripped
from this branch's commit history; this paragraph is the canonical
disclosure.

If the AI-assisted aspect raises review or maintenance concerns for the
project, I'm happy to discuss — including whether to keep the GPU module
opt-in / under a feature flag.

Reminders

  • Fix RFC: opt-in GPU backend for invert_network (torch.linalg.lstsq, CUDA) #1489
  • Pass Pre-commit check (green) — verified locally with the upstream .pre-commit-config.yaml; CI to confirm.
  • Pass Codacy code review (green)
  • Pass Circle CI test (green)
  • Make sure that your code follows our style. Use the other functions/files as a basis.
  • If modifying functionality, describe changes to function behavior and arguments in a comment below the function declaration.
  • If adding new functionality, add a detailed description to the documentation and/or an example.

Summary by Sourcery

Introduce an opt-in GPU-accelerated solver for the invert_network step using a PyTorch CUDA backend, while preserving the existing CPU behavior as default and documenting configuration and usage.

New Features:

  • Add a torch-based CUDA-batched weighted least-squares solver for invert_network via the new ifgram_inversion_gpu module.
  • Expose CLI and template options to select the WLS solver (cpu or torch) and configure GPU chunk size for network inversion.
  • Provide dedicated documentation describing GPU acceleration for invert_network, including setup, configuration, and performance expectations.

Enhancements:

  • Extend ifgram_inversion to dispatch to the GPU-batched solver when explicitly requested, without altering the default CPU code path.
  • Add optional [gpu] extras and GPU-specific requirements to the build configuration for installing CUDA-enabled PyTorch.
  • Update default smallbaselineApp configuration files to include explicit defaults for the network inversion solver and GPU chunk size.

Tests:

  • Add numerical-equivalence and behavior tests for the GPU-batched solver, including CUDA-availability gating and solver selection validation.

Add a CUDA-accelerated path for the per-pixel weighted least-squares
inversion in `ifgram_inversion.py`, batched as normal-equations + Cholesky
on a single CUDA device via PyTorch. The solver is opt-in and the default
(`mintpy.networkInversion.solver = auto`) resolves to `cpu`, so existing
setups are unaffected and the CPU path is byte-for-byte unchanged.

Surface
- cfg keys: `mintpy.networkInversion.solver = cpu|torch` (default `auto`),
  `mintpy.networkInversion.gpuChunkSize = <int>` (default 0 = auto-size).
- CLI flags: `--solver {cpu,torch}` and `--gpu-chunk-size N` on
  `ifgram_inversion.py`.
- New module `src/mintpy/ifgram_inversion_gpu.py` holds the torch path;
  `ifgram_inversion.py` dispatches to it only when `solver=torch` is
  explicitly requested.

Behavior
- VRAM auto-sizing probes free GPU memory and chooses a per-chunk pixel
  count with a fixed headroom factor; `gpuChunkSize > 0` overrides.
- Rank-deficient pixels are detected via `torch.linalg.cholesky_ex` info
  codes and zeroed so NaN/Inf cannot propagate downstream.
- Per-pixel NaN observations are handled by zeroing the corresponding row
  weight, which is mathematically equivalent to dropping that row from the
  WLS system.
- Selecting `solver=torch` on a host without a visible CUDA device raises
  immediately rather than silently falling back to CPU, keeping any
  performance regression visible.

Packaging
- Adds `[gpu]` extras in `pyproject.toml`, sourced from
  `requirements-gpu.txt`. The PyTorch CUDA wheels live on a separate index;
  `installation.md` documents the install command in a follow-up commit.
The opt-in GPU solver in `ifgram_inversion_gpu.py` is implemented entirely
on top of `torch.linalg.cholesky_ex`, with no cupy entry point. Listing
`cupy-cuda12x` in `requirements-gpu.txt` therefore pulls a multi-hundred-MB
runtime that no code path imports. Drop it.

Pin `torch>=2.11` to match the version exercised in the bench matrix used
during development (Blackwell sm_120 wheel from the cu128 index). Earlier
torch releases have not been validated against this code path.
Document the new opt-in `torch` GPU solver added in the previous commits:

- `docs/gpu.md` — setup, CLI / template surface, behavior notes (VRAM
  auto-sizing, rank-deficient pixel handling, NaN observations, hard-fail
  on missing CUDA), and indicative performance numbers.
- `docs/installation.md` §2.4 — install the `[gpu]` extras together with
  the matching PyTorch CUDA wheel index.
- `docs/README.md` and `docs/dask.md` — add cross-links so readers can
  reach the GPU page from the documentation root and from the Dask page
  (since the two parallelism paths are orthogonal and need to be picked
  one or the other).

Performance numbers in `gpu.md` §4 are stated inline without any external
repository links so the page stays self-contained.
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented May 6, 2026

Reviewer's Guide

Adds an opt-in CUDA-accelerated PyTorch solver for the invert_network step, with CLI/template wiring, configuration defaults, docs, tests, and packaging extras, while keeping the existing CPU path as the default and behaviorally unchanged.

Sequence diagram for GPU-backed invert_network execution

sequenceDiagram
    actor User
    participant CLI as IfgramInversionCLI
    participant Runner as run_ifgram_inversion
    participant Patch as run_ifgram_inversion_patch
    participant GPU as IfgramInversionGPUModule
    participant Torch as TorchLibrary
    participant CUDA as CUDADevice

    User->>CLI: run ifgram_inversion.py\n--solver torch --gpu-chunk-size 0
    CLI->>Runner: inps with solver=torch,\n gpuChunkSize=0
    Runner->>Patch: run_ifgram_inversion_patch(...,\n solver=torch, gpu_chunk_size=0)

    Patch->>Patch: build design matrices A, B\nselect pixels idx_pixel2inv
    Patch->>GPU: estimate_timeseries_batch(A, B, y, tbase_diff,\n weight_sqrt, min_norm_velocity, rcond,\n min_redundancy, inv_quality_name,\n chunk_size=gpu_chunk_size, solver=torch)

    GPU->>Torch: torch.cuda.is_available()
    Torch-->>GPU: True (or raises error if False)

    GPU->>Torch: cuda.mem_get_info() (when chunk_size<=0)
    Torch-->>GPU: free_bytes, total_bytes
    GPU->>GPU: _auto_chunk_size()\nchoose chunk_size from free VRAM

    loop for each pixel chunk
        GPU->>Torch: as_tensor(G, tbase_diff, y_chunk, w_chunk)
        Torch->>CUDA: launch kernels for Gw, yw, N, r
        CUDA-->>Torch: Gw, yw, N, r on device

        GPU->>Torch: linalg.cholesky_ex(N)
        Torch->>CUDA: batched Cholesky
        CUDA-->>Torch: L, info
        GPU->>GPU: zero-out rank-deficient pixels\n(info != 0)

        GPU->>Torch: cholesky_solve(r, L)
        Torch->>CUDA: solve for X_batch
        CUDA-->>Torch: X_batch

        GPU->>GPU: build ts_chunk, inv_quality_chunk,\n num_inv_obs_chunk
        GPU-->>Patch: partial ts, quality, counts\nfor this chunk
        Patch->>Patch: write into global\narrays at indices
    end

    GPU-->>Patch: ts, inv_quality, num_inv_obs

    Patch->>Patch: assign ts[:, idx_pixel2inv]\ninv_quality[idx_pixel2inv]\nnum_inv_obs[idx_pixel2inv]
    Patch-->>Runner: inversion results
    Runner-->>User: timeseries/velocity outputs\n(with GPU-accelerated invert_network)
Loading

Class diagram for the new GPU-batched inversion solver and integration

classDiagram
    class IfgramInversionPatch {
        +run_ifgram_inversion_patch(ifgram_file, box, ref_phase, obs_ds_name, weight_func, water_mask_file, min_norm_velocity, mask_ds_name, mask_threshold, min_redundancy, calc_cov, solver, gpu_chunk_size)
    }

    class IfgramInversionCLI {
        +create_parser(subparsers)
        +--solver: str  (cpu|torch)
        +--gpuChunkSize: int
        +read_template2inps(template_file, inps)
    }

    class SmallbaselineAppDefaults {
        +mintpy.networkInversion.solver: str
        +mintpy.networkInversion.gpuChunkSize: int
    }

    class IfgramInversionGPUModule {
        +SUPPORTED_SOLVERS: tuple
        +DEFAULT_CHUNK_SIZE: int
        +VRAM_SAFETY: float
        +is_solver_available(solver)
        +_get_torch_device(solver)
        +_auto_chunk_size(num_pair, num_unknown, dtype_bytes)
        +_solve_cholesky(G_dev, w_dev, y_dev)
        +estimate_timeseries_batch(A, B, y, tbase_diff, weight_sqrt, min_norm_velocity, rcond, min_redundancy, inv_quality_name, chunk_size, solver, print_msg)
    }

    class TorchLibrary {
        +cuda.is_available()
        +cuda.mem_get_info()
        +as_tensor(data, dtype, device)
        +linalg.cholesky_ex(N)
        +cholesky_solve(r, L)
    }

    class CUDADevice {
        <<hardware>>
    }

    IfgramInversionCLI --> IfgramInversionPatch : passes solver\n& gpuChunkSize
    SmallbaselineAppDefaults --> IfgramInversionCLI : provides default\nconfig values

    IfgramInversionPatch ..> IfgramInversionGPUModule : imports\nestimate_timeseries_batch
    IfgramInversionGPUModule ..> TorchLibrary : uses
    TorchLibrary ..> CUDADevice : executes kernels on

    IfgramInversionPatch --> IfgramInversionGPUModule : uses when\nsolver != cpu
Loading

File-Level Changes

Change Details Files
Introduce a GPU-batched weighted least-squares solver for invert_network using PyTorch/CUDA and integrate it as an alternative solver path.
  • Add new module implementing batched normal-equations + Cholesky WLS solver on CUDA with VRAM-aware chunking, NaN handling, and rank-deficiency detection via torch.linalg.cholesky_ex.
  • Provide estimate_timeseries_batch API that mirrors the CPU estimate_timeseries signature and returns timeseries, inversion quality, and observation counts.
  • Implement solver availability checks, CUDA device probing, and automatic chunk-size selection based on free VRAM with configurable overrides.
src/mintpy/ifgram_inversion_gpu.py
Wire the GPU solver into ifgram inversion while preserving the existing CPU behavior as the default.
  • Extend run_ifgram_inversion_patch to accept solver and gpu_chunk_size parameters, dispatching to the GPU batch solver when solver!='cpu' and otherwise leaving the CPU code path intact.
  • Pass solver and gpuChunkSize from CLI/template inputs into run_ifgram_inversion_patch via the options dictionary used for block-wise inversion.
  • Ensure GPU path handles both weighted and unweighted inversions in one call and writes results back into the existing timeseries, inversion quality, and observation count arrays.
src/mintpy/ifgram_inversion.py
Expose configuration for choosing the WLS solver backend and GPU chunk size via CLI and templates, including defaults for smallbaselineApp.
  • Add --solver and --gpu-chunk-size options to the ifgram_inversion CLI, with choices {'cpu','torch'} and descriptive help text including requirements and behavior.
  • Teach template reader to parse mintpy.networkInversion.solver and mintpy.networkInversion.gpuChunkSize keys and map them to inps.solver and inps.gpuChunkSize.
  • Set explicit defaults in smallbaselineApp_auto.cfg and document the auto/CPU vs torch options and gpuChunkSize behavior in smallbaselineApp.cfg comments.
src/mintpy/cli/ifgram_inversion.py
src/mintpy/defaults/smallbaselineApp.cfg
src/mintpy/defaults/smallbaselineApp_auto.cfg
Add packaging and dependency wiring for an optional GPU extras group that pulls CUDA-enabled PyTorch.
  • Define a gpu optional-dependency group in pyproject.toml, sourcing from a new requirements-gpu.txt file.
  • Introduce an empty requirements-gpu.txt placeholder to be populated with CUDA-enabled torch>=2.11, referenced by docs and install commands.
pyproject.toml
requirements-gpu.txt
Document installation, configuration, and performance characteristics of the optional GPU solver and link it into existing docs.
  • Extend installation docs with a new section describing GPU prerequisites, [gpu] extras installation via PyTorch CUDA wheel indices, uv-specific notes, verification, and enabling the solver from CLI or template.
  • Add a dedicated gpu.md page covering configuration, behavior notes (VRAM auto-sizing, rank-deficient handling, NaN treatment, no CPU fallback), and performance benchmarks for tutorial and large scenes.
  • Link gpu.md from the main docs README and note its relation to Dask in the dask.md documentation.
docs/installation.md
docs/gpu.md
docs/README.md
docs/dask.md
Add GPU-path tests that validate numerical equivalence with the CPU solver, chunk-size invariance, and error handling.
  • Create synthetic SBAS-like network generators and observation simulators to drive both CPU and GPU solvers on controlled fixtures, including optional NaN masking and weighting.
  • Compare GPU estimate_timeseries_batch outputs against the per-pixel CPU estimate_timeseries reference over multiple scenarios (WLS/OLS, min_norm_velocity True/False) with tight float32 RMS tolerances, and assert identical observation counts.
  • Test invariance across different GPU chunk sizes and verify that unsupported solver names raise ValueError; mark all tests to skip automatically when CUDA is unavailable.
tests/test_ifgram_inversion_gpu.py

Assessment against linked issues

Issue Objective Addressed Explanation
#1489 Add an opt-in CUDA-only PyTorch-based GPU backend for the invert_network step that accelerates per-pixel WLS/OLS inversion, while preserving existing CPU behavior as the default and keeping the original CPU code path unchanged.
#1489 Provide configuration and packaging hooks for the GPU backend, including template and CLI options to select the solver and GPU chunk size, an optional [gpu] extras dependency group for installing CUDA-enabled PyTorch, and fail-fast behavior when CUDA/PyTorch are unavailable (no silent CPU fallback).
#1489 Document the optional GPU backend (installation, enabling, tuning, behavior/performance notes) and add tests to validate numerical equivalence and behavior of the GPU path against the existing CPU implementation.

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@s-sasaki-earthsea-wizard
Copy link
Copy Markdown
Author

Sharing additional benchmark data for the GPU torch solver. The PR description's Performance section reports per-step invert_network numbers on FernandinaSenDT128 and GalapagosSenDT128 (both ISCE2 / Sentinel-1 C-band). The bench has since been extended to 5 scenes spanning 4 InSAR processors (ISCE2, GMTSAR, ARIA, ROI_PAC) and 2 sensors / wavelengths (Sentinel-1 C-band, ALOS-1 L-band):

Scene Processor / Sensor Pixels Ifgs (K) Dates (D) invert_network cpu wall torch wall speedup
FernandinaSenDT128 ISCE2 / S1 (C) 270 k 288 98 645.12 s 6.88 s 93.77×
GalapagosSenDT128 ISCE2 / S1 (C) 3.40 M 490 98 2976.72 s 79.40 s 37.49×
SanFranBaySenD42 GMTSAR / S1 (C) 326 k 1297 333 1080.38 s 17.42 s 62.02×
KujuAlosAT422F650 ROI_PAC / ALOS-1 (L) 226 k 167 24 31.01 s 4.53 s 6.85×
SanFranSenDT42 ARIA / S1 (C) 1.04 M 505 114 58.85 s 11.07 s 5.32×

Configuration: warm SSD, NVIDIA RTX 5080 (16 GiB), float32, mintpy.networkInversion.solver = torch. Each run was end-to-end smallbaselineApp.py (full 18-step pipeline), not a direct call to the solver — wall numbers above are extracted from /usr/bin/time -v capture of the invert_network segment. The cpu-only steps in the same pipeline (load_data, modify_network, correct_SET, correct_troposphere, deramp, save_hdfeos5) stay within ±5 % between the cpu and torch runs of each scene, serving as an I/O / cache control.

The 5 — 94× range is structurally driven by per-pixel solve cost (∝ K · D²): Kuju (K=167, D=24) sits at the floor, SanFranBay (K=1297, D=333) at the ceiling. The Fernandina and Galapagos figures above are consistent with the ~16× internal / ~4.5× step-wall (Fernandina) and ~44× internal / ~36× step-wall (Galapagos) numbers in the original PR description; the larger headline here reflects the warm-SSD scene root and the per-step wall extracted from a fresh end-to-end run with both cpu and torch using identical fixtures.

Numerical agreement: the float32 round-off gate (rms / |cpu|.max < 1e-5) is met for the user-visible final products (velocity.h5, geocoded outputs) in all 5 scenes. Two scenes (Kuju, SanFranSF) show divergence on radar-coordinate intermediate products at rms/scale 1 — 7 %, but Kuju's geocoded velocity (filtered through maskTempCoh.h5) passes at 1.38e-7 — consistent with the divergence being confined to pixels that the downstream maskTempCoh.h5 mask drops anyway. Diagnosed in the report as cpu scipy.linalg.lstsq min-norm fill vs torch cholesky_ex fill for near-rank-deficient masked-out pixels; making the radar-coord diff tool mask-aware is queued as a sibling-repo follow-up.

Full per-step wall breakdown, fixture parity verification (cpu and torch fixtures verified byte-identical except for the 2 mintpy.*.solver = torch lines), and the numerical comparison methodology are in the report:

reports/report_end_to_end_bench.md @ 0fbf71b

Dataset records (Zenodo):

@yunjunz yunjunz requested a review from huchangyang May 27, 2026 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: opt-in GPU backend for invert_network (torch.linalg.lstsq, CUDA)

1 participant