Skip to content

Conversation

@paulhendricks
Copy link
Contributor

@paulhendricks paulhendricks commented Nov 24, 2025

Summary

  • Add a runnable synthetic diagnostics demo that trains a small MLP, forces a learning-rate plateau, and emits loss, curvature, confusion-matrix, and degree-bucket plots under artifacts/.
  • Introduce reusable helpers for degree decile evaluation, overall confusion matrix plotting, and Hessian top-eigenvalue estimation/visualization to probe training curvature.

@paulhendricks paulhendricks requested a review from a team as a code owner November 24, 2025 02:02
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 24, 2025

Greptile Overview

Greptile Summary

This PR introduces a comprehensive GNN diagnostics toolkit with utilities for analyzing model training and performance. The changes add a new example directory with scripts for environment verification, synthetic dataset generation, and multiple diagnostic visualizations including Hessian eigenvalue tracking, confusion matrices, and degree-based performance analysis.

Key additions:

  • Environment verification script (verify_cugraph_gnn.py) to validate torch/PyG/cuGraph setup
  • Synthetic GNN training scripts for smoke testing and controlled diagnostic demonstrations
  • Hessian eigenvalue estimation via power iteration for loss curvature analysis
  • Degree-based performance slicing to identify model behavior on high/low-degree nodes
  • Confusion matrix visualization utilities
  • End-to-end demo script that integrates all diagnostic tools with forced learning rate plateaus

Previous review issues addressed:

  • Fixed import ordering in synthetic_diagnostics_demo.py
  • Corrected label text from "LR zeroed" to "LR reduced" to match actual behavior
  • Replaced pd.qcut with percentile-based binning using np.percentile and np.digitize to avoid bucket mismatch errors

The implementation is well-documented with clear docstrings, follows Python best practices, and provides practical diagnostic tools for GNN model analysis.

Confidence Score: 5/5

  • This PR is safe to merge with no blocking issues
  • All previously identified issues have been properly addressed. The code is well-structured with comprehensive documentation, proper error handling, and clear separation of concerns. The diagnostic utilities are self-contained examples that don't modify core library code, reducing risk of regression.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
python/cugraph-pyg/cugraph_pyg/examples/gnn_diagnostics/train_synthetic_gnn.py 5/5 Clean smoke-test script for validating torch/PyG installation with synthetic data
python/cugraph-pyg/cugraph_pyg/examples/gnn_diagnostics/hessian_top_eigen.py 5/5 Well-documented Hessian eigenvalue estimation using power iteration and VHP
python/cugraph-pyg/cugraph_pyg/examples/gnn_diagnostics/degree_decile_performance.py 5/5 Degree-based performance analysis using percentile binning, fixed from previous qcut approach
python/cugraph-pyg/cugraph_pyg/examples/gnn_diagnostics/synthetic_diagnostics_demo.py 5/5 Comprehensive demo script integrating all diagnostic tools with controlled training trajectory

Sequence Diagram

sequenceDiagram
    participant User
    participant Demo as synthetic_diagnostics_demo.py
    participant Data as make_synthetic()
    participant Model as MLP
    participant Hessian as hessian_top_eigen.py
    participant Degree as degree_decile_performance.py
    participant Confusion as overall_confusion_matrix.py
    participant Artifacts as artifacts/

    User->>Demo: Run with CLI args
    Demo->>Data: Generate synthetic dataset
    Data-->>Demo: x, y, degrees (with degree-dependent labels)
    
    Demo->>Model: Initialize MLP & optimizer
    
    loop Training epochs
        Demo->>Model: Forward pass
        Model-->>Demo: Logits & loss
        
        alt Step == plateau_step
            Demo->>Demo: Reduce learning rate by plateau_lr_scale
        end
        
        Demo->>Model: Backward & optimizer step
        
        alt Step % hessian_sample_every == 0
            Demo->>Hessian: estimate_top_eigenvalue_vhp()
            Hessian->>Model: Power iteration via VHP
            Hessian-->>Demo: Top eigenvalue
            Demo->>Demo: Store (step, eigenvalue)
        end
    end
    
    Demo->>Model: Full inference on dataset
    Model-->>Demo: Predictions
    
    Demo->>Confusion: plot_overall_confusion_matrix()
    Confusion->>Artifacts: Save confusion_matrix.png
    
    Demo->>Degree: evaluate_by_degree_bucket()
    Degree->>Degree: Compute percentile bins
    Degree->>Degree: Calculate acc/F1 per bucket
    Degree-->>Demo: results_df, confusions
    
    Demo->>Degree: plot_performance()
    Degree->>Artifacts: Save degree_performance.png
    
    Demo->>Hessian: plot_curvature()
    Hessian->>Artifacts: Save hessian_curve.png
    
    Demo->>Artifacts: Save loss_curve.png
    
    Demo-->>User: All diagnostics complete
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant