Update jobapi pt example by ZiyueXu77 · Pull Request #4112 · NVIDIA/NVFlare

ZiyueXu77 · 2026-02-03T19:10:38Z

Fixes # .

Description

correct cifar10 download issue, update learner to clientapi

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

greptile-apps · 2026-02-03T19:14:28Z

Greptile Overview

Greptile Summary

This PR updates the PyTorch Job API examples to fix CIFAR10 download issues and migrate from ModelLearner to the Client API approach. Key improvements include:

Fixed race condition: Added FileLock when downloading CIFAR10 dataset to prevent concurrent download conflicts across multiple sites
Migration to Client API: Replaced ModelLearner-based examples with Client API (nvflare.client) for better consistency
Enhanced examples: Created new cifar10_fl_partitioned.py with support for data partitioning and multi-task handling (train/evaluate/submit_model)
Cross-site validation: Added comprehensive cross-site evaluation example with Dirichlet data partitioning (fedavg_script_runner_xsite_val_cifar10.py)
Error handling: Added division-by-zero protection when test loader is empty, following custom instruction 783565ac-d530-4d49-a8bc-55877cb0a0cd
Analytics tracking: Added TBAnalyticsReceiver across examples for TensorBoard integration
Code cleanup: Removed deprecated files (cse_script_runner_cifar10.py, fedavg_model_learner_xsite_val_cifar10.py) and improved loss tracking

All previously reported issues have been addressed including FileLock additions, task name alignment (validate task), and accuracy calculation precision.

Confidence Score: 5/5

This PR is safe to merge with high confidence
All critical issues from previous reviews have been properly addressed: FileLock added for dataset downloads, division by zero checks implemented per custom instructions, task names correctly aligned, and script paths updated. The changes are well-structured, follow best practices, and improve code quality.
No files require special attention

Important Files Changed

Filename	Overview
examples/advanced/job_api/pt/src/cifar10_fl.py	Fixed CIFAR10 download race condition with FileLock, improved loss tracking, added division by zero check for empty test loader
examples/advanced/job_api/pt/src/cifar10_fl_partitioned.py	New file implementing partitioned FL with train/evaluate/submit_model tasks, includes FileLock and division by zero protection
examples/advanced/job_api/pt/src/cifar10_fl_lightning.py	New Lightning implementation with proper FileLock for dataset download race condition prevention
examples/advanced/job_api/pt/fedavg_script_runner_xsite_val_cifar10.py	New cross-site validation example with Dirichlet data partitioning, FileLock in data loading, proper task name mapping
examples/advanced/job_api/pt/swarm_script_runner_cifar10.py	Updated to use cifar10_fl_partitioned.py script, added TBAnalyticsReceiver for tensorboard tracking

Sequence Diagram

sequenceDiagram
    participant Server
    participant Client1 as Client (site-1)
    participant Client2 as Client (site-2)
    participant CIFAR10 as CIFAR10 Dataset
    
    Note over Server,Client2: Initialization Phase
    Server->>Client1: Initialize with global model
    Server->>Client2: Initialize with global model
    
    Client1->>CIFAR10: Download dataset (with FileLock)
    Client2->>CIFAR10: Download dataset (with FileLock)
    Note over Client1,Client2: FileLock prevents race conditions
    
    loop Training Rounds (num_rounds)
        Note over Server,Client2: Training Phase
        Server->>Client1: Send global model (train task)
        Server->>Client2: Send global model (train task)
        
        Client1->>Client1: Load partitioned data subset
        Client2->>Client2: Load partitioned data subset
        
        Client1->>Client1: Train local model (local_epochs)
        Client1->>Client1: Evaluate local model
        Client1->>Client1: Evaluate global model for selection
        Client1->>Server: Send updated weights + accuracy metric
        
        Client2->>Client2: Train local model (local_epochs)
        Client2->>Client2: Evaluate local model
        Client2->>Client2: Evaluate global model for selection
        Client2->>Server: Send updated weights + accuracy metric
        
        Server->>Server: Aggregate weights (FedAvg)
        Server->>Server: Select best model (IntimeModelSelector)
        
        opt Cross-Site Validation
            Note over Server,Client2: Cross-Site Evaluation Phase
            Server->>Client1: Request best local model (submit_model task)
            Server->>Client2: Request best local model (submit_model task)
            
            Client1->>Server: Send best local model
            Client2->>Server: Send best local model
            
            Server->>Client1: Send Client2's model (validate task)
            Server->>Client2: Send Client1's model (validate task)
            
            Client1->>Client1: Evaluate Client2's model
            Client2->>Client2: Evaluate Client1's model
            
            Client1->>Server: Send cross-site accuracy
            Client2->>Server: Send cross-site accuracy
            
            Server->>Server: Generate validation JSON results
        end
    end
    
    Note over Server,Client2: Training Complete

greptile-apps

_{3 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-03T19:14:39Z

Additional Comments (1)

examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py
Integer division used for accuracy calculation can lose precision

        return 100 * correct / total

greptile-apps

_{5 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-03T19:22:58Z

Additional Comments (3)

examples/advanced/job_api/pt/src/cifar10_fl.py
Integer division loses precision in accuracy calculation

            print(f"Accuracy of the network on the 10000 test images: {100 * correct / total} %")
            return 100 * correct / total

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py
Integer division loses precision

        return 100 * correct / total

examples/advanced/job_api/pt/cse_script_runner_cifar10.py
Task name mismatch: controller sets validation_task_name="validate" but the script checks flare.is_evaluate() which expects task name evaluate. This will cause the validation phase to fail.

        validation_task_name="evaluate",

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-03T19:29:15Z

Additional Comments (1)

examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py
Integer division loses precision in accuracy calculation

        return 100 * correct / total

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-03T19:35:06Z

Additional Comments (1)

examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py
Integer division loses precision in accuracy calculation

        return 100.0 * correct / total

ZiyueXu77 · 2026-02-03T19:37:27Z

Additional Comments (1)
examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py Integer division loses precision in accuracy calculation
        return 100.0 * correct / total

In Python 3, / already does floating‑point division. To force float calculation explicitly, the float literal 100.0 * correct / total is the simplest and clearest

greptile-apps

_{4 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

ZiyueXu77 · 2026-02-03T19:45:51Z

Additional Comments (3)
examples/advanced/job_api/pt/src/cifar10_fl.py Integer division loses precision in accuracy calculation
            return 100.0 * correct / total
examples/advanced/job_api/pt/src/cifar10_fl.py Integer division loses precision in displayed accuracy
            print(f"Accuracy of the network on the 10000 test images: {100.0 * correct / total} %")
examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py Integer division loses precision in accuracy calculation
        return 100.0 * correct / total

@greptileai, it will not loss precision, test yourself:

python -c "correct=3; total=8; print(100correct/total, 100.0correct/total, 100*float(correct)/total)"

greptile-apps

_{4 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ZiyueXu77 · 2026-02-03T20:08:13Z

/build

ZiyueXu77 · 2026-02-03T21:35:04Z

/build

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Copilot

Pull request overview

This PR updates the CIFAR-10 PyTorch Job API examples to use a consistent dataset location with safe concurrent downloads, and replaces the older model-learner-based FedAvg cross-site validation example with a Client API/script-runner–based example that supports heterogeneous data partitioning.

Changes:

Ensure all CIFAR-10 example scripts use /tmp/nvflare/data/cifar10 and add filelock-based locking around dataset downloads to avoid race conditions across multiple sites.
Refine client-side evaluation logic to use float accuracies with explicit checks for empty test loaders, and improve logging for train/evaluate/submit tasks.
Replace the fedavg_model_learner_xsite_val_cifar10.py example with a new fedavg_script_runner_xsite_val_cifar10.py + cifar10_fl_partitioned.py pipeline that partitions CIFAR-10 non-iid across sites and integrates Cross-Site Evaluation (CSE); align swarm and CSE script runners, requirements, and README with these changes.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`examples/advanced/job_api/pt/swarm_script_runner_cifar10.py`	Points the swarm job to the new `cifar10_fl_train_eval_submit.py` training script to keep the swarm example aligned with the updated Client API workflow.
`examples/advanced/job_api/pt/src/cifar10_lightning_fl.py`	Uses `/tmp/nvflare/data/cifar10` and a `FileLock`-guarded `prepare_data` to prevent concurrent CIFAR-10 downloads across Lightning clients.
`examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py`	Adds locked dataset download, converts accuracy computation to safe float division with a guard for empty test loaders, and improves task-specific logging.
`examples/advanced/job_api/pt/src/cifar10_fl_partitioned.py`	New Client API training script that loads optional per-site index splits, partitions CIFAR-10 accordingly, and supports train/evaluate/submit_model with best-model saving and robust accuracy computation.
`examples/advanced/job_api/pt/src/cifar10_fl.py`	Aligns dataset root with the new CIFAR-10 path, wraps downloads in a `FileLock`, and updates evaluation to use float accuracies with an explicit empty-loader check.
`examples/advanced/job_api/pt/requirements.txt`	Adds `filelock>=3.12.0` to support the new locking behavior in all CIFAR-10 examples that download data.
`examples/advanced/job_api/pt/fedavg_script_runner_xsite_val_cifar10.py`	New FedAvg Job API script that partitions CIFAR-10 with a Dirichlet sampler, creates/save per-site splits, configures Scatter-and-Gather training plus CrossSiteModelEval, and wires ScriptRunner clients with the partitioned training script.
`examples/advanced/job_api/pt/fedavg_model_learner_xsite_val_cifar10.py`	Removes the older model-learner–based FedAvg x-site validation example in favor of the newer Client API/script-runner–based example.
`examples/advanced/job_api/pt/cse_script_runner_cifar10.py`	Switches to the `cifar10_fl_train_eval_submit.py` script, uses `DataKind.WEIGHTS` for the aggregator, aligns CSE `validation_task_name="evaluate"`, and normalizes client IDs to `site-{i}`.
`examples/advanced/job_api/pt/README.md`	Updates the fifth example to reference the new FedAvg script runner with cross-site validation and heterogeneous data partitioning, adjusting the command and description accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{6 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-04T18:48:00Z

Additional Comments (2)

examples/advanced/job_api/pt/cse_script_runner_cifar10.py
Script file src/cifar10_fl_train_eval_submit.py does not exist. Should be src/cifar10_fl_partitioned.py.

    train_script = "src/cifar10_fl_partitioned.py"

examples/advanced/job_api/pt/cse_script_runner_cifar10.py
Site naming inconsistency: uses 0-indexed site-{i} (site-0, site-1), but other examples and data split files use 1-indexed site-{i + 1} (site-1, site-2). This will cause data split loading to fail if partitioned data is used.

        job.to(executor, f"site-{i + 1}")

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ZiyueXu77 · 2026-02-04T19:27:42Z

/build

ZiyueXu77 · 2026-02-04T20:21:06Z

/build

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ZiyueXu77 · 2026-02-04T20:54:14Z

/build

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Fixes # . ### Description cherry pick #4112 ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

ZiyueXu77 added 2 commits February 3, 2026 14:09

update jobapi pt example

eb27bfa

Merge branch 'main' into jobapi

73fb622