Skip to content

Update jobapi pt example#4112

Merged
ZiyueXu77 merged 24 commits intoNVIDIA:mainfrom
ZiyueXu77:jobapi
Feb 4, 2026
Merged

Update jobapi pt example#4112
ZiyueXu77 merged 24 commits intoNVIDIA:mainfrom
ZiyueXu77:jobapi

Conversation

@ZiyueXu77
Copy link
Copy Markdown
Collaborator

Fixes # .

Description

correct cifar10 download issue, update learner to clientapi

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 3, 2026

Greptile Overview

Greptile Summary

This PR updates the PyTorch Job API examples to fix CIFAR10 download issues and migrate from ModelLearner to the Client API approach. Key improvements include:

  • Fixed race condition: Added FileLock when downloading CIFAR10 dataset to prevent concurrent download conflicts across multiple sites
  • Migration to Client API: Replaced ModelLearner-based examples with Client API (nvflare.client) for better consistency
  • Enhanced examples: Created new cifar10_fl_partitioned.py with support for data partitioning and multi-task handling (train/evaluate/submit_model)
  • Cross-site validation: Added comprehensive cross-site evaluation example with Dirichlet data partitioning (fedavg_script_runner_xsite_val_cifar10.py)
  • Error handling: Added division-by-zero protection when test loader is empty, following custom instruction 783565ac-d530-4d49-a8bc-55877cb0a0cd
  • Analytics tracking: Added TBAnalyticsReceiver across examples for TensorBoard integration
  • Code cleanup: Removed deprecated files (cse_script_runner_cifar10.py, fedavg_model_learner_xsite_val_cifar10.py) and improved loss tracking

All previously reported issues have been addressed including FileLock additions, task name alignment (validate task), and accuracy calculation precision.

Confidence Score: 5/5

  • This PR is safe to merge with high confidence
  • All critical issues from previous reviews have been properly addressed: FileLock added for dataset downloads, division by zero checks implemented per custom instructions, task names correctly aligned, and script paths updated. The changes are well-structured, follow best practices, and improve code quality.
  • No files require special attention

Important Files Changed

Filename Overview
examples/advanced/job_api/pt/src/cifar10_fl.py Fixed CIFAR10 download race condition with FileLock, improved loss tracking, added division by zero check for empty test loader
examples/advanced/job_api/pt/src/cifar10_fl_partitioned.py New file implementing partitioned FL with train/evaluate/submit_model tasks, includes FileLock and division by zero protection
examples/advanced/job_api/pt/src/cifar10_fl_lightning.py New Lightning implementation with proper FileLock for dataset download race condition prevention
examples/advanced/job_api/pt/fedavg_script_runner_xsite_val_cifar10.py New cross-site validation example with Dirichlet data partitioning, FileLock in data loading, proper task name mapping
examples/advanced/job_api/pt/swarm_script_runner_cifar10.py Updated to use cifar10_fl_partitioned.py script, added TBAnalyticsReceiver for tensorboard tracking

Sequence Diagram

sequenceDiagram
    participant Server
    participant Client1 as Client (site-1)
    participant Client2 as Client (site-2)
    participant CIFAR10 as CIFAR10 Dataset
    
    Note over Server,Client2: Initialization Phase
    Server->>Client1: Initialize with global model
    Server->>Client2: Initialize with global model
    
    Client1->>CIFAR10: Download dataset (with FileLock)
    Client2->>CIFAR10: Download dataset (with FileLock)
    Note over Client1,Client2: FileLock prevents race conditions
    
    loop Training Rounds (num_rounds)
        Note over Server,Client2: Training Phase
        Server->>Client1: Send global model (train task)
        Server->>Client2: Send global model (train task)
        
        Client1->>Client1: Load partitioned data subset
        Client2->>Client2: Load partitioned data subset
        
        Client1->>Client1: Train local model (local_epochs)
        Client1->>Client1: Evaluate local model
        Client1->>Client1: Evaluate global model for selection
        Client1->>Server: Send updated weights + accuracy metric
        
        Client2->>Client2: Train local model (local_epochs)
        Client2->>Client2: Evaluate local model
        Client2->>Client2: Evaluate global model for selection
        Client2->>Server: Send updated weights + accuracy metric
        
        Server->>Server: Aggregate weights (FedAvg)
        Server->>Server: Select best model (IntimeModelSelector)
        
        opt Cross-Site Validation
            Note over Server,Client2: Cross-Site Evaluation Phase
            Server->>Client1: Request best local model (submit_model task)
            Server->>Client2: Request best local model (submit_model task)
            
            Client1->>Server: Send best local model
            Client2->>Server: Send best local model
            
            Server->>Client1: Send Client2's model (validate task)
            Server->>Client2: Send Client1's model (validate task)
            
            Client1->>Client1: Evaluate Client2's model
            Client2->>Client2: Evaluate Client1's model
            
            Client1->>Server: Send cross-site accuracy
            Client2->>Server: Send cross-site accuracy
            
            Server->>Server: Generate validation JSON results
        end
    end
    
    Note over Server,Client2: Training Complete
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread examples/advanced/job_api/pt/src/cifar10_fl_partitioned.py Outdated
Comment thread examples/advanced/job_api/pt/fedavg_script_runner_xsite_val_cifar10.py Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 3, 2026

Additional Comments (1)

examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py
Integer division used for accuracy calculation can lose precision

        return 100 * correct / total

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 3, 2026

Additional Comments (3)

examples/advanced/job_api/pt/src/cifar10_fl.py
Integer division loses precision in accuracy calculation

            print(f"Accuracy of the network on the 10000 test images: {100 * correct / total} %")
            return 100 * correct / total

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py
Integer division loses precision

        return 100 * correct / total

examples/advanced/job_api/pt/cse_script_runner_cifar10.py
Task name mismatch: controller sets validation_task_name="validate" but the script checks flare.is_evaluate() which expects task name evaluate. This will cause the validation phase to fail.

        validation_task_name="evaluate",

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 3, 2026

Additional Comments (1)

examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py
Integer division loses precision in accuracy calculation

        return 100 * correct / total

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 3, 2026

Additional Comments (1)

examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py
Integer division loses precision in accuracy calculation

        return 100.0 * correct / total

@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

Additional Comments (1)
examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py Integer division loses precision in accuracy calculation

        return 100.0 * correct / total

In Python 3, / already does floating‑point division. To force float calculation explicitly, the float literal 100.0 * correct / total is the simplest and clearest

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps

This comment was marked as resolved.

@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

Additional Comments (3)
examples/advanced/job_api/pt/src/cifar10_fl.py Integer division loses precision in accuracy calculation

            return 100.0 * correct / total

examples/advanced/job_api/pt/src/cifar10_fl.py Integer division loses precision in displayed accuracy

            print(f"Accuracy of the network on the 10000 test images: {100.0 * correct / total} %")

examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py Integer division loses precision in accuracy calculation

        return 100.0 * correct / total

@greptileai, it will not loss precision, test yourself:

python -c "correct=3; total=8; print(100correct/total, 100.0correct/total, 100*float(correct)/total)"

@greptile-apps

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment thread examples/advanced/job_api/pt/src/cifar10_fl_partitioned.py
Comment thread examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py Outdated
Comment thread examples/advanced/job_api/pt/src/cifar10_fl.py
ZiyueXu77 and others added 3 commits February 3, 2026 15:03
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

/build

@ZiyueXu77 ZiyueXu77 mentioned this pull request Feb 3, 2026
6 tasks
@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

/build

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the CIFAR-10 PyTorch Job API examples to use a consistent dataset location with safe concurrent downloads, and replaces the older model-learner-based FedAvg cross-site validation example with a Client API/script-runner–based example that supports heterogeneous data partitioning.

Changes:

  • Ensure all CIFAR-10 example scripts use /tmp/nvflare/data/cifar10 and add filelock-based locking around dataset downloads to avoid race conditions across multiple sites.
  • Refine client-side evaluation logic to use float accuracies with explicit checks for empty test loaders, and improve logging for train/evaluate/submit tasks.
  • Replace the fedavg_model_learner_xsite_val_cifar10.py example with a new fedavg_script_runner_xsite_val_cifar10.py + cifar10_fl_partitioned.py pipeline that partitions CIFAR-10 non-iid across sites and integrates Cross-Site Evaluation (CSE); align swarm and CSE script runners, requirements, and README with these changes.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
examples/advanced/job_api/pt/swarm_script_runner_cifar10.py Points the swarm job to the new cifar10_fl_train_eval_submit.py training script to keep the swarm example aligned with the updated Client API workflow.
examples/advanced/job_api/pt/src/cifar10_lightning_fl.py Uses /tmp/nvflare/data/cifar10 and a FileLock-guarded prepare_data to prevent concurrent CIFAR-10 downloads across Lightning clients.
examples/advanced/job_api/pt/src/cifar10_fl_train_eval_submit.py Adds locked dataset download, converts accuracy computation to safe float division with a guard for empty test loaders, and improves task-specific logging.
examples/advanced/job_api/pt/src/cifar10_fl_partitioned.py New Client API training script that loads optional per-site index splits, partitions CIFAR-10 accordingly, and supports train/evaluate/submit_model with best-model saving and robust accuracy computation.
examples/advanced/job_api/pt/src/cifar10_fl.py Aligns dataset root with the new CIFAR-10 path, wraps downloads in a FileLock, and updates evaluation to use float accuracies with an explicit empty-loader check.
examples/advanced/job_api/pt/requirements.txt Adds filelock>=3.12.0 to support the new locking behavior in all CIFAR-10 examples that download data.
examples/advanced/job_api/pt/fedavg_script_runner_xsite_val_cifar10.py New FedAvg Job API script that partitions CIFAR-10 with a Dirichlet sampler, creates/save per-site splits, configures Scatter-and-Gather training plus CrossSiteModelEval, and wires ScriptRunner clients with the partitioned training script.
examples/advanced/job_api/pt/fedavg_model_learner_xsite_val_cifar10.py Removes the older model-learner–based FedAvg x-site validation example in favor of the newer Client API/script-runner–based example.
examples/advanced/job_api/pt/cse_script_runner_cifar10.py Switches to the cifar10_fl_train_eval_submit.py script, uses DataKind.WEIGHTS for the aggregator, aligns CSE validation_task_name="evaluate", and normalizes client IDs to site-{i}.
examples/advanced/job_api/pt/README.md Updates the fifth example to reference the new FedAvg script runner with cross-site validation and heterogeneous data partitioning, adjusting the command and description accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread examples/advanced/job_api/pt/cse_script_runner_cifar10.py Outdated
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment thread examples/advanced/job_api/pt/swarm_script_runner_cifar10.py
Comment thread examples/advanced/job_api/pt/cyclic_cc_script_runner_cifar10.py
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 4, 2026

Additional Comments (2)

examples/advanced/job_api/pt/cse_script_runner_cifar10.py
Script file src/cifar10_fl_train_eval_submit.py does not exist. Should be src/cifar10_fl_partitioned.py.

    train_script = "src/cifar10_fl_partitioned.py"

examples/advanced/job_api/pt/cse_script_runner_cifar10.py
Site naming inconsistency: uses 0-indexed site-{i} (site-0, site-1), but other examples and data split files use 1-indexed site-{i + 1} (site-1, site-2). This will cause data split loading to fail if partitioned data is used.

        job.to(executor, f"site-{i + 1}")

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

/build

@ZiyueXu77 ZiyueXu77 enabled auto-merge (squash) February 4, 2026 19:27
@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

/build

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ZiyueXu77
Copy link
Copy Markdown
Collaborator Author

/build

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ZiyueXu77 ZiyueXu77 merged commit d80fbea into NVIDIA:main Feb 4, 2026
19 checks passed
@ZiyueXu77 ZiyueXu77 deleted the jobapi branch February 4, 2026 21:44
ZiyueXu77 added a commit that referenced this pull request Feb 5, 2026
Fixes # .

### Description

cherry pick #4112

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants