Skip to content

Bit reproducible checkpoint/restart of stochastic physics SPT and SKEB schemes#383

Open
Steve Mullerworth (stevemullerworth) wants to merge 21 commits intoMetOffice:mainfrom
stevemullerworth:stoch_vn3.1
Open

Bit reproducible checkpoint/restart of stochastic physics SPT and SKEB schemes#383
Steve Mullerworth (stevemullerworth) wants to merge 21 commits intoMetOffice:mainfrom
stevemullerworth:stoch_vn3.1

Conversation

@stevemullerworth
Copy link
Copy Markdown
Collaborator

@stevemullerworth Steve Mullerworth (stevemullerworth) commented Mar 18, 2026

PR Summary

Sci/Tech Reviewer: is Shusuke Nishimoto (@mo-snishimoto)
Code Reviewer: Lottie Turner (@mo-lottieturner)

The SKEB and SPT stochastic physics schemes create some arrays of numbers at the start of a run. The arrays are initialised at the start and then evolve throughout the run. Therefore, the arrays need to be included in the checkpoint dump.

  1. The method for including them in the checkpoint dump follows the method used for checkpointing the random seed, where values are stored in an io_value_type that is included in modeldb. A copy of the each value is extracted prior to the stochastic physics call, then an updated value is put back into modeldb after the call.
  2. The previous code initialised the array during the first call to the stochastic physics schemes: a saved logical that is initialised as true then set to false on the first call ensures initialisation occurs once. Now, the initialisation code is moved to a separate subroutine and called from the driver layer only if a checkpoint file is not being read.
  3. A short NRUN/CRUN test of the clim_gal9 configuration has been added to test equivalence of the NRUN/CRUN with an equal-length NRUN. Note, this runs at 64-bit because the existing checkpoint of the random seed does not work at 32-bit. See Issue Checkpointing of random seeds in lfric_atm does not work at 32-bit #389 for suggested fixes.
  4. The CRUN of the existing clim-gal9 tests have changed as expected, as this run uses SKEB and SPT. I checked that the NRUN clim_gal9 produced the same results as 3.1 stable.
  5. See lfric_atm clim_gal9 does not always bit compare across nrun/crun boundary #412 for another issue that was discovered during nrun/crun of this change, leading to the nrun/crun test for this job being restarted after timestep 4 because restarting after timestep 3 diverged, most likely, due to a separate issue.
  6. Debug log output formatting of the random seed did not give enough space to print a large 32-bit integer, so this was increased (and tested).

Code Quality Checklist

  • I have performed a self-review of my own code
  • My code follows the project's style guidelines
  • Comments have been included that aid understanding and enhance the readability of the code
  • My changes generate no new warnings
  • All automated checks in the CI pipeline have completed successfully

Testing

  • I have tested this change locally, using the LFRic Apps rose-stem suite
  • If any tests fail (rose-stem or CI) the reason is understood and acceptable (e.g. kgo changes)
  • I have added tests to cover new functionality as appropriate (e.g. system tests, unit tests, etc.)
  • Any new tests have been assigned an appropriate amount of compute resource and have been allocated to an appropriate testing group (i.e. the developer tests are for jobs which use a small amount of compute resource and complete in a matter of minutes)

trac.log

There is one failure caused by running out of time. This job regularly times out and previously passed in a test with the same codebase.

Test Suite Results - lfric_apps - PR383_final2/run1

Suite Information

Item Value
Suite Name PR383_final2/run1
Suite User steve.mullerworth
Workflow Start 2026-03-31T16:03:43
Groups Run all
Dependency Reference Main Like
casim MetOffice/casim@2026.03.1 True
jules MetOffice/jules@2026.03.1 True
lfric_apps stevemullerworth/lfric_apps@stoch_vn3.1 False
lfric_core MetOffice/lfric_core@2026.03.1 True
moci MetOffice/moci@2026.03.1 True
SimSys_Scripts MetOffice/SimSys_Scripts@2026.03.1 True
socrates MetOffice/socrates@2026.03.1 True
socrates-spectral MetOffice/socrates-spectral@2026.03.1 True
ukca MetOffice/ukca@2026.03.1 True

Task Information

❌ failed tasks - 1
Task State
run_gungho_model_robert-moist-lam-BiP100x8-10x10_azspice_gnu_fast-debug-64bit failed
✅ succeeded tasks - 1516

Security Considerations

  • I have reviewed my changes for potential security issues
  • Sensitive data is properly handled (if applicable)
  • Authentication and authorisation are properly implemented (if applicable)

Performance Impact

  • Performance of the code has been considered and, if applicable, suitable performance measurements have been conducted

AI Assistance and Attribution

  • Some of the content of this change has been produced with the assistance of Generative AI tool name (e.g., Met Office Github Copilot Enterprise, Github Copilot Personal, ChatGPT GPT-4, etc) and I have followed the Simulation Systems AI policy (including attribution labels)

Documentation

  • Where appropriate I have updated documentation related to this change and confirmed that it builds correctly

PSyclone Approval

  • If you have edited any PSyclone-related code (e.g. PSyKAl-lite, Kernel interface, optimisation scripts, LFRic data structure code) then please contact the TCD Team

Sci/Tech Review

  • I understand this area of code and the changes being added
  • The proposed changes correspond to the pull request description
  • Documentation is sufficient (do documentation papers need updating)
  • Sufficient testing has been completed

(Please alert the code reviewer via a tag when you have approved the SR)

Code Review

  • All dependencies have been resolved
  • Related Issues have been properly linked and addressed
  • CLA compliance has been confirmed
  • Code quality standards have been met
  • Tests are adequate and have passed
  • Documentation is complete and accurate
  • Security considerations have been addressed
  • Performance impact is acceptable

Copy link
Copy Markdown
Contributor

@iboutle iboutle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I'm slightly worried about the comment that "existing checkpointing of the random seed doesn't work at single precision" - is there an issue or PR to address this that can be linked?

@stevemullerworth
Copy link
Copy Markdown
Collaborator Author

Looks good to me.

I'm slightly worried about the comment that "existing checkpointing of the random seed doesn't work at single precision" - is there an issue or PR to address this that can be linked?

#389 describes the issue

@iboutle
Copy link
Copy Markdown
Contributor

iboutle commented Mar 19, 2026

Looks good to me.
I'm slightly worried about the comment that "existing checkpointing of the random seed doesn't work at single precision" - is there an issue or PR to address this that can be linked?

#389 describes the issue

Thanks, that makes sense - I've tagged it as being an issue relevant to GC6 suites, which are using 32bit, and so presumably still won't see nrun-crun reproducability even with this PR because of it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy as a code owner with the implementation in the driver source, it is done as currently expected.

Co-authored-by: iboutle <135141261+iboutle@users.noreply.github.com>
@github-actions github-actions bot added the cla-modified The CLA has been modified as part of this PR - added by GA label Mar 19, 2026
Copy link
Copy Markdown
Contributor

@thomasmelvin thomasmelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks fine to me

@mo-snishimoto
Copy link
Copy Markdown

Shusuke Nishimoto (mo-snishimoto) commented Mar 27, 2026

Hi, Steve Mullerworth (@stevemullerworth) . I am happy to approve your every change of source code, but I would like to point out followings:

  1. Checksum files below were added (changed) but are not commited in your branch:

ex1a/checksum_lfric_atm_clim_gal9_short-C12_ex1a_cce_fast-debug-64bit.txt (not added to your branch)
azspice/checksum_lfric_atm_clim_gal9_chem-C12_azspice_gnu_fast-debug-32bit.txt (change is not reflected)
ex1a/checksum_lfric_atm_clim_gal9_chem-C12_ex1a_cce_fast-debug-32bit.txt (change is not reflected)

  1. The tests you added (run_lfric_atm_clim_gal9_short-C12_*-64bit*) don't use SKEB or SPT scheme (stochastic physics scheme is turned off). I don't think it is what you meant to do. I think you should probably add "climate" and "lowres_stp" at https://github.com/stevemullerworth/lfric_apps/blob/stoch_vn3.1/rose-stem/site/common/lfric_atm/tasks_lfric_atm.cylc#L459 .

Could you also resolve conflicts because I cannot push approve button (due to system specifications..?), please? I'm sorry, that was a misunderstanding.

Best wishes, Shusuke

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I made a mistake in the operation. My comment is as stated above.

@github-actions
Copy link
Copy Markdown

⚠️ Hello Steve Mullerworth (@stevemullerworth)!

Your CLA signature was found on the base branch, but you appear to have modified the CONTRIBUTORS.md file in this PR.

Please do not edit the CONTRIBUTORS.md file. If you have already signed the CLA, revert changes to the file and your signature will be picked up.

@stevemullerworth
Copy link
Copy Markdown
Collaborator Author

Shusuke Nishimoto (@mo-snishimoto), thanks for the careful review. Apologies for failing to commit the KGO changes, and for copying the wrong test configuration for the clim_gal9 short nrun/crun test.

After doing this, I found the nrun/crun test diverges if checkpointing occurs after timestep 3 whereas it was fine at the timesteps I tested. I think this is an unrelated issue, and have opened #412 to investigate and set the current test to restart after timestep 4.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steve Mullerworth (@stevemullerworth) , Thank you for quick response. I confirmed you properly fixed what I pointed out (I'm sorry that I mistakenly pointed 2 checksum files, which you had already commited correctly. I misunderstood something.).

I feel nrun/crun test diverges at timestep 3 is an annoying problem... I also ran some tests and confirmed it will be reproducible in the following conditions:

  • case 1
    • crun0 : timestep_start=1, timestep_end=2
    • crun1 : timestep_start=3, timestep_end=6
    • nrun : timestep_start=1, timestep_end=6
  • case 2
    • crun0 : timestep_start=1, timestep_end=4
    • crun1 : timestep_start=5, timestep_end=6
    • nrun : timestep_start=1, timestep_end=6

In any case, I'm happy to approve your change and pass this PR to Code Review, Lottie Turner (@mo-lottieturner) .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-modified The CLA has been modified as part of this PR - added by GA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants