Bit reproducible checkpoint/restart of stochastic physics SPT and SKEB schemes by stevemullerworth · Pull Request #383 · MetOffice/lfric_apps

Steve Mullerworth (stevemullerworth) · 2026-03-18T14:25:09Z

PR Summary

Sci/Tech Reviewer: is Shusuke Nishimoto (@mo-snishimoto)
Code Reviewer: Lottie Turner (@mo-lottieturner)

The SKEB and SPT stochastic physics schemes create some arrays of numbers at the start of a run. The arrays are initialised at the start and then evolve throughout the run. Therefore, the arrays need to be included in the checkpoint dump.

The method for including them in the checkpoint dump follows the method used for checkpointing the random seed, where values are stored in an io_value_type that is included in modeldb. A copy of the each value is extracted prior to the stochastic physics call, then an updated value is put back into modeldb after the call.
The previous code initialised the array during the first call to the stochastic physics schemes: a saved logical that is initialised as true then set to false on the first call ensures initialisation occurs once. Now, the initialisation code is moved to a separate subroutine and called from the driver layer only if a checkpoint file is not being read.
A short NRUN/CRUN test of the clim_gal9 configuration has been added to test equivalence of the NRUN/CRUN with an equal-length NRUN. Note, this runs at 64-bit because the existing checkpoint of the random seed does not work at 32-bit. See Issue Checkpointing of random seeds in lfric_atm does not work at 32-bit #389 for suggested fixes.
The CRUN of the existing clim-gal9 tests have changed as expected, as this run uses SKEB and SPT. I checked that the NRUN clim_gal9 produced the same results as 3.1 stable.
See lfric_atm clim_gal9 does not always bit compare across nrun/crun boundary #412 for another issue that was discovered during nrun/crun of this change, leading to the nrun/crun test for this job being restarted after timestep 4 because restarting after timestep 3 diverged, most likely, due to a separate issue.
Debug log output formatting of the random seed did not give enough space to print a large 32-bit integer, so this was increased (and tested).

Code Quality Checklist

I have performed a self-review of my own code
My code follows the project's style guidelines
Comments have been included that aid understanding and enhance the readability of the code
My changes generate no new warnings
All automated checks in the CI pipeline have completed successfully

Testing

I have tested this change locally, using the LFRic Apps rose-stem suite
If any tests fail (rose-stem or CI) the reason is understood and acceptable (e.g. kgo changes)
I have added tests to cover new functionality as appropriate (e.g. system tests, unit tests, etc.)
Any new tests have been assigned an appropriate amount of compute resource and have been allocated to an appropriate testing group (i.e. the developer tests are for jobs which use a small amount of compute resource and complete in a matter of minutes)

trac.log

There is one failure caused by running out of time. This job regularly times out and previously passed in a test with the same codebase.

Test Suite Results - lfric_apps - PR383_final2/run1

Suite Information

Item	Value
Suite Name	PR383_final2/run1
Suite User	steve.mullerworth
Workflow Start	2026-03-31T16:03:43
Groups Run	all

Dependency	Reference	Main Like
casim	MetOffice/casim@2026.03.1	True
jules	MetOffice/jules@2026.03.1	True
lfric_apps	stevemullerworth/lfric_apps@stoch_vn3.1	False
lfric_core	MetOffice/lfric_core@2026.03.1	True
moci	MetOffice/moci@2026.03.1	True
SimSys_Scripts	MetOffice/SimSys_Scripts@2026.03.1	True
socrates	MetOffice/socrates@2026.03.1	True
socrates-spectral	MetOffice/socrates-spectral@2026.03.1	True
ukca	MetOffice/ukca@2026.03.1	True

Task Information

❌ failed tasks - 1

Task	State
run_gungho_model_robert-moist-lam-BiP100x8-10x10_azspice_gnu_fast-debug-64bit	failed

✅ succeeded tasks - 1516

Security Considerations

I have reviewed my changes for potential security issues
Sensitive data is properly handled (if applicable)
Authentication and authorisation are properly implemented (if applicable)

Performance Impact

Performance of the code has been considered and, if applicable, suitable performance measurements have been conducted

AI Assistance and Attribution

Some of the content of this change has been produced with the assistance of Generative AI tool name (e.g., Met Office Github Copilot Enterprise, Github Copilot Personal, ChatGPT GPT-4, etc) and I have followed the Simulation Systems AI policy (including attribution labels)

Documentation

Where appropriate I have updated documentation related to this change and confirmed that it builds correctly

PSyclone Approval

If you have edited any PSyclone-related code (e.g. PSyKAl-lite, Kernel interface, optimisation scripts, LFRic data structure code) then please contact the TCD Team

Sci/Tech Review

I understand this area of code and the changes being added
The proposed changes correspond to the pull request description
Documentation is sufficient (do documentation papers need updating)
Sufficient testing has been completed

(Please alert the code reviewer via a tag when you have approved the SR)

Code Review

All dependencies have been resolved
Related Issues have been properly linked and addressed
CLA compliance has been confirmed
Code quality standards have been met
Tests are adequate and have passed
Documentation is complete and accurate
Security considerations have been addressed
Performance impact is acceptable

…rary array creation

rose-stem/site/meto/groups/groups_lfric_atm.cylc

iboutle

Looks good to me.

I'm slightly worried about the comment that "existing checkpointing of the random seed doesn't work at single precision" - is there an issue or PR to address this that can be linked?

Steve Mullerworth (stevemullerworth) · 2026-03-19T15:46:22Z

Looks good to me.

I'm slightly worried about the comment that "existing checkpointing of the random seed doesn't work at single precision" - is there an issue or PR to address this that can be linked?

#389 describes the issue

iboutle · 2026-03-19T15:51:28Z

Looks good to me.
I'm slightly worried about the comment that "existing checkpointing of the random seed doesn't work at single precision" - is there an issue or PR to address this that can be linked?

#389 describes the issue

Thanks, that makes sense - I've tagged it as being an issue relevant to GC6 suites, which are using 32bit, and so presumably still won't see nrun-crun reproducability even with this PR because of it.

Andrew Coughtrie (andrewcoughtrie)

I'm happy as a code owner with the implementation in the driver source, it is done as currently expected.

Co-authored-by: iboutle <135141261+iboutle@users.noreply.github.com>

thomasmelvin

All looks fine to me

Shusuke Nishimoto (mo-snishimoto) · 2026-03-27T19:32:14Z

Hi, Steve Mullerworth (@stevemullerworth) . I am happy to approve your every change of source code, but I would like to point out followings:

Checksum files below were added (changed) but are not commited in your branch:

ex1a/checksum_lfric_atm_clim_gal9_short-C12_ex1a_cce_fast-debug-64bit.txt (not added to your branch)
~~azspice/checksum_lfric_atm_clim_gal9_chem-C12_azspice_gnu_fast-debug-32bit.txt (change is not reflected)~~
~~ex1a/checksum_lfric_atm_clim_gal9_chem-C12_ex1a_cce_fast-debug-32bit.txt (change is not reflected)~~

The tests you added (run_lfric_atm_clim_gal9_short-C12_*-64bit*) don't use SKEB or SPT scheme (stochastic physics scheme is turned off). I don't think it is what you meant to do. I think you should probably add "climate" and "lowres_stp" at https://github.com/stevemullerworth/lfric_apps/blob/stoch_vn3.1/rose-stem/site/common/lfric_atm/tasks_lfric_atm.cylc#L459 .

~~Could you also resolve conflicts because I cannot push approve button (due to system specifications..?), please?~~ I'm sorry, that was a misunderstanding.

Best wishes, Shusuke

Shusuke Nishimoto (mo-snishimoto)

I'm sorry, I made a mistake in the operation. My comment is as stated above.

…s divergence occurs with checkpointing at 3ts

github-actions · 2026-03-31T16:03:05Z

⚠️ Hello Steve Mullerworth (@stevemullerworth)!

Your CLA signature was found on the base branch, but you appear to have modified the CONTRIBUTORS.md file in this PR.

Please do not edit the CONTRIBUTORS.md file. If you have already signed the CLA, revert changes to the file and your signature will be picked up.

Steve Mullerworth (stevemullerworth) · 2026-04-01T13:57:35Z

Shusuke Nishimoto (@mo-snishimoto), thanks for the careful review. Apologies for failing to commit the KGO changes, and for copying the wrong test configuration for the clim_gal9 short nrun/crun test.

After doing this, I found the nrun/crun test diverges if checkpointing occurs after timestep 3 whereas it was fine at the timesteps I tested. I think this is an unrelated issue, and have opened #412 to investigate and set the current test to restart after timestep 4.

Shusuke Nishimoto (mo-snishimoto)

Steve Mullerworth (@stevemullerworth) , Thank you for quick response. I confirmed you properly fixed what I pointed out (I'm sorry that I mistakenly pointed 2 checksum files, which you had already commited correctly. I misunderstood something.).

I feel nrun/crun test diverges at timestep 3 is an annoying problem... I also ran some tests and confirmed it will be reproducible in the following conditions:

case 1
- crun0 : timestep_start=1, timestep_end=2
- crun1 : timestep_start=3, timestep_end=6
- nrun : timestep_start=1, timestep_end=6
case 2
- crun0 : timestep_start=1, timestep_end=4
- crun1 : timestep_start=5, timestep_end=6
- nrun : timestep_start=1, timestep_end=6

In any case, I'm happy to approve your change and pass this PR to Code Review, Lottie Turner (@mo-lottieturner) .

Steve Mullerworth (stevemullerworth) added 14 commits March 5, 2026 17:26

Start adding stochastic physics variables to setup

e351136

Fix basic compile errors

9c6fc56

Set up stochastic physics arrays and pass them down - not working yet

dd5ca37

Minor bug fix to use separate array for spt and skeb initialisation

14a3330

Bit compares. But checkpoint not working

4a07bd5

Refactor code that accesses modeldb to modularise, and to avoid tempo…

ac58176

…rary array creation

Fixes to enable checkpoint dump read and write

bdbf40d

Change how initialisation of stochastic physics is done

3afc3aa

Fix ifdef UMPHYSICS issues found when building gungho_mod

3e1b0e0

Remove debug statements and add comments

9f37e6d

Add an nrun-crun test for 64-bit clim-gal9 with stochastic physics

124de39

Correct formatting for writing random seeds (tested)

8c8288f

Improve long names

dde9cc0

Add comment

d8801eb

github-actions bot assigned Steve Mullerworth (stevemullerworth) Mar 18, 2026

Steve Mullerworth (stevemullerworth) added 2 commits March 18, 2026 16:05

Checksum changes and addition

1bc380b

Checksum changes for clim_gal_chem in extra group

0bb7554

Steve Mullerworth (stevemullerworth) marked this pull request as ready for review March 19, 2026 14:22

Steve Mullerworth (stevemullerworth) requested review from a team, iboutle and thomasmelvin as code owners March 19, 2026 14:22

Steve Mullerworth (stevemullerworth) requested review from Andrew Coughtrie (andrewcoughtrie) and removed request for a team March 19, 2026 14:22

iboutle reviewed Mar 19, 2026

View reviewed changes

rose-stem/site/meto/groups/groups_lfric_atm.cylc Outdated Show resolved Hide resolved

iboutle approved these changes Mar 19, 2026

View reviewed changes

Andrew Coughtrie (andrewcoughtrie) approved these changes Mar 19, 2026

View reviewed changes

Correct platform and compiler for test in ex1a group

3bd136f

Co-authored-by: iboutle <135141261+iboutle@users.noreply.github.com>

github-actions bot added the cla-modified The CLA has been modified as part of this PR - added by GA label Mar 19, 2026

Merge remote-tracking branch 'origin/stoch_vn3.1' into stoch_vn3.1

3e342b4

thomasmelvin approved these changes Mar 24, 2026

View reviewed changes

Steve Mullerworth (stevemullerworth) requested a review from Shusuke Nishimoto (mo-snishimoto) March 24, 2026 13:58

Shusuke Nishimoto (mo-snishimoto) requested changes Mar 27, 2026

View reviewed changes

Steve Mullerworth (stevemullerworth) added 3 commits March 31, 2026 15:57

Correct clim_gal9_short config and increase nrun/crun test to 4+4ts a…

ed96de8

…s divergence occurs with checkpointing at 3ts

Raise memory for clim_gal9_short to match existing clim_gal9 task

86ef630

Update KGOs now NRUN/CRUN test runs longer

e22cf24

Steve Mullerworth (stevemullerworth) requested a review from Shusuke Nishimoto (mo-snishimoto) April 1, 2026 13:57

Shusuke Nishimoto (mo-snishimoto) approved these changes Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bit reproducible checkpoint/restart of stochastic physics SPT and SKEB schemes#383

Bit reproducible checkpoint/restart of stochastic physics SPT and SKEB schemes#383
Steve Mullerworth (stevemullerworth) wants to merge 21 commits intoMetOffice:mainfrom
stevemullerworth:stoch_vn3.1

Steve Mullerworth (stevemullerworth) commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

iboutle left a comment

Uh oh!

Steve Mullerworth (stevemullerworth) commented Mar 19, 2026

Uh oh!

iboutle commented Mar 19, 2026

Uh oh!

Andrew Coughtrie (andrewcoughtrie) left a comment

Uh oh!

thomasmelvin left a comment

Uh oh!

Shusuke Nishimoto (mo-snishimoto) commented Mar 27, 2026 •

edited

Loading

Uh oh!

Shusuke Nishimoto (mo-snishimoto) left a comment

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Steve Mullerworth (stevemullerworth) commented Apr 1, 2026

Uh oh!

Shusuke Nishimoto (mo-snishimoto) left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

Steve Mullerworth (stevemullerworth) commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Code Quality Checklist

Testing

trac.log

Test Suite Results - lfric_apps - PR383_final2/run1

Suite Information

Task Information

Security Considerations

Performance Impact

AI Assistance and Attribution

Documentation

PSyclone Approval

Sci/Tech Review

Code Review

Uh oh!

Uh oh!

iboutle left a comment

Choose a reason for hiding this comment

Uh oh!

Steve Mullerworth (stevemullerworth) commented Mar 19, 2026

Uh oh!

iboutle commented Mar 19, 2026

Uh oh!

Andrew Coughtrie (andrewcoughtrie) left a comment

Choose a reason for hiding this comment

Uh oh!

thomasmelvin left a comment

Choose a reason for hiding this comment

Uh oh!

Shusuke Nishimoto (mo-snishimoto) commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shusuke Nishimoto (mo-snishimoto) left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Steve Mullerworth (stevemullerworth) commented Apr 1, 2026

Uh oh!

Shusuke Nishimoto (mo-snishimoto) left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Steve Mullerworth (stevemullerworth) commented Mar 18, 2026 •

edited

Loading

Shusuke Nishimoto (mo-snishimoto) commented Mar 27, 2026 •

edited

Loading