Bit reproducible checkpoint/restart of stochastic physics SPT and SKEB schemes#383
Bit reproducible checkpoint/restart of stochastic physics SPT and SKEB schemes#383Steve Mullerworth (stevemullerworth) wants to merge 21 commits intoMetOffice:mainfrom
Conversation
…rary array creation
iboutle
left a comment
There was a problem hiding this comment.
Looks good to me.
I'm slightly worried about the comment that "existing checkpointing of the random seed doesn't work at single precision" - is there an issue or PR to address this that can be linked?
#389 describes the issue |
Thanks, that makes sense - I've tagged it as being an issue relevant to GC6 suites, which are using 32bit, and so presumably still won't see nrun-crun reproducability even with this PR because of it. |
Andrew Coughtrie (andrewcoughtrie)
left a comment
There was a problem hiding this comment.
I'm happy as a code owner with the implementation in the driver source, it is done as currently expected.
Co-authored-by: iboutle <135141261+iboutle@users.noreply.github.com>
thomasmelvin
left a comment
There was a problem hiding this comment.
All looks fine to me
|
Hi, Steve Mullerworth (@stevemullerworth) . I am happy to approve your every change of source code, but I would like to point out followings:
Best wishes, Shusuke |
Shusuke Nishimoto (mo-snishimoto)
left a comment
There was a problem hiding this comment.
I'm sorry, I made a mistake in the operation. My comment is as stated above.
…s divergence occurs with checkpointing at 3ts
|
Your CLA signature was found on the base branch, but you appear to have modified the CONTRIBUTORS.md file in this PR. Please do not edit the CONTRIBUTORS.md file. If you have already signed the CLA, revert changes to the file and your signature will be picked up. |
|
Shusuke Nishimoto (@mo-snishimoto), thanks for the careful review. Apologies for failing to commit the KGO changes, and for copying the wrong test configuration for the clim_gal9 short nrun/crun test. After doing this, I found the nrun/crun test diverges if checkpointing occurs after timestep 3 whereas it was fine at the timesteps I tested. I think this is an unrelated issue, and have opened #412 to investigate and set the current test to restart after timestep 4. |
Shusuke Nishimoto (mo-snishimoto)
left a comment
There was a problem hiding this comment.
Steve Mullerworth (@stevemullerworth) , Thank you for quick response. I confirmed you properly fixed what I pointed out (I'm sorry that I mistakenly pointed 2 checksum files, which you had already commited correctly. I misunderstood something.).
I feel nrun/crun test diverges at timestep 3 is an annoying problem... I also ran some tests and confirmed it will be reproducible in the following conditions:
- case 1
- crun0 : timestep_start=1, timestep_end=2
- crun1 : timestep_start=3, timestep_end=6
- nrun : timestep_start=1, timestep_end=6
- case 2
- crun0 : timestep_start=1, timestep_end=4
- crun1 : timestep_start=5, timestep_end=6
- nrun : timestep_start=1, timestep_end=6
In any case, I'm happy to approve your change and pass this PR to Code Review, Lottie Turner (@mo-lottieturner) .
PR Summary
Sci/Tech Reviewer: is Shusuke Nishimoto (@mo-snishimoto)
Code Reviewer: Lottie Turner (@mo-lottieturner)
The SKEB and SPT stochastic physics schemes create some arrays of numbers at the start of a run. The arrays are initialised at the start and then evolve throughout the run. Therefore, the arrays need to be included in the checkpoint dump.
Code Quality Checklist
Testing
trac.log
There is one failure caused by running out of time. This job regularly times out and previously passed in a test with the same codebase.
Test Suite Results - lfric_apps - PR383_final2/run1
Suite Information
Task Information
❌ failed tasks - 1
Security Considerations
Performance Impact
AI Assistance and Attribution
Documentation
PSyclone Approval
Sci/Tech Review
(Please alert the code reviewer via a tag when you have approved the SR)
Code Review