Hi folks,
I am a developer on https://github.com/anza-xyz/agave/ and we've released our v2.1 somewhat recently. This version upgrades the tar crate from 0.4.41 to 0.4.42.
We've been getting reports of these errors when our snapshot process runs, which brings down the process:
lseek(SEEK_DATA) did not advance. Did the file change while appending?
I see that this was added in #375. I haven't gone through the code sufficiently yet to understand, but did want to open this issue early to see if there's any known issues or debugging help available.
The agave code that is seeing the errors is here:
https://github.com/anza-xyz/agave/blob/085c6055c4ba6b2593927cef4392f32e3b43a888/runtime/src/snapshot_utils.rs#L1044-L1046
This code did not did not change between v2.0 and v2.1, which is why I was surprised to start hearing these reports. I posted a bit more information over in our discord here too: https://discord.com/channels/428295358100013066/838890116386521088/1352668469015089152
Some additional info that may be useful:
- The files being archived have not changed. Agave uses a strict write-all and write-once approach, and does not append/modify the files being snapshotted.
- These files are written in the same thread that does the archiving, so there aren't concurrency concerns e.g. if the files are actively still writing while archiving happens.
- All of the reports of this error (that I've seen) have been on XFS, except for one that was using tmpfs.
- The errors are non-deterministic and sporadic.
- All the machines I have access to are ext4, and have not had this error.
- I'm going to try getting a machine with XFS to try and reproduce
- I'm going to request the people that saw this error to turn off spare files and see if that helps. Unfortunately the absence of an error doesn't necessarily prove anything definitively, esp. given the (in)frequency of the error.
Please do let me know if there's any additional information that would be useful, and thanks in advance for any help/suggestions!
Hi folks,
I am a developer on https://github.com/anza-xyz/agave/ and we've released our v2.1 somewhat recently. This version upgrades the tar crate from 0.4.41 to 0.4.42.
We've been getting reports of these errors when our snapshot process runs, which brings down the process:
I see that this was added in #375. I haven't gone through the code sufficiently yet to understand, but did want to open this issue early to see if there's any known issues or debugging help available.
The agave code that is seeing the errors is here:
https://github.com/anza-xyz/agave/blob/085c6055c4ba6b2593927cef4392f32e3b43a888/runtime/src/snapshot_utils.rs#L1044-L1046
This code did not did not change between v2.0 and v2.1, which is why I was surprised to start hearing these reports. I posted a bit more information over in our discord here too: https://discord.com/channels/428295358100013066/838890116386521088/1352668469015089152
Some additional info that may be useful:
Please do let me know if there's any additional information that would be useful, and thanks in advance for any help/suggestions!