Replies: 3 comments 3 replies
-
|
@aawsome Thoughts? |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Hi @philipmw Sounds good in general. However, I like to question whether a fixed period of e.g. 5 minutes is good for a warning. |
Beta Was this translation helpful? Give feedback.
3 replies
-
|
Posted a PR: rustic-rs/rustic_core#524 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I outline two pain points with today’s cold storage restoration, and propose a solution that addresses them both.
background
rustic has support for backups to cold storage. Cold storage requires warmup before retrieving packs. Warmup is specific to each cold storage provider, so rustic delegates warmup to a separate program. rustic determines the list of needed packs, then spawns a user-configured warmup program as a separate unix process, passing it a list of packs to warm up. Rustic blocks on this program, waiting until it succeeds or fails.
problem 1: inaccurate progress indication
While the warmup program is running, rustic blocks on its completion, displaying a progress bar. The progress bar is accurate only when rustic invokes the warmup program serially per pack (using the
%idargument). When the customer configures rustic to warm up a batch of files at once, it becomes the warmup program’s responsibility to track state. Rustic does not know how far along the warmup program is; all it knows is that it hasn’t exited. The likeliest case is that rustic asks to warm up all packs in one big batch; then, rustic’s progress bar shows 0% the whole time, until suddenly everything’s done. (I am not sure whether rustic ever shows 100%.) In this likeliest case, rustic also cannot estimate time to completion.I called out this limitation when proposing cold storage batch warmup support: #1430 (“progress bar” section), so now I am here to address it.
problem 2: rustic cannot detect when the warmup process freezes
In the warmup program for S3 Glacier, there is a known issue where the program can hang after an OS suspend and resume. When this happens, rustic is blocked indefinitely. Neither program is aware that they are blocked indefinitely. Rustic continues showing a progress bar with a countup timer. The customer may not realize this problem until up to two full days later, when the warmup SLA is exceeded. This is a terrible customer experience.
I am confident (as the program’s author, hehe) that this issue will get resolved, but this category of bug cannot be eliminated in general. The warmup program may hang, and rustic should protect itself from this.
proposed solution
A protocol that allows—and requires—the warmup program to emit status updates on stdout, consumed by rustic. These updates serve both to indicate progress and as a heartbeat.
The protocol is JSON Lines, with each object having at least a
typekey. I propose only one type for now:{ “type”: “pack-progress”, “warm”: integer }warmis a number that indicates how many packs the warmup program expects to be warm now. (Regardless of whether this process was the one that warmed it or the pack was already warm.) Dividing that by the total number of packs requested is the fraction of completion. Rustic uses that for the progress bar during warmup.The
pack-progresskey type is also defined to serve as the heartbeat indicator. Rustic remembers when the last such message that was received. Rustic can display a warning or abandon the warmup when too much time elapses from the lastpack-progressmessage.This implies that the warmup program will emit multiple
pack-progressevents with the same value for warm— not just when the value changes. Thewarmvalue can also decrease, such as if it takes longer to warm up the whole batch than the requested lifetime of the earliest pack. Then the progress bar would move backward.backward compatibility
Today the stdout of warmup programs is unspecified. To be backward compatible, rustic will try to parse each line of warmup stdout as JSON Lines. If unable to parse, it will ignore the stdout.
If it does not receive the heartbeat in 5+ minutes, it will print a warning only. In the future, once this protocol is entrenched, rustic can abort the warmup instead of merely warning.
work proposed
2. read the stdout of whatever warmup-program it invokes;
3. parse it as JSON Lines;
4. handle the
pack-progresstype to update its own progress bar and ETA;5. keep track of time elapsed from the last status update, and to print a warning if status had not been received in 5+ minutes.
5. ignore other types, or lines that don’t parse as JSON.
pack-progresstype to stdout.appendix
For both an architectural and a pragmatic reason.
Architecturally, I want rustic to be the sole decider of what constitutes aliveness of the warmup program. By creating a dedicated heartbeat type, I think I would be conceding power to the warmup program to claim a heartbeat even if it’s unable to make pack progress.
Pragmatically, the S3 Glacier warmup application is a long lived loop on a single thread that polls a queue, with a 20-second cycle time. Every 20 seconds, the warmup program updates its own state of what packs are warm. At that point, the warmup program can naturally emit a
pack-progressmessage. Having a separate heartbeat message would mean that the warmup program needs to either always emit both messages, or to keep track of state changes to know when is the right time to emit apack-progressversus just a heartbeat. This adds complexity over just usingpack-progressas the heartbeat.In the future, if we add other message types, rustic will be free to update its own definition of heartbeat to use a mix of messages, or to switch to the other message type(s) entirely:
Beta Was this translation helpful? Give feedback.
All reactions