feat(ntx-builder): deactivate accounts which crash repeatedly#1712
Conversation
| tracing::warn!( | ||
| %account_id, | ||
| crash_count = count, | ||
| "Account blacklisted due to repeated crashes, skipping actor spawn" |
There was a problem hiding this comment.
Q: should we make this more explicit? tracing::error!, since it signifies a bug in our impl or add "BUG" deliberately to the message?
There was a problem hiding this comment.
We should already be logging the crash itself, so this is probably okay at warn since it will repeat often for each account.
| return; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Q: should we first check for running actors and terminate them?
| assert_eq!(inactive_targets[0], inactive_id); | ||
| } | ||
|
|
||
| // BLACKLIST TESTS |
There was a problem hiding this comment.
nit: we might want to find a different term to "blacklist", i.e. repeated_failure_block_list to avoid any confusion / fud
There was a problem hiding this comment.
I'd prefer something much shorter -- we're already overly verbose and stuttery throughout the codebase imo. Maybe deactivated_accounts
Mirko-von-Leipzig
left a comment
There was a problem hiding this comment.
Just naming nits :) I guess blacklist was the wrong term to use -- let's try deactivated instead 🙃
| /// Maximum number of crashes before an account actor is blacklisted. | ||
| /// | ||
| /// Once an actor for a given account exceeds this crash count, no new actor will be | ||
| /// spawned for that account. |
There was a problem hiding this comment.
| /// Maximum number of crashes before an account actor is blacklisted. | |
| /// | |
| /// Once an actor for a given account exceeds this crash count, no new actor will be | |
| /// spawned for that account. | |
| /// Maximum number of crashes before an account deactivated. | |
| /// | |
| /// Once this limit is reached, no new transactions will be created for this account. |
| #[arg(long = "ntx-builder.max-actor-crashes", default_value_t = 10, value_name = "NUM")] | ||
| pub max_actor_crashes: usize, |
There was a problem hiding this comment.
I would avoid mentioning actor in the public docs -- its more an implementation detail.
| #[arg(long = "ntx-builder.max-actor-crashes", default_value_t = 10, value_name = "NUM")] | |
| pub max_actor_crashes: usize, | |
| #[arg(long = "ntx-builder.max-account-crashes", default_value_t = 10, value_name = "NUM")] | |
| pub max_account_crashes: usize, |
| tracing::warn!( | ||
| %account_id, | ||
| crash_count = count, | ||
| "Account blacklisted due to repeated crashes, skipping actor spawn" |
There was a problem hiding this comment.
We should already be logging the crash itself, so this is probably okay at warn since it will repeat often for each account.
| if let Some(&count) = self.crash_counts.get(&account_id) { | ||
| if count >= self.max_actor_crashes { | ||
| tracing::warn!( | ||
| %account_id, |
There was a problem hiding this comment.
| %account_id, | |
| account.id = %account_id, |
| - Replaced NTX Builder's in-memory state management with SQLite-backed persistence; account states, notes, and transaction effects are now stored in the database and inflight state is purged on startup ([#1662](https://github.com/0xMiden/miden-node/pull/1662)). | ||
| - [BREAKING] Reworked `miden-remote-prover`, removing the `worker`/`proxy` distinction and simplifying to a `worker` with a request queue ([#1688](https://github.com/0xMiden/miden-node/pull/1688)). | ||
| - NTX Builder actors now deactivate after being idle for a configurable idle timeout (`--ntx-builder.idle-timeout`, default 5 min) and are re-activated when new notes target their account ([#1705](https://github.com/0xMiden/miden-node/pull/1705)). | ||
| - NTX Builder now blacklists network accounts whose actors crash repeatedly (configurable via `--ntx-builder.max-actor-crashes`, default 10) ([#1712](https://github.com/0xMiden/miden-node/pull/1712)). |
There was a problem hiding this comment.
| - NTX Builder now blacklists network accounts whose actors crash repeatedly (configurable via `--ntx-builder.max-actor-crashes`, default 10) ([#1712](https://github.com/0xMiden/miden-node/pull/1712)). | |
| - NTX Builder now deactivates network accounts which crash repeatedly (configurable via `--ntx-builder.max-actor-crashes`, default 10) ([#1712](https://github.com/0xMiden/miden-node/pull/1712)). |
| assert_eq!(inactive_targets[0], inactive_id); | ||
| } | ||
|
|
||
| // BLACKLIST TESTS |
There was a problem hiding this comment.
I'd prefer something much shorter -- we're already overly verbose and stuttery throughout the codebase imo. Maybe deactivated_accounts
| /// | ||
| /// The URLs are fake and actors spawned with this context will fail on their first gRPC call, | ||
| /// but this is sufficient for testing coordinator logic (registry, blacklisting, etc.). | ||
| fn test_actor_context(db: &Db) -> AccountActorContext { |
There was a problem hiding this comment.
Maybe have this be a method on AccountActorContext::test?
| async fn test_coordinator() -> (Coordinator, tempfile::TempDir) { | ||
| let (db, dir) = Db::test_setup().await; | ||
| (Coordinator::new(4, db), dir) | ||
| (Coordinator::new(4, 10, db), dir) |
There was a problem hiding this comment.
This could also be a method on Coordinator
| let count = self.crash_counts.entry(account_id).or_insert(0); | ||
| *count += 1; | ||
| tracing::error!( | ||
| %account_id, |
There was a problem hiding this comment.
| %account_id, | |
| account.id = %account_id, |
Implements account blacklisting for the NTX Builder (4th task from #1694).
Track crash counts per account in the coordinator. When an actor shuts down due to a
DbError, its crash count is incremented. Once it reaches a configurable threshold, the account is blacklisted andspawn_actorskips it.Only DbError shutdowns count as crashes because other shutdown reasons (
Cancelled,IdleTimeout,SemaphoreFailed) are either intentional or system-wide and not indicative of a per-account bug.