Summary
Instance managers can fail to complete deletion operations due to a race condition between the provider worker and instance manager goroutines. This leaves instances permanently stuck in pending_delete state, requiring manual intervention (pod restart) to recover.
Note: this bug report was written using Augment, however I can vouch for it having personally combed through the code and confirmed that this is a very likely explanation for the leaked runners in pending_delete we see build up over time.
Affected Versions
Likely affects all versions after the sync.Map refactoring (commits 928bcae and b959c7f, March 20, 2026)
Note: All line numbers in this report reference origin/main (commit fe10bdb as of April 2026)
Impact
- Runner instances accumulate in
pending_delete state over time
- Instances stuck despite VMs being successfully deleted in the underlying provider
- Growth rate: 2-3 stuck instances per day (out of ~60 runners active at any given time typically)
- Required pod restart to trigger cleanup
Root Cause
The issue occurs due to incomplete error handling after the sync.Map refactoring. When an instance manager is processing a deletion, it can be removed from p.runners map by the provider worker while still executing, causing subsequent SetInstanceStatus() calls to fail.
Detailed Sequence
- Instance manager processes
pending_delete (instance_manager.go:328-352)
- Sets status to
deleting (line 329): SetInstanceStatus(..., InstanceDeleting, ...)
- Successfully deletes VM from provider (line 336):
handleDeleteInstanceInProvider()
- Meanwhile: Database watcher sends an update event with status
deleted
- Provider worker receives event (provider.go:268-320)
- Worker calls
stopAndDeleteInstance() (line 307) which:
- Stops the instance manager (line 262)
- Removes it from
p.runners (line 264): p.runners.Delete(instance.Name)
- Instance manager tries to finalize (instance_manager.go:347):
- Calls
SetInstanceStatus(..., InstanceDeleted, ...)
- This fails with
ErrNotFound because manager was removed from map
- Error is ignored (line 348) and manager exits
- Database still shows
deleting not deleted
- No manager exists to retry - instance is orphaned
Alternative Failure Path
If deletion fails and the manager tries to revert to pending_delete (line 340):
if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstancePendingDelete, []byte(err.Error()), true); err != nil {
return fmt.Errorf("setting instance status to error: %w", err)
}
This call can also fail with ErrNotFound, and there's no error handling - the function just returns an error and the manager exits, leaving the instance stuck.
Code Locations
provider_helper.go (lines 59-81)
func (p *Provider) SetInstanceStatus(instanceName string, status commonParams.InstanceStatus, providerFault []byte, force bool) error {
if _, ok := p.runners.Load(instanceName); !ok { // ← Fails if manager was removed
return errors.ErrNotFound
}
// ...
}
provider.go (lines 252-266)
func (p *Provider) stopAndDeleteInstance(instance params.Instance) error {
if instance.Status != commonParams.InstanceDeleted {
return nil
}
val, ok := p.runners.Load(instance.Name)
if !ok {
return nil
}
existingInstance := val.(*instanceManager)
if err := existingInstance.Stop(); err != nil {
return fmt.Errorf("failed to stop instance manager: %w", err)
}
p.runners.Delete(instance.Name) // ← Removes manager from map
return nil
}
instance_manager.go (lines 328-352)
case commonParams.InstancePendingDelete, commonParams.InstancePendingForceDelete:
prevStatus := i.instance.Status
if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstanceDeleting, nil, true); err != nil {
if errors.Is(err, runnerErrors.ErrNotFound) {
return nil // Manager was removed - silently exit
}
return fmt.Errorf("setting instance status to deleting: %w", err)
}
if err := i.handleDeleteInstanceInProvider(i.instance); err != nil {
slog.ErrorContext(i.ctx, "deleting instance in provider", "error", err)
if prevStatus == commonParams.InstancePendingDelete {
i.incrementBackOff()
// ↓ This can fail with ErrNotFound, no handling!
if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstancePendingDelete, []byte(err.Error()), true); err != nil {
return fmt.Errorf("setting instance status to error: %w", err)
}
return fmt.Errorf("error removing instance. Will retry: %w", err)
}
}
// ↓ This can fail with ErrNotFound, error is ignored
if err := i.helper.SetInstanceStatus(i.instance.Name, commonParams.InstanceDeleted, nil, false); err != nil {
if !errors.Is(err, runnerErrors.ErrNotFound) {
return fmt.Errorf("setting instance status to deleted: %w", err)
}
}
return ErrInstanceDeleted
Why Pod Restart Fixes It
When GARM restarts:
loadAllRunners() (provider.go:85-137) is called
- It loads all instances from the database except those in
deleting or deleted state
- Creates fresh instance managers for instances in
pending_delete
- These new managers successfully complete the deletion
Proposed Fixes
Option 1: Don't Remove Manager Until Database Confirms Deletion
Modify stopAndDeleteInstance() to only stop the manager, not remove it:
func (p *Provider) stopAndDeleteInstance(instance params.Instance) error {
if instance.Status != commonParams.InstanceDeleted {
return nil
}
val, ok := p.runners.Load(instance.Name)
if ok {
existingInstance := val.(*instanceManager)
if err := existingInstance.Stop(); err != nil {
return fmt.Errorf("failed to stop instance manager: %w", err)
}
// Don't delete from map yet - let it happen naturally when manager exits
}
return nil
}
Option 2: Use Database Directly for Status Updates During Deletion
Modify instance manager to bypass the helper and write directly to database during critical deletion path:
if err := i.store.UpdateInstance(i.ctx, i.instance.Name, params.UpdateInstanceParams{
Status: commonParams.InstanceDeleted,
}); err != nil {
return fmt.Errorf("setting instance status to deleted: %w", err)
}
Option 3: Implement Proper Retry Logic
Add retry logic when SetInstanceStatus() fails with ErrNotFound:
const maxRetries = 3
for retry := 0; retry < maxRetries; retry++ {
if err := i.helper.SetInstanceStatus(...); err != nil {
if errors.Is(err, runnerErrors.ErrNotFound) {
time.Sleep(100 * time.Millisecond)
continue
}
return err
}
break
}
Reproduction
This is a timing-dependent race condition that occurs more frequently under:
- High instance churn (many creates/deletes)
- Database watcher latency causing events to arrive out of order
- Provider operations completing very quickly
Workaround
If instances get stuck in pending_delete:
- Restart the GARM pod to reload instance managers
- Fresh managers will complete the deletions
Summary
Instance managers can fail to complete deletion operations due to a race condition between the provider worker and instance manager goroutines. This leaves instances permanently stuck in
pending_deletestate, requiring manual intervention (pod restart) to recover.Note: this bug report was written using Augment, however I can vouch for it having personally combed through the code and confirmed that this is a very likely explanation for the leaked runners in
pending_deletewe see build up over time.Affected Versions
Likely affects all versions after the sync.Map refactoring (commits 928bcae and b959c7f, March 20, 2026)
Note: All line numbers in this report reference
origin/main(commit fe10bdb as of April 2026)Impact
pending_deletestate over timeRoot Cause
The issue occurs due to incomplete error handling after the sync.Map refactoring. When an instance manager is processing a deletion, it can be removed from
p.runnersmap by the provider worker while still executing, causing subsequentSetInstanceStatus()calls to fail.Detailed Sequence
pending_delete(instance_manager.go:328-352)deleting(line 329):SetInstanceStatus(..., InstanceDeleting, ...)handleDeleteInstanceInProvider()deletedstopAndDeleteInstance()(line 307) which:p.runners(line 264):p.runners.Delete(instance.Name)SetInstanceStatus(..., InstanceDeleted, ...)ErrNotFoundbecause manager was removed from mapdeletingnotdeletedAlternative Failure Path
If deletion fails and the manager tries to revert to
pending_delete(line 340):This call can also fail with
ErrNotFound, and there's no error handling - the function just returns an error and the manager exits, leaving the instance stuck.Code Locations
provider_helper.go (lines 59-81)
provider.go (lines 252-266)
instance_manager.go (lines 328-352)
Why Pod Restart Fixes It
When GARM restarts:
loadAllRunners()(provider.go:85-137) is calleddeletingordeletedstatepending_deleteProposed Fixes
Option 1: Don't Remove Manager Until Database Confirms Deletion
Modify
stopAndDeleteInstance()to only stop the manager, not remove it:Option 2: Use Database Directly for Status Updates During Deletion
Modify instance manager to bypass the helper and write directly to database during critical deletion path:
Option 3: Implement Proper Retry Logic
Add retry logic when
SetInstanceStatus()fails withErrNotFound:Reproduction
This is a timing-dependent race condition that occurs more frequently under:
Workaround
If instances get stuck in
pending_delete: