admin: fix capacity leak in maintenance system by preserving Task IDs (#8214)

* admin: fix capacity leak in maintenance system by preserving Task IDs

Preserve the original TaskID generated during detection and sync task
states (Assign/Complete/Retry) with ActiveTopology. This ensures that
capacity reserved during task assignment is properly released when a
task completes or fails, preventing 'need 9, have 0' capacity exhaustion.

Fixes https://github.com/seaweedfs/seaweedfs/issues/8202

* Update weed/admin/maintenance/maintenance_queue.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update weed/admin/maintenance/maintenance_queue.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* test: rename ActiveTopologySync to TaskIDPreservation

Rename the test case to more accurately reflect its scope, as suggested
by the code review bot.

* Add TestMaintenanceQueue_ActiveTopologySync to verify task state synchronization and capacity management

* Implement task assignment rollback and add verification test

* Enhance ActiveTopology.CompleteTask to support pending tasks

* Populate storage impact in MaintenanceIntegration.SyncTask

* Release capacity in RemoveStaleWorkers when worker becomes unavailable

* Release capacity in MaintenanceManager.CancelTask when pending task is cancelled

* Sync reloaded tasks with ActiveTopology in LoadTasksFromPersistence

* Add verification tests for consistent capacity management lifecycle

* Add TestMaintenanceQueue_RetryCapacitySync to verify capacity tracking during retries

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This commit is contained in:
Chris Lu
2026-02-04 20:39:34 -08:00
committed by GitHub
parent 2ecbae3611
commit 19c18d827a
6 changed files with 696 additions and 7 deletions

View File

@@ -603,6 +603,13 @@ func (mm *MaintenanceManager) CancelTask(taskID string) error {
}
}
// Notify ActiveTopology to release capacity
if mm.scanner != nil && mm.scanner.integration != nil {
if at := mm.scanner.integration.GetActiveTopology(); at != nil {
_ = at.CompleteTask(taskID)
}
}
glog.V(2).Infof("Cancelled task %s", taskID)
return nil
}