* fix: maintenance task topology lookup, retry, and stale task cleanup
1. Strip gRPC port from ServerAddress in SyncTask using ToHttpAddress()
so task targets match topology disk keys (NodeId format).
2. Skip capacity check when topology has no disks yet (startup race
where tasks are loaded from persistence before first topology update).
3. Don't retry permanent errors like "volume not found" - these will
never succeed on retry.
4. Cancel all pending tasks for each task type before re-detection,
ensuring stale proposals from previous cycles are cleaned up.
This prevents stale tasks from blocking new detection and from
repeatedly failing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* logs
Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
* less lock scope
Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* admin: fix capacity leak in maintenance system by preserving Task IDs
Preserve the original TaskID generated during detection and sync task
states (Assign/Complete/Retry) with ActiveTopology. This ensures that
capacity reserved during task assignment is properly released when a
task completes or fails, preventing 'need 9, have 0' capacity exhaustion.
Fixes https://github.com/seaweedfs/seaweedfs/issues/8202
* Update weed/admin/maintenance/maintenance_queue.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update weed/admin/maintenance/maintenance_queue.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* test: rename ActiveTopologySync to TaskIDPreservation
Rename the test case to more accurately reflect its scope, as suggested
by the code review bot.
* Add TestMaintenanceQueue_ActiveTopologySync to verify task state synchronization and capacity management
* Implement task assignment rollback and add verification test
* Enhance ActiveTopology.CompleteTask to support pending tasks
* Populate storage impact in MaintenanceIntegration.SyncTask
* Release capacity in RemoveStaleWorkers when worker becomes unavailable
* Release capacity in MaintenanceManager.CancelTask when pending task is cancelled
* Sync reloaded tasks with ActiveTopology in LoadTasksFromPersistence
* Add verification tests for consistent capacity management lifecycle
* Add TestMaintenanceQueue_RetryCapacitySync to verify capacity tracking during retries
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Fix nil pointer panic in maintenance worker when receiving empty task assignment
When a worker requests a task and none are available, the admin server
sends an empty TaskAssignment message. The worker was attempting to log
the task details without checking if the TaskId was empty, causing a
nil pointer dereference when accessing taskAssign.Params.VolumeId.
This fix adds a check for empty TaskId before processing the assignment,
preventing worker crashes and improving stability in production environments.
* Add EC integration test for admin-worker maintenance system
Adds comprehensive integration test that verifies the end-to-end flow
of erasure coding maintenance tasks:
- Admin server detects volumes needing EC encoding
- Workers register and receive task assignments
- EC encoding is executed and verified in master topology
- File read-back validation confirms data integrity
The test uses unique absolute working directories for each worker to
prevent ID conflicts and ensure stable worker registration. Includes
proper cleanup and process management for reliable test execution.
* Improve maintenance system stability and task deduplication
- Add cross-type task deduplication to prevent concurrent maintenance
operations on the same volume (EC, balance, vacuum)
- Implement HasAnyTask check in ActiveTopology for better coordination
- Increase RequestTask timeout from 5s to 30s to prevent unnecessary
worker reconnections
- Add TaskTypeNone sentinel for generic task checks
- Update all task detectors to use HasAnyTask for conflict prevention
- Improve config persistence and schema handling
* Add GitHub Actions workflow for EC integration tests
Adds CI workflow that runs EC integration tests on push and pull requests
to master branch. The workflow:
- Triggers on changes to admin, worker, or test files
- Builds the weed binary
- Runs the EC integration test suite
- Uploads test logs as artifacts on failure for debugging
This ensures the maintenance system remains stable and worker-admin
integration is validated in CI.
* go version 1.24
* address comments
* Update maintenance_integration.go
* support seconds
* ec prioritize over balancing in tests
* refactoring
* add ec shard size
* address comments
* passing task id
There seems to be a disconnect between the pending tasks created in ActiveTopology and the TaskDetectionResult returned by this function. A taskID is generated locally and used to create pending tasks via AddPendingECShardTask, but this taskID is not stored in the TaskDetectionResult or passed along in any way.
This makes it impossible for the worker that eventually executes the task to know which pending task in ActiveTopology it corresponds to. Without the correct taskID, the worker cannot call AssignTask or CompleteTask on the master, breaking the entire task lifecycle and capacity management feature.
A potential solution is to add a TaskID field to TaskDetectionResult and worker_pb.TaskParams, ensuring the ID is propagated from detection to execution.
* 1 source multiple destinations
* task supports multi source and destination
* ec needs to clean up previous shards
* use erasure coding constants
* getPlanningCapacityUnsafe getEffectiveAvailableCapacityUnsafe should return StorageSlotChange for calculation
* use CanAccommodate to calculate
* remove dead code
* address comments
* fix Mutex Copying in Protobuf Structs
* use constants
* fix estimatedSize
The calculation for estimatedSize only considers source.EstimatedSize and dest.StorageChange, but omits dest.EstimatedSize. The TaskDestination struct has an EstimatedSize field, which seems to be ignored here. This could lead to an incorrect estimation of the total size of data involved in tasks on a disk. The loop should probably also include estimatedSize += dest.EstimatedSize.
* at.assignTaskToDisk(task)
* refactoring
* Update weed/admin/topology/internal.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* fail fast
* fix compilation
* Update weed/worker/tasks/erasure_coding/detection.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* indexes for volume and shard locations
* dedup with ToVolumeSlots
* return an additional boolean to indicate success, or an error
* Update abstract_sql_store.go
* fix
* Update weed/worker/tasks/erasure_coding/detection.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update weed/admin/topology/task_management.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* faster findVolumeDisk
* Update weed/worker/tasks/erasure_coding/detection.go
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update weed/admin/topology/storage_slot_test.go
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* refactor
* simplify
* remove unused GetDiskStorageImpact function
* refactor
* add comments
* Update weed/admin/topology/storage_impact.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update weed/admin/topology/storage_slot_test.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update storage_impact.go
* AddPendingTask
The unified AddPendingTask function now serves as the single entry point for all task creation, successfully consolidating the previously separate functions while maintaining full functionality and improving code organization.
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>