Files
seaweedFS/weed/worker/tasks/balance/detection.go
Chris Lu 5f85bf5e8a Batch volume balance: run multiple moves per job (#8561)
* proto: add BalanceMoveSpec and batch fields to BalanceTaskParams

Add BalanceMoveSpec message for encoding individual volume moves,
and max_concurrent_moves + repeated moves fields to BalanceTaskParams
to support batching multiple volume moves in a single job.

* balance handler: add batch execution with concurrent volume moves

Refactor Execute() into executeSingleMove() (backward compatible) and
executeBatchMoves() which runs multiple volume moves concurrently using
a semaphore-bounded goroutine pool. When BalanceTaskParams.Moves is
populated, the batch path is taken; otherwise the single-move path.

Includes aggregate progress reporting across concurrent moves,
per-move error collection, and partial failure support.

* balance handler: add batch config fields to Descriptor and worker config

Add max_concurrent_moves and batch_size fields to the worker config
form and deriveBalanceWorkerConfig(). These control how many volume
moves run concurrently within a batch job and the maximum batch size.

* balance handler: group detection proposals into batch jobs

When batch_size > 1, the Detect method groups detection results into
batch proposals where each proposal encodes multiple BalanceMoveSpec
entries in BalanceTaskParams.Moves. Single-result batches fall back
to the existing single-move proposal format for backward compatibility.

* admin UI: add volume balance execution plan and batch badge

Add renderBalanceExecutionPlan() for rich rendering of volume balance
jobs in the job detail modal. Single-move jobs show source/target/volume
info; batch jobs show a moves table with all volume moves.

Add batch badge (e.g., "5 moves") next to job type in the execution
jobs table when the job has batch=true label.

* Update plugin_templ.go

* fix: detection algorithm uses greedy target instead of divergent topology scores

The detection loop tracked effective volume counts via an adjustments map,
but createBalanceTask independently called planBalanceDestination which used
the topology's LoadCount — a separate, unadjusted source of truth. This
divergence caused multiple moves to pile onto the same server.

Changes:
- Add resolveBalanceDestination to resolve the detection loop's greedy
  target (minServer) rather than independently picking a destination
- Add oscillation guard: stop when max-min <= 1 since no single move
  can improve the balance beyond that point
- Track unseeded destinations: if a target server wasn't in the initial
  serverVolumeCounts, add it so subsequent iterations include it
- Add TestDetection_UnseededDestinationDoesNotOverload

* fix: handler force_move propagation, partial failure, deterministic dedupe

- Propagate ForceMove from outer BalanceTaskParams to individual move
  TaskParams so batch moves respect the force_move flag
- Fix partial failure: mark job successful if at least one move
  succeeded (succeeded > 0 || failed == 0) to avoid re-running
  already-completed moves on retry
- Use SHA-256 hash for deterministic dedupe key fallback instead of
  time.Now().UnixNano() which is non-deterministic
- Remove unused successDetails variable
- Extract maxProposalStringLength constant to replace magic number 200

* admin UI: use template literals in balance execution plan rendering

* fix: integration test handles batch proposals from batched detection

With batch_size=20, all moves are grouped into a single proposal
containing BalanceParams.Moves instead of top-level Sources/Targets.
Update assertions to handle both batch and single-move proposal formats.

* fix: verify volume size on target before deleting source during balance

Add a pre-delete safety check that reads the volume file status on both
source and target, then compares .dat file size and file count. If they
don't match, the move is aborted — leaving the source intact rather than
risking irreversible data loss.

Also removes the redundant mountVolume call since VolumeCopy already
mounts the volume on the target server.

* fix: clamp maxConcurrent, serialize progress sends, validate config as int64

- Clamp maxConcurrentMoves to defaultMaxConcurrentMoves before creating
  the semaphore so a stale or malicious job cannot request unbounded
  concurrent volume moves
- Extend progressMu to cover sender.SendProgress calls since the
  underlying gRPC stream is not safe for concurrent writes
- Perform bounds checks on max_concurrent_moves and batch_size in int64
  space before casting to int, avoiding potential overflow on 32-bit

* fix: check disk capacity in resolveBalanceDestination

Skip disks where VolumeCount >= MaxVolumeCount so the detection loop
does not propose moves to a full disk that would fail at execution time.

* test: rename unseeded destination test to match actual behavior

The test exercises a server with 0 volumes that IS seeded from topology
(matching disk type), not an unseeded destination. Rename to
TestDetection_ZeroVolumeServerIncludedInBalance and fix comments.

* test: tighten integration test to assert exactly one batch proposal

With default batch_size=20, all moves should be grouped into a single
batch proposal. Assert len(proposals)==1 and require BalanceParams with
Moves, removing the legacy single-move else branch.

* fix: propagate ctx to RPCs and restore source writability on abort

- All helper methods (markVolumeReadonly, copyVolume, tailVolume,
  readVolumeFileStatus, deleteVolume) now accept a context parameter
  instead of using context.Background(), so Execute's ctx propagates
  cancellation and timeouts into every volume server RPC
- Add deferred cleanup that restores the source volume to writable if
  any step after markVolumeReadonly fails, preventing the source from
  being left permanently readonly on abort
- Add markVolumeWritable helper using VolumeMarkWritableRequest

* fix: deep-copy protobuf messages in test recording sender

Use proto.Clone in recordingExecutionSender to store immutable snapshots
of JobProgressUpdate and JobCompleted, preventing assertions from
observing mutations if the handler reuses message pointers.

* fix: add VolumeMarkWritable and ReadVolumeFileStatus to fake volume server

The balance task now calls ReadVolumeFileStatus for pre-delete
verification and VolumeMarkWritable to restore writability on abort.
Add both RPCs to the test fake, and drop the mountCalls assertion since
BalanceTask no longer calls VolumeMount directly (VolumeCopy handles it).

* fix: use maxConcurrentMovesLimit (50) for clamp, not defaultMaxConcurrentMoves

defaultMaxConcurrentMoves (5) is the fallback when the field is unset,
not an upper bound. Clamping to it silently overrides valid config
values like 10/20/50. Introduce maxConcurrentMovesLimit (50) matching
the descriptor's MaxValue and clamp to that instead.

* fix: cancel batch moves on progress stream failure

Derive a cancellable batchCtx from the caller's ctx. If
sender.SendProgress returns an error (client disconnect, context
cancelled), capture it, skip further sends, and cancel batchCtx so
in-flight moves abort via their propagated context rather than running
blind to completion.

* fix: bound cleanup timeout and validate batch move fields

- Use a 30-second timeout for the deferred markVolumeWritable cleanup
  instead of context.Background() which can block indefinitely if the
  volume server is unreachable
- Validate required fields (VolumeID, SourceNode, TargetNode) before
  appending moves to a batch proposal, skipping invalid entries
- Fall back to a single-move proposal when filtering leaves only one
  valid move in a batch

* fix: cancel task execution on SendProgress stream failure

All handler progress callbacks previously ignored SendProgress errors,
allowing tasks to continue executing after the client disconnected.
Now each handler creates a derived cancellable context and cancels it
on the first SendProgress error, stopping the in-flight task promptly.

Handlers fixed: erasure_coding, vacuum, volume_balance (single-move),
and admin_script (breaks command loop on send failure).

* fix: validate batch moves before scheduling in executeBatchMoves

Reject empty batches, enforce a hard upper bound (100 moves), and
filter out nil or incomplete move specs (missing source/target/volume)
before allocating progress tracking and launching goroutines.

* test: add batch balance execution integration test

Tests the batch move path with 3 volumes, max concurrency 2, using
fake volume servers. Verifies all moves complete with correct readonly,
copy, tail, and delete RPC counts.

* test: add MarkWritableCount and ReadFileStatusCount accessors

Expose the markWritableCalls and readFileStatusCalls counters on the
fake volume server, following the existing MarkReadonlyCount pattern.

* fix: oscillation guard uses global effective counts for heterogeneous capacity

The oscillation guard (max-min <= 1) previously used maxServer/minServer
which are determined by utilization ratio. With heterogeneous capacity,
maxServer by utilization can have fewer raw volumes than minServer,
producing a negative diff and incorrectly triggering the guard.

Now scans all servers' effective counts to find the true global max/min
volume counts, so the guard works correctly regardless of whether
utilization-based or raw-count balancing is used.

* fix: admin script handler breaks outer loop on SendProgress failure

The break on SendProgress error inside the shell.Commands scan only
exited the inner loop, letting the outer command loop continue
executing commands on a broken stream. Use a sendBroken flag to
propagate the break to the outer execCommands loop.
2026-03-09 19:30:08 -07:00

590 lines
22 KiB
Go
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
package balance
import (
"fmt"
"math"
"sort"
"time"
"github.com/seaweedfs/seaweedfs/weed/admin/topology"
"github.com/seaweedfs/seaweedfs/weed/glog"
"github.com/seaweedfs/seaweedfs/weed/pb/worker_pb"
"github.com/seaweedfs/seaweedfs/weed/worker/tasks/base"
"github.com/seaweedfs/seaweedfs/weed/worker/tasks/util"
"github.com/seaweedfs/seaweedfs/weed/worker/types"
)
// Detection implements the detection logic for balance tasks.
// maxResults limits how many balance operations are returned per invocation.
// A non-positive maxResults means no explicit limit (uses a large default).
// The returned truncated flag is true when detection stopped because it hit
// maxResults rather than running out of work.
func Detection(metrics []*types.VolumeHealthMetrics, clusterInfo *types.ClusterInfo, config base.TaskConfig, maxResults int) ([]*types.TaskDetectionResult, bool, error) {
if !config.IsEnabled() {
return nil, false, nil
}
if clusterInfo == nil {
return nil, false, nil
}
balanceConfig := config.(*Config)
if maxResults <= 0 {
maxResults = math.MaxInt32
}
// Group volumes by disk type to ensure we compare apples to apples
volumesByDiskType := make(map[string][]*types.VolumeHealthMetrics)
for _, metric := range metrics {
volumesByDiskType[metric.DiskType] = append(volumesByDiskType[metric.DiskType], metric)
}
// Sort disk types for deterministic iteration order when maxResults
// spans multiple disk types.
diskTypes := make([]string, 0, len(volumesByDiskType))
for dt := range volumesByDiskType {
diskTypes = append(diskTypes, dt)
}
sort.Strings(diskTypes)
var allParams []*types.TaskDetectionResult
truncated := false
for _, diskType := range diskTypes {
remaining := maxResults - len(allParams)
if remaining <= 0 {
truncated = true
break
}
tasks, diskTruncated := detectForDiskType(diskType, volumesByDiskType[diskType], balanceConfig, clusterInfo, remaining)
allParams = append(allParams, tasks...)
if diskTruncated {
truncated = true
}
}
return allParams, truncated, nil
}
// detectForDiskType performs balance detection for a specific disk type,
// returning up to maxResults balance tasks and whether it was truncated by the limit.
func detectForDiskType(diskType string, diskMetrics []*types.VolumeHealthMetrics, balanceConfig *Config, clusterInfo *types.ClusterInfo, maxResults int) ([]*types.TaskDetectionResult, bool) {
// Skip if cluster segment is too small
minVolumeCount := 2 // More reasonable for small clusters
if len(diskMetrics) < minVolumeCount {
// Only log at verbose level to avoid spamming for small/empty disk types
glog.V(1).Infof("BALANCE [%s]: No tasks created - cluster too small (%d volumes, need ≥%d)", diskType, len(diskMetrics), minVolumeCount)
return nil, false
}
// Analyze volume distribution across servers.
// Seed from ActiveTopology so servers with matching disk type but zero
// volumes are included in the count and imbalance calculation.
// Also collect MaxVolumeCount per server to compute utilization ratios.
serverVolumeCounts := make(map[string]int)
serverMaxVolumes := make(map[string]int64)
if clusterInfo.ActiveTopology != nil {
topologyInfo := clusterInfo.ActiveTopology.GetTopologyInfo()
if topologyInfo != nil {
for _, dc := range topologyInfo.DataCenterInfos {
for _, rack := range dc.RackInfos {
for _, node := range rack.DataNodeInfos {
for diskTypeName, diskInfo := range node.DiskInfos {
if diskTypeName == diskType {
serverVolumeCounts[node.Id] = 0
serverMaxVolumes[node.Id] += diskInfo.MaxVolumeCount
}
}
}
}
}
}
}
for _, metric := range diskMetrics {
serverVolumeCounts[metric.Server]++
}
if len(serverVolumeCounts) < balanceConfig.MinServerCount {
glog.V(1).Infof("BALANCE [%s]: No tasks created - too few servers (%d servers, need ≥%d)", diskType, len(serverVolumeCounts), balanceConfig.MinServerCount)
return nil, false
}
// Seed adjustments from existing pending/assigned balance tasks so that
// effectiveCounts reflects in-flight moves and prevents over-scheduling.
var adjustments map[string]int
if clusterInfo.ActiveTopology != nil {
adjustments = clusterInfo.ActiveTopology.GetTaskServerAdjustments(topology.TaskTypeBalance)
}
if adjustments == nil {
adjustments = make(map[string]int)
}
// Servers where we can no longer find eligible volumes or plan destinations
exhaustedServers := make(map[string]bool)
// Sort servers for deterministic iteration and tie-breaking
sortedServers := make([]string, 0, len(serverVolumeCounts))
for server := range serverVolumeCounts {
sortedServers = append(sortedServers, server)
}
sort.Strings(sortedServers)
// Pre-index volumes by server with cursors to avoid O(maxResults * volumes) scanning.
// Sort each server's volumes by VolumeID for deterministic selection.
volumesByServer := make(map[string][]*types.VolumeHealthMetrics, len(serverVolumeCounts))
for _, metric := range diskMetrics {
volumesByServer[metric.Server] = append(volumesByServer[metric.Server], metric)
}
for _, vols := range volumesByServer {
sort.Slice(vols, func(i, j int) bool {
return vols[i].VolumeID < vols[j].VolumeID
})
}
serverCursors := make(map[string]int, len(serverVolumeCounts))
var results []*types.TaskDetectionResult
balanced := false
// Decide upfront whether all servers have MaxVolumeCount info.
// If any server is missing it, fall back to raw counts for ALL servers
// to avoid mixing utilization ratios (0.01.0) with raw counts.
allServersHaveMaxInfo := true
for _, server := range sortedServers {
if maxVol, ok := serverMaxVolumes[server]; !ok || maxVol <= 0 {
allServersHaveMaxInfo = false
glog.V(1).Infof("BALANCE [%s]: Server %s is missing MaxVolumeCount info, falling back to raw volume counts for balancing", diskType, server)
break
}
}
var serverUtilization func(server string, effectiveCount int) float64
if allServersHaveMaxInfo {
serverUtilization = func(server string, effectiveCount int) float64 {
return float64(effectiveCount) / float64(serverMaxVolumes[server])
}
} else {
serverUtilization = func(_ string, effectiveCount int) float64 {
return float64(effectiveCount)
}
}
for len(results) < maxResults {
// Compute effective volume counts with adjustments from planned moves
effectiveCounts := make(map[string]int, len(serverVolumeCounts))
for server, count := range serverVolumeCounts {
effective := count + adjustments[server]
if effective < 0 {
effective = 0
}
effectiveCounts[server] = effective
}
// Find the most and least utilized servers using utilization ratio
// (volumes / maxVolumes) so that servers with higher capacity are
// expected to hold proportionally more volumes.
maxUtilization := -1.0
minUtilization := math.Inf(1)
maxServer := ""
minServer := ""
for _, server := range sortedServers {
count := effectiveCounts[server]
util := serverUtilization(server, count)
// Min is calculated across all servers for an accurate imbalance ratio
if util < minUtilization {
minUtilization = util
minServer = server
}
// Max is only among non-exhausted servers since we can only move from them
if exhaustedServers[server] {
continue
}
if util > maxUtilization {
maxUtilization = util
maxServer = server
}
}
if maxServer == "" {
// All servers exhausted
glog.V(1).Infof("BALANCE [%s]: All overloaded servers exhausted after %d task(s)", diskType, len(results))
break
}
// Check if utilization imbalance exceeds threshold.
// imbalanceRatio is the difference between the most and least utilized
// servers, expressed as a fraction of mean utilization.
avgUtilization := (maxUtilization + minUtilization) / 2.0
var imbalanceRatio float64
if avgUtilization > 0 {
imbalanceRatio = (maxUtilization - minUtilization) / avgUtilization
}
if imbalanceRatio <= balanceConfig.ImbalanceThreshold {
if len(results) == 0 {
glog.Infof("BALANCE [%s]: No tasks created - cluster well balanced. Imbalance=%.1f%% (threshold=%.1f%%). MaxUtil=%.1f%% on %s, MinUtil=%.1f%% on %s",
diskType, imbalanceRatio*100, balanceConfig.ImbalanceThreshold*100, maxUtilization*100, maxServer, minUtilization*100, minServer)
} else {
glog.Infof("BALANCE [%s]: Created %d task(s), cluster now balanced. Imbalance=%.1f%% (threshold=%.1f%%)",
diskType, len(results), imbalanceRatio*100, balanceConfig.ImbalanceThreshold*100)
}
balanced = true
break
}
// When the global max and min effective counts differ by at most 1,
// no single move can improve balance — it would just swap which server
// is min vs max. Stop here to avoid infinite oscillation when the
// threshold is unachievable (e.g., 11 vols across 4 servers: best is
// 3/3/3/2, imbalance=36%). We scan ALL servers' effective counts so the
// check works regardless of whether utilization or raw counts are used.
globalMaxCount, globalMinCount := 0, math.MaxInt
for _, c := range effectiveCounts {
if c > globalMaxCount {
globalMaxCount = c
}
if c < globalMinCount {
globalMinCount = c
}
}
if globalMaxCount-globalMinCount <= 1 {
if len(results) == 0 {
glog.Infof("BALANCE [%s]: No tasks created - cluster as balanced as possible. Imbalance=%.1f%% (threshold=%.1f%%), but max-min diff is %d",
diskType, imbalanceRatio*100, balanceConfig.ImbalanceThreshold*100, globalMaxCount-globalMinCount)
} else {
glog.Infof("BALANCE [%s]: Created %d task(s), cluster as balanced as possible. Imbalance=%.1f%% (threshold=%.1f%%), max-min diff=%d",
diskType, len(results), imbalanceRatio*100, balanceConfig.ImbalanceThreshold*100, globalMaxCount-globalMinCount)
}
balanced = true
break
}
// Select a volume from the overloaded server using per-server cursor
var selectedVolume *types.VolumeHealthMetrics
serverVols := volumesByServer[maxServer]
cursor := serverCursors[maxServer]
for cursor < len(serverVols) {
metric := serverVols[cursor]
cursor++
// Skip volumes that already have a task in ActiveTopology
if clusterInfo.ActiveTopology != nil && clusterInfo.ActiveTopology.HasAnyTask(metric.VolumeID) {
continue
}
selectedVolume = metric
break
}
serverCursors[maxServer] = cursor
if selectedVolume == nil {
glog.V(1).Infof("BALANCE [%s]: No more eligible volumes on overloaded server %s, trying other servers", diskType, maxServer)
exhaustedServers[maxServer] = true
continue
}
// Create task targeting minServer — the greedy algorithm's natural choice.
// Using minServer instead of letting planBalanceDestination independently
// pick a destination ensures that the detection loop's effective counts
// and the destination selection stay in sync. Without this, the topology's
// LoadCount-based scoring can diverge from the adjustment-based effective
// counts, causing moves to pile onto one server or oscillate (A→B, B→A).
task, destServerID := createBalanceTask(diskType, selectedVolume, clusterInfo, minServer)
if task == nil {
glog.V(1).Infof("BALANCE [%s]: Cannot plan task for volume %d on server %s, trying next volume", diskType, selectedVolume.VolumeID, maxServer)
continue
}
results = append(results, task)
// Adjust effective counts for the next iteration.
adjustments[maxServer]--
if destServerID != "" {
adjustments[destServerID]++
// If the destination server wasn't in serverVolumeCounts (e.g., a
// server with 0 volumes not seeded from topology), add it so
// subsequent iterations include it in effective/average/min/max.
if _, exists := serverVolumeCounts[destServerID]; !exists {
serverVolumeCounts[destServerID] = 0
sortedServers = append(sortedServers, destServerID)
sort.Strings(sortedServers)
}
}
}
// Truncated only if we hit maxResults and detection didn't naturally finish
truncated := len(results) >= maxResults && !balanced
return results, truncated
}
// createBalanceTask creates a single balance task for the selected volume.
// targetServer is the server ID chosen by the detection loop's greedy algorithm.
// Returns (nil, "") if destination planning fails.
// On success, returns the task result and the canonical destination server ID.
func createBalanceTask(diskType string, selectedVolume *types.VolumeHealthMetrics, clusterInfo *types.ClusterInfo, targetServer string) (*types.TaskDetectionResult, string) {
taskID := fmt.Sprintf("balance_vol_%d_%d", selectedVolume.VolumeID, time.Now().UnixNano())
task := &types.TaskDetectionResult{
TaskID: taskID,
TaskType: types.TaskTypeBalance,
VolumeID: selectedVolume.VolumeID,
Server: selectedVolume.Server,
Collection: selectedVolume.Collection,
Priority: types.TaskPriorityNormal,
Reason: fmt.Sprintf("Cluster imbalance detected for %s disk type",
diskType),
ScheduleAt: time.Now(),
}
// Plan destination if ActiveTopology is available
if clusterInfo.ActiveTopology == nil {
glog.Warningf("No ActiveTopology available for destination planning in balance detection")
return nil, ""
}
// Resolve the target server chosen by the detection loop's effective counts.
// This keeps destination selection in sync with the greedy algorithm rather
// than relying on topology LoadCount which can diverge across iterations.
destinationPlan, err := resolveBalanceDestination(clusterInfo.ActiveTopology, selectedVolume, targetServer)
if err != nil {
// Fall back to score-based planning if the preferred target can't be resolved
glog.V(1).Infof("BALANCE [%s]: Cannot resolve target %s for volume %d, falling back to score-based planning: %v",
diskType, targetServer, selectedVolume.VolumeID, err)
destinationPlan, err = planBalanceDestination(clusterInfo.ActiveTopology, selectedVolume)
if err != nil {
glog.Warningf("Failed to plan balance destination for volume %d: %v", selectedVolume.VolumeID, err)
return nil, ""
}
}
// Find the actual disk containing the volume on the source server
sourceDisk, found := base.FindVolumeDisk(clusterInfo.ActiveTopology, selectedVolume.VolumeID, selectedVolume.Collection, selectedVolume.Server)
if !found {
glog.Warningf("BALANCE [%s]: Could not find volume %d (collection: %s) on source server %s - unable to create balance task",
diskType, selectedVolume.VolumeID, selectedVolume.Collection, selectedVolume.Server)
return nil, ""
}
// Update reason with full details now that we have destination info
task.Reason = fmt.Sprintf("Cluster imbalance detected for %s: move volume %d from %s to %s",
diskType, selectedVolume.VolumeID, selectedVolume.Server, destinationPlan.TargetNode)
// Create typed parameters with unified source and target information
task.TypedParams = &worker_pb.TaskParams{
TaskId: taskID,
VolumeId: selectedVolume.VolumeID,
Collection: selectedVolume.Collection,
VolumeSize: selectedVolume.Size,
Sources: []*worker_pb.TaskSource{
{
Node: selectedVolume.ServerAddress,
DiskId: sourceDisk,
VolumeId: selectedVolume.VolumeID,
EstimatedSize: selectedVolume.Size,
DataCenter: selectedVolume.DataCenter,
Rack: selectedVolume.Rack,
},
},
Targets: []*worker_pb.TaskTarget{
{
Node: destinationPlan.TargetAddress,
DiskId: destinationPlan.TargetDisk,
VolumeId: selectedVolume.VolumeID,
EstimatedSize: destinationPlan.ExpectedSize,
DataCenter: destinationPlan.TargetDC,
Rack: destinationPlan.TargetRack,
},
},
TaskParams: &worker_pb.TaskParams_BalanceParams{
BalanceParams: &worker_pb.BalanceTaskParams{
ForceMove: false,
TimeoutSeconds: 600, // 10 minutes default
},
},
}
glog.V(1).Infof("Planned balance destination for volume %d: %s -> %s",
selectedVolume.VolumeID, selectedVolume.Server, destinationPlan.TargetNode)
// Add pending balance task to ActiveTopology for capacity management
targetDisk := destinationPlan.TargetDisk
err = clusterInfo.ActiveTopology.AddPendingTask(topology.TaskSpec{
TaskID: taskID,
TaskType: topology.TaskTypeBalance,
VolumeID: selectedVolume.VolumeID,
VolumeSize: int64(selectedVolume.Size),
Sources: []topology.TaskSourceSpec{
{ServerID: selectedVolume.Server, DiskID: sourceDisk},
},
Destinations: []topology.TaskDestinationSpec{
{ServerID: destinationPlan.TargetNode, DiskID: targetDisk},
},
})
if err != nil {
glog.Warningf("BALANCE [%s]: Failed to add pending task for volume %d: %v", diskType, selectedVolume.VolumeID, err)
return nil, ""
}
glog.V(2).Infof("Added pending balance task %s to ActiveTopology for volume %d: %s:%d -> %s:%d",
taskID, selectedVolume.VolumeID, selectedVolume.Server, sourceDisk, destinationPlan.TargetNode, targetDisk)
return task, destinationPlan.TargetNode
}
// resolveBalanceDestination resolves the destination for a balance operation
// when the target server is already known (chosen by the detection loop's
// effective volume counts). It finds the appropriate disk and address for the
// target server in the topology.
func resolveBalanceDestination(activeTopology *topology.ActiveTopology, selectedVolume *types.VolumeHealthMetrics, targetServer string) (*topology.DestinationPlan, error) {
topologyInfo := activeTopology.GetTopologyInfo()
if topologyInfo == nil {
return nil, fmt.Errorf("no topology info available")
}
// Find the target node in the topology and get its disk info
for _, dc := range topologyInfo.DataCenterInfos {
for _, rack := range dc.RackInfos {
for _, node := range rack.DataNodeInfos {
if node.Id != targetServer {
continue
}
// Find an available disk matching the volume's disk type
for diskTypeName, diskInfo := range node.DiskInfos {
if diskTypeName != selectedVolume.DiskType {
continue
}
if diskInfo.MaxVolumeCount > 0 && diskInfo.VolumeCount >= diskInfo.MaxVolumeCount {
continue // disk is full
}
targetAddress, err := util.ResolveServerAddress(node.Id, activeTopology)
if err != nil {
return nil, fmt.Errorf("failed to resolve address for target server %s: %v", node.Id, err)
}
return &topology.DestinationPlan{
TargetNode: node.Id,
TargetAddress: targetAddress,
TargetDisk: diskInfo.DiskId,
TargetRack: rack.Id,
TargetDC: dc.Id,
ExpectedSize: selectedVolume.Size,
}, nil
}
return nil, fmt.Errorf("target server %s has no available disk of type %s", targetServer, selectedVolume.DiskType)
}
}
}
return nil, fmt.Errorf("target server %s not found in topology", targetServer)
}
// planBalanceDestination plans the destination for a balance operation using
// score-based selection. Used as a fallback when the preferred target cannot
// be resolved, and for single-move scenarios outside the detection loop.
func planBalanceDestination(activeTopology *topology.ActiveTopology, selectedVolume *types.VolumeHealthMetrics) (*topology.DestinationPlan, error) {
// Get source node information from topology
var sourceRack, sourceDC string
// Extract rack and DC from topology info
topologyInfo := activeTopology.GetTopologyInfo()
if topologyInfo != nil {
for _, dc := range topologyInfo.DataCenterInfos {
for _, rack := range dc.RackInfos {
for _, dataNodeInfo := range rack.DataNodeInfos {
if dataNodeInfo.Id == selectedVolume.Server {
sourceDC = dc.Id
sourceRack = rack.Id
break
}
}
if sourceRack != "" {
break
}
}
if sourceDC != "" {
break
}
}
}
// Get available disks, excluding the source node
availableDisks := activeTopology.GetAvailableDisks(topology.TaskTypeBalance, selectedVolume.Server)
if len(availableDisks) == 0 {
return nil, fmt.Errorf("no available disks for balance operation")
}
// Sort available disks by NodeID then DiskID for deterministic tie-breaking
sort.Slice(availableDisks, func(i, j int) bool {
if availableDisks[i].NodeID != availableDisks[j].NodeID {
return availableDisks[i].NodeID < availableDisks[j].NodeID
}
return availableDisks[i].DiskID < availableDisks[j].DiskID
})
// Find the best destination disk based on balance criteria
var bestDisk *topology.DiskInfo
bestScore := math.Inf(-1)
for _, disk := range availableDisks {
// Ensure disk type matches
if disk.DiskType != selectedVolume.DiskType {
continue
}
score := calculateBalanceScore(disk, sourceRack, sourceDC, selectedVolume.Size)
if score > bestScore {
bestScore = score
bestDisk = disk
}
}
if bestDisk == nil {
return nil, fmt.Errorf("no suitable destination found for balance operation")
}
// Get the target server address
targetAddress, err := util.ResolveServerAddress(bestDisk.NodeID, activeTopology)
if err != nil {
return nil, fmt.Errorf("failed to resolve address for target server %s: %v", bestDisk.NodeID, err)
}
return &topology.DestinationPlan{
TargetNode: bestDisk.NodeID,
TargetAddress: targetAddress,
TargetDisk: bestDisk.DiskID,
TargetRack: bestDisk.Rack,
TargetDC: bestDisk.DataCenter,
ExpectedSize: selectedVolume.Size,
PlacementScore: bestScore,
}, nil
}
// calculateBalanceScore calculates placement score for balance operations.
// LoadCount reflects pending+assigned tasks on the disk, so we factor it into
// the utilization estimate to avoid stacking multiple moves onto the same target.
func calculateBalanceScore(disk *topology.DiskInfo, sourceRack, sourceDC string, volumeSize uint64) float64 {
if disk.DiskInfo == nil {
return 0.0
}
score := 0.0
// Prefer disks with lower effective volume count (current + pending moves).
// LoadCount is included so that disks already targeted by planned moves
// appear more utilized, naturally spreading work across targets.
if disk.DiskInfo.MaxVolumeCount > 0 {
effectiveVolumeCount := float64(disk.DiskInfo.VolumeCount) + float64(disk.LoadCount)
utilization := effectiveVolumeCount / float64(disk.DiskInfo.MaxVolumeCount)
score += (1.0 - utilization) * 50.0 // Up to 50 points for low utilization
}
// Prefer different racks for better distribution
if disk.Rack != sourceRack {
score += 30.0
}
// Prefer different data centers for better distribution
if disk.DataCenter != sourceDC {
score += 20.0
}
return score
}