Batch volume balance: run multiple moves per job (#8561)
* proto: add BalanceMoveSpec and batch fields to BalanceTaskParams Add BalanceMoveSpec message for encoding individual volume moves, and max_concurrent_moves + repeated moves fields to BalanceTaskParams to support batching multiple volume moves in a single job. * balance handler: add batch execution with concurrent volume moves Refactor Execute() into executeSingleMove() (backward compatible) and executeBatchMoves() which runs multiple volume moves concurrently using a semaphore-bounded goroutine pool. When BalanceTaskParams.Moves is populated, the batch path is taken; otherwise the single-move path. Includes aggregate progress reporting across concurrent moves, per-move error collection, and partial failure support. * balance handler: add batch config fields to Descriptor and worker config Add max_concurrent_moves and batch_size fields to the worker config form and deriveBalanceWorkerConfig(). These control how many volume moves run concurrently within a batch job and the maximum batch size. * balance handler: group detection proposals into batch jobs When batch_size > 1, the Detect method groups detection results into batch proposals where each proposal encodes multiple BalanceMoveSpec entries in BalanceTaskParams.Moves. Single-result batches fall back to the existing single-move proposal format for backward compatibility. * admin UI: add volume balance execution plan and batch badge Add renderBalanceExecutionPlan() for rich rendering of volume balance jobs in the job detail modal. Single-move jobs show source/target/volume info; batch jobs show a moves table with all volume moves. Add batch badge (e.g., "5 moves") next to job type in the execution jobs table when the job has batch=true label. * Update plugin_templ.go * fix: detection algorithm uses greedy target instead of divergent topology scores The detection loop tracked effective volume counts via an adjustments map, but createBalanceTask independently called planBalanceDestination which used the topology's LoadCount — a separate, unadjusted source of truth. This divergence caused multiple moves to pile onto the same server. Changes: - Add resolveBalanceDestination to resolve the detection loop's greedy target (minServer) rather than independently picking a destination - Add oscillation guard: stop when max-min <= 1 since no single move can improve the balance beyond that point - Track unseeded destinations: if a target server wasn't in the initial serverVolumeCounts, add it so subsequent iterations include it - Add TestDetection_UnseededDestinationDoesNotOverload * fix: handler force_move propagation, partial failure, deterministic dedupe - Propagate ForceMove from outer BalanceTaskParams to individual move TaskParams so batch moves respect the force_move flag - Fix partial failure: mark job successful if at least one move succeeded (succeeded > 0 || failed == 0) to avoid re-running already-completed moves on retry - Use SHA-256 hash for deterministic dedupe key fallback instead of time.Now().UnixNano() which is non-deterministic - Remove unused successDetails variable - Extract maxProposalStringLength constant to replace magic number 200 * admin UI: use template literals in balance execution plan rendering * fix: integration test handles batch proposals from batched detection With batch_size=20, all moves are grouped into a single proposal containing BalanceParams.Moves instead of top-level Sources/Targets. Update assertions to handle both batch and single-move proposal formats. * fix: verify volume size on target before deleting source during balance Add a pre-delete safety check that reads the volume file status on both source and target, then compares .dat file size and file count. If they don't match, the move is aborted — leaving the source intact rather than risking irreversible data loss. Also removes the redundant mountVolume call since VolumeCopy already mounts the volume on the target server. * fix: clamp maxConcurrent, serialize progress sends, validate config as int64 - Clamp maxConcurrentMoves to defaultMaxConcurrentMoves before creating the semaphore so a stale or malicious job cannot request unbounded concurrent volume moves - Extend progressMu to cover sender.SendProgress calls since the underlying gRPC stream is not safe for concurrent writes - Perform bounds checks on max_concurrent_moves and batch_size in int64 space before casting to int, avoiding potential overflow on 32-bit * fix: check disk capacity in resolveBalanceDestination Skip disks where VolumeCount >= MaxVolumeCount so the detection loop does not propose moves to a full disk that would fail at execution time. * test: rename unseeded destination test to match actual behavior The test exercises a server with 0 volumes that IS seeded from topology (matching disk type), not an unseeded destination. Rename to TestDetection_ZeroVolumeServerIncludedInBalance and fix comments. * test: tighten integration test to assert exactly one batch proposal With default batch_size=20, all moves should be grouped into a single batch proposal. Assert len(proposals)==1 and require BalanceParams with Moves, removing the legacy single-move else branch. * fix: propagate ctx to RPCs and restore source writability on abort - All helper methods (markVolumeReadonly, copyVolume, tailVolume, readVolumeFileStatus, deleteVolume) now accept a context parameter instead of using context.Background(), so Execute's ctx propagates cancellation and timeouts into every volume server RPC - Add deferred cleanup that restores the source volume to writable if any step after markVolumeReadonly fails, preventing the source from being left permanently readonly on abort - Add markVolumeWritable helper using VolumeMarkWritableRequest * fix: deep-copy protobuf messages in test recording sender Use proto.Clone in recordingExecutionSender to store immutable snapshots of JobProgressUpdate and JobCompleted, preventing assertions from observing mutations if the handler reuses message pointers. * fix: add VolumeMarkWritable and ReadVolumeFileStatus to fake volume server The balance task now calls ReadVolumeFileStatus for pre-delete verification and VolumeMarkWritable to restore writability on abort. Add both RPCs to the test fake, and drop the mountCalls assertion since BalanceTask no longer calls VolumeMount directly (VolumeCopy handles it). * fix: use maxConcurrentMovesLimit (50) for clamp, not defaultMaxConcurrentMoves defaultMaxConcurrentMoves (5) is the fallback when the field is unset, not an upper bound. Clamping to it silently overrides valid config values like 10/20/50. Introduce maxConcurrentMovesLimit (50) matching the descriptor's MaxValue and clamp to that instead. * fix: cancel batch moves on progress stream failure Derive a cancellable batchCtx from the caller's ctx. If sender.SendProgress returns an error (client disconnect, context cancelled), capture it, skip further sends, and cancel batchCtx so in-flight moves abort via their propagated context rather than running blind to completion. * fix: bound cleanup timeout and validate batch move fields - Use a 30-second timeout for the deferred markVolumeWritable cleanup instead of context.Background() which can block indefinitely if the volume server is unreachable - Validate required fields (VolumeID, SourceNode, TargetNode) before appending moves to a batch proposal, skipping invalid entries - Fall back to a single-move proposal when filtering leaves only one valid move in a batch * fix: cancel task execution on SendProgress stream failure All handler progress callbacks previously ignored SendProgress errors, allowing tasks to continue executing after the client disconnected. Now each handler creates a derived cancellable context and cancels it on the first SendProgress error, stopping the in-flight task promptly. Handlers fixed: erasure_coding, vacuum, volume_balance (single-move), and admin_script (breaks command loop on send failure). * fix: validate batch moves before scheduling in executeBatchMoves Reject empty batches, enforce a hard upper bound (100 moves), and filter out nil or incomplete move specs (missing source/target/volume) before allocating progress tracking and launching goroutines. * test: add batch balance execution integration test Tests the batch move path with 3 volumes, max concurrency 2, using fake volume servers. Verifies all moves complete with correct readonly, copy, tail, and delete RPC counts. * test: add MarkWritableCount and ReadFileStatusCount accessors Expose the markWritableCalls and readFileStatusCalls counters on the fake volume server, following the existing MarkReadonlyCount pattern. * fix: oscillation guard uses global effective counts for heterogeneous capacity The oscillation guard (max-min <= 1) previously used maxServer/minServer which are determined by utilization ratio. With heterogeneous capacity, maxServer by utilization can have fewer raw volumes than minServer, producing a negative diff and incorrectly triggering the guard. Now scans all servers' effective counts to find the true global max/min volume counts, so the guard works correctly regardless of whether utilization-based or raw-count balancing is used. * fix: admin script handler breaks outer loop on SendProgress failure The break on SendProgress error inside the shell.Commands scan only exited the inner loop, letting the outer command loop continue executing commands on a broken stream. Use a sendBroken flag to propagate the break to the outer execCommands loop.
This commit is contained in:
@@ -82,38 +82,70 @@ func (t *BalanceTask) Execute(ctx context.Context, params *worker_pb.TaskParams)
|
||||
// Step 1: Mark volume readonly
|
||||
t.ReportProgress(10.0)
|
||||
t.GetLogger().Info("Marking volume readonly for move")
|
||||
if err := t.markVolumeReadonly(sourceServer, volumeId); err != nil {
|
||||
if err := t.markVolumeReadonly(ctx, sourceServer, volumeId); err != nil {
|
||||
return fmt.Errorf("failed to mark volume readonly: %v", err)
|
||||
}
|
||||
// Restore source writability if any subsequent step fails, so the
|
||||
// source volume is not left permanently readonly on abort.
|
||||
sourceMarkedReadonly := true
|
||||
defer func() {
|
||||
if sourceMarkedReadonly {
|
||||
cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer cleanupCancel()
|
||||
if wErr := t.markVolumeWritable(cleanupCtx, sourceServer, volumeId); wErr != nil {
|
||||
glog.Warningf("failed to restore volume %d writability on %s: %v", volumeId, sourceServer, wErr)
|
||||
}
|
||||
}
|
||||
}()
|
||||
|
||||
// Step 2: Copy volume to destination
|
||||
t.ReportProgress(20.0)
|
||||
t.GetLogger().Info("Copying volume to destination")
|
||||
lastAppendAtNs, err := t.copyVolume(sourceServer, targetServer, volumeId)
|
||||
// Step 2: Read source volume size before copy (for post-copy verification)
|
||||
t.ReportProgress(15.0)
|
||||
sourceStatus, err := t.readVolumeFileStatus(ctx, sourceServer, volumeId)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to copy volume: %v", err)
|
||||
return fmt.Errorf("failed to read source volume status: %v", err)
|
||||
}
|
||||
|
||||
// Step 3: Mount volume on target and mark it readonly
|
||||
t.ReportProgress(60.0)
|
||||
t.GetLogger().Info("Mounting volume on target server")
|
||||
if err := t.mountVolume(targetServer, volumeId); err != nil {
|
||||
return fmt.Errorf("failed to mount volume on target: %v", err)
|
||||
// Step 3: Copy volume to destination (VolumeCopy also mounts the volume)
|
||||
t.ReportProgress(20.0)
|
||||
t.GetLogger().Info("Copying volume to destination")
|
||||
lastAppendAtNs, err := t.copyVolume(ctx, sourceServer, targetServer, volumeId)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to copy volume: %v", err)
|
||||
}
|
||||
|
||||
// Step 4: Tail for updates
|
||||
t.ReportProgress(70.0)
|
||||
t.GetLogger().Info("Syncing final updates")
|
||||
if err := t.tailVolume(sourceServer, targetServer, volumeId, lastAppendAtNs); err != nil {
|
||||
if err := t.tailVolume(ctx, sourceServer, targetServer, volumeId, lastAppendAtNs); err != nil {
|
||||
glog.Warningf("Tail operation failed (may be normal): %v", err)
|
||||
}
|
||||
|
||||
// Step 5: Delete from source
|
||||
// Step 5: Verify the volume on target before deleting source.
|
||||
// This is a critical safety check — once the source is deleted, data loss
|
||||
// is irreversible. We verify the target has the volume with matching size.
|
||||
t.ReportProgress(85.0)
|
||||
t.GetLogger().Info("Verifying volume on target before deleting source")
|
||||
targetStatus, err := t.readVolumeFileStatus(ctx, targetServer, volumeId)
|
||||
if err != nil {
|
||||
return fmt.Errorf("aborting: cannot verify volume %d on target %s before deleting source: %v", volumeId, targetServer, err)
|
||||
}
|
||||
if targetStatus.DatFileSize != sourceStatus.DatFileSize {
|
||||
return fmt.Errorf("aborting: volume %d .dat size mismatch — source %d bytes, target %d bytes",
|
||||
volumeId, sourceStatus.DatFileSize, targetStatus.DatFileSize)
|
||||
}
|
||||
if targetStatus.FileCount != sourceStatus.FileCount {
|
||||
return fmt.Errorf("aborting: volume %d file count mismatch — source %d, target %d",
|
||||
volumeId, sourceStatus.FileCount, targetStatus.FileCount)
|
||||
}
|
||||
|
||||
// Step 6: Delete from source — after this, the move is committed.
|
||||
// Clear the readonly flag so the defer doesn't try to restore writability.
|
||||
t.ReportProgress(90.0)
|
||||
t.GetLogger().Info("Deleting volume from source server")
|
||||
if err := t.deleteVolume(sourceServer, volumeId); err != nil {
|
||||
if err := t.deleteVolume(ctx, sourceServer, volumeId); err != nil {
|
||||
return fmt.Errorf("failed to delete volume from source: %v", err)
|
||||
}
|
||||
sourceMarkedReadonly = false
|
||||
|
||||
t.ReportProgress(100.0)
|
||||
glog.Infof("Balance task completed successfully: volume %d moved from %s to %s",
|
||||
@@ -164,24 +196,35 @@ func (t *BalanceTask) GetProgress() float64 {
|
||||
|
||||
// Helper methods for real balance operations
|
||||
|
||||
// markVolumeReadonly marks the volume readonly
|
||||
func (t *BalanceTask) markVolumeReadonly(server pb.ServerAddress, volumeId needle.VolumeId) error {
|
||||
// markVolumeReadonly marks the volume readonly on the source server.
|
||||
func (t *BalanceTask) markVolumeReadonly(ctx context.Context, server pb.ServerAddress, volumeId needle.VolumeId) error {
|
||||
return operation.WithVolumeServerClient(false, server, t.grpcDialOption,
|
||||
func(client volume_server_pb.VolumeServerClient) error {
|
||||
_, err := client.VolumeMarkReadonly(context.Background(), &volume_server_pb.VolumeMarkReadonlyRequest{
|
||||
_, err := client.VolumeMarkReadonly(ctx, &volume_server_pb.VolumeMarkReadonlyRequest{
|
||||
VolumeId: uint32(volumeId),
|
||||
})
|
||||
return err
|
||||
})
|
||||
}
|
||||
|
||||
// copyVolume copies volume from source to target server
|
||||
func (t *BalanceTask) copyVolume(sourceServer, targetServer pb.ServerAddress, volumeId needle.VolumeId) (uint64, error) {
|
||||
// markVolumeWritable restores the volume to writable on the source server.
|
||||
func (t *BalanceTask) markVolumeWritable(ctx context.Context, server pb.ServerAddress, volumeId needle.VolumeId) error {
|
||||
return operation.WithVolumeServerClient(false, server, t.grpcDialOption,
|
||||
func(client volume_server_pb.VolumeServerClient) error {
|
||||
_, err := client.VolumeMarkWritable(ctx, &volume_server_pb.VolumeMarkWritableRequest{
|
||||
VolumeId: uint32(volumeId),
|
||||
})
|
||||
return err
|
||||
})
|
||||
}
|
||||
|
||||
// copyVolume copies volume from source to target server.
|
||||
func (t *BalanceTask) copyVolume(ctx context.Context, sourceServer, targetServer pb.ServerAddress, volumeId needle.VolumeId) (uint64, error) {
|
||||
var lastAppendAtNs uint64
|
||||
|
||||
err := operation.WithVolumeServerClient(true, targetServer, t.grpcDialOption,
|
||||
func(client volume_server_pb.VolumeServerClient) error {
|
||||
stream, err := client.VolumeCopy(context.Background(), &volume_server_pb.VolumeCopyRequest{
|
||||
stream, err := client.VolumeCopy(ctx, &volume_server_pb.VolumeCopyRequest{
|
||||
VolumeId: uint32(volumeId),
|
||||
SourceDataNode: string(sourceServer),
|
||||
})
|
||||
@@ -213,22 +256,11 @@ func (t *BalanceTask) copyVolume(sourceServer, targetServer pb.ServerAddress, vo
|
||||
return lastAppendAtNs, err
|
||||
}
|
||||
|
||||
// mountVolume mounts the volume on the target server
|
||||
func (t *BalanceTask) mountVolume(server pb.ServerAddress, volumeId needle.VolumeId) error {
|
||||
return operation.WithVolumeServerClient(false, server, t.grpcDialOption,
|
||||
func(client volume_server_pb.VolumeServerClient) error {
|
||||
_, err := client.VolumeMount(context.Background(), &volume_server_pb.VolumeMountRequest{
|
||||
VolumeId: uint32(volumeId),
|
||||
})
|
||||
return err
|
||||
})
|
||||
}
|
||||
|
||||
// tailVolume syncs remaining updates from source to target
|
||||
func (t *BalanceTask) tailVolume(sourceServer, targetServer pb.ServerAddress, volumeId needle.VolumeId, sinceNs uint64) error {
|
||||
// tailVolume syncs remaining updates from source to target.
|
||||
func (t *BalanceTask) tailVolume(ctx context.Context, sourceServer, targetServer pb.ServerAddress, volumeId needle.VolumeId, sinceNs uint64) error {
|
||||
return operation.WithVolumeServerClient(true, targetServer, t.grpcDialOption,
|
||||
func(client volume_server_pb.VolumeServerClient) error {
|
||||
_, err := client.VolumeTailReceiver(context.Background(), &volume_server_pb.VolumeTailReceiverRequest{
|
||||
_, err := client.VolumeTailReceiver(ctx, &volume_server_pb.VolumeTailReceiverRequest{
|
||||
VolumeId: uint32(volumeId),
|
||||
SinceNs: sinceNs,
|
||||
IdleTimeoutSeconds: 60, // 1 minute timeout
|
||||
@@ -238,11 +270,26 @@ func (t *BalanceTask) tailVolume(sourceServer, targetServer pb.ServerAddress, vo
|
||||
})
|
||||
}
|
||||
|
||||
// deleteVolume deletes the volume from the server
|
||||
func (t *BalanceTask) deleteVolume(server pb.ServerAddress, volumeId needle.VolumeId) error {
|
||||
// readVolumeFileStatus reads the volume's file status (sizes, file count) from a server.
|
||||
func (t *BalanceTask) readVolumeFileStatus(ctx context.Context, server pb.ServerAddress, volumeId needle.VolumeId) (*volume_server_pb.ReadVolumeFileStatusResponse, error) {
|
||||
var resp *volume_server_pb.ReadVolumeFileStatusResponse
|
||||
err := operation.WithVolumeServerClient(false, server, t.grpcDialOption,
|
||||
func(client volume_server_pb.VolumeServerClient) error {
|
||||
var err error
|
||||
resp, err = client.ReadVolumeFileStatus(ctx,
|
||||
&volume_server_pb.ReadVolumeFileStatusRequest{
|
||||
VolumeId: uint32(volumeId),
|
||||
})
|
||||
return err
|
||||
})
|
||||
return resp, err
|
||||
}
|
||||
|
||||
// deleteVolume deletes the volume from the server.
|
||||
func (t *BalanceTask) deleteVolume(ctx context.Context, server pb.ServerAddress, volumeId needle.VolumeId) error {
|
||||
return operation.WithVolumeServerClient(false, server, t.grpcDialOption,
|
||||
func(client volume_server_pb.VolumeServerClient) error {
|
||||
_, err := client.VolumeDelete(context.Background(), &volume_server_pb.VolumeDeleteRequest{
|
||||
_, err := client.VolumeDelete(ctx, &volume_server_pb.VolumeDeleteRequest{
|
||||
VolumeId: uint32(volumeId),
|
||||
OnlyEmpty: false,
|
||||
})
|
||||
|
||||
@@ -230,6 +230,33 @@ func detectForDiskType(diskType string, diskMetrics []*types.VolumeHealthMetrics
|
||||
break
|
||||
}
|
||||
|
||||
// When the global max and min effective counts differ by at most 1,
|
||||
// no single move can improve balance — it would just swap which server
|
||||
// is min vs max. Stop here to avoid infinite oscillation when the
|
||||
// threshold is unachievable (e.g., 11 vols across 4 servers: best is
|
||||
// 3/3/3/2, imbalance=36%). We scan ALL servers' effective counts so the
|
||||
// check works regardless of whether utilization or raw counts are used.
|
||||
globalMaxCount, globalMinCount := 0, math.MaxInt
|
||||
for _, c := range effectiveCounts {
|
||||
if c > globalMaxCount {
|
||||
globalMaxCount = c
|
||||
}
|
||||
if c < globalMinCount {
|
||||
globalMinCount = c
|
||||
}
|
||||
}
|
||||
if globalMaxCount-globalMinCount <= 1 {
|
||||
if len(results) == 0 {
|
||||
glog.Infof("BALANCE [%s]: No tasks created - cluster as balanced as possible. Imbalance=%.1f%% (threshold=%.1f%%), but max-min diff is %d",
|
||||
diskType, imbalanceRatio*100, balanceConfig.ImbalanceThreshold*100, globalMaxCount-globalMinCount)
|
||||
} else {
|
||||
glog.Infof("BALANCE [%s]: Created %d task(s), cluster as balanced as possible. Imbalance=%.1f%% (threshold=%.1f%%), max-min diff=%d",
|
||||
diskType, len(results), imbalanceRatio*100, balanceConfig.ImbalanceThreshold*100, globalMaxCount-globalMinCount)
|
||||
}
|
||||
balanced = true
|
||||
break
|
||||
}
|
||||
|
||||
// Select a volume from the overloaded server using per-server cursor
|
||||
var selectedVolume *types.VolumeHealthMetrics
|
||||
serverVols := volumesByServer[maxServer]
|
||||
@@ -252,11 +279,13 @@ func detectForDiskType(diskType string, diskMetrics []*types.VolumeHealthMetrics
|
||||
continue
|
||||
}
|
||||
|
||||
// Plan destination and create task.
|
||||
// On failure, continue to the next volume on the same server rather
|
||||
// than exhausting the entire server — the failure may be per-volume
|
||||
// (e.g., volume not found in topology, AddPendingTask failed).
|
||||
task, destServerID := createBalanceTask(diskType, selectedVolume, clusterInfo)
|
||||
// Create task targeting minServer — the greedy algorithm's natural choice.
|
||||
// Using minServer instead of letting planBalanceDestination independently
|
||||
// pick a destination ensures that the detection loop's effective counts
|
||||
// and the destination selection stay in sync. Without this, the topology's
|
||||
// LoadCount-based scoring can diverge from the adjustment-based effective
|
||||
// counts, causing moves to pile onto one server or oscillate (A→B, B→A).
|
||||
task, destServerID := createBalanceTask(diskType, selectedVolume, clusterInfo, minServer)
|
||||
if task == nil {
|
||||
glog.V(1).Infof("BALANCE [%s]: Cannot plan task for volume %d on server %s, trying next volume", diskType, selectedVolume.VolumeID, maxServer)
|
||||
continue
|
||||
@@ -264,10 +293,18 @@ func detectForDiskType(diskType string, diskMetrics []*types.VolumeHealthMetrics
|
||||
|
||||
results = append(results, task)
|
||||
|
||||
// Adjust effective counts for the next iteration
|
||||
// Adjust effective counts for the next iteration.
|
||||
adjustments[maxServer]--
|
||||
if destServerID != "" {
|
||||
adjustments[destServerID]++
|
||||
// If the destination server wasn't in serverVolumeCounts (e.g., a
|
||||
// server with 0 volumes not seeded from topology), add it so
|
||||
// subsequent iterations include it in effective/average/min/max.
|
||||
if _, exists := serverVolumeCounts[destServerID]; !exists {
|
||||
serverVolumeCounts[destServerID] = 0
|
||||
sortedServers = append(sortedServers, destServerID)
|
||||
sort.Strings(sortedServers)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -277,9 +314,10 @@ func detectForDiskType(diskType string, diskMetrics []*types.VolumeHealthMetrics
|
||||
}
|
||||
|
||||
// createBalanceTask creates a single balance task for the selected volume.
|
||||
// targetServer is the server ID chosen by the detection loop's greedy algorithm.
|
||||
// Returns (nil, "") if destination planning fails.
|
||||
// On success, returns the task result and the canonical destination server ID.
|
||||
func createBalanceTask(diskType string, selectedVolume *types.VolumeHealthMetrics, clusterInfo *types.ClusterInfo) (*types.TaskDetectionResult, string) {
|
||||
func createBalanceTask(diskType string, selectedVolume *types.VolumeHealthMetrics, clusterInfo *types.ClusterInfo, targetServer string) (*types.TaskDetectionResult, string) {
|
||||
taskID := fmt.Sprintf("balance_vol_%d_%d", selectedVolume.VolumeID, time.Now().UnixNano())
|
||||
|
||||
task := &types.TaskDetectionResult{
|
||||
@@ -300,10 +338,19 @@ func createBalanceTask(diskType string, selectedVolume *types.VolumeHealthMetric
|
||||
return nil, ""
|
||||
}
|
||||
|
||||
destinationPlan, err := planBalanceDestination(clusterInfo.ActiveTopology, selectedVolume)
|
||||
// Resolve the target server chosen by the detection loop's effective counts.
|
||||
// This keeps destination selection in sync with the greedy algorithm rather
|
||||
// than relying on topology LoadCount which can diverge across iterations.
|
||||
destinationPlan, err := resolveBalanceDestination(clusterInfo.ActiveTopology, selectedVolume, targetServer)
|
||||
if err != nil {
|
||||
glog.Warningf("Failed to plan balance destination for volume %d: %v", selectedVolume.VolumeID, err)
|
||||
return nil, ""
|
||||
// Fall back to score-based planning if the preferred target can't be resolved
|
||||
glog.V(1).Infof("BALANCE [%s]: Cannot resolve target %s for volume %d, falling back to score-based planning: %v",
|
||||
diskType, targetServer, selectedVolume.VolumeID, err)
|
||||
destinationPlan, err = planBalanceDestination(clusterInfo.ActiveTopology, selectedVolume)
|
||||
if err != nil {
|
||||
glog.Warningf("Failed to plan balance destination for volume %d: %v", selectedVolume.VolumeID, err)
|
||||
return nil, ""
|
||||
}
|
||||
}
|
||||
|
||||
// Find the actual disk containing the volume on the source server
|
||||
@@ -383,8 +430,54 @@ func createBalanceTask(diskType string, selectedVolume *types.VolumeHealthMetric
|
||||
return task, destinationPlan.TargetNode
|
||||
}
|
||||
|
||||
// planBalanceDestination plans the destination for a balance operation
|
||||
// This function implements destination planning logic directly in the detection phase
|
||||
// resolveBalanceDestination resolves the destination for a balance operation
|
||||
// when the target server is already known (chosen by the detection loop's
|
||||
// effective volume counts). It finds the appropriate disk and address for the
|
||||
// target server in the topology.
|
||||
func resolveBalanceDestination(activeTopology *topology.ActiveTopology, selectedVolume *types.VolumeHealthMetrics, targetServer string) (*topology.DestinationPlan, error) {
|
||||
topologyInfo := activeTopology.GetTopologyInfo()
|
||||
if topologyInfo == nil {
|
||||
return nil, fmt.Errorf("no topology info available")
|
||||
}
|
||||
|
||||
// Find the target node in the topology and get its disk info
|
||||
for _, dc := range topologyInfo.DataCenterInfos {
|
||||
for _, rack := range dc.RackInfos {
|
||||
for _, node := range rack.DataNodeInfos {
|
||||
if node.Id != targetServer {
|
||||
continue
|
||||
}
|
||||
// Find an available disk matching the volume's disk type
|
||||
for diskTypeName, diskInfo := range node.DiskInfos {
|
||||
if diskTypeName != selectedVolume.DiskType {
|
||||
continue
|
||||
}
|
||||
if diskInfo.MaxVolumeCount > 0 && diskInfo.VolumeCount >= diskInfo.MaxVolumeCount {
|
||||
continue // disk is full
|
||||
}
|
||||
targetAddress, err := util.ResolveServerAddress(node.Id, activeTopology)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to resolve address for target server %s: %v", node.Id, err)
|
||||
}
|
||||
return &topology.DestinationPlan{
|
||||
TargetNode: node.Id,
|
||||
TargetAddress: targetAddress,
|
||||
TargetDisk: diskInfo.DiskId,
|
||||
TargetRack: rack.Id,
|
||||
TargetDC: dc.Id,
|
||||
ExpectedSize: selectedVolume.Size,
|
||||
}, nil
|
||||
}
|
||||
return nil, fmt.Errorf("target server %s has no available disk of type %s", targetServer, selectedVolume.DiskType)
|
||||
}
|
||||
}
|
||||
}
|
||||
return nil, fmt.Errorf("target server %s not found in topology", targetServer)
|
||||
}
|
||||
|
||||
// planBalanceDestination plans the destination for a balance operation using
|
||||
// score-based selection. Used as a fallback when the preferred target cannot
|
||||
// be resolved, and for single-move scenarios outside the detection loop.
|
||||
func planBalanceDestination(activeTopology *topology.ActiveTopology, selectedVolume *types.VolumeHealthMetrics) (*topology.DestinationPlan, error) {
|
||||
// Get source node information from topology
|
||||
var sourceRack, sourceDC string
|
||||
|
||||
@@ -917,3 +917,86 @@ func TestDetection_HeterogeneousCapacity(t *testing.T) {
|
||||
}
|
||||
t.Logf("First balance task: move from %s (correct: highest utilization)", firstSource)
|
||||
}
|
||||
|
||||
// TestDetection_ZeroVolumeServerIncludedInBalance verifies that a server
|
||||
// with zero volumes (seeded from topology with a matching disk type) is
|
||||
// correctly included in the balance calculation and receives moves to
|
||||
// equalize the distribution.
|
||||
func TestDetection_ZeroVolumeServerIncludedInBalance(t *testing.T) {
|
||||
// 4 servers total, but only 3 have volumes.
|
||||
// node-d has a disk of the same type but zero volumes, so it appears in the
|
||||
// topology and is seeded into serverVolumeCounts with count=0.
|
||||
|
||||
servers := []serverSpec{
|
||||
{id: "node-a", diskType: "hdd", diskID: 1, dc: "dc1", rack: "rack1", maxVolumes: 20},
|
||||
{id: "node-b", diskType: "hdd", diskID: 2, dc: "dc1", rack: "rack1", maxVolumes: 20},
|
||||
{id: "node-c", diskType: "hdd", diskID: 3, dc: "dc1", rack: "rack1", maxVolumes: 20},
|
||||
{id: "node-d", diskType: "hdd", diskID: 4, dc: "dc1", rack: "rack1", maxVolumes: 20},
|
||||
}
|
||||
|
||||
// node-a: 8 volumes, node-b: 2, node-c: 1, node-d: 0
|
||||
var metrics []*types.VolumeHealthMetrics
|
||||
metrics = append(metrics, makeVolumes("node-a", "hdd", "dc1", "rack1", "", 1, 8)...)
|
||||
metrics = append(metrics, makeVolumes("node-b", "hdd", "dc1", "rack1", "", 20, 2)...)
|
||||
metrics = append(metrics, makeVolumes("node-c", "hdd", "dc1", "rack1", "", 30, 1)...)
|
||||
// node-d has 0 volumes — no metrics
|
||||
|
||||
at := buildTopology(servers, metrics)
|
||||
clusterInfo := &types.ClusterInfo{ActiveTopology: at}
|
||||
|
||||
tasks, _, err := Detection(metrics, clusterInfo, defaultConf(), 100)
|
||||
if err != nil {
|
||||
t.Fatalf("Detection failed: %v", err)
|
||||
}
|
||||
|
||||
if len(tasks) == 0 {
|
||||
t.Fatal("Expected balance tasks for 8/2/1/0 distribution, got 0")
|
||||
}
|
||||
|
||||
assertNoDuplicateVolumes(t, tasks)
|
||||
|
||||
// With 11 volumes across 4 servers, the best achievable is 3/3/3/2
|
||||
// (imbalance=36.4%), which exceeds the 20% threshold. The algorithm should
|
||||
// stop when max-min<=1 rather than oscillating endlessly.
|
||||
effective := computeEffectiveCounts(servers, metrics, tasks)
|
||||
total := 0
|
||||
maxC, minC := 0, len(metrics)
|
||||
for _, c := range effective {
|
||||
total += c
|
||||
if c > maxC {
|
||||
maxC = c
|
||||
}
|
||||
if c < minC {
|
||||
minC = c
|
||||
}
|
||||
}
|
||||
|
||||
// The diff between max and min should be at most 1 (as balanced as possible)
|
||||
if maxC-minC > 1 {
|
||||
t.Errorf("After %d moves, distribution not optimally balanced: effective=%v, max-min=%d (want ≤1)",
|
||||
len(tasks), effective, maxC-minC)
|
||||
}
|
||||
|
||||
// Count destinations — moves should spread, not pile onto one server
|
||||
destCounts := make(map[string]int)
|
||||
for _, task := range tasks {
|
||||
if task.TypedParams != nil && len(task.TypedParams.Targets) > 0 {
|
||||
destCounts[task.TypedParams.Targets[0].Node]++
|
||||
}
|
||||
}
|
||||
|
||||
// Moves should go to at least 2 different destinations
|
||||
if len(destCounts) < 2 {
|
||||
t.Errorf("Expected moves to spread across destinations, but got: %v", destCounts)
|
||||
}
|
||||
|
||||
// Should need only ~5 moves for 8/2/1/0 → 3/3/3/2, not 8+ (oscillation)
|
||||
if len(tasks) > 8 {
|
||||
t.Errorf("Too many moves (%d) — likely oscillating; expected ≤8 for this distribution", len(tasks))
|
||||
}
|
||||
|
||||
avg := float64(total) / float64(len(effective))
|
||||
imbalance := float64(maxC-minC) / avg
|
||||
t.Logf("Distribution 8/2/1/0 → %v after %d moves (imbalance=%.1f%%)",
|
||||
effective, len(tasks), imbalance*100)
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user