fix: EC rebalance fails with replica placement 000 (#7812)

* fix: EC rebalance fails with replica placement 000 This PR fixes several issues with EC shard distribution: 1. Pre-flight check before EC encoding - Verify target disk type has capacity before encoding starts - Prevents encoding shards only to fail during rebalance - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd) 2. Fix EC rebalance with replica placement 000 - When DiffRackCount=0, shards should be distributed freely across racks - The '000' placement means 'no volume replication needed' because EC provides redundancy - Previously all racks were skipped with error 'shards X > replica placement limit (0)' 3. Add unit tests for EC rebalance slot calculation - TestECRebalanceWithLimitedSlots: documents the limited slots scenario - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error 4. Add Makefile for manual EC testing - make setup: start cluster and populate data - make shell: open weed shell for EC commands - make clean: stop cluster and cleanup * fix: default -rebalance to true for ec.encode The -rebalance flag was defaulting to false, which meant ec.encode would only print shard moves but not actually execute them. This is a poor default since the whole point of EC encoding is to distribute shards across servers for fault tolerance. Now -rebalance defaults to true, so shards are actually distributed after encoding. Users can use -rebalance=false if they only want to see what would happen without making changes. * test/erasure_coding: improve Makefile safety and docs - Narrow pkill pattern for volume servers to use TEST_DIR instead of port pattern, avoiding accidental kills of unrelated SeaweedFS processes - Document external dependencies (curl, jq) in header comments * shell: refactor buildRackWithEcShards to reuse buildEcShards Extract common shard bit construction logic to avoid duplication between buildEcShards and buildRackWithEcShards helper functions. * shell: update test for EC replication 000 behavior When DiffRackCount=0 (replication "000"), EC shards should be distributed freely across racks since erasure coding provides its own redundancy. Update test expectation to reflect this behavior. * erasure_coding: add distribution package for proportional EC shard placement Add a new reusable package for EC shard distribution that: - Supports configurable EC ratios (not hard-coded 10+4) - Distributes shards proportionally based on replication policy - Provides fault tolerance analysis - Prefers moving parity shards to keep data shards spread out Key components: - ECConfig: Configurable data/parity shard counts - ReplicationConfig: Parsed XYZ replication policy - ECDistribution: Target shard counts per DC/rack/node - Rebalancer: Plans shard moves with parity-first strategy This enables seaweed-enterprise custom EC ratios and weed worker integration while maintaining a clean, testable architecture. * shell: integrate distribution package for EC rebalancing Add shell wrappers around the distribution package: - ProportionalECRebalancer: Plans moves using distribution.Rebalancer - NewProportionalECRebalancerWithConfig: Supports custom EC configs - GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions The shell layer converts between EcNode types and the generic TopologyNode types used by the distribution package. * test setup * ec: improve data and parity shard distribution across racks - Add shardsByTypePerRack helper to track data vs parity shards - Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing: 1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack 2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack - Add balanceShardTypeAcrossRacks for generic shard type balancing - Add pickRackForShardType to select destination with room for type - Add unit tests for even data/parity distribution verification This ensures even read load during normal operation by spreading both data and parity shards across all available racks. * ec: make data/parity shard counts configurable in ecBalancer - Add dataShardCount and parityShardCount fields to ecBalancer struct - Add getDataShardCount() and getParityShardCount() methods with defaults - Replace direct constant usage with configurable methods - Fix unused variable warning for parityPerRack This allows seaweed-enterprise to use custom EC ratios while defaulting to standard 10+4 scheme. * Address PR 7812 review comments Makefile improvements: - Save PIDs for each volume server for precise termination - Use PID-based killing in stop target with pkill fallback - Use more specific pkill patterns with TEST_DIR paths Documentation: - Document jq dependency in README.md Rebalancer fix: - Fix duplicate shard count updates in applyMovesToAnalysis - All planners (DC/rack/node) update counts inline during planning - Remove duplicate updates from applyMovesToAnalysis to avoid double-counting * test/erasure_coding: use mktemp for test file template Use mktemp instead of hardcoded /tmp/testfile_template.bin path to provide better isolation for concurrent test runs.
2025-12-19 13:29:12 -08:00
parent 77a56c2857
commit 4aa50bfa6a
16 changed files with 3127 additions and 28 deletions
--- a/weed/storage/erasure_coding/distribution/rebalancer.go
+++ b/weed/storage/erasure_coding/distribution/rebalancer.go
@@ -0,0 +1,378 @@
+package distribution
+
+import (
+	"fmt"
+	"slices"
+)
+
+// ShardMove represents a planned shard move
+type ShardMove struct {
+	ShardID    int
+	SourceNode *TopologyNode
+	DestNode   *TopologyNode
+	Reason     string
+}
+
+// String returns a human-readable description of the move
+func (m ShardMove) String() string {
+	return fmt.Sprintf("shard %d: %s -> %s (%s)",
+		m.ShardID, m.SourceNode.NodeID, m.DestNode.NodeID, m.Reason)
+}
+
+// RebalancePlan contains the complete plan for rebalancing EC shards
+type RebalancePlan struct {
+	Moves        []ShardMove
+	Distribution *ECDistribution
+	Analysis     *TopologyAnalysis
+
+	// Statistics
+	TotalMoves     int
+	MovesAcrossDC  int
+	MovesAcrossRack int
+	MovesWithinRack int
+}
+
+// String returns a summary of the plan
+func (p *RebalancePlan) String() string {
+	return fmt.Sprintf("RebalancePlan{moves:%d, acrossDC:%d, acrossRack:%d, withinRack:%d}",
+		p.TotalMoves, p.MovesAcrossDC, p.MovesAcrossRack, p.MovesWithinRack)
+}
+
+// DetailedString returns a detailed multi-line summary
+func (p *RebalancePlan) DetailedString() string {
+	s := fmt.Sprintf("Rebalance Plan:\n")
+	s += fmt.Sprintf("  Total Moves: %d\n", p.TotalMoves)
+	s += fmt.Sprintf("  Across DC: %d\n", p.MovesAcrossDC)
+	s += fmt.Sprintf("  Across Rack: %d\n", p.MovesAcrossRack)
+	s += fmt.Sprintf("  Within Rack: %d\n", p.MovesWithinRack)
+	s += fmt.Sprintf("\nMoves:\n")
+	for i, move := range p.Moves {
+		s += fmt.Sprintf("  %d. %s\n", i+1, move.String())
+	}
+	return s
+}
+
+// Rebalancer plans shard moves to achieve proportional distribution
+type Rebalancer struct {
+	ecConfig   ECConfig
+	repConfig  ReplicationConfig
+}
+
+// NewRebalancer creates a new rebalancer with the given configuration
+func NewRebalancer(ec ECConfig, rep ReplicationConfig) *Rebalancer {
+	return &Rebalancer{
+		ecConfig:  ec,
+		repConfig: rep,
+	}
+}
+
+// PlanRebalance creates a rebalancing plan based on current topology analysis
+func (r *Rebalancer) PlanRebalance(analysis *TopologyAnalysis) (*RebalancePlan, error) {
+	dist := CalculateDistribution(r.ecConfig, r.repConfig)
+
+	plan := &RebalancePlan{
+		Distribution: dist,
+		Analysis:     analysis,
+	}
+
+	// Step 1: Balance across data centers
+	dcMoves := r.planDCMoves(analysis, dist)
+	for _, move := range dcMoves {
+		plan.Moves = append(plan.Moves, move)
+		plan.MovesAcrossDC++
+	}
+
+	// Update analysis after DC moves (for planning purposes)
+	r.applyMovesToAnalysis(analysis, dcMoves)
+
+	// Step 2: Balance across racks within each DC
+	rackMoves := r.planRackMoves(analysis, dist)
+	for _, move := range rackMoves {
+		plan.Moves = append(plan.Moves, move)
+		plan.MovesAcrossRack++
+	}
+
+	// Update analysis after rack moves
+	r.applyMovesToAnalysis(analysis, rackMoves)
+
+	// Step 3: Balance across nodes within each rack
+	nodeMoves := r.planNodeMoves(analysis, dist)
+	for _, move := range nodeMoves {
+		plan.Moves = append(plan.Moves, move)
+		plan.MovesWithinRack++
+	}
+
+	plan.TotalMoves = len(plan.Moves)
+
+	return plan, nil
+}
+
+// planDCMoves plans moves to balance shards across data centers
+func (r *Rebalancer) planDCMoves(analysis *TopologyAnalysis, dist *ECDistribution) []ShardMove {
+	var moves []ShardMove
+
+	overDCs := CalculateDCExcess(analysis, dist)
+	underDCs := CalculateUnderservedDCs(analysis, dist)
+
+	underIdx := 0
+	for _, over := range overDCs {
+		for over.Excess > 0 && underIdx < len(underDCs) {
+			destDC := underDCs[underIdx]
+
+			// Find a shard and source node
+			shardID, srcNode := r.pickShardToMove(analysis, over.Nodes)
+			if srcNode == nil {
+				break
+			}
+
+			// Find destination node in target DC
+			destNode := r.pickBestDestination(analysis, destDC, "", dist)
+			if destNode == nil {
+				underIdx++
+				continue
+			}
+
+			moves = append(moves, ShardMove{
+				ShardID:    shardID,
+				SourceNode: srcNode,
+				DestNode:   destNode,
+				Reason:     fmt.Sprintf("balance DC: %s -> %s", srcNode.DataCenter, destDC),
+			})
+
+			over.Excess--
+			analysis.ShardsByDC[srcNode.DataCenter]--
+			analysis.ShardsByDC[destDC]++
+
+			// Check if destDC reached target
+			if analysis.ShardsByDC[destDC] >= dist.TargetShardsPerDC {
+				underIdx++
+			}
+		}
+	}
+
+	return moves
+}
+
+// planRackMoves plans moves to balance shards across racks within each DC
+func (r *Rebalancer) planRackMoves(analysis *TopologyAnalysis, dist *ECDistribution) []ShardMove {
+	var moves []ShardMove
+
+	for dc := range analysis.DCToRacks {
+		dcShards := analysis.ShardsByDC[dc]
+		numRacks := len(analysis.DCToRacks[dc])
+		if numRacks == 0 {
+			continue
+		}
+
+		targetPerRack := ceilDivide(dcShards, max(numRacks, dist.ReplicationConfig.MinRacksPerDC))
+
+		overRacks := CalculateRackExcess(analysis, dc, targetPerRack)
+		underRacks := CalculateUnderservedRacks(analysis, dc, targetPerRack)
+
+		underIdx := 0
+		for _, over := range overRacks {
+			for over.Excess > 0 && underIdx < len(underRacks) {
+				destRack := underRacks[underIdx]
+
+				// Find shard and source node
+				shardID, srcNode := r.pickShardToMove(analysis, over.Nodes)
+				if srcNode == nil {
+					break
+				}
+
+				// Find destination node in target rack
+				destNode := r.pickBestDestination(analysis, dc, destRack, dist)
+				if destNode == nil {
+					underIdx++
+					continue
+				}
+
+				moves = append(moves, ShardMove{
+					ShardID:    shardID,
+					SourceNode: srcNode,
+					DestNode:   destNode,
+					Reason:     fmt.Sprintf("balance rack: %s -> %s", srcNode.Rack, destRack),
+				})
+
+				over.Excess--
+				analysis.ShardsByRack[srcNode.Rack]--
+				analysis.ShardsByRack[destRack]++
+
+				if analysis.ShardsByRack[destRack] >= targetPerRack {
+					underIdx++
+				}
+			}
+		}
+	}
+
+	return moves
+}
+
+// planNodeMoves plans moves to balance shards across nodes within each rack
+func (r *Rebalancer) planNodeMoves(analysis *TopologyAnalysis, dist *ECDistribution) []ShardMove {
+	var moves []ShardMove
+
+	for rack, nodes := range analysis.RackToNodes {
+		if len(nodes) <= 1 {
+			continue
+		}
+
+		rackShards := analysis.ShardsByRack[rack]
+		targetPerNode := ceilDivide(rackShards, max(len(nodes), dist.ReplicationConfig.MinNodesPerRack))
+
+		// Find over and under nodes
+		var overNodes []*TopologyNode
+		var underNodes []*TopologyNode
+
+		for _, node := range nodes {
+			count := analysis.ShardsByNode[node.NodeID]
+			if count > targetPerNode {
+				overNodes = append(overNodes, node)
+			} else if count < targetPerNode {
+				underNodes = append(underNodes, node)
+			}
+		}
+
+		// Sort by excess/deficit
+		slices.SortFunc(overNodes, func(a, b *TopologyNode) int {
+			return analysis.ShardsByNode[b.NodeID] - analysis.ShardsByNode[a.NodeID]
+		})
+
+		underIdx := 0
+		for _, srcNode := range overNodes {
+			excess := analysis.ShardsByNode[srcNode.NodeID] - targetPerNode
+
+			for excess > 0 && underIdx < len(underNodes) {
+				destNode := underNodes[underIdx]
+
+				// Pick a shard from this node, preferring parity shards
+				shards := analysis.NodeToShards[srcNode.NodeID]
+				if len(shards) == 0 {
+					break
+				}
+
+				// Find a parity shard first, fallback to data shard
+				shardID := -1
+				shardIdx := -1
+				for i, s := range shards {
+					if r.ecConfig.IsParityShard(s) {
+						shardID = s
+						shardIdx = i
+						break
+					}
+				}
+				if shardID == -1 {
+					shardID = shards[0]
+					shardIdx = 0
+				}
+
+				moves = append(moves, ShardMove{
+					ShardID:    shardID,
+					SourceNode: srcNode,
+					DestNode:   destNode,
+					Reason:     fmt.Sprintf("balance node: %s -> %s", srcNode.NodeID, destNode.NodeID),
+				})
+
+				excess--
+				analysis.ShardsByNode[srcNode.NodeID]--
+				analysis.ShardsByNode[destNode.NodeID]++
+
+				// Update shard lists - remove the specific shard we picked
+				analysis.NodeToShards[srcNode.NodeID] = append(
+					shards[:shardIdx], shards[shardIdx+1:]...)
+				analysis.NodeToShards[destNode.NodeID] = append(
+					analysis.NodeToShards[destNode.NodeID], shardID)
+
+				if analysis.ShardsByNode[destNode.NodeID] >= targetPerNode {
+					underIdx++
+				}
+			}
+		}
+	}
+
+	return moves
+}
+
+// pickShardToMove selects a shard and its node from the given nodes.
+// It prefers to move parity shards first, keeping data shards spread out
+// since data shards serve read requests while parity shards are only for reconstruction.
+func (r *Rebalancer) pickShardToMove(analysis *TopologyAnalysis, nodes []*TopologyNode) (int, *TopologyNode) {
+	// Sort by shard count (most shards first)
+	slices.SortFunc(nodes, func(a, b *TopologyNode) int {
+		return analysis.ShardsByNode[b.NodeID] - analysis.ShardsByNode[a.NodeID]
+	})
+
+	// First pass: try to find a parity shard to move (prefer moving parity)
+	for _, node := range nodes {
+		shards := analysis.NodeToShards[node.NodeID]
+		for _, shardID := range shards {
+			if r.ecConfig.IsParityShard(shardID) {
+				return shardID, node
+			}
+		}
+	}
+
+	// Second pass: if no parity shards, move a data shard
+	for _, node := range nodes {
+		shards := analysis.NodeToShards[node.NodeID]
+		if len(shards) > 0 {
+			return shards[0], node
+		}
+	}
+
+	return -1, nil
+}
+
+// pickBestDestination selects the best destination node
+func (r *Rebalancer) pickBestDestination(analysis *TopologyAnalysis, targetDC, targetRack string, dist *ECDistribution) *TopologyNode {
+	var candidates []*TopologyNode
+
+	// Collect candidates
+	for _, node := range analysis.AllNodes {
+		// Filter by DC if specified
+		if targetDC != "" && node.DataCenter != targetDC {
+			continue
+		}
+		// Filter by rack if specified
+		if targetRack != "" && node.Rack != targetRack {
+			continue
+		}
+		// Check capacity
+		if node.FreeSlots <= 0 {
+			continue
+		}
+		// Check max shards limit
+		if analysis.ShardsByNode[node.NodeID] >= dist.MaxShardsPerNode {
+			continue
+		}
+
+		candidates = append(candidates, node)
+	}
+
+	if len(candidates) == 0 {
+		return nil
+	}
+
+	// Sort by: 1) fewer shards, 2) more free slots
+	slices.SortFunc(candidates, func(a, b *TopologyNode) int {
+		aShards := analysis.ShardsByNode[a.NodeID]
+		bShards := analysis.ShardsByNode[b.NodeID]
+		if aShards != bShards {
+			return aShards - bShards
+		}
+		return b.FreeSlots - a.FreeSlots
+	})
+
+	return candidates[0]
+}
+
+// applyMovesToAnalysis is a no-op placeholder for potential future use.
+// Note: All planners (planDCMoves, planRackMoves, planNodeMoves) update
+// their respective counts (ShardsByDC, ShardsByRack, ShardsByNode) and
+// shard lists (NodeToShards) inline during planning. This avoids duplicate
+// updates that would occur if we also updated counts here.
+func (r *Rebalancer) applyMovesToAnalysis(analysis *TopologyAnalysis, moves []ShardMove) {
+	// Counts are already updated by the individual planners.
+	// This function is kept for API compatibility and potential future use.
+}
+