fix: EC rebalance fails with replica placement 000 (#7812)

* fix: EC rebalance fails with replica placement 000 This PR fixes several issues with EC shard distribution: 1. Pre-flight check before EC encoding - Verify target disk type has capacity before encoding starts - Prevents encoding shards only to fail during rebalance - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd) 2. Fix EC rebalance with replica placement 000 - When DiffRackCount=0, shards should be distributed freely across racks - The '000' placement means 'no volume replication needed' because EC provides redundancy - Previously all racks were skipped with error 'shards X > replica placement limit (0)' 3. Add unit tests for EC rebalance slot calculation - TestECRebalanceWithLimitedSlots: documents the limited slots scenario - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error 4. Add Makefile for manual EC testing - make setup: start cluster and populate data - make shell: open weed shell for EC commands - make clean: stop cluster and cleanup * fix: default -rebalance to true for ec.encode The -rebalance flag was defaulting to false, which meant ec.encode would only print shard moves but not actually execute them. This is a poor default since the whole point of EC encoding is to distribute shards across servers for fault tolerance. Now -rebalance defaults to true, so shards are actually distributed after encoding. Users can use -rebalance=false if they only want to see what would happen without making changes. * test/erasure_coding: improve Makefile safety and docs - Narrow pkill pattern for volume servers to use TEST_DIR instead of port pattern, avoiding accidental kills of unrelated SeaweedFS processes - Document external dependencies (curl, jq) in header comments * shell: refactor buildRackWithEcShards to reuse buildEcShards Extract common shard bit construction logic to avoid duplication between buildEcShards and buildRackWithEcShards helper functions. * shell: update test for EC replication 000 behavior When DiffRackCount=0 (replication "000"), EC shards should be distributed freely across racks since erasure coding provides its own redundancy. Update test expectation to reflect this behavior. * erasure_coding: add distribution package for proportional EC shard placement Add a new reusable package for EC shard distribution that: - Supports configurable EC ratios (not hard-coded 10+4) - Distributes shards proportionally based on replication policy - Provides fault tolerance analysis - Prefers moving parity shards to keep data shards spread out Key components: - ECConfig: Configurable data/parity shard counts - ReplicationConfig: Parsed XYZ replication policy - ECDistribution: Target shard counts per DC/rack/node - Rebalancer: Plans shard moves with parity-first strategy This enables seaweed-enterprise custom EC ratios and weed worker integration while maintaining a clean, testable architecture. * shell: integrate distribution package for EC rebalancing Add shell wrappers around the distribution package: - ProportionalECRebalancer: Plans moves using distribution.Rebalancer - NewProportionalECRebalancerWithConfig: Supports custom EC configs - GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions The shell layer converts between EcNode types and the generic TopologyNode types used by the distribution package. * test setup * ec: improve data and parity shard distribution across racks - Add shardsByTypePerRack helper to track data vs parity shards - Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing: 1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack 2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack - Add balanceShardTypeAcrossRacks for generic shard type balancing - Add pickRackForShardType to select destination with room for type - Add unit tests for even data/parity distribution verification This ensures even read load during normal operation by spreading both data and parity shards across all available racks. * ec: make data/parity shard counts configurable in ecBalancer - Add dataShardCount and parityShardCount fields to ecBalancer struct - Add getDataShardCount() and getParityShardCount() methods with defaults - Replace direct constant usage with configurable methods - Fix unused variable warning for parityPerRack This allows seaweed-enterprise to use custom EC ratios while defaulting to standard 10+4 scheme. * Address PR 7812 review comments Makefile improvements: - Save PIDs for each volume server for precise termination - Use PID-based killing in stop target with pkill fallback - Use more specific pkill patterns with TEST_DIR paths Documentation: - Document jq dependency in README.md Rebalancer fix: - Fix duplicate shard count updates in applyMovesToAnalysis - All planners (DC/rack/node) update counts inline during planning - Remove duplicate updates from applyMovesToAnalysis to avoid double-counting * test/erasure_coding: use mktemp for test file template Use mktemp instead of hardcoded /tmp/testfile_template.bin path to provide better isolation for concurrent test runs.
2025-12-19 13:29:12 -08:00
parent 77a56c2857
commit 4aa50bfa6a
16 changed files with 3127 additions and 28 deletions
--- a/weed/shell/command_ec_common.go
+++ b/weed/shell/command_ec_common.go
@@ -679,6 +679,25 @@ type ecBalancer struct {
 	applyBalancing     bool
 	maxParallelization int
 	diskType           types.DiskType // target disk type for EC shards (default: HardDriveType)
+	// EC configuration for shard distribution (defaults to 10+4)
+	dataShardCount   int
+	parityShardCount int
+}
+
+// getDataShardCount returns the configured data shard count, defaulting to standard 10
+func (ecb *ecBalancer) getDataShardCount() int {
+	if ecb.dataShardCount > 0 {
+		return ecb.dataShardCount
+	}
+	return erasure_coding.DataShardsCount
+}
+
+// getParityShardCount returns the configured parity shard count, defaulting to standard 4
+func (ecb *ecBalancer) getParityShardCount() int {
+	if ecb.parityShardCount > 0 {
+		return ecb.parityShardCount
+	}
+	return erasure_coding.ParityShardsCount
 }

 func (ecb *ecBalancer) errorWaitGroup() *ErrorWaitGroup {
@@ -785,59 +804,176 @@ func countShardsByRack(vid needle.VolumeId, locations []*EcNode, diskType types.
 	})
 }

+// shardsByTypePerRack counts data shards (< dataShards) and parity shards (>= dataShards) per rack
+func shardsByTypePerRack(vid needle.VolumeId, locations []*EcNode, diskType types.DiskType, dataShards int) (dataPerRack, parityPerRack map[string][]erasure_coding.ShardId) {
+	dataPerRack = make(map[string][]erasure_coding.ShardId)
+	parityPerRack = make(map[string][]erasure_coding.ShardId)
+	for _, ecNode := range locations {
+		shardBits := findEcVolumeShards(ecNode, vid, diskType)
+		rackId := string(ecNode.rack)
+		for _, shardId := range shardBits.ShardIds() {
+			if int(shardId) < dataShards {
+				dataPerRack[rackId] = append(dataPerRack[rackId], shardId)
+			} else {
+				parityPerRack[rackId] = append(parityPerRack[rackId], shardId)
+			}
+		}
+	}
+	return
+}
+
 func (ecb *ecBalancer) doBalanceEcShardsAcrossRacks(collection string, vid needle.VolumeId, locations []*EcNode) error {
 	racks := ecb.racks()
+	numRacks := len(racks)

-	// see the volume's shards are in how many racks, and how many in each rack
-	rackToShardCount := countShardsByRack(vid, locations, ecb.diskType)
+	// Use configured EC scheme for shard type classification (defaults to 10+4)
+	dataShardCount := ecb.getDataShardCount()
+	parityShardCount := ecb.getParityShardCount()
+
+	// Get current distribution of data shards per rack (parity computed after data balancing)
+	dataPerRack, _ := shardsByTypePerRack(vid, locations, ecb.diskType, dataShardCount)
+
+	// Calculate max shards per rack for each type to ensure even spread
+	// Data: 10 shards / 6 racks = max 2 per rack
+	// Parity: 4 shards / 6 racks = max 1 per rack (with 2 racks having 0)
+	maxDataPerRack := ceilDivide(dataShardCount, numRacks)
+	maxParityPerRack := ceilDivide(parityShardCount, numRacks)

-	// Calculate actual total shards for this volume (not hardcoded default)
-	var totalShardsForVolume int
-	for _, count := range rackToShardCount {
-		totalShardsForVolume += count
-	}
-	// calculate average number of shards an ec rack should have for one volume
-	averageShardsPerEcRack := ceilDivide(totalShardsForVolume, len(racks))
 	rackEcNodesWithVid := groupBy(locations, func(ecNode *EcNode) string {
 		return string(ecNode.rack)
 	})

-	// ecShardsToMove = select overflown ec shards from racks with ec shard counts > averageShardsPerEcRack
-	ecShardsToMove := make(map[erasure_coding.ShardId]*EcNode)
-	for rackId, count := range rackToShardCount {
-		if count <= averageShardsPerEcRack {
+	// Track total shard count per rack for slot management
+	rackToShardCount := countShardsByRack(vid, locations, ecb.diskType)
+
+	// First pass: Balance data shards across racks
+	if err := ecb.balanceShardTypeAcrossRacks(collection, vid, racks, rackEcNodesWithVid, dataPerRack, rackToShardCount, maxDataPerRack, "data"); err != nil {
+		return err
+	}
+
+	// Refresh locations after data shard moves and get parity distribution
+	locations = ecb.collectVolumeIdToEcNodes(collection)[vid]
+	_, parityPerRack := shardsByTypePerRack(vid, locations, ecb.diskType, dataShardCount)
+	rackEcNodesWithVid = groupBy(locations, func(ecNode *EcNode) string {
+		return string(ecNode.rack)
+	})
+	rackToShardCount = countShardsByRack(vid, locations, ecb.diskType)
+
+	// Second pass: Balance parity shards across racks
+	if err := ecb.balanceShardTypeAcrossRacks(collection, vid, racks, rackEcNodesWithVid, parityPerRack, rackToShardCount, maxParityPerRack, "parity"); err != nil {
+		return err
+	}
+
+	return nil
+}
+
+// balanceShardTypeAcrossRacks spreads shards of a specific type (data or parity) evenly across racks
+func (ecb *ecBalancer) balanceShardTypeAcrossRacks(
+	collection string,
+	vid needle.VolumeId,
+	racks map[RackId]*EcRack,
+	rackEcNodesWithVid map[string][]*EcNode,
+	shardsPerRack map[string][]erasure_coding.ShardId,
+	rackToShardCount map[string]int,
+	maxPerRack int,
+	shardType string,
+) error {
+	// Find racks with too many shards of this type
+	shardsToMove := make(map[erasure_coding.ShardId]*EcNode)
+	for rackId, shards := range shardsPerRack {
+		if len(shards) <= maxPerRack {
 			continue
 		}
-		possibleEcNodes := rackEcNodesWithVid[rackId]
-		for shardId, ecNode := range pickNEcShardsToMoveFrom(possibleEcNodes, vid, count-averageShardsPerEcRack, ecb.diskType) {
-			ecShardsToMove[shardId] = ecNode
+		// Pick excess shards to move
+		excess := len(shards) - maxPerRack
+		ecNodes := rackEcNodesWithVid[rackId]
+		for i := 0; i < excess && i < len(shards); i++ {
+			shardId := shards[i]
+			// Find which node has this shard
+			for _, ecNode := range ecNodes {
+				shardBits := findEcVolumeShards(ecNode, vid, ecb.diskType)
+				if shardBits.HasShardId(shardId) {
+					shardsToMove[shardId] = ecNode
+					break
+				}
+			}
 		}
 	}

-	for shardId, ecNode := range ecShardsToMove {
-		rackId, err := ecb.pickRackToBalanceShardsInto(racks, rackToShardCount)
+	// Move shards to racks that have fewer than maxPerRack of this type
+	for shardId, ecNode := range shardsToMove {
+		// Find destination rack with room for this shard type
+		destRackId, err := ecb.pickRackForShardType(racks, shardsPerRack, maxPerRack, rackToShardCount)
 		if err != nil {
-			fmt.Printf("ec shard %d.%d at %s can not find a destination rack:\n%s\n", vid, shardId, ecNode.info.Id, err.Error())
+			fmt.Printf("ec %s shard %d.%d at %s can not find a destination rack:\n%s\n", shardType, vid, shardId, ecNode.info.Id, err.Error())
 			continue
 		}

 		var possibleDestinationEcNodes []*EcNode
-		for _, n := range racks[rackId].ecNodes {
+		for _, n := range racks[destRackId].ecNodes {
 			possibleDestinationEcNodes = append(possibleDestinationEcNodes, n)
 		}
 		err = ecb.pickOneEcNodeAndMoveOneShard(ecNode, collection, vid, shardId, possibleDestinationEcNodes)
 		if err != nil {
 			return err
 		}
-		rackToShardCount[string(rackId)] += 1
-		rackToShardCount[string(ecNode.rack)] -= 1
-		racks[rackId].freeEcSlot -= 1
-		racks[ecNode.rack].freeEcSlot += 1
+
+		// Update tracking
+		shardsPerRack[string(destRackId)] = append(shardsPerRack[string(destRackId)], shardId)
+		// Remove from source rack
+		srcRack := string(ecNode.rack)
+		for i, s := range shardsPerRack[srcRack] {
+			if s == shardId {
+				shardsPerRack[srcRack] = append(shardsPerRack[srcRack][:i], shardsPerRack[srcRack][i+1:]...)
+				break
+			}
+		}
+		rackToShardCount[string(destRackId)] += 1
+		rackToShardCount[srcRack] -= 1
+		racks[destRackId].freeEcSlot -= 1
+		racks[RackId(srcRack)].freeEcSlot += 1
 	}

 	return nil
 }

+// pickRackForShardType selects a rack that has room for more shards of a specific type
+func (ecb *ecBalancer) pickRackForShardType(
+	rackToEcNodes map[RackId]*EcRack,
+	shardsPerRack map[string][]erasure_coding.ShardId,
+	maxPerRack int,
+	rackToShardCount map[string]int,
+) (RackId, error) {
+	var candidates []RackId
+	minShards := maxPerRack + 1
+
+	for rackId, rack := range rackToEcNodes {
+		if rack.freeEcSlot <= 0 {
+			continue
+		}
+		currentCount := len(shardsPerRack[string(rackId)])
+		if currentCount >= maxPerRack {
+			continue
+		}
+		// For EC shards, replica placement constraint only applies when DiffRackCount > 0.
+		if ecb.replicaPlacement != nil && ecb.replicaPlacement.DiffRackCount > 0 && rackToShardCount[string(rackId)] >= ecb.replicaPlacement.DiffRackCount {
+			continue
+		}
+		if currentCount < minShards {
+			candidates = nil
+			minShards = currentCount
+		}
+		if currentCount == minShards {
+			candidates = append(candidates, rackId)
+		}
+	}
+
+	if len(candidates) == 0 {
+		return "", errors.New("no rack available for shard type balancing")
+	}
+	return candidates[rand.IntN(len(candidates))], nil
+}
+
 func (ecb *ecBalancer) pickRackToBalanceShardsInto(rackToEcNodes map[RackId]*EcRack, rackToShardCount map[string]int) (RackId, error) {
 	targets := []RackId{}
 	targetShards := -1
@@ -855,7 +991,11 @@ func (ecb *ecBalancer) pickRackToBalanceShardsInto(rackToEcNodes map[RackId]*EcR
 			details += fmt.Sprintf("  Skipped %s because it has no free slots\n", rackId)
 			continue
 		}
-		if ecb.replicaPlacement != nil && shards > ecb.replicaPlacement.DiffRackCount {
+		// For EC shards, replica placement constraint only applies when DiffRackCount > 0.
+		// When DiffRackCount = 0 (e.g., replica placement "000"), EC shards should be
+		// distributed freely across racks for fault tolerance - the "000" means
+		// "no volume replication needed" because erasure coding provides redundancy.
+		if ecb.replicaPlacement != nil && ecb.replicaPlacement.DiffRackCount > 0 && shards > ecb.replicaPlacement.DiffRackCount {
 			details += fmt.Sprintf("  Skipped %s because shards %d > replica placement limit for other racks (%d)\n", rackId, shards, ecb.replicaPlacement.DiffRackCount)
 			continue
 		}
@@ -1056,7 +1196,11 @@ func (ecb *ecBalancer) pickEcNodeToBalanceShardsInto(vid needle.VolumeId, existi
 		}

 		shards := nodeShards[node]
-		if ecb.replicaPlacement != nil && shards > ecb.replicaPlacement.SameRackCount+1 {
+		// For EC shards, replica placement constraint only applies when SameRackCount > 0.
+		// When SameRackCount = 0 (e.g., replica placement "000"), EC shards should be
+		// distributed freely within racks - the "000" means "no volume replication needed"
+		// because erasure coding provides redundancy.
+		if ecb.replicaPlacement != nil && ecb.replicaPlacement.SameRackCount > 0 && shards > ecb.replicaPlacement.SameRackCount+1 {
 			details += fmt.Sprintf("  Skipped %s because shards %d > replica placement limit for the rack (%d + 1)\n", node.info.Id, shards, ecb.replicaPlacement.SameRackCount)
 			continue
 		}