fix: EC rebalance fails with replica placement 000 (#7812)

* fix: EC rebalance fails with replica placement 000 This PR fixes several issues with EC shard distribution: 1. Pre-flight check before EC encoding - Verify target disk type has capacity before encoding starts - Prevents encoding shards only to fail during rebalance - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd) 2. Fix EC rebalance with replica placement 000 - When DiffRackCount=0, shards should be distributed freely across racks - The '000' placement means 'no volume replication needed' because EC provides redundancy - Previously all racks were skipped with error 'shards X > replica placement limit (0)' 3. Add unit tests for EC rebalance slot calculation - TestECRebalanceWithLimitedSlots: documents the limited slots scenario - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error 4. Add Makefile for manual EC testing - make setup: start cluster and populate data - make shell: open weed shell for EC commands - make clean: stop cluster and cleanup * fix: default -rebalance to true for ec.encode The -rebalance flag was defaulting to false, which meant ec.encode would only print shard moves but not actually execute them. This is a poor default since the whole point of EC encoding is to distribute shards across servers for fault tolerance. Now -rebalance defaults to true, so shards are actually distributed after encoding. Users can use -rebalance=false if they only want to see what would happen without making changes. * test/erasure_coding: improve Makefile safety and docs - Narrow pkill pattern for volume servers to use TEST_DIR instead of port pattern, avoiding accidental kills of unrelated SeaweedFS processes - Document external dependencies (curl, jq) in header comments * shell: refactor buildRackWithEcShards to reuse buildEcShards Extract common shard bit construction logic to avoid duplication between buildEcShards and buildRackWithEcShards helper functions. * shell: update test for EC replication 000 behavior When DiffRackCount=0 (replication "000"), EC shards should be distributed freely across racks since erasure coding provides its own redundancy. Update test expectation to reflect this behavior. * erasure_coding: add distribution package for proportional EC shard placement Add a new reusable package for EC shard distribution that: - Supports configurable EC ratios (not hard-coded 10+4) - Distributes shards proportionally based on replication policy - Provides fault tolerance analysis - Prefers moving parity shards to keep data shards spread out Key components: - ECConfig: Configurable data/parity shard counts - ReplicationConfig: Parsed XYZ replication policy - ECDistribution: Target shard counts per DC/rack/node - Rebalancer: Plans shard moves with parity-first strategy This enables seaweed-enterprise custom EC ratios and weed worker integration while maintaining a clean, testable architecture. * shell: integrate distribution package for EC rebalancing Add shell wrappers around the distribution package: - ProportionalECRebalancer: Plans moves using distribution.Rebalancer - NewProportionalECRebalancerWithConfig: Supports custom EC configs - GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions The shell layer converts between EcNode types and the generic TopologyNode types used by the distribution package. * test setup * ec: improve data and parity shard distribution across racks - Add shardsByTypePerRack helper to track data vs parity shards - Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing: 1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack 2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack - Add balanceShardTypeAcrossRacks for generic shard type balancing - Add pickRackForShardType to select destination with room for type - Add unit tests for even data/parity distribution verification This ensures even read load during normal operation by spreading both data and parity shards across all available racks. * ec: make data/parity shard counts configurable in ecBalancer - Add dataShardCount and parityShardCount fields to ecBalancer struct - Add getDataShardCount() and getParityShardCount() methods with defaults - Replace direct constant usage with configurable methods - Fix unused variable warning for parityPerRack This allows seaweed-enterprise to use custom EC ratios while defaulting to standard 10+4 scheme. * Address PR 7812 review comments Makefile improvements: - Save PIDs for each volume server for precise termination - Use PID-based killing in stop target with pkill fallback - Use more specific pkill patterns with TEST_DIR paths Documentation: - Document jq dependency in README.md Rebalancer fix: - Fix duplicate shard count updates in applyMovesToAnalysis - All planners (DC/rack/node) update counts inline during planning - Remove duplicate updates from applyMovesToAnalysis to avoid double-counting * test/erasure_coding: use mktemp for test file template Use mktemp instead of hardcoded /tmp/testfile_template.bin path to provide better isolation for concurrent test runs.
2025-12-19 13:29:12 -08:00
parent 77a56c2857
commit 4aa50bfa6a
16 changed files with 3127 additions and 28 deletions
--- a/weed/shell/ec_rebalance_slots_test.go
+++ b/weed/shell/ec_rebalance_slots_test.go
@@ -0,0 +1,293 @@
+package shell
+
+import (
+	"testing"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
+	"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
+	"github.com/seaweedfs/seaweedfs/weed/storage/types"
+)
+
+// TestECRebalanceWithLimitedSlots tests that EC rebalance handles the scenario
+// where there are limited free slots on volume servers.
+//
+// This is a regression test for the error:
+//
+//	"no free ec shard slots. only 0 left"
+//
+// Scenario (from real usage):
+// - 6 volume servers in 6 racks
+// - Each server has max=10 volume slots
+// - 7 volumes were EC encoded (7 × 14 = 98 EC shards)
+// - All 14 shards per volume are on the original server (not yet distributed)
+//
+// Expected behavior:
+// - The rebalance algorithm should distribute shards across servers
+// - Even if perfect distribution isn't possible, it should do best-effort
+// - Currently fails with "no free ec shard slots" because freeSlots calculation
+//
+//	doesn't account for shards being moved (freed slots on source, used on target)
+func TestECRebalanceWithLimitedSlots(t *testing.T) {
+	// Build a topology matching the problematic scenario:
+	// 6 servers, each with 2+ volumes worth of EC shards (all 14 shards per volume on same server)
+	topology := buildLimitedSlotsTopology()
+
+	// Collect EC nodes from the topology
+	ecNodes, totalFreeEcSlots := collectEcVolumeServersByDc(topology, "", types.HardDriveType)
+
+	t.Logf("Topology summary:")
+	t.Logf("  Number of EC nodes: %d", len(ecNodes))
+	t.Logf("  Total free EC slots: %d", totalFreeEcSlots)
+
+	// Log per-node details
+	for _, node := range ecNodes {
+		shardCount := 0
+		for _, diskInfo := range node.info.DiskInfos {
+			for _, ecShard := range diskInfo.EcShardInfos {
+				shardCount += erasure_coding.ShardBits(ecShard.EcIndexBits).ShardIdCount()
+			}
+		}
+		t.Logf("  Node %s (rack %s): %d shards, %d free slots",
+			node.info.Id, node.rack, shardCount, node.freeEcSlot)
+	}
+
+	// Calculate total EC shards
+	totalEcShards := 0
+	for _, node := range ecNodes {
+		for _, diskInfo := range node.info.DiskInfos {
+			for _, ecShard := range diskInfo.EcShardInfos {
+				totalEcShards += erasure_coding.ShardBits(ecShard.EcIndexBits).ShardIdCount()
+			}
+		}
+	}
+	t.Logf("  Total EC shards: %d", totalEcShards)
+
+	// Document the issue:
+	// With 98 EC shards (7 volumes × 14 shards) on 6 servers with max=10 each,
+	// total capacity is 60 slots. But shards already occupy slots on their current servers.
+	//
+	// The current algorithm calculates free slots as:
+	//   freeSlots = maxVolumeCount - volumeCount - ecShardCount
+	//
+	// If all shards are on their original servers:
+	// - Server A has 28 shards (2 volumes × 14) → may have negative free slots
+	// - This causes totalFreeEcSlots to be 0 or negative
+	//
+	// The EXPECTED improvement:
+	// - Rebalance should recognize that moving a shard FREES a slot on the source
+	// - The algorithm should work iteratively, moving shards one at a time
+	// - Even if starting with 0 free slots, moving one shard opens a slot
+
+	if totalFreeEcSlots < 1 {
+		// This is the current (buggy) behavior we're documenting
+		t.Logf("")
+		t.Logf("KNOWN ISSUE: totalFreeEcSlots = %d (< 1)", totalFreeEcSlots)
+		t.Logf("")
+		t.Logf("This triggers the error: 'no free ec shard slots. only %d left'", totalFreeEcSlots)
+		t.Logf("")
+		t.Logf("Analysis:")
+		t.Logf("  - %d EC shards across %d servers", totalEcShards, len(ecNodes))
+		t.Logf("  - Shards are concentrated on original servers (not distributed)")
+		t.Logf("  - Current slot calculation doesn't account for slots freed by moving shards")
+		t.Logf("")
+		t.Logf("Expected fix:")
+		t.Logf("  1. Rebalance should work iteratively, moving one shard at a time")
+		t.Logf("  2. Moving a shard from A to B: frees 1 slot on A, uses 1 slot on B")
+		t.Logf("  3. The 'free slots' check should be per-move, not global")
+		t.Logf("  4. Or: calculate 'redistributable slots' = total capacity - shards that must stay")
+
+		// For now, document this is a known issue - don't fail the test
+		// When the fix is implemented, this test should be updated to verify the fix works
+		return
+	}
+
+	// If we get here, the issue might have been fixed
+	t.Logf("totalFreeEcSlots = %d, rebalance should be possible", totalFreeEcSlots)
+}
+
+// TestECRebalanceZeroFreeSlots tests the specific scenario where
+// the topology appears to have free slots but rebalance fails.
+//
+// This can happen when the VolumeCount in the topology includes the original
+// volumes that were EC-encoded, making the free slot calculation incorrect.
+func TestECRebalanceZeroFreeSlots(t *testing.T) {
+	// Build a topology where volumes were NOT deleted after EC encoding
+	// (VolumeCount still reflects the original volumes)
+	topology := buildZeroFreeSlotTopology()
+
+	ecNodes, totalFreeEcSlots := collectEcVolumeServersByDc(topology, "", types.HardDriveType)
+
+	t.Logf("Zero free slots scenario:")
+	for _, node := range ecNodes {
+		shardCount := 0
+		for _, diskInfo := range node.info.DiskInfos {
+			for _, ecShard := range diskInfo.EcShardInfos {
+				shardCount += erasure_coding.ShardBits(ecShard.EcIndexBits).ShardIdCount()
+			}
+		}
+		t.Logf("  Node %s: %d shards, %d free slots, volumeCount=%d, max=%d",
+			node.info.Id, shardCount, node.freeEcSlot,
+			node.info.DiskInfos[string(types.HardDriveType)].VolumeCount,
+			node.info.DiskInfos[string(types.HardDriveType)].MaxVolumeCount)
+	}
+	t.Logf("  Total free slots: %d", totalFreeEcSlots)
+
+	if totalFreeEcSlots == 0 {
+		t.Logf("")
+		t.Logf("SCENARIO REPRODUCED: totalFreeEcSlots = 0")
+		t.Logf("This would trigger: 'no free ec shard slots. only 0 left'")
+	}
+}
+
+// buildZeroFreeSlotTopology creates a topology where rebalance will fail
+// because servers are at capacity (volumeCount equals maxVolumeCount)
+func buildZeroFreeSlotTopology() *master_pb.TopologyInfo {
+	diskTypeKey := string(types.HardDriveType)
+
+	// Each server has max=10, volumeCount=10 (full capacity)
+	// Free capacity = (10-10) * 10 = 0 per server
+	// This will trigger "no free ec shard slots" error
+	return &master_pb.TopologyInfo{
+		Id: "test_zero_free_slots",
+		DataCenterInfos: []*master_pb.DataCenterInfo{
+			{
+				Id: "dc1",
+				RackInfos: []*master_pb.RackInfo{
+					{
+						Id: "rack0",
+						DataNodeInfos: []*master_pb.DataNodeInfo{
+							{
+								Id: "127.0.0.1:8080",
+								DiskInfos: map[string]*master_pb.DiskInfo{
+									diskTypeKey: {
+										Type:           diskTypeKey,
+										MaxVolumeCount: 10,
+										VolumeCount:    10, // At full capacity
+										EcShardInfos:   buildEcShards([]uint32{3, 4}),
+									},
+								},
+							},
+						},
+					},
+					{
+						Id: "rack1",
+						DataNodeInfos: []*master_pb.DataNodeInfo{
+							{
+								Id: "127.0.0.1:8081",
+								DiskInfos: map[string]*master_pb.DiskInfo{
+									diskTypeKey: {
+										Type:           diskTypeKey,
+										MaxVolumeCount: 10,
+										VolumeCount:    10,
+										EcShardInfos:   buildEcShards([]uint32{1, 7}),
+									},
+								},
+							},
+						},
+					},
+					{
+						Id: "rack2",
+						DataNodeInfos: []*master_pb.DataNodeInfo{
+							{
+								Id: "127.0.0.1:8082",
+								DiskInfos: map[string]*master_pb.DiskInfo{
+									diskTypeKey: {
+										Type:           diskTypeKey,
+										MaxVolumeCount: 10,
+										VolumeCount:    10,
+										EcShardInfos:   buildEcShards([]uint32{2}),
+									},
+								},
+							},
+						},
+					},
+					{
+						Id: "rack3",
+						DataNodeInfos: []*master_pb.DataNodeInfo{
+							{
+								Id: "127.0.0.1:8083",
+								DiskInfos: map[string]*master_pb.DiskInfo{
+									diskTypeKey: {
+										Type:           diskTypeKey,
+										MaxVolumeCount: 10,
+										VolumeCount:    10,
+										EcShardInfos:   buildEcShards([]uint32{5, 6}),
+									},
+								},
+							},
+						},
+					},
+				},
+			},
+		},
+	}
+}
+
+func buildEcShards(volumeIds []uint32) []*master_pb.VolumeEcShardInformationMessage {
+	var shards []*master_pb.VolumeEcShardInformationMessage
+	for _, vid := range volumeIds {
+		allShardBits := erasure_coding.ShardBits(0)
+		for i := 0; i < erasure_coding.TotalShardsCount; i++ {
+			allShardBits = allShardBits.AddShardId(erasure_coding.ShardId(i))
+		}
+		shards = append(shards, &master_pb.VolumeEcShardInformationMessage{
+			Id:          vid,
+			Collection:  "ectest",
+			EcIndexBits: uint32(allShardBits),
+		})
+	}
+	return shards
+}
+
+// buildLimitedSlotsTopology creates a topology matching the problematic scenario:
+// - 6 servers in 6 racks
+// - Each server has max=10 volume slots
+// - 7 volumes were EC encoded, shards distributed as follows:
+//   - rack0 (8080): volumes 3,4 → 28 shards
+//   - rack1 (8081): volumes 1,7 → 28 shards
+//   - rack2 (8082): volume 2 → 14 shards
+//   - rack3 (8083): volumes 5,6 → 28 shards
+//   - rack4 (8084): (no volumes originally)
+//   - rack5 (8085): (no volumes originally)
+func buildLimitedSlotsTopology() *master_pb.TopologyInfo {
+	return &master_pb.TopologyInfo{
+		Id: "test_limited_slots",
+		DataCenterInfos: []*master_pb.DataCenterInfo{
+			{
+				Id: "dc1",
+				RackInfos: []*master_pb.RackInfo{
+					buildRackWithEcShards("rack0", "127.0.0.1:8080", 10, []uint32{3, 4}),
+					buildRackWithEcShards("rack1", "127.0.0.1:8081", 10, []uint32{1, 7}),
+					buildRackWithEcShards("rack2", "127.0.0.1:8082", 10, []uint32{2}),
+					buildRackWithEcShards("rack3", "127.0.0.1:8083", 10, []uint32{5, 6}),
+					buildRackWithEcShards("rack4", "127.0.0.1:8084", 10, []uint32{}),
+					buildRackWithEcShards("rack5", "127.0.0.1:8085", 10, []uint32{}),
+				},
+			},
+		},
+	}
+}
+
+// buildRackWithEcShards creates a rack with one data node containing EC shards
+// for the specified volume IDs (all 14 shards per volume)
+func buildRackWithEcShards(rackId, nodeId string, maxVolumes int64, volumeIds []uint32) *master_pb.RackInfo {
+	// Note: types.HardDriveType is "" (empty string), so we use "" as the key
+	diskTypeKey := string(types.HardDriveType)
+
+	return &master_pb.RackInfo{
+		Id: rackId,
+		DataNodeInfos: []*master_pb.DataNodeInfo{
+			{
+				Id: nodeId,
+				DiskInfos: map[string]*master_pb.DiskInfo{
+					diskTypeKey: {
+						Type:           diskTypeKey,
+						MaxVolumeCount: maxVolumes,
+						VolumeCount:    int64(len(volumeIds)), // Original volumes still counted
+						EcShardInfos:   buildEcShards(volumeIds),
+					},
+				},
+			},
+		},
+	}
+}