fix: EC rebalance fails with replica placement 000 (#7812)

* fix: EC rebalance fails with replica placement 000 This PR fixes several issues with EC shard distribution: 1. Pre-flight check before EC encoding - Verify target disk type has capacity before encoding starts - Prevents encoding shards only to fail during rebalance - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd) 2. Fix EC rebalance with replica placement 000 - When DiffRackCount=0, shards should be distributed freely across racks - The '000' placement means 'no volume replication needed' because EC provides redundancy - Previously all racks were skipped with error 'shards X > replica placement limit (0)' 3. Add unit tests for EC rebalance slot calculation - TestECRebalanceWithLimitedSlots: documents the limited slots scenario - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error 4. Add Makefile for manual EC testing - make setup: start cluster and populate data - make shell: open weed shell for EC commands - make clean: stop cluster and cleanup * fix: default -rebalance to true for ec.encode The -rebalance flag was defaulting to false, which meant ec.encode would only print shard moves but not actually execute them. This is a poor default since the whole point of EC encoding is to distribute shards across servers for fault tolerance. Now -rebalance defaults to true, so shards are actually distributed after encoding. Users can use -rebalance=false if they only want to see what would happen without making changes. * test/erasure_coding: improve Makefile safety and docs - Narrow pkill pattern for volume servers to use TEST_DIR instead of port pattern, avoiding accidental kills of unrelated SeaweedFS processes - Document external dependencies (curl, jq) in header comments * shell: refactor buildRackWithEcShards to reuse buildEcShards Extract common shard bit construction logic to avoid duplication between buildEcShards and buildRackWithEcShards helper functions. * shell: update test for EC replication 000 behavior When DiffRackCount=0 (replication "000"), EC shards should be distributed freely across racks since erasure coding provides its own redundancy. Update test expectation to reflect this behavior. * erasure_coding: add distribution package for proportional EC shard placement Add a new reusable package for EC shard distribution that: - Supports configurable EC ratios (not hard-coded 10+4) - Distributes shards proportionally based on replication policy - Provides fault tolerance analysis - Prefers moving parity shards to keep data shards spread out Key components: - ECConfig: Configurable data/parity shard counts - ReplicationConfig: Parsed XYZ replication policy - ECDistribution: Target shard counts per DC/rack/node - Rebalancer: Plans shard moves with parity-first strategy This enables seaweed-enterprise custom EC ratios and weed worker integration while maintaining a clean, testable architecture. * shell: integrate distribution package for EC rebalancing Add shell wrappers around the distribution package: - ProportionalECRebalancer: Plans moves using distribution.Rebalancer - NewProportionalECRebalancerWithConfig: Supports custom EC configs - GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions The shell layer converts between EcNode types and the generic TopologyNode types used by the distribution package. * test setup * ec: improve data and parity shard distribution across racks - Add shardsByTypePerRack helper to track data vs parity shards - Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing: 1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack 2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack - Add balanceShardTypeAcrossRacks for generic shard type balancing - Add pickRackForShardType to select destination with room for type - Add unit tests for even data/parity distribution verification This ensures even read load during normal operation by spreading both data and parity shards across all available racks. * ec: make data/parity shard counts configurable in ecBalancer - Add dataShardCount and parityShardCount fields to ecBalancer struct - Add getDataShardCount() and getParityShardCount() methods with defaults - Replace direct constant usage with configurable methods - Fix unused variable warning for parityPerRack This allows seaweed-enterprise to use custom EC ratios while defaulting to standard 10+4 scheme. * Address PR 7812 review comments Makefile improvements: - Save PIDs for each volume server for precise termination - Use PID-based killing in stop target with pkill fallback - Use more specific pkill patterns with TEST_DIR paths Documentation: - Document jq dependency in README.md Rebalancer fix: - Fix duplicate shard count updates in applyMovesToAnalysis - All planners (DC/rack/node) update counts inline during planning - Remove duplicate updates from applyMovesToAnalysis to avoid double-counting * test/erasure_coding: use mktemp for test file template Use mktemp instead of hardcoded /tmp/testfile_template.bin path to provide better isolation for concurrent test runs.
2025-12-19 13:29:12 -08:00
parent 77a56c2857
commit 4aa50bfa6a
16 changed files with 3127 additions and 28 deletions
--- a/weed/storage/erasure_coding/distribution/config.go
+++ b/weed/storage/erasure_coding/distribution/config.go
@@ -0,0 +1,171 @@
+// Package distribution provides EC shard distribution algorithms with configurable EC ratios.
+package distribution
+
+import (
+	"fmt"
+
+	"github.com/seaweedfs/seaweedfs/weed/storage/super_block"
+)
+
+// ECConfig holds erasure coding configuration parameters.
+// This replaces hard-coded constants like DataShardsCount=10, ParityShardsCount=4.
+type ECConfig struct {
+	DataShards   int // Number of data shards (e.g., 10)
+	ParityShards int // Number of parity shards (e.g., 4)
+}
+
+// DefaultECConfig returns the standard 10+4 EC configuration
+func DefaultECConfig() ECConfig {
+	return ECConfig{
+		DataShards:   10,
+		ParityShards: 4,
+	}
+}
+
+// NewECConfig creates a new EC configuration with validation
+func NewECConfig(dataShards, parityShards int) (ECConfig, error) {
+	if dataShards <= 0 {
+		return ECConfig{}, fmt.Errorf("dataShards must be positive, got %d", dataShards)
+	}
+	if parityShards <= 0 {
+		return ECConfig{}, fmt.Errorf("parityShards must be positive, got %d", parityShards)
+	}
+	if dataShards+parityShards > 32 {
+		return ECConfig{}, fmt.Errorf("total shards (%d+%d=%d) exceeds maximum of 32",
+			dataShards, parityShards, dataShards+parityShards)
+	}
+	return ECConfig{
+		DataShards:   dataShards,
+		ParityShards: parityShards,
+	}, nil
+}
+
+// TotalShards returns the total number of shards (data + parity)
+func (c ECConfig) TotalShards() int {
+	return c.DataShards + c.ParityShards
+}
+
+// MaxTolerableLoss returns the maximum number of shards that can be lost
+// while still being able to reconstruct the data
+func (c ECConfig) MaxTolerableLoss() int {
+	return c.ParityShards
+}
+
+// MinShardsForReconstruction returns the minimum number of shards needed
+// to reconstruct the original data
+func (c ECConfig) MinShardsForReconstruction() int {
+	return c.DataShards
+}
+
+// String returns a human-readable representation
+func (c ECConfig) String() string {
+	return fmt.Sprintf("%d+%d (total: %d, can lose: %d)",
+		c.DataShards, c.ParityShards, c.TotalShards(), c.MaxTolerableLoss())
+}
+
+// IsDataShard returns true if the shard ID is a data shard (0 to DataShards-1)
+func (c ECConfig) IsDataShard(shardID int) bool {
+	return shardID >= 0 && shardID < c.DataShards
+}
+
+// IsParityShard returns true if the shard ID is a parity shard (DataShards to TotalShards-1)
+func (c ECConfig) IsParityShard(shardID int) bool {
+	return shardID >= c.DataShards && shardID < c.TotalShards()
+}
+
+// SortShardsDataFirst returns a copy of shards sorted with data shards first.
+// This is useful for initial placement where data shards should be spread out first.
+func (c ECConfig) SortShardsDataFirst(shards []int) []int {
+	result := make([]int, len(shards))
+	copy(result, shards)
+
+	// Partition: data shards first, then parity shards
+	dataIdx := 0
+	parityIdx := len(result) - 1
+
+	sorted := make([]int, len(result))
+	for _, s := range result {
+		if c.IsDataShard(s) {
+			sorted[dataIdx] = s
+			dataIdx++
+		} else {
+			sorted[parityIdx] = s
+			parityIdx--
+		}
+	}
+
+	return sorted
+}
+
+// SortShardsParityFirst returns a copy of shards sorted with parity shards first.
+// This is useful for rebalancing where we prefer to move parity shards.
+func (c ECConfig) SortShardsParityFirst(shards []int) []int {
+	result := make([]int, len(shards))
+	copy(result, shards)
+
+	// Partition: parity shards first, then data shards
+	parityIdx := 0
+	dataIdx := len(result) - 1
+
+	sorted := make([]int, len(result))
+	for _, s := range result {
+		if c.IsParityShard(s) {
+			sorted[parityIdx] = s
+			parityIdx++
+		} else {
+			sorted[dataIdx] = s
+			dataIdx--
+		}
+	}
+
+	return sorted
+}
+
+// ReplicationConfig holds the parsed replication policy
+type ReplicationConfig struct {
+	MinDataCenters  int // X+1 from XYZ replication (minimum DCs to use)
+	MinRacksPerDC   int // Y+1 from XYZ replication (minimum racks per DC)
+	MinNodesPerRack int // Z+1 from XYZ replication (minimum nodes per rack)
+
+	// Original replication string (for logging/debugging)
+	Original string
+}
+
+// NewReplicationConfig creates a ReplicationConfig from a ReplicaPlacement
+func NewReplicationConfig(rp *super_block.ReplicaPlacement) ReplicationConfig {
+	if rp == nil {
+		return ReplicationConfig{
+			MinDataCenters:  1,
+			MinRacksPerDC:   1,
+			MinNodesPerRack: 1,
+			Original:        "000",
+		}
+	}
+	return ReplicationConfig{
+		MinDataCenters:  rp.DiffDataCenterCount + 1,
+		MinRacksPerDC:   rp.DiffRackCount + 1,
+		MinNodesPerRack: rp.SameRackCount + 1,
+		Original:        rp.String(),
+	}
+}
+
+// NewReplicationConfigFromString creates a ReplicationConfig from a replication string
+func NewReplicationConfigFromString(replication string) (ReplicationConfig, error) {
+	rp, err := super_block.NewReplicaPlacementFromString(replication)
+	if err != nil {
+		return ReplicationConfig{}, err
+	}
+	return NewReplicationConfig(rp), nil
+}
+
+// TotalPlacementSlots returns the minimum number of unique placement locations
+// based on the replication policy
+func (r ReplicationConfig) TotalPlacementSlots() int {
+	return r.MinDataCenters * r.MinRacksPerDC * r.MinNodesPerRack
+}
+
+// String returns a human-readable representation
+func (r ReplicationConfig) String() string {
+	return fmt.Sprintf("replication=%s (DCs:%d, Racks/DC:%d, Nodes/Rack:%d)",
+		r.Original, r.MinDataCenters, r.MinRacksPerDC, r.MinNodesPerRack)
+}