* fix: EC rebalance fails with replica placement 000 This PR fixes several issues with EC shard distribution: 1. Pre-flight check before EC encoding - Verify target disk type has capacity before encoding starts - Prevents encoding shards only to fail during rebalance - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd) 2. Fix EC rebalance with replica placement 000 - When DiffRackCount=0, shards should be distributed freely across racks - The '000' placement means 'no volume replication needed' because EC provides redundancy - Previously all racks were skipped with error 'shards X > replica placement limit (0)' 3. Add unit tests for EC rebalance slot calculation - TestECRebalanceWithLimitedSlots: documents the limited slots scenario - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error 4. Add Makefile for manual EC testing - make setup: start cluster and populate data - make shell: open weed shell for EC commands - make clean: stop cluster and cleanup * fix: default -rebalance to true for ec.encode The -rebalance flag was defaulting to false, which meant ec.encode would only print shard moves but not actually execute them. This is a poor default since the whole point of EC encoding is to distribute shards across servers for fault tolerance. Now -rebalance defaults to true, so shards are actually distributed after encoding. Users can use -rebalance=false if they only want to see what would happen without making changes. * test/erasure_coding: improve Makefile safety and docs - Narrow pkill pattern for volume servers to use TEST_DIR instead of port pattern, avoiding accidental kills of unrelated SeaweedFS processes - Document external dependencies (curl, jq) in header comments * shell: refactor buildRackWithEcShards to reuse buildEcShards Extract common shard bit construction logic to avoid duplication between buildEcShards and buildRackWithEcShards helper functions. * shell: update test for EC replication 000 behavior When DiffRackCount=0 (replication "000"), EC shards should be distributed freely across racks since erasure coding provides its own redundancy. Update test expectation to reflect this behavior. * erasure_coding: add distribution package for proportional EC shard placement Add a new reusable package for EC shard distribution that: - Supports configurable EC ratios (not hard-coded 10+4) - Distributes shards proportionally based on replication policy - Provides fault tolerance analysis - Prefers moving parity shards to keep data shards spread out Key components: - ECConfig: Configurable data/parity shard counts - ReplicationConfig: Parsed XYZ replication policy - ECDistribution: Target shard counts per DC/rack/node - Rebalancer: Plans shard moves with parity-first strategy This enables seaweed-enterprise custom EC ratios and weed worker integration while maintaining a clean, testable architecture. * shell: integrate distribution package for EC rebalancing Add shell wrappers around the distribution package: - ProportionalECRebalancer: Plans moves using distribution.Rebalancer - NewProportionalECRebalancerWithConfig: Supports custom EC configs - GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions The shell layer converts between EcNode types and the generic TopologyNode types used by the distribution package. * test setup * ec: improve data and parity shard distribution across racks - Add shardsByTypePerRack helper to track data vs parity shards - Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing: 1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack 2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack - Add balanceShardTypeAcrossRacks for generic shard type balancing - Add pickRackForShardType to select destination with room for type - Add unit tests for even data/parity distribution verification This ensures even read load during normal operation by spreading both data and parity shards across all available racks. * ec: make data/parity shard counts configurable in ecBalancer - Add dataShardCount and parityShardCount fields to ecBalancer struct - Add getDataShardCount() and getParityShardCount() methods with defaults - Replace direct constant usage with configurable methods - Fix unused variable warning for parityPerRack This allows seaweed-enterprise to use custom EC ratios while defaulting to standard 10+4 scheme. * Address PR 7812 review comments Makefile improvements: - Save PIDs for each volume server for precise termination - Use PID-based killing in stop target with pkill fallback - Use more specific pkill patterns with TEST_DIR paths Documentation: - Document jq dependency in README.md Rebalancer fix: - Fix duplicate shard count updates in applyMovesToAnalysis - All planners (DC/rack/node) update counts inline during planning - Remove duplicate updates from applyMovesToAnalysis to avoid double-counting * test/erasure_coding: use mktemp for test file template Use mktemp instead of hardcoded /tmp/testfile_template.bin path to provide better isolation for concurrent test runs.
162 lines
6.4 KiB
Go
162 lines
6.4 KiB
Go
package distribution
|
|
|
|
import (
|
|
"fmt"
|
|
)
|
|
|
|
// ECDistribution represents the target distribution of EC shards
|
|
// based on EC configuration and replication policy.
|
|
type ECDistribution struct {
|
|
// EC configuration
|
|
ECConfig ECConfig
|
|
|
|
// Replication configuration
|
|
ReplicationConfig ReplicationConfig
|
|
|
|
// Target shard counts per topology level (balanced distribution)
|
|
TargetShardsPerDC int
|
|
TargetShardsPerRack int
|
|
TargetShardsPerNode int
|
|
|
|
// Maximum shard counts per topology level (fault tolerance limits)
|
|
// These prevent any single failure domain from having too many shards
|
|
MaxShardsPerDC int
|
|
MaxShardsPerRack int
|
|
MaxShardsPerNode int
|
|
}
|
|
|
|
// CalculateDistribution computes the target EC shard distribution based on
|
|
// EC configuration and replication policy.
|
|
//
|
|
// The algorithm:
|
|
// 1. Uses replication policy to determine minimum topology spread
|
|
// 2. Calculates target shards per level (evenly distributed)
|
|
// 3. Calculates max shards per level (for fault tolerance)
|
|
func CalculateDistribution(ec ECConfig, rep ReplicationConfig) *ECDistribution {
|
|
totalShards := ec.TotalShards()
|
|
|
|
// Target distribution (balanced, rounded up to ensure all shards placed)
|
|
targetShardsPerDC := ceilDivide(totalShards, rep.MinDataCenters)
|
|
targetShardsPerRack := ceilDivide(targetShardsPerDC, rep.MinRacksPerDC)
|
|
targetShardsPerNode := ceilDivide(targetShardsPerRack, rep.MinNodesPerRack)
|
|
|
|
// Maximum limits for fault tolerance
|
|
// The key constraint: losing one failure domain shouldn't lose more than parityShards
|
|
// So max shards per domain = totalShards - parityShards + tolerance
|
|
// We add small tolerance (+2) to allow for imbalanced topologies
|
|
faultToleranceLimit := totalShards - ec.ParityShards + 1
|
|
|
|
maxShardsPerDC := min(faultToleranceLimit, targetShardsPerDC+2)
|
|
maxShardsPerRack := min(faultToleranceLimit, targetShardsPerRack+2)
|
|
maxShardsPerNode := min(faultToleranceLimit, targetShardsPerNode+2)
|
|
|
|
return &ECDistribution{
|
|
ECConfig: ec,
|
|
ReplicationConfig: rep,
|
|
TargetShardsPerDC: targetShardsPerDC,
|
|
TargetShardsPerRack: targetShardsPerRack,
|
|
TargetShardsPerNode: targetShardsPerNode,
|
|
MaxShardsPerDC: maxShardsPerDC,
|
|
MaxShardsPerRack: maxShardsPerRack,
|
|
MaxShardsPerNode: maxShardsPerNode,
|
|
}
|
|
}
|
|
|
|
// String returns a human-readable description of the distribution
|
|
func (d *ECDistribution) String() string {
|
|
return fmt.Sprintf(
|
|
"ECDistribution{EC:%s, DCs:%d (target:%d/max:%d), Racks/DC:%d (target:%d/max:%d), Nodes/Rack:%d (target:%d/max:%d)}",
|
|
d.ECConfig.String(),
|
|
d.ReplicationConfig.MinDataCenters, d.TargetShardsPerDC, d.MaxShardsPerDC,
|
|
d.ReplicationConfig.MinRacksPerDC, d.TargetShardsPerRack, d.MaxShardsPerRack,
|
|
d.ReplicationConfig.MinNodesPerRack, d.TargetShardsPerNode, d.MaxShardsPerNode,
|
|
)
|
|
}
|
|
|
|
// Summary returns a multi-line summary of the distribution plan
|
|
func (d *ECDistribution) Summary() string {
|
|
summary := fmt.Sprintf("EC Configuration: %s\n", d.ECConfig.String())
|
|
summary += fmt.Sprintf("Replication: %s\n", d.ReplicationConfig.String())
|
|
summary += fmt.Sprintf("Distribution Plan:\n")
|
|
summary += fmt.Sprintf(" Data Centers: %d (target %d shards each, max %d)\n",
|
|
d.ReplicationConfig.MinDataCenters, d.TargetShardsPerDC, d.MaxShardsPerDC)
|
|
summary += fmt.Sprintf(" Racks per DC: %d (target %d shards each, max %d)\n",
|
|
d.ReplicationConfig.MinRacksPerDC, d.TargetShardsPerRack, d.MaxShardsPerRack)
|
|
summary += fmt.Sprintf(" Nodes per Rack: %d (target %d shards each, max %d)\n",
|
|
d.ReplicationConfig.MinNodesPerRack, d.TargetShardsPerNode, d.MaxShardsPerNode)
|
|
return summary
|
|
}
|
|
|
|
// CanSurviveDCFailure returns true if the distribution can survive
|
|
// complete loss of one data center
|
|
func (d *ECDistribution) CanSurviveDCFailure() bool {
|
|
// After losing one DC with max shards, check if remaining shards are enough
|
|
remainingAfterDCLoss := d.ECConfig.TotalShards() - d.TargetShardsPerDC
|
|
return remainingAfterDCLoss >= d.ECConfig.MinShardsForReconstruction()
|
|
}
|
|
|
|
// CanSurviveRackFailure returns true if the distribution can survive
|
|
// complete loss of one rack
|
|
func (d *ECDistribution) CanSurviveRackFailure() bool {
|
|
remainingAfterRackLoss := d.ECConfig.TotalShards() - d.TargetShardsPerRack
|
|
return remainingAfterRackLoss >= d.ECConfig.MinShardsForReconstruction()
|
|
}
|
|
|
|
// MinDCsForDCFaultTolerance calculates the minimum number of DCs needed
|
|
// to survive complete DC failure with this EC configuration
|
|
func (d *ECDistribution) MinDCsForDCFaultTolerance() int {
|
|
// To survive DC failure, max shards per DC = parityShards
|
|
maxShardsPerDC := d.ECConfig.MaxTolerableLoss()
|
|
if maxShardsPerDC == 0 {
|
|
return d.ECConfig.TotalShards() // Would need one DC per shard
|
|
}
|
|
return ceilDivide(d.ECConfig.TotalShards(), maxShardsPerDC)
|
|
}
|
|
|
|
// FaultToleranceAnalysis returns a detailed analysis of fault tolerance
|
|
func (d *ECDistribution) FaultToleranceAnalysis() string {
|
|
analysis := fmt.Sprintf("Fault Tolerance Analysis for %s:\n", d.ECConfig.String())
|
|
|
|
// DC failure
|
|
dcSurvive := d.CanSurviveDCFailure()
|
|
shardsAfterDC := d.ECConfig.TotalShards() - d.TargetShardsPerDC
|
|
analysis += fmt.Sprintf(" DC Failure: %s\n", boolToResult(dcSurvive))
|
|
analysis += fmt.Sprintf(" - Losing one DC loses ~%d shards\n", d.TargetShardsPerDC)
|
|
analysis += fmt.Sprintf(" - Remaining: %d shards (need %d)\n", shardsAfterDC, d.ECConfig.DataShards)
|
|
if !dcSurvive {
|
|
analysis += fmt.Sprintf(" - Need at least %d DCs for DC fault tolerance\n", d.MinDCsForDCFaultTolerance())
|
|
}
|
|
|
|
// Rack failure
|
|
rackSurvive := d.CanSurviveRackFailure()
|
|
shardsAfterRack := d.ECConfig.TotalShards() - d.TargetShardsPerRack
|
|
analysis += fmt.Sprintf(" Rack Failure: %s\n", boolToResult(rackSurvive))
|
|
analysis += fmt.Sprintf(" - Losing one rack loses ~%d shards\n", d.TargetShardsPerRack)
|
|
analysis += fmt.Sprintf(" - Remaining: %d shards (need %d)\n", shardsAfterRack, d.ECConfig.DataShards)
|
|
|
|
// Node failure (usually survivable)
|
|
shardsAfterNode := d.ECConfig.TotalShards() - d.TargetShardsPerNode
|
|
nodeSurvive := shardsAfterNode >= d.ECConfig.DataShards
|
|
analysis += fmt.Sprintf(" Node Failure: %s\n", boolToResult(nodeSurvive))
|
|
analysis += fmt.Sprintf(" - Losing one node loses ~%d shards\n", d.TargetShardsPerNode)
|
|
analysis += fmt.Sprintf(" - Remaining: %d shards (need %d)\n", shardsAfterNode, d.ECConfig.DataShards)
|
|
|
|
return analysis
|
|
}
|
|
|
|
func boolToResult(b bool) string {
|
|
if b {
|
|
return "SURVIVABLE ✓"
|
|
}
|
|
return "NOT SURVIVABLE ✗"
|
|
}
|
|
|
|
// ceilDivide performs ceiling division
|
|
func ceilDivide(a, b int) int {
|
|
if b <= 0 {
|
|
return a
|
|
}
|
|
return (a + b - 1) / b
|
|
}
|
|
|