* fix: EC rebalance fails with replica placement 000 This PR fixes several issues with EC shard distribution: 1. Pre-flight check before EC encoding - Verify target disk type has capacity before encoding starts - Prevents encoding shards only to fail during rebalance - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd) 2. Fix EC rebalance with replica placement 000 - When DiffRackCount=0, shards should be distributed freely across racks - The '000' placement means 'no volume replication needed' because EC provides redundancy - Previously all racks were skipped with error 'shards X > replica placement limit (0)' 3. Add unit tests for EC rebalance slot calculation - TestECRebalanceWithLimitedSlots: documents the limited slots scenario - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error 4. Add Makefile for manual EC testing - make setup: start cluster and populate data - make shell: open weed shell for EC commands - make clean: stop cluster and cleanup * fix: default -rebalance to true for ec.encode The -rebalance flag was defaulting to false, which meant ec.encode would only print shard moves but not actually execute them. This is a poor default since the whole point of EC encoding is to distribute shards across servers for fault tolerance. Now -rebalance defaults to true, so shards are actually distributed after encoding. Users can use -rebalance=false if they only want to see what would happen without making changes. * test/erasure_coding: improve Makefile safety and docs - Narrow pkill pattern for volume servers to use TEST_DIR instead of port pattern, avoiding accidental kills of unrelated SeaweedFS processes - Document external dependencies (curl, jq) in header comments * shell: refactor buildRackWithEcShards to reuse buildEcShards Extract common shard bit construction logic to avoid duplication between buildEcShards and buildRackWithEcShards helper functions. * shell: update test for EC replication 000 behavior When DiffRackCount=0 (replication "000"), EC shards should be distributed freely across racks since erasure coding provides its own redundancy. Update test expectation to reflect this behavior. * erasure_coding: add distribution package for proportional EC shard placement Add a new reusable package for EC shard distribution that: - Supports configurable EC ratios (not hard-coded 10+4) - Distributes shards proportionally based on replication policy - Provides fault tolerance analysis - Prefers moving parity shards to keep data shards spread out Key components: - ECConfig: Configurable data/parity shard counts - ReplicationConfig: Parsed XYZ replication policy - ECDistribution: Target shard counts per DC/rack/node - Rebalancer: Plans shard moves with parity-first strategy This enables seaweed-enterprise custom EC ratios and weed worker integration while maintaining a clean, testable architecture. * shell: integrate distribution package for EC rebalancing Add shell wrappers around the distribution package: - ProportionalECRebalancer: Plans moves using distribution.Rebalancer - NewProportionalECRebalancerWithConfig: Supports custom EC configs - GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions The shell layer converts between EcNode types and the generic TopologyNode types used by the distribution package. * test setup * ec: improve data and parity shard distribution across racks - Add shardsByTypePerRack helper to track data vs parity shards - Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing: 1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack 2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack - Add balanceShardTypeAcrossRacks for generic shard type balancing - Add pickRackForShardType to select destination with room for type - Add unit tests for even data/parity distribution verification This ensures even read load during normal operation by spreading both data and parity shards across all available racks. * ec: make data/parity shard counts configurable in ecBalancer - Add dataShardCount and parityShardCount fields to ecBalancer struct - Add getDataShardCount() and getParityShardCount() methods with defaults - Replace direct constant usage with configurable methods - Fix unused variable warning for parityPerRack This allows seaweed-enterprise to use custom EC ratios while defaulting to standard 10+4 scheme. * Address PR 7812 review comments Makefile improvements: - Save PIDs for each volume server for precise termination - Use PID-based killing in stop target with pkill fallback - Use more specific pkill patterns with TEST_DIR paths Documentation: - Document jq dependency in README.md Rebalancer fix: - Fix duplicate shard count updates in applyMovesToAnalysis - All planners (DC/rack/node) update counts inline during planning - Remove duplicate updates from applyMovesToAnalysis to avoid double-counting * test/erasure_coding: use mktemp for test file template Use mktemp instead of hardcoded /tmp/testfile_template.bin path to provide better isolation for concurrent test runs.
379 lines
10 KiB
Go
379 lines
10 KiB
Go
package distribution
|
|
|
|
import (
|
|
"fmt"
|
|
"slices"
|
|
)
|
|
|
|
// ShardMove represents a planned shard move
|
|
type ShardMove struct {
|
|
ShardID int
|
|
SourceNode *TopologyNode
|
|
DestNode *TopologyNode
|
|
Reason string
|
|
}
|
|
|
|
// String returns a human-readable description of the move
|
|
func (m ShardMove) String() string {
|
|
return fmt.Sprintf("shard %d: %s -> %s (%s)",
|
|
m.ShardID, m.SourceNode.NodeID, m.DestNode.NodeID, m.Reason)
|
|
}
|
|
|
|
// RebalancePlan contains the complete plan for rebalancing EC shards
|
|
type RebalancePlan struct {
|
|
Moves []ShardMove
|
|
Distribution *ECDistribution
|
|
Analysis *TopologyAnalysis
|
|
|
|
// Statistics
|
|
TotalMoves int
|
|
MovesAcrossDC int
|
|
MovesAcrossRack int
|
|
MovesWithinRack int
|
|
}
|
|
|
|
// String returns a summary of the plan
|
|
func (p *RebalancePlan) String() string {
|
|
return fmt.Sprintf("RebalancePlan{moves:%d, acrossDC:%d, acrossRack:%d, withinRack:%d}",
|
|
p.TotalMoves, p.MovesAcrossDC, p.MovesAcrossRack, p.MovesWithinRack)
|
|
}
|
|
|
|
// DetailedString returns a detailed multi-line summary
|
|
func (p *RebalancePlan) DetailedString() string {
|
|
s := fmt.Sprintf("Rebalance Plan:\n")
|
|
s += fmt.Sprintf(" Total Moves: %d\n", p.TotalMoves)
|
|
s += fmt.Sprintf(" Across DC: %d\n", p.MovesAcrossDC)
|
|
s += fmt.Sprintf(" Across Rack: %d\n", p.MovesAcrossRack)
|
|
s += fmt.Sprintf(" Within Rack: %d\n", p.MovesWithinRack)
|
|
s += fmt.Sprintf("\nMoves:\n")
|
|
for i, move := range p.Moves {
|
|
s += fmt.Sprintf(" %d. %s\n", i+1, move.String())
|
|
}
|
|
return s
|
|
}
|
|
|
|
// Rebalancer plans shard moves to achieve proportional distribution
|
|
type Rebalancer struct {
|
|
ecConfig ECConfig
|
|
repConfig ReplicationConfig
|
|
}
|
|
|
|
// NewRebalancer creates a new rebalancer with the given configuration
|
|
func NewRebalancer(ec ECConfig, rep ReplicationConfig) *Rebalancer {
|
|
return &Rebalancer{
|
|
ecConfig: ec,
|
|
repConfig: rep,
|
|
}
|
|
}
|
|
|
|
// PlanRebalance creates a rebalancing plan based on current topology analysis
|
|
func (r *Rebalancer) PlanRebalance(analysis *TopologyAnalysis) (*RebalancePlan, error) {
|
|
dist := CalculateDistribution(r.ecConfig, r.repConfig)
|
|
|
|
plan := &RebalancePlan{
|
|
Distribution: dist,
|
|
Analysis: analysis,
|
|
}
|
|
|
|
// Step 1: Balance across data centers
|
|
dcMoves := r.planDCMoves(analysis, dist)
|
|
for _, move := range dcMoves {
|
|
plan.Moves = append(plan.Moves, move)
|
|
plan.MovesAcrossDC++
|
|
}
|
|
|
|
// Update analysis after DC moves (for planning purposes)
|
|
r.applyMovesToAnalysis(analysis, dcMoves)
|
|
|
|
// Step 2: Balance across racks within each DC
|
|
rackMoves := r.planRackMoves(analysis, dist)
|
|
for _, move := range rackMoves {
|
|
plan.Moves = append(plan.Moves, move)
|
|
plan.MovesAcrossRack++
|
|
}
|
|
|
|
// Update analysis after rack moves
|
|
r.applyMovesToAnalysis(analysis, rackMoves)
|
|
|
|
// Step 3: Balance across nodes within each rack
|
|
nodeMoves := r.planNodeMoves(analysis, dist)
|
|
for _, move := range nodeMoves {
|
|
plan.Moves = append(plan.Moves, move)
|
|
plan.MovesWithinRack++
|
|
}
|
|
|
|
plan.TotalMoves = len(plan.Moves)
|
|
|
|
return plan, nil
|
|
}
|
|
|
|
// planDCMoves plans moves to balance shards across data centers
|
|
func (r *Rebalancer) planDCMoves(analysis *TopologyAnalysis, dist *ECDistribution) []ShardMove {
|
|
var moves []ShardMove
|
|
|
|
overDCs := CalculateDCExcess(analysis, dist)
|
|
underDCs := CalculateUnderservedDCs(analysis, dist)
|
|
|
|
underIdx := 0
|
|
for _, over := range overDCs {
|
|
for over.Excess > 0 && underIdx < len(underDCs) {
|
|
destDC := underDCs[underIdx]
|
|
|
|
// Find a shard and source node
|
|
shardID, srcNode := r.pickShardToMove(analysis, over.Nodes)
|
|
if srcNode == nil {
|
|
break
|
|
}
|
|
|
|
// Find destination node in target DC
|
|
destNode := r.pickBestDestination(analysis, destDC, "", dist)
|
|
if destNode == nil {
|
|
underIdx++
|
|
continue
|
|
}
|
|
|
|
moves = append(moves, ShardMove{
|
|
ShardID: shardID,
|
|
SourceNode: srcNode,
|
|
DestNode: destNode,
|
|
Reason: fmt.Sprintf("balance DC: %s -> %s", srcNode.DataCenter, destDC),
|
|
})
|
|
|
|
over.Excess--
|
|
analysis.ShardsByDC[srcNode.DataCenter]--
|
|
analysis.ShardsByDC[destDC]++
|
|
|
|
// Check if destDC reached target
|
|
if analysis.ShardsByDC[destDC] >= dist.TargetShardsPerDC {
|
|
underIdx++
|
|
}
|
|
}
|
|
}
|
|
|
|
return moves
|
|
}
|
|
|
|
// planRackMoves plans moves to balance shards across racks within each DC
|
|
func (r *Rebalancer) planRackMoves(analysis *TopologyAnalysis, dist *ECDistribution) []ShardMove {
|
|
var moves []ShardMove
|
|
|
|
for dc := range analysis.DCToRacks {
|
|
dcShards := analysis.ShardsByDC[dc]
|
|
numRacks := len(analysis.DCToRacks[dc])
|
|
if numRacks == 0 {
|
|
continue
|
|
}
|
|
|
|
targetPerRack := ceilDivide(dcShards, max(numRacks, dist.ReplicationConfig.MinRacksPerDC))
|
|
|
|
overRacks := CalculateRackExcess(analysis, dc, targetPerRack)
|
|
underRacks := CalculateUnderservedRacks(analysis, dc, targetPerRack)
|
|
|
|
underIdx := 0
|
|
for _, over := range overRacks {
|
|
for over.Excess > 0 && underIdx < len(underRacks) {
|
|
destRack := underRacks[underIdx]
|
|
|
|
// Find shard and source node
|
|
shardID, srcNode := r.pickShardToMove(analysis, over.Nodes)
|
|
if srcNode == nil {
|
|
break
|
|
}
|
|
|
|
// Find destination node in target rack
|
|
destNode := r.pickBestDestination(analysis, dc, destRack, dist)
|
|
if destNode == nil {
|
|
underIdx++
|
|
continue
|
|
}
|
|
|
|
moves = append(moves, ShardMove{
|
|
ShardID: shardID,
|
|
SourceNode: srcNode,
|
|
DestNode: destNode,
|
|
Reason: fmt.Sprintf("balance rack: %s -> %s", srcNode.Rack, destRack),
|
|
})
|
|
|
|
over.Excess--
|
|
analysis.ShardsByRack[srcNode.Rack]--
|
|
analysis.ShardsByRack[destRack]++
|
|
|
|
if analysis.ShardsByRack[destRack] >= targetPerRack {
|
|
underIdx++
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
return moves
|
|
}
|
|
|
|
// planNodeMoves plans moves to balance shards across nodes within each rack
|
|
func (r *Rebalancer) planNodeMoves(analysis *TopologyAnalysis, dist *ECDistribution) []ShardMove {
|
|
var moves []ShardMove
|
|
|
|
for rack, nodes := range analysis.RackToNodes {
|
|
if len(nodes) <= 1 {
|
|
continue
|
|
}
|
|
|
|
rackShards := analysis.ShardsByRack[rack]
|
|
targetPerNode := ceilDivide(rackShards, max(len(nodes), dist.ReplicationConfig.MinNodesPerRack))
|
|
|
|
// Find over and under nodes
|
|
var overNodes []*TopologyNode
|
|
var underNodes []*TopologyNode
|
|
|
|
for _, node := range nodes {
|
|
count := analysis.ShardsByNode[node.NodeID]
|
|
if count > targetPerNode {
|
|
overNodes = append(overNodes, node)
|
|
} else if count < targetPerNode {
|
|
underNodes = append(underNodes, node)
|
|
}
|
|
}
|
|
|
|
// Sort by excess/deficit
|
|
slices.SortFunc(overNodes, func(a, b *TopologyNode) int {
|
|
return analysis.ShardsByNode[b.NodeID] - analysis.ShardsByNode[a.NodeID]
|
|
})
|
|
|
|
underIdx := 0
|
|
for _, srcNode := range overNodes {
|
|
excess := analysis.ShardsByNode[srcNode.NodeID] - targetPerNode
|
|
|
|
for excess > 0 && underIdx < len(underNodes) {
|
|
destNode := underNodes[underIdx]
|
|
|
|
// Pick a shard from this node, preferring parity shards
|
|
shards := analysis.NodeToShards[srcNode.NodeID]
|
|
if len(shards) == 0 {
|
|
break
|
|
}
|
|
|
|
// Find a parity shard first, fallback to data shard
|
|
shardID := -1
|
|
shardIdx := -1
|
|
for i, s := range shards {
|
|
if r.ecConfig.IsParityShard(s) {
|
|
shardID = s
|
|
shardIdx = i
|
|
break
|
|
}
|
|
}
|
|
if shardID == -1 {
|
|
shardID = shards[0]
|
|
shardIdx = 0
|
|
}
|
|
|
|
moves = append(moves, ShardMove{
|
|
ShardID: shardID,
|
|
SourceNode: srcNode,
|
|
DestNode: destNode,
|
|
Reason: fmt.Sprintf("balance node: %s -> %s", srcNode.NodeID, destNode.NodeID),
|
|
})
|
|
|
|
excess--
|
|
analysis.ShardsByNode[srcNode.NodeID]--
|
|
analysis.ShardsByNode[destNode.NodeID]++
|
|
|
|
// Update shard lists - remove the specific shard we picked
|
|
analysis.NodeToShards[srcNode.NodeID] = append(
|
|
shards[:shardIdx], shards[shardIdx+1:]...)
|
|
analysis.NodeToShards[destNode.NodeID] = append(
|
|
analysis.NodeToShards[destNode.NodeID], shardID)
|
|
|
|
if analysis.ShardsByNode[destNode.NodeID] >= targetPerNode {
|
|
underIdx++
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
return moves
|
|
}
|
|
|
|
// pickShardToMove selects a shard and its node from the given nodes.
|
|
// It prefers to move parity shards first, keeping data shards spread out
|
|
// since data shards serve read requests while parity shards are only for reconstruction.
|
|
func (r *Rebalancer) pickShardToMove(analysis *TopologyAnalysis, nodes []*TopologyNode) (int, *TopologyNode) {
|
|
// Sort by shard count (most shards first)
|
|
slices.SortFunc(nodes, func(a, b *TopologyNode) int {
|
|
return analysis.ShardsByNode[b.NodeID] - analysis.ShardsByNode[a.NodeID]
|
|
})
|
|
|
|
// First pass: try to find a parity shard to move (prefer moving parity)
|
|
for _, node := range nodes {
|
|
shards := analysis.NodeToShards[node.NodeID]
|
|
for _, shardID := range shards {
|
|
if r.ecConfig.IsParityShard(shardID) {
|
|
return shardID, node
|
|
}
|
|
}
|
|
}
|
|
|
|
// Second pass: if no parity shards, move a data shard
|
|
for _, node := range nodes {
|
|
shards := analysis.NodeToShards[node.NodeID]
|
|
if len(shards) > 0 {
|
|
return shards[0], node
|
|
}
|
|
}
|
|
|
|
return -1, nil
|
|
}
|
|
|
|
// pickBestDestination selects the best destination node
|
|
func (r *Rebalancer) pickBestDestination(analysis *TopologyAnalysis, targetDC, targetRack string, dist *ECDistribution) *TopologyNode {
|
|
var candidates []*TopologyNode
|
|
|
|
// Collect candidates
|
|
for _, node := range analysis.AllNodes {
|
|
// Filter by DC if specified
|
|
if targetDC != "" && node.DataCenter != targetDC {
|
|
continue
|
|
}
|
|
// Filter by rack if specified
|
|
if targetRack != "" && node.Rack != targetRack {
|
|
continue
|
|
}
|
|
// Check capacity
|
|
if node.FreeSlots <= 0 {
|
|
continue
|
|
}
|
|
// Check max shards limit
|
|
if analysis.ShardsByNode[node.NodeID] >= dist.MaxShardsPerNode {
|
|
continue
|
|
}
|
|
|
|
candidates = append(candidates, node)
|
|
}
|
|
|
|
if len(candidates) == 0 {
|
|
return nil
|
|
}
|
|
|
|
// Sort by: 1) fewer shards, 2) more free slots
|
|
slices.SortFunc(candidates, func(a, b *TopologyNode) int {
|
|
aShards := analysis.ShardsByNode[a.NodeID]
|
|
bShards := analysis.ShardsByNode[b.NodeID]
|
|
if aShards != bShards {
|
|
return aShards - bShards
|
|
}
|
|
return b.FreeSlots - a.FreeSlots
|
|
})
|
|
|
|
return candidates[0]
|
|
}
|
|
|
|
// applyMovesToAnalysis is a no-op placeholder for potential future use.
|
|
// Note: All planners (planDCMoves, planRackMoves, planNodeMoves) update
|
|
// their respective counts (ShardsByDC, ShardsByRack, ShardsByNode) and
|
|
// shard lists (NodeToShards) inline during planning. This avoids duplicate
|
|
// updates that would occur if we also updated counts here.
|
|
func (r *Rebalancer) applyMovesToAnalysis(analysis *TopologyAnalysis, moves []ShardMove) {
|
|
// Counts are already updated by the individual planners.
|
|
// This function is kept for API compatibility and potential future use.
|
|
}
|
|
|