fix: EC rebalance fails with replica placement 000 (#7812)
* fix: EC rebalance fails with replica placement 000 This PR fixes several issues with EC shard distribution: 1. Pre-flight check before EC encoding - Verify target disk type has capacity before encoding starts - Prevents encoding shards only to fail during rebalance - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd) 2. Fix EC rebalance with replica placement 000 - When DiffRackCount=0, shards should be distributed freely across racks - The '000' placement means 'no volume replication needed' because EC provides redundancy - Previously all racks were skipped with error 'shards X > replica placement limit (0)' 3. Add unit tests for EC rebalance slot calculation - TestECRebalanceWithLimitedSlots: documents the limited slots scenario - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error 4. Add Makefile for manual EC testing - make setup: start cluster and populate data - make shell: open weed shell for EC commands - make clean: stop cluster and cleanup * fix: default -rebalance to true for ec.encode The -rebalance flag was defaulting to false, which meant ec.encode would only print shard moves but not actually execute them. This is a poor default since the whole point of EC encoding is to distribute shards across servers for fault tolerance. Now -rebalance defaults to true, so shards are actually distributed after encoding. Users can use -rebalance=false if they only want to see what would happen without making changes. * test/erasure_coding: improve Makefile safety and docs - Narrow pkill pattern for volume servers to use TEST_DIR instead of port pattern, avoiding accidental kills of unrelated SeaweedFS processes - Document external dependencies (curl, jq) in header comments * shell: refactor buildRackWithEcShards to reuse buildEcShards Extract common shard bit construction logic to avoid duplication between buildEcShards and buildRackWithEcShards helper functions. * shell: update test for EC replication 000 behavior When DiffRackCount=0 (replication "000"), EC shards should be distributed freely across racks since erasure coding provides its own redundancy. Update test expectation to reflect this behavior. * erasure_coding: add distribution package for proportional EC shard placement Add a new reusable package for EC shard distribution that: - Supports configurable EC ratios (not hard-coded 10+4) - Distributes shards proportionally based on replication policy - Provides fault tolerance analysis - Prefers moving parity shards to keep data shards spread out Key components: - ECConfig: Configurable data/parity shard counts - ReplicationConfig: Parsed XYZ replication policy - ECDistribution: Target shard counts per DC/rack/node - Rebalancer: Plans shard moves with parity-first strategy This enables seaweed-enterprise custom EC ratios and weed worker integration while maintaining a clean, testable architecture. * shell: integrate distribution package for EC rebalancing Add shell wrappers around the distribution package: - ProportionalECRebalancer: Plans moves using distribution.Rebalancer - NewProportionalECRebalancerWithConfig: Supports custom EC configs - GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions The shell layer converts between EcNode types and the generic TopologyNode types used by the distribution package. * test setup * ec: improve data and parity shard distribution across racks - Add shardsByTypePerRack helper to track data vs parity shards - Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing: 1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack 2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack - Add balanceShardTypeAcrossRacks for generic shard type balancing - Add pickRackForShardType to select destination with room for type - Add unit tests for even data/parity distribution verification This ensures even read load during normal operation by spreading both data and parity shards across all available racks. * ec: make data/parity shard counts configurable in ecBalancer - Add dataShardCount and parityShardCount fields to ecBalancer struct - Add getDataShardCount() and getParityShardCount() methods with defaults - Replace direct constant usage with configurable methods - Fix unused variable warning for parityPerRack This allows seaweed-enterprise to use custom EC ratios while defaulting to standard 10+4 scheme. * Address PR 7812 review comments Makefile improvements: - Save PIDs for each volume server for precise termination - Use PID-based killing in stop target with pkill fallback - Use more specific pkill patterns with TEST_DIR paths Documentation: - Document jq dependency in README.md Rebalancer fix: - Fix duplicate shard count updates in applyMovesToAnalysis - All planners (DC/rack/node) update counts inline during planning - Remove duplicate updates from applyMovesToAnalysis to avoid double-counting * test/erasure_coding: use mktemp for test file template Use mktemp instead of hardcoded /tmp/testfile_template.bin path to provide better isolation for concurrent test runs.
This commit is contained in:
@@ -679,6 +679,25 @@ type ecBalancer struct {
|
||||
applyBalancing bool
|
||||
maxParallelization int
|
||||
diskType types.DiskType // target disk type for EC shards (default: HardDriveType)
|
||||
// EC configuration for shard distribution (defaults to 10+4)
|
||||
dataShardCount int
|
||||
parityShardCount int
|
||||
}
|
||||
|
||||
// getDataShardCount returns the configured data shard count, defaulting to standard 10
|
||||
func (ecb *ecBalancer) getDataShardCount() int {
|
||||
if ecb.dataShardCount > 0 {
|
||||
return ecb.dataShardCount
|
||||
}
|
||||
return erasure_coding.DataShardsCount
|
||||
}
|
||||
|
||||
// getParityShardCount returns the configured parity shard count, defaulting to standard 4
|
||||
func (ecb *ecBalancer) getParityShardCount() int {
|
||||
if ecb.parityShardCount > 0 {
|
||||
return ecb.parityShardCount
|
||||
}
|
||||
return erasure_coding.ParityShardsCount
|
||||
}
|
||||
|
||||
func (ecb *ecBalancer) errorWaitGroup() *ErrorWaitGroup {
|
||||
@@ -785,59 +804,176 @@ func countShardsByRack(vid needle.VolumeId, locations []*EcNode, diskType types.
|
||||
})
|
||||
}
|
||||
|
||||
// shardsByTypePerRack counts data shards (< dataShards) and parity shards (>= dataShards) per rack
|
||||
func shardsByTypePerRack(vid needle.VolumeId, locations []*EcNode, diskType types.DiskType, dataShards int) (dataPerRack, parityPerRack map[string][]erasure_coding.ShardId) {
|
||||
dataPerRack = make(map[string][]erasure_coding.ShardId)
|
||||
parityPerRack = make(map[string][]erasure_coding.ShardId)
|
||||
for _, ecNode := range locations {
|
||||
shardBits := findEcVolumeShards(ecNode, vid, diskType)
|
||||
rackId := string(ecNode.rack)
|
||||
for _, shardId := range shardBits.ShardIds() {
|
||||
if int(shardId) < dataShards {
|
||||
dataPerRack[rackId] = append(dataPerRack[rackId], shardId)
|
||||
} else {
|
||||
parityPerRack[rackId] = append(parityPerRack[rackId], shardId)
|
||||
}
|
||||
}
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
func (ecb *ecBalancer) doBalanceEcShardsAcrossRacks(collection string, vid needle.VolumeId, locations []*EcNode) error {
|
||||
racks := ecb.racks()
|
||||
numRacks := len(racks)
|
||||
|
||||
// see the volume's shards are in how many racks, and how many in each rack
|
||||
rackToShardCount := countShardsByRack(vid, locations, ecb.diskType)
|
||||
// Use configured EC scheme for shard type classification (defaults to 10+4)
|
||||
dataShardCount := ecb.getDataShardCount()
|
||||
parityShardCount := ecb.getParityShardCount()
|
||||
|
||||
// Get current distribution of data shards per rack (parity computed after data balancing)
|
||||
dataPerRack, _ := shardsByTypePerRack(vid, locations, ecb.diskType, dataShardCount)
|
||||
|
||||
// Calculate max shards per rack for each type to ensure even spread
|
||||
// Data: 10 shards / 6 racks = max 2 per rack
|
||||
// Parity: 4 shards / 6 racks = max 1 per rack (with 2 racks having 0)
|
||||
maxDataPerRack := ceilDivide(dataShardCount, numRacks)
|
||||
maxParityPerRack := ceilDivide(parityShardCount, numRacks)
|
||||
|
||||
// Calculate actual total shards for this volume (not hardcoded default)
|
||||
var totalShardsForVolume int
|
||||
for _, count := range rackToShardCount {
|
||||
totalShardsForVolume += count
|
||||
}
|
||||
// calculate average number of shards an ec rack should have for one volume
|
||||
averageShardsPerEcRack := ceilDivide(totalShardsForVolume, len(racks))
|
||||
rackEcNodesWithVid := groupBy(locations, func(ecNode *EcNode) string {
|
||||
return string(ecNode.rack)
|
||||
})
|
||||
|
||||
// ecShardsToMove = select overflown ec shards from racks with ec shard counts > averageShardsPerEcRack
|
||||
ecShardsToMove := make(map[erasure_coding.ShardId]*EcNode)
|
||||
for rackId, count := range rackToShardCount {
|
||||
if count <= averageShardsPerEcRack {
|
||||
// Track total shard count per rack for slot management
|
||||
rackToShardCount := countShardsByRack(vid, locations, ecb.diskType)
|
||||
|
||||
// First pass: Balance data shards across racks
|
||||
if err := ecb.balanceShardTypeAcrossRacks(collection, vid, racks, rackEcNodesWithVid, dataPerRack, rackToShardCount, maxDataPerRack, "data"); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// Refresh locations after data shard moves and get parity distribution
|
||||
locations = ecb.collectVolumeIdToEcNodes(collection)[vid]
|
||||
_, parityPerRack := shardsByTypePerRack(vid, locations, ecb.diskType, dataShardCount)
|
||||
rackEcNodesWithVid = groupBy(locations, func(ecNode *EcNode) string {
|
||||
return string(ecNode.rack)
|
||||
})
|
||||
rackToShardCount = countShardsByRack(vid, locations, ecb.diskType)
|
||||
|
||||
// Second pass: Balance parity shards across racks
|
||||
if err := ecb.balanceShardTypeAcrossRacks(collection, vid, racks, rackEcNodesWithVid, parityPerRack, rackToShardCount, maxParityPerRack, "parity"); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// balanceShardTypeAcrossRacks spreads shards of a specific type (data or parity) evenly across racks
|
||||
func (ecb *ecBalancer) balanceShardTypeAcrossRacks(
|
||||
collection string,
|
||||
vid needle.VolumeId,
|
||||
racks map[RackId]*EcRack,
|
||||
rackEcNodesWithVid map[string][]*EcNode,
|
||||
shardsPerRack map[string][]erasure_coding.ShardId,
|
||||
rackToShardCount map[string]int,
|
||||
maxPerRack int,
|
||||
shardType string,
|
||||
) error {
|
||||
// Find racks with too many shards of this type
|
||||
shardsToMove := make(map[erasure_coding.ShardId]*EcNode)
|
||||
for rackId, shards := range shardsPerRack {
|
||||
if len(shards) <= maxPerRack {
|
||||
continue
|
||||
}
|
||||
possibleEcNodes := rackEcNodesWithVid[rackId]
|
||||
for shardId, ecNode := range pickNEcShardsToMoveFrom(possibleEcNodes, vid, count-averageShardsPerEcRack, ecb.diskType) {
|
||||
ecShardsToMove[shardId] = ecNode
|
||||
// Pick excess shards to move
|
||||
excess := len(shards) - maxPerRack
|
||||
ecNodes := rackEcNodesWithVid[rackId]
|
||||
for i := 0; i < excess && i < len(shards); i++ {
|
||||
shardId := shards[i]
|
||||
// Find which node has this shard
|
||||
for _, ecNode := range ecNodes {
|
||||
shardBits := findEcVolumeShards(ecNode, vid, ecb.diskType)
|
||||
if shardBits.HasShardId(shardId) {
|
||||
shardsToMove[shardId] = ecNode
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for shardId, ecNode := range ecShardsToMove {
|
||||
rackId, err := ecb.pickRackToBalanceShardsInto(racks, rackToShardCount)
|
||||
// Move shards to racks that have fewer than maxPerRack of this type
|
||||
for shardId, ecNode := range shardsToMove {
|
||||
// Find destination rack with room for this shard type
|
||||
destRackId, err := ecb.pickRackForShardType(racks, shardsPerRack, maxPerRack, rackToShardCount)
|
||||
if err != nil {
|
||||
fmt.Printf("ec shard %d.%d at %s can not find a destination rack:\n%s\n", vid, shardId, ecNode.info.Id, err.Error())
|
||||
fmt.Printf("ec %s shard %d.%d at %s can not find a destination rack:\n%s\n", shardType, vid, shardId, ecNode.info.Id, err.Error())
|
||||
continue
|
||||
}
|
||||
|
||||
var possibleDestinationEcNodes []*EcNode
|
||||
for _, n := range racks[rackId].ecNodes {
|
||||
for _, n := range racks[destRackId].ecNodes {
|
||||
possibleDestinationEcNodes = append(possibleDestinationEcNodes, n)
|
||||
}
|
||||
err = ecb.pickOneEcNodeAndMoveOneShard(ecNode, collection, vid, shardId, possibleDestinationEcNodes)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
rackToShardCount[string(rackId)] += 1
|
||||
rackToShardCount[string(ecNode.rack)] -= 1
|
||||
racks[rackId].freeEcSlot -= 1
|
||||
racks[ecNode.rack].freeEcSlot += 1
|
||||
|
||||
// Update tracking
|
||||
shardsPerRack[string(destRackId)] = append(shardsPerRack[string(destRackId)], shardId)
|
||||
// Remove from source rack
|
||||
srcRack := string(ecNode.rack)
|
||||
for i, s := range shardsPerRack[srcRack] {
|
||||
if s == shardId {
|
||||
shardsPerRack[srcRack] = append(shardsPerRack[srcRack][:i], shardsPerRack[srcRack][i+1:]...)
|
||||
break
|
||||
}
|
||||
}
|
||||
rackToShardCount[string(destRackId)] += 1
|
||||
rackToShardCount[srcRack] -= 1
|
||||
racks[destRackId].freeEcSlot -= 1
|
||||
racks[RackId(srcRack)].freeEcSlot += 1
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// pickRackForShardType selects a rack that has room for more shards of a specific type
|
||||
func (ecb *ecBalancer) pickRackForShardType(
|
||||
rackToEcNodes map[RackId]*EcRack,
|
||||
shardsPerRack map[string][]erasure_coding.ShardId,
|
||||
maxPerRack int,
|
||||
rackToShardCount map[string]int,
|
||||
) (RackId, error) {
|
||||
var candidates []RackId
|
||||
minShards := maxPerRack + 1
|
||||
|
||||
for rackId, rack := range rackToEcNodes {
|
||||
if rack.freeEcSlot <= 0 {
|
||||
continue
|
||||
}
|
||||
currentCount := len(shardsPerRack[string(rackId)])
|
||||
if currentCount >= maxPerRack {
|
||||
continue
|
||||
}
|
||||
// For EC shards, replica placement constraint only applies when DiffRackCount > 0.
|
||||
if ecb.replicaPlacement != nil && ecb.replicaPlacement.DiffRackCount > 0 && rackToShardCount[string(rackId)] >= ecb.replicaPlacement.DiffRackCount {
|
||||
continue
|
||||
}
|
||||
if currentCount < minShards {
|
||||
candidates = nil
|
||||
minShards = currentCount
|
||||
}
|
||||
if currentCount == minShards {
|
||||
candidates = append(candidates, rackId)
|
||||
}
|
||||
}
|
||||
|
||||
if len(candidates) == 0 {
|
||||
return "", errors.New("no rack available for shard type balancing")
|
||||
}
|
||||
return candidates[rand.IntN(len(candidates))], nil
|
||||
}
|
||||
|
||||
func (ecb *ecBalancer) pickRackToBalanceShardsInto(rackToEcNodes map[RackId]*EcRack, rackToShardCount map[string]int) (RackId, error) {
|
||||
targets := []RackId{}
|
||||
targetShards := -1
|
||||
@@ -855,7 +991,11 @@ func (ecb *ecBalancer) pickRackToBalanceShardsInto(rackToEcNodes map[RackId]*EcR
|
||||
details += fmt.Sprintf(" Skipped %s because it has no free slots\n", rackId)
|
||||
continue
|
||||
}
|
||||
if ecb.replicaPlacement != nil && shards > ecb.replicaPlacement.DiffRackCount {
|
||||
// For EC shards, replica placement constraint only applies when DiffRackCount > 0.
|
||||
// When DiffRackCount = 0 (e.g., replica placement "000"), EC shards should be
|
||||
// distributed freely across racks for fault tolerance - the "000" means
|
||||
// "no volume replication needed" because erasure coding provides redundancy.
|
||||
if ecb.replicaPlacement != nil && ecb.replicaPlacement.DiffRackCount > 0 && shards > ecb.replicaPlacement.DiffRackCount {
|
||||
details += fmt.Sprintf(" Skipped %s because shards %d > replica placement limit for other racks (%d)\n", rackId, shards, ecb.replicaPlacement.DiffRackCount)
|
||||
continue
|
||||
}
|
||||
@@ -1056,7 +1196,11 @@ func (ecb *ecBalancer) pickEcNodeToBalanceShardsInto(vid needle.VolumeId, existi
|
||||
}
|
||||
|
||||
shards := nodeShards[node]
|
||||
if ecb.replicaPlacement != nil && shards > ecb.replicaPlacement.SameRackCount+1 {
|
||||
// For EC shards, replica placement constraint only applies when SameRackCount > 0.
|
||||
// When SameRackCount = 0 (e.g., replica placement "000"), EC shards should be
|
||||
// distributed freely within racks - the "000" means "no volume replication needed"
|
||||
// because erasure coding provides redundancy.
|
||||
if ecb.replicaPlacement != nil && ecb.replicaPlacement.SameRackCount > 0 && shards > ecb.replicaPlacement.SameRackCount+1 {
|
||||
details += fmt.Sprintf(" Skipped %s because shards %d > replica placement limit for the rack (%d + 1)\n", node.info.Id, shards, ecb.replicaPlacement.SameRackCount)
|
||||
continue
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user