fix: EC rebalance fails with replica placement 000 (#7812)

* fix: EC rebalance fails with replica placement 000

This PR fixes several issues with EC shard distribution:

1. Pre-flight check before EC encoding
   - Verify target disk type has capacity before encoding starts
   - Prevents encoding shards only to fail during rebalance
   - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd)

2. Fix EC rebalance with replica placement 000
   - When DiffRackCount=0, shards should be distributed freely across racks
   - The '000' placement means 'no volume replication needed' because EC provides redundancy
   - Previously all racks were skipped with error 'shards X > replica placement limit (0)'

3. Add unit tests for EC rebalance slot calculation
   - TestECRebalanceWithLimitedSlots: documents the limited slots scenario
   - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error

4. Add Makefile for manual EC testing
   - make setup: start cluster and populate data
   - make shell: open weed shell for EC commands
   - make clean: stop cluster and cleanup

* fix: default -rebalance to true for ec.encode

The -rebalance flag was defaulting to false, which meant ec.encode would
only print shard moves but not actually execute them. This is a poor
default since the whole point of EC encoding is to distribute shards
across servers for fault tolerance.

Now -rebalance defaults to true, so shards are actually distributed
after encoding. Users can use -rebalance=false if they only want to
see what would happen without making changes.

* test/erasure_coding: improve Makefile safety and docs

- Narrow pkill pattern for volume servers to use TEST_DIR instead of
  port pattern, avoiding accidental kills of unrelated SeaweedFS processes
- Document external dependencies (curl, jq) in header comments

* shell: refactor buildRackWithEcShards to reuse buildEcShards

Extract common shard bit construction logic to avoid duplication
between buildEcShards and buildRackWithEcShards helper functions.

* shell: update test for EC replication 000 behavior

When DiffRackCount=0 (replication "000"), EC shards should be
distributed freely across racks since erasure coding provides its
own redundancy. Update test expectation to reflect this behavior.

* erasure_coding: add distribution package for proportional EC shard placement

Add a new reusable package for EC shard distribution that:
- Supports configurable EC ratios (not hard-coded 10+4)
- Distributes shards proportionally based on replication policy
- Provides fault tolerance analysis
- Prefers moving parity shards to keep data shards spread out

Key components:
- ECConfig: Configurable data/parity shard counts
- ReplicationConfig: Parsed XYZ replication policy
- ECDistribution: Target shard counts per DC/rack/node
- Rebalancer: Plans shard moves with parity-first strategy

This enables seaweed-enterprise custom EC ratios and weed worker
integration while maintaining a clean, testable architecture.

* shell: integrate distribution package for EC rebalancing

Add shell wrappers around the distribution package:
- ProportionalECRebalancer: Plans moves using distribution.Rebalancer
- NewProportionalECRebalancerWithConfig: Supports custom EC configs
- GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions

The shell layer converts between EcNode types and the generic
TopologyNode types used by the distribution package.

* test setup

* ec: improve data and parity shard distribution across racks

- Add shardsByTypePerRack helper to track data vs parity shards
- Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing:
  1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack
  2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack
- Add balanceShardTypeAcrossRacks for generic shard type balancing
- Add pickRackForShardType to select destination with room for type
- Add unit tests for even data/parity distribution verification

This ensures even read load during normal operation by spreading
both data and parity shards across all available racks.

* ec: make data/parity shard counts configurable in ecBalancer

- Add dataShardCount and parityShardCount fields to ecBalancer struct
- Add getDataShardCount() and getParityShardCount() methods with defaults
- Replace direct constant usage with configurable methods
- Fix unused variable warning for parityPerRack

This allows seaweed-enterprise to use custom EC ratios while
defaulting to standard 10+4 scheme.

* Address PR 7812 review comments

Makefile improvements:
- Save PIDs for each volume server for precise termination
- Use PID-based killing in stop target with pkill fallback
- Use more specific pkill patterns with TEST_DIR paths

Documentation:
- Document jq dependency in README.md

Rebalancer fix:
- Fix duplicate shard count updates in applyMovesToAnalysis
- All planners (DC/rack/node) update counts inline during planning
- Remove duplicate updates from applyMovesToAnalysis to avoid double-counting

* test/erasure_coding: use mktemp for test file template

Use mktemp instead of hardcoded /tmp/testfile_template.bin path
to provide better isolation for concurrent test runs.
This commit is contained in:
Chris Lu
2025-12-19 13:29:12 -08:00
committed by GitHub
parent 77a56c2857
commit 4aa50bfa6a
16 changed files with 3127 additions and 28 deletions

View File

@@ -679,6 +679,25 @@ type ecBalancer struct {
applyBalancing bool
maxParallelization int
diskType types.DiskType // target disk type for EC shards (default: HardDriveType)
// EC configuration for shard distribution (defaults to 10+4)
dataShardCount int
parityShardCount int
}
// getDataShardCount returns the configured data shard count, defaulting to standard 10
func (ecb *ecBalancer) getDataShardCount() int {
if ecb.dataShardCount > 0 {
return ecb.dataShardCount
}
return erasure_coding.DataShardsCount
}
// getParityShardCount returns the configured parity shard count, defaulting to standard 4
func (ecb *ecBalancer) getParityShardCount() int {
if ecb.parityShardCount > 0 {
return ecb.parityShardCount
}
return erasure_coding.ParityShardsCount
}
func (ecb *ecBalancer) errorWaitGroup() *ErrorWaitGroup {
@@ -785,59 +804,176 @@ func countShardsByRack(vid needle.VolumeId, locations []*EcNode, diskType types.
})
}
// shardsByTypePerRack counts data shards (< dataShards) and parity shards (>= dataShards) per rack
func shardsByTypePerRack(vid needle.VolumeId, locations []*EcNode, diskType types.DiskType, dataShards int) (dataPerRack, parityPerRack map[string][]erasure_coding.ShardId) {
dataPerRack = make(map[string][]erasure_coding.ShardId)
parityPerRack = make(map[string][]erasure_coding.ShardId)
for _, ecNode := range locations {
shardBits := findEcVolumeShards(ecNode, vid, diskType)
rackId := string(ecNode.rack)
for _, shardId := range shardBits.ShardIds() {
if int(shardId) < dataShards {
dataPerRack[rackId] = append(dataPerRack[rackId], shardId)
} else {
parityPerRack[rackId] = append(parityPerRack[rackId], shardId)
}
}
}
return
}
func (ecb *ecBalancer) doBalanceEcShardsAcrossRacks(collection string, vid needle.VolumeId, locations []*EcNode) error {
racks := ecb.racks()
numRacks := len(racks)
// see the volume's shards are in how many racks, and how many in each rack
rackToShardCount := countShardsByRack(vid, locations, ecb.diskType)
// Use configured EC scheme for shard type classification (defaults to 10+4)
dataShardCount := ecb.getDataShardCount()
parityShardCount := ecb.getParityShardCount()
// Get current distribution of data shards per rack (parity computed after data balancing)
dataPerRack, _ := shardsByTypePerRack(vid, locations, ecb.diskType, dataShardCount)
// Calculate max shards per rack for each type to ensure even spread
// Data: 10 shards / 6 racks = max 2 per rack
// Parity: 4 shards / 6 racks = max 1 per rack (with 2 racks having 0)
maxDataPerRack := ceilDivide(dataShardCount, numRacks)
maxParityPerRack := ceilDivide(parityShardCount, numRacks)
// Calculate actual total shards for this volume (not hardcoded default)
var totalShardsForVolume int
for _, count := range rackToShardCount {
totalShardsForVolume += count
}
// calculate average number of shards an ec rack should have for one volume
averageShardsPerEcRack := ceilDivide(totalShardsForVolume, len(racks))
rackEcNodesWithVid := groupBy(locations, func(ecNode *EcNode) string {
return string(ecNode.rack)
})
// ecShardsToMove = select overflown ec shards from racks with ec shard counts > averageShardsPerEcRack
ecShardsToMove := make(map[erasure_coding.ShardId]*EcNode)
for rackId, count := range rackToShardCount {
if count <= averageShardsPerEcRack {
// Track total shard count per rack for slot management
rackToShardCount := countShardsByRack(vid, locations, ecb.diskType)
// First pass: Balance data shards across racks
if err := ecb.balanceShardTypeAcrossRacks(collection, vid, racks, rackEcNodesWithVid, dataPerRack, rackToShardCount, maxDataPerRack, "data"); err != nil {
return err
}
// Refresh locations after data shard moves and get parity distribution
locations = ecb.collectVolumeIdToEcNodes(collection)[vid]
_, parityPerRack := shardsByTypePerRack(vid, locations, ecb.diskType, dataShardCount)
rackEcNodesWithVid = groupBy(locations, func(ecNode *EcNode) string {
return string(ecNode.rack)
})
rackToShardCount = countShardsByRack(vid, locations, ecb.diskType)
// Second pass: Balance parity shards across racks
if err := ecb.balanceShardTypeAcrossRacks(collection, vid, racks, rackEcNodesWithVid, parityPerRack, rackToShardCount, maxParityPerRack, "parity"); err != nil {
return err
}
return nil
}
// balanceShardTypeAcrossRacks spreads shards of a specific type (data or parity) evenly across racks
func (ecb *ecBalancer) balanceShardTypeAcrossRacks(
collection string,
vid needle.VolumeId,
racks map[RackId]*EcRack,
rackEcNodesWithVid map[string][]*EcNode,
shardsPerRack map[string][]erasure_coding.ShardId,
rackToShardCount map[string]int,
maxPerRack int,
shardType string,
) error {
// Find racks with too many shards of this type
shardsToMove := make(map[erasure_coding.ShardId]*EcNode)
for rackId, shards := range shardsPerRack {
if len(shards) <= maxPerRack {
continue
}
possibleEcNodes := rackEcNodesWithVid[rackId]
for shardId, ecNode := range pickNEcShardsToMoveFrom(possibleEcNodes, vid, count-averageShardsPerEcRack, ecb.diskType) {
ecShardsToMove[shardId] = ecNode
// Pick excess shards to move
excess := len(shards) - maxPerRack
ecNodes := rackEcNodesWithVid[rackId]
for i := 0; i < excess && i < len(shards); i++ {
shardId := shards[i]
// Find which node has this shard
for _, ecNode := range ecNodes {
shardBits := findEcVolumeShards(ecNode, vid, ecb.diskType)
if shardBits.HasShardId(shardId) {
shardsToMove[shardId] = ecNode
break
}
}
}
}
for shardId, ecNode := range ecShardsToMove {
rackId, err := ecb.pickRackToBalanceShardsInto(racks, rackToShardCount)
// Move shards to racks that have fewer than maxPerRack of this type
for shardId, ecNode := range shardsToMove {
// Find destination rack with room for this shard type
destRackId, err := ecb.pickRackForShardType(racks, shardsPerRack, maxPerRack, rackToShardCount)
if err != nil {
fmt.Printf("ec shard %d.%d at %s can not find a destination rack:\n%s\n", vid, shardId, ecNode.info.Id, err.Error())
fmt.Printf("ec %s shard %d.%d at %s can not find a destination rack:\n%s\n", shardType, vid, shardId, ecNode.info.Id, err.Error())
continue
}
var possibleDestinationEcNodes []*EcNode
for _, n := range racks[rackId].ecNodes {
for _, n := range racks[destRackId].ecNodes {
possibleDestinationEcNodes = append(possibleDestinationEcNodes, n)
}
err = ecb.pickOneEcNodeAndMoveOneShard(ecNode, collection, vid, shardId, possibleDestinationEcNodes)
if err != nil {
return err
}
rackToShardCount[string(rackId)] += 1
rackToShardCount[string(ecNode.rack)] -= 1
racks[rackId].freeEcSlot -= 1
racks[ecNode.rack].freeEcSlot += 1
// Update tracking
shardsPerRack[string(destRackId)] = append(shardsPerRack[string(destRackId)], shardId)
// Remove from source rack
srcRack := string(ecNode.rack)
for i, s := range shardsPerRack[srcRack] {
if s == shardId {
shardsPerRack[srcRack] = append(shardsPerRack[srcRack][:i], shardsPerRack[srcRack][i+1:]...)
break
}
}
rackToShardCount[string(destRackId)] += 1
rackToShardCount[srcRack] -= 1
racks[destRackId].freeEcSlot -= 1
racks[RackId(srcRack)].freeEcSlot += 1
}
return nil
}
// pickRackForShardType selects a rack that has room for more shards of a specific type
func (ecb *ecBalancer) pickRackForShardType(
rackToEcNodes map[RackId]*EcRack,
shardsPerRack map[string][]erasure_coding.ShardId,
maxPerRack int,
rackToShardCount map[string]int,
) (RackId, error) {
var candidates []RackId
minShards := maxPerRack + 1
for rackId, rack := range rackToEcNodes {
if rack.freeEcSlot <= 0 {
continue
}
currentCount := len(shardsPerRack[string(rackId)])
if currentCount >= maxPerRack {
continue
}
// For EC shards, replica placement constraint only applies when DiffRackCount > 0.
if ecb.replicaPlacement != nil && ecb.replicaPlacement.DiffRackCount > 0 && rackToShardCount[string(rackId)] >= ecb.replicaPlacement.DiffRackCount {
continue
}
if currentCount < minShards {
candidates = nil
minShards = currentCount
}
if currentCount == minShards {
candidates = append(candidates, rackId)
}
}
if len(candidates) == 0 {
return "", errors.New("no rack available for shard type balancing")
}
return candidates[rand.IntN(len(candidates))], nil
}
func (ecb *ecBalancer) pickRackToBalanceShardsInto(rackToEcNodes map[RackId]*EcRack, rackToShardCount map[string]int) (RackId, error) {
targets := []RackId{}
targetShards := -1
@@ -855,7 +991,11 @@ func (ecb *ecBalancer) pickRackToBalanceShardsInto(rackToEcNodes map[RackId]*EcR
details += fmt.Sprintf(" Skipped %s because it has no free slots\n", rackId)
continue
}
if ecb.replicaPlacement != nil && shards > ecb.replicaPlacement.DiffRackCount {
// For EC shards, replica placement constraint only applies when DiffRackCount > 0.
// When DiffRackCount = 0 (e.g., replica placement "000"), EC shards should be
// distributed freely across racks for fault tolerance - the "000" means
// "no volume replication needed" because erasure coding provides redundancy.
if ecb.replicaPlacement != nil && ecb.replicaPlacement.DiffRackCount > 0 && shards > ecb.replicaPlacement.DiffRackCount {
details += fmt.Sprintf(" Skipped %s because shards %d > replica placement limit for other racks (%d)\n", rackId, shards, ecb.replicaPlacement.DiffRackCount)
continue
}
@@ -1056,7 +1196,11 @@ func (ecb *ecBalancer) pickEcNodeToBalanceShardsInto(vid needle.VolumeId, existi
}
shards := nodeShards[node]
if ecb.replicaPlacement != nil && shards > ecb.replicaPlacement.SameRackCount+1 {
// For EC shards, replica placement constraint only applies when SameRackCount > 0.
// When SameRackCount = 0 (e.g., replica placement "000"), EC shards should be
// distributed freely within racks - the "000" means "no volume replication needed"
// because erasure coding provides redundancy.
if ecb.replicaPlacement != nil && ecb.replicaPlacement.SameRackCount > 0 && shards > ecb.replicaPlacement.SameRackCount+1 {
details += fmt.Sprintf(" Skipped %s because shards %d > replica placement limit for the rack (%d + 1)\n", node.info.Id, shards, ecb.replicaPlacement.SameRackCount)
continue
}