fix: EC rebalance fails with replica placement 000 (#7812)
* fix: EC rebalance fails with replica placement 000 This PR fixes several issues with EC shard distribution: 1. Pre-flight check before EC encoding - Verify target disk type has capacity before encoding starts - Prevents encoding shards only to fail during rebalance - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd) 2. Fix EC rebalance with replica placement 000 - When DiffRackCount=0, shards should be distributed freely across racks - The '000' placement means 'no volume replication needed' because EC provides redundancy - Previously all racks were skipped with error 'shards X > replica placement limit (0)' 3. Add unit tests for EC rebalance slot calculation - TestECRebalanceWithLimitedSlots: documents the limited slots scenario - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error 4. Add Makefile for manual EC testing - make setup: start cluster and populate data - make shell: open weed shell for EC commands - make clean: stop cluster and cleanup * fix: default -rebalance to true for ec.encode The -rebalance flag was defaulting to false, which meant ec.encode would only print shard moves but not actually execute them. This is a poor default since the whole point of EC encoding is to distribute shards across servers for fault tolerance. Now -rebalance defaults to true, so shards are actually distributed after encoding. Users can use -rebalance=false if they only want to see what would happen without making changes. * test/erasure_coding: improve Makefile safety and docs - Narrow pkill pattern for volume servers to use TEST_DIR instead of port pattern, avoiding accidental kills of unrelated SeaweedFS processes - Document external dependencies (curl, jq) in header comments * shell: refactor buildRackWithEcShards to reuse buildEcShards Extract common shard bit construction logic to avoid duplication between buildEcShards and buildRackWithEcShards helper functions. * shell: update test for EC replication 000 behavior When DiffRackCount=0 (replication "000"), EC shards should be distributed freely across racks since erasure coding provides its own redundancy. Update test expectation to reflect this behavior. * erasure_coding: add distribution package for proportional EC shard placement Add a new reusable package for EC shard distribution that: - Supports configurable EC ratios (not hard-coded 10+4) - Distributes shards proportionally based on replication policy - Provides fault tolerance analysis - Prefers moving parity shards to keep data shards spread out Key components: - ECConfig: Configurable data/parity shard counts - ReplicationConfig: Parsed XYZ replication policy - ECDistribution: Target shard counts per DC/rack/node - Rebalancer: Plans shard moves with parity-first strategy This enables seaweed-enterprise custom EC ratios and weed worker integration while maintaining a clean, testable architecture. * shell: integrate distribution package for EC rebalancing Add shell wrappers around the distribution package: - ProportionalECRebalancer: Plans moves using distribution.Rebalancer - NewProportionalECRebalancerWithConfig: Supports custom EC configs - GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions The shell layer converts between EcNode types and the generic TopologyNode types used by the distribution package. * test setup * ec: improve data and parity shard distribution across racks - Add shardsByTypePerRack helper to track data vs parity shards - Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing: 1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack 2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack - Add balanceShardTypeAcrossRacks for generic shard type balancing - Add pickRackForShardType to select destination with room for type - Add unit tests for even data/parity distribution verification This ensures even read load during normal operation by spreading both data and parity shards across all available racks. * ec: make data/parity shard counts configurable in ecBalancer - Add dataShardCount and parityShardCount fields to ecBalancer struct - Add getDataShardCount() and getParityShardCount() methods with defaults - Replace direct constant usage with configurable methods - Fix unused variable warning for parityPerRack This allows seaweed-enterprise to use custom EC ratios while defaulting to standard 10+4 scheme. * Address PR 7812 review comments Makefile improvements: - Save PIDs for each volume server for precise termination - Use PID-based killing in stop target with pkill fallback - Use more specific pkill patterns with TEST_DIR paths Documentation: - Document jq dependency in README.md Rebalancer fix: - Fix duplicate shard count updates in applyMovesToAnalysis - All planners (DC/rack/node) update counts inline during planning - Remove duplicate updates from applyMovesToAnalysis to avoid double-counting * test/erasure_coding: use mktemp for test file template Use mktemp instead of hardcoded /tmp/testfile_template.bin path to provide better isolation for concurrent test runs.
This commit is contained in:
@@ -679,6 +679,25 @@ type ecBalancer struct {
|
||||
applyBalancing bool
|
||||
maxParallelization int
|
||||
diskType types.DiskType // target disk type for EC shards (default: HardDriveType)
|
||||
// EC configuration for shard distribution (defaults to 10+4)
|
||||
dataShardCount int
|
||||
parityShardCount int
|
||||
}
|
||||
|
||||
// getDataShardCount returns the configured data shard count, defaulting to standard 10
|
||||
func (ecb *ecBalancer) getDataShardCount() int {
|
||||
if ecb.dataShardCount > 0 {
|
||||
return ecb.dataShardCount
|
||||
}
|
||||
return erasure_coding.DataShardsCount
|
||||
}
|
||||
|
||||
// getParityShardCount returns the configured parity shard count, defaulting to standard 4
|
||||
func (ecb *ecBalancer) getParityShardCount() int {
|
||||
if ecb.parityShardCount > 0 {
|
||||
return ecb.parityShardCount
|
||||
}
|
||||
return erasure_coding.ParityShardsCount
|
||||
}
|
||||
|
||||
func (ecb *ecBalancer) errorWaitGroup() *ErrorWaitGroup {
|
||||
@@ -785,59 +804,176 @@ func countShardsByRack(vid needle.VolumeId, locations []*EcNode, diskType types.
|
||||
})
|
||||
}
|
||||
|
||||
// shardsByTypePerRack counts data shards (< dataShards) and parity shards (>= dataShards) per rack
|
||||
func shardsByTypePerRack(vid needle.VolumeId, locations []*EcNode, diskType types.DiskType, dataShards int) (dataPerRack, parityPerRack map[string][]erasure_coding.ShardId) {
|
||||
dataPerRack = make(map[string][]erasure_coding.ShardId)
|
||||
parityPerRack = make(map[string][]erasure_coding.ShardId)
|
||||
for _, ecNode := range locations {
|
||||
shardBits := findEcVolumeShards(ecNode, vid, diskType)
|
||||
rackId := string(ecNode.rack)
|
||||
for _, shardId := range shardBits.ShardIds() {
|
||||
if int(shardId) < dataShards {
|
||||
dataPerRack[rackId] = append(dataPerRack[rackId], shardId)
|
||||
} else {
|
||||
parityPerRack[rackId] = append(parityPerRack[rackId], shardId)
|
||||
}
|
||||
}
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
func (ecb *ecBalancer) doBalanceEcShardsAcrossRacks(collection string, vid needle.VolumeId, locations []*EcNode) error {
|
||||
racks := ecb.racks()
|
||||
numRacks := len(racks)
|
||||
|
||||
// see the volume's shards are in how many racks, and how many in each rack
|
||||
rackToShardCount := countShardsByRack(vid, locations, ecb.diskType)
|
||||
// Use configured EC scheme for shard type classification (defaults to 10+4)
|
||||
dataShardCount := ecb.getDataShardCount()
|
||||
parityShardCount := ecb.getParityShardCount()
|
||||
|
||||
// Get current distribution of data shards per rack (parity computed after data balancing)
|
||||
dataPerRack, _ := shardsByTypePerRack(vid, locations, ecb.diskType, dataShardCount)
|
||||
|
||||
// Calculate max shards per rack for each type to ensure even spread
|
||||
// Data: 10 shards / 6 racks = max 2 per rack
|
||||
// Parity: 4 shards / 6 racks = max 1 per rack (with 2 racks having 0)
|
||||
maxDataPerRack := ceilDivide(dataShardCount, numRacks)
|
||||
maxParityPerRack := ceilDivide(parityShardCount, numRacks)
|
||||
|
||||
// Calculate actual total shards for this volume (not hardcoded default)
|
||||
var totalShardsForVolume int
|
||||
for _, count := range rackToShardCount {
|
||||
totalShardsForVolume += count
|
||||
}
|
||||
// calculate average number of shards an ec rack should have for one volume
|
||||
averageShardsPerEcRack := ceilDivide(totalShardsForVolume, len(racks))
|
||||
rackEcNodesWithVid := groupBy(locations, func(ecNode *EcNode) string {
|
||||
return string(ecNode.rack)
|
||||
})
|
||||
|
||||
// ecShardsToMove = select overflown ec shards from racks with ec shard counts > averageShardsPerEcRack
|
||||
ecShardsToMove := make(map[erasure_coding.ShardId]*EcNode)
|
||||
for rackId, count := range rackToShardCount {
|
||||
if count <= averageShardsPerEcRack {
|
||||
// Track total shard count per rack for slot management
|
||||
rackToShardCount := countShardsByRack(vid, locations, ecb.diskType)
|
||||
|
||||
// First pass: Balance data shards across racks
|
||||
if err := ecb.balanceShardTypeAcrossRacks(collection, vid, racks, rackEcNodesWithVid, dataPerRack, rackToShardCount, maxDataPerRack, "data"); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
// Refresh locations after data shard moves and get parity distribution
|
||||
locations = ecb.collectVolumeIdToEcNodes(collection)[vid]
|
||||
_, parityPerRack := shardsByTypePerRack(vid, locations, ecb.diskType, dataShardCount)
|
||||
rackEcNodesWithVid = groupBy(locations, func(ecNode *EcNode) string {
|
||||
return string(ecNode.rack)
|
||||
})
|
||||
rackToShardCount = countShardsByRack(vid, locations, ecb.diskType)
|
||||
|
||||
// Second pass: Balance parity shards across racks
|
||||
if err := ecb.balanceShardTypeAcrossRacks(collection, vid, racks, rackEcNodesWithVid, parityPerRack, rackToShardCount, maxParityPerRack, "parity"); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// balanceShardTypeAcrossRacks spreads shards of a specific type (data or parity) evenly across racks
|
||||
func (ecb *ecBalancer) balanceShardTypeAcrossRacks(
|
||||
collection string,
|
||||
vid needle.VolumeId,
|
||||
racks map[RackId]*EcRack,
|
||||
rackEcNodesWithVid map[string][]*EcNode,
|
||||
shardsPerRack map[string][]erasure_coding.ShardId,
|
||||
rackToShardCount map[string]int,
|
||||
maxPerRack int,
|
||||
shardType string,
|
||||
) error {
|
||||
// Find racks with too many shards of this type
|
||||
shardsToMove := make(map[erasure_coding.ShardId]*EcNode)
|
||||
for rackId, shards := range shardsPerRack {
|
||||
if len(shards) <= maxPerRack {
|
||||
continue
|
||||
}
|
||||
possibleEcNodes := rackEcNodesWithVid[rackId]
|
||||
for shardId, ecNode := range pickNEcShardsToMoveFrom(possibleEcNodes, vid, count-averageShardsPerEcRack, ecb.diskType) {
|
||||
ecShardsToMove[shardId] = ecNode
|
||||
// Pick excess shards to move
|
||||
excess := len(shards) - maxPerRack
|
||||
ecNodes := rackEcNodesWithVid[rackId]
|
||||
for i := 0; i < excess && i < len(shards); i++ {
|
||||
shardId := shards[i]
|
||||
// Find which node has this shard
|
||||
for _, ecNode := range ecNodes {
|
||||
shardBits := findEcVolumeShards(ecNode, vid, ecb.diskType)
|
||||
if shardBits.HasShardId(shardId) {
|
||||
shardsToMove[shardId] = ecNode
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for shardId, ecNode := range ecShardsToMove {
|
||||
rackId, err := ecb.pickRackToBalanceShardsInto(racks, rackToShardCount)
|
||||
// Move shards to racks that have fewer than maxPerRack of this type
|
||||
for shardId, ecNode := range shardsToMove {
|
||||
// Find destination rack with room for this shard type
|
||||
destRackId, err := ecb.pickRackForShardType(racks, shardsPerRack, maxPerRack, rackToShardCount)
|
||||
if err != nil {
|
||||
fmt.Printf("ec shard %d.%d at %s can not find a destination rack:\n%s\n", vid, shardId, ecNode.info.Id, err.Error())
|
||||
fmt.Printf("ec %s shard %d.%d at %s can not find a destination rack:\n%s\n", shardType, vid, shardId, ecNode.info.Id, err.Error())
|
||||
continue
|
||||
}
|
||||
|
||||
var possibleDestinationEcNodes []*EcNode
|
||||
for _, n := range racks[rackId].ecNodes {
|
||||
for _, n := range racks[destRackId].ecNodes {
|
||||
possibleDestinationEcNodes = append(possibleDestinationEcNodes, n)
|
||||
}
|
||||
err = ecb.pickOneEcNodeAndMoveOneShard(ecNode, collection, vid, shardId, possibleDestinationEcNodes)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
rackToShardCount[string(rackId)] += 1
|
||||
rackToShardCount[string(ecNode.rack)] -= 1
|
||||
racks[rackId].freeEcSlot -= 1
|
||||
racks[ecNode.rack].freeEcSlot += 1
|
||||
|
||||
// Update tracking
|
||||
shardsPerRack[string(destRackId)] = append(shardsPerRack[string(destRackId)], shardId)
|
||||
// Remove from source rack
|
||||
srcRack := string(ecNode.rack)
|
||||
for i, s := range shardsPerRack[srcRack] {
|
||||
if s == shardId {
|
||||
shardsPerRack[srcRack] = append(shardsPerRack[srcRack][:i], shardsPerRack[srcRack][i+1:]...)
|
||||
break
|
||||
}
|
||||
}
|
||||
rackToShardCount[string(destRackId)] += 1
|
||||
rackToShardCount[srcRack] -= 1
|
||||
racks[destRackId].freeEcSlot -= 1
|
||||
racks[RackId(srcRack)].freeEcSlot += 1
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// pickRackForShardType selects a rack that has room for more shards of a specific type
|
||||
func (ecb *ecBalancer) pickRackForShardType(
|
||||
rackToEcNodes map[RackId]*EcRack,
|
||||
shardsPerRack map[string][]erasure_coding.ShardId,
|
||||
maxPerRack int,
|
||||
rackToShardCount map[string]int,
|
||||
) (RackId, error) {
|
||||
var candidates []RackId
|
||||
minShards := maxPerRack + 1
|
||||
|
||||
for rackId, rack := range rackToEcNodes {
|
||||
if rack.freeEcSlot <= 0 {
|
||||
continue
|
||||
}
|
||||
currentCount := len(shardsPerRack[string(rackId)])
|
||||
if currentCount >= maxPerRack {
|
||||
continue
|
||||
}
|
||||
// For EC shards, replica placement constraint only applies when DiffRackCount > 0.
|
||||
if ecb.replicaPlacement != nil && ecb.replicaPlacement.DiffRackCount > 0 && rackToShardCount[string(rackId)] >= ecb.replicaPlacement.DiffRackCount {
|
||||
continue
|
||||
}
|
||||
if currentCount < minShards {
|
||||
candidates = nil
|
||||
minShards = currentCount
|
||||
}
|
||||
if currentCount == minShards {
|
||||
candidates = append(candidates, rackId)
|
||||
}
|
||||
}
|
||||
|
||||
if len(candidates) == 0 {
|
||||
return "", errors.New("no rack available for shard type balancing")
|
||||
}
|
||||
return candidates[rand.IntN(len(candidates))], nil
|
||||
}
|
||||
|
||||
func (ecb *ecBalancer) pickRackToBalanceShardsInto(rackToEcNodes map[RackId]*EcRack, rackToShardCount map[string]int) (RackId, error) {
|
||||
targets := []RackId{}
|
||||
targetShards := -1
|
||||
@@ -855,7 +991,11 @@ func (ecb *ecBalancer) pickRackToBalanceShardsInto(rackToEcNodes map[RackId]*EcR
|
||||
details += fmt.Sprintf(" Skipped %s because it has no free slots\n", rackId)
|
||||
continue
|
||||
}
|
||||
if ecb.replicaPlacement != nil && shards > ecb.replicaPlacement.DiffRackCount {
|
||||
// For EC shards, replica placement constraint only applies when DiffRackCount > 0.
|
||||
// When DiffRackCount = 0 (e.g., replica placement "000"), EC shards should be
|
||||
// distributed freely across racks for fault tolerance - the "000" means
|
||||
// "no volume replication needed" because erasure coding provides redundancy.
|
||||
if ecb.replicaPlacement != nil && ecb.replicaPlacement.DiffRackCount > 0 && shards > ecb.replicaPlacement.DiffRackCount {
|
||||
details += fmt.Sprintf(" Skipped %s because shards %d > replica placement limit for other racks (%d)\n", rackId, shards, ecb.replicaPlacement.DiffRackCount)
|
||||
continue
|
||||
}
|
||||
@@ -1056,7 +1196,11 @@ func (ecb *ecBalancer) pickEcNodeToBalanceShardsInto(vid needle.VolumeId, existi
|
||||
}
|
||||
|
||||
shards := nodeShards[node]
|
||||
if ecb.replicaPlacement != nil && shards > ecb.replicaPlacement.SameRackCount+1 {
|
||||
// For EC shards, replica placement constraint only applies when SameRackCount > 0.
|
||||
// When SameRackCount = 0 (e.g., replica placement "000"), EC shards should be
|
||||
// distributed freely within racks - the "000" means "no volume replication needed"
|
||||
// because erasure coding provides redundancy.
|
||||
if ecb.replicaPlacement != nil && ecb.replicaPlacement.SameRackCount > 0 && shards > ecb.replicaPlacement.SameRackCount+1 {
|
||||
details += fmt.Sprintf(" Skipped %s because shards %d > replica placement limit for the rack (%d + 1)\n", node.info.Id, shards, ecb.replicaPlacement.SameRackCount)
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -133,7 +133,9 @@ func TestPickRackToBalanceShardsInto(t *testing.T) {
|
||||
{testTopologyEc, "6241", "123", []string{"rack1", "rack2", "rack3", "rack4", "rack5", "rack6"}, ""},
|
||||
{testTopologyEc, "6242", "123", []string{"rack1", "rack2", "rack3", "rack4", "rack5", "rack6"}, ""},
|
||||
// EC volumes.
|
||||
{testTopologyEc, "9577", "", nil, "shards 1 > replica placement limit for other racks (0)"},
|
||||
// With replication "000" (DiffRackCount=0), EC shards should be distributed freely
|
||||
// because erasure coding provides its own redundancy. No replica placement error.
|
||||
{testTopologyEc, "9577", "", []string{"rack1", "rack2", "rack3"}, ""},
|
||||
{testTopologyEc, "9577", "111", []string{"rack1", "rack2", "rack3"}, ""},
|
||||
{testTopologyEc, "9577", "222", []string{"rack1", "rack2", "rack3"}, ""},
|
||||
{testTopologyEc, "10457", "222", []string{"rack1"}, ""},
|
||||
|
||||
@@ -94,7 +94,7 @@ func (c *commandEcEncode) Do(args []string, commandEnv *CommandEnv, writer io.Wr
|
||||
shardReplicaPlacement := encodeCommand.String("shardReplicaPlacement", "", "replica placement for EC shards, or master default if empty")
|
||||
sourceDiskTypeStr := encodeCommand.String("sourceDiskType", "", "filter source volumes by disk type (hdd, ssd, or empty for all)")
|
||||
diskTypeStr := encodeCommand.String("diskType", "", "target disk type for EC shards (hdd, ssd, or empty for default hdd)")
|
||||
applyBalancing := encodeCommand.Bool("rebalance", false, "re-balance EC shards after creation")
|
||||
applyBalancing := encodeCommand.Bool("rebalance", true, "re-balance EC shards after creation (default: true)")
|
||||
verbose := encodeCommand.Bool("verbose", false, "show detailed reasons why volumes are not selected for encoding")
|
||||
|
||||
if err = encodeCommand.Parse(args); err != nil {
|
||||
@@ -164,6 +164,32 @@ func (c *commandEcEncode) Do(args []string, commandEnv *CommandEnv, writer io.Wr
|
||||
return fmt.Errorf("failed to collect volume locations before EC encoding: %w", err)
|
||||
}
|
||||
|
||||
// Pre-flight check: verify the target disk type has capacity for EC shards
|
||||
// This prevents encoding shards only to fail during rebalance
|
||||
_, totalFreeEcSlots, err := collectEcNodesForDC(commandEnv, "", diskType)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to check EC shard capacity: %w", err)
|
||||
}
|
||||
|
||||
// Calculate required slots: each volume needs TotalShardsCount (14) shards distributed
|
||||
requiredSlots := len(volumeIds) * erasure_coding.TotalShardsCount
|
||||
if totalFreeEcSlots < 1 {
|
||||
// No capacity at all on the target disk type
|
||||
if diskType != types.HardDriveType {
|
||||
return fmt.Errorf("no free ec shard slots on disk type '%s'. The target disk type has no capacity.\n"+
|
||||
"Your volumes are likely on a different disk type. Try:\n"+
|
||||
" ec.encode -collection=%s -diskType=hdd\n"+
|
||||
"Or omit -diskType to use the default (hdd)", diskType, *collection)
|
||||
}
|
||||
return fmt.Errorf("no free ec shard slots. only %d left on disk type '%s'", totalFreeEcSlots, diskType)
|
||||
}
|
||||
|
||||
if totalFreeEcSlots < requiredSlots {
|
||||
fmt.Printf("Warning: limited EC shard capacity. Need %d slots for %d volumes, but only %d slots available on disk type '%s'.\n",
|
||||
requiredSlots, len(volumeIds), totalFreeEcSlots, diskType)
|
||||
fmt.Printf("Rebalancing may not achieve optimal distribution.\n")
|
||||
}
|
||||
|
||||
// encode all requested volumes...
|
||||
if err = doEcEncode(commandEnv, writer, volumeIdToCollection, volumeIds, *maxParallelization); err != nil {
|
||||
return fmt.Errorf("ec encode for volumes %v: %w", volumeIds, err)
|
||||
|
||||
@@ -4,6 +4,7 @@ import (
|
||||
"testing"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/types"
|
||||
)
|
||||
@@ -136,3 +137,151 @@ func newEcNode(dc string, rack string, dataNodeId string, freeEcSlot int) *EcNod
|
||||
func (ecNode *EcNode) addEcVolumeAndShardsForTest(vid uint32, collection string, shardIds []uint32) *EcNode {
|
||||
return ecNode.addEcVolumeShards(needle.VolumeId(vid), collection, shardIds, types.HardDriveType)
|
||||
}
|
||||
|
||||
// TestCommandEcBalanceEvenDataAndParityDistribution verifies that after balancing:
|
||||
// 1. Data shards (0-9) are evenly distributed across racks (max 2 per rack for 6 racks)
|
||||
// 2. Parity shards (10-13) are evenly distributed across racks (max 1 per rack for 6 racks)
|
||||
func TestCommandEcBalanceEvenDataAndParityDistribution(t *testing.T) {
|
||||
// Setup: All 14 shards start on rack1 (simulating fresh EC encode)
|
||||
ecb := &ecBalancer{
|
||||
ecNodes: []*EcNode{
|
||||
// All shards initially on rack1/dn1
|
||||
newEcNode("dc1", "rack1", "dn1", 100).addEcVolumeAndShardsForTest(1, "c1", []uint32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}),
|
||||
// Empty nodes on other racks
|
||||
newEcNode("dc1", "rack2", "dn2", 100),
|
||||
newEcNode("dc1", "rack3", "dn3", 100),
|
||||
newEcNode("dc1", "rack4", "dn4", 100),
|
||||
newEcNode("dc1", "rack5", "dn5", 100),
|
||||
newEcNode("dc1", "rack6", "dn6", 100),
|
||||
},
|
||||
applyBalancing: false, // Dry-run mode (simulates moves by updating internal state)
|
||||
diskType: types.HardDriveType,
|
||||
}
|
||||
|
||||
ecb.balanceEcVolumes("c1")
|
||||
|
||||
// After balancing (dry-run), verify the PLANNED distribution by checking what moves were proposed
|
||||
// The ecb.ecNodes state is updated during dry-run to track planned moves
|
||||
vid := needle.VolumeId(1)
|
||||
dataShardCount := erasure_coding.DataShardsCount // 10
|
||||
parityShardCount := erasure_coding.ParityShardsCount // 4
|
||||
|
||||
// Count data and parity shards per rack based on current (updated) state
|
||||
dataPerRack, parityPerRack := countDataAndParityShardsPerRack(ecb.ecNodes, vid, dataShardCount)
|
||||
|
||||
// With 6 racks:
|
||||
// - Data shards (10): max 2 per rack (ceil(10/6) = 2)
|
||||
// - Parity shards (4): max 1 per rack (ceil(4/6) = 1)
|
||||
maxDataPerRack := ceilDivide(dataShardCount, 6) // 2
|
||||
maxParityPerRack := ceilDivide(parityShardCount, 6) // 1
|
||||
|
||||
// Verify no rack has more than max data shards
|
||||
for rackId, count := range dataPerRack {
|
||||
if count > maxDataPerRack {
|
||||
t.Errorf("rack %s has %d data shards, expected max %d", rackId, count, maxDataPerRack)
|
||||
}
|
||||
}
|
||||
|
||||
// Verify no rack has more than max parity shards
|
||||
for rackId, count := range parityPerRack {
|
||||
if count > maxParityPerRack {
|
||||
t.Errorf("rack %s has %d parity shards, expected max %d", rackId, count, maxParityPerRack)
|
||||
}
|
||||
}
|
||||
|
||||
// Verify all shards are distributed (total counts)
|
||||
totalData := 0
|
||||
totalParity := 0
|
||||
for _, count := range dataPerRack {
|
||||
totalData += count
|
||||
}
|
||||
for _, count := range parityPerRack {
|
||||
totalParity += count
|
||||
}
|
||||
if totalData != dataShardCount {
|
||||
t.Errorf("total data shards = %d, expected %d", totalData, dataShardCount)
|
||||
}
|
||||
if totalParity != parityShardCount {
|
||||
t.Errorf("total parity shards = %d, expected %d", totalParity, parityShardCount)
|
||||
}
|
||||
|
||||
// Verify data shards are spread across at least 5 racks (10 shards / 2 max per rack)
|
||||
racksWithData := len(dataPerRack)
|
||||
minRacksForData := dataShardCount / maxDataPerRack // At least 5 racks needed for 10 data shards
|
||||
if racksWithData < minRacksForData {
|
||||
t.Errorf("data shards spread across only %d racks, expected at least %d", racksWithData, minRacksForData)
|
||||
}
|
||||
|
||||
// Verify parity shards are spread across at least 4 racks (4 shards / 1 max per rack)
|
||||
racksWithParity := len(parityPerRack)
|
||||
if racksWithParity < parityShardCount {
|
||||
t.Errorf("parity shards spread across only %d racks, expected at least %d", racksWithParity, parityShardCount)
|
||||
}
|
||||
|
||||
t.Logf("Distribution after balancing:")
|
||||
t.Logf(" Data shards per rack: %v (max allowed: %d)", dataPerRack, maxDataPerRack)
|
||||
t.Logf(" Parity shards per rack: %v (max allowed: %d)", parityPerRack, maxParityPerRack)
|
||||
}
|
||||
|
||||
// countDataAndParityShardsPerRack counts data and parity shards per rack
|
||||
func countDataAndParityShardsPerRack(ecNodes []*EcNode, vid needle.VolumeId, dataShardCount int) (dataPerRack, parityPerRack map[string]int) {
|
||||
dataPerRack = make(map[string]int)
|
||||
parityPerRack = make(map[string]int)
|
||||
|
||||
for _, ecNode := range ecNodes {
|
||||
shardBits := findEcVolumeShards(ecNode, vid, types.HardDriveType)
|
||||
for _, shardId := range shardBits.ShardIds() {
|
||||
rackId := string(ecNode.rack)
|
||||
if int(shardId) < dataShardCount {
|
||||
dataPerRack[rackId]++
|
||||
} else {
|
||||
parityPerRack[rackId]++
|
||||
}
|
||||
}
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
// TestCommandEcBalanceMultipleVolumesEvenDistribution tests that multiple volumes
|
||||
// each get their data and parity shards evenly distributed
|
||||
func TestCommandEcBalanceMultipleVolumesEvenDistribution(t *testing.T) {
|
||||
// Setup: Two volumes, each with all 14 shards on different starting racks
|
||||
ecb := &ecBalancer{
|
||||
ecNodes: []*EcNode{
|
||||
// Volume 1: all shards on rack1
|
||||
newEcNode("dc1", "rack1", "dn1", 100).addEcVolumeAndShardsForTest(1, "c1", []uint32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}),
|
||||
// Volume 2: all shards on rack2
|
||||
newEcNode("dc1", "rack2", "dn2", 100).addEcVolumeAndShardsForTest(2, "c1", []uint32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}),
|
||||
// Empty nodes on other racks
|
||||
newEcNode("dc1", "rack3", "dn3", 100),
|
||||
newEcNode("dc1", "rack4", "dn4", 100),
|
||||
newEcNode("dc1", "rack5", "dn5", 100),
|
||||
newEcNode("dc1", "rack6", "dn6", 100),
|
||||
},
|
||||
applyBalancing: false, // Dry-run mode
|
||||
diskType: types.HardDriveType,
|
||||
}
|
||||
|
||||
ecb.balanceEcVolumes("c1")
|
||||
|
||||
// Check both volumes
|
||||
for _, vid := range []needle.VolumeId{1, 2} {
|
||||
dataPerRack, parityPerRack := countDataAndParityShardsPerRack(ecb.ecNodes, vid, erasure_coding.DataShardsCount)
|
||||
|
||||
maxDataPerRack := ceilDivide(erasure_coding.DataShardsCount, 6)
|
||||
maxParityPerRack := ceilDivide(erasure_coding.ParityShardsCount, 6)
|
||||
|
||||
for rackId, count := range dataPerRack {
|
||||
if count > maxDataPerRack {
|
||||
t.Errorf("volume %d: rack %s has %d data shards, expected max %d", vid, rackId, count, maxDataPerRack)
|
||||
}
|
||||
}
|
||||
for rackId, count := range parityPerRack {
|
||||
if count > maxParityPerRack {
|
||||
t.Errorf("volume %d: rack %s has %d parity shards, expected max %d", vid, rackId, count, maxParityPerRack)
|
||||
}
|
||||
}
|
||||
|
||||
t.Logf("Volume %d - Data: %v, Parity: %v", vid, dataPerRack, parityPerRack)
|
||||
}
|
||||
}
|
||||
|
||||
284
weed/shell/ec_proportional_rebalance.go
Normal file
284
weed/shell/ec_proportional_rebalance.go
Normal file
@@ -0,0 +1,284 @@
|
||||
package shell
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding/distribution"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/super_block"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/types"
|
||||
)
|
||||
|
||||
// ECDistribution is an alias to the distribution package type for backward compatibility
|
||||
type ECDistribution = distribution.ECDistribution
|
||||
|
||||
// CalculateECDistribution computes the target EC shard distribution based on replication policy.
|
||||
// This is a convenience wrapper that uses the default 10+4 EC configuration.
|
||||
// For custom EC ratios, use the distribution package directly.
|
||||
func CalculateECDistribution(totalShards, parityShards int, rp *super_block.ReplicaPlacement) *ECDistribution {
|
||||
ec := distribution.ECConfig{
|
||||
DataShards: totalShards - parityShards,
|
||||
ParityShards: parityShards,
|
||||
}
|
||||
rep := distribution.NewReplicationConfig(rp)
|
||||
return distribution.CalculateDistribution(ec, rep)
|
||||
}
|
||||
|
||||
// TopologyDistributionAnalysis holds the current shard distribution analysis
|
||||
// This wraps the distribution package's TopologyAnalysis with shell-specific EcNode handling
|
||||
type TopologyDistributionAnalysis struct {
|
||||
inner *distribution.TopologyAnalysis
|
||||
|
||||
// Shell-specific mappings
|
||||
nodeMap map[string]*EcNode // nodeID -> EcNode
|
||||
}
|
||||
|
||||
// NewTopologyDistributionAnalysis creates a new analysis structure
|
||||
func NewTopologyDistributionAnalysis() *TopologyDistributionAnalysis {
|
||||
return &TopologyDistributionAnalysis{
|
||||
inner: distribution.NewTopologyAnalysis(),
|
||||
nodeMap: make(map[string]*EcNode),
|
||||
}
|
||||
}
|
||||
|
||||
// AddNode adds a node and its shards to the analysis
|
||||
func (a *TopologyDistributionAnalysis) AddNode(node *EcNode, shardBits erasure_coding.ShardBits) {
|
||||
nodeId := node.info.Id
|
||||
|
||||
// Create distribution.TopologyNode from EcNode
|
||||
topoNode := &distribution.TopologyNode{
|
||||
NodeID: nodeId,
|
||||
DataCenter: string(node.dc),
|
||||
Rack: string(node.rack),
|
||||
FreeSlots: node.freeEcSlot,
|
||||
TotalShards: shardBits.ShardIdCount(),
|
||||
}
|
||||
|
||||
for _, shardId := range shardBits.ShardIds() {
|
||||
topoNode.ShardIDs = append(topoNode.ShardIDs, int(shardId))
|
||||
}
|
||||
|
||||
a.inner.AddNode(topoNode)
|
||||
a.nodeMap[nodeId] = node
|
||||
|
||||
// Add shard locations
|
||||
for _, shardId := range shardBits.ShardIds() {
|
||||
a.inner.AddShardLocation(distribution.ShardLocation{
|
||||
ShardID: int(shardId),
|
||||
NodeID: nodeId,
|
||||
DataCenter: string(node.dc),
|
||||
Rack: string(node.rack),
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// Finalize completes the analysis
|
||||
func (a *TopologyDistributionAnalysis) Finalize() {
|
||||
a.inner.Finalize()
|
||||
}
|
||||
|
||||
// String returns a summary
|
||||
func (a *TopologyDistributionAnalysis) String() string {
|
||||
return a.inner.String()
|
||||
}
|
||||
|
||||
// DetailedString returns detailed analysis
|
||||
func (a *TopologyDistributionAnalysis) DetailedString() string {
|
||||
return a.inner.DetailedString()
|
||||
}
|
||||
|
||||
// GetShardsByDC returns shard counts by DC
|
||||
func (a *TopologyDistributionAnalysis) GetShardsByDC() map[DataCenterId]int {
|
||||
result := make(map[DataCenterId]int)
|
||||
for dc, count := range a.inner.ShardsByDC {
|
||||
result[DataCenterId(dc)] = count
|
||||
}
|
||||
return result
|
||||
}
|
||||
|
||||
// GetShardsByRack returns shard counts by rack
|
||||
func (a *TopologyDistributionAnalysis) GetShardsByRack() map[RackId]int {
|
||||
result := make(map[RackId]int)
|
||||
for rack, count := range a.inner.ShardsByRack {
|
||||
result[RackId(rack)] = count
|
||||
}
|
||||
return result
|
||||
}
|
||||
|
||||
// GetShardsByNode returns shard counts by node
|
||||
func (a *TopologyDistributionAnalysis) GetShardsByNode() map[EcNodeId]int {
|
||||
result := make(map[EcNodeId]int)
|
||||
for nodeId, count := range a.inner.ShardsByNode {
|
||||
result[EcNodeId(nodeId)] = count
|
||||
}
|
||||
return result
|
||||
}
|
||||
|
||||
// AnalyzeVolumeDistribution creates an analysis of current shard distribution for a volume
|
||||
func AnalyzeVolumeDistribution(volumeId needle.VolumeId, locations []*EcNode, diskType types.DiskType) *TopologyDistributionAnalysis {
|
||||
analysis := NewTopologyDistributionAnalysis()
|
||||
|
||||
for _, node := range locations {
|
||||
shardBits := findEcVolumeShards(node, volumeId, diskType)
|
||||
if shardBits.ShardIdCount() > 0 {
|
||||
analysis.AddNode(node, shardBits)
|
||||
}
|
||||
}
|
||||
|
||||
analysis.Finalize()
|
||||
return analysis
|
||||
}
|
||||
|
||||
// ECShardMove represents a planned shard move (shell-specific with EcNode references)
|
||||
type ECShardMove struct {
|
||||
VolumeId needle.VolumeId
|
||||
ShardId erasure_coding.ShardId
|
||||
SourceNode *EcNode
|
||||
DestNode *EcNode
|
||||
Reason string
|
||||
}
|
||||
|
||||
// String returns a human-readable description
|
||||
func (m ECShardMove) String() string {
|
||||
return fmt.Sprintf("volume %d shard %d: %s -> %s (%s)",
|
||||
m.VolumeId, m.ShardId, m.SourceNode.info.Id, m.DestNode.info.Id, m.Reason)
|
||||
}
|
||||
|
||||
// ProportionalECRebalancer implements proportional shard distribution for shell commands
|
||||
type ProportionalECRebalancer struct {
|
||||
ecNodes []*EcNode
|
||||
replicaPlacement *super_block.ReplicaPlacement
|
||||
diskType types.DiskType
|
||||
ecConfig distribution.ECConfig
|
||||
}
|
||||
|
||||
// NewProportionalECRebalancer creates a new proportional rebalancer with default EC config
|
||||
func NewProportionalECRebalancer(
|
||||
ecNodes []*EcNode,
|
||||
rp *super_block.ReplicaPlacement,
|
||||
diskType types.DiskType,
|
||||
) *ProportionalECRebalancer {
|
||||
return NewProportionalECRebalancerWithConfig(
|
||||
ecNodes,
|
||||
rp,
|
||||
diskType,
|
||||
distribution.DefaultECConfig(),
|
||||
)
|
||||
}
|
||||
|
||||
// NewProportionalECRebalancerWithConfig creates a rebalancer with custom EC configuration
|
||||
func NewProportionalECRebalancerWithConfig(
|
||||
ecNodes []*EcNode,
|
||||
rp *super_block.ReplicaPlacement,
|
||||
diskType types.DiskType,
|
||||
ecConfig distribution.ECConfig,
|
||||
) *ProportionalECRebalancer {
|
||||
return &ProportionalECRebalancer{
|
||||
ecNodes: ecNodes,
|
||||
replicaPlacement: rp,
|
||||
diskType: diskType,
|
||||
ecConfig: ecConfig,
|
||||
}
|
||||
}
|
||||
|
||||
// PlanMoves generates a plan for moving shards to achieve proportional distribution
|
||||
func (r *ProportionalECRebalancer) PlanMoves(
|
||||
volumeId needle.VolumeId,
|
||||
locations []*EcNode,
|
||||
) ([]ECShardMove, error) {
|
||||
// Build topology analysis
|
||||
analysis := distribution.NewTopologyAnalysis()
|
||||
nodeMap := make(map[string]*EcNode)
|
||||
|
||||
// Add all EC nodes to the analysis (even those without shards)
|
||||
for _, node := range r.ecNodes {
|
||||
nodeId := node.info.Id
|
||||
topoNode := &distribution.TopologyNode{
|
||||
NodeID: nodeId,
|
||||
DataCenter: string(node.dc),
|
||||
Rack: string(node.rack),
|
||||
FreeSlots: node.freeEcSlot,
|
||||
}
|
||||
analysis.AddNode(topoNode)
|
||||
nodeMap[nodeId] = node
|
||||
}
|
||||
|
||||
// Add shard locations from nodes that have shards
|
||||
for _, node := range locations {
|
||||
nodeId := node.info.Id
|
||||
shardBits := findEcVolumeShards(node, volumeId, r.diskType)
|
||||
for _, shardId := range shardBits.ShardIds() {
|
||||
analysis.AddShardLocation(distribution.ShardLocation{
|
||||
ShardID: int(shardId),
|
||||
NodeID: nodeId,
|
||||
DataCenter: string(node.dc),
|
||||
Rack: string(node.rack),
|
||||
})
|
||||
}
|
||||
if _, exists := nodeMap[nodeId]; !exists {
|
||||
nodeMap[nodeId] = node
|
||||
}
|
||||
}
|
||||
|
||||
analysis.Finalize()
|
||||
|
||||
// Create rebalancer and plan moves
|
||||
rep := distribution.NewReplicationConfig(r.replicaPlacement)
|
||||
rebalancer := distribution.NewRebalancer(r.ecConfig, rep)
|
||||
|
||||
plan, err := rebalancer.PlanRebalance(analysis)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
// Convert distribution moves to shell moves
|
||||
var moves []ECShardMove
|
||||
for _, move := range plan.Moves {
|
||||
srcNode := nodeMap[move.SourceNode.NodeID]
|
||||
destNode := nodeMap[move.DestNode.NodeID]
|
||||
if srcNode == nil || destNode == nil {
|
||||
continue
|
||||
}
|
||||
|
||||
moves = append(moves, ECShardMove{
|
||||
VolumeId: volumeId,
|
||||
ShardId: erasure_coding.ShardId(move.ShardID),
|
||||
SourceNode: srcNode,
|
||||
DestNode: destNode,
|
||||
Reason: move.Reason,
|
||||
})
|
||||
}
|
||||
|
||||
return moves, nil
|
||||
}
|
||||
|
||||
// GetDistributionSummary returns a summary of the planned distribution
|
||||
func GetDistributionSummary(rp *super_block.ReplicaPlacement) string {
|
||||
ec := distribution.DefaultECConfig()
|
||||
rep := distribution.NewReplicationConfig(rp)
|
||||
dist := distribution.CalculateDistribution(ec, rep)
|
||||
return dist.Summary()
|
||||
}
|
||||
|
||||
// GetDistributionSummaryWithConfig returns a summary with custom EC configuration
|
||||
func GetDistributionSummaryWithConfig(rp *super_block.ReplicaPlacement, ecConfig distribution.ECConfig) string {
|
||||
rep := distribution.NewReplicationConfig(rp)
|
||||
dist := distribution.CalculateDistribution(ecConfig, rep)
|
||||
return dist.Summary()
|
||||
}
|
||||
|
||||
// GetFaultToleranceAnalysis returns fault tolerance analysis for the given configuration
|
||||
func GetFaultToleranceAnalysis(rp *super_block.ReplicaPlacement) string {
|
||||
ec := distribution.DefaultECConfig()
|
||||
rep := distribution.NewReplicationConfig(rp)
|
||||
dist := distribution.CalculateDistribution(ec, rep)
|
||||
return dist.FaultToleranceAnalysis()
|
||||
}
|
||||
|
||||
// GetFaultToleranceAnalysisWithConfig returns fault tolerance analysis with custom EC configuration
|
||||
func GetFaultToleranceAnalysisWithConfig(rp *super_block.ReplicaPlacement, ecConfig distribution.ECConfig) string {
|
||||
rep := distribution.NewReplicationConfig(rp)
|
||||
dist := distribution.CalculateDistribution(ecConfig, rep)
|
||||
return dist.FaultToleranceAnalysis()
|
||||
}
|
||||
251
weed/shell/ec_proportional_rebalance_test.go
Normal file
251
weed/shell/ec_proportional_rebalance_test.go
Normal file
@@ -0,0 +1,251 @@
|
||||
package shell
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding/distribution"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/super_block"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/types"
|
||||
)
|
||||
|
||||
func TestCalculateECDistributionShell(t *testing.T) {
|
||||
// Test the shell wrapper function
|
||||
rp, _ := super_block.NewReplicaPlacementFromString("110")
|
||||
|
||||
dist := CalculateECDistribution(
|
||||
erasure_coding.TotalShardsCount,
|
||||
erasure_coding.ParityShardsCount,
|
||||
rp,
|
||||
)
|
||||
|
||||
if dist.ReplicationConfig.MinDataCenters != 2 {
|
||||
t.Errorf("Expected 2 DCs, got %d", dist.ReplicationConfig.MinDataCenters)
|
||||
}
|
||||
if dist.TargetShardsPerDC != 7 {
|
||||
t.Errorf("Expected 7 shards per DC, got %d", dist.TargetShardsPerDC)
|
||||
}
|
||||
|
||||
t.Log(dist.Summary())
|
||||
}
|
||||
|
||||
func TestAnalyzeVolumeDistributionShell(t *testing.T) {
|
||||
diskType := types.HardDriveType
|
||||
diskTypeKey := string(diskType)
|
||||
|
||||
// Build a topology with unbalanced distribution
|
||||
node1 := &EcNode{
|
||||
info: &master_pb.DataNodeInfo{
|
||||
Id: "127.0.0.1:8080",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {
|
||||
Type: diskTypeKey,
|
||||
MaxVolumeCount: 10,
|
||||
EcShardInfos: []*master_pb.VolumeEcShardInformationMessage{
|
||||
{
|
||||
Id: 1,
|
||||
Collection: "test",
|
||||
EcIndexBits: 0x3FFF, // All 14 shards
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
dc: "dc1",
|
||||
rack: "rack1",
|
||||
freeEcSlot: 5,
|
||||
}
|
||||
|
||||
node2 := &EcNode{
|
||||
info: &master_pb.DataNodeInfo{
|
||||
Id: "127.0.0.1:8081",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {
|
||||
Type: diskTypeKey,
|
||||
MaxVolumeCount: 10,
|
||||
EcShardInfos: []*master_pb.VolumeEcShardInformationMessage{},
|
||||
},
|
||||
},
|
||||
},
|
||||
dc: "dc2",
|
||||
rack: "rack2",
|
||||
freeEcSlot: 10,
|
||||
}
|
||||
|
||||
locations := []*EcNode{node1, node2}
|
||||
volumeId := needle.VolumeId(1)
|
||||
|
||||
analysis := AnalyzeVolumeDistribution(volumeId, locations, diskType)
|
||||
|
||||
shardsByDC := analysis.GetShardsByDC()
|
||||
if shardsByDC["dc1"] != 14 {
|
||||
t.Errorf("Expected 14 shards in dc1, got %d", shardsByDC["dc1"])
|
||||
}
|
||||
|
||||
t.Log(analysis.DetailedString())
|
||||
}
|
||||
|
||||
func TestProportionalRebalancerShell(t *testing.T) {
|
||||
diskType := types.HardDriveType
|
||||
diskTypeKey := string(diskType)
|
||||
|
||||
// Build topology: 2 DCs, 2 racks each, all shards on one node
|
||||
nodes := []*EcNode{
|
||||
{
|
||||
info: &master_pb.DataNodeInfo{
|
||||
Id: "dc1-rack1-node1",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {
|
||||
Type: diskTypeKey,
|
||||
MaxVolumeCount: 10,
|
||||
EcShardInfos: []*master_pb.VolumeEcShardInformationMessage{
|
||||
{Id: 1, Collection: "test", EcIndexBits: 0x3FFF},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
dc: "dc1", rack: "dc1-rack1", freeEcSlot: 0,
|
||||
},
|
||||
{
|
||||
info: &master_pb.DataNodeInfo{
|
||||
Id: "dc1-rack2-node1",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {Type: diskTypeKey, MaxVolumeCount: 10},
|
||||
},
|
||||
},
|
||||
dc: "dc1", rack: "dc1-rack2", freeEcSlot: 10,
|
||||
},
|
||||
{
|
||||
info: &master_pb.DataNodeInfo{
|
||||
Id: "dc2-rack1-node1",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {Type: diskTypeKey, MaxVolumeCount: 10},
|
||||
},
|
||||
},
|
||||
dc: "dc2", rack: "dc2-rack1", freeEcSlot: 10,
|
||||
},
|
||||
{
|
||||
info: &master_pb.DataNodeInfo{
|
||||
Id: "dc2-rack2-node1",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {Type: diskTypeKey, MaxVolumeCount: 10},
|
||||
},
|
||||
},
|
||||
dc: "dc2", rack: "dc2-rack2", freeEcSlot: 10,
|
||||
},
|
||||
}
|
||||
|
||||
rp, _ := super_block.NewReplicaPlacementFromString("110")
|
||||
rebalancer := NewProportionalECRebalancer(nodes, rp, diskType)
|
||||
|
||||
volumeId := needle.VolumeId(1)
|
||||
moves, err := rebalancer.PlanMoves(volumeId, []*EcNode{nodes[0]})
|
||||
|
||||
if err != nil {
|
||||
t.Fatalf("PlanMoves failed: %v", err)
|
||||
}
|
||||
|
||||
t.Logf("Planned %d moves", len(moves))
|
||||
for i, move := range moves {
|
||||
t.Logf(" %d. %s", i+1, move.String())
|
||||
}
|
||||
|
||||
// Verify moves to dc2
|
||||
movedToDC2 := 0
|
||||
for _, move := range moves {
|
||||
if move.DestNode.dc == "dc2" {
|
||||
movedToDC2++
|
||||
}
|
||||
}
|
||||
|
||||
if movedToDC2 == 0 {
|
||||
t.Error("Expected some moves to dc2")
|
||||
}
|
||||
}
|
||||
|
||||
func TestCustomECConfigRebalancer(t *testing.T) {
|
||||
diskType := types.HardDriveType
|
||||
diskTypeKey := string(diskType)
|
||||
|
||||
// Test with custom 8+4 EC configuration
|
||||
ecConfig, err := distribution.NewECConfig(8, 4)
|
||||
if err != nil {
|
||||
t.Fatalf("Failed to create EC config: %v", err)
|
||||
}
|
||||
|
||||
// Build topology for 12 shards (8+4)
|
||||
nodes := []*EcNode{
|
||||
{
|
||||
info: &master_pb.DataNodeInfo{
|
||||
Id: "dc1-node1",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {
|
||||
Type: diskTypeKey,
|
||||
MaxVolumeCount: 10,
|
||||
EcShardInfos: []*master_pb.VolumeEcShardInformationMessage{
|
||||
{Id: 1, Collection: "test", EcIndexBits: 0x0FFF}, // 12 shards (bits 0-11)
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
dc: "dc1", rack: "dc1-rack1", freeEcSlot: 0,
|
||||
},
|
||||
{
|
||||
info: &master_pb.DataNodeInfo{
|
||||
Id: "dc2-node1",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {Type: diskTypeKey, MaxVolumeCount: 10},
|
||||
},
|
||||
},
|
||||
dc: "dc2", rack: "dc2-rack1", freeEcSlot: 10,
|
||||
},
|
||||
{
|
||||
info: &master_pb.DataNodeInfo{
|
||||
Id: "dc3-node1",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {Type: diskTypeKey, MaxVolumeCount: 10},
|
||||
},
|
||||
},
|
||||
dc: "dc3", rack: "dc3-rack1", freeEcSlot: 10,
|
||||
},
|
||||
}
|
||||
|
||||
rp, _ := super_block.NewReplicaPlacementFromString("200") // 3 DCs
|
||||
rebalancer := NewProportionalECRebalancerWithConfig(nodes, rp, diskType, ecConfig)
|
||||
|
||||
volumeId := needle.VolumeId(1)
|
||||
moves, err := rebalancer.PlanMoves(volumeId, []*EcNode{nodes[0]})
|
||||
|
||||
if err != nil {
|
||||
t.Fatalf("PlanMoves failed: %v", err)
|
||||
}
|
||||
|
||||
t.Logf("Custom 8+4 EC with 200 replication: planned %d moves", len(moves))
|
||||
|
||||
// Get the distribution summary
|
||||
summary := GetDistributionSummaryWithConfig(rp, ecConfig)
|
||||
t.Log(summary)
|
||||
|
||||
analysis := GetFaultToleranceAnalysisWithConfig(rp, ecConfig)
|
||||
t.Log(analysis)
|
||||
}
|
||||
|
||||
func TestGetDistributionSummaryShell(t *testing.T) {
|
||||
rp, _ := super_block.NewReplicaPlacementFromString("110")
|
||||
|
||||
summary := GetDistributionSummary(rp)
|
||||
t.Log(summary)
|
||||
|
||||
if len(summary) == 0 {
|
||||
t.Error("Summary should not be empty")
|
||||
}
|
||||
|
||||
analysis := GetFaultToleranceAnalysis(rp)
|
||||
t.Log(analysis)
|
||||
|
||||
if len(analysis) == 0 {
|
||||
t.Error("Analysis should not be empty")
|
||||
}
|
||||
}
|
||||
293
weed/shell/ec_rebalance_slots_test.go
Normal file
293
weed/shell/ec_rebalance_slots_test.go
Normal file
@@ -0,0 +1,293 @@
|
||||
package shell
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/types"
|
||||
)
|
||||
|
||||
// TestECRebalanceWithLimitedSlots tests that EC rebalance handles the scenario
|
||||
// where there are limited free slots on volume servers.
|
||||
//
|
||||
// This is a regression test for the error:
|
||||
//
|
||||
// "no free ec shard slots. only 0 left"
|
||||
//
|
||||
// Scenario (from real usage):
|
||||
// - 6 volume servers in 6 racks
|
||||
// - Each server has max=10 volume slots
|
||||
// - 7 volumes were EC encoded (7 × 14 = 98 EC shards)
|
||||
// - All 14 shards per volume are on the original server (not yet distributed)
|
||||
//
|
||||
// Expected behavior:
|
||||
// - The rebalance algorithm should distribute shards across servers
|
||||
// - Even if perfect distribution isn't possible, it should do best-effort
|
||||
// - Currently fails with "no free ec shard slots" because freeSlots calculation
|
||||
//
|
||||
// doesn't account for shards being moved (freed slots on source, used on target)
|
||||
func TestECRebalanceWithLimitedSlots(t *testing.T) {
|
||||
// Build a topology matching the problematic scenario:
|
||||
// 6 servers, each with 2+ volumes worth of EC shards (all 14 shards per volume on same server)
|
||||
topology := buildLimitedSlotsTopology()
|
||||
|
||||
// Collect EC nodes from the topology
|
||||
ecNodes, totalFreeEcSlots := collectEcVolumeServersByDc(topology, "", types.HardDriveType)
|
||||
|
||||
t.Logf("Topology summary:")
|
||||
t.Logf(" Number of EC nodes: %d", len(ecNodes))
|
||||
t.Logf(" Total free EC slots: %d", totalFreeEcSlots)
|
||||
|
||||
// Log per-node details
|
||||
for _, node := range ecNodes {
|
||||
shardCount := 0
|
||||
for _, diskInfo := range node.info.DiskInfos {
|
||||
for _, ecShard := range diskInfo.EcShardInfos {
|
||||
shardCount += erasure_coding.ShardBits(ecShard.EcIndexBits).ShardIdCount()
|
||||
}
|
||||
}
|
||||
t.Logf(" Node %s (rack %s): %d shards, %d free slots",
|
||||
node.info.Id, node.rack, shardCount, node.freeEcSlot)
|
||||
}
|
||||
|
||||
// Calculate total EC shards
|
||||
totalEcShards := 0
|
||||
for _, node := range ecNodes {
|
||||
for _, diskInfo := range node.info.DiskInfos {
|
||||
for _, ecShard := range diskInfo.EcShardInfos {
|
||||
totalEcShards += erasure_coding.ShardBits(ecShard.EcIndexBits).ShardIdCount()
|
||||
}
|
||||
}
|
||||
}
|
||||
t.Logf(" Total EC shards: %d", totalEcShards)
|
||||
|
||||
// Document the issue:
|
||||
// With 98 EC shards (7 volumes × 14 shards) on 6 servers with max=10 each,
|
||||
// total capacity is 60 slots. But shards already occupy slots on their current servers.
|
||||
//
|
||||
// The current algorithm calculates free slots as:
|
||||
// freeSlots = maxVolumeCount - volumeCount - ecShardCount
|
||||
//
|
||||
// If all shards are on their original servers:
|
||||
// - Server A has 28 shards (2 volumes × 14) → may have negative free slots
|
||||
// - This causes totalFreeEcSlots to be 0 or negative
|
||||
//
|
||||
// The EXPECTED improvement:
|
||||
// - Rebalance should recognize that moving a shard FREES a slot on the source
|
||||
// - The algorithm should work iteratively, moving shards one at a time
|
||||
// - Even if starting with 0 free slots, moving one shard opens a slot
|
||||
|
||||
if totalFreeEcSlots < 1 {
|
||||
// This is the current (buggy) behavior we're documenting
|
||||
t.Logf("")
|
||||
t.Logf("KNOWN ISSUE: totalFreeEcSlots = %d (< 1)", totalFreeEcSlots)
|
||||
t.Logf("")
|
||||
t.Logf("This triggers the error: 'no free ec shard slots. only %d left'", totalFreeEcSlots)
|
||||
t.Logf("")
|
||||
t.Logf("Analysis:")
|
||||
t.Logf(" - %d EC shards across %d servers", totalEcShards, len(ecNodes))
|
||||
t.Logf(" - Shards are concentrated on original servers (not distributed)")
|
||||
t.Logf(" - Current slot calculation doesn't account for slots freed by moving shards")
|
||||
t.Logf("")
|
||||
t.Logf("Expected fix:")
|
||||
t.Logf(" 1. Rebalance should work iteratively, moving one shard at a time")
|
||||
t.Logf(" 2. Moving a shard from A to B: frees 1 slot on A, uses 1 slot on B")
|
||||
t.Logf(" 3. The 'free slots' check should be per-move, not global")
|
||||
t.Logf(" 4. Or: calculate 'redistributable slots' = total capacity - shards that must stay")
|
||||
|
||||
// For now, document this is a known issue - don't fail the test
|
||||
// When the fix is implemented, this test should be updated to verify the fix works
|
||||
return
|
||||
}
|
||||
|
||||
// If we get here, the issue might have been fixed
|
||||
t.Logf("totalFreeEcSlots = %d, rebalance should be possible", totalFreeEcSlots)
|
||||
}
|
||||
|
||||
// TestECRebalanceZeroFreeSlots tests the specific scenario where
|
||||
// the topology appears to have free slots but rebalance fails.
|
||||
//
|
||||
// This can happen when the VolumeCount in the topology includes the original
|
||||
// volumes that were EC-encoded, making the free slot calculation incorrect.
|
||||
func TestECRebalanceZeroFreeSlots(t *testing.T) {
|
||||
// Build a topology where volumes were NOT deleted after EC encoding
|
||||
// (VolumeCount still reflects the original volumes)
|
||||
topology := buildZeroFreeSlotTopology()
|
||||
|
||||
ecNodes, totalFreeEcSlots := collectEcVolumeServersByDc(topology, "", types.HardDriveType)
|
||||
|
||||
t.Logf("Zero free slots scenario:")
|
||||
for _, node := range ecNodes {
|
||||
shardCount := 0
|
||||
for _, diskInfo := range node.info.DiskInfos {
|
||||
for _, ecShard := range diskInfo.EcShardInfos {
|
||||
shardCount += erasure_coding.ShardBits(ecShard.EcIndexBits).ShardIdCount()
|
||||
}
|
||||
}
|
||||
t.Logf(" Node %s: %d shards, %d free slots, volumeCount=%d, max=%d",
|
||||
node.info.Id, shardCount, node.freeEcSlot,
|
||||
node.info.DiskInfos[string(types.HardDriveType)].VolumeCount,
|
||||
node.info.DiskInfos[string(types.HardDriveType)].MaxVolumeCount)
|
||||
}
|
||||
t.Logf(" Total free slots: %d", totalFreeEcSlots)
|
||||
|
||||
if totalFreeEcSlots == 0 {
|
||||
t.Logf("")
|
||||
t.Logf("SCENARIO REPRODUCED: totalFreeEcSlots = 0")
|
||||
t.Logf("This would trigger: 'no free ec shard slots. only 0 left'")
|
||||
}
|
||||
}
|
||||
|
||||
// buildZeroFreeSlotTopology creates a topology where rebalance will fail
|
||||
// because servers are at capacity (volumeCount equals maxVolumeCount)
|
||||
func buildZeroFreeSlotTopology() *master_pb.TopologyInfo {
|
||||
diskTypeKey := string(types.HardDriveType)
|
||||
|
||||
// Each server has max=10, volumeCount=10 (full capacity)
|
||||
// Free capacity = (10-10) * 10 = 0 per server
|
||||
// This will trigger "no free ec shard slots" error
|
||||
return &master_pb.TopologyInfo{
|
||||
Id: "test_zero_free_slots",
|
||||
DataCenterInfos: []*master_pb.DataCenterInfo{
|
||||
{
|
||||
Id: "dc1",
|
||||
RackInfos: []*master_pb.RackInfo{
|
||||
{
|
||||
Id: "rack0",
|
||||
DataNodeInfos: []*master_pb.DataNodeInfo{
|
||||
{
|
||||
Id: "127.0.0.1:8080",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {
|
||||
Type: diskTypeKey,
|
||||
MaxVolumeCount: 10,
|
||||
VolumeCount: 10, // At full capacity
|
||||
EcShardInfos: buildEcShards([]uint32{3, 4}),
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
Id: "rack1",
|
||||
DataNodeInfos: []*master_pb.DataNodeInfo{
|
||||
{
|
||||
Id: "127.0.0.1:8081",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {
|
||||
Type: diskTypeKey,
|
||||
MaxVolumeCount: 10,
|
||||
VolumeCount: 10,
|
||||
EcShardInfos: buildEcShards([]uint32{1, 7}),
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
Id: "rack2",
|
||||
DataNodeInfos: []*master_pb.DataNodeInfo{
|
||||
{
|
||||
Id: "127.0.0.1:8082",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {
|
||||
Type: diskTypeKey,
|
||||
MaxVolumeCount: 10,
|
||||
VolumeCount: 10,
|
||||
EcShardInfos: buildEcShards([]uint32{2}),
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
Id: "rack3",
|
||||
DataNodeInfos: []*master_pb.DataNodeInfo{
|
||||
{
|
||||
Id: "127.0.0.1:8083",
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {
|
||||
Type: diskTypeKey,
|
||||
MaxVolumeCount: 10,
|
||||
VolumeCount: 10,
|
||||
EcShardInfos: buildEcShards([]uint32{5, 6}),
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
func buildEcShards(volumeIds []uint32) []*master_pb.VolumeEcShardInformationMessage {
|
||||
var shards []*master_pb.VolumeEcShardInformationMessage
|
||||
for _, vid := range volumeIds {
|
||||
allShardBits := erasure_coding.ShardBits(0)
|
||||
for i := 0; i < erasure_coding.TotalShardsCount; i++ {
|
||||
allShardBits = allShardBits.AddShardId(erasure_coding.ShardId(i))
|
||||
}
|
||||
shards = append(shards, &master_pb.VolumeEcShardInformationMessage{
|
||||
Id: vid,
|
||||
Collection: "ectest",
|
||||
EcIndexBits: uint32(allShardBits),
|
||||
})
|
||||
}
|
||||
return shards
|
||||
}
|
||||
|
||||
// buildLimitedSlotsTopology creates a topology matching the problematic scenario:
|
||||
// - 6 servers in 6 racks
|
||||
// - Each server has max=10 volume slots
|
||||
// - 7 volumes were EC encoded, shards distributed as follows:
|
||||
// - rack0 (8080): volumes 3,4 → 28 shards
|
||||
// - rack1 (8081): volumes 1,7 → 28 shards
|
||||
// - rack2 (8082): volume 2 → 14 shards
|
||||
// - rack3 (8083): volumes 5,6 → 28 shards
|
||||
// - rack4 (8084): (no volumes originally)
|
||||
// - rack5 (8085): (no volumes originally)
|
||||
func buildLimitedSlotsTopology() *master_pb.TopologyInfo {
|
||||
return &master_pb.TopologyInfo{
|
||||
Id: "test_limited_slots",
|
||||
DataCenterInfos: []*master_pb.DataCenterInfo{
|
||||
{
|
||||
Id: "dc1",
|
||||
RackInfos: []*master_pb.RackInfo{
|
||||
buildRackWithEcShards("rack0", "127.0.0.1:8080", 10, []uint32{3, 4}),
|
||||
buildRackWithEcShards("rack1", "127.0.0.1:8081", 10, []uint32{1, 7}),
|
||||
buildRackWithEcShards("rack2", "127.0.0.1:8082", 10, []uint32{2}),
|
||||
buildRackWithEcShards("rack3", "127.0.0.1:8083", 10, []uint32{5, 6}),
|
||||
buildRackWithEcShards("rack4", "127.0.0.1:8084", 10, []uint32{}),
|
||||
buildRackWithEcShards("rack5", "127.0.0.1:8085", 10, []uint32{}),
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
// buildRackWithEcShards creates a rack with one data node containing EC shards
|
||||
// for the specified volume IDs (all 14 shards per volume)
|
||||
func buildRackWithEcShards(rackId, nodeId string, maxVolumes int64, volumeIds []uint32) *master_pb.RackInfo {
|
||||
// Note: types.HardDriveType is "" (empty string), so we use "" as the key
|
||||
diskTypeKey := string(types.HardDriveType)
|
||||
|
||||
return &master_pb.RackInfo{
|
||||
Id: rackId,
|
||||
DataNodeInfos: []*master_pb.DataNodeInfo{
|
||||
{
|
||||
Id: nodeId,
|
||||
DiskInfos: map[string]*master_pb.DiskInfo{
|
||||
diskTypeKey: {
|
||||
Type: diskTypeKey,
|
||||
MaxVolumeCount: maxVolumes,
|
||||
VolumeCount: int64(len(volumeIds)), // Original volumes still counted
|
||||
EcShardInfos: buildEcShards(volumeIds),
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user