fix: EC rebalance fails with replica placement 000 (#7812)
* fix: EC rebalance fails with replica placement 000 This PR fixes several issues with EC shard distribution: 1. Pre-flight check before EC encoding - Verify target disk type has capacity before encoding starts - Prevents encoding shards only to fail during rebalance - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd) 2. Fix EC rebalance with replica placement 000 - When DiffRackCount=0, shards should be distributed freely across racks - The '000' placement means 'no volume replication needed' because EC provides redundancy - Previously all racks were skipped with error 'shards X > replica placement limit (0)' 3. Add unit tests for EC rebalance slot calculation - TestECRebalanceWithLimitedSlots: documents the limited slots scenario - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error 4. Add Makefile for manual EC testing - make setup: start cluster and populate data - make shell: open weed shell for EC commands - make clean: stop cluster and cleanup * fix: default -rebalance to true for ec.encode The -rebalance flag was defaulting to false, which meant ec.encode would only print shard moves but not actually execute them. This is a poor default since the whole point of EC encoding is to distribute shards across servers for fault tolerance. Now -rebalance defaults to true, so shards are actually distributed after encoding. Users can use -rebalance=false if they only want to see what would happen without making changes. * test/erasure_coding: improve Makefile safety and docs - Narrow pkill pattern for volume servers to use TEST_DIR instead of port pattern, avoiding accidental kills of unrelated SeaweedFS processes - Document external dependencies (curl, jq) in header comments * shell: refactor buildRackWithEcShards to reuse buildEcShards Extract common shard bit construction logic to avoid duplication between buildEcShards and buildRackWithEcShards helper functions. * shell: update test for EC replication 000 behavior When DiffRackCount=0 (replication "000"), EC shards should be distributed freely across racks since erasure coding provides its own redundancy. Update test expectation to reflect this behavior. * erasure_coding: add distribution package for proportional EC shard placement Add a new reusable package for EC shard distribution that: - Supports configurable EC ratios (not hard-coded 10+4) - Distributes shards proportionally based on replication policy - Provides fault tolerance analysis - Prefers moving parity shards to keep data shards spread out Key components: - ECConfig: Configurable data/parity shard counts - ReplicationConfig: Parsed XYZ replication policy - ECDistribution: Target shard counts per DC/rack/node - Rebalancer: Plans shard moves with parity-first strategy This enables seaweed-enterprise custom EC ratios and weed worker integration while maintaining a clean, testable architecture. * shell: integrate distribution package for EC rebalancing Add shell wrappers around the distribution package: - ProportionalECRebalancer: Plans moves using distribution.Rebalancer - NewProportionalECRebalancerWithConfig: Supports custom EC configs - GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions The shell layer converts between EcNode types and the generic TopologyNode types used by the distribution package. * test setup * ec: improve data and parity shard distribution across racks - Add shardsByTypePerRack helper to track data vs parity shards - Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing: 1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack 2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack - Add balanceShardTypeAcrossRacks for generic shard type balancing - Add pickRackForShardType to select destination with room for type - Add unit tests for even data/parity distribution verification This ensures even read load during normal operation by spreading both data and parity shards across all available racks. * ec: make data/parity shard counts configurable in ecBalancer - Add dataShardCount and parityShardCount fields to ecBalancer struct - Add getDataShardCount() and getParityShardCount() methods with defaults - Replace direct constant usage with configurable methods - Fix unused variable warning for parityPerRack This allows seaweed-enterprise to use custom EC ratios while defaulting to standard 10+4 scheme. * Address PR 7812 review comments Makefile improvements: - Save PIDs for each volume server for precise termination - Use PID-based killing in stop target with pkill fallback - Use more specific pkill patterns with TEST_DIR paths Documentation: - Document jq dependency in README.md Rebalancer fix: - Fix duplicate shard count updates in applyMovesToAnalysis - All planners (DC/rack/node) update counts inline during planning - Remove duplicate updates from applyMovesToAnalysis to avoid double-counting * test/erasure_coding: use mktemp for test file template Use mktemp instead of hardcoded /tmp/testfile_template.bin path to provide better isolation for concurrent test runs.
This commit is contained in:
209
weed/storage/erasure_coding/distribution/README.md
Normal file
209
weed/storage/erasure_coding/distribution/README.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# EC Distribution Package
|
||||
|
||||
This package provides erasure coding (EC) shard distribution algorithms that are:
|
||||
|
||||
- **Configurable**: Works with any EC ratio (e.g., 10+4, 8+4, 6+3)
|
||||
- **Reusable**: Used by shell commands, worker tasks, and seaweed-enterprise
|
||||
- **Topology-aware**: Distributes shards across data centers, racks, and nodes proportionally
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage with Default 10+4 EC
|
||||
|
||||
```go
|
||||
import (
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding/distribution"
|
||||
)
|
||||
|
||||
// Parse replication policy
|
||||
rep, _ := distribution.NewReplicationConfigFromString("110")
|
||||
|
||||
// Use default 10+4 EC configuration
|
||||
ec := distribution.DefaultECConfig()
|
||||
|
||||
// Calculate distribution plan
|
||||
dist := distribution.CalculateDistribution(ec, rep)
|
||||
|
||||
fmt.Println(dist.Summary())
|
||||
// Output:
|
||||
// EC Configuration: 10+4 (total: 14, can lose: 4)
|
||||
// Replication: replication=110 (DCs:2, Racks/DC:2, Nodes/Rack:1)
|
||||
// Distribution Plan:
|
||||
// Data Centers: 2 (target 7 shards each, max 9)
|
||||
// Racks per DC: 2 (target 4 shards each, max 6)
|
||||
// Nodes per Rack: 1 (target 4 shards each, max 6)
|
||||
```
|
||||
|
||||
### Custom EC Ratios (seaweed-enterprise)
|
||||
|
||||
```go
|
||||
// Create custom 8+4 EC configuration
|
||||
ec, err := distribution.NewECConfig(8, 4)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
rep, _ := distribution.NewReplicationConfigFromString("200")
|
||||
dist := distribution.CalculateDistribution(ec, rep)
|
||||
|
||||
// Check fault tolerance
|
||||
fmt.Println(dist.FaultToleranceAnalysis())
|
||||
// Output:
|
||||
// Fault Tolerance Analysis for 8+4:
|
||||
// DC Failure: SURVIVABLE ✓
|
||||
// - Losing one DC loses ~4 shards
|
||||
// - Remaining: 8 shards (need 8)
|
||||
```
|
||||
|
||||
### Planning Shard Moves
|
||||
|
||||
```go
|
||||
// Build topology analysis
|
||||
analysis := distribution.NewTopologyAnalysis()
|
||||
|
||||
// Add nodes and their shard locations
|
||||
for _, node := range nodes {
|
||||
analysis.AddNode(&distribution.TopologyNode{
|
||||
NodeID: node.ID,
|
||||
DataCenter: node.DC,
|
||||
Rack: node.Rack,
|
||||
FreeSlots: node.FreeSlots,
|
||||
})
|
||||
for _, shardID := range node.ShardIDs {
|
||||
analysis.AddShardLocation(distribution.ShardLocation{
|
||||
ShardID: shardID,
|
||||
NodeID: node.ID,
|
||||
DataCenter: node.DC,
|
||||
Rack: node.Rack,
|
||||
})
|
||||
}
|
||||
}
|
||||
analysis.Finalize()
|
||||
|
||||
// Create rebalancer and plan moves
|
||||
rebalancer := distribution.NewRebalancer(ec, rep)
|
||||
plan, err := rebalancer.PlanRebalance(analysis)
|
||||
|
||||
for _, move := range plan.Moves {
|
||||
fmt.Printf("Move shard %d from %s to %s\n",
|
||||
move.ShardID, move.SourceNode.NodeID, move.DestNode.NodeID)
|
||||
}
|
||||
```
|
||||
|
||||
## Algorithm
|
||||
|
||||
### Proportional Distribution
|
||||
|
||||
The replication policy `XYZ` is interpreted as a ratio:
|
||||
|
||||
| Replication | DCs | Racks/DC | Nodes/Rack | 14 Shards Distribution |
|
||||
|-------------|-----|----------|------------|------------------------|
|
||||
| `000` | 1 | 1 | 1 | All in one place |
|
||||
| `001` | 1 | 1 | 2 | 7 per node |
|
||||
| `010` | 1 | 2 | 1 | 7 per rack |
|
||||
| `100` | 2 | 1 | 1 | 7 per DC |
|
||||
| `110` | 2 | 2 | 1 | 7/DC, 4/rack |
|
||||
| `200` | 3 | 1 | 1 | 5 per DC |
|
||||
|
||||
### Rebalancing Process
|
||||
|
||||
1. **DC-level balancing**: Move shards to achieve target shards per DC
|
||||
2. **Rack-level balancing**: Within each DC, balance across racks
|
||||
3. **Node-level balancing**: Within each rack, balance across nodes
|
||||
|
||||
### Shard Priority: Data First, Parity Moves First
|
||||
|
||||
When rebalancing, the algorithm prioritizes keeping data shards spread out:
|
||||
|
||||
- **Data shards (0 to DataShards-1)**: Serve read requests directly
|
||||
- **Parity shards (DataShards to TotalShards-1)**: Only used for reconstruction
|
||||
|
||||
**Rebalancing Strategy**:
|
||||
- When moving shards FROM an overloaded node, **parity shards are moved first**
|
||||
- This keeps data shards in place on well-distributed nodes
|
||||
- Result: Data shards remain spread out for optimal read performance
|
||||
|
||||
```go
|
||||
// Check shard type
|
||||
if ec.IsDataShard(shardID) {
|
||||
// Shard serves read requests
|
||||
}
|
||||
if ec.IsParityShard(shardID) {
|
||||
// Shard only used for reconstruction
|
||||
}
|
||||
|
||||
// Sort shards for placement (data first for initial distribution)
|
||||
sorted := ec.SortShardsDataFirst(shards)
|
||||
|
||||
// Sort shards for rebalancing (parity first to move them away)
|
||||
sorted := ec.SortShardsParityFirst(shards)
|
||||
```
|
||||
|
||||
### Fault Tolerance
|
||||
|
||||
The package provides fault tolerance analysis:
|
||||
|
||||
- **DC Failure**: Can the data survive complete DC loss?
|
||||
- **Rack Failure**: Can the data survive complete rack loss?
|
||||
- **Node Failure**: Can the data survive single node loss?
|
||||
|
||||
For example, with 10+4 EC (can lose 4 shards):
|
||||
- Need 4+ DCs for DC-level fault tolerance
|
||||
- Need 4+ racks for rack-level fault tolerance
|
||||
- Usually survivable at node level
|
||||
|
||||
## API Reference
|
||||
|
||||
### Types
|
||||
|
||||
- `ECConfig`: EC configuration (data shards, parity shards)
|
||||
- `ReplicationConfig`: Parsed replication policy
|
||||
- `ECDistribution`: Calculated distribution plan
|
||||
- `TopologyAnalysis`: Current shard distribution analysis
|
||||
- `Rebalancer`: Plans shard moves
|
||||
- `RebalancePlan`: List of planned moves
|
||||
- `ShardMove`: Single shard move operation
|
||||
|
||||
### Key Functions
|
||||
|
||||
- `NewECConfig(data, parity int)`: Create EC configuration
|
||||
- `DefaultECConfig()`: Returns 10+4 configuration
|
||||
- `CalculateDistribution(ec, rep)`: Calculate distribution plan
|
||||
- `NewRebalancer(ec, rep)`: Create rebalancer
|
||||
- `PlanRebalance(analysis)`: Generate rebalancing plan
|
||||
|
||||
## Integration
|
||||
|
||||
### Shell Commands
|
||||
|
||||
The shell package wraps this distribution package for `ec.balance`:
|
||||
|
||||
```go
|
||||
import "github.com/seaweedfs/seaweedfs/weed/shell"
|
||||
|
||||
rebalancer := shell.NewProportionalECRebalancer(nodes, rp, diskType)
|
||||
moves, _ := rebalancer.PlanMoves(volumeId, locations)
|
||||
```
|
||||
|
||||
### Worker Tasks
|
||||
|
||||
Worker tasks can use the distribution package directly:
|
||||
|
||||
```go
|
||||
import "github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding/distribution"
|
||||
|
||||
ec := distribution.ECConfig{DataShards: 8, ParityShards: 4}
|
||||
rep := distribution.NewReplicationConfig(rp)
|
||||
dist := distribution.CalculateDistribution(ec, rep)
|
||||
```
|
||||
|
||||
### seaweed-enterprise
|
||||
|
||||
Enterprise features can provide custom EC configurations:
|
||||
|
||||
```go
|
||||
// Custom EC ratio from license/config
|
||||
ec, _ := distribution.NewECConfig(customData, customParity)
|
||||
rebalancer := distribution.NewRebalancer(ec, rep)
|
||||
```
|
||||
|
||||
241
weed/storage/erasure_coding/distribution/analysis.go
Normal file
241
weed/storage/erasure_coding/distribution/analysis.go
Normal file
@@ -0,0 +1,241 @@
|
||||
package distribution
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"slices"
|
||||
)
|
||||
|
||||
// ShardLocation represents where a shard is located in the topology
|
||||
type ShardLocation struct {
|
||||
ShardID int
|
||||
NodeID string
|
||||
DataCenter string
|
||||
Rack string
|
||||
}
|
||||
|
||||
// TopologyNode represents a node in the topology that can hold EC shards
|
||||
type TopologyNode struct {
|
||||
NodeID string
|
||||
DataCenter string
|
||||
Rack string
|
||||
FreeSlots int // Available slots for new shards
|
||||
ShardIDs []int // Shard IDs currently on this node for a specific volume
|
||||
TotalShards int // Total shards on this node (for all volumes)
|
||||
}
|
||||
|
||||
// TopologyAnalysis holds the current shard distribution analysis for a volume
|
||||
type TopologyAnalysis struct {
|
||||
// Shard counts at each level
|
||||
ShardsByDC map[string]int
|
||||
ShardsByRack map[string]int
|
||||
ShardsByNode map[string]int
|
||||
|
||||
// Detailed shard locations
|
||||
DCToShards map[string][]int // DC -> list of shard IDs
|
||||
RackToShards map[string][]int // Rack -> list of shard IDs
|
||||
NodeToShards map[string][]int // NodeID -> list of shard IDs
|
||||
|
||||
// Topology structure
|
||||
DCToRacks map[string][]string // DC -> list of rack IDs
|
||||
RackToNodes map[string][]*TopologyNode // Rack -> list of nodes
|
||||
AllNodes map[string]*TopologyNode // NodeID -> node info
|
||||
|
||||
// Statistics
|
||||
TotalShards int
|
||||
TotalNodes int
|
||||
TotalRacks int
|
||||
TotalDCs int
|
||||
}
|
||||
|
||||
// NewTopologyAnalysis creates a new empty analysis
|
||||
func NewTopologyAnalysis() *TopologyAnalysis {
|
||||
return &TopologyAnalysis{
|
||||
ShardsByDC: make(map[string]int),
|
||||
ShardsByRack: make(map[string]int),
|
||||
ShardsByNode: make(map[string]int),
|
||||
DCToShards: make(map[string][]int),
|
||||
RackToShards: make(map[string][]int),
|
||||
NodeToShards: make(map[string][]int),
|
||||
DCToRacks: make(map[string][]string),
|
||||
RackToNodes: make(map[string][]*TopologyNode),
|
||||
AllNodes: make(map[string]*TopologyNode),
|
||||
}
|
||||
}
|
||||
|
||||
// AddShardLocation adds a shard location to the analysis
|
||||
func (a *TopologyAnalysis) AddShardLocation(loc ShardLocation) {
|
||||
// Update counts
|
||||
a.ShardsByDC[loc.DataCenter]++
|
||||
a.ShardsByRack[loc.Rack]++
|
||||
a.ShardsByNode[loc.NodeID]++
|
||||
|
||||
// Update shard lists
|
||||
a.DCToShards[loc.DataCenter] = append(a.DCToShards[loc.DataCenter], loc.ShardID)
|
||||
a.RackToShards[loc.Rack] = append(a.RackToShards[loc.Rack], loc.ShardID)
|
||||
a.NodeToShards[loc.NodeID] = append(a.NodeToShards[loc.NodeID], loc.ShardID)
|
||||
|
||||
a.TotalShards++
|
||||
}
|
||||
|
||||
// AddNode adds a node to the topology (even if it has no shards)
|
||||
func (a *TopologyAnalysis) AddNode(node *TopologyNode) {
|
||||
if _, exists := a.AllNodes[node.NodeID]; exists {
|
||||
return // Already added
|
||||
}
|
||||
|
||||
a.AllNodes[node.NodeID] = node
|
||||
a.TotalNodes++
|
||||
|
||||
// Update topology structure
|
||||
if !slices.Contains(a.DCToRacks[node.DataCenter], node.Rack) {
|
||||
a.DCToRacks[node.DataCenter] = append(a.DCToRacks[node.DataCenter], node.Rack)
|
||||
}
|
||||
a.RackToNodes[node.Rack] = append(a.RackToNodes[node.Rack], node)
|
||||
|
||||
// Update counts
|
||||
if _, exists := a.ShardsByDC[node.DataCenter]; !exists {
|
||||
a.TotalDCs++
|
||||
}
|
||||
if _, exists := a.ShardsByRack[node.Rack]; !exists {
|
||||
a.TotalRacks++
|
||||
}
|
||||
}
|
||||
|
||||
// Finalize computes final statistics after all data is added
|
||||
func (a *TopologyAnalysis) Finalize() {
|
||||
// Ensure we have accurate DC and rack counts
|
||||
dcSet := make(map[string]bool)
|
||||
rackSet := make(map[string]bool)
|
||||
for _, node := range a.AllNodes {
|
||||
dcSet[node.DataCenter] = true
|
||||
rackSet[node.Rack] = true
|
||||
}
|
||||
a.TotalDCs = len(dcSet)
|
||||
a.TotalRacks = len(rackSet)
|
||||
a.TotalNodes = len(a.AllNodes)
|
||||
}
|
||||
|
||||
// String returns a summary of the analysis
|
||||
func (a *TopologyAnalysis) String() string {
|
||||
return fmt.Sprintf("TopologyAnalysis{shards:%d, nodes:%d, racks:%d, dcs:%d}",
|
||||
a.TotalShards, a.TotalNodes, a.TotalRacks, a.TotalDCs)
|
||||
}
|
||||
|
||||
// DetailedString returns a detailed multi-line summary
|
||||
func (a *TopologyAnalysis) DetailedString() string {
|
||||
s := fmt.Sprintf("Topology Analysis:\n")
|
||||
s += fmt.Sprintf(" Total Shards: %d\n", a.TotalShards)
|
||||
s += fmt.Sprintf(" Data Centers: %d\n", a.TotalDCs)
|
||||
for dc, count := range a.ShardsByDC {
|
||||
s += fmt.Sprintf(" %s: %d shards\n", dc, count)
|
||||
}
|
||||
s += fmt.Sprintf(" Racks: %d\n", a.TotalRacks)
|
||||
for rack, count := range a.ShardsByRack {
|
||||
s += fmt.Sprintf(" %s: %d shards\n", rack, count)
|
||||
}
|
||||
s += fmt.Sprintf(" Nodes: %d\n", a.TotalNodes)
|
||||
for nodeID, count := range a.ShardsByNode {
|
||||
if count > 0 {
|
||||
s += fmt.Sprintf(" %s: %d shards\n", nodeID, count)
|
||||
}
|
||||
}
|
||||
return s
|
||||
}
|
||||
|
||||
// TopologyExcess represents a topology level (DC/rack/node) with excess shards
|
||||
type TopologyExcess struct {
|
||||
ID string // DC/rack/node ID
|
||||
Level string // "dc", "rack", or "node"
|
||||
Excess int // Number of excess shards (above target)
|
||||
Shards []int // Shard IDs at this level
|
||||
Nodes []*TopologyNode // Nodes at this level (for finding sources)
|
||||
}
|
||||
|
||||
// CalculateDCExcess returns DCs with more shards than the target
|
||||
func CalculateDCExcess(analysis *TopologyAnalysis, dist *ECDistribution) []TopologyExcess {
|
||||
var excess []TopologyExcess
|
||||
|
||||
for dc, count := range analysis.ShardsByDC {
|
||||
if count > dist.TargetShardsPerDC {
|
||||
// Collect nodes in this DC
|
||||
var nodes []*TopologyNode
|
||||
for _, rack := range analysis.DCToRacks[dc] {
|
||||
nodes = append(nodes, analysis.RackToNodes[rack]...)
|
||||
}
|
||||
excess = append(excess, TopologyExcess{
|
||||
ID: dc,
|
||||
Level: "dc",
|
||||
Excess: count - dist.TargetShardsPerDC,
|
||||
Shards: analysis.DCToShards[dc],
|
||||
Nodes: nodes,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// Sort by excess (most excess first)
|
||||
slices.SortFunc(excess, func(a, b TopologyExcess) int {
|
||||
return b.Excess - a.Excess
|
||||
})
|
||||
|
||||
return excess
|
||||
}
|
||||
|
||||
// CalculateRackExcess returns racks with more shards than the target (within a DC)
|
||||
func CalculateRackExcess(analysis *TopologyAnalysis, dc string, targetPerRack int) []TopologyExcess {
|
||||
var excess []TopologyExcess
|
||||
|
||||
for _, rack := range analysis.DCToRacks[dc] {
|
||||
count := analysis.ShardsByRack[rack]
|
||||
if count > targetPerRack {
|
||||
excess = append(excess, TopologyExcess{
|
||||
ID: rack,
|
||||
Level: "rack",
|
||||
Excess: count - targetPerRack,
|
||||
Shards: analysis.RackToShards[rack],
|
||||
Nodes: analysis.RackToNodes[rack],
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
slices.SortFunc(excess, func(a, b TopologyExcess) int {
|
||||
return b.Excess - a.Excess
|
||||
})
|
||||
|
||||
return excess
|
||||
}
|
||||
|
||||
// CalculateUnderservedDCs returns DCs that have fewer shards than target
|
||||
func CalculateUnderservedDCs(analysis *TopologyAnalysis, dist *ECDistribution) []string {
|
||||
var underserved []string
|
||||
|
||||
// Check existing DCs
|
||||
for dc, count := range analysis.ShardsByDC {
|
||||
if count < dist.TargetShardsPerDC {
|
||||
underserved = append(underserved, dc)
|
||||
}
|
||||
}
|
||||
|
||||
// Check DCs with nodes but no shards
|
||||
for dc := range analysis.DCToRacks {
|
||||
if _, exists := analysis.ShardsByDC[dc]; !exists {
|
||||
underserved = append(underserved, dc)
|
||||
}
|
||||
}
|
||||
|
||||
return underserved
|
||||
}
|
||||
|
||||
// CalculateUnderservedRacks returns racks that have fewer shards than target
|
||||
func CalculateUnderservedRacks(analysis *TopologyAnalysis, dc string, targetPerRack int) []string {
|
||||
var underserved []string
|
||||
|
||||
for _, rack := range analysis.DCToRacks[dc] {
|
||||
count := analysis.ShardsByRack[rack]
|
||||
if count < targetPerRack {
|
||||
underserved = append(underserved, rack)
|
||||
}
|
||||
}
|
||||
|
||||
return underserved
|
||||
}
|
||||
|
||||
171
weed/storage/erasure_coding/distribution/config.go
Normal file
171
weed/storage/erasure_coding/distribution/config.go
Normal file
@@ -0,0 +1,171 @@
|
||||
// Package distribution provides EC shard distribution algorithms with configurable EC ratios.
|
||||
package distribution
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/super_block"
|
||||
)
|
||||
|
||||
// ECConfig holds erasure coding configuration parameters.
|
||||
// This replaces hard-coded constants like DataShardsCount=10, ParityShardsCount=4.
|
||||
type ECConfig struct {
|
||||
DataShards int // Number of data shards (e.g., 10)
|
||||
ParityShards int // Number of parity shards (e.g., 4)
|
||||
}
|
||||
|
||||
// DefaultECConfig returns the standard 10+4 EC configuration
|
||||
func DefaultECConfig() ECConfig {
|
||||
return ECConfig{
|
||||
DataShards: 10,
|
||||
ParityShards: 4,
|
||||
}
|
||||
}
|
||||
|
||||
// NewECConfig creates a new EC configuration with validation
|
||||
func NewECConfig(dataShards, parityShards int) (ECConfig, error) {
|
||||
if dataShards <= 0 {
|
||||
return ECConfig{}, fmt.Errorf("dataShards must be positive, got %d", dataShards)
|
||||
}
|
||||
if parityShards <= 0 {
|
||||
return ECConfig{}, fmt.Errorf("parityShards must be positive, got %d", parityShards)
|
||||
}
|
||||
if dataShards+parityShards > 32 {
|
||||
return ECConfig{}, fmt.Errorf("total shards (%d+%d=%d) exceeds maximum of 32",
|
||||
dataShards, parityShards, dataShards+parityShards)
|
||||
}
|
||||
return ECConfig{
|
||||
DataShards: dataShards,
|
||||
ParityShards: parityShards,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// TotalShards returns the total number of shards (data + parity)
|
||||
func (c ECConfig) TotalShards() int {
|
||||
return c.DataShards + c.ParityShards
|
||||
}
|
||||
|
||||
// MaxTolerableLoss returns the maximum number of shards that can be lost
|
||||
// while still being able to reconstruct the data
|
||||
func (c ECConfig) MaxTolerableLoss() int {
|
||||
return c.ParityShards
|
||||
}
|
||||
|
||||
// MinShardsForReconstruction returns the minimum number of shards needed
|
||||
// to reconstruct the original data
|
||||
func (c ECConfig) MinShardsForReconstruction() int {
|
||||
return c.DataShards
|
||||
}
|
||||
|
||||
// String returns a human-readable representation
|
||||
func (c ECConfig) String() string {
|
||||
return fmt.Sprintf("%d+%d (total: %d, can lose: %d)",
|
||||
c.DataShards, c.ParityShards, c.TotalShards(), c.MaxTolerableLoss())
|
||||
}
|
||||
|
||||
// IsDataShard returns true if the shard ID is a data shard (0 to DataShards-1)
|
||||
func (c ECConfig) IsDataShard(shardID int) bool {
|
||||
return shardID >= 0 && shardID < c.DataShards
|
||||
}
|
||||
|
||||
// IsParityShard returns true if the shard ID is a parity shard (DataShards to TotalShards-1)
|
||||
func (c ECConfig) IsParityShard(shardID int) bool {
|
||||
return shardID >= c.DataShards && shardID < c.TotalShards()
|
||||
}
|
||||
|
||||
// SortShardsDataFirst returns a copy of shards sorted with data shards first.
|
||||
// This is useful for initial placement where data shards should be spread out first.
|
||||
func (c ECConfig) SortShardsDataFirst(shards []int) []int {
|
||||
result := make([]int, len(shards))
|
||||
copy(result, shards)
|
||||
|
||||
// Partition: data shards first, then parity shards
|
||||
dataIdx := 0
|
||||
parityIdx := len(result) - 1
|
||||
|
||||
sorted := make([]int, len(result))
|
||||
for _, s := range result {
|
||||
if c.IsDataShard(s) {
|
||||
sorted[dataIdx] = s
|
||||
dataIdx++
|
||||
} else {
|
||||
sorted[parityIdx] = s
|
||||
parityIdx--
|
||||
}
|
||||
}
|
||||
|
||||
return sorted
|
||||
}
|
||||
|
||||
// SortShardsParityFirst returns a copy of shards sorted with parity shards first.
|
||||
// This is useful for rebalancing where we prefer to move parity shards.
|
||||
func (c ECConfig) SortShardsParityFirst(shards []int) []int {
|
||||
result := make([]int, len(shards))
|
||||
copy(result, shards)
|
||||
|
||||
// Partition: parity shards first, then data shards
|
||||
parityIdx := 0
|
||||
dataIdx := len(result) - 1
|
||||
|
||||
sorted := make([]int, len(result))
|
||||
for _, s := range result {
|
||||
if c.IsParityShard(s) {
|
||||
sorted[parityIdx] = s
|
||||
parityIdx++
|
||||
} else {
|
||||
sorted[dataIdx] = s
|
||||
dataIdx--
|
||||
}
|
||||
}
|
||||
|
||||
return sorted
|
||||
}
|
||||
|
||||
// ReplicationConfig holds the parsed replication policy
|
||||
type ReplicationConfig struct {
|
||||
MinDataCenters int // X+1 from XYZ replication (minimum DCs to use)
|
||||
MinRacksPerDC int // Y+1 from XYZ replication (minimum racks per DC)
|
||||
MinNodesPerRack int // Z+1 from XYZ replication (minimum nodes per rack)
|
||||
|
||||
// Original replication string (for logging/debugging)
|
||||
Original string
|
||||
}
|
||||
|
||||
// NewReplicationConfig creates a ReplicationConfig from a ReplicaPlacement
|
||||
func NewReplicationConfig(rp *super_block.ReplicaPlacement) ReplicationConfig {
|
||||
if rp == nil {
|
||||
return ReplicationConfig{
|
||||
MinDataCenters: 1,
|
||||
MinRacksPerDC: 1,
|
||||
MinNodesPerRack: 1,
|
||||
Original: "000",
|
||||
}
|
||||
}
|
||||
return ReplicationConfig{
|
||||
MinDataCenters: rp.DiffDataCenterCount + 1,
|
||||
MinRacksPerDC: rp.DiffRackCount + 1,
|
||||
MinNodesPerRack: rp.SameRackCount + 1,
|
||||
Original: rp.String(),
|
||||
}
|
||||
}
|
||||
|
||||
// NewReplicationConfigFromString creates a ReplicationConfig from a replication string
|
||||
func NewReplicationConfigFromString(replication string) (ReplicationConfig, error) {
|
||||
rp, err := super_block.NewReplicaPlacementFromString(replication)
|
||||
if err != nil {
|
||||
return ReplicationConfig{}, err
|
||||
}
|
||||
return NewReplicationConfig(rp), nil
|
||||
}
|
||||
|
||||
// TotalPlacementSlots returns the minimum number of unique placement locations
|
||||
// based on the replication policy
|
||||
func (r ReplicationConfig) TotalPlacementSlots() int {
|
||||
return r.MinDataCenters * r.MinRacksPerDC * r.MinNodesPerRack
|
||||
}
|
||||
|
||||
// String returns a human-readable representation
|
||||
func (r ReplicationConfig) String() string {
|
||||
return fmt.Sprintf("replication=%s (DCs:%d, Racks/DC:%d, Nodes/Rack:%d)",
|
||||
r.Original, r.MinDataCenters, r.MinRacksPerDC, r.MinNodesPerRack)
|
||||
}
|
||||
161
weed/storage/erasure_coding/distribution/distribution.go
Normal file
161
weed/storage/erasure_coding/distribution/distribution.go
Normal file
@@ -0,0 +1,161 @@
|
||||
package distribution
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
)
|
||||
|
||||
// ECDistribution represents the target distribution of EC shards
|
||||
// based on EC configuration and replication policy.
|
||||
type ECDistribution struct {
|
||||
// EC configuration
|
||||
ECConfig ECConfig
|
||||
|
||||
// Replication configuration
|
||||
ReplicationConfig ReplicationConfig
|
||||
|
||||
// Target shard counts per topology level (balanced distribution)
|
||||
TargetShardsPerDC int
|
||||
TargetShardsPerRack int
|
||||
TargetShardsPerNode int
|
||||
|
||||
// Maximum shard counts per topology level (fault tolerance limits)
|
||||
// These prevent any single failure domain from having too many shards
|
||||
MaxShardsPerDC int
|
||||
MaxShardsPerRack int
|
||||
MaxShardsPerNode int
|
||||
}
|
||||
|
||||
// CalculateDistribution computes the target EC shard distribution based on
|
||||
// EC configuration and replication policy.
|
||||
//
|
||||
// The algorithm:
|
||||
// 1. Uses replication policy to determine minimum topology spread
|
||||
// 2. Calculates target shards per level (evenly distributed)
|
||||
// 3. Calculates max shards per level (for fault tolerance)
|
||||
func CalculateDistribution(ec ECConfig, rep ReplicationConfig) *ECDistribution {
|
||||
totalShards := ec.TotalShards()
|
||||
|
||||
// Target distribution (balanced, rounded up to ensure all shards placed)
|
||||
targetShardsPerDC := ceilDivide(totalShards, rep.MinDataCenters)
|
||||
targetShardsPerRack := ceilDivide(targetShardsPerDC, rep.MinRacksPerDC)
|
||||
targetShardsPerNode := ceilDivide(targetShardsPerRack, rep.MinNodesPerRack)
|
||||
|
||||
// Maximum limits for fault tolerance
|
||||
// The key constraint: losing one failure domain shouldn't lose more than parityShards
|
||||
// So max shards per domain = totalShards - parityShards + tolerance
|
||||
// We add small tolerance (+2) to allow for imbalanced topologies
|
||||
faultToleranceLimit := totalShards - ec.ParityShards + 1
|
||||
|
||||
maxShardsPerDC := min(faultToleranceLimit, targetShardsPerDC+2)
|
||||
maxShardsPerRack := min(faultToleranceLimit, targetShardsPerRack+2)
|
||||
maxShardsPerNode := min(faultToleranceLimit, targetShardsPerNode+2)
|
||||
|
||||
return &ECDistribution{
|
||||
ECConfig: ec,
|
||||
ReplicationConfig: rep,
|
||||
TargetShardsPerDC: targetShardsPerDC,
|
||||
TargetShardsPerRack: targetShardsPerRack,
|
||||
TargetShardsPerNode: targetShardsPerNode,
|
||||
MaxShardsPerDC: maxShardsPerDC,
|
||||
MaxShardsPerRack: maxShardsPerRack,
|
||||
MaxShardsPerNode: maxShardsPerNode,
|
||||
}
|
||||
}
|
||||
|
||||
// String returns a human-readable description of the distribution
|
||||
func (d *ECDistribution) String() string {
|
||||
return fmt.Sprintf(
|
||||
"ECDistribution{EC:%s, DCs:%d (target:%d/max:%d), Racks/DC:%d (target:%d/max:%d), Nodes/Rack:%d (target:%d/max:%d)}",
|
||||
d.ECConfig.String(),
|
||||
d.ReplicationConfig.MinDataCenters, d.TargetShardsPerDC, d.MaxShardsPerDC,
|
||||
d.ReplicationConfig.MinRacksPerDC, d.TargetShardsPerRack, d.MaxShardsPerRack,
|
||||
d.ReplicationConfig.MinNodesPerRack, d.TargetShardsPerNode, d.MaxShardsPerNode,
|
||||
)
|
||||
}
|
||||
|
||||
// Summary returns a multi-line summary of the distribution plan
|
||||
func (d *ECDistribution) Summary() string {
|
||||
summary := fmt.Sprintf("EC Configuration: %s\n", d.ECConfig.String())
|
||||
summary += fmt.Sprintf("Replication: %s\n", d.ReplicationConfig.String())
|
||||
summary += fmt.Sprintf("Distribution Plan:\n")
|
||||
summary += fmt.Sprintf(" Data Centers: %d (target %d shards each, max %d)\n",
|
||||
d.ReplicationConfig.MinDataCenters, d.TargetShardsPerDC, d.MaxShardsPerDC)
|
||||
summary += fmt.Sprintf(" Racks per DC: %d (target %d shards each, max %d)\n",
|
||||
d.ReplicationConfig.MinRacksPerDC, d.TargetShardsPerRack, d.MaxShardsPerRack)
|
||||
summary += fmt.Sprintf(" Nodes per Rack: %d (target %d shards each, max %d)\n",
|
||||
d.ReplicationConfig.MinNodesPerRack, d.TargetShardsPerNode, d.MaxShardsPerNode)
|
||||
return summary
|
||||
}
|
||||
|
||||
// CanSurviveDCFailure returns true if the distribution can survive
|
||||
// complete loss of one data center
|
||||
func (d *ECDistribution) CanSurviveDCFailure() bool {
|
||||
// After losing one DC with max shards, check if remaining shards are enough
|
||||
remainingAfterDCLoss := d.ECConfig.TotalShards() - d.TargetShardsPerDC
|
||||
return remainingAfterDCLoss >= d.ECConfig.MinShardsForReconstruction()
|
||||
}
|
||||
|
||||
// CanSurviveRackFailure returns true if the distribution can survive
|
||||
// complete loss of one rack
|
||||
func (d *ECDistribution) CanSurviveRackFailure() bool {
|
||||
remainingAfterRackLoss := d.ECConfig.TotalShards() - d.TargetShardsPerRack
|
||||
return remainingAfterRackLoss >= d.ECConfig.MinShardsForReconstruction()
|
||||
}
|
||||
|
||||
// MinDCsForDCFaultTolerance calculates the minimum number of DCs needed
|
||||
// to survive complete DC failure with this EC configuration
|
||||
func (d *ECDistribution) MinDCsForDCFaultTolerance() int {
|
||||
// To survive DC failure, max shards per DC = parityShards
|
||||
maxShardsPerDC := d.ECConfig.MaxTolerableLoss()
|
||||
if maxShardsPerDC == 0 {
|
||||
return d.ECConfig.TotalShards() // Would need one DC per shard
|
||||
}
|
||||
return ceilDivide(d.ECConfig.TotalShards(), maxShardsPerDC)
|
||||
}
|
||||
|
||||
// FaultToleranceAnalysis returns a detailed analysis of fault tolerance
|
||||
func (d *ECDistribution) FaultToleranceAnalysis() string {
|
||||
analysis := fmt.Sprintf("Fault Tolerance Analysis for %s:\n", d.ECConfig.String())
|
||||
|
||||
// DC failure
|
||||
dcSurvive := d.CanSurviveDCFailure()
|
||||
shardsAfterDC := d.ECConfig.TotalShards() - d.TargetShardsPerDC
|
||||
analysis += fmt.Sprintf(" DC Failure: %s\n", boolToResult(dcSurvive))
|
||||
analysis += fmt.Sprintf(" - Losing one DC loses ~%d shards\n", d.TargetShardsPerDC)
|
||||
analysis += fmt.Sprintf(" - Remaining: %d shards (need %d)\n", shardsAfterDC, d.ECConfig.DataShards)
|
||||
if !dcSurvive {
|
||||
analysis += fmt.Sprintf(" - Need at least %d DCs for DC fault tolerance\n", d.MinDCsForDCFaultTolerance())
|
||||
}
|
||||
|
||||
// Rack failure
|
||||
rackSurvive := d.CanSurviveRackFailure()
|
||||
shardsAfterRack := d.ECConfig.TotalShards() - d.TargetShardsPerRack
|
||||
analysis += fmt.Sprintf(" Rack Failure: %s\n", boolToResult(rackSurvive))
|
||||
analysis += fmt.Sprintf(" - Losing one rack loses ~%d shards\n", d.TargetShardsPerRack)
|
||||
analysis += fmt.Sprintf(" - Remaining: %d shards (need %d)\n", shardsAfterRack, d.ECConfig.DataShards)
|
||||
|
||||
// Node failure (usually survivable)
|
||||
shardsAfterNode := d.ECConfig.TotalShards() - d.TargetShardsPerNode
|
||||
nodeSurvive := shardsAfterNode >= d.ECConfig.DataShards
|
||||
analysis += fmt.Sprintf(" Node Failure: %s\n", boolToResult(nodeSurvive))
|
||||
analysis += fmt.Sprintf(" - Losing one node loses ~%d shards\n", d.TargetShardsPerNode)
|
||||
analysis += fmt.Sprintf(" - Remaining: %d shards (need %d)\n", shardsAfterNode, d.ECConfig.DataShards)
|
||||
|
||||
return analysis
|
||||
}
|
||||
|
||||
func boolToResult(b bool) string {
|
||||
if b {
|
||||
return "SURVIVABLE ✓"
|
||||
}
|
||||
return "NOT SURVIVABLE ✗"
|
||||
}
|
||||
|
||||
// ceilDivide performs ceiling division
|
||||
func ceilDivide(a, b int) int {
|
||||
if b <= 0 {
|
||||
return a
|
||||
}
|
||||
return (a + b - 1) / b
|
||||
}
|
||||
|
||||
565
weed/storage/erasure_coding/distribution/distribution_test.go
Normal file
565
weed/storage/erasure_coding/distribution/distribution_test.go
Normal file
@@ -0,0 +1,565 @@
|
||||
package distribution
|
||||
|
||||
import (
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestNewECConfig(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
dataShards int
|
||||
parityShards int
|
||||
wantErr bool
|
||||
}{
|
||||
{"valid 10+4", 10, 4, false},
|
||||
{"valid 8+4", 8, 4, false},
|
||||
{"valid 6+3", 6, 3, false},
|
||||
{"valid 4+2", 4, 2, false},
|
||||
{"invalid data=0", 0, 4, true},
|
||||
{"invalid parity=0", 10, 0, true},
|
||||
{"invalid total>32", 20, 15, true},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
config, err := NewECConfig(tt.dataShards, tt.parityShards)
|
||||
if (err != nil) != tt.wantErr {
|
||||
t.Errorf("NewECConfig() error = %v, wantErr %v", err, tt.wantErr)
|
||||
return
|
||||
}
|
||||
if !tt.wantErr {
|
||||
if config.DataShards != tt.dataShards {
|
||||
t.Errorf("DataShards = %d, want %d", config.DataShards, tt.dataShards)
|
||||
}
|
||||
if config.ParityShards != tt.parityShards {
|
||||
t.Errorf("ParityShards = %d, want %d", config.ParityShards, tt.parityShards)
|
||||
}
|
||||
if config.TotalShards() != tt.dataShards+tt.parityShards {
|
||||
t.Errorf("TotalShards() = %d, want %d", config.TotalShards(), tt.dataShards+tt.parityShards)
|
||||
}
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestCalculateDistribution(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
ecConfig ECConfig
|
||||
replication string
|
||||
expectedMinDCs int
|
||||
expectedMinRacksPerDC int
|
||||
expectedMinNodesPerRack int
|
||||
expectedTargetPerDC int
|
||||
expectedTargetPerRack int
|
||||
expectedTargetPerNode int
|
||||
}{
|
||||
{
|
||||
name: "10+4 with 000",
|
||||
ecConfig: DefaultECConfig(),
|
||||
replication: "000",
|
||||
expectedMinDCs: 1,
|
||||
expectedMinRacksPerDC: 1,
|
||||
expectedMinNodesPerRack: 1,
|
||||
expectedTargetPerDC: 14,
|
||||
expectedTargetPerRack: 14,
|
||||
expectedTargetPerNode: 14,
|
||||
},
|
||||
{
|
||||
name: "10+4 with 100",
|
||||
ecConfig: DefaultECConfig(),
|
||||
replication: "100",
|
||||
expectedMinDCs: 2,
|
||||
expectedMinRacksPerDC: 1,
|
||||
expectedMinNodesPerRack: 1,
|
||||
expectedTargetPerDC: 7,
|
||||
expectedTargetPerRack: 7,
|
||||
expectedTargetPerNode: 7,
|
||||
},
|
||||
{
|
||||
name: "10+4 with 110",
|
||||
ecConfig: DefaultECConfig(),
|
||||
replication: "110",
|
||||
expectedMinDCs: 2,
|
||||
expectedMinRacksPerDC: 2,
|
||||
expectedMinNodesPerRack: 1,
|
||||
expectedTargetPerDC: 7,
|
||||
expectedTargetPerRack: 4,
|
||||
expectedTargetPerNode: 4,
|
||||
},
|
||||
{
|
||||
name: "10+4 with 200",
|
||||
ecConfig: DefaultECConfig(),
|
||||
replication: "200",
|
||||
expectedMinDCs: 3,
|
||||
expectedMinRacksPerDC: 1,
|
||||
expectedMinNodesPerRack: 1,
|
||||
expectedTargetPerDC: 5,
|
||||
expectedTargetPerRack: 5,
|
||||
expectedTargetPerNode: 5,
|
||||
},
|
||||
{
|
||||
name: "8+4 with 110",
|
||||
ecConfig: ECConfig{
|
||||
DataShards: 8,
|
||||
ParityShards: 4,
|
||||
},
|
||||
replication: "110",
|
||||
expectedMinDCs: 2,
|
||||
expectedMinRacksPerDC: 2,
|
||||
expectedMinNodesPerRack: 1,
|
||||
expectedTargetPerDC: 6, // 12/2 = 6
|
||||
expectedTargetPerRack: 3, // 6/2 = 3
|
||||
expectedTargetPerNode: 3,
|
||||
},
|
||||
{
|
||||
name: "6+3 with 100",
|
||||
ecConfig: ECConfig{
|
||||
DataShards: 6,
|
||||
ParityShards: 3,
|
||||
},
|
||||
replication: "100",
|
||||
expectedMinDCs: 2,
|
||||
expectedMinRacksPerDC: 1,
|
||||
expectedMinNodesPerRack: 1,
|
||||
expectedTargetPerDC: 5, // ceil(9/2) = 5
|
||||
expectedTargetPerRack: 5,
|
||||
expectedTargetPerNode: 5,
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
rep, err := NewReplicationConfigFromString(tt.replication)
|
||||
if err != nil {
|
||||
t.Fatalf("Failed to parse replication %s: %v", tt.replication, err)
|
||||
}
|
||||
|
||||
dist := CalculateDistribution(tt.ecConfig, rep)
|
||||
|
||||
if dist.ReplicationConfig.MinDataCenters != tt.expectedMinDCs {
|
||||
t.Errorf("MinDataCenters = %d, want %d", dist.ReplicationConfig.MinDataCenters, tt.expectedMinDCs)
|
||||
}
|
||||
if dist.ReplicationConfig.MinRacksPerDC != tt.expectedMinRacksPerDC {
|
||||
t.Errorf("MinRacksPerDC = %d, want %d", dist.ReplicationConfig.MinRacksPerDC, tt.expectedMinRacksPerDC)
|
||||
}
|
||||
if dist.ReplicationConfig.MinNodesPerRack != tt.expectedMinNodesPerRack {
|
||||
t.Errorf("MinNodesPerRack = %d, want %d", dist.ReplicationConfig.MinNodesPerRack, tt.expectedMinNodesPerRack)
|
||||
}
|
||||
if dist.TargetShardsPerDC != tt.expectedTargetPerDC {
|
||||
t.Errorf("TargetShardsPerDC = %d, want %d", dist.TargetShardsPerDC, tt.expectedTargetPerDC)
|
||||
}
|
||||
if dist.TargetShardsPerRack != tt.expectedTargetPerRack {
|
||||
t.Errorf("TargetShardsPerRack = %d, want %d", dist.TargetShardsPerRack, tt.expectedTargetPerRack)
|
||||
}
|
||||
if dist.TargetShardsPerNode != tt.expectedTargetPerNode {
|
||||
t.Errorf("TargetShardsPerNode = %d, want %d", dist.TargetShardsPerNode, tt.expectedTargetPerNode)
|
||||
}
|
||||
|
||||
t.Logf("Distribution for %s: %s", tt.name, dist.String())
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestFaultToleranceAnalysis(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
ecConfig ECConfig
|
||||
replication string
|
||||
canSurviveDC bool
|
||||
canSurviveRack bool
|
||||
}{
|
||||
// 10+4 = 14 shards, need 10 to reconstruct, can lose 4
|
||||
{"10+4 000", DefaultECConfig(), "000", false, false}, // All in one, any failure is fatal
|
||||
{"10+4 100", DefaultECConfig(), "100", false, false}, // 7 per DC/rack, 7 remaining < 10
|
||||
{"10+4 200", DefaultECConfig(), "200", false, false}, // 5 per DC/rack, 9 remaining < 10
|
||||
{"10+4 110", DefaultECConfig(), "110", false, true}, // 4 per rack, 10 remaining = enough for rack
|
||||
|
||||
// 8+4 = 12 shards, need 8 to reconstruct, can lose 4
|
||||
{"8+4 100", ECConfig{8, 4}, "100", false, false}, // 6 per DC/rack, 6 remaining < 8
|
||||
{"8+4 200", ECConfig{8, 4}, "200", true, true}, // 4 per DC/rack, 8 remaining = enough!
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
rep, _ := NewReplicationConfigFromString(tt.replication)
|
||||
dist := CalculateDistribution(tt.ecConfig, rep)
|
||||
|
||||
if dist.CanSurviveDCFailure() != tt.canSurviveDC {
|
||||
t.Errorf("CanSurviveDCFailure() = %v, want %v", dist.CanSurviveDCFailure(), tt.canSurviveDC)
|
||||
}
|
||||
if dist.CanSurviveRackFailure() != tt.canSurviveRack {
|
||||
t.Errorf("CanSurviveRackFailure() = %v, want %v", dist.CanSurviveRackFailure(), tt.canSurviveRack)
|
||||
}
|
||||
|
||||
t.Log(dist.FaultToleranceAnalysis())
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestMinDCsForDCFaultTolerance(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
ecConfig ECConfig
|
||||
minDCs int
|
||||
}{
|
||||
// 10+4: can lose 4, so max 4 per DC, 14/4 = 4 DCs needed
|
||||
{"10+4", DefaultECConfig(), 4},
|
||||
// 8+4: can lose 4, so max 4 per DC, 12/4 = 3 DCs needed
|
||||
{"8+4", ECConfig{8, 4}, 3},
|
||||
// 6+3: can lose 3, so max 3 per DC, 9/3 = 3 DCs needed
|
||||
{"6+3", ECConfig{6, 3}, 3},
|
||||
// 4+2: can lose 2, so max 2 per DC, 6/2 = 3 DCs needed
|
||||
{"4+2", ECConfig{4, 2}, 3},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
rep, _ := NewReplicationConfigFromString("000")
|
||||
dist := CalculateDistribution(tt.ecConfig, rep)
|
||||
|
||||
if dist.MinDCsForDCFaultTolerance() != tt.minDCs {
|
||||
t.Errorf("MinDCsForDCFaultTolerance() = %d, want %d",
|
||||
dist.MinDCsForDCFaultTolerance(), tt.minDCs)
|
||||
}
|
||||
|
||||
t.Logf("%s: needs %d DCs for DC fault tolerance", tt.name, dist.MinDCsForDCFaultTolerance())
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestTopologyAnalysis(t *testing.T) {
|
||||
analysis := NewTopologyAnalysis()
|
||||
|
||||
// Add nodes to topology
|
||||
node1 := &TopologyNode{
|
||||
NodeID: "node1",
|
||||
DataCenter: "dc1",
|
||||
Rack: "rack1",
|
||||
FreeSlots: 5,
|
||||
}
|
||||
node2 := &TopologyNode{
|
||||
NodeID: "node2",
|
||||
DataCenter: "dc1",
|
||||
Rack: "rack2",
|
||||
FreeSlots: 10,
|
||||
}
|
||||
node3 := &TopologyNode{
|
||||
NodeID: "node3",
|
||||
DataCenter: "dc2",
|
||||
Rack: "rack3",
|
||||
FreeSlots: 10,
|
||||
}
|
||||
|
||||
analysis.AddNode(node1)
|
||||
analysis.AddNode(node2)
|
||||
analysis.AddNode(node3)
|
||||
|
||||
// Add shard locations (all on node1)
|
||||
for i := 0; i < 14; i++ {
|
||||
analysis.AddShardLocation(ShardLocation{
|
||||
ShardID: i,
|
||||
NodeID: "node1",
|
||||
DataCenter: "dc1",
|
||||
Rack: "rack1",
|
||||
})
|
||||
}
|
||||
|
||||
analysis.Finalize()
|
||||
|
||||
// Verify counts
|
||||
if analysis.TotalShards != 14 {
|
||||
t.Errorf("TotalShards = %d, want 14", analysis.TotalShards)
|
||||
}
|
||||
if analysis.ShardsByDC["dc1"] != 14 {
|
||||
t.Errorf("ShardsByDC[dc1] = %d, want 14", analysis.ShardsByDC["dc1"])
|
||||
}
|
||||
if analysis.ShardsByRack["rack1"] != 14 {
|
||||
t.Errorf("ShardsByRack[rack1] = %d, want 14", analysis.ShardsByRack["rack1"])
|
||||
}
|
||||
if analysis.ShardsByNode["node1"] != 14 {
|
||||
t.Errorf("ShardsByNode[node1] = %d, want 14", analysis.ShardsByNode["node1"])
|
||||
}
|
||||
|
||||
t.Log(analysis.DetailedString())
|
||||
}
|
||||
|
||||
func TestRebalancer(t *testing.T) {
|
||||
// Build topology: 2 DCs, 2 racks each, all shards on one node
|
||||
analysis := NewTopologyAnalysis()
|
||||
|
||||
// Add nodes
|
||||
nodes := []*TopologyNode{
|
||||
{NodeID: "dc1-rack1-node1", DataCenter: "dc1", Rack: "dc1-rack1", FreeSlots: 0},
|
||||
{NodeID: "dc1-rack2-node1", DataCenter: "dc1", Rack: "dc1-rack2", FreeSlots: 10},
|
||||
{NodeID: "dc2-rack1-node1", DataCenter: "dc2", Rack: "dc2-rack1", FreeSlots: 10},
|
||||
{NodeID: "dc2-rack2-node1", DataCenter: "dc2", Rack: "dc2-rack2", FreeSlots: 10},
|
||||
}
|
||||
for _, node := range nodes {
|
||||
analysis.AddNode(node)
|
||||
}
|
||||
|
||||
// Add all 14 shards to first node
|
||||
for i := 0; i < 14; i++ {
|
||||
analysis.AddShardLocation(ShardLocation{
|
||||
ShardID: i,
|
||||
NodeID: "dc1-rack1-node1",
|
||||
DataCenter: "dc1",
|
||||
Rack: "dc1-rack1",
|
||||
})
|
||||
}
|
||||
analysis.Finalize()
|
||||
|
||||
// Create rebalancer with 110 replication (2 DCs, 2 racks each)
|
||||
ec := DefaultECConfig()
|
||||
rep, _ := NewReplicationConfigFromString("110")
|
||||
rebalancer := NewRebalancer(ec, rep)
|
||||
|
||||
plan, err := rebalancer.PlanRebalance(analysis)
|
||||
if err != nil {
|
||||
t.Fatalf("PlanRebalance failed: %v", err)
|
||||
}
|
||||
|
||||
t.Logf("Planned %d moves", plan.TotalMoves)
|
||||
t.Log(plan.DetailedString())
|
||||
|
||||
// Verify we're moving shards to dc2
|
||||
movedToDC2 := 0
|
||||
for _, move := range plan.Moves {
|
||||
if move.DestNode.DataCenter == "dc2" {
|
||||
movedToDC2++
|
||||
}
|
||||
}
|
||||
|
||||
if movedToDC2 == 0 {
|
||||
t.Error("Expected some moves to dc2")
|
||||
}
|
||||
|
||||
// With "110" replication, target is 7 shards per DC
|
||||
// Starting with 14 in dc1, should plan to move 7 to dc2
|
||||
if plan.MovesAcrossDC < 7 {
|
||||
t.Errorf("Expected at least 7 cross-DC moves for 110 replication, got %d", plan.MovesAcrossDC)
|
||||
}
|
||||
}
|
||||
|
||||
func TestCustomECRatios(t *testing.T) {
|
||||
// Test various custom EC ratios that seaweed-enterprise might use
|
||||
ratios := []struct {
|
||||
name string
|
||||
data int
|
||||
parity int
|
||||
}{
|
||||
{"4+2", 4, 2},
|
||||
{"6+3", 6, 3},
|
||||
{"8+2", 8, 2},
|
||||
{"8+4", 8, 4},
|
||||
{"10+4", 10, 4},
|
||||
{"12+4", 12, 4},
|
||||
{"16+4", 16, 4},
|
||||
}
|
||||
|
||||
for _, ratio := range ratios {
|
||||
t.Run(ratio.name, func(t *testing.T) {
|
||||
ec, err := NewECConfig(ratio.data, ratio.parity)
|
||||
if err != nil {
|
||||
t.Fatalf("Failed to create EC config: %v", err)
|
||||
}
|
||||
|
||||
rep, _ := NewReplicationConfigFromString("110")
|
||||
dist := CalculateDistribution(ec, rep)
|
||||
|
||||
t.Logf("EC %s with replication 110:", ratio.name)
|
||||
t.Logf(" Total shards: %d", ec.TotalShards())
|
||||
t.Logf(" Can lose: %d shards", ec.MaxTolerableLoss())
|
||||
t.Logf(" Target per DC: %d", dist.TargetShardsPerDC)
|
||||
t.Logf(" Target per rack: %d", dist.TargetShardsPerRack)
|
||||
t.Logf(" Min DCs for DC fault tolerance: %d", dist.MinDCsForDCFaultTolerance())
|
||||
|
||||
// Verify basic sanity
|
||||
if dist.TargetShardsPerDC*2 < ec.TotalShards() {
|
||||
t.Errorf("Target per DC (%d) * 2 should be >= total (%d)",
|
||||
dist.TargetShardsPerDC, ec.TotalShards())
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestShardClassification(t *testing.T) {
|
||||
ec := DefaultECConfig() // 10+4
|
||||
|
||||
// Test IsDataShard
|
||||
for i := 0; i < 10; i++ {
|
||||
if !ec.IsDataShard(i) {
|
||||
t.Errorf("Shard %d should be a data shard", i)
|
||||
}
|
||||
if ec.IsParityShard(i) {
|
||||
t.Errorf("Shard %d should not be a parity shard", i)
|
||||
}
|
||||
}
|
||||
|
||||
// Test IsParityShard
|
||||
for i := 10; i < 14; i++ {
|
||||
if ec.IsDataShard(i) {
|
||||
t.Errorf("Shard %d should not be a data shard", i)
|
||||
}
|
||||
if !ec.IsParityShard(i) {
|
||||
t.Errorf("Shard %d should be a parity shard", i)
|
||||
}
|
||||
}
|
||||
|
||||
// Test with custom 8+4 EC
|
||||
ec84, _ := NewECConfig(8, 4)
|
||||
for i := 0; i < 8; i++ {
|
||||
if !ec84.IsDataShard(i) {
|
||||
t.Errorf("8+4 EC: Shard %d should be a data shard", i)
|
||||
}
|
||||
}
|
||||
for i := 8; i < 12; i++ {
|
||||
if !ec84.IsParityShard(i) {
|
||||
t.Errorf("8+4 EC: Shard %d should be a parity shard", i)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestSortShardsDataFirst(t *testing.T) {
|
||||
ec := DefaultECConfig() // 10+4
|
||||
|
||||
// Mixed shards: [0, 10, 5, 11, 2, 12, 7, 13]
|
||||
shards := []int{0, 10, 5, 11, 2, 12, 7, 13}
|
||||
sorted := ec.SortShardsDataFirst(shards)
|
||||
|
||||
t.Logf("Original: %v", shards)
|
||||
t.Logf("Sorted (data first): %v", sorted)
|
||||
|
||||
// First 4 should be data shards (0, 5, 2, 7)
|
||||
for i := 0; i < 4; i++ {
|
||||
if !ec.IsDataShard(sorted[i]) {
|
||||
t.Errorf("Position %d should be a data shard, got %d", i, sorted[i])
|
||||
}
|
||||
}
|
||||
|
||||
// Last 4 should be parity shards (10, 11, 12, 13)
|
||||
for i := 4; i < 8; i++ {
|
||||
if !ec.IsParityShard(sorted[i]) {
|
||||
t.Errorf("Position %d should be a parity shard, got %d", i, sorted[i])
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestSortShardsParityFirst(t *testing.T) {
|
||||
ec := DefaultECConfig() // 10+4
|
||||
|
||||
// Mixed shards: [0, 10, 5, 11, 2, 12, 7, 13]
|
||||
shards := []int{0, 10, 5, 11, 2, 12, 7, 13}
|
||||
sorted := ec.SortShardsParityFirst(shards)
|
||||
|
||||
t.Logf("Original: %v", shards)
|
||||
t.Logf("Sorted (parity first): %v", sorted)
|
||||
|
||||
// First 4 should be parity shards (10, 11, 12, 13)
|
||||
for i := 0; i < 4; i++ {
|
||||
if !ec.IsParityShard(sorted[i]) {
|
||||
t.Errorf("Position %d should be a parity shard, got %d", i, sorted[i])
|
||||
}
|
||||
}
|
||||
|
||||
// Last 4 should be data shards (0, 5, 2, 7)
|
||||
for i := 4; i < 8; i++ {
|
||||
if !ec.IsDataShard(sorted[i]) {
|
||||
t.Errorf("Position %d should be a data shard, got %d", i, sorted[i])
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestRebalancerPrefersMovingParityShards(t *testing.T) {
|
||||
// Build topology where one node has all shards including mix of data and parity
|
||||
analysis := NewTopologyAnalysis()
|
||||
|
||||
// Node 1: Has all 14 shards (mixed data and parity)
|
||||
node1 := &TopologyNode{
|
||||
NodeID: "node1",
|
||||
DataCenter: "dc1",
|
||||
Rack: "rack1",
|
||||
FreeSlots: 0,
|
||||
}
|
||||
analysis.AddNode(node1)
|
||||
|
||||
// Node 2: Empty, ready to receive
|
||||
node2 := &TopologyNode{
|
||||
NodeID: "node2",
|
||||
DataCenter: "dc1",
|
||||
Rack: "rack1",
|
||||
FreeSlots: 10,
|
||||
}
|
||||
analysis.AddNode(node2)
|
||||
|
||||
// Add all 14 shards to node1
|
||||
for i := 0; i < 14; i++ {
|
||||
analysis.AddShardLocation(ShardLocation{
|
||||
ShardID: i,
|
||||
NodeID: "node1",
|
||||
DataCenter: "dc1",
|
||||
Rack: "rack1",
|
||||
})
|
||||
}
|
||||
analysis.Finalize()
|
||||
|
||||
// Create rebalancer
|
||||
ec := DefaultECConfig()
|
||||
rep, _ := NewReplicationConfigFromString("000")
|
||||
rebalancer := NewRebalancer(ec, rep)
|
||||
|
||||
plan, err := rebalancer.PlanRebalance(analysis)
|
||||
if err != nil {
|
||||
t.Fatalf("PlanRebalance failed: %v", err)
|
||||
}
|
||||
|
||||
t.Logf("Planned %d moves", len(plan.Moves))
|
||||
|
||||
// Check that parity shards are moved first
|
||||
parityMovesFirst := 0
|
||||
dataMovesFirst := 0
|
||||
seenDataMove := false
|
||||
|
||||
for _, move := range plan.Moves {
|
||||
isParity := ec.IsParityShard(move.ShardID)
|
||||
t.Logf("Move shard %d (parity=%v): %s -> %s",
|
||||
move.ShardID, isParity, move.SourceNode.NodeID, move.DestNode.NodeID)
|
||||
|
||||
if isParity && !seenDataMove {
|
||||
parityMovesFirst++
|
||||
} else if !isParity {
|
||||
seenDataMove = true
|
||||
dataMovesFirst++
|
||||
}
|
||||
}
|
||||
|
||||
t.Logf("Parity moves before first data move: %d", parityMovesFirst)
|
||||
t.Logf("Data moves: %d", dataMovesFirst)
|
||||
|
||||
// With 10+4 EC, there are 4 parity shards
|
||||
// They should be moved before data shards when possible
|
||||
if parityMovesFirst < 4 && len(plan.Moves) >= 4 {
|
||||
t.Logf("Note: Expected parity shards to be moved first, but got %d parity moves before data moves", parityMovesFirst)
|
||||
}
|
||||
}
|
||||
|
||||
func TestDistributionSummary(t *testing.T) {
|
||||
ec := DefaultECConfig()
|
||||
rep, _ := NewReplicationConfigFromString("110")
|
||||
dist := CalculateDistribution(ec, rep)
|
||||
|
||||
summary := dist.Summary()
|
||||
t.Log(summary)
|
||||
|
||||
if len(summary) == 0 {
|
||||
t.Error("Summary should not be empty")
|
||||
}
|
||||
|
||||
analysis := dist.FaultToleranceAnalysis()
|
||||
t.Log(analysis)
|
||||
|
||||
if len(analysis) == 0 {
|
||||
t.Error("Fault tolerance analysis should not be empty")
|
||||
}
|
||||
}
|
||||
378
weed/storage/erasure_coding/distribution/rebalancer.go
Normal file
378
weed/storage/erasure_coding/distribution/rebalancer.go
Normal file
@@ -0,0 +1,378 @@
|
||||
package distribution
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"slices"
|
||||
)
|
||||
|
||||
// ShardMove represents a planned shard move
|
||||
type ShardMove struct {
|
||||
ShardID int
|
||||
SourceNode *TopologyNode
|
||||
DestNode *TopologyNode
|
||||
Reason string
|
||||
}
|
||||
|
||||
// String returns a human-readable description of the move
|
||||
func (m ShardMove) String() string {
|
||||
return fmt.Sprintf("shard %d: %s -> %s (%s)",
|
||||
m.ShardID, m.SourceNode.NodeID, m.DestNode.NodeID, m.Reason)
|
||||
}
|
||||
|
||||
// RebalancePlan contains the complete plan for rebalancing EC shards
|
||||
type RebalancePlan struct {
|
||||
Moves []ShardMove
|
||||
Distribution *ECDistribution
|
||||
Analysis *TopologyAnalysis
|
||||
|
||||
// Statistics
|
||||
TotalMoves int
|
||||
MovesAcrossDC int
|
||||
MovesAcrossRack int
|
||||
MovesWithinRack int
|
||||
}
|
||||
|
||||
// String returns a summary of the plan
|
||||
func (p *RebalancePlan) String() string {
|
||||
return fmt.Sprintf("RebalancePlan{moves:%d, acrossDC:%d, acrossRack:%d, withinRack:%d}",
|
||||
p.TotalMoves, p.MovesAcrossDC, p.MovesAcrossRack, p.MovesWithinRack)
|
||||
}
|
||||
|
||||
// DetailedString returns a detailed multi-line summary
|
||||
func (p *RebalancePlan) DetailedString() string {
|
||||
s := fmt.Sprintf("Rebalance Plan:\n")
|
||||
s += fmt.Sprintf(" Total Moves: %d\n", p.TotalMoves)
|
||||
s += fmt.Sprintf(" Across DC: %d\n", p.MovesAcrossDC)
|
||||
s += fmt.Sprintf(" Across Rack: %d\n", p.MovesAcrossRack)
|
||||
s += fmt.Sprintf(" Within Rack: %d\n", p.MovesWithinRack)
|
||||
s += fmt.Sprintf("\nMoves:\n")
|
||||
for i, move := range p.Moves {
|
||||
s += fmt.Sprintf(" %d. %s\n", i+1, move.String())
|
||||
}
|
||||
return s
|
||||
}
|
||||
|
||||
// Rebalancer plans shard moves to achieve proportional distribution
|
||||
type Rebalancer struct {
|
||||
ecConfig ECConfig
|
||||
repConfig ReplicationConfig
|
||||
}
|
||||
|
||||
// NewRebalancer creates a new rebalancer with the given configuration
|
||||
func NewRebalancer(ec ECConfig, rep ReplicationConfig) *Rebalancer {
|
||||
return &Rebalancer{
|
||||
ecConfig: ec,
|
||||
repConfig: rep,
|
||||
}
|
||||
}
|
||||
|
||||
// PlanRebalance creates a rebalancing plan based on current topology analysis
|
||||
func (r *Rebalancer) PlanRebalance(analysis *TopologyAnalysis) (*RebalancePlan, error) {
|
||||
dist := CalculateDistribution(r.ecConfig, r.repConfig)
|
||||
|
||||
plan := &RebalancePlan{
|
||||
Distribution: dist,
|
||||
Analysis: analysis,
|
||||
}
|
||||
|
||||
// Step 1: Balance across data centers
|
||||
dcMoves := r.planDCMoves(analysis, dist)
|
||||
for _, move := range dcMoves {
|
||||
plan.Moves = append(plan.Moves, move)
|
||||
plan.MovesAcrossDC++
|
||||
}
|
||||
|
||||
// Update analysis after DC moves (for planning purposes)
|
||||
r.applyMovesToAnalysis(analysis, dcMoves)
|
||||
|
||||
// Step 2: Balance across racks within each DC
|
||||
rackMoves := r.planRackMoves(analysis, dist)
|
||||
for _, move := range rackMoves {
|
||||
plan.Moves = append(plan.Moves, move)
|
||||
plan.MovesAcrossRack++
|
||||
}
|
||||
|
||||
// Update analysis after rack moves
|
||||
r.applyMovesToAnalysis(analysis, rackMoves)
|
||||
|
||||
// Step 3: Balance across nodes within each rack
|
||||
nodeMoves := r.planNodeMoves(analysis, dist)
|
||||
for _, move := range nodeMoves {
|
||||
plan.Moves = append(plan.Moves, move)
|
||||
plan.MovesWithinRack++
|
||||
}
|
||||
|
||||
plan.TotalMoves = len(plan.Moves)
|
||||
|
||||
return plan, nil
|
||||
}
|
||||
|
||||
// planDCMoves plans moves to balance shards across data centers
|
||||
func (r *Rebalancer) planDCMoves(analysis *TopologyAnalysis, dist *ECDistribution) []ShardMove {
|
||||
var moves []ShardMove
|
||||
|
||||
overDCs := CalculateDCExcess(analysis, dist)
|
||||
underDCs := CalculateUnderservedDCs(analysis, dist)
|
||||
|
||||
underIdx := 0
|
||||
for _, over := range overDCs {
|
||||
for over.Excess > 0 && underIdx < len(underDCs) {
|
||||
destDC := underDCs[underIdx]
|
||||
|
||||
// Find a shard and source node
|
||||
shardID, srcNode := r.pickShardToMove(analysis, over.Nodes)
|
||||
if srcNode == nil {
|
||||
break
|
||||
}
|
||||
|
||||
// Find destination node in target DC
|
||||
destNode := r.pickBestDestination(analysis, destDC, "", dist)
|
||||
if destNode == nil {
|
||||
underIdx++
|
||||
continue
|
||||
}
|
||||
|
||||
moves = append(moves, ShardMove{
|
||||
ShardID: shardID,
|
||||
SourceNode: srcNode,
|
||||
DestNode: destNode,
|
||||
Reason: fmt.Sprintf("balance DC: %s -> %s", srcNode.DataCenter, destDC),
|
||||
})
|
||||
|
||||
over.Excess--
|
||||
analysis.ShardsByDC[srcNode.DataCenter]--
|
||||
analysis.ShardsByDC[destDC]++
|
||||
|
||||
// Check if destDC reached target
|
||||
if analysis.ShardsByDC[destDC] >= dist.TargetShardsPerDC {
|
||||
underIdx++
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return moves
|
||||
}
|
||||
|
||||
// planRackMoves plans moves to balance shards across racks within each DC
|
||||
func (r *Rebalancer) planRackMoves(analysis *TopologyAnalysis, dist *ECDistribution) []ShardMove {
|
||||
var moves []ShardMove
|
||||
|
||||
for dc := range analysis.DCToRacks {
|
||||
dcShards := analysis.ShardsByDC[dc]
|
||||
numRacks := len(analysis.DCToRacks[dc])
|
||||
if numRacks == 0 {
|
||||
continue
|
||||
}
|
||||
|
||||
targetPerRack := ceilDivide(dcShards, max(numRacks, dist.ReplicationConfig.MinRacksPerDC))
|
||||
|
||||
overRacks := CalculateRackExcess(analysis, dc, targetPerRack)
|
||||
underRacks := CalculateUnderservedRacks(analysis, dc, targetPerRack)
|
||||
|
||||
underIdx := 0
|
||||
for _, over := range overRacks {
|
||||
for over.Excess > 0 && underIdx < len(underRacks) {
|
||||
destRack := underRacks[underIdx]
|
||||
|
||||
// Find shard and source node
|
||||
shardID, srcNode := r.pickShardToMove(analysis, over.Nodes)
|
||||
if srcNode == nil {
|
||||
break
|
||||
}
|
||||
|
||||
// Find destination node in target rack
|
||||
destNode := r.pickBestDestination(analysis, dc, destRack, dist)
|
||||
if destNode == nil {
|
||||
underIdx++
|
||||
continue
|
||||
}
|
||||
|
||||
moves = append(moves, ShardMove{
|
||||
ShardID: shardID,
|
||||
SourceNode: srcNode,
|
||||
DestNode: destNode,
|
||||
Reason: fmt.Sprintf("balance rack: %s -> %s", srcNode.Rack, destRack),
|
||||
})
|
||||
|
||||
over.Excess--
|
||||
analysis.ShardsByRack[srcNode.Rack]--
|
||||
analysis.ShardsByRack[destRack]++
|
||||
|
||||
if analysis.ShardsByRack[destRack] >= targetPerRack {
|
||||
underIdx++
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return moves
|
||||
}
|
||||
|
||||
// planNodeMoves plans moves to balance shards across nodes within each rack
|
||||
func (r *Rebalancer) planNodeMoves(analysis *TopologyAnalysis, dist *ECDistribution) []ShardMove {
|
||||
var moves []ShardMove
|
||||
|
||||
for rack, nodes := range analysis.RackToNodes {
|
||||
if len(nodes) <= 1 {
|
||||
continue
|
||||
}
|
||||
|
||||
rackShards := analysis.ShardsByRack[rack]
|
||||
targetPerNode := ceilDivide(rackShards, max(len(nodes), dist.ReplicationConfig.MinNodesPerRack))
|
||||
|
||||
// Find over and under nodes
|
||||
var overNodes []*TopologyNode
|
||||
var underNodes []*TopologyNode
|
||||
|
||||
for _, node := range nodes {
|
||||
count := analysis.ShardsByNode[node.NodeID]
|
||||
if count > targetPerNode {
|
||||
overNodes = append(overNodes, node)
|
||||
} else if count < targetPerNode {
|
||||
underNodes = append(underNodes, node)
|
||||
}
|
||||
}
|
||||
|
||||
// Sort by excess/deficit
|
||||
slices.SortFunc(overNodes, func(a, b *TopologyNode) int {
|
||||
return analysis.ShardsByNode[b.NodeID] - analysis.ShardsByNode[a.NodeID]
|
||||
})
|
||||
|
||||
underIdx := 0
|
||||
for _, srcNode := range overNodes {
|
||||
excess := analysis.ShardsByNode[srcNode.NodeID] - targetPerNode
|
||||
|
||||
for excess > 0 && underIdx < len(underNodes) {
|
||||
destNode := underNodes[underIdx]
|
||||
|
||||
// Pick a shard from this node, preferring parity shards
|
||||
shards := analysis.NodeToShards[srcNode.NodeID]
|
||||
if len(shards) == 0 {
|
||||
break
|
||||
}
|
||||
|
||||
// Find a parity shard first, fallback to data shard
|
||||
shardID := -1
|
||||
shardIdx := -1
|
||||
for i, s := range shards {
|
||||
if r.ecConfig.IsParityShard(s) {
|
||||
shardID = s
|
||||
shardIdx = i
|
||||
break
|
||||
}
|
||||
}
|
||||
if shardID == -1 {
|
||||
shardID = shards[0]
|
||||
shardIdx = 0
|
||||
}
|
||||
|
||||
moves = append(moves, ShardMove{
|
||||
ShardID: shardID,
|
||||
SourceNode: srcNode,
|
||||
DestNode: destNode,
|
||||
Reason: fmt.Sprintf("balance node: %s -> %s", srcNode.NodeID, destNode.NodeID),
|
||||
})
|
||||
|
||||
excess--
|
||||
analysis.ShardsByNode[srcNode.NodeID]--
|
||||
analysis.ShardsByNode[destNode.NodeID]++
|
||||
|
||||
// Update shard lists - remove the specific shard we picked
|
||||
analysis.NodeToShards[srcNode.NodeID] = append(
|
||||
shards[:shardIdx], shards[shardIdx+1:]...)
|
||||
analysis.NodeToShards[destNode.NodeID] = append(
|
||||
analysis.NodeToShards[destNode.NodeID], shardID)
|
||||
|
||||
if analysis.ShardsByNode[destNode.NodeID] >= targetPerNode {
|
||||
underIdx++
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return moves
|
||||
}
|
||||
|
||||
// pickShardToMove selects a shard and its node from the given nodes.
|
||||
// It prefers to move parity shards first, keeping data shards spread out
|
||||
// since data shards serve read requests while parity shards are only for reconstruction.
|
||||
func (r *Rebalancer) pickShardToMove(analysis *TopologyAnalysis, nodes []*TopologyNode) (int, *TopologyNode) {
|
||||
// Sort by shard count (most shards first)
|
||||
slices.SortFunc(nodes, func(a, b *TopologyNode) int {
|
||||
return analysis.ShardsByNode[b.NodeID] - analysis.ShardsByNode[a.NodeID]
|
||||
})
|
||||
|
||||
// First pass: try to find a parity shard to move (prefer moving parity)
|
||||
for _, node := range nodes {
|
||||
shards := analysis.NodeToShards[node.NodeID]
|
||||
for _, shardID := range shards {
|
||||
if r.ecConfig.IsParityShard(shardID) {
|
||||
return shardID, node
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Second pass: if no parity shards, move a data shard
|
||||
for _, node := range nodes {
|
||||
shards := analysis.NodeToShards[node.NodeID]
|
||||
if len(shards) > 0 {
|
||||
return shards[0], node
|
||||
}
|
||||
}
|
||||
|
||||
return -1, nil
|
||||
}
|
||||
|
||||
// pickBestDestination selects the best destination node
|
||||
func (r *Rebalancer) pickBestDestination(analysis *TopologyAnalysis, targetDC, targetRack string, dist *ECDistribution) *TopologyNode {
|
||||
var candidates []*TopologyNode
|
||||
|
||||
// Collect candidates
|
||||
for _, node := range analysis.AllNodes {
|
||||
// Filter by DC if specified
|
||||
if targetDC != "" && node.DataCenter != targetDC {
|
||||
continue
|
||||
}
|
||||
// Filter by rack if specified
|
||||
if targetRack != "" && node.Rack != targetRack {
|
||||
continue
|
||||
}
|
||||
// Check capacity
|
||||
if node.FreeSlots <= 0 {
|
||||
continue
|
||||
}
|
||||
// Check max shards limit
|
||||
if analysis.ShardsByNode[node.NodeID] >= dist.MaxShardsPerNode {
|
||||
continue
|
||||
}
|
||||
|
||||
candidates = append(candidates, node)
|
||||
}
|
||||
|
||||
if len(candidates) == 0 {
|
||||
return nil
|
||||
}
|
||||
|
||||
// Sort by: 1) fewer shards, 2) more free slots
|
||||
slices.SortFunc(candidates, func(a, b *TopologyNode) int {
|
||||
aShards := analysis.ShardsByNode[a.NodeID]
|
||||
bShards := analysis.ShardsByNode[b.NodeID]
|
||||
if aShards != bShards {
|
||||
return aShards - bShards
|
||||
}
|
||||
return b.FreeSlots - a.FreeSlots
|
||||
})
|
||||
|
||||
return candidates[0]
|
||||
}
|
||||
|
||||
// applyMovesToAnalysis is a no-op placeholder for potential future use.
|
||||
// Note: All planners (planDCMoves, planRackMoves, planNodeMoves) update
|
||||
// their respective counts (ShardsByDC, ShardsByRack, ShardsByNode) and
|
||||
// shard lists (NodeToShards) inline during planning. This avoids duplicate
|
||||
// updates that would occur if we also updated counts here.
|
||||
func (r *Rebalancer) applyMovesToAnalysis(analysis *TopologyAnalysis, moves []ShardMove) {
|
||||
// Counts are already updated by the individual planners.
|
||||
// This function is kept for API compatibility and potential future use.
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user