* fix: EC rebalance fails with replica placement 000 This PR fixes several issues with EC shard distribution: 1. Pre-flight check before EC encoding - Verify target disk type has capacity before encoding starts - Prevents encoding shards only to fail during rebalance - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd) 2. Fix EC rebalance with replica placement 000 - When DiffRackCount=0, shards should be distributed freely across racks - The '000' placement means 'no volume replication needed' because EC provides redundancy - Previously all racks were skipped with error 'shards X > replica placement limit (0)' 3. Add unit tests for EC rebalance slot calculation - TestECRebalanceWithLimitedSlots: documents the limited slots scenario - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error 4. Add Makefile for manual EC testing - make setup: start cluster and populate data - make shell: open weed shell for EC commands - make clean: stop cluster and cleanup * fix: default -rebalance to true for ec.encode The -rebalance flag was defaulting to false, which meant ec.encode would only print shard moves but not actually execute them. This is a poor default since the whole point of EC encoding is to distribute shards across servers for fault tolerance. Now -rebalance defaults to true, so shards are actually distributed after encoding. Users can use -rebalance=false if they only want to see what would happen without making changes. * test/erasure_coding: improve Makefile safety and docs - Narrow pkill pattern for volume servers to use TEST_DIR instead of port pattern, avoiding accidental kills of unrelated SeaweedFS processes - Document external dependencies (curl, jq) in header comments * shell: refactor buildRackWithEcShards to reuse buildEcShards Extract common shard bit construction logic to avoid duplication between buildEcShards and buildRackWithEcShards helper functions. * shell: update test for EC replication 000 behavior When DiffRackCount=0 (replication "000"), EC shards should be distributed freely across racks since erasure coding provides its own redundancy. Update test expectation to reflect this behavior. * erasure_coding: add distribution package for proportional EC shard placement Add a new reusable package for EC shard distribution that: - Supports configurable EC ratios (not hard-coded 10+4) - Distributes shards proportionally based on replication policy - Provides fault tolerance analysis - Prefers moving parity shards to keep data shards spread out Key components: - ECConfig: Configurable data/parity shard counts - ReplicationConfig: Parsed XYZ replication policy - ECDistribution: Target shard counts per DC/rack/node - Rebalancer: Plans shard moves with parity-first strategy This enables seaweed-enterprise custom EC ratios and weed worker integration while maintaining a clean, testable architecture. * shell: integrate distribution package for EC rebalancing Add shell wrappers around the distribution package: - ProportionalECRebalancer: Plans moves using distribution.Rebalancer - NewProportionalECRebalancerWithConfig: Supports custom EC configs - GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions The shell layer converts between EcNode types and the generic TopologyNode types used by the distribution package. * test setup * ec: improve data and parity shard distribution across racks - Add shardsByTypePerRack helper to track data vs parity shards - Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing: 1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack 2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack - Add balanceShardTypeAcrossRacks for generic shard type balancing - Add pickRackForShardType to select destination with room for type - Add unit tests for even data/parity distribution verification This ensures even read load during normal operation by spreading both data and parity shards across all available racks. * ec: make data/parity shard counts configurable in ecBalancer - Add dataShardCount and parityShardCount fields to ecBalancer struct - Add getDataShardCount() and getParityShardCount() methods with defaults - Replace direct constant usage with configurable methods - Fix unused variable warning for parityPerRack This allows seaweed-enterprise to use custom EC ratios while defaulting to standard 10+4 scheme. * Address PR 7812 review comments Makefile improvements: - Save PIDs for each volume server for precise termination - Use PID-based killing in stop target with pkill fallback - Use more specific pkill patterns with TEST_DIR paths Documentation: - Document jq dependency in README.md Rebalancer fix: - Fix duplicate shard count updates in applyMovesToAnalysis - All planners (DC/rack/node) update counts inline during planning - Remove duplicate updates from applyMovesToAnalysis to avoid double-counting * test/erasure_coding: use mktemp for test file template Use mktemp instead of hardcoded /tmp/testfile_template.bin path to provide better isolation for concurrent test runs.
210 lines
5.9 KiB
Markdown
210 lines
5.9 KiB
Markdown
# EC Distribution Package
|
|
|
|
This package provides erasure coding (EC) shard distribution algorithms that are:
|
|
|
|
- **Configurable**: Works with any EC ratio (e.g., 10+4, 8+4, 6+3)
|
|
- **Reusable**: Used by shell commands, worker tasks, and seaweed-enterprise
|
|
- **Topology-aware**: Distributes shards across data centers, racks, and nodes proportionally
|
|
|
|
## Usage
|
|
|
|
### Basic Usage with Default 10+4 EC
|
|
|
|
```go
|
|
import (
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding/distribution"
|
|
)
|
|
|
|
// Parse replication policy
|
|
rep, _ := distribution.NewReplicationConfigFromString("110")
|
|
|
|
// Use default 10+4 EC configuration
|
|
ec := distribution.DefaultECConfig()
|
|
|
|
// Calculate distribution plan
|
|
dist := distribution.CalculateDistribution(ec, rep)
|
|
|
|
fmt.Println(dist.Summary())
|
|
// Output:
|
|
// EC Configuration: 10+4 (total: 14, can lose: 4)
|
|
// Replication: replication=110 (DCs:2, Racks/DC:2, Nodes/Rack:1)
|
|
// Distribution Plan:
|
|
// Data Centers: 2 (target 7 shards each, max 9)
|
|
// Racks per DC: 2 (target 4 shards each, max 6)
|
|
// Nodes per Rack: 1 (target 4 shards each, max 6)
|
|
```
|
|
|
|
### Custom EC Ratios (seaweed-enterprise)
|
|
|
|
```go
|
|
// Create custom 8+4 EC configuration
|
|
ec, err := distribution.NewECConfig(8, 4)
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
|
|
rep, _ := distribution.NewReplicationConfigFromString("200")
|
|
dist := distribution.CalculateDistribution(ec, rep)
|
|
|
|
// Check fault tolerance
|
|
fmt.Println(dist.FaultToleranceAnalysis())
|
|
// Output:
|
|
// Fault Tolerance Analysis for 8+4:
|
|
// DC Failure: SURVIVABLE ✓
|
|
// - Losing one DC loses ~4 shards
|
|
// - Remaining: 8 shards (need 8)
|
|
```
|
|
|
|
### Planning Shard Moves
|
|
|
|
```go
|
|
// Build topology analysis
|
|
analysis := distribution.NewTopologyAnalysis()
|
|
|
|
// Add nodes and their shard locations
|
|
for _, node := range nodes {
|
|
analysis.AddNode(&distribution.TopologyNode{
|
|
NodeID: node.ID,
|
|
DataCenter: node.DC,
|
|
Rack: node.Rack,
|
|
FreeSlots: node.FreeSlots,
|
|
})
|
|
for _, shardID := range node.ShardIDs {
|
|
analysis.AddShardLocation(distribution.ShardLocation{
|
|
ShardID: shardID,
|
|
NodeID: node.ID,
|
|
DataCenter: node.DC,
|
|
Rack: node.Rack,
|
|
})
|
|
}
|
|
}
|
|
analysis.Finalize()
|
|
|
|
// Create rebalancer and plan moves
|
|
rebalancer := distribution.NewRebalancer(ec, rep)
|
|
plan, err := rebalancer.PlanRebalance(analysis)
|
|
|
|
for _, move := range plan.Moves {
|
|
fmt.Printf("Move shard %d from %s to %s\n",
|
|
move.ShardID, move.SourceNode.NodeID, move.DestNode.NodeID)
|
|
}
|
|
```
|
|
|
|
## Algorithm
|
|
|
|
### Proportional Distribution
|
|
|
|
The replication policy `XYZ` is interpreted as a ratio:
|
|
|
|
| Replication | DCs | Racks/DC | Nodes/Rack | 14 Shards Distribution |
|
|
|-------------|-----|----------|------------|------------------------|
|
|
| `000` | 1 | 1 | 1 | All in one place |
|
|
| `001` | 1 | 1 | 2 | 7 per node |
|
|
| `010` | 1 | 2 | 1 | 7 per rack |
|
|
| `100` | 2 | 1 | 1 | 7 per DC |
|
|
| `110` | 2 | 2 | 1 | 7/DC, 4/rack |
|
|
| `200` | 3 | 1 | 1 | 5 per DC |
|
|
|
|
### Rebalancing Process
|
|
|
|
1. **DC-level balancing**: Move shards to achieve target shards per DC
|
|
2. **Rack-level balancing**: Within each DC, balance across racks
|
|
3. **Node-level balancing**: Within each rack, balance across nodes
|
|
|
|
### Shard Priority: Data First, Parity Moves First
|
|
|
|
When rebalancing, the algorithm prioritizes keeping data shards spread out:
|
|
|
|
- **Data shards (0 to DataShards-1)**: Serve read requests directly
|
|
- **Parity shards (DataShards to TotalShards-1)**: Only used for reconstruction
|
|
|
|
**Rebalancing Strategy**:
|
|
- When moving shards FROM an overloaded node, **parity shards are moved first**
|
|
- This keeps data shards in place on well-distributed nodes
|
|
- Result: Data shards remain spread out for optimal read performance
|
|
|
|
```go
|
|
// Check shard type
|
|
if ec.IsDataShard(shardID) {
|
|
// Shard serves read requests
|
|
}
|
|
if ec.IsParityShard(shardID) {
|
|
// Shard only used for reconstruction
|
|
}
|
|
|
|
// Sort shards for placement (data first for initial distribution)
|
|
sorted := ec.SortShardsDataFirst(shards)
|
|
|
|
// Sort shards for rebalancing (parity first to move them away)
|
|
sorted := ec.SortShardsParityFirst(shards)
|
|
```
|
|
|
|
### Fault Tolerance
|
|
|
|
The package provides fault tolerance analysis:
|
|
|
|
- **DC Failure**: Can the data survive complete DC loss?
|
|
- **Rack Failure**: Can the data survive complete rack loss?
|
|
- **Node Failure**: Can the data survive single node loss?
|
|
|
|
For example, with 10+4 EC (can lose 4 shards):
|
|
- Need 4+ DCs for DC-level fault tolerance
|
|
- Need 4+ racks for rack-level fault tolerance
|
|
- Usually survivable at node level
|
|
|
|
## API Reference
|
|
|
|
### Types
|
|
|
|
- `ECConfig`: EC configuration (data shards, parity shards)
|
|
- `ReplicationConfig`: Parsed replication policy
|
|
- `ECDistribution`: Calculated distribution plan
|
|
- `TopologyAnalysis`: Current shard distribution analysis
|
|
- `Rebalancer`: Plans shard moves
|
|
- `RebalancePlan`: List of planned moves
|
|
- `ShardMove`: Single shard move operation
|
|
|
|
### Key Functions
|
|
|
|
- `NewECConfig(data, parity int)`: Create EC configuration
|
|
- `DefaultECConfig()`: Returns 10+4 configuration
|
|
- `CalculateDistribution(ec, rep)`: Calculate distribution plan
|
|
- `NewRebalancer(ec, rep)`: Create rebalancer
|
|
- `PlanRebalance(analysis)`: Generate rebalancing plan
|
|
|
|
## Integration
|
|
|
|
### Shell Commands
|
|
|
|
The shell package wraps this distribution package for `ec.balance`:
|
|
|
|
```go
|
|
import "github.com/seaweedfs/seaweedfs/weed/shell"
|
|
|
|
rebalancer := shell.NewProportionalECRebalancer(nodes, rp, diskType)
|
|
moves, _ := rebalancer.PlanMoves(volumeId, locations)
|
|
```
|
|
|
|
### Worker Tasks
|
|
|
|
Worker tasks can use the distribution package directly:
|
|
|
|
```go
|
|
import "github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding/distribution"
|
|
|
|
ec := distribution.ECConfig{DataShards: 8, ParityShards: 4}
|
|
rep := distribution.NewReplicationConfig(rp)
|
|
dist := distribution.CalculateDistribution(ec, rep)
|
|
```
|
|
|
|
### seaweed-enterprise
|
|
|
|
Enterprise features can provide custom EC configurations:
|
|
|
|
```go
|
|
// Custom EC ratio from license/config
|
|
ec, _ := distribution.NewECConfig(customData, customParity)
|
|
rebalancer := distribution.NewRebalancer(ec, rep)
|
|
```
|
|
|