Files
seaweedFS/weed/storage/erasure_coding/distribution/config.go
Chris Lu 4aa50bfa6a fix: EC rebalance fails with replica placement 000 (#7812)
* fix: EC rebalance fails with replica placement 000

This PR fixes several issues with EC shard distribution:

1. Pre-flight check before EC encoding
   - Verify target disk type has capacity before encoding starts
   - Prevents encoding shards only to fail during rebalance
   - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd)

2. Fix EC rebalance with replica placement 000
   - When DiffRackCount=0, shards should be distributed freely across racks
   - The '000' placement means 'no volume replication needed' because EC provides redundancy
   - Previously all racks were skipped with error 'shards X > replica placement limit (0)'

3. Add unit tests for EC rebalance slot calculation
   - TestECRebalanceWithLimitedSlots: documents the limited slots scenario
   - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error

4. Add Makefile for manual EC testing
   - make setup: start cluster and populate data
   - make shell: open weed shell for EC commands
   - make clean: stop cluster and cleanup

* fix: default -rebalance to true for ec.encode

The -rebalance flag was defaulting to false, which meant ec.encode would
only print shard moves but not actually execute them. This is a poor
default since the whole point of EC encoding is to distribute shards
across servers for fault tolerance.

Now -rebalance defaults to true, so shards are actually distributed
after encoding. Users can use -rebalance=false if they only want to
see what would happen without making changes.

* test/erasure_coding: improve Makefile safety and docs

- Narrow pkill pattern for volume servers to use TEST_DIR instead of
  port pattern, avoiding accidental kills of unrelated SeaweedFS processes
- Document external dependencies (curl, jq) in header comments

* shell: refactor buildRackWithEcShards to reuse buildEcShards

Extract common shard bit construction logic to avoid duplication
between buildEcShards and buildRackWithEcShards helper functions.

* shell: update test for EC replication 000 behavior

When DiffRackCount=0 (replication "000"), EC shards should be
distributed freely across racks since erasure coding provides its
own redundancy. Update test expectation to reflect this behavior.

* erasure_coding: add distribution package for proportional EC shard placement

Add a new reusable package for EC shard distribution that:
- Supports configurable EC ratios (not hard-coded 10+4)
- Distributes shards proportionally based on replication policy
- Provides fault tolerance analysis
- Prefers moving parity shards to keep data shards spread out

Key components:
- ECConfig: Configurable data/parity shard counts
- ReplicationConfig: Parsed XYZ replication policy
- ECDistribution: Target shard counts per DC/rack/node
- Rebalancer: Plans shard moves with parity-first strategy

This enables seaweed-enterprise custom EC ratios and weed worker
integration while maintaining a clean, testable architecture.

* shell: integrate distribution package for EC rebalancing

Add shell wrappers around the distribution package:
- ProportionalECRebalancer: Plans moves using distribution.Rebalancer
- NewProportionalECRebalancerWithConfig: Supports custom EC configs
- GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions

The shell layer converts between EcNode types and the generic
TopologyNode types used by the distribution package.

* test setup

* ec: improve data and parity shard distribution across racks

- Add shardsByTypePerRack helper to track data vs parity shards
- Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing:
  1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack
  2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack
- Add balanceShardTypeAcrossRacks for generic shard type balancing
- Add pickRackForShardType to select destination with room for type
- Add unit tests for even data/parity distribution verification

This ensures even read load during normal operation by spreading
both data and parity shards across all available racks.

* ec: make data/parity shard counts configurable in ecBalancer

- Add dataShardCount and parityShardCount fields to ecBalancer struct
- Add getDataShardCount() and getParityShardCount() methods with defaults
- Replace direct constant usage with configurable methods
- Fix unused variable warning for parityPerRack

This allows seaweed-enterprise to use custom EC ratios while
defaulting to standard 10+4 scheme.

* Address PR 7812 review comments

Makefile improvements:
- Save PIDs for each volume server for precise termination
- Use PID-based killing in stop target with pkill fallback
- Use more specific pkill patterns with TEST_DIR paths

Documentation:
- Document jq dependency in README.md

Rebalancer fix:
- Fix duplicate shard count updates in applyMovesToAnalysis
- All planners (DC/rack/node) update counts inline during planning
- Remove duplicate updates from applyMovesToAnalysis to avoid double-counting

* test/erasure_coding: use mktemp for test file template

Use mktemp instead of hardcoded /tmp/testfile_template.bin path
to provide better isolation for concurrent test runs.
2025-12-19 13:29:12 -08:00

172 lines
5.1 KiB
Go

// Package distribution provides EC shard distribution algorithms with configurable EC ratios.
package distribution
import (
"fmt"
"github.com/seaweedfs/seaweedfs/weed/storage/super_block"
)
// ECConfig holds erasure coding configuration parameters.
// This replaces hard-coded constants like DataShardsCount=10, ParityShardsCount=4.
type ECConfig struct {
DataShards int // Number of data shards (e.g., 10)
ParityShards int // Number of parity shards (e.g., 4)
}
// DefaultECConfig returns the standard 10+4 EC configuration
func DefaultECConfig() ECConfig {
return ECConfig{
DataShards: 10,
ParityShards: 4,
}
}
// NewECConfig creates a new EC configuration with validation
func NewECConfig(dataShards, parityShards int) (ECConfig, error) {
if dataShards <= 0 {
return ECConfig{}, fmt.Errorf("dataShards must be positive, got %d", dataShards)
}
if parityShards <= 0 {
return ECConfig{}, fmt.Errorf("parityShards must be positive, got %d", parityShards)
}
if dataShards+parityShards > 32 {
return ECConfig{}, fmt.Errorf("total shards (%d+%d=%d) exceeds maximum of 32",
dataShards, parityShards, dataShards+parityShards)
}
return ECConfig{
DataShards: dataShards,
ParityShards: parityShards,
}, nil
}
// TotalShards returns the total number of shards (data + parity)
func (c ECConfig) TotalShards() int {
return c.DataShards + c.ParityShards
}
// MaxTolerableLoss returns the maximum number of shards that can be lost
// while still being able to reconstruct the data
func (c ECConfig) MaxTolerableLoss() int {
return c.ParityShards
}
// MinShardsForReconstruction returns the minimum number of shards needed
// to reconstruct the original data
func (c ECConfig) MinShardsForReconstruction() int {
return c.DataShards
}
// String returns a human-readable representation
func (c ECConfig) String() string {
return fmt.Sprintf("%d+%d (total: %d, can lose: %d)",
c.DataShards, c.ParityShards, c.TotalShards(), c.MaxTolerableLoss())
}
// IsDataShard returns true if the shard ID is a data shard (0 to DataShards-1)
func (c ECConfig) IsDataShard(shardID int) bool {
return shardID >= 0 && shardID < c.DataShards
}
// IsParityShard returns true if the shard ID is a parity shard (DataShards to TotalShards-1)
func (c ECConfig) IsParityShard(shardID int) bool {
return shardID >= c.DataShards && shardID < c.TotalShards()
}
// SortShardsDataFirst returns a copy of shards sorted with data shards first.
// This is useful for initial placement where data shards should be spread out first.
func (c ECConfig) SortShardsDataFirst(shards []int) []int {
result := make([]int, len(shards))
copy(result, shards)
// Partition: data shards first, then parity shards
dataIdx := 0
parityIdx := len(result) - 1
sorted := make([]int, len(result))
for _, s := range result {
if c.IsDataShard(s) {
sorted[dataIdx] = s
dataIdx++
} else {
sorted[parityIdx] = s
parityIdx--
}
}
return sorted
}
// SortShardsParityFirst returns a copy of shards sorted with parity shards first.
// This is useful for rebalancing where we prefer to move parity shards.
func (c ECConfig) SortShardsParityFirst(shards []int) []int {
result := make([]int, len(shards))
copy(result, shards)
// Partition: parity shards first, then data shards
parityIdx := 0
dataIdx := len(result) - 1
sorted := make([]int, len(result))
for _, s := range result {
if c.IsParityShard(s) {
sorted[parityIdx] = s
parityIdx++
} else {
sorted[dataIdx] = s
dataIdx--
}
}
return sorted
}
// ReplicationConfig holds the parsed replication policy
type ReplicationConfig struct {
MinDataCenters int // X+1 from XYZ replication (minimum DCs to use)
MinRacksPerDC int // Y+1 from XYZ replication (minimum racks per DC)
MinNodesPerRack int // Z+1 from XYZ replication (minimum nodes per rack)
// Original replication string (for logging/debugging)
Original string
}
// NewReplicationConfig creates a ReplicationConfig from a ReplicaPlacement
func NewReplicationConfig(rp *super_block.ReplicaPlacement) ReplicationConfig {
if rp == nil {
return ReplicationConfig{
MinDataCenters: 1,
MinRacksPerDC: 1,
MinNodesPerRack: 1,
Original: "000",
}
}
return ReplicationConfig{
MinDataCenters: rp.DiffDataCenterCount + 1,
MinRacksPerDC: rp.DiffRackCount + 1,
MinNodesPerRack: rp.SameRackCount + 1,
Original: rp.String(),
}
}
// NewReplicationConfigFromString creates a ReplicationConfig from a replication string
func NewReplicationConfigFromString(replication string) (ReplicationConfig, error) {
rp, err := super_block.NewReplicaPlacementFromString(replication)
if err != nil {
return ReplicationConfig{}, err
}
return NewReplicationConfig(rp), nil
}
// TotalPlacementSlots returns the minimum number of unique placement locations
// based on the replication policy
func (r ReplicationConfig) TotalPlacementSlots() int {
return r.MinDataCenters * r.MinRacksPerDC * r.MinNodesPerRack
}
// String returns a human-readable representation
func (r ReplicationConfig) String() string {
return fmt.Sprintf("replication=%s (DCs:%d, Racks/DC:%d, Nodes/Rack:%d)",
r.Original, r.MinDataCenters, r.MinRacksPerDC, r.MinNodesPerRack)
}