Files
seaweedFS/weed/storage/erasure_coding/distribution
Chris Lu 4aa50bfa6a fix: EC rebalance fails with replica placement 000 (#7812)
* fix: EC rebalance fails with replica placement 000

This PR fixes several issues with EC shard distribution:

1. Pre-flight check before EC encoding
   - Verify target disk type has capacity before encoding starts
   - Prevents encoding shards only to fail during rebalance
   - Shows helpful error when wrong diskType is specified (e.g., ssd when volumes are on hdd)

2. Fix EC rebalance with replica placement 000
   - When DiffRackCount=0, shards should be distributed freely across racks
   - The '000' placement means 'no volume replication needed' because EC provides redundancy
   - Previously all racks were skipped with error 'shards X > replica placement limit (0)'

3. Add unit tests for EC rebalance slot calculation
   - TestECRebalanceWithLimitedSlots: documents the limited slots scenario
   - TestECRebalanceZeroFreeSlots: reproduces the 0 free slots error

4. Add Makefile for manual EC testing
   - make setup: start cluster and populate data
   - make shell: open weed shell for EC commands
   - make clean: stop cluster and cleanup

* fix: default -rebalance to true for ec.encode

The -rebalance flag was defaulting to false, which meant ec.encode would
only print shard moves but not actually execute them. This is a poor
default since the whole point of EC encoding is to distribute shards
across servers for fault tolerance.

Now -rebalance defaults to true, so shards are actually distributed
after encoding. Users can use -rebalance=false if they only want to
see what would happen without making changes.

* test/erasure_coding: improve Makefile safety and docs

- Narrow pkill pattern for volume servers to use TEST_DIR instead of
  port pattern, avoiding accidental kills of unrelated SeaweedFS processes
- Document external dependencies (curl, jq) in header comments

* shell: refactor buildRackWithEcShards to reuse buildEcShards

Extract common shard bit construction logic to avoid duplication
between buildEcShards and buildRackWithEcShards helper functions.

* shell: update test for EC replication 000 behavior

When DiffRackCount=0 (replication "000"), EC shards should be
distributed freely across racks since erasure coding provides its
own redundancy. Update test expectation to reflect this behavior.

* erasure_coding: add distribution package for proportional EC shard placement

Add a new reusable package for EC shard distribution that:
- Supports configurable EC ratios (not hard-coded 10+4)
- Distributes shards proportionally based on replication policy
- Provides fault tolerance analysis
- Prefers moving parity shards to keep data shards spread out

Key components:
- ECConfig: Configurable data/parity shard counts
- ReplicationConfig: Parsed XYZ replication policy
- ECDistribution: Target shard counts per DC/rack/node
- Rebalancer: Plans shard moves with parity-first strategy

This enables seaweed-enterprise custom EC ratios and weed worker
integration while maintaining a clean, testable architecture.

* shell: integrate distribution package for EC rebalancing

Add shell wrappers around the distribution package:
- ProportionalECRebalancer: Plans moves using distribution.Rebalancer
- NewProportionalECRebalancerWithConfig: Supports custom EC configs
- GetDistributionSummary/GetFaultToleranceAnalysis: Helper functions

The shell layer converts between EcNode types and the generic
TopologyNode types used by the distribution package.

* test setup

* ec: improve data and parity shard distribution across racks

- Add shardsByTypePerRack helper to track data vs parity shards
- Rewrite doBalanceEcShardsAcrossRacks for two-pass balancing:
  1. Balance data shards (0-9) evenly, max ceil(10/6)=2 per rack
  2. Balance parity shards (10-13) evenly, max ceil(4/6)=1 per rack
- Add balanceShardTypeAcrossRacks for generic shard type balancing
- Add pickRackForShardType to select destination with room for type
- Add unit tests for even data/parity distribution verification

This ensures even read load during normal operation by spreading
both data and parity shards across all available racks.

* ec: make data/parity shard counts configurable in ecBalancer

- Add dataShardCount and parityShardCount fields to ecBalancer struct
- Add getDataShardCount() and getParityShardCount() methods with defaults
- Replace direct constant usage with configurable methods
- Fix unused variable warning for parityPerRack

This allows seaweed-enterprise to use custom EC ratios while
defaulting to standard 10+4 scheme.

* Address PR 7812 review comments

Makefile improvements:
- Save PIDs for each volume server for precise termination
- Use PID-based killing in stop target with pkill fallback
- Use more specific pkill patterns with TEST_DIR paths

Documentation:
- Document jq dependency in README.md

Rebalancer fix:
- Fix duplicate shard count updates in applyMovesToAnalysis
- All planners (DC/rack/node) update counts inline during planning
- Remove duplicate updates from applyMovesToAnalysis to avoid double-counting

* test/erasure_coding: use mktemp for test file template

Use mktemp instead of hardcoded /tmp/testfile_template.bin path
to provide better isolation for concurrent test runs.
2025-12-19 13:29:12 -08:00
..

EC Distribution Package

This package provides erasure coding (EC) shard distribution algorithms that are:

  • Configurable: Works with any EC ratio (e.g., 10+4, 8+4, 6+3)
  • Reusable: Used by shell commands, worker tasks, and seaweed-enterprise
  • Topology-aware: Distributes shards across data centers, racks, and nodes proportionally

Usage

Basic Usage with Default 10+4 EC

import (
    "github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding/distribution"
)

// Parse replication policy
rep, _ := distribution.NewReplicationConfigFromString("110")

// Use default 10+4 EC configuration
ec := distribution.DefaultECConfig()

// Calculate distribution plan
dist := distribution.CalculateDistribution(ec, rep)

fmt.Println(dist.Summary())
// Output:
// EC Configuration: 10+4 (total: 14, can lose: 4)
// Replication: replication=110 (DCs:2, Racks/DC:2, Nodes/Rack:1)
// Distribution Plan:
//   Data Centers: 2 (target 7 shards each, max 9)
//   Racks per DC: 2 (target 4 shards each, max 6)
//   Nodes per Rack: 1 (target 4 shards each, max 6)

Custom EC Ratios (seaweed-enterprise)

// Create custom 8+4 EC configuration
ec, err := distribution.NewECConfig(8, 4)
if err != nil {
    log.Fatal(err)
}

rep, _ := distribution.NewReplicationConfigFromString("200")
dist := distribution.CalculateDistribution(ec, rep)

// Check fault tolerance
fmt.Println(dist.FaultToleranceAnalysis())
// Output:
// Fault Tolerance Analysis for 8+4:
//   DC Failure: SURVIVABLE ✓
//     - Losing one DC loses ~4 shards
//     - Remaining: 8 shards (need 8)

Planning Shard Moves

// Build topology analysis
analysis := distribution.NewTopologyAnalysis()

// Add nodes and their shard locations
for _, node := range nodes {
    analysis.AddNode(&distribution.TopologyNode{
        NodeID:     node.ID,
        DataCenter: node.DC,
        Rack:       node.Rack,
        FreeSlots:  node.FreeSlots,
    })
    for _, shardID := range node.ShardIDs {
        analysis.AddShardLocation(distribution.ShardLocation{
            ShardID:    shardID,
            NodeID:     node.ID,
            DataCenter: node.DC,
            Rack:       node.Rack,
        })
    }
}
analysis.Finalize()

// Create rebalancer and plan moves
rebalancer := distribution.NewRebalancer(ec, rep)
plan, err := rebalancer.PlanRebalance(analysis)

for _, move := range plan.Moves {
    fmt.Printf("Move shard %d from %s to %s\n", 
        move.ShardID, move.SourceNode.NodeID, move.DestNode.NodeID)
}

Algorithm

Proportional Distribution

The replication policy XYZ is interpreted as a ratio:

Replication DCs Racks/DC Nodes/Rack 14 Shards Distribution
000 1 1 1 All in one place
001 1 1 2 7 per node
010 1 2 1 7 per rack
100 2 1 1 7 per DC
110 2 2 1 7/DC, 4/rack
200 3 1 1 5 per DC

Rebalancing Process

  1. DC-level balancing: Move shards to achieve target shards per DC
  2. Rack-level balancing: Within each DC, balance across racks
  3. Node-level balancing: Within each rack, balance across nodes

Shard Priority: Data First, Parity Moves First

When rebalancing, the algorithm prioritizes keeping data shards spread out:

  • Data shards (0 to DataShards-1): Serve read requests directly
  • Parity shards (DataShards to TotalShards-1): Only used for reconstruction

Rebalancing Strategy:

  • When moving shards FROM an overloaded node, parity shards are moved first
  • This keeps data shards in place on well-distributed nodes
  • Result: Data shards remain spread out for optimal read performance
// Check shard type
if ec.IsDataShard(shardID) {
    // Shard serves read requests
}
if ec.IsParityShard(shardID) {
    // Shard only used for reconstruction
}

// Sort shards for placement (data first for initial distribution)
sorted := ec.SortShardsDataFirst(shards)

// Sort shards for rebalancing (parity first to move them away)
sorted := ec.SortShardsParityFirst(shards)

Fault Tolerance

The package provides fault tolerance analysis:

  • DC Failure: Can the data survive complete DC loss?
  • Rack Failure: Can the data survive complete rack loss?
  • Node Failure: Can the data survive single node loss?

For example, with 10+4 EC (can lose 4 shards):

  • Need 4+ DCs for DC-level fault tolerance
  • Need 4+ racks for rack-level fault tolerance
  • Usually survivable at node level

API Reference

Types

  • ECConfig: EC configuration (data shards, parity shards)
  • ReplicationConfig: Parsed replication policy
  • ECDistribution: Calculated distribution plan
  • TopologyAnalysis: Current shard distribution analysis
  • Rebalancer: Plans shard moves
  • RebalancePlan: List of planned moves
  • ShardMove: Single shard move operation

Key Functions

  • NewECConfig(data, parity int): Create EC configuration
  • DefaultECConfig(): Returns 10+4 configuration
  • CalculateDistribution(ec, rep): Calculate distribution plan
  • NewRebalancer(ec, rep): Create rebalancer
  • PlanRebalance(analysis): Generate rebalancing plan

Integration

Shell Commands

The shell package wraps this distribution package for ec.balance:

import "github.com/seaweedfs/seaweedfs/weed/shell"

rebalancer := shell.NewProportionalECRebalancer(nodes, rp, diskType)
moves, _ := rebalancer.PlanMoves(volumeId, locations)

Worker Tasks

Worker tasks can use the distribution package directly:

import "github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding/distribution"

ec := distribution.ECConfig{DataShards: 8, ParityShards: 4}
rep := distribution.NewReplicationConfig(rp)
dist := distribution.CalculateDistribution(ec, rep)

seaweed-enterprise

Enterprise features can provide custom EC configurations:

// Custom EC ratio from license/config
ec, _ := distribution.NewECConfig(customData, customParity)
rebalancer := distribution.NewRebalancer(ec, rep)