Files
seaweedFS/weed/pb/worker.proto
Chris Lu 5f85bf5e8a Batch volume balance: run multiple moves per job (#8561)
* proto: add BalanceMoveSpec and batch fields to BalanceTaskParams

Add BalanceMoveSpec message for encoding individual volume moves,
and max_concurrent_moves + repeated moves fields to BalanceTaskParams
to support batching multiple volume moves in a single job.

* balance handler: add batch execution with concurrent volume moves

Refactor Execute() into executeSingleMove() (backward compatible) and
executeBatchMoves() which runs multiple volume moves concurrently using
a semaphore-bounded goroutine pool. When BalanceTaskParams.Moves is
populated, the batch path is taken; otherwise the single-move path.

Includes aggregate progress reporting across concurrent moves,
per-move error collection, and partial failure support.

* balance handler: add batch config fields to Descriptor and worker config

Add max_concurrent_moves and batch_size fields to the worker config
form and deriveBalanceWorkerConfig(). These control how many volume
moves run concurrently within a batch job and the maximum batch size.

* balance handler: group detection proposals into batch jobs

When batch_size > 1, the Detect method groups detection results into
batch proposals where each proposal encodes multiple BalanceMoveSpec
entries in BalanceTaskParams.Moves. Single-result batches fall back
to the existing single-move proposal format for backward compatibility.

* admin UI: add volume balance execution plan and batch badge

Add renderBalanceExecutionPlan() for rich rendering of volume balance
jobs in the job detail modal. Single-move jobs show source/target/volume
info; batch jobs show a moves table with all volume moves.

Add batch badge (e.g., "5 moves") next to job type in the execution
jobs table when the job has batch=true label.

* Update plugin_templ.go

* fix: detection algorithm uses greedy target instead of divergent topology scores

The detection loop tracked effective volume counts via an adjustments map,
but createBalanceTask independently called planBalanceDestination which used
the topology's LoadCount — a separate, unadjusted source of truth. This
divergence caused multiple moves to pile onto the same server.

Changes:
- Add resolveBalanceDestination to resolve the detection loop's greedy
  target (minServer) rather than independently picking a destination
- Add oscillation guard: stop when max-min <= 1 since no single move
  can improve the balance beyond that point
- Track unseeded destinations: if a target server wasn't in the initial
  serverVolumeCounts, add it so subsequent iterations include it
- Add TestDetection_UnseededDestinationDoesNotOverload

* fix: handler force_move propagation, partial failure, deterministic dedupe

- Propagate ForceMove from outer BalanceTaskParams to individual move
  TaskParams so batch moves respect the force_move flag
- Fix partial failure: mark job successful if at least one move
  succeeded (succeeded > 0 || failed == 0) to avoid re-running
  already-completed moves on retry
- Use SHA-256 hash for deterministic dedupe key fallback instead of
  time.Now().UnixNano() which is non-deterministic
- Remove unused successDetails variable
- Extract maxProposalStringLength constant to replace magic number 200

* admin UI: use template literals in balance execution plan rendering

* fix: integration test handles batch proposals from batched detection

With batch_size=20, all moves are grouped into a single proposal
containing BalanceParams.Moves instead of top-level Sources/Targets.
Update assertions to handle both batch and single-move proposal formats.

* fix: verify volume size on target before deleting source during balance

Add a pre-delete safety check that reads the volume file status on both
source and target, then compares .dat file size and file count. If they
don't match, the move is aborted — leaving the source intact rather than
risking irreversible data loss.

Also removes the redundant mountVolume call since VolumeCopy already
mounts the volume on the target server.

* fix: clamp maxConcurrent, serialize progress sends, validate config as int64

- Clamp maxConcurrentMoves to defaultMaxConcurrentMoves before creating
  the semaphore so a stale or malicious job cannot request unbounded
  concurrent volume moves
- Extend progressMu to cover sender.SendProgress calls since the
  underlying gRPC stream is not safe for concurrent writes
- Perform bounds checks on max_concurrent_moves and batch_size in int64
  space before casting to int, avoiding potential overflow on 32-bit

* fix: check disk capacity in resolveBalanceDestination

Skip disks where VolumeCount >= MaxVolumeCount so the detection loop
does not propose moves to a full disk that would fail at execution time.

* test: rename unseeded destination test to match actual behavior

The test exercises a server with 0 volumes that IS seeded from topology
(matching disk type), not an unseeded destination. Rename to
TestDetection_ZeroVolumeServerIncludedInBalance and fix comments.

* test: tighten integration test to assert exactly one batch proposal

With default batch_size=20, all moves should be grouped into a single
batch proposal. Assert len(proposals)==1 and require BalanceParams with
Moves, removing the legacy single-move else branch.

* fix: propagate ctx to RPCs and restore source writability on abort

- All helper methods (markVolumeReadonly, copyVolume, tailVolume,
  readVolumeFileStatus, deleteVolume) now accept a context parameter
  instead of using context.Background(), so Execute's ctx propagates
  cancellation and timeouts into every volume server RPC
- Add deferred cleanup that restores the source volume to writable if
  any step after markVolumeReadonly fails, preventing the source from
  being left permanently readonly on abort
- Add markVolumeWritable helper using VolumeMarkWritableRequest

* fix: deep-copy protobuf messages in test recording sender

Use proto.Clone in recordingExecutionSender to store immutable snapshots
of JobProgressUpdate and JobCompleted, preventing assertions from
observing mutations if the handler reuses message pointers.

* fix: add VolumeMarkWritable and ReadVolumeFileStatus to fake volume server

The balance task now calls ReadVolumeFileStatus for pre-delete
verification and VolumeMarkWritable to restore writability on abort.
Add both RPCs to the test fake, and drop the mountCalls assertion since
BalanceTask no longer calls VolumeMount directly (VolumeCopy handles it).

* fix: use maxConcurrentMovesLimit (50) for clamp, not defaultMaxConcurrentMoves

defaultMaxConcurrentMoves (5) is the fallback when the field is unset,
not an upper bound. Clamping to it silently overrides valid config
values like 10/20/50. Introduce maxConcurrentMovesLimit (50) matching
the descriptor's MaxValue and clamp to that instead.

* fix: cancel batch moves on progress stream failure

Derive a cancellable batchCtx from the caller's ctx. If
sender.SendProgress returns an error (client disconnect, context
cancelled), capture it, skip further sends, and cancel batchCtx so
in-flight moves abort via their propagated context rather than running
blind to completion.

* fix: bound cleanup timeout and validate batch move fields

- Use a 30-second timeout for the deferred markVolumeWritable cleanup
  instead of context.Background() which can block indefinitely if the
  volume server is unreachable
- Validate required fields (VolumeID, SourceNode, TargetNode) before
  appending moves to a batch proposal, skipping invalid entries
- Fall back to a single-move proposal when filtering leaves only one
  valid move in a batch

* fix: cancel task execution on SendProgress stream failure

All handler progress callbacks previously ignored SendProgress errors,
allowing tasks to continue executing after the client disconnected.
Now each handler creates a derived cancellable context and cancels it
on the first SendProgress error, stopping the in-flight task promptly.

Handlers fixed: erasure_coding, vacuum, volume_balance (single-move),
and admin_script (breaks command loop on send failure).

* fix: validate batch moves before scheduling in executeBatchMoves

Reject empty batches, enforce a hard upper bound (100 moves), and
filter out nil or incomplete move specs (missing source/target/volume)
before allocating progress tracking and launching goroutines.

* test: add batch balance execution integration test

Tests the batch move path with 3 volumes, max concurrency 2, using
fake volume servers. Verifies all moves complete with correct readonly,
copy, tail, and delete RPC counts.

* test: add MarkWritableCount and ReadFileStatusCount accessors

Expose the markWritableCalls and readFileStatusCalls counters on the
fake volume server, following the existing MarkReadonlyCount pattern.

* fix: oscillation guard uses global effective counts for heterogeneous capacity

The oscillation guard (max-min <= 1) previously used maxServer/minServer
which are determined by utilization ratio. With heterogeneous capacity,
maxServer by utilization can have fewer raw volumes than minServer,
producing a negative diff and incorrectly triggering the guard.

Now scans all servers' effective counts to find the true global max/min
volume counts, so the guard works correctly regardless of whether
utilization-based or raw-count balancing is used.

* fix: admin script handler breaks outer loop on SendProgress failure

The break on SendProgress error inside the shell.Commands scan only
exited the inner loop, letting the outer command loop continue
executing commands on a broken stream. Use a sendBroken flag to
propagate the break to the outer execCommands loop.
2026-03-09 19:30:08 -07:00

411 lines
14 KiB
Protocol Buffer

syntax = "proto3";
package worker_pb;
option go_package = "github.com/seaweedfs/seaweedfs/weed/pb/worker_pb";
// WorkerService provides bidirectional communication between admin and worker
service WorkerService {
// WorkerStream maintains a bidirectional stream for worker communication
rpc WorkerStream(stream WorkerMessage) returns (stream AdminMessage);
}
// WorkerMessage represents messages from worker to admin
message WorkerMessage {
string worker_id = 1;
int64 timestamp = 2;
oneof message {
WorkerRegistration registration = 3;
WorkerHeartbeat heartbeat = 4;
TaskRequest task_request = 5;
TaskUpdate task_update = 6;
TaskComplete task_complete = 7;
WorkerShutdown shutdown = 8;
TaskLogResponse task_log_response = 9;
}
}
// AdminMessage represents messages from admin to worker
message AdminMessage {
string admin_id = 1;
int64 timestamp = 2;
oneof message {
RegistrationResponse registration_response = 3;
HeartbeatResponse heartbeat_response = 4;
TaskAssignment task_assignment = 5;
TaskCancellation task_cancellation = 6;
AdminShutdown admin_shutdown = 7;
TaskLogRequest task_log_request = 8;
}
}
// WorkerRegistration message when worker connects
message WorkerRegistration {
string worker_id = 1;
string address = 2;
repeated string capabilities = 3;
int32 max_concurrent = 4;
map<string, string> metadata = 5;
}
// RegistrationResponse confirms worker registration
message RegistrationResponse {
bool success = 1;
string message = 2;
string assigned_worker_id = 3;
}
// WorkerHeartbeat sent periodically by worker
message WorkerHeartbeat {
string worker_id = 1;
string status = 2;
int32 current_load = 3;
int32 max_concurrent = 4;
repeated string current_task_ids = 5;
int32 tasks_completed = 6;
int32 tasks_failed = 7;
int64 uptime_seconds = 8;
}
// HeartbeatResponse acknowledges heartbeat
message HeartbeatResponse {
bool success = 1;
string message = 2;
}
// TaskRequest from worker asking for new tasks
message TaskRequest {
string worker_id = 1;
repeated string capabilities = 2;
int32 available_slots = 3;
}
// TaskAssignment from admin to worker
message TaskAssignment {
string task_id = 1;
string task_type = 2;
TaskParams params = 3;
int32 priority = 4;
int64 created_time = 5;
map<string, string> metadata = 6;
}
// TaskParams contains task-specific parameters with typed variants
message TaskParams {
string task_id = 1; // ActiveTopology task ID for lifecycle management
uint32 volume_id = 2; // Primary volume ID for the task
string collection = 3; // Collection name
string data_center = 4; // Primary data center
string rack = 5; // Primary rack
uint64 volume_size = 6; // Original volume size in bytes for tracking size changes
// Unified source and target arrays for all task types
repeated TaskSource sources = 7; // Source locations (volume replicas, EC shards, etc.)
repeated TaskTarget targets = 8; // Target locations (destinations, new replicas, etc.)
// Typed task parameters
oneof task_params {
VacuumTaskParams vacuum_params = 9;
ErasureCodingTaskParams erasure_coding_params = 10;
BalanceTaskParams balance_params = 11;
ReplicationTaskParams replication_params = 12;
}
}
// VacuumTaskParams for vacuum operations
message VacuumTaskParams {
double garbage_threshold = 1; // Minimum garbage ratio to trigger vacuum
bool force_vacuum = 2; // Force vacuum even if below threshold
int32 batch_size = 3; // Number of files to process per batch
string working_dir = 4; // Working directory for temporary files
bool verify_checksum = 5; // Verify file checksums during vacuum
}
// ErasureCodingTaskParams for EC encoding operations
message ErasureCodingTaskParams {
uint64 estimated_shard_size = 1; // Estimated size per shard
int32 data_shards = 2; // Number of data shards (default: 10)
int32 parity_shards = 3; // Number of parity shards (default: 4)
string working_dir = 4; // Working directory for EC processing
string master_client = 5; // Master server address
bool cleanup_source = 6; // Whether to cleanup source volume after EC
}
// TaskSource represents a unified source location for any task type
message TaskSource {
string node = 1; // Source server address
uint32 disk_id = 2; // Source disk ID
string rack = 3; // Source rack for tracking
string data_center = 4; // Source data center for tracking
uint32 volume_id = 5; // Volume ID (for volume operations)
repeated uint32 shard_ids = 6; // Shard IDs (for EC shard operations)
uint64 estimated_size = 7; // Estimated size to be processed
}
// TaskTarget represents a unified target location for any task type
message TaskTarget {
string node = 1; // Target server address
uint32 disk_id = 2; // Target disk ID
string rack = 3; // Target rack for tracking
string data_center = 4; // Target data center for tracking
uint32 volume_id = 5; // Volume ID (for volume operations)
repeated uint32 shard_ids = 6; // Shard IDs (for EC shard operations)
uint64 estimated_size = 7; // Estimated size to be created
}
// BalanceMoveSpec describes a single volume move within a batch balance job
message BalanceMoveSpec {
uint32 volume_id = 1; // Volume to move
string source_node = 2; // Source server address (host:port)
string target_node = 3; // Destination server address (host:port)
string collection = 4; // Collection name
uint64 volume_size = 5; // Volume size in bytes (informational)
}
// BalanceTaskParams for volume balancing operations
message BalanceTaskParams {
bool force_move = 1; // Force move even with conflicts
int32 timeout_seconds = 2; // Operation timeout
int32 max_concurrent_moves = 3; // Max concurrent moves in a batch job (0 = default 5)
repeated BalanceMoveSpec moves = 4; // Batch: multiple volume moves in one job
}
// ReplicationTaskParams for adding replicas
message ReplicationTaskParams {
int32 replica_count = 1; // Target replica count
bool verify_consistency = 2; // Verify replica consistency after creation
}
// TaskUpdate reports task progress
message TaskUpdate {
string task_id = 1;
string worker_id = 2;
string status = 3;
float progress = 4;
string message = 5;
map<string, string> metadata = 6;
}
// TaskComplete reports task completion
message TaskComplete {
string task_id = 1;
string worker_id = 2;
bool success = 3;
string error_message = 4;
int64 completion_time = 5;
map<string, string> result_metadata = 6;
}
// TaskCancellation from admin to cancel a task
message TaskCancellation {
string task_id = 1;
string reason = 2;
bool force = 3;
}
// WorkerShutdown notifies admin that worker is shutting down
message WorkerShutdown {
string worker_id = 1;
string reason = 2;
repeated string pending_task_ids = 3;
}
// AdminShutdown notifies worker that admin is shutting down
message AdminShutdown {
string reason = 1;
int32 graceful_shutdown_seconds = 2;
}
// ========== Task Log Messages ==========
// TaskLogRequest requests logs for a specific task
message TaskLogRequest {
string task_id = 1;
string worker_id = 2;
bool include_metadata = 3; // Include task metadata
int32 max_entries = 4; // Maximum number of log entries (0 = all)
string log_level = 5; // Filter by log level (INFO, WARNING, ERROR, DEBUG)
int64 start_time = 6; // Unix timestamp for start time filter
int64 end_time = 7; // Unix timestamp for end time filter
}
// TaskLogResponse returns task logs and metadata
message TaskLogResponse {
string task_id = 1;
string worker_id = 2;
bool success = 3;
string error_message = 4;
TaskLogMetadata metadata = 5;
repeated TaskLogEntry log_entries = 6;
}
// TaskLogMetadata contains metadata about task execution
message TaskLogMetadata {
string task_id = 1;
string task_type = 2;
string worker_id = 3;
int64 start_time = 4;
int64 end_time = 5;
int64 duration_ms = 6;
string status = 7;
float progress = 8;
uint32 volume_id = 9;
string server = 10;
string collection = 11;
string log_file_path = 12;
int64 created_at = 13;
map<string, string> custom_data = 14;
}
// TaskLogEntry represents a single log entry
message TaskLogEntry {
int64 timestamp = 1;
string level = 2;
string message = 3;
map<string, string> fields = 4;
float progress = 5;
string status = 6;
}
// ========== Maintenance Configuration Messages ==========
// MaintenanceConfig holds configuration for the maintenance system
message MaintenanceConfig {
bool enabled = 1;
int32 scan_interval_seconds = 2; // How often to scan for maintenance needs
int32 worker_timeout_seconds = 3; // Worker heartbeat timeout
int32 task_timeout_seconds = 4; // Individual task timeout
int32 retry_delay_seconds = 5; // Delay between retries
int32 max_retries = 6; // Default max retries for tasks
int32 cleanup_interval_seconds = 7; // How often to clean up old tasks
int32 task_retention_seconds = 8; // How long to keep completed/failed tasks
MaintenancePolicy policy = 9;
}
// MaintenancePolicy defines policies for maintenance operations
message MaintenancePolicy {
map<string, TaskPolicy> task_policies = 1; // Task type -> policy mapping
int32 global_max_concurrent = 2; // Overall limit across all task types
int32 default_repeat_interval_seconds = 3; // Default seconds if task doesn't specify
int32 default_check_interval_seconds = 4; // Default seconds for periodic checks
}
// TaskPolicy represents configuration for a specific task type
message TaskPolicy {
bool enabled = 1;
int32 max_concurrent = 2;
int32 repeat_interval_seconds = 3; // Seconds to wait before repeating
int32 check_interval_seconds = 4; // Seconds between checks
// Typed task-specific configuration (replaces generic map)
oneof task_config {
VacuumTaskConfig vacuum_config = 5;
ErasureCodingTaskConfig erasure_coding_config = 6;
BalanceTaskConfig balance_config = 7;
ReplicationTaskConfig replication_config = 8;
}
}
// Task-specific configuration messages
// VacuumTaskConfig contains vacuum-specific configuration
message VacuumTaskConfig {
double garbage_threshold = 1; // Minimum garbage ratio to trigger vacuum (0.0-1.0)
int32 min_volume_age_hours = 2; // Minimum age before vacuum is considered
int32 min_interval_seconds = 3; // Minimum time between vacuum operations on the same volume
}
// ErasureCodingTaskConfig contains EC-specific configuration
message ErasureCodingTaskConfig {
double fullness_ratio = 1; // Minimum fullness ratio to trigger EC (0.0-1.0)
int32 quiet_for_seconds = 2; // Minimum quiet time before EC
int32 min_volume_size_mb = 3; // Minimum volume size for EC
string collection_filter = 4; // Only process volumes from specific collections
repeated string preferred_tags = 5; // Disk tags to prioritize for EC shard placement
}
// BalanceTaskConfig contains balance-specific configuration
message BalanceTaskConfig {
double imbalance_threshold = 1; // Threshold for triggering rebalancing (0.0-1.0)
int32 min_server_count = 2; // Minimum number of servers required for balancing
}
// ReplicationTaskConfig contains replication-specific configuration
message ReplicationTaskConfig {
int32 target_replica_count = 1; // Target number of replicas
}
// ========== Task Persistence Messages ==========
// MaintenanceTaskData represents complete task state for persistence
message MaintenanceTaskData {
string id = 1;
string type = 2;
string priority = 3;
string status = 4;
uint32 volume_id = 5;
string server = 6;
string collection = 7;
TaskParams typed_params = 8;
string reason = 9;
int64 created_at = 10;
int64 scheduled_at = 11;
int64 started_at = 12;
int64 completed_at = 13;
string worker_id = 14;
string error = 15;
double progress = 16;
int32 retry_count = 17;
int32 max_retries = 18;
// Enhanced fields for detailed task tracking
string created_by = 19;
string creation_context = 20;
repeated TaskAssignmentRecord assignment_history = 21;
string detailed_reason = 22;
map<string, string> tags = 23;
TaskCreationMetrics creation_metrics = 24;
}
// TaskAssignmentRecord tracks worker assignments for a task
message TaskAssignmentRecord {
string worker_id = 1;
string worker_address = 2;
int64 assigned_at = 3;
int64 unassigned_at = 4; // Optional: when worker was unassigned
string reason = 5; // Reason for assignment/unassignment
}
// TaskCreationMetrics tracks why and how a task was created
message TaskCreationMetrics {
string trigger_metric = 1; // Name of metric that triggered creation
double metric_value = 2; // Value that triggered creation
double threshold = 3; // Threshold that was exceeded
VolumeHealthMetrics volume_metrics = 4; // Volume health at creation time
map<string, string> additional_data = 5; // Additional context data
}
// VolumeHealthMetrics captures volume state at task creation
message VolumeHealthMetrics {
uint64 total_size = 1;
uint64 used_size = 2;
uint64 garbage_size = 3;
double garbage_ratio = 4;
int32 file_count = 5;
int32 deleted_file_count = 6;
int64 last_modified = 7;
int32 replica_count = 8;
bool is_ec_volume = 9;
string collection = 10;
}
// TaskStateFile wraps task data with metadata for persistence
message TaskStateFile {
MaintenanceTaskData task = 1;
int64 last_updated = 2;
string admin_version = 3;
}