* add dynamic timeouts to plugin worker vacuum gRPC calls
All vacuum gRPC calls used context.Background() with no deadline,
so the plugin scheduler's execution timeout could kill a job while
a large volume compact was still in progress. Use volume-size-scaled
timeouts matching the topology vacuum approach: 3 min/GB for compact,
1 min/GB for check, commit, and cleanup.
Fixes#8591
* scale scheduler execution timeout by volume size
The scheduler's per-job execution timeout (default 240s) would kill
vacuum jobs on large volumes before they finish. Three changes:
1. Vacuum detection now includes estimated_runtime_seconds in job
proposals, computed as 5 min/GB of volume size.
2. The scheduler checks for estimated_runtime_seconds in job
parameters and uses it as the execution timeout when larger than
the default — a generic mechanism any handler can use.
3. Vacuum task gRPC calls now use the passed-in ctx as parent
instead of context.Background(), so scheduler cancellation
propagates to in-flight RPCs.
* extend job type runtime when proposals need more time
The JobTypeMaxRuntime (default 30 min) wraps both detection and
execution. Its context is the parent of all per-job execution
contexts, so even with per-job estimated_runtime_seconds, jobCtx
would cancel everything when it expires.
After detection, scan proposals for the maximum
estimated_runtime_seconds. If any proposal needs more time than
the remaining JobTypeMaxRuntime, create a new execution context
with enough headroom. This lets large vacuum jobs complete without
being killed by the job type deadline while still respecting the
configured limit for normal-sized jobs.
* log missing volume size metric, remove dead minimum runtime guard
Add a debug log in vacuumTimeout when t.volumeSize is 0 so
operators can investigate why metrics are missing for a volume.
Remove the unreachable estimatedRuntimeSeconds < 180 check in
buildVacuumProposal — volumeSizeGB always >= 1 (due to +1 floor),
so estimatedRuntimeSeconds is always >= 300.
* cap estimated runtime and fix status check context
- Cap maxEstimatedRuntime and per-job timeout overrides to 8 hours
to prevent unbounded timeouts from bad metrics.
- Check execCtx.Err() instead of jobCtx.Err() for status reporting,
since dispatch runs under execCtx which may have a longer deadline.
A successful dispatch under execCtx was misreported as "timeout"
when jobCtx had expired.
* Add support for TLS in gRPC communication between worker and volume server
* address comments
* worker: capture shared grpc.DialOption in BalanceTask registration closure
* worker: capture shared grpc.DialOption in ErasureCodingTask registration closure
* worker: capture shared grpc.DialOption in VacuumTask registration closure
* worker: use grpc.worker security configuration section for tasks
* plugin/worker: fix compilation errors by passing grpc.DialOption to task constructors
* plugin/worker: prevent double-counting in EC skip counters
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* refactor planning into task detection
* refactoring worker tasks
* refactor
* compiles, but only balance task is registered
* compiles, but has nil exception
* avoid nil logger
* add back ec task
* setting ec log directory
* implement balance and vacuum tasks
* EC tasks will no longer fail with "file not found" errors
* Use ReceiveFile API to send locally generated shards
* distributing shard files and ecx,ecj,vif files
* generate .ecx files correctly
* do not mount all possible EC shards (0-13) on every destination
* use constants
* delete all replicas
* rename files
* pass in volume size to tasks