Refactor plugin system and migrate worker runtime (#8369)
* admin: add plugin runtime UI page and route wiring * pb: add plugin gRPC contract and generated bindings * admin/plugin: implement worker registry, runtime, monitoring, and config store * admin/dash: wire plugin runtime and expose plugin workflow APIs * command: add flags to enable plugin runtime * admin: rename remaining plugin v2 wording to plugin * admin/plugin: add detectable job type registry helper * admin/plugin: add scheduled detection and dispatch orchestration * admin/plugin: prefetch job type descriptors when workers connect * admin/plugin: add known job type discovery API and UI * admin/plugin: refresh design doc to match current implementation * admin/plugin: enforce per-worker scheduler concurrency limits * admin/plugin: use descriptor runtime defaults for scheduler policy * admin/ui: auto-load first known plugin job type on page open * admin/plugin: bootstrap persisted config from descriptor defaults * admin/plugin: dedupe scheduled proposals by dedupe key * admin/ui: add job type and state filters for plugin monitoring * admin/ui: add per-job-type plugin activity summary * admin/plugin: split descriptor read API from schema refresh * admin/ui: keep plugin summary metrics global while tables are filtered * admin/plugin: retry executor reservation before timing out * admin/plugin: expose scheduler states for monitoring * admin/ui: show per-job-type scheduler states in plugin monitor * pb/plugin: rename protobuf package to plugin * admin/plugin: rename pluginRuntime wiring to plugin * admin/plugin: remove runtime naming from plugin APIs and UI * admin/plugin: rename runtime files to plugin naming * admin/plugin: persist jobs and activities for monitor recovery * admin/plugin: lease one detector worker per job type * admin/ui: show worker load from plugin heartbeats * admin/plugin: skip stale workers for detector and executor picks * plugin/worker: add plugin worker command and stream runtime scaffold * plugin/worker: implement vacuum detect and execute handlers * admin/plugin: document external vacuum plugin worker starter * command: update plugin.worker help to reflect implemented flow * command/admin: drop legacy Plugin V2 label * plugin/worker: validate vacuum job type and respect min interval * plugin/worker: test no-op detect when min interval not elapsed * command/admin: document plugin.worker external process * plugin/worker: advertise configured concurrency in hello * command/plugin.worker: add jobType handler selection * command/plugin.worker: test handler selection by job type * command/plugin.worker: persist worker id in workingDir * admin/plugin: document plugin.worker jobType and workingDir flags * plugin/worker: support cancel request for in-flight work * plugin/worker: test cancel request acknowledgements * command/plugin.worker: document workingDir and jobType behavior * plugin/worker: emit executor activity events for monitor * plugin/worker: test executor activity builder * admin/plugin: send last successful run in detection request * admin/plugin: send cancel request when detect or execute context ends * admin/plugin: document worker cancel request responsibility * admin/handlers: expose plugin scheduler states API in no-auth mode * admin/handlers: test plugin scheduler states route registration * admin/plugin: keep worker id on worker-generated activity records * admin/plugin: test worker id propagation in monitor activities * admin/dash: always initialize plugin service * command/admin: remove plugin enable flags and default to enabled * admin/dash: drop pluginEnabled constructor parameter * admin/plugin UI: stop checking plugin enabled state * admin/plugin: remove docs for plugin enable flags * admin/dash: remove unused plugin enabled check method * admin/dash: fallback to in-memory plugin init when dataDir fails * admin/plugin API: expose worker gRPC port in status * command/plugin.worker: resolve admin gRPC port via plugin status * split plugin UI into overview/configuration/monitoring pages * Update layout_templ.go * add volume_balance plugin worker handler * wire plugin.worker CLI for volume_balance job type * add erasure_coding plugin worker handler * wire plugin.worker CLI for erasure_coding job type * support multi-job handlers in plugin worker runtime * allow plugin.worker jobType as comma-separated list * admin/plugin UI: rename to Workers and simplify config view * plugin worker: queue detection requests instead of capacity reject * Update plugin_worker.go * plugin volume_balance: remove force_move/timeout from worker config UI * plugin erasure_coding: enforce local working dir and cleanup * admin/plugin UI: rename admin settings to job scheduling * admin/plugin UI: persist and robustly render detection results * admin/plugin: record and return detection trace metadata * admin/plugin UI: show detection process and decision trace * plugin: surface detector decision trace as activities * mini: start a plugin worker by default * admin/plugin UI: split monitoring into detection and execution tabs * plugin worker: emit detection decision trace for EC and balance * admin workers UI: split monitoring into detection and execution pages * plugin scheduler: skip proposals for active assigned/running jobs * admin workers UI: add job queue tab * plugin worker: add dummy stress detector and executor job type * admin workers UI: reorder tabs to detection queue execution * admin workers UI: regenerate plugin template * plugin defaults: include dummy stress and add stress tests * plugin dummy stress: rotate detection selections across runs * plugin scheduler: remove cross-run proposal dedupe * plugin queue: track pending scheduled jobs * plugin scheduler: wait for executor capacity before dispatch * plugin scheduler: skip detection when waiting backlog is high * plugin: add disk-backed job detail API and persistence * admin ui: show plugin job detail modal from job id links * plugin: generate unique job ids instead of reusing proposal ids * plugin worker: emit heartbeats on work state changes * plugin registry: round-robin tied executor and detector picks * add temporary EC overnight stress runner * plugin job details: persist and render EC execution plans * ec volume details: color data and parity shard badges * shard labels: keep parity ids numeric and color-only distinction * admin: remove legacy maintenance UI routes and templates * admin: remove dead maintenance endpoint helpers * Update layout_templ.go * remove dummy_stress worker and command support * refactor plugin UI to job-type top tabs and sub-tabs * migrate weed worker command to plugin runtime * remove plugin.worker command and keep worker runtime with metrics * update helm worker args for jobType and execution flags * set plugin scheduling defaults to global 16 and per-worker 4 * stress: fix RPC context reuse and remove redundant variables in ec_stress_runner * admin/plugin: fix lifecycle races, safe channel operations, and terminal state constants * admin/dash: randomize job IDs and fix priority zero-value overwrite in plugin API * admin/handlers: implement buffered rendering to prevent response corruption * admin/plugin: implement debounced persistence flusher and optimize BuildJobDetail memory lookups * admin/plugin: fix priority overwrite and implement bounded wait in scheduler reserve * admin/plugin: implement atomic file writes and fix run record side effects * admin/plugin: use P prefix for parity shard labels in execution plans * admin/plugin: enable parallel execution for cancellation tests * admin: refactor time.Time fields to pointers for better JSON omitempty support * admin/plugin: implement pointer-safe time assignments and comparisons in plugin core * admin/plugin: fix time assignment and sorting logic in plugin monitor after pointer refactor * admin/plugin: update scheduler activity tracking to use time pointers * admin/plugin: fix time-based run history trimming after pointer refactor * admin/dash: fix JobSpec struct literal in plugin API after pointer refactor * admin/view: add D/P prefixes to EC shard badges for UI consistency * admin/plugin: use lifecycle-aware context for schema prefetching * Update ec_volume_details_templ.go * admin/stress: fix proposal sorting and log volume cleanup errors * stress: refine ec stress runner with math/rand and collection name - Added Collection field to VolumeEcShardsDeleteRequest for correct filename construction. - Replaced crypto/rand with seeded math/rand PRNG for bulk payloads. - Added documentation for EcMinAge zero-value behavior. - Added logging for ignored errors in volume/shard deletion. * admin: return internal server error for plugin store failures Changed error status code from 400 Bad Request to 500 Internal Server Error for failures in GetPluginJobDetail to correctly reflect server-side errors. * admin: implement safe channel sends and graceful shutdown sync - Added sync.WaitGroup to Plugin struct to manage background goroutines. - Implemented safeSendCh helper using recover() to prevent panics on closed channels. - Ensured Shutdown() waits for all background operations to complete. * admin: robustify plugin monitor with nil-safe time and record init - Standardized nil-safe assignment for *time.Time pointers (CreatedAt, UpdatedAt, CompletedAt). - Ensured persistJobDetailSnapshot initializes new records correctly if they don't exist on disk. - Fixed debounced persistence to trigger immediate write on job completion. * admin: improve scheduler shutdown behavior and logic guards - Replaced brittle error string matching with explicit r.shutdownCh selection for shutdown detection. - Removed redundant nil guard in buildScheduledJobSpec. - Standardized WaitGroup usage for schedulerLoop. * admin: implement deep copy for job parameters and atomic write fixes - Implemented deepCopyGenericValue and used it in cloneTrackedJob to prevent shared state. - Ensured atomicWriteFile creates parent directories before writing. * admin: remove unreachable branch in shard classification Removed an unreachable 'totalShards <= 0' check in classifyShardID as dataShards and parityShards are already guarded. * admin: secure UI links and use canonical shard constants - Added rel="noopener noreferrer" to external links for security. - Replaced magic number 14 with erasure_coding.TotalShardsCount. - Used renderEcShardBadge for missing shard list consistency. * admin: stabilize plugin tests and fix regressions - Composed a robust plugin_monitor_test.go to handle asynchronous persistence. - Updated all time.Time literals to use timeToPtr helper. - Added explicit Shutdown() calls in tests to synchronize with debounced writes. - Fixed syntax errors and orphaned struct literals in tests. * Potential fix for code scanning alert no. 278: Slice memory allocation with excessive size value Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 283: Uncontrolled data used in path expression Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * admin: finalize refinements for error handling, scheduler, and race fixes - Standardized HTTP 500 status codes for store failures in plugin_api.go. - Tracked scheduled detection goroutines with sync.WaitGroup for safe shutdown. - Fixed race condition in safeSendDetectionComplete by extracting channel under lock. - Implemented deep copy for JobActivity details. - Used defaultDirPerm constant in atomicWriteFile. * test(ec): migrate admin dockertest to plugin APIs * admin/plugin_api: fix RunPluginJobTypeAPI to return 500 for server-side detection/filter errors * admin/plugin_api: fix ExecutePluginJobAPI to return 500 for job execution failures * admin/plugin_api: limit parseProtoJSONBody request body to 1MB to prevent unbounded memory usage * admin/plugin: consolidate regex to package-level validJobTypePattern; add char validation to sanitizeJobID * admin/plugin: fix racy Shutdown channel close with sync.Once * admin/plugin: track sendLoop and recv goroutines in WorkerStream with r.wg * admin/plugin: document writeProtoFiles atomicity — .pb is source of truth, .json is human-readable only * admin/plugin: extract activityLess helper to deduplicate nil-safe OccurredAt sort comparators * test/ec: check http.NewRequest errors to prevent nil req panics * test/ec: replace deprecated ioutil/math/rand, fix stale step comment 5.1→3.1 * plugin(ec): raise default detection and scheduling throughput limits * topology: include empty disks in volume list and EC capacity fallback * topology: remove hard 10-task cap for detection planning * Update ec_volume_details_templ.go * adjust default * fix tests --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
This commit is contained in:
205
weed/admin/plugin/DESIGN.md
Normal file
205
weed/admin/plugin/DESIGN.md
Normal file
@@ -0,0 +1,205 @@
|
||||
# Admin Worker Plugin System (Design)
|
||||
|
||||
This document describes the plugin system for admin-managed workers, implemented in parallel with the current maintenance/worker mechanism.
|
||||
|
||||
## Scope
|
||||
|
||||
- Add a new plugin protocol and runtime model for multi-language workers.
|
||||
- Keep all current admin + worker code paths untouched.
|
||||
- Use gRPC for all admin-worker communication.
|
||||
- Let workers describe job configuration UI declaratively via protobuf.
|
||||
- Persist all job type configuration under admin server data directory.
|
||||
- Support detector workers and executor workers per job type.
|
||||
- Add end-to-end workflow observability (activities, active jobs, progress).
|
||||
|
||||
## New Contract
|
||||
|
||||
- Proto file: `weed/pb/plugin.proto`
|
||||
- gRPC service: `PluginControlService.WorkerStream`
|
||||
- Connection model: worker-initiated long-lived bidirectional stream.
|
||||
|
||||
Why this model:
|
||||
|
||||
- Works for workers in any language with gRPC support.
|
||||
- Avoids admin dialing constraints in NAT/private networks.
|
||||
- Allows command/response, progress streaming, and heartbeat over one channel.
|
||||
|
||||
## Core Runtime Components (Admin Side)
|
||||
|
||||
1. `PluginRegistry`
|
||||
- Tracks connected workers and their per-job-type capabilities.
|
||||
- Maintains liveness via heartbeat timeout.
|
||||
|
||||
2. `SchemaCoordinator`
|
||||
- For each job type, asks one capable worker for `JobTypeDescriptor`.
|
||||
- Caches descriptor version and refresh timestamp.
|
||||
|
||||
3. `ConfigStore`
|
||||
- Persists descriptor + saved config values in `dataDir`.
|
||||
- Stores both:
|
||||
- Admin-owned runtime config (detection interval, dispatch concurrency, retry).
|
||||
- Worker-owned config values (plugin-specific detection/execution knobs).
|
||||
|
||||
4. `DetectorScheduler`
|
||||
- Per job type, chooses one detector worker (`can_detect=true`).
|
||||
- Sends `RunDetectionRequest` with saved configs + cluster context.
|
||||
- Accepts `DetectionProposals`, dedupes by `dedupe_key`, inserts jobs.
|
||||
|
||||
5. `JobDispatcher`
|
||||
- Chooses executor worker (`can_execute=true`) for each pending job.
|
||||
- Sends `ExecuteJobRequest`.
|
||||
- Consumes `JobProgressUpdate` and `JobCompleted`.
|
||||
|
||||
6. `WorkflowMonitor`
|
||||
- Builds live counters and timeline from events:
|
||||
- activities per job type,
|
||||
- active jobs,
|
||||
- per-job progress/state,
|
||||
- worker health/load.
|
||||
|
||||
## Worker Responsibilities
|
||||
|
||||
1. Register capabilities on connect (`WorkerHello`).
|
||||
2. Expose job type descriptor (`ConfigSchemaResponse`) including UI schemas:
|
||||
- admin config form,
|
||||
- worker config form,
|
||||
- defaults.
|
||||
3. Run detection on demand (`RunDetectionRequest`) and return proposals.
|
||||
4. Execute assigned jobs (`ExecuteJobRequest`) and stream progress.
|
||||
5. Heartbeat regularly with slot usage and running work.
|
||||
6. Handle cancellation requests (`CancelRequest`) for in-flight detection/execution.
|
||||
|
||||
## Declarative UI Model
|
||||
|
||||
UI is fully derived from protobuf schema:
|
||||
|
||||
- `ConfigForm`
|
||||
- `ConfigSection`
|
||||
- `ConfigField`
|
||||
- `ConfigOption`
|
||||
- `ValidationRule`
|
||||
- `ConfigValue` (typed scalar/list/map/object value container)
|
||||
|
||||
Result:
|
||||
|
||||
- Admin can render forms without hardcoded task structs.
|
||||
- New job types can ship UI schema from worker binary alone.
|
||||
- Worker language is irrelevant as long as it can emit protobuf messages.
|
||||
|
||||
## Detection and Dispatch Flow
|
||||
|
||||
1. Worker connects and registers capabilities.
|
||||
2. Admin requests descriptor per job type.
|
||||
3. Admin persists descriptor and editable config values.
|
||||
4. On detection interval (admin-owned setting):
|
||||
- Admin chooses one detector worker for that job type.
|
||||
- Sends `RunDetectionRequest` with:
|
||||
- `AdminRuntimeConfig`,
|
||||
- `admin_config_values`,
|
||||
- `worker_config_values`,
|
||||
- `ClusterContext` (master/filer/volume grpc locations, metadata).
|
||||
5. Detector emits `DetectionProposals` and `DetectionComplete`.
|
||||
6. Admin dedupes and enqueues jobs.
|
||||
7. Dispatcher assigns jobs to any eligible executor worker.
|
||||
8. Executor emits `JobProgressUpdate` and `JobCompleted`.
|
||||
9. Monitor updates workflow UI in near-real-time.
|
||||
|
||||
## Persistence Layout (Admin Data Dir)
|
||||
|
||||
Current layout under `<admin-data-dir>/plugin/`:
|
||||
|
||||
- `job_types/<job_type>/descriptor.pb`
|
||||
- `job_types/<job_type>/descriptor.json`
|
||||
- `job_types/<job_type>/config.pb`
|
||||
- `job_types/<job_type>/config.json`
|
||||
- `job_types/<job_type>/runs.json`
|
||||
- `jobs/tracked_jobs.json`
|
||||
- `activities/activities.json`
|
||||
|
||||
`config.pb` should use `PersistedJobTypeConfig` from `plugin.proto`.
|
||||
|
||||
## Admin UI
|
||||
|
||||
- Route: `/plugin`
|
||||
- Includes:
|
||||
- runtime status,
|
||||
- workers/capabilities,
|
||||
- declarative descriptor-driven config forms,
|
||||
- run history (last 10 success + last 10 errors),
|
||||
- tracked jobs and activity stream,
|
||||
- manual actions for schema refresh, detection, and detect+execute workflow.
|
||||
|
||||
## Scheduling Policy (Initial)
|
||||
|
||||
Detector selection per job type:
|
||||
- only workers with `can_detect=true`.
|
||||
- prefer healthy worker with highest free detection slots.
|
||||
- lease ends when heartbeat timeout or stream drop.
|
||||
|
||||
Execution dispatch:
|
||||
- only workers with `can_execute=true`.
|
||||
- select by available execution slots and least active jobs.
|
||||
- retry on failure using admin runtime retry config.
|
||||
|
||||
## Safety and Reliability
|
||||
|
||||
- Idempotency: dedupe proposals by (`job_type`, `dedupe_key`).
|
||||
- Backpressure: enforce max jobs per detection run.
|
||||
- Timeouts: detection and execution timeout from admin runtime config.
|
||||
- Replay-safe persistence: write job state changes before emitting UI events.
|
||||
- Heartbeat-based failover for detector/executor reassignment.
|
||||
|
||||
## Backward Compatibility
|
||||
|
||||
- Legacy `worker.proto` runtime remains internally available where still referenced.
|
||||
- External CLI worker path is moved to plugin runtime behavior.
|
||||
- Runtime is enabled by default on admin worker gRPC server.
|
||||
|
||||
## Incremental Rollout Plan
|
||||
|
||||
Phase 1
|
||||
- Introduce protocol and storage models only.
|
||||
|
||||
Phase 2
|
||||
- Build admin registry/scheduler/dispatcher behind feature flag.
|
||||
|
||||
Phase 3
|
||||
- Add dedicated plugin UI pages and metrics.
|
||||
|
||||
Phase 4
|
||||
- Port one existing job type (e.g. vacuum) as external worker plugin.
|
||||
|
||||
Phase 4 status (starter)
|
||||
- Added `weed worker` command as an external `plugin.proto` worker process.
|
||||
- Initial handler implements `vacuum` job type with:
|
||||
- declarative descriptor/config form response (`ConfigSchemaResponse`),
|
||||
- detection via master topology scan (`RunDetectionRequest`),
|
||||
- execution via existing vacuum task logic (`ExecuteJobRequest`),
|
||||
- heartbeat/load reporting for monitor UI.
|
||||
- Legacy maintenance-worker-specific CLI path is removed.
|
||||
|
||||
Run example:
|
||||
- Start admin: `weed admin -master=localhost:9333`
|
||||
- Start worker: `weed worker -admin=localhost:23646`
|
||||
- Optional explicit job type: `weed worker -admin=localhost:23646 -jobType=vacuum`
|
||||
- Optional stable worker ID persistence: `weed worker -admin=localhost:23646 -workingDir=/var/lib/seaweedfs-plugin`
|
||||
|
||||
Phase 5
|
||||
- Migrate remaining job types and deprecate old mechanism.
|
||||
|
||||
## Agreed Defaults
|
||||
|
||||
1. Detector multiplicity
|
||||
- Exactly one detector worker per job type at a time. Admin selects one worker and runs detection there.
|
||||
|
||||
2. Secret handling
|
||||
- No encryption at rest required for plugin config in this phase.
|
||||
|
||||
3. Schema compatibility
|
||||
- No migration policy required yet; this is a new system.
|
||||
|
||||
4. Execution ownership
|
||||
- Same worker is allowed to do both detection and execution.
|
||||
|
||||
5. Retention
|
||||
- Keep last 10 successful runs and last 10 error runs per job type.
|
||||
739
weed/admin/plugin/config_store.go
Normal file
739
weed/admin/plugin/config_store.go
Normal file
@@ -0,0 +1,739 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"net/url"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"regexp"
|
||||
"sort"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
"google.golang.org/protobuf/encoding/protojson"
|
||||
"google.golang.org/protobuf/proto"
|
||||
)
|
||||
|
||||
const (
|
||||
pluginDirName = "plugin"
|
||||
jobTypesDirName = "job_types"
|
||||
jobsDirName = "jobs"
|
||||
jobDetailsDirName = "job_details"
|
||||
activitiesDirName = "activities"
|
||||
descriptorPBFileName = "descriptor.pb"
|
||||
descriptorJSONFileName = "descriptor.json"
|
||||
configPBFileName = "config.pb"
|
||||
configJSONFileName = "config.json"
|
||||
runsJSONFileName = "runs.json"
|
||||
trackedJobsJSONFileName = "tracked_jobs.json"
|
||||
activitiesJSONFileName = "activities.json"
|
||||
defaultDirPerm = 0o755
|
||||
defaultFilePerm = 0o644
|
||||
)
|
||||
|
||||
// validJobTypePattern is the canonical pattern for safe job type names.
|
||||
// Only letters, digits, underscore, dash, and dot are allowed, which prevents
|
||||
// path traversal because '/', '\\', and whitespace are rejected.
|
||||
var validJobTypePattern = regexp.MustCompile(`^[A-Za-z0-9_.-]+$`)
|
||||
|
||||
// ConfigStore persists plugin configuration and bounded run history.
|
||||
// If admin data dir is empty, it transparently falls back to in-memory mode.
|
||||
type ConfigStore struct {
|
||||
configured bool
|
||||
baseDir string
|
||||
|
||||
mu sync.RWMutex
|
||||
|
||||
memDescriptors map[string]*plugin_pb.JobTypeDescriptor
|
||||
memConfigs map[string]*plugin_pb.PersistedJobTypeConfig
|
||||
memRunHistory map[string]*JobTypeRunHistory
|
||||
memTrackedJobs []TrackedJob
|
||||
memActivities []JobActivity
|
||||
memJobDetails map[string]TrackedJob
|
||||
}
|
||||
|
||||
func NewConfigStore(adminDataDir string) (*ConfigStore, error) {
|
||||
store := &ConfigStore{
|
||||
configured: adminDataDir != "",
|
||||
memDescriptors: make(map[string]*plugin_pb.JobTypeDescriptor),
|
||||
memConfigs: make(map[string]*plugin_pb.PersistedJobTypeConfig),
|
||||
memRunHistory: make(map[string]*JobTypeRunHistory),
|
||||
memJobDetails: make(map[string]TrackedJob),
|
||||
}
|
||||
|
||||
if adminDataDir == "" {
|
||||
return store, nil
|
||||
}
|
||||
|
||||
store.baseDir = filepath.Join(adminDataDir, pluginDirName)
|
||||
if err := os.MkdirAll(filepath.Join(store.baseDir, jobTypesDirName), defaultDirPerm); err != nil {
|
||||
return nil, fmt.Errorf("create plugin job_types dir: %w", err)
|
||||
}
|
||||
if err := os.MkdirAll(filepath.Join(store.baseDir, jobsDirName), defaultDirPerm); err != nil {
|
||||
return nil, fmt.Errorf("create plugin jobs dir: %w", err)
|
||||
}
|
||||
if err := os.MkdirAll(filepath.Join(store.baseDir, jobsDirName, jobDetailsDirName), defaultDirPerm); err != nil {
|
||||
return nil, fmt.Errorf("create plugin job_details dir: %w", err)
|
||||
}
|
||||
if err := os.MkdirAll(filepath.Join(store.baseDir, activitiesDirName), defaultDirPerm); err != nil {
|
||||
return nil, fmt.Errorf("create plugin activities dir: %w", err)
|
||||
}
|
||||
|
||||
return store, nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) IsConfigured() bool {
|
||||
return s.configured
|
||||
}
|
||||
|
||||
func (s *ConfigStore) BaseDir() string {
|
||||
return s.baseDir
|
||||
}
|
||||
|
||||
func (s *ConfigStore) SaveDescriptor(jobType string, descriptor *plugin_pb.JobTypeDescriptor) error {
|
||||
if descriptor == nil {
|
||||
return fmt.Errorf("descriptor is nil")
|
||||
}
|
||||
if _, err := sanitizeJobType(jobType); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
clone := proto.Clone(descriptor).(*plugin_pb.JobTypeDescriptor)
|
||||
if clone.JobType == "" {
|
||||
clone.JobType = jobType
|
||||
}
|
||||
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
if !s.configured {
|
||||
s.memDescriptors[jobType] = clone
|
||||
return nil
|
||||
}
|
||||
|
||||
jobTypeDir, err := s.ensureJobTypeDir(jobType)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
pbPath := filepath.Join(jobTypeDir, descriptorPBFileName)
|
||||
jsonPath := filepath.Join(jobTypeDir, descriptorJSONFileName)
|
||||
|
||||
if err := writeProtoFiles(clone, pbPath, jsonPath); err != nil {
|
||||
return fmt.Errorf("save descriptor for %s: %w", jobType, err)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) LoadDescriptor(jobType string) (*plugin_pb.JobTypeDescriptor, error) {
|
||||
if _, err := sanitizeJobType(jobType); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
s.mu.RLock()
|
||||
if !s.configured {
|
||||
d := s.memDescriptors[jobType]
|
||||
s.mu.RUnlock()
|
||||
if d == nil {
|
||||
return nil, nil
|
||||
}
|
||||
return proto.Clone(d).(*plugin_pb.JobTypeDescriptor), nil
|
||||
}
|
||||
s.mu.RUnlock()
|
||||
|
||||
pbPath := filepath.Join(s.baseDir, jobTypesDirName, jobType, descriptorPBFileName)
|
||||
data, err := os.ReadFile(pbPath)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return nil, nil
|
||||
}
|
||||
return nil, fmt.Errorf("read descriptor for %s: %w", jobType, err)
|
||||
}
|
||||
|
||||
var descriptor plugin_pb.JobTypeDescriptor
|
||||
if err := proto.Unmarshal(data, &descriptor); err != nil {
|
||||
return nil, fmt.Errorf("unmarshal descriptor for %s: %w", jobType, err)
|
||||
}
|
||||
return &descriptor, nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) SaveJobTypeConfig(config *plugin_pb.PersistedJobTypeConfig) error {
|
||||
if config == nil {
|
||||
return fmt.Errorf("job type config is nil")
|
||||
}
|
||||
if config.JobType == "" {
|
||||
return fmt.Errorf("job type config has empty job_type")
|
||||
}
|
||||
sanitizedJobType, err := sanitizeJobType(config.JobType)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
// Use the sanitized job type going forward to ensure it is safe for filesystem paths.
|
||||
config.JobType = sanitizedJobType
|
||||
|
||||
clone := proto.Clone(config).(*plugin_pb.PersistedJobTypeConfig)
|
||||
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
if !s.configured {
|
||||
s.memConfigs[config.JobType] = clone
|
||||
return nil
|
||||
}
|
||||
|
||||
jobTypeDir, err := s.ensureJobTypeDir(config.JobType)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
pbPath := filepath.Join(jobTypeDir, configPBFileName)
|
||||
jsonPath := filepath.Join(jobTypeDir, configJSONFileName)
|
||||
|
||||
if err := writeProtoFiles(clone, pbPath, jsonPath); err != nil {
|
||||
return fmt.Errorf("save job type config for %s: %w", config.JobType, err)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) LoadJobTypeConfig(jobType string) (*plugin_pb.PersistedJobTypeConfig, error) {
|
||||
if _, err := sanitizeJobType(jobType); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
s.mu.RLock()
|
||||
if !s.configured {
|
||||
cfg := s.memConfigs[jobType]
|
||||
s.mu.RUnlock()
|
||||
if cfg == nil {
|
||||
return nil, nil
|
||||
}
|
||||
return proto.Clone(cfg).(*plugin_pb.PersistedJobTypeConfig), nil
|
||||
}
|
||||
s.mu.RUnlock()
|
||||
|
||||
pbPath := filepath.Join(s.baseDir, jobTypesDirName, jobType, configPBFileName)
|
||||
data, err := os.ReadFile(pbPath)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return nil, nil
|
||||
}
|
||||
return nil, fmt.Errorf("read job type config for %s: %w", jobType, err)
|
||||
}
|
||||
|
||||
var config plugin_pb.PersistedJobTypeConfig
|
||||
if err := proto.Unmarshal(data, &config); err != nil {
|
||||
return nil, fmt.Errorf("unmarshal job type config for %s: %w", jobType, err)
|
||||
}
|
||||
|
||||
return &config, nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) AppendRunRecord(jobType string, record *JobRunRecord) error {
|
||||
if record == nil {
|
||||
return fmt.Errorf("run record is nil")
|
||||
}
|
||||
if _, err := sanitizeJobType(jobType); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
safeRecord := *record
|
||||
if safeRecord.JobType == "" {
|
||||
safeRecord.JobType = jobType
|
||||
}
|
||||
if safeRecord.CompletedAt == nil || safeRecord.CompletedAt.IsZero() {
|
||||
safeRecord.CompletedAt = timeToPtr(time.Now().UTC())
|
||||
}
|
||||
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
history, err := s.loadRunHistoryLocked(jobType)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if safeRecord.Outcome == RunOutcomeSuccess {
|
||||
history.SuccessfulRuns = append(history.SuccessfulRuns, safeRecord)
|
||||
} else {
|
||||
safeRecord.Outcome = RunOutcomeError
|
||||
history.ErrorRuns = append(history.ErrorRuns, safeRecord)
|
||||
}
|
||||
|
||||
history.SuccessfulRuns = trimRuns(history.SuccessfulRuns, MaxSuccessfulRunHistory)
|
||||
history.ErrorRuns = trimRuns(history.ErrorRuns, MaxErrorRunHistory)
|
||||
history.LastUpdatedTime = timeToPtr(time.Now().UTC())
|
||||
|
||||
return s.saveRunHistoryLocked(jobType, history)
|
||||
}
|
||||
|
||||
func (s *ConfigStore) LoadRunHistory(jobType string) (*JobTypeRunHistory, error) {
|
||||
if _, err := sanitizeJobType(jobType); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
history, err := s.loadRunHistoryLocked(jobType)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return cloneRunHistory(history), nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) SaveTrackedJobs(jobs []TrackedJob) error {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
clone := cloneTrackedJobs(jobs)
|
||||
|
||||
if !s.configured {
|
||||
s.memTrackedJobs = clone
|
||||
return nil
|
||||
}
|
||||
|
||||
encoded, err := json.MarshalIndent(clone, "", " ")
|
||||
if err != nil {
|
||||
return fmt.Errorf("encode tracked jobs: %w", err)
|
||||
}
|
||||
|
||||
path := filepath.Join(s.baseDir, jobsDirName, trackedJobsJSONFileName)
|
||||
if err := atomicWriteFile(path, encoded, defaultFilePerm); err != nil {
|
||||
return fmt.Errorf("write tracked jobs: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) LoadTrackedJobs() ([]TrackedJob, error) {
|
||||
s.mu.RLock()
|
||||
if !s.configured {
|
||||
out := cloneTrackedJobs(s.memTrackedJobs)
|
||||
s.mu.RUnlock()
|
||||
return out, nil
|
||||
}
|
||||
s.mu.RUnlock()
|
||||
|
||||
path := filepath.Join(s.baseDir, jobsDirName, trackedJobsJSONFileName)
|
||||
data, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return nil, nil
|
||||
}
|
||||
return nil, fmt.Errorf("read tracked jobs: %w", err)
|
||||
}
|
||||
|
||||
var jobs []TrackedJob
|
||||
if err := json.Unmarshal(data, &jobs); err != nil {
|
||||
return nil, fmt.Errorf("parse tracked jobs: %w", err)
|
||||
}
|
||||
return cloneTrackedJobs(jobs), nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) SaveJobDetail(job TrackedJob) error {
|
||||
jobID, err := sanitizeJobID(job.JobID)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
clone := cloneTrackedJob(job)
|
||||
clone.JobID = jobID
|
||||
|
||||
if !s.configured {
|
||||
s.memJobDetails[jobID] = clone
|
||||
return nil
|
||||
}
|
||||
|
||||
encoded, err := json.MarshalIndent(clone, "", " ")
|
||||
if err != nil {
|
||||
return fmt.Errorf("encode job detail: %w", err)
|
||||
}
|
||||
|
||||
path := filepath.Join(s.baseDir, jobsDirName, jobDetailsDirName, jobDetailFileName(jobID))
|
||||
if err := atomicWriteFile(path, encoded, defaultFilePerm); err != nil {
|
||||
return fmt.Errorf("write job detail: %w", err)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) LoadJobDetail(jobID string) (*TrackedJob, error) {
|
||||
jobID, err := sanitizeJobID(jobID)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
s.mu.RLock()
|
||||
if !s.configured {
|
||||
job, ok := s.memJobDetails[jobID]
|
||||
s.mu.RUnlock()
|
||||
if !ok {
|
||||
return nil, nil
|
||||
}
|
||||
clone := cloneTrackedJob(job)
|
||||
return &clone, nil
|
||||
}
|
||||
s.mu.RUnlock()
|
||||
|
||||
path := filepath.Join(s.baseDir, jobsDirName, jobDetailsDirName, jobDetailFileName(jobID))
|
||||
data, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return nil, nil
|
||||
}
|
||||
return nil, fmt.Errorf("read job detail: %w", err)
|
||||
}
|
||||
|
||||
var job TrackedJob
|
||||
if err := json.Unmarshal(data, &job); err != nil {
|
||||
return nil, fmt.Errorf("parse job detail: %w", err)
|
||||
}
|
||||
clone := cloneTrackedJob(job)
|
||||
return &clone, nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) SaveActivities(activities []JobActivity) error {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
clone := cloneActivities(activities)
|
||||
|
||||
if !s.configured {
|
||||
s.memActivities = clone
|
||||
return nil
|
||||
}
|
||||
|
||||
encoded, err := json.MarshalIndent(clone, "", " ")
|
||||
if err != nil {
|
||||
return fmt.Errorf("encode activities: %w", err)
|
||||
}
|
||||
|
||||
path := filepath.Join(s.baseDir, activitiesDirName, activitiesJSONFileName)
|
||||
if err := atomicWriteFile(path, encoded, defaultFilePerm); err != nil {
|
||||
return fmt.Errorf("write activities: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) LoadActivities() ([]JobActivity, error) {
|
||||
s.mu.RLock()
|
||||
if !s.configured {
|
||||
out := cloneActivities(s.memActivities)
|
||||
s.mu.RUnlock()
|
||||
return out, nil
|
||||
}
|
||||
s.mu.RUnlock()
|
||||
|
||||
path := filepath.Join(s.baseDir, activitiesDirName, activitiesJSONFileName)
|
||||
data, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return nil, nil
|
||||
}
|
||||
return nil, fmt.Errorf("read activities: %w", err)
|
||||
}
|
||||
|
||||
var activities []JobActivity
|
||||
if err := json.Unmarshal(data, &activities); err != nil {
|
||||
return nil, fmt.Errorf("parse activities: %w", err)
|
||||
}
|
||||
return cloneActivities(activities), nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) ListJobTypes() ([]string, error) {
|
||||
s.mu.RLock()
|
||||
defer s.mu.RUnlock()
|
||||
|
||||
jobTypeSet := make(map[string]struct{})
|
||||
|
||||
if !s.configured {
|
||||
for jobType := range s.memDescriptors {
|
||||
jobTypeSet[jobType] = struct{}{}
|
||||
}
|
||||
for jobType := range s.memConfigs {
|
||||
jobTypeSet[jobType] = struct{}{}
|
||||
}
|
||||
for jobType := range s.memRunHistory {
|
||||
jobTypeSet[jobType] = struct{}{}
|
||||
}
|
||||
} else {
|
||||
jobTypesPath := filepath.Join(s.baseDir, jobTypesDirName)
|
||||
entries, err := os.ReadDir(jobTypesPath)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return []string{}, nil
|
||||
}
|
||||
return nil, fmt.Errorf("list job types: %w", err)
|
||||
}
|
||||
for _, entry := range entries {
|
||||
if !entry.IsDir() {
|
||||
continue
|
||||
}
|
||||
jobType := strings.TrimSpace(entry.Name())
|
||||
if _, err := sanitizeJobType(jobType); err != nil {
|
||||
continue
|
||||
}
|
||||
jobTypeSet[jobType] = struct{}{}
|
||||
}
|
||||
}
|
||||
|
||||
jobTypes := make([]string, 0, len(jobTypeSet))
|
||||
for jobType := range jobTypeSet {
|
||||
jobTypes = append(jobTypes, jobType)
|
||||
}
|
||||
sort.Strings(jobTypes)
|
||||
return jobTypes, nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) loadRunHistoryLocked(jobType string) (*JobTypeRunHistory, error) {
|
||||
if !s.configured {
|
||||
history, ok := s.memRunHistory[jobType]
|
||||
if !ok {
|
||||
history = &JobTypeRunHistory{JobType: jobType}
|
||||
s.memRunHistory[jobType] = history
|
||||
}
|
||||
return cloneRunHistory(history), nil
|
||||
}
|
||||
|
||||
runsPath := filepath.Join(s.baseDir, jobTypesDirName, jobType, runsJSONFileName)
|
||||
data, err := os.ReadFile(runsPath)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return &JobTypeRunHistory{JobType: jobType}, nil
|
||||
}
|
||||
return nil, fmt.Errorf("read run history for %s: %w", jobType, err)
|
||||
}
|
||||
|
||||
var history JobTypeRunHistory
|
||||
if err := json.Unmarshal(data, &history); err != nil {
|
||||
return nil, fmt.Errorf("parse run history for %s: %w", jobType, err)
|
||||
}
|
||||
if history.JobType == "" {
|
||||
history.JobType = jobType
|
||||
}
|
||||
return &history, nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) saveRunHistoryLocked(jobType string, history *JobTypeRunHistory) error {
|
||||
if !s.configured {
|
||||
s.memRunHistory[jobType] = cloneRunHistory(history)
|
||||
return nil
|
||||
}
|
||||
|
||||
jobTypeDir, err := s.ensureJobTypeDir(jobType)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
encoded, err := json.MarshalIndent(history, "", " ")
|
||||
if err != nil {
|
||||
return fmt.Errorf("encode run history for %s: %w", jobType, err)
|
||||
}
|
||||
|
||||
runsPath := filepath.Join(jobTypeDir, runsJSONFileName)
|
||||
if err := atomicWriteFile(runsPath, encoded, defaultFilePerm); err != nil {
|
||||
return fmt.Errorf("write run history for %s: %w", jobType, err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *ConfigStore) ensureJobTypeDir(jobType string) (string, error) {
|
||||
if !s.configured {
|
||||
return "", nil
|
||||
}
|
||||
jobTypeDir := filepath.Join(s.baseDir, jobTypesDirName, jobType)
|
||||
if err := os.MkdirAll(jobTypeDir, defaultDirPerm); err != nil {
|
||||
return "", fmt.Errorf("create job type dir for %s: %w", jobType, err)
|
||||
}
|
||||
return jobTypeDir, nil
|
||||
}
|
||||
|
||||
func sanitizeJobType(jobType string) (string, error) {
|
||||
jobType = strings.TrimSpace(jobType)
|
||||
if jobType == "" {
|
||||
return "", fmt.Errorf("job type is empty")
|
||||
}
|
||||
// Enforce a strict, path-safe pattern for job types: only letters, digits, underscore, dash and dot.
|
||||
// This prevents path traversal because '/', '\\' and whitespace are rejected.
|
||||
if !validJobTypePattern.MatchString(jobType) {
|
||||
return "", fmt.Errorf("invalid job type %q: must match %s", jobType, validJobTypePattern.String())
|
||||
}
|
||||
return jobType, nil
|
||||
}
|
||||
|
||||
// validJobIDPattern allows letters, digits, dash, underscore, and dot.
|
||||
// url.PathEscape in jobDetailFileName provides a second layer of defense.
|
||||
var validJobIDPattern = regexp.MustCompile(`^[A-Za-z0-9_.-]+$`)
|
||||
|
||||
func sanitizeJobID(jobID string) (string, error) {
|
||||
jobID = strings.TrimSpace(jobID)
|
||||
if jobID == "" {
|
||||
return "", fmt.Errorf("job id is empty")
|
||||
}
|
||||
if !validJobIDPattern.MatchString(jobID) {
|
||||
return "", fmt.Errorf("invalid job id %q: must match %s", jobID, validJobIDPattern.String())
|
||||
}
|
||||
return jobID, nil
|
||||
}
|
||||
|
||||
func jobDetailFileName(jobID string) string {
|
||||
return url.PathEscape(jobID) + ".json"
|
||||
}
|
||||
|
||||
func trimRuns(runs []JobRunRecord, maxKeep int) []JobRunRecord {
|
||||
if len(runs) == 0 {
|
||||
return runs
|
||||
}
|
||||
sort.Slice(runs, func(i, j int) bool {
|
||||
ti := time.Time{}
|
||||
if runs[i].CompletedAt != nil {
|
||||
ti = *runs[i].CompletedAt
|
||||
}
|
||||
tj := time.Time{}
|
||||
if runs[j].CompletedAt != nil {
|
||||
tj = *runs[j].CompletedAt
|
||||
}
|
||||
return ti.After(tj)
|
||||
})
|
||||
if len(runs) > maxKeep {
|
||||
runs = runs[:maxKeep]
|
||||
}
|
||||
return runs
|
||||
}
|
||||
|
||||
func cloneRunHistory(in *JobTypeRunHistory) *JobTypeRunHistory {
|
||||
if in == nil {
|
||||
return nil
|
||||
}
|
||||
out := *in
|
||||
if in.SuccessfulRuns != nil {
|
||||
out.SuccessfulRuns = append([]JobRunRecord(nil), in.SuccessfulRuns...)
|
||||
}
|
||||
if in.ErrorRuns != nil {
|
||||
out.ErrorRuns = append([]JobRunRecord(nil), in.ErrorRuns...)
|
||||
}
|
||||
return &out
|
||||
}
|
||||
|
||||
func cloneTrackedJobs(in []TrackedJob) []TrackedJob {
|
||||
if len(in) == 0 {
|
||||
return nil
|
||||
}
|
||||
|
||||
out := make([]TrackedJob, len(in))
|
||||
for i := range in {
|
||||
out[i] = cloneTrackedJob(in[i])
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func cloneTrackedJob(in TrackedJob) TrackedJob {
|
||||
out := in
|
||||
if in.Parameters != nil {
|
||||
out.Parameters = make(map[string]interface{}, len(in.Parameters))
|
||||
for key, value := range in.Parameters {
|
||||
out.Parameters[key] = deepCopyGenericValue(value)
|
||||
}
|
||||
}
|
||||
if in.Labels != nil {
|
||||
out.Labels = make(map[string]string, len(in.Labels))
|
||||
for key, value := range in.Labels {
|
||||
out.Labels[key] = value
|
||||
}
|
||||
}
|
||||
if in.ResultOutputValues != nil {
|
||||
out.ResultOutputValues = make(map[string]interface{}, len(in.ResultOutputValues))
|
||||
for key, value := range in.ResultOutputValues {
|
||||
out.ResultOutputValues[key] = deepCopyGenericValue(value)
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func deepCopyGenericValue(val interface{}) interface{} {
|
||||
switch v := val.(type) {
|
||||
case map[string]interface{}:
|
||||
res := make(map[string]interface{}, len(v))
|
||||
for k, val := range v {
|
||||
res[k] = deepCopyGenericValue(val)
|
||||
}
|
||||
return res
|
||||
case []interface{}:
|
||||
res := make([]interface{}, len(v))
|
||||
for i, val := range v {
|
||||
res[i] = deepCopyGenericValue(val)
|
||||
}
|
||||
return res
|
||||
default:
|
||||
return v
|
||||
}
|
||||
}
|
||||
|
||||
func cloneActivities(in []JobActivity) []JobActivity {
|
||||
if len(in) == 0 {
|
||||
return nil
|
||||
}
|
||||
|
||||
out := make([]JobActivity, len(in))
|
||||
for i := range in {
|
||||
out[i] = in[i]
|
||||
if in[i].Details != nil {
|
||||
out[i].Details = make(map[string]interface{}, len(in[i].Details))
|
||||
for key, value := range in[i].Details {
|
||||
out[i].Details[key] = deepCopyGenericValue(value)
|
||||
}
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
// writeProtoFiles writes message to both a binary protobuf file (pbPath) and a
|
||||
// human-readable JSON file (jsonPath) using atomicWriteFile for each.
|
||||
// The .pb file is the authoritative source of truth: all reads use proto.Unmarshal
|
||||
// on the .pb file. The .json file is for human inspection only, so a partial
|
||||
// failure where .pb succeeds but .json fails leaves the store in a consistent state.
|
||||
func writeProtoFiles(message proto.Message, pbPath string, jsonPath string) error {
|
||||
pbData, err := proto.Marshal(message)
|
||||
if err != nil {
|
||||
return fmt.Errorf("marshal protobuf: %w", err)
|
||||
}
|
||||
if err := atomicWriteFile(pbPath, pbData, defaultFilePerm); err != nil {
|
||||
return fmt.Errorf("write protobuf file: %w", err)
|
||||
}
|
||||
|
||||
jsonData, err := protojson.MarshalOptions{
|
||||
Multiline: true,
|
||||
Indent: " ",
|
||||
EmitUnpopulated: true,
|
||||
}.Marshal(message)
|
||||
if err != nil {
|
||||
return fmt.Errorf("marshal json: %w", err)
|
||||
}
|
||||
if err := atomicWriteFile(jsonPath, jsonData, defaultFilePerm); err != nil {
|
||||
return fmt.Errorf("write json file: %w", err)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
func atomicWriteFile(filename string, data []byte, perm os.FileMode) error {
|
||||
dir := filepath.Dir(filename)
|
||||
if err := os.MkdirAll(dir, defaultDirPerm); err != nil {
|
||||
return fmt.Errorf("create directory %s: %w", dir, err)
|
||||
}
|
||||
tmpFile := filename + ".tmp"
|
||||
if err := os.WriteFile(tmpFile, data, perm); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := os.Rename(tmpFile, filename); err != nil {
|
||||
_ = os.Remove(tmpFile)
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
257
weed/admin/plugin/config_store_test.go
Normal file
257
weed/admin/plugin/config_store_test.go
Normal file
@@ -0,0 +1,257 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
)
|
||||
|
||||
func TestConfigStoreDescriptorRoundTrip(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
tempDir := t.TempDir()
|
||||
store, err := NewConfigStore(tempDir)
|
||||
if err != nil {
|
||||
t.Fatalf("NewConfigStore: %v", err)
|
||||
}
|
||||
|
||||
descriptor := &plugin_pb.JobTypeDescriptor{
|
||||
JobType: "vacuum",
|
||||
DisplayName: "Vacuum",
|
||||
Description: "Vacuum volumes",
|
||||
DescriptorVersion: 1,
|
||||
}
|
||||
|
||||
if err := store.SaveDescriptor("vacuum", descriptor); err != nil {
|
||||
t.Fatalf("SaveDescriptor: %v", err)
|
||||
}
|
||||
|
||||
got, err := store.LoadDescriptor("vacuum")
|
||||
if err != nil {
|
||||
t.Fatalf("LoadDescriptor: %v", err)
|
||||
}
|
||||
if got == nil {
|
||||
t.Fatalf("LoadDescriptor: nil descriptor")
|
||||
}
|
||||
if got.DisplayName != descriptor.DisplayName {
|
||||
t.Fatalf("unexpected display name: got %q want %q", got.DisplayName, descriptor.DisplayName)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestConfigStoreRunHistoryRetention(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
store, err := NewConfigStore(t.TempDir())
|
||||
if err != nil {
|
||||
t.Fatalf("NewConfigStore: %v", err)
|
||||
}
|
||||
|
||||
base := time.Now().UTC().Add(-24 * time.Hour)
|
||||
for i := 0; i < 15; i++ {
|
||||
err := store.AppendRunRecord("balance", &JobRunRecord{
|
||||
RunID: "s" + time.Duration(i).String(),
|
||||
JobID: "job-success",
|
||||
JobType: "balance",
|
||||
WorkerID: "worker-a",
|
||||
Outcome: RunOutcomeSuccess,
|
||||
CompletedAt: timeToPtr(base.Add(time.Duration(i) * time.Minute)),
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatalf("AppendRunRecord success[%d]: %v", i, err)
|
||||
}
|
||||
}
|
||||
|
||||
for i := 0; i < 12; i++ {
|
||||
err := store.AppendRunRecord("balance", &JobRunRecord{
|
||||
RunID: "e" + time.Duration(i).String(),
|
||||
JobID: "job-error",
|
||||
JobType: "balance",
|
||||
WorkerID: "worker-b",
|
||||
Outcome: RunOutcomeError,
|
||||
CompletedAt: timeToPtr(base.Add(time.Duration(i) * time.Minute)),
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatalf("AppendRunRecord error[%d]: %v", i, err)
|
||||
}
|
||||
}
|
||||
|
||||
history, err := store.LoadRunHistory("balance")
|
||||
if err != nil {
|
||||
t.Fatalf("LoadRunHistory: %v", err)
|
||||
}
|
||||
if len(history.SuccessfulRuns) != MaxSuccessfulRunHistory {
|
||||
t.Fatalf("successful retention mismatch: got %d want %d", len(history.SuccessfulRuns), MaxSuccessfulRunHistory)
|
||||
}
|
||||
if len(history.ErrorRuns) != MaxErrorRunHistory {
|
||||
t.Fatalf("error retention mismatch: got %d want %d", len(history.ErrorRuns), MaxErrorRunHistory)
|
||||
}
|
||||
|
||||
for i := 1; i < len(history.SuccessfulRuns); i++ {
|
||||
t1 := time.Time{}
|
||||
if history.SuccessfulRuns[i-1].CompletedAt != nil {
|
||||
t1 = *history.SuccessfulRuns[i-1].CompletedAt
|
||||
}
|
||||
t2 := time.Time{}
|
||||
if history.SuccessfulRuns[i].CompletedAt != nil {
|
||||
t2 = *history.SuccessfulRuns[i].CompletedAt
|
||||
}
|
||||
if t1.Before(t2) {
|
||||
t.Fatalf("successful run order not descending at %d", i)
|
||||
}
|
||||
}
|
||||
for i := 1; i < len(history.ErrorRuns); i++ {
|
||||
t1 := time.Time{}
|
||||
if history.ErrorRuns[i-1].CompletedAt != nil {
|
||||
t1 = *history.ErrorRuns[i-1].CompletedAt
|
||||
}
|
||||
t2 := time.Time{}
|
||||
if history.ErrorRuns[i].CompletedAt != nil {
|
||||
t2 = *history.ErrorRuns[i].CompletedAt
|
||||
}
|
||||
if t1.Before(t2) {
|
||||
t.Fatalf("error run order not descending at %d", i)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestConfigStoreListJobTypes(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
store, err := NewConfigStore("")
|
||||
if err != nil {
|
||||
t.Fatalf("NewConfigStore: %v", err)
|
||||
}
|
||||
|
||||
if err := store.SaveDescriptor("vacuum", &plugin_pb.JobTypeDescriptor{JobType: "vacuum"}); err != nil {
|
||||
t.Fatalf("SaveDescriptor: %v", err)
|
||||
}
|
||||
if err := store.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
|
||||
JobType: "balance",
|
||||
AdminRuntime: &plugin_pb.AdminRuntimeConfig{Enabled: true},
|
||||
}); err != nil {
|
||||
t.Fatalf("SaveJobTypeConfig: %v", err)
|
||||
}
|
||||
if err := store.AppendRunRecord("ec", &JobRunRecord{Outcome: RunOutcomeSuccess, CompletedAt: timeToPtr(time.Now().UTC())}); err != nil {
|
||||
t.Fatalf("AppendRunRecord: %v", err)
|
||||
}
|
||||
|
||||
got, err := store.ListJobTypes()
|
||||
if err != nil {
|
||||
t.Fatalf("ListJobTypes: %v", err)
|
||||
}
|
||||
want := []string{"balance", "ec", "vacuum"}
|
||||
if !reflect.DeepEqual(got, want) {
|
||||
t.Fatalf("unexpected job types: got=%v want=%v", got, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestConfigStoreMonitorStateRoundTrip(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
store, err := NewConfigStore(t.TempDir())
|
||||
if err != nil {
|
||||
t.Fatalf("NewConfigStore: %v", err)
|
||||
}
|
||||
|
||||
tracked := []TrackedJob{
|
||||
{
|
||||
JobID: "job-1",
|
||||
JobType: "vacuum",
|
||||
State: "running",
|
||||
Progress: 55,
|
||||
WorkerID: "worker-a",
|
||||
CreatedAt: timeToPtr(time.Now().UTC().Add(-2 * time.Minute)),
|
||||
UpdatedAt: timeToPtr(time.Now().UTC().Add(-1 * time.Minute)),
|
||||
},
|
||||
}
|
||||
activities := []JobActivity{
|
||||
{
|
||||
JobID: "job-1",
|
||||
JobType: "vacuum",
|
||||
Source: "worker_progress",
|
||||
Message: "processing",
|
||||
Stage: "running",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
Details: map[string]interface{}{
|
||||
"step": "scan",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
if err := store.SaveTrackedJobs(tracked); err != nil {
|
||||
t.Fatalf("SaveTrackedJobs: %v", err)
|
||||
}
|
||||
if err := store.SaveActivities(activities); err != nil {
|
||||
t.Fatalf("SaveActivities: %v", err)
|
||||
}
|
||||
|
||||
gotTracked, err := store.LoadTrackedJobs()
|
||||
if err != nil {
|
||||
t.Fatalf("LoadTrackedJobs: %v", err)
|
||||
}
|
||||
if len(gotTracked) != 1 || gotTracked[0].JobID != tracked[0].JobID {
|
||||
t.Fatalf("unexpected tracked jobs: %+v", gotTracked)
|
||||
}
|
||||
|
||||
gotActivities, err := store.LoadActivities()
|
||||
if err != nil {
|
||||
t.Fatalf("LoadActivities: %v", err)
|
||||
}
|
||||
if len(gotActivities) != 1 || gotActivities[0].Message != activities[0].Message {
|
||||
t.Fatalf("unexpected activities: %+v", gotActivities)
|
||||
}
|
||||
if gotActivities[0].Details["step"] != "scan" {
|
||||
t.Fatalf("unexpected activity details: %+v", gotActivities[0].Details)
|
||||
}
|
||||
}
|
||||
|
||||
func TestConfigStoreJobDetailRoundTrip(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
store, err := NewConfigStore(t.TempDir())
|
||||
if err != nil {
|
||||
t.Fatalf("NewConfigStore: %v", err)
|
||||
}
|
||||
|
||||
input := TrackedJob{
|
||||
JobID: "job-detail-1",
|
||||
JobType: "vacuum",
|
||||
Summary: "detail summary",
|
||||
Detail: "detail payload",
|
||||
CreatedAt: timeToPtr(time.Now().UTC().Add(-2 * time.Minute)),
|
||||
UpdatedAt: timeToPtr(time.Now().UTC()),
|
||||
Parameters: map[string]interface{}{
|
||||
"volume_id": map[string]interface{}{"int64_value": "3"},
|
||||
},
|
||||
Labels: map[string]string{
|
||||
"source": "detector",
|
||||
},
|
||||
ResultOutputValues: map[string]interface{}{
|
||||
"moved": map[string]interface{}{"bool_value": true},
|
||||
},
|
||||
}
|
||||
|
||||
if err := store.SaveJobDetail(input); err != nil {
|
||||
t.Fatalf("SaveJobDetail: %v", err)
|
||||
}
|
||||
|
||||
got, err := store.LoadJobDetail(input.JobID)
|
||||
if err != nil {
|
||||
t.Fatalf("LoadJobDetail: %v", err)
|
||||
}
|
||||
if got == nil {
|
||||
t.Fatalf("LoadJobDetail returned nil")
|
||||
}
|
||||
if got.Detail != input.Detail {
|
||||
t.Fatalf("unexpected detail: got=%q want=%q", got.Detail, input.Detail)
|
||||
}
|
||||
if got.Labels["source"] != "detector" {
|
||||
t.Fatalf("unexpected labels: %+v", got.Labels)
|
||||
}
|
||||
if got.ResultOutputValues == nil {
|
||||
t.Fatalf("expected result output values")
|
||||
}
|
||||
}
|
||||
231
weed/admin/plugin/job_execution_plan.go
Normal file
231
weed/admin/plugin/job_execution_plan.go
Normal file
@@ -0,0 +1,231 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"encoding/base64"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/worker_pb"
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
|
||||
"google.golang.org/protobuf/proto"
|
||||
)
|
||||
|
||||
func enrichTrackedJobParameters(jobType string, parameters map[string]interface{}) map[string]interface{} {
|
||||
if len(parameters) == 0 {
|
||||
return parameters
|
||||
}
|
||||
if _, exists := parameters["execution_plan"]; exists {
|
||||
return parameters
|
||||
}
|
||||
|
||||
taskParams, ok := decodeTaskParamsFromPlainParameters(parameters)
|
||||
if !ok || taskParams == nil {
|
||||
return parameters
|
||||
}
|
||||
|
||||
plan := buildExecutionPlan(strings.TrimSpace(jobType), taskParams)
|
||||
if plan == nil {
|
||||
return parameters
|
||||
}
|
||||
|
||||
enriched := make(map[string]interface{}, len(parameters)+1)
|
||||
for key, value := range parameters {
|
||||
enriched[key] = value
|
||||
}
|
||||
enriched["execution_plan"] = plan
|
||||
return enriched
|
||||
}
|
||||
|
||||
func decodeTaskParamsFromPlainParameters(parameters map[string]interface{}) (*worker_pb.TaskParams, bool) {
|
||||
rawField, ok := parameters["task_params_pb"]
|
||||
if !ok || rawField == nil {
|
||||
return nil, false
|
||||
}
|
||||
|
||||
fieldMap, ok := rawField.(map[string]interface{})
|
||||
if !ok {
|
||||
return nil, false
|
||||
}
|
||||
|
||||
bytesValue, _ := fieldMap["bytes_value"].(string)
|
||||
bytesValue = strings.TrimSpace(bytesValue)
|
||||
if bytesValue == "" {
|
||||
return nil, false
|
||||
}
|
||||
|
||||
payload, err := base64.StdEncoding.DecodeString(bytesValue)
|
||||
if err != nil {
|
||||
return nil, false
|
||||
}
|
||||
|
||||
params := &worker_pb.TaskParams{}
|
||||
if err := proto.Unmarshal(payload, params); err != nil {
|
||||
return nil, false
|
||||
}
|
||||
|
||||
return params, true
|
||||
}
|
||||
|
||||
func buildExecutionPlan(jobType string, params *worker_pb.TaskParams) map[string]interface{} {
|
||||
if params == nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
normalizedJobType := strings.TrimSpace(jobType)
|
||||
if normalizedJobType == "" && params.GetErasureCodingParams() != nil {
|
||||
normalizedJobType = "erasure_coding"
|
||||
}
|
||||
|
||||
switch normalizedJobType {
|
||||
case "erasure_coding":
|
||||
return buildErasureCodingExecutionPlan(params)
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
func buildErasureCodingExecutionPlan(params *worker_pb.TaskParams) map[string]interface{} {
|
||||
if params == nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
ecParams := params.GetErasureCodingParams()
|
||||
if ecParams == nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
dataShards := int(ecParams.DataShards)
|
||||
if dataShards <= 0 {
|
||||
dataShards = int(erasure_coding.DataShardsCount)
|
||||
}
|
||||
parityShards := int(ecParams.ParityShards)
|
||||
if parityShards <= 0 {
|
||||
parityShards = int(erasure_coding.ParityShardsCount)
|
||||
}
|
||||
totalShards := dataShards + parityShards
|
||||
|
||||
sources := make([]map[string]interface{}, 0, len(params.Sources))
|
||||
for _, source := range params.Sources {
|
||||
if source == nil {
|
||||
continue
|
||||
}
|
||||
sources = append(sources, buildExecutionEndpoint(
|
||||
source.Node,
|
||||
source.DataCenter,
|
||||
source.Rack,
|
||||
source.VolumeId,
|
||||
source.ShardIds,
|
||||
dataShards,
|
||||
))
|
||||
}
|
||||
|
||||
targets := make([]map[string]interface{}, 0, len(params.Targets))
|
||||
shardAssignments := make([]map[string]interface{}, 0, totalShards)
|
||||
for targetIndex, target := range params.Targets {
|
||||
if target == nil {
|
||||
continue
|
||||
}
|
||||
|
||||
targets = append(targets, buildExecutionEndpoint(
|
||||
target.Node,
|
||||
target.DataCenter,
|
||||
target.Rack,
|
||||
target.VolumeId,
|
||||
target.ShardIds,
|
||||
dataShards,
|
||||
))
|
||||
|
||||
for _, shardID := range normalizeShardIDs(target.ShardIds) {
|
||||
kind, label := classifyShardID(shardID, dataShards)
|
||||
shardAssignments = append(shardAssignments, map[string]interface{}{
|
||||
"shard_id": shardID,
|
||||
"kind": kind,
|
||||
"label": label,
|
||||
"target_index": targetIndex + 1,
|
||||
"target_node": strings.TrimSpace(target.Node),
|
||||
"target_data_center": strings.TrimSpace(target.DataCenter),
|
||||
"target_rack": strings.TrimSpace(target.Rack),
|
||||
"target_volume_id": int(target.VolumeId),
|
||||
})
|
||||
}
|
||||
}
|
||||
sort.Slice(shardAssignments, func(i, j int) bool {
|
||||
left, _ := shardAssignments[i]["shard_id"].(int)
|
||||
right, _ := shardAssignments[j]["shard_id"].(int)
|
||||
return left < right
|
||||
})
|
||||
|
||||
plan := map[string]interface{}{
|
||||
"job_type": "erasure_coding",
|
||||
"task_id": strings.TrimSpace(params.TaskId),
|
||||
"volume_id": int(params.VolumeId),
|
||||
"collection": strings.TrimSpace(params.Collection),
|
||||
"data_shards": dataShards,
|
||||
"parity_shards": parityShards,
|
||||
"total_shards": totalShards,
|
||||
"sources": sources,
|
||||
"targets": targets,
|
||||
"source_count": len(sources),
|
||||
"target_count": len(targets),
|
||||
}
|
||||
|
||||
if len(shardAssignments) > 0 {
|
||||
plan["shard_assignments"] = shardAssignments
|
||||
}
|
||||
|
||||
return plan
|
||||
}
|
||||
|
||||
func buildExecutionEndpoint(
|
||||
node string,
|
||||
dataCenter string,
|
||||
rack string,
|
||||
volumeID uint32,
|
||||
shardIDs []uint32,
|
||||
dataShardCount int,
|
||||
) map[string]interface{} {
|
||||
allShards := normalizeShardIDs(shardIDs)
|
||||
dataShards := make([]int, 0, len(allShards))
|
||||
parityShards := make([]int, 0, len(allShards))
|
||||
for _, shardID := range allShards {
|
||||
if shardID < dataShardCount {
|
||||
dataShards = append(dataShards, shardID)
|
||||
} else {
|
||||
parityShards = append(parityShards, shardID)
|
||||
}
|
||||
}
|
||||
|
||||
return map[string]interface{}{
|
||||
"node": strings.TrimSpace(node),
|
||||
"data_center": strings.TrimSpace(dataCenter),
|
||||
"rack": strings.TrimSpace(rack),
|
||||
"volume_id": int(volumeID),
|
||||
"shard_ids": allShards,
|
||||
"data_shard_ids": dataShards,
|
||||
"parity_shard_ids": parityShards,
|
||||
}
|
||||
}
|
||||
|
||||
func normalizeShardIDs(shardIDs []uint32) []int {
|
||||
if len(shardIDs) == 0 {
|
||||
return nil
|
||||
}
|
||||
|
||||
out := make([]int, 0, len(shardIDs))
|
||||
for _, shardID := range shardIDs {
|
||||
out = append(out, int(shardID))
|
||||
}
|
||||
sort.Ints(out)
|
||||
return out
|
||||
}
|
||||
|
||||
func classifyShardID(shardID int, dataShardCount int) (kind string, label string) {
|
||||
if dataShardCount <= 0 {
|
||||
dataShardCount = int(erasure_coding.DataShardsCount)
|
||||
}
|
||||
if shardID < dataShardCount {
|
||||
return "data", "D" + strconv.Itoa(shardID)
|
||||
}
|
||||
return "parity", "P" + strconv.Itoa(shardID)
|
||||
}
|
||||
1243
weed/admin/plugin/plugin.go
Normal file
1243
weed/admin/plugin/plugin.go
Normal file
File diff suppressed because it is too large
Load Diff
112
weed/admin/plugin/plugin_cancel_test.go
Normal file
112
weed/admin/plugin/plugin_cancel_test.go
Normal file
@@ -0,0 +1,112 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"testing"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
)
|
||||
|
||||
func TestRunDetectionSendsCancelOnContextDone(t *testing.T) {
|
||||
t.Parallel()
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New plugin error: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
const workerID = "worker-detect"
|
||||
const jobType = "vacuum"
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: workerID,
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
|
||||
},
|
||||
})
|
||||
session := &streamSession{workerID: workerID, outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 4)}
|
||||
pluginSvc.putSession(session)
|
||||
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
errCh := make(chan error, 1)
|
||||
go func() {
|
||||
_, runErr := pluginSvc.RunDetection(ctx, jobType, &plugin_pb.ClusterContext{}, 10)
|
||||
errCh <- runErr
|
||||
}()
|
||||
|
||||
first := <-session.outgoing
|
||||
if first.GetRunDetectionRequest() == nil {
|
||||
t.Fatalf("expected first message to be run_detection_request")
|
||||
}
|
||||
|
||||
cancel()
|
||||
|
||||
second := <-session.outgoing
|
||||
cancelReq := second.GetCancelRequest()
|
||||
if cancelReq == nil {
|
||||
t.Fatalf("expected second message to be cancel_request")
|
||||
}
|
||||
if cancelReq.TargetId != first.RequestId {
|
||||
t.Fatalf("unexpected cancel target id: got=%s want=%s", cancelReq.TargetId, first.RequestId)
|
||||
}
|
||||
if cancelReq.TargetKind != plugin_pb.WorkKind_WORK_KIND_DETECTION {
|
||||
t.Fatalf("unexpected cancel target kind: %v", cancelReq.TargetKind)
|
||||
}
|
||||
|
||||
runErr := <-errCh
|
||||
if !errors.Is(runErr, context.Canceled) {
|
||||
t.Fatalf("expected context canceled error, got %v", runErr)
|
||||
}
|
||||
}
|
||||
|
||||
func TestExecuteJobSendsCancelOnContextDone(t *testing.T) {
|
||||
t.Parallel()
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New plugin error: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
const workerID = "worker-exec"
|
||||
const jobType = "vacuum"
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: workerID,
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: jobType, CanExecute: true, MaxExecutionConcurrency: 1},
|
||||
},
|
||||
})
|
||||
session := &streamSession{workerID: workerID, outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 4)}
|
||||
pluginSvc.putSession(session)
|
||||
|
||||
job := &plugin_pb.JobSpec{JobId: "job-1", JobType: jobType}
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
errCh := make(chan error, 1)
|
||||
go func() {
|
||||
_, runErr := pluginSvc.ExecuteJob(ctx, job, &plugin_pb.ClusterContext{}, 1)
|
||||
errCh <- runErr
|
||||
}()
|
||||
|
||||
first := <-session.outgoing
|
||||
if first.GetExecuteJobRequest() == nil {
|
||||
t.Fatalf("expected first message to be execute_job_request")
|
||||
}
|
||||
|
||||
cancel()
|
||||
|
||||
second := <-session.outgoing
|
||||
cancelReq := second.GetCancelRequest()
|
||||
if cancelReq == nil {
|
||||
t.Fatalf("expected second message to be cancel_request")
|
||||
}
|
||||
if cancelReq.TargetId != first.RequestId {
|
||||
t.Fatalf("unexpected cancel target id: got=%s want=%s", cancelReq.TargetId, first.RequestId)
|
||||
}
|
||||
if cancelReq.TargetKind != plugin_pb.WorkKind_WORK_KIND_EXECUTION {
|
||||
t.Fatalf("unexpected cancel target kind: %v", cancelReq.TargetKind)
|
||||
}
|
||||
|
||||
runErr := <-errCh
|
||||
if !errors.Is(runErr, context.Canceled) {
|
||||
t.Fatalf("expected context canceled error, got %v", runErr)
|
||||
}
|
||||
}
|
||||
125
weed/admin/plugin/plugin_config_bootstrap_test.go
Normal file
125
weed/admin/plugin/plugin_config_bootstrap_test.go
Normal file
@@ -0,0 +1,125 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
)
|
||||
|
||||
func TestEnsureJobTypeConfigFromDescriptorBootstrapsDefaults(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
descriptor := &plugin_pb.JobTypeDescriptor{
|
||||
JobType: "vacuum",
|
||||
DescriptorVersion: 3,
|
||||
AdminConfigForm: &plugin_pb.ConfigForm{
|
||||
DefaultValues: map[string]*plugin_pb.ConfigValue{
|
||||
"scan_scope": {Kind: &plugin_pb.ConfigValue_StringValue{StringValue: "all"}},
|
||||
},
|
||||
},
|
||||
WorkerConfigForm: &plugin_pb.ConfigForm{
|
||||
DefaultValues: map[string]*plugin_pb.ConfigValue{
|
||||
"threshold": {Kind: &plugin_pb.ConfigValue_DoubleValue{DoubleValue: 0.3}},
|
||||
},
|
||||
},
|
||||
AdminRuntimeDefaults: &plugin_pb.AdminRuntimeDefaults{
|
||||
Enabled: true,
|
||||
DetectionIntervalSeconds: 60,
|
||||
DetectionTimeoutSeconds: 20,
|
||||
MaxJobsPerDetection: 30,
|
||||
GlobalExecutionConcurrency: 4,
|
||||
PerWorkerExecutionConcurrency: 2,
|
||||
RetryLimit: 3,
|
||||
RetryBackoffSeconds: 5,
|
||||
},
|
||||
}
|
||||
|
||||
if err := pluginSvc.ensureJobTypeConfigFromDescriptor("vacuum", descriptor); err != nil {
|
||||
t.Fatalf("ensureJobTypeConfigFromDescriptor: %v", err)
|
||||
}
|
||||
|
||||
cfg, err := pluginSvc.LoadJobTypeConfig("vacuum")
|
||||
if err != nil {
|
||||
t.Fatalf("LoadJobTypeConfig: %v", err)
|
||||
}
|
||||
if cfg == nil {
|
||||
t.Fatalf("expected non-nil config")
|
||||
}
|
||||
if cfg.DescriptorVersion != 3 {
|
||||
t.Fatalf("unexpected descriptor version: got=%d", cfg.DescriptorVersion)
|
||||
}
|
||||
if cfg.AdminRuntime == nil || !cfg.AdminRuntime.Enabled {
|
||||
t.Fatalf("expected enabled admin settings")
|
||||
}
|
||||
if cfg.AdminRuntime.GlobalExecutionConcurrency != 4 {
|
||||
t.Fatalf("unexpected global execution concurrency: %d", cfg.AdminRuntime.GlobalExecutionConcurrency)
|
||||
}
|
||||
if _, ok := cfg.AdminConfigValues["scan_scope"]; !ok {
|
||||
t.Fatalf("missing admin default value")
|
||||
}
|
||||
if _, ok := cfg.WorkerConfigValues["threshold"]; !ok {
|
||||
t.Fatalf("missing worker default value")
|
||||
}
|
||||
}
|
||||
|
||||
func TestEnsureJobTypeConfigFromDescriptorDoesNotOverwriteExisting(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
if err := pluginSvc.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
|
||||
JobType: "balance",
|
||||
AdminRuntime: &plugin_pb.AdminRuntimeConfig{
|
||||
Enabled: true,
|
||||
GlobalExecutionConcurrency: 9,
|
||||
},
|
||||
AdminConfigValues: map[string]*plugin_pb.ConfigValue{
|
||||
"custom": {Kind: &plugin_pb.ConfigValue_StringValue{StringValue: "keep"}},
|
||||
},
|
||||
}); err != nil {
|
||||
t.Fatalf("SaveJobTypeConfig: %v", err)
|
||||
}
|
||||
|
||||
descriptor := &plugin_pb.JobTypeDescriptor{
|
||||
JobType: "balance",
|
||||
DescriptorVersion: 7,
|
||||
AdminConfigForm: &plugin_pb.ConfigForm{
|
||||
DefaultValues: map[string]*plugin_pb.ConfigValue{
|
||||
"custom": {Kind: &plugin_pb.ConfigValue_StringValue{StringValue: "overwrite"}},
|
||||
},
|
||||
},
|
||||
AdminRuntimeDefaults: &plugin_pb.AdminRuntimeDefaults{
|
||||
Enabled: true,
|
||||
GlobalExecutionConcurrency: 1,
|
||||
},
|
||||
}
|
||||
|
||||
if err := pluginSvc.ensureJobTypeConfigFromDescriptor("balance", descriptor); err != nil {
|
||||
t.Fatalf("ensureJobTypeConfigFromDescriptor: %v", err)
|
||||
}
|
||||
|
||||
cfg, err := pluginSvc.LoadJobTypeConfig("balance")
|
||||
if err != nil {
|
||||
t.Fatalf("LoadJobTypeConfig: %v", err)
|
||||
}
|
||||
if cfg == nil {
|
||||
t.Fatalf("expected config")
|
||||
}
|
||||
if cfg.AdminRuntime == nil || cfg.AdminRuntime.GlobalExecutionConcurrency != 9 {
|
||||
t.Fatalf("existing admin settings should be preserved, got=%v", cfg.AdminRuntime)
|
||||
}
|
||||
custom := cfg.AdminConfigValues["custom"]
|
||||
if custom == nil || custom.GetStringValue() != "keep" {
|
||||
t.Fatalf("existing admin config should be preserved")
|
||||
}
|
||||
}
|
||||
197
weed/admin/plugin/plugin_detection_test.go
Normal file
197
weed/admin/plugin/plugin_detection_test.go
Normal file
@@ -0,0 +1,197 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
)
|
||||
|
||||
func TestRunDetectionIncludesLatestSuccessfulRun(t *testing.T) {
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New plugin error: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
jobType := "vacuum"
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
|
||||
},
|
||||
})
|
||||
session := &streamSession{workerID: "worker-a", outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 1)}
|
||||
pluginSvc.putSession(session)
|
||||
|
||||
oldSuccess := time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)
|
||||
latestSuccess := time.Date(2026, 2, 1, 0, 0, 0, 0, time.UTC)
|
||||
if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{Outcome: RunOutcomeSuccess, CompletedAt: timeToPtr(oldSuccess)}); err != nil {
|
||||
t.Fatalf("AppendRunRecord old success: %v", err)
|
||||
}
|
||||
if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{Outcome: RunOutcomeError, CompletedAt: timeToPtr(latestSuccess.Add(2 * time.Hour))}); err != nil {
|
||||
t.Fatalf("AppendRunRecord error run: %v", err)
|
||||
}
|
||||
if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{Outcome: RunOutcomeSuccess, CompletedAt: timeToPtr(latestSuccess)}); err != nil {
|
||||
t.Fatalf("AppendRunRecord latest success: %v", err)
|
||||
}
|
||||
|
||||
resultCh := make(chan error, 1)
|
||||
go func() {
|
||||
_, runErr := pluginSvc.RunDetection(context.Background(), jobType, &plugin_pb.ClusterContext{}, 10)
|
||||
resultCh <- runErr
|
||||
}()
|
||||
|
||||
message := <-session.outgoing
|
||||
detectRequest := message.GetRunDetectionRequest()
|
||||
if detectRequest == nil {
|
||||
t.Fatalf("expected run detection request message")
|
||||
}
|
||||
if detectRequest.LastSuccessfulRun == nil {
|
||||
t.Fatalf("expected last_successful_run to be set")
|
||||
}
|
||||
if got := detectRequest.LastSuccessfulRun.AsTime().UTC(); !got.Equal(latestSuccess) {
|
||||
t.Fatalf("unexpected last_successful_run, got=%s want=%s", got, latestSuccess)
|
||||
}
|
||||
|
||||
pluginSvc.handleDetectionComplete("worker-a", &plugin_pb.DetectionComplete{
|
||||
RequestId: message.RequestId,
|
||||
JobType: jobType,
|
||||
Success: true,
|
||||
})
|
||||
|
||||
if runErr := <-resultCh; runErr != nil {
|
||||
t.Fatalf("RunDetection error: %v", runErr)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunDetectionOmitsLastSuccessfulRunWhenNoSuccessHistory(t *testing.T) {
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New plugin error: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
jobType := "vacuum"
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
|
||||
},
|
||||
})
|
||||
session := &streamSession{workerID: "worker-a", outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 1)}
|
||||
pluginSvc.putSession(session)
|
||||
|
||||
if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{
|
||||
Outcome: RunOutcomeError,
|
||||
CompletedAt: timeToPtr(time.Date(2026, 2, 10, 0, 0, 0, 0, time.UTC)),
|
||||
}); err != nil {
|
||||
t.Fatalf("AppendRunRecord error run: %v", err)
|
||||
}
|
||||
|
||||
resultCh := make(chan error, 1)
|
||||
go func() {
|
||||
_, runErr := pluginSvc.RunDetection(context.Background(), jobType, &plugin_pb.ClusterContext{}, 10)
|
||||
resultCh <- runErr
|
||||
}()
|
||||
|
||||
message := <-session.outgoing
|
||||
detectRequest := message.GetRunDetectionRequest()
|
||||
if detectRequest == nil {
|
||||
t.Fatalf("expected run detection request message")
|
||||
}
|
||||
if detectRequest.LastSuccessfulRun != nil {
|
||||
t.Fatalf("expected last_successful_run to be nil when no success history")
|
||||
}
|
||||
|
||||
pluginSvc.handleDetectionComplete("worker-a", &plugin_pb.DetectionComplete{
|
||||
RequestId: message.RequestId,
|
||||
JobType: jobType,
|
||||
Success: true,
|
||||
})
|
||||
|
||||
if runErr := <-resultCh; runErr != nil {
|
||||
t.Fatalf("RunDetection error: %v", runErr)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRunDetectionWithReportCapturesDetectionActivities(t *testing.T) {
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New plugin error: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
jobType := "vacuum"
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
|
||||
},
|
||||
})
|
||||
session := &streamSession{workerID: "worker-a", outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 1)}
|
||||
pluginSvc.putSession(session)
|
||||
|
||||
reportCh := make(chan *DetectionReport, 1)
|
||||
errCh := make(chan error, 1)
|
||||
go func() {
|
||||
report, runErr := pluginSvc.RunDetectionWithReport(context.Background(), jobType, &plugin_pb.ClusterContext{}, 10)
|
||||
reportCh <- report
|
||||
errCh <- runErr
|
||||
}()
|
||||
|
||||
message := <-session.outgoing
|
||||
requestID := message.GetRequestId()
|
||||
if requestID == "" {
|
||||
t.Fatalf("expected request id in detection request")
|
||||
}
|
||||
|
||||
pluginSvc.handleDetectionProposals("worker-a", &plugin_pb.DetectionProposals{
|
||||
RequestId: requestID,
|
||||
JobType: jobType,
|
||||
Proposals: []*plugin_pb.JobProposal{
|
||||
{
|
||||
ProposalId: "proposal-1",
|
||||
JobType: jobType,
|
||||
Summary: "vacuum proposal",
|
||||
Detail: "based on garbage ratio",
|
||||
},
|
||||
},
|
||||
})
|
||||
pluginSvc.handleDetectionComplete("worker-a", &plugin_pb.DetectionComplete{
|
||||
RequestId: requestID,
|
||||
JobType: jobType,
|
||||
Success: true,
|
||||
TotalProposals: 1,
|
||||
})
|
||||
|
||||
report := <-reportCh
|
||||
if report == nil {
|
||||
t.Fatalf("expected detection report")
|
||||
}
|
||||
if report.RequestID == "" {
|
||||
t.Fatalf("expected detection report request id")
|
||||
}
|
||||
if report.WorkerID != "worker-a" {
|
||||
t.Fatalf("expected worker-a, got %q", report.WorkerID)
|
||||
}
|
||||
if len(report.Proposals) != 1 {
|
||||
t.Fatalf("expected one proposal in report, got %d", len(report.Proposals))
|
||||
}
|
||||
if runErr := <-errCh; runErr != nil {
|
||||
t.Fatalf("RunDetectionWithReport error: %v", runErr)
|
||||
}
|
||||
|
||||
activities := pluginSvc.ListActivities(jobType, 0)
|
||||
stages := map[string]bool{}
|
||||
for _, activity := range activities {
|
||||
if activity.RequestID != report.RequestID {
|
||||
continue
|
||||
}
|
||||
stages[activity.Stage] = true
|
||||
}
|
||||
if !stages["requested"] || !stages["proposal"] || !stages["completed"] {
|
||||
t.Fatalf("expected requested/proposal/completed activities, got stages=%v", stages)
|
||||
}
|
||||
}
|
||||
896
weed/admin/plugin/plugin_monitor.go
Normal file
896
weed/admin/plugin/plugin_monitor.go
Normal file
@@ -0,0 +1,896 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"sort"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/glog"
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
"google.golang.org/protobuf/encoding/protojson"
|
||||
)
|
||||
|
||||
const (
|
||||
maxTrackedJobsTotal = 1000
|
||||
maxActivityRecords = 4000
|
||||
maxRelatedJobs = 100
|
||||
)
|
||||
|
||||
var (
|
||||
StateSucceeded = strings.ToLower(plugin_pb.JobState_JOB_STATE_SUCCEEDED.String())
|
||||
StateFailed = strings.ToLower(plugin_pb.JobState_JOB_STATE_FAILED.String())
|
||||
StateCanceled = strings.ToLower(plugin_pb.JobState_JOB_STATE_CANCELED.String())
|
||||
)
|
||||
|
||||
// activityLess reports whether activity a occurred after activity b (newest-first order).
|
||||
// A nil OccurredAt is treated as the zero time.
|
||||
func activityLess(a, b JobActivity) bool {
|
||||
ta := time.Time{}
|
||||
if a.OccurredAt != nil {
|
||||
ta = *a.OccurredAt
|
||||
}
|
||||
tb := time.Time{}
|
||||
if b.OccurredAt != nil {
|
||||
tb = *b.OccurredAt
|
||||
}
|
||||
return ta.After(tb)
|
||||
}
|
||||
|
||||
func (r *Plugin) loadPersistedMonitorState() error {
|
||||
trackedJobs, err := r.store.LoadTrackedJobs()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
activities, err := r.store.LoadActivities()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if len(trackedJobs) > 0 {
|
||||
r.jobsMu.Lock()
|
||||
for i := range trackedJobs {
|
||||
job := trackedJobs[i]
|
||||
if strings.TrimSpace(job.JobID) == "" {
|
||||
continue
|
||||
}
|
||||
// Backward compatibility: migrate older inline detail payloads
|
||||
// out of tracked_jobs.json into dedicated per-job detail files.
|
||||
if hasTrackedJobRichDetails(job) {
|
||||
if err := r.store.SaveJobDetail(job); err != nil {
|
||||
glog.Warningf("Plugin failed to migrate detail snapshot for job %s: %v", job.JobID, err)
|
||||
}
|
||||
}
|
||||
stripTrackedJobDetailFields(&job)
|
||||
jobCopy := job
|
||||
r.jobs[job.JobID] = &jobCopy
|
||||
}
|
||||
r.pruneTrackedJobsLocked()
|
||||
r.jobsMu.Unlock()
|
||||
}
|
||||
|
||||
if len(activities) > maxActivityRecords {
|
||||
activities = activities[len(activities)-maxActivityRecords:]
|
||||
}
|
||||
if len(activities) > 0 {
|
||||
r.activitiesMu.Lock()
|
||||
r.activities = append([]JobActivity(nil), activities...)
|
||||
r.activitiesMu.Unlock()
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func (r *Plugin) ListTrackedJobs(jobType string, state string, limit int) []TrackedJob {
|
||||
r.jobsMu.RLock()
|
||||
defer r.jobsMu.RUnlock()
|
||||
|
||||
normalizedJobType := strings.TrimSpace(jobType)
|
||||
normalizedState := strings.TrimSpace(strings.ToLower(state))
|
||||
|
||||
items := make([]TrackedJob, 0, len(r.jobs))
|
||||
for _, job := range r.jobs {
|
||||
if job == nil {
|
||||
continue
|
||||
}
|
||||
if normalizedJobType != "" && job.JobType != normalizedJobType {
|
||||
continue
|
||||
}
|
||||
if normalizedState != "" && strings.ToLower(job.State) != normalizedState {
|
||||
continue
|
||||
}
|
||||
items = append(items, cloneTrackedJob(*job))
|
||||
}
|
||||
|
||||
sort.Slice(items, func(i, j int) bool {
|
||||
ti := time.Time{}
|
||||
if items[i].UpdatedAt != nil {
|
||||
ti = *items[i].UpdatedAt
|
||||
}
|
||||
tj := time.Time{}
|
||||
if items[j].UpdatedAt != nil {
|
||||
tj = *items[j].UpdatedAt
|
||||
}
|
||||
if !ti.Equal(tj) {
|
||||
return ti.After(tj)
|
||||
}
|
||||
return items[i].JobID < items[j].JobID
|
||||
})
|
||||
|
||||
if limit > 0 && len(items) > limit {
|
||||
items = items[:limit]
|
||||
}
|
||||
return items
|
||||
}
|
||||
|
||||
func (r *Plugin) GetTrackedJob(jobID string) (*TrackedJob, bool) {
|
||||
r.jobsMu.RLock()
|
||||
defer r.jobsMu.RUnlock()
|
||||
|
||||
job, ok := r.jobs[jobID]
|
||||
if !ok || job == nil {
|
||||
return nil, false
|
||||
}
|
||||
clone := cloneTrackedJob(*job)
|
||||
return &clone, true
|
||||
}
|
||||
|
||||
func (r *Plugin) ListActivities(jobType string, limit int) []JobActivity {
|
||||
r.activitiesMu.RLock()
|
||||
defer r.activitiesMu.RUnlock()
|
||||
|
||||
normalized := strings.TrimSpace(jobType)
|
||||
activities := make([]JobActivity, 0, len(r.activities))
|
||||
for _, activity := range r.activities {
|
||||
if normalized != "" && activity.JobType != normalized {
|
||||
continue
|
||||
}
|
||||
activities = append(activities, activity)
|
||||
}
|
||||
|
||||
sort.Slice(activities, func(i, j int) bool {
|
||||
return activityLess(activities[i], activities[j])
|
||||
})
|
||||
if limit > 0 && len(activities) > limit {
|
||||
activities = activities[:limit]
|
||||
}
|
||||
return activities
|
||||
}
|
||||
|
||||
func (r *Plugin) ListJobActivities(jobID string, limit int) []JobActivity {
|
||||
normalizedJobID := strings.TrimSpace(jobID)
|
||||
if normalizedJobID == "" {
|
||||
return nil
|
||||
}
|
||||
|
||||
r.activitiesMu.RLock()
|
||||
activities := make([]JobActivity, 0, len(r.activities))
|
||||
for _, activity := range r.activities {
|
||||
if strings.TrimSpace(activity.JobID) != normalizedJobID {
|
||||
continue
|
||||
}
|
||||
activities = append(activities, activity)
|
||||
}
|
||||
r.activitiesMu.RUnlock()
|
||||
|
||||
sort.Slice(activities, func(i, j int) bool {
|
||||
return !activityLess(activities[i], activities[j]) // oldest-first for job timeline
|
||||
})
|
||||
if limit > 0 && len(activities) > limit {
|
||||
activities = activities[len(activities)-limit:]
|
||||
}
|
||||
return activities
|
||||
}
|
||||
|
||||
func (r *Plugin) BuildJobDetail(jobID string, activityLimit int, relatedLimit int) (*JobDetail, bool, error) {
|
||||
normalizedJobID := strings.TrimSpace(jobID)
|
||||
if normalizedJobID == "" {
|
||||
return nil, false, nil
|
||||
}
|
||||
|
||||
// Clamp relatedLimit to a safe range to avoid excessive memory allocation from untrusted input.
|
||||
if relatedLimit <= 0 {
|
||||
relatedLimit = 0
|
||||
} else if relatedLimit > maxRelatedJobs {
|
||||
relatedLimit = maxRelatedJobs
|
||||
}
|
||||
|
||||
r.jobsMu.RLock()
|
||||
trackedSnapshot, ok := r.jobs[normalizedJobID]
|
||||
if ok && trackedSnapshot != nil {
|
||||
candidate := cloneTrackedJob(*trackedSnapshot)
|
||||
stripTrackedJobDetailFields(&candidate)
|
||||
trackedSnapshot = &candidate
|
||||
} else {
|
||||
trackedSnapshot = nil
|
||||
}
|
||||
r.jobsMu.RUnlock()
|
||||
|
||||
detailJob, err := r.store.LoadJobDetail(normalizedJobID)
|
||||
if err != nil {
|
||||
return nil, false, err
|
||||
}
|
||||
|
||||
if trackedSnapshot == nil && detailJob == nil {
|
||||
return nil, false, nil
|
||||
}
|
||||
if detailJob == nil && trackedSnapshot != nil {
|
||||
clone := cloneTrackedJob(*trackedSnapshot)
|
||||
detailJob = &clone
|
||||
}
|
||||
if detailJob == nil {
|
||||
return nil, false, nil
|
||||
}
|
||||
if trackedSnapshot != nil {
|
||||
mergeTrackedStatusIntoDetail(detailJob, trackedSnapshot)
|
||||
}
|
||||
detailJob.Parameters = enrichTrackedJobParameters(detailJob.JobType, detailJob.Parameters)
|
||||
|
||||
r.activitiesMu.RLock()
|
||||
activities := append([]JobActivity(nil), r.activities...)
|
||||
r.activitiesMu.RUnlock()
|
||||
|
||||
detail := &JobDetail{
|
||||
Job: detailJob,
|
||||
Activities: filterJobActivitiesFromSlice(activities, normalizedJobID, activityLimit),
|
||||
LastUpdated: timeToPtr(time.Now().UTC()),
|
||||
}
|
||||
|
||||
if history, err := r.store.LoadRunHistory(detailJob.JobType); err != nil {
|
||||
return nil, true, err
|
||||
} else if history != nil {
|
||||
for i := range history.SuccessfulRuns {
|
||||
record := history.SuccessfulRuns[i]
|
||||
if strings.TrimSpace(record.JobID) == normalizedJobID {
|
||||
recordCopy := record
|
||||
detail.RunRecord = &recordCopy
|
||||
break
|
||||
}
|
||||
}
|
||||
if detail.RunRecord == nil {
|
||||
for i := range history.ErrorRuns {
|
||||
record := history.ErrorRuns[i]
|
||||
if strings.TrimSpace(record.JobID) == normalizedJobID {
|
||||
recordCopy := record
|
||||
detail.RunRecord = &recordCopy
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if relatedLimit > 0 {
|
||||
related := make([]TrackedJob, 0, relatedLimit)
|
||||
r.jobsMu.RLock()
|
||||
for _, candidate := range r.jobs {
|
||||
if strings.TrimSpace(candidate.JobType) != strings.TrimSpace(detailJob.JobType) {
|
||||
continue
|
||||
}
|
||||
if strings.TrimSpace(candidate.JobID) == normalizedJobID {
|
||||
continue
|
||||
}
|
||||
cloned := cloneTrackedJob(*candidate)
|
||||
stripTrackedJobDetailFields(&cloned)
|
||||
related = append(related, cloned)
|
||||
if len(related) >= relatedLimit {
|
||||
break
|
||||
}
|
||||
}
|
||||
r.jobsMu.RUnlock()
|
||||
detail.RelatedJobs = related
|
||||
}
|
||||
|
||||
return detail, true, nil
|
||||
}
|
||||
|
||||
func filterJobActivitiesFromSlice(all []JobActivity, jobID string, limit int) []JobActivity {
|
||||
if strings.TrimSpace(jobID) == "" || len(all) == 0 {
|
||||
return nil
|
||||
}
|
||||
|
||||
activities := make([]JobActivity, 0, len(all))
|
||||
for _, activity := range all {
|
||||
if strings.TrimSpace(activity.JobID) != jobID {
|
||||
continue
|
||||
}
|
||||
activities = append(activities, activity)
|
||||
}
|
||||
|
||||
sort.Slice(activities, func(i, j int) bool {
|
||||
return !activityLess(activities[i], activities[j]) // oldest-first for job timeline
|
||||
})
|
||||
if limit > 0 && len(activities) > limit {
|
||||
activities = activities[len(activities)-limit:]
|
||||
}
|
||||
return activities
|
||||
}
|
||||
|
||||
func stripTrackedJobDetailFields(job *TrackedJob) {
|
||||
if job == nil {
|
||||
return
|
||||
}
|
||||
job.Detail = ""
|
||||
job.Parameters = nil
|
||||
job.Labels = nil
|
||||
job.ResultOutputValues = nil
|
||||
}
|
||||
|
||||
func hasTrackedJobRichDetails(job TrackedJob) bool {
|
||||
return strings.TrimSpace(job.Detail) != "" ||
|
||||
len(job.Parameters) > 0 ||
|
||||
len(job.Labels) > 0 ||
|
||||
len(job.ResultOutputValues) > 0
|
||||
}
|
||||
|
||||
func mergeTrackedStatusIntoDetail(detail *TrackedJob, tracked *TrackedJob) {
|
||||
if detail == nil || tracked == nil {
|
||||
return
|
||||
}
|
||||
|
||||
if detail.JobType == "" {
|
||||
detail.JobType = tracked.JobType
|
||||
}
|
||||
if detail.RequestID == "" {
|
||||
detail.RequestID = tracked.RequestID
|
||||
}
|
||||
if detail.WorkerID == "" {
|
||||
detail.WorkerID = tracked.WorkerID
|
||||
}
|
||||
if detail.DedupeKey == "" {
|
||||
detail.DedupeKey = tracked.DedupeKey
|
||||
}
|
||||
if detail.Summary == "" {
|
||||
detail.Summary = tracked.Summary
|
||||
}
|
||||
if detail.State == "" {
|
||||
detail.State = tracked.State
|
||||
}
|
||||
if detail.Progress == 0 {
|
||||
detail.Progress = tracked.Progress
|
||||
}
|
||||
if detail.Stage == "" {
|
||||
detail.Stage = tracked.Stage
|
||||
}
|
||||
if detail.Message == "" {
|
||||
detail.Message = tracked.Message
|
||||
}
|
||||
if detail.Attempt == 0 {
|
||||
detail.Attempt = tracked.Attempt
|
||||
}
|
||||
if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
|
||||
detail.CreatedAt = tracked.CreatedAt
|
||||
}
|
||||
if detail.UpdatedAt == nil || detail.UpdatedAt.IsZero() {
|
||||
detail.UpdatedAt = tracked.UpdatedAt
|
||||
}
|
||||
if detail.CompletedAt == nil || detail.CompletedAt.IsZero() {
|
||||
detail.CompletedAt = tracked.CompletedAt
|
||||
}
|
||||
if detail.ErrorMessage == "" {
|
||||
detail.ErrorMessage = tracked.ErrorMessage
|
||||
}
|
||||
if detail.ResultSummary == "" {
|
||||
detail.ResultSummary = tracked.ResultSummary
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) handleJobProgressUpdate(workerID string, update *plugin_pb.JobProgressUpdate) {
|
||||
if update == nil {
|
||||
return
|
||||
}
|
||||
|
||||
now := time.Now().UTC()
|
||||
resolvedWorkerID := strings.TrimSpace(workerID)
|
||||
|
||||
if strings.TrimSpace(update.JobId) != "" {
|
||||
r.jobsMu.Lock()
|
||||
job := r.jobs[update.JobId]
|
||||
if job == nil {
|
||||
job = &TrackedJob{
|
||||
JobID: update.JobId,
|
||||
JobType: update.JobType,
|
||||
RequestID: update.RequestId,
|
||||
WorkerID: resolvedWorkerID,
|
||||
CreatedAt: timeToPtr(now),
|
||||
}
|
||||
r.jobs[update.JobId] = job
|
||||
}
|
||||
|
||||
if update.JobType != "" {
|
||||
job.JobType = update.JobType
|
||||
}
|
||||
if update.RequestId != "" {
|
||||
job.RequestID = update.RequestId
|
||||
}
|
||||
if job.WorkerID != "" {
|
||||
resolvedWorkerID = job.WorkerID
|
||||
} else if resolvedWorkerID != "" {
|
||||
job.WorkerID = resolvedWorkerID
|
||||
}
|
||||
job.State = strings.ToLower(update.State.String())
|
||||
job.Progress = update.ProgressPercent
|
||||
job.Stage = update.Stage
|
||||
job.Message = update.Message
|
||||
job.UpdatedAt = timeToPtr(now)
|
||||
r.pruneTrackedJobsLocked()
|
||||
r.dirtyJobs = true
|
||||
r.jobsMu.Unlock()
|
||||
}
|
||||
|
||||
r.trackWorkerActivities(update.JobType, update.JobId, update.RequestId, resolvedWorkerID, update.Activities)
|
||||
if update.Message != "" || update.Stage != "" {
|
||||
source := "worker_progress"
|
||||
if strings.TrimSpace(update.JobId) == "" {
|
||||
source = "worker_detection"
|
||||
}
|
||||
r.appendActivity(JobActivity{
|
||||
JobID: update.JobId,
|
||||
JobType: update.JobType,
|
||||
RequestID: update.RequestId,
|
||||
WorkerID: resolvedWorkerID,
|
||||
Source: source,
|
||||
Message: update.Message,
|
||||
Stage: update.Stage,
|
||||
OccurredAt: timeToPtr(now),
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) trackExecutionStart(requestID, workerID string, job *plugin_pb.JobSpec, attempt int32) {
|
||||
if job == nil || strings.TrimSpace(job.JobId) == "" {
|
||||
return
|
||||
}
|
||||
|
||||
now := time.Now().UTC()
|
||||
|
||||
r.jobsMu.Lock()
|
||||
tracked := r.jobs[job.JobId]
|
||||
if tracked == nil {
|
||||
tracked = &TrackedJob{
|
||||
JobID: job.JobId,
|
||||
CreatedAt: timeToPtr(now),
|
||||
}
|
||||
r.jobs[job.JobId] = tracked
|
||||
}
|
||||
|
||||
tracked.JobType = job.JobType
|
||||
tracked.RequestID = requestID
|
||||
tracked.WorkerID = workerID
|
||||
tracked.DedupeKey = job.DedupeKey
|
||||
tracked.Summary = job.Summary
|
||||
tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_ASSIGNED.String())
|
||||
tracked.Progress = 0
|
||||
tracked.Stage = "assigned"
|
||||
tracked.Message = "job assigned to worker"
|
||||
tracked.Attempt = attempt
|
||||
if tracked.CreatedAt == nil || tracked.CreatedAt.IsZero() {
|
||||
tracked.CreatedAt = timeToPtr(now)
|
||||
}
|
||||
tracked.UpdatedAt = timeToPtr(now)
|
||||
trackedSnapshot := cloneTrackedJob(*tracked)
|
||||
r.pruneTrackedJobsLocked()
|
||||
r.dirtyJobs = true
|
||||
r.jobsMu.Unlock()
|
||||
r.persistJobDetailSnapshot(job.JobId, func(detail *TrackedJob) {
|
||||
detail.JobID = job.JobId
|
||||
detail.JobType = job.JobType
|
||||
detail.RequestID = requestID
|
||||
detail.WorkerID = workerID
|
||||
detail.DedupeKey = job.DedupeKey
|
||||
detail.Summary = job.Summary
|
||||
detail.Detail = job.Detail
|
||||
detail.Parameters = enrichTrackedJobParameters(job.JobType, configValueMapToPlain(job.Parameters))
|
||||
if len(job.Labels) > 0 {
|
||||
labels := make(map[string]string, len(job.Labels))
|
||||
for key, value := range job.Labels {
|
||||
labels[key] = value
|
||||
}
|
||||
detail.Labels = labels
|
||||
} else {
|
||||
detail.Labels = nil
|
||||
}
|
||||
detail.State = trackedSnapshot.State
|
||||
detail.Progress = trackedSnapshot.Progress
|
||||
detail.Stage = trackedSnapshot.Stage
|
||||
detail.Message = trackedSnapshot.Message
|
||||
detail.Attempt = attempt
|
||||
if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
|
||||
detail.CreatedAt = trackedSnapshot.CreatedAt
|
||||
}
|
||||
detail.UpdatedAt = trackedSnapshot.UpdatedAt
|
||||
})
|
||||
|
||||
r.appendActivity(JobActivity{
|
||||
JobID: job.JobId,
|
||||
JobType: job.JobType,
|
||||
RequestID: requestID,
|
||||
WorkerID: workerID,
|
||||
Source: "admin_dispatch",
|
||||
Message: "job assigned",
|
||||
Stage: "assigned",
|
||||
OccurredAt: timeToPtr(now),
|
||||
})
|
||||
}
|
||||
|
||||
func (r *Plugin) trackExecutionQueued(job *plugin_pb.JobSpec) {
|
||||
if job == nil || strings.TrimSpace(job.JobId) == "" {
|
||||
return
|
||||
}
|
||||
|
||||
now := time.Now().UTC()
|
||||
|
||||
r.jobsMu.Lock()
|
||||
tracked := r.jobs[job.JobId]
|
||||
if tracked == nil {
|
||||
tracked = &TrackedJob{
|
||||
JobID: job.JobId,
|
||||
CreatedAt: timeToPtr(now),
|
||||
}
|
||||
r.jobs[job.JobId] = tracked
|
||||
}
|
||||
|
||||
tracked.JobType = job.JobType
|
||||
tracked.DedupeKey = job.DedupeKey
|
||||
tracked.Summary = job.Summary
|
||||
tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_PENDING.String())
|
||||
tracked.Progress = 0
|
||||
tracked.Stage = "queued"
|
||||
tracked.Message = "waiting for available executor"
|
||||
if tracked.CreatedAt == nil || tracked.CreatedAt.IsZero() {
|
||||
tracked.CreatedAt = timeToPtr(now)
|
||||
}
|
||||
tracked.UpdatedAt = timeToPtr(now)
|
||||
trackedSnapshot := cloneTrackedJob(*tracked)
|
||||
r.pruneTrackedJobsLocked()
|
||||
r.dirtyJobs = true
|
||||
r.jobsMu.Unlock()
|
||||
r.persistJobDetailSnapshot(job.JobId, func(detail *TrackedJob) {
|
||||
detail.JobID = job.JobId
|
||||
detail.JobType = job.JobType
|
||||
detail.DedupeKey = job.DedupeKey
|
||||
detail.Summary = job.Summary
|
||||
detail.Detail = job.Detail
|
||||
detail.Parameters = enrichTrackedJobParameters(job.JobType, configValueMapToPlain(job.Parameters))
|
||||
if len(job.Labels) > 0 {
|
||||
labels := make(map[string]string, len(job.Labels))
|
||||
for key, value := range job.Labels {
|
||||
labels[key] = value
|
||||
}
|
||||
detail.Labels = labels
|
||||
} else {
|
||||
detail.Labels = nil
|
||||
}
|
||||
detail.State = trackedSnapshot.State
|
||||
detail.Progress = trackedSnapshot.Progress
|
||||
detail.Stage = trackedSnapshot.Stage
|
||||
detail.Message = trackedSnapshot.Message
|
||||
if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
|
||||
detail.CreatedAt = trackedSnapshot.CreatedAt
|
||||
}
|
||||
detail.UpdatedAt = trackedSnapshot.UpdatedAt
|
||||
})
|
||||
|
||||
r.appendActivity(JobActivity{
|
||||
JobID: job.JobId,
|
||||
JobType: job.JobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: "job queued for execution",
|
||||
Stage: "queued",
|
||||
OccurredAt: timeToPtr(now),
|
||||
})
|
||||
}
|
||||
|
||||
func (r *Plugin) trackExecutionCompletion(completed *plugin_pb.JobCompleted) *TrackedJob {
|
||||
if completed == nil || strings.TrimSpace(completed.JobId) == "" {
|
||||
return nil
|
||||
}
|
||||
|
||||
now := time.Now().UTC()
|
||||
if completed.CompletedAt != nil {
|
||||
now = completed.CompletedAt.AsTime().UTC()
|
||||
}
|
||||
|
||||
r.jobsMu.Lock()
|
||||
tracked := r.jobs[completed.JobId]
|
||||
if tracked == nil {
|
||||
tracked = &TrackedJob{
|
||||
JobID: completed.JobId,
|
||||
CreatedAt: timeToPtr(now),
|
||||
}
|
||||
r.jobs[completed.JobId] = tracked
|
||||
}
|
||||
|
||||
if completed.JobType != "" {
|
||||
tracked.JobType = completed.JobType
|
||||
}
|
||||
if completed.RequestId != "" {
|
||||
tracked.RequestID = completed.RequestId
|
||||
}
|
||||
if completed.Success {
|
||||
tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_SUCCEEDED.String())
|
||||
tracked.Progress = 100
|
||||
tracked.Stage = "completed"
|
||||
if completed.Result != nil {
|
||||
tracked.ResultSummary = completed.Result.Summary
|
||||
}
|
||||
tracked.Message = tracked.ResultSummary
|
||||
if tracked.Message == "" {
|
||||
tracked.Message = "completed"
|
||||
}
|
||||
tracked.ErrorMessage = ""
|
||||
} else {
|
||||
tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_FAILED.String())
|
||||
tracked.Stage = "failed"
|
||||
tracked.ErrorMessage = completed.ErrorMessage
|
||||
tracked.Message = completed.ErrorMessage
|
||||
}
|
||||
|
||||
tracked.UpdatedAt = timeToPtr(now)
|
||||
tracked.CompletedAt = timeToPtr(now)
|
||||
r.pruneTrackedJobsLocked()
|
||||
clone := cloneTrackedJob(*tracked)
|
||||
r.dirtyJobs = true
|
||||
r.jobsMu.Unlock()
|
||||
r.persistJobDetailSnapshot(completed.JobId, func(detail *TrackedJob) {
|
||||
detail.JobID = completed.JobId
|
||||
if completed.JobType != "" {
|
||||
detail.JobType = completed.JobType
|
||||
}
|
||||
if completed.RequestId != "" {
|
||||
detail.RequestID = completed.RequestId
|
||||
}
|
||||
detail.State = clone.State
|
||||
detail.Progress = clone.Progress
|
||||
detail.Stage = clone.Stage
|
||||
detail.Message = clone.Message
|
||||
detail.ErrorMessage = clone.ErrorMessage
|
||||
detail.ResultSummary = clone.ResultSummary
|
||||
if completed.Success && completed.Result != nil {
|
||||
detail.ResultOutputValues = configValueMapToPlain(completed.Result.OutputValues)
|
||||
} else {
|
||||
detail.ResultOutputValues = nil
|
||||
}
|
||||
if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
|
||||
detail.CreatedAt = clone.CreatedAt
|
||||
}
|
||||
if detail.UpdatedAt == nil || detail.UpdatedAt.IsZero() {
|
||||
detail.UpdatedAt = clone.UpdatedAt
|
||||
}
|
||||
if detail.CompletedAt == nil || detail.CompletedAt.IsZero() {
|
||||
detail.CompletedAt = clone.CompletedAt
|
||||
}
|
||||
})
|
||||
|
||||
r.appendActivity(JobActivity{
|
||||
JobID: completed.JobId,
|
||||
JobType: completed.JobType,
|
||||
RequestID: completed.RequestId,
|
||||
WorkerID: clone.WorkerID,
|
||||
Source: "worker_completion",
|
||||
Message: clone.Message,
|
||||
Stage: clone.Stage,
|
||||
OccurredAt: timeToPtr(now),
|
||||
})
|
||||
|
||||
return &clone
|
||||
}
|
||||
|
||||
func (r *Plugin) trackWorkerActivities(jobType, jobID, requestID, workerID string, events []*plugin_pb.ActivityEvent) {
|
||||
if len(events) == 0 {
|
||||
return
|
||||
}
|
||||
for _, event := range events {
|
||||
if event == nil {
|
||||
continue
|
||||
}
|
||||
timestamp := time.Now().UTC()
|
||||
if event.CreatedAt != nil {
|
||||
timestamp = event.CreatedAt.AsTime().UTC()
|
||||
}
|
||||
r.appendActivity(JobActivity{
|
||||
JobID: jobID,
|
||||
JobType: jobType,
|
||||
RequestID: requestID,
|
||||
WorkerID: workerID,
|
||||
Source: strings.ToLower(event.Source.String()),
|
||||
Message: event.Message,
|
||||
Stage: event.Stage,
|
||||
Details: configValueMapToPlain(event.Details),
|
||||
OccurredAt: timeToPtr(timestamp),
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) appendActivity(activity JobActivity) {
|
||||
if activity.OccurredAt == nil || activity.OccurredAt.IsZero() {
|
||||
activity.OccurredAt = timeToPtr(time.Now().UTC())
|
||||
}
|
||||
|
||||
r.activitiesMu.Lock()
|
||||
r.activities = append(r.activities, activity)
|
||||
if len(r.activities) > maxActivityRecords {
|
||||
r.activities = r.activities[len(r.activities)-maxActivityRecords:]
|
||||
}
|
||||
r.dirtyActivities = true
|
||||
r.activitiesMu.Unlock()
|
||||
}
|
||||
|
||||
func (r *Plugin) pruneTrackedJobsLocked() {
|
||||
if len(r.jobs) <= maxTrackedJobsTotal {
|
||||
return
|
||||
}
|
||||
|
||||
type sortableJob struct {
|
||||
jobID string
|
||||
updatedAt time.Time
|
||||
}
|
||||
terminalJobs := make([]sortableJob, 0)
|
||||
for jobID, job := range r.jobs {
|
||||
if job.State == StateSucceeded ||
|
||||
job.State == StateFailed ||
|
||||
job.State == StateCanceled {
|
||||
updAt := time.Time{}
|
||||
if job.UpdatedAt != nil {
|
||||
updAt = *job.UpdatedAt
|
||||
}
|
||||
terminalJobs = append(terminalJobs, sortableJob{jobID, updAt})
|
||||
}
|
||||
}
|
||||
|
||||
if len(terminalJobs) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
sort.Slice(terminalJobs, func(i, j int) bool {
|
||||
return terminalJobs[i].updatedAt.Before(terminalJobs[j].updatedAt)
|
||||
})
|
||||
|
||||
toDelete := len(r.jobs) - maxTrackedJobsTotal
|
||||
if toDelete <= 0 {
|
||||
return
|
||||
}
|
||||
if toDelete > len(terminalJobs) {
|
||||
toDelete = len(terminalJobs)
|
||||
}
|
||||
|
||||
for i := 0; i < toDelete; i++ {
|
||||
delete(r.jobs, terminalJobs[i].jobID)
|
||||
}
|
||||
}
|
||||
|
||||
func configValueMapToPlain(values map[string]*plugin_pb.ConfigValue) map[string]interface{} {
|
||||
if len(values) == 0 {
|
||||
return nil
|
||||
}
|
||||
|
||||
payload, err := protojson.MarshalOptions{UseProtoNames: true}.Marshal(&plugin_pb.ValueMap{Fields: values})
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
decoded := map[string]interface{}{}
|
||||
if err := json.Unmarshal(payload, &decoded); err != nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
fields, ok := decoded["fields"].(map[string]interface{})
|
||||
if !ok {
|
||||
return nil
|
||||
}
|
||||
return fields
|
||||
}
|
||||
|
||||
func (r *Plugin) persistTrackedJobsSnapshot() {
|
||||
r.jobsMu.Lock()
|
||||
r.dirtyJobs = false
|
||||
jobs := make([]TrackedJob, 0, len(r.jobs))
|
||||
for _, job := range r.jobs {
|
||||
if job == nil || strings.TrimSpace(job.JobID) == "" {
|
||||
continue
|
||||
}
|
||||
clone := cloneTrackedJob(*job)
|
||||
stripTrackedJobDetailFields(&clone)
|
||||
jobs = append(jobs, clone)
|
||||
}
|
||||
r.jobsMu.Unlock()
|
||||
|
||||
if len(jobs) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
sort.Slice(jobs, func(i, j int) bool {
|
||||
ti := time.Time{}
|
||||
if jobs[i].UpdatedAt != nil {
|
||||
ti = *jobs[i].UpdatedAt
|
||||
}
|
||||
tj := time.Time{}
|
||||
if jobs[j].UpdatedAt != nil {
|
||||
tj = *jobs[j].UpdatedAt
|
||||
}
|
||||
if !ti.Equal(tj) {
|
||||
return ti.After(tj)
|
||||
}
|
||||
return jobs[i].JobID < jobs[j].JobID
|
||||
})
|
||||
if len(jobs) > maxTrackedJobsTotal {
|
||||
jobs = jobs[:maxTrackedJobsTotal]
|
||||
}
|
||||
|
||||
if err := r.store.SaveTrackedJobs(jobs); err != nil {
|
||||
glog.Warningf("Plugin failed to persist tracked jobs: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) persistJobDetailSnapshot(jobID string, apply func(detail *TrackedJob)) {
|
||||
normalizedJobID, _ := sanitizeJobID(jobID)
|
||||
if normalizedJobID == "" {
|
||||
return
|
||||
}
|
||||
|
||||
r.jobDetailsMu.Lock()
|
||||
defer r.jobDetailsMu.Unlock()
|
||||
|
||||
detail, err := r.store.LoadJobDetail(normalizedJobID)
|
||||
if err != nil {
|
||||
glog.Warningf("Plugin failed to load job detail snapshot for %s: %v", normalizedJobID, err)
|
||||
return
|
||||
}
|
||||
if detail == nil {
|
||||
detail = &TrackedJob{
|
||||
JobID: normalizedJobID,
|
||||
}
|
||||
}
|
||||
|
||||
if apply != nil {
|
||||
apply(detail)
|
||||
}
|
||||
|
||||
if err := r.store.SaveJobDetail(*detail); err != nil {
|
||||
glog.Warningf("Plugin failed to persist job detail snapshot for %s: %v", normalizedJobID, err)
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) persistActivitiesSnapshot() {
|
||||
r.activitiesMu.Lock()
|
||||
r.dirtyActivities = false
|
||||
activities := append([]JobActivity(nil), r.activities...)
|
||||
r.activitiesMu.Unlock()
|
||||
|
||||
if len(activities) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
if len(activities) > maxActivityRecords {
|
||||
activities = activities[len(activities)-maxActivityRecords:]
|
||||
}
|
||||
|
||||
if err := r.store.SaveActivities(activities); err != nil {
|
||||
glog.Warningf("Plugin failed to persist activities: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) persistenceLoop() {
|
||||
defer r.wg.Done()
|
||||
for {
|
||||
select {
|
||||
case <-r.shutdownCh:
|
||||
r.persistTrackedJobsSnapshot()
|
||||
r.persistActivitiesSnapshot()
|
||||
return
|
||||
case <-r.persistTicker.C:
|
||||
r.jobsMu.RLock()
|
||||
needsJobsFlush := r.dirtyJobs
|
||||
r.jobsMu.RUnlock()
|
||||
if needsJobsFlush {
|
||||
r.persistTrackedJobsSnapshot()
|
||||
}
|
||||
|
||||
r.activitiesMu.RLock()
|
||||
needsActivitiesFlush := r.dirtyActivities
|
||||
r.activitiesMu.RUnlock()
|
||||
if needsActivitiesFlush {
|
||||
r.persistActivitiesSnapshot()
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
600
weed/admin/plugin/plugin_monitor_test.go
Normal file
600
weed/admin/plugin/plugin_monitor_test.go
Normal file
@@ -0,0 +1,600 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/worker_pb"
|
||||
"google.golang.org/protobuf/proto"
|
||||
"google.golang.org/protobuf/types/known/timestamppb"
|
||||
)
|
||||
|
||||
func TestPluginLoadsPersistedMonitorStateOnStart(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
dataDir := t.TempDir()
|
||||
store, err := NewConfigStore(dataDir)
|
||||
if err != nil {
|
||||
t.Fatalf("NewConfigStore: %v", err)
|
||||
}
|
||||
|
||||
seedJobs := []TrackedJob{
|
||||
{
|
||||
JobID: "job-seeded",
|
||||
JobType: "vacuum",
|
||||
State: "running",
|
||||
CreatedAt: timeToPtr(time.Now().UTC().Add(-2 * time.Minute)),
|
||||
UpdatedAt: timeToPtr(time.Now().UTC().Add(-1 * time.Minute)),
|
||||
},
|
||||
}
|
||||
seedActivities := []JobActivity{
|
||||
{
|
||||
JobID: "job-seeded",
|
||||
JobType: "vacuum",
|
||||
Source: "worker_progress",
|
||||
Message: "seeded",
|
||||
OccurredAt: timeToPtr(time.Now().UTC().Add(-30 * time.Second)),
|
||||
},
|
||||
}
|
||||
|
||||
if err := store.SaveTrackedJobs(seedJobs); err != nil {
|
||||
t.Fatalf("SaveTrackedJobs: %v", err)
|
||||
}
|
||||
if err := store.SaveActivities(seedActivities); err != nil {
|
||||
t.Fatalf("SaveActivities: %v", err)
|
||||
}
|
||||
|
||||
pluginSvc, err := New(Options{DataDir: dataDir})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
gotJobs := pluginSvc.ListTrackedJobs("", "", 0)
|
||||
if len(gotJobs) != 1 || gotJobs[0].JobID != "job-seeded" {
|
||||
t.Fatalf("unexpected loaded jobs: %+v", gotJobs)
|
||||
}
|
||||
|
||||
gotActivities := pluginSvc.ListActivities("", 0)
|
||||
if len(gotActivities) != 1 || gotActivities[0].Message != "seeded" {
|
||||
t.Fatalf("unexpected loaded activities: %+v", gotActivities)
|
||||
}
|
||||
}
|
||||
|
||||
func TestPluginPersistsMonitorStateAfterJobUpdates(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
dataDir := t.TempDir()
|
||||
pluginSvc, err := New(Options{DataDir: dataDir})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
job := &plugin_pb.JobSpec{
|
||||
JobId: "job-persist",
|
||||
JobType: "vacuum",
|
||||
Summary: "persist test",
|
||||
}
|
||||
pluginSvc.trackExecutionStart("req-persist", "worker-a", job, 1)
|
||||
|
||||
pluginSvc.trackExecutionCompletion(&plugin_pb.JobCompleted{
|
||||
RequestId: "req-persist",
|
||||
JobId: "job-persist",
|
||||
JobType: "vacuum",
|
||||
Success: true,
|
||||
Result: &plugin_pb.JobResult{Summary: "done"},
|
||||
CompletedAt: timestamppb.New(time.Now().UTC()),
|
||||
})
|
||||
pluginSvc.Shutdown()
|
||||
|
||||
store, err := NewConfigStore(dataDir)
|
||||
if err != nil {
|
||||
t.Fatalf("NewConfigStore: %v", err)
|
||||
}
|
||||
|
||||
trackedJobs, err := store.LoadTrackedJobs()
|
||||
if err != nil {
|
||||
t.Fatalf("LoadTrackedJobs: %v", err)
|
||||
}
|
||||
if len(trackedJobs) == 0 {
|
||||
t.Fatalf("expected persisted tracked jobs")
|
||||
}
|
||||
|
||||
found := false
|
||||
for _, tracked := range trackedJobs {
|
||||
if tracked.JobID == "job-persist" {
|
||||
found = true
|
||||
if tracked.State == "" {
|
||||
t.Fatalf("persisted job state should not be empty")
|
||||
}
|
||||
}
|
||||
}
|
||||
if !found {
|
||||
t.Fatalf("persisted tracked jobs missing job-persist")
|
||||
}
|
||||
|
||||
activities, err := store.LoadActivities()
|
||||
if err != nil {
|
||||
t.Fatalf("LoadActivities: %v", err)
|
||||
}
|
||||
if len(activities) == 0 {
|
||||
t.Fatalf("expected persisted activities")
|
||||
}
|
||||
}
|
||||
|
||||
func TestTrackExecutionQueuedMarksPendingState(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.trackExecutionQueued(&plugin_pb.JobSpec{
|
||||
JobId: "job-pending-1",
|
||||
JobType: "vacuum",
|
||||
DedupeKey: "vacuum:1",
|
||||
Summary: "pending queue item",
|
||||
})
|
||||
|
||||
jobs := pluginSvc.ListTrackedJobs("vacuum", "", 10)
|
||||
if len(jobs) != 1 {
|
||||
t.Fatalf("expected one tracked pending job, got=%d", len(jobs))
|
||||
}
|
||||
job := jobs[0]
|
||||
if job.JobID != "job-pending-1" {
|
||||
t.Fatalf("unexpected pending job id: %s", job.JobID)
|
||||
}
|
||||
if job.State != "job_state_pending" {
|
||||
t.Fatalf("unexpected pending job state: %s", job.State)
|
||||
}
|
||||
if job.Stage != "queued" {
|
||||
t.Fatalf("unexpected pending job stage: %s", job.Stage)
|
||||
}
|
||||
|
||||
activities := pluginSvc.ListActivities("vacuum", 50)
|
||||
found := false
|
||||
for _, activity := range activities {
|
||||
if activity.JobID == "job-pending-1" && activity.Stage == "queued" && activity.Source == "admin_scheduler" {
|
||||
found = true
|
||||
break
|
||||
}
|
||||
}
|
||||
if !found {
|
||||
t.Fatalf("expected queued activity for pending job")
|
||||
}
|
||||
}
|
||||
|
||||
func TestHandleJobProgressUpdateCarriesWorkerIDInActivities(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
job := &plugin_pb.JobSpec{
|
||||
JobId: "job-progress-worker",
|
||||
JobType: "vacuum",
|
||||
}
|
||||
pluginSvc.trackExecutionStart("req-progress-worker", "worker-a", job, 1)
|
||||
|
||||
pluginSvc.handleJobProgressUpdate("worker-a", &plugin_pb.JobProgressUpdate{
|
||||
RequestId: "req-progress-worker",
|
||||
JobId: "job-progress-worker",
|
||||
JobType: "vacuum",
|
||||
State: plugin_pb.JobState_JOB_STATE_RUNNING,
|
||||
ProgressPercent: 42.0,
|
||||
Stage: "scan",
|
||||
Message: "in progress",
|
||||
Activities: []*plugin_pb.ActivityEvent{
|
||||
{
|
||||
Source: plugin_pb.ActivitySource_ACTIVITY_SOURCE_EXECUTOR,
|
||||
Message: "volume scanned",
|
||||
Stage: "scan",
|
||||
},
|
||||
},
|
||||
})
|
||||
|
||||
activities := pluginSvc.ListActivities("vacuum", 0)
|
||||
if len(activities) == 0 {
|
||||
t.Fatalf("expected activity entries")
|
||||
}
|
||||
|
||||
foundProgress := false
|
||||
foundEvent := false
|
||||
for _, activity := range activities {
|
||||
if activity.Source == "worker_progress" && activity.Message == "in progress" {
|
||||
foundProgress = true
|
||||
if activity.WorkerID != "worker-a" {
|
||||
t.Fatalf("worker_progress activity worker mismatch: got=%q want=%q", activity.WorkerID, "worker-a")
|
||||
}
|
||||
}
|
||||
if activity.Message == "volume scanned" {
|
||||
foundEvent = true
|
||||
if activity.WorkerID != "worker-a" {
|
||||
t.Fatalf("worker event worker mismatch: got=%q want=%q", activity.WorkerID, "worker-a")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if !foundProgress {
|
||||
t.Fatalf("expected worker_progress activity")
|
||||
}
|
||||
if !foundEvent {
|
||||
t.Fatalf("expected worker activity event")
|
||||
}
|
||||
}
|
||||
|
||||
func TestHandleJobProgressUpdateWithoutJobIDTracksDetectionActivities(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.handleJobProgressUpdate("worker-detector", &plugin_pb.JobProgressUpdate{
|
||||
RequestId: "detect-req-1",
|
||||
JobType: "vacuum",
|
||||
State: plugin_pb.JobState_JOB_STATE_RUNNING,
|
||||
Stage: "decision_summary",
|
||||
Message: "VACUUM: No tasks created for 3 volumes",
|
||||
Activities: []*plugin_pb.ActivityEvent{
|
||||
{
|
||||
Source: plugin_pb.ActivitySource_ACTIVITY_SOURCE_DETECTOR,
|
||||
Stage: "decision_summary",
|
||||
Message: "VACUUM: No tasks created for 3 volumes",
|
||||
},
|
||||
},
|
||||
})
|
||||
|
||||
activities := pluginSvc.ListActivities("vacuum", 0)
|
||||
if len(activities) == 0 {
|
||||
t.Fatalf("expected activity entries")
|
||||
}
|
||||
|
||||
foundDetectionProgress := false
|
||||
foundDetectorEvent := false
|
||||
for _, activity := range activities {
|
||||
if activity.RequestID != "detect-req-1" {
|
||||
continue
|
||||
}
|
||||
if activity.Source == "worker_detection" {
|
||||
foundDetectionProgress = true
|
||||
if activity.WorkerID != "worker-detector" {
|
||||
t.Fatalf("worker_detection worker mismatch: got=%q want=%q", activity.WorkerID, "worker-detector")
|
||||
}
|
||||
}
|
||||
if activity.Source == "activity_source_detector" {
|
||||
foundDetectorEvent = true
|
||||
if activity.WorkerID != "worker-detector" {
|
||||
t.Fatalf("detector event worker mismatch: got=%q want=%q", activity.WorkerID, "worker-detector")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if !foundDetectionProgress {
|
||||
t.Fatalf("expected worker_detection activity")
|
||||
}
|
||||
if !foundDetectorEvent {
|
||||
t.Fatalf("expected detector activity event")
|
||||
}
|
||||
}
|
||||
|
||||
func TestHandleJobCompletedCarriesWorkerIDInActivitiesAndRunHistory(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
job := &plugin_pb.JobSpec{
|
||||
JobId: "job-complete-worker",
|
||||
JobType: "vacuum",
|
||||
}
|
||||
pluginSvc.trackExecutionStart("req-complete-worker", "worker-b", job, 1)
|
||||
|
||||
pluginSvc.handleJobCompleted(&plugin_pb.JobCompleted{
|
||||
RequestId: "req-complete-worker",
|
||||
JobId: "job-complete-worker",
|
||||
JobType: "vacuum",
|
||||
Success: true,
|
||||
Activities: []*plugin_pb.ActivityEvent{
|
||||
{
|
||||
Source: plugin_pb.ActivitySource_ACTIVITY_SOURCE_EXECUTOR,
|
||||
Message: "finalizer done",
|
||||
Stage: "finalize",
|
||||
},
|
||||
},
|
||||
CompletedAt: timestamppb.Now(),
|
||||
})
|
||||
pluginSvc.Shutdown()
|
||||
|
||||
activities := pluginSvc.ListActivities("vacuum", 0)
|
||||
foundWorkerEvent := false
|
||||
for _, activity := range activities {
|
||||
if activity.Message == "finalizer done" {
|
||||
foundWorkerEvent = true
|
||||
if activity.WorkerID != "worker-b" {
|
||||
t.Fatalf("worker completion event worker mismatch: got=%q want=%q", activity.WorkerID, "worker-b")
|
||||
}
|
||||
}
|
||||
}
|
||||
if !foundWorkerEvent {
|
||||
t.Fatalf("expected completion worker event activity")
|
||||
}
|
||||
|
||||
history, err := pluginSvc.LoadRunHistory("vacuum")
|
||||
if err != nil {
|
||||
t.Fatalf("LoadRunHistory: %v", err)
|
||||
}
|
||||
if history == nil || len(history.SuccessfulRuns) == 0 {
|
||||
t.Fatalf("expected successful run history entry")
|
||||
}
|
||||
if history.SuccessfulRuns[0].WorkerID != "worker-b" {
|
||||
t.Fatalf("run history worker mismatch: got=%q want=%q", history.SuccessfulRuns[0].WorkerID, "worker-b")
|
||||
}
|
||||
}
|
||||
|
||||
func TestTrackExecutionStartStoresJobPayloadDetails(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{DataDir: t.TempDir()})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.trackExecutionStart("req-payload", "worker-c", &plugin_pb.JobSpec{
|
||||
JobId: "job-payload",
|
||||
JobType: "vacuum",
|
||||
Summary: "payload summary",
|
||||
Detail: "payload detail",
|
||||
Parameters: map[string]*plugin_pb.ConfigValue{
|
||||
"volume_id": {
|
||||
Kind: &plugin_pb.ConfigValue_Int64Value{Int64Value: 9},
|
||||
},
|
||||
},
|
||||
Labels: map[string]string{
|
||||
"source": "detector",
|
||||
},
|
||||
}, 2)
|
||||
pluginSvc.Shutdown()
|
||||
|
||||
job, found := pluginSvc.GetTrackedJob("job-payload")
|
||||
if !found || job == nil {
|
||||
t.Fatalf("expected tracked job")
|
||||
}
|
||||
if job.Detail != "" {
|
||||
t.Fatalf("expected in-memory tracked job detail to be stripped, got=%q", job.Detail)
|
||||
}
|
||||
if job.Attempt != 2 {
|
||||
t.Fatalf("unexpected attempt: %d", job.Attempt)
|
||||
}
|
||||
if len(job.Labels) != 0 {
|
||||
t.Fatalf("expected in-memory labels to be stripped, got=%+v", job.Labels)
|
||||
}
|
||||
if len(job.Parameters) != 0 {
|
||||
t.Fatalf("expected in-memory parameters to be stripped, got=%+v", job.Parameters)
|
||||
}
|
||||
|
||||
detail, found, err := pluginSvc.BuildJobDetail("job-payload", 100, 0)
|
||||
if err != nil {
|
||||
t.Fatalf("BuildJobDetail: %v", err)
|
||||
}
|
||||
if !found || detail == nil || detail.Job == nil {
|
||||
t.Fatalf("expected disk-backed job detail")
|
||||
}
|
||||
if detail.Job.Detail != "payload detail" {
|
||||
t.Fatalf("unexpected disk-backed detail: %q", detail.Job.Detail)
|
||||
}
|
||||
if got := detail.Job.Labels["source"]; got != "detector" {
|
||||
t.Fatalf("unexpected disk-backed label source: %q", got)
|
||||
}
|
||||
if got, ok := detail.Job.Parameters["volume_id"].(map[string]interface{}); !ok || got["int64_value"] != "9" {
|
||||
t.Fatalf("unexpected disk-backed parameters payload: %#v", detail.Job.Parameters["volume_id"])
|
||||
}
|
||||
}
|
||||
|
||||
func TestTrackExecutionStartStoresErasureCodingExecutionPlan(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{DataDir: t.TempDir()})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
taskParams := &worker_pb.TaskParams{
|
||||
TaskId: "task-ec-1",
|
||||
VolumeId: 29,
|
||||
Collection: "photos",
|
||||
Sources: []*worker_pb.TaskSource{
|
||||
{
|
||||
Node: "source-a:8080",
|
||||
DataCenter: "dc1",
|
||||
Rack: "rack1",
|
||||
VolumeId: 29,
|
||||
},
|
||||
},
|
||||
Targets: []*worker_pb.TaskTarget{
|
||||
{
|
||||
Node: "target-a:8080",
|
||||
DataCenter: "dc1",
|
||||
Rack: "rack2",
|
||||
VolumeId: 29,
|
||||
ShardIds: []uint32{0, 10},
|
||||
},
|
||||
{
|
||||
Node: "target-b:8080",
|
||||
DataCenter: "dc2",
|
||||
Rack: "rack3",
|
||||
VolumeId: 29,
|
||||
ShardIds: []uint32{1, 11},
|
||||
},
|
||||
},
|
||||
TaskParams: &worker_pb.TaskParams_ErasureCodingParams{
|
||||
ErasureCodingParams: &worker_pb.ErasureCodingTaskParams{
|
||||
DataShards: 10,
|
||||
ParityShards: 4,
|
||||
},
|
||||
},
|
||||
}
|
||||
payload, err := proto.Marshal(taskParams)
|
||||
if err != nil {
|
||||
t.Fatalf("Marshal task params: %v", err)
|
||||
}
|
||||
|
||||
pluginSvc.trackExecutionStart("req-ec-plan", "worker-ec", &plugin_pb.JobSpec{
|
||||
JobId: "job-ec-plan",
|
||||
JobType: "erasure_coding",
|
||||
Parameters: map[string]*plugin_pb.ConfigValue{
|
||||
"task_params_pb": {
|
||||
Kind: &plugin_pb.ConfigValue_BytesValue{BytesValue: payload},
|
||||
},
|
||||
},
|
||||
}, 1)
|
||||
pluginSvc.Shutdown()
|
||||
|
||||
detail, found, err := pluginSvc.BuildJobDetail("job-ec-plan", 100, 0)
|
||||
if err != nil {
|
||||
t.Fatalf("BuildJobDetail: %v", err)
|
||||
}
|
||||
if !found || detail == nil || detail.Job == nil {
|
||||
t.Fatalf("expected disk-backed detail")
|
||||
}
|
||||
|
||||
rawPlan, ok := detail.Job.Parameters["execution_plan"]
|
||||
if !ok {
|
||||
t.Fatalf("expected execution_plan in parameters, got=%+v", detail.Job.Parameters)
|
||||
}
|
||||
plan, ok := rawPlan.(map[string]interface{})
|
||||
if !ok {
|
||||
t.Fatalf("unexpected execution_plan type: %T", rawPlan)
|
||||
}
|
||||
if plan["job_type"] != "erasure_coding" {
|
||||
t.Fatalf("unexpected execution plan job type: %+v", plan["job_type"])
|
||||
}
|
||||
if plan["volume_id"] != float64(29) {
|
||||
t.Fatalf("unexpected execution plan volume id: %+v", plan["volume_id"])
|
||||
}
|
||||
targets, ok := plan["targets"].([]interface{})
|
||||
if !ok || len(targets) != 2 {
|
||||
t.Fatalf("unexpected targets in execution plan: %+v", plan["targets"])
|
||||
}
|
||||
assignments, ok := plan["shard_assignments"].([]interface{})
|
||||
if !ok || len(assignments) != 4 {
|
||||
t.Fatalf("unexpected shard assignments in execution plan: %+v", plan["shard_assignments"])
|
||||
}
|
||||
firstAssignment, ok := assignments[0].(map[string]interface{})
|
||||
if !ok {
|
||||
t.Fatalf("unexpected first assignment payload: %+v", assignments[0])
|
||||
}
|
||||
if firstAssignment["shard_id"] != float64(0) || firstAssignment["kind"] != "data" {
|
||||
t.Fatalf("unexpected first assignment: %+v", firstAssignment)
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildJobDetailIncludesActivitiesAndRunRecord(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{DataDir: t.TempDir()})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.trackExecutionStart("req-detail", "worker-z", &plugin_pb.JobSpec{
|
||||
JobId: "job-detail",
|
||||
JobType: "vacuum",
|
||||
Summary: "detail summary",
|
||||
}, 1)
|
||||
pluginSvc.handleJobProgressUpdate("worker-z", &plugin_pb.JobProgressUpdate{
|
||||
RequestId: "req-detail",
|
||||
JobId: "job-detail",
|
||||
JobType: "vacuum",
|
||||
State: plugin_pb.JobState_JOB_STATE_RUNNING,
|
||||
Stage: "scan",
|
||||
Message: "scanning volume",
|
||||
})
|
||||
pluginSvc.handleJobCompleted(&plugin_pb.JobCompleted{
|
||||
RequestId: "req-detail",
|
||||
JobId: "job-detail",
|
||||
JobType: "vacuum",
|
||||
Success: true,
|
||||
Result: &plugin_pb.JobResult{
|
||||
Summary: "done",
|
||||
OutputValues: map[string]*plugin_pb.ConfigValue{
|
||||
"affected": {
|
||||
Kind: &plugin_pb.ConfigValue_Int64Value{Int64Value: 1},
|
||||
},
|
||||
},
|
||||
},
|
||||
CompletedAt: timestamppb.Now(),
|
||||
})
|
||||
pluginSvc.Shutdown()
|
||||
|
||||
detail, found, err := pluginSvc.BuildJobDetail("job-detail", 100, 5)
|
||||
if err != nil {
|
||||
t.Fatalf("BuildJobDetail error: %v", err)
|
||||
}
|
||||
if !found || detail == nil {
|
||||
t.Fatalf("expected job detail")
|
||||
}
|
||||
if detail.Job == nil || detail.Job.JobID != "job-detail" {
|
||||
t.Fatalf("unexpected job detail payload: %+v", detail.Job)
|
||||
}
|
||||
if detail.RunRecord == nil || detail.RunRecord.JobID != "job-detail" {
|
||||
t.Fatalf("expected run record for job-detail, got=%+v", detail.RunRecord)
|
||||
}
|
||||
if len(detail.Activities) == 0 {
|
||||
t.Fatalf("expected activity timeline entries")
|
||||
}
|
||||
if detail.Job.ResultOutputValues == nil {
|
||||
t.Fatalf("expected result output values")
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildJobDetailLoadsFromDiskWhenMemoryCleared(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{DataDir: t.TempDir()})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.trackExecutionStart("req-disk", "worker-d", &plugin_pb.JobSpec{
|
||||
JobId: "job-disk",
|
||||
JobType: "vacuum",
|
||||
Summary: "disk summary",
|
||||
Detail: "disk detail payload",
|
||||
}, 1)
|
||||
pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.jobsMu.Lock()
|
||||
pluginSvc.jobs = map[string]*TrackedJob{}
|
||||
pluginSvc.jobsMu.Unlock()
|
||||
pluginSvc.activitiesMu.Lock()
|
||||
pluginSvc.activities = nil
|
||||
pluginSvc.activitiesMu.Unlock()
|
||||
|
||||
detail, found, err := pluginSvc.BuildJobDetail("job-disk", 100, 0)
|
||||
if err != nil {
|
||||
t.Fatalf("BuildJobDetail: %v", err)
|
||||
}
|
||||
if !found || detail == nil || detail.Job == nil {
|
||||
t.Fatalf("expected detail from disk")
|
||||
}
|
||||
if detail.Job.Detail != "disk detail payload" {
|
||||
t.Fatalf("unexpected disk detail payload: %q", detail.Job.Detail)
|
||||
}
|
||||
}
|
||||
945
weed/admin/plugin/plugin_scheduler.go
Normal file
945
weed/admin/plugin/plugin_scheduler.go
Normal file
@@ -0,0 +1,945 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/glog"
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
"google.golang.org/protobuf/types/known/timestamppb"
|
||||
)
|
||||
|
||||
var errExecutorAtCapacity = errors.New("executor is at capacity")
|
||||
|
||||
const (
|
||||
defaultSchedulerTick = 5 * time.Second
|
||||
defaultScheduledDetectionInterval = 300 * time.Second
|
||||
defaultScheduledDetectionTimeout = 45 * time.Second
|
||||
defaultScheduledExecutionTimeout = 90 * time.Second
|
||||
defaultScheduledMaxResults int32 = 1000
|
||||
defaultScheduledExecutionConcurrency = 1
|
||||
defaultScheduledPerWorkerConcurrency = 1
|
||||
maxScheduledExecutionConcurrency = 128
|
||||
defaultScheduledRetryBackoff = 5 * time.Second
|
||||
defaultClusterContextTimeout = 10 * time.Second
|
||||
defaultWaitingBacklogFloor = 8
|
||||
defaultWaitingBacklogMultiplier = 4
|
||||
)
|
||||
|
||||
type schedulerPolicy struct {
|
||||
DetectionInterval time.Duration
|
||||
DetectionTimeout time.Duration
|
||||
ExecutionTimeout time.Duration
|
||||
RetryBackoff time.Duration
|
||||
MaxResults int32
|
||||
ExecutionConcurrency int
|
||||
PerWorkerConcurrency int
|
||||
RetryLimit int
|
||||
ExecutorReserveBackoff time.Duration
|
||||
}
|
||||
|
||||
func (r *Plugin) schedulerLoop() {
|
||||
defer r.wg.Done()
|
||||
ticker := time.NewTicker(r.schedulerTick)
|
||||
defer ticker.Stop()
|
||||
|
||||
// Try once immediately on startup.
|
||||
r.runSchedulerTick()
|
||||
|
||||
for {
|
||||
select {
|
||||
case <-r.shutdownCh:
|
||||
return
|
||||
case <-ticker.C:
|
||||
r.runSchedulerTick()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) runSchedulerTick() {
|
||||
jobTypes := r.registry.DetectableJobTypes()
|
||||
if len(jobTypes) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
active := make(map[string]struct{}, len(jobTypes))
|
||||
for _, jobType := range jobTypes {
|
||||
active[jobType] = struct{}{}
|
||||
|
||||
policy, enabled, err := r.loadSchedulerPolicy(jobType)
|
||||
if err != nil {
|
||||
glog.Warningf("Plugin scheduler failed to load policy for %s: %v", jobType, err)
|
||||
continue
|
||||
}
|
||||
if !enabled {
|
||||
r.clearSchedulerJobType(jobType)
|
||||
continue
|
||||
}
|
||||
|
||||
if !r.markDetectionDue(jobType, policy.DetectionInterval) {
|
||||
continue
|
||||
}
|
||||
|
||||
r.wg.Add(1)
|
||||
go func(jt string, p schedulerPolicy) {
|
||||
defer r.wg.Done()
|
||||
r.runScheduledDetection(jt, p)
|
||||
}(jobType, policy)
|
||||
}
|
||||
|
||||
r.pruneSchedulerState(active)
|
||||
r.pruneDetectorLeases(active)
|
||||
}
|
||||
|
||||
func (r *Plugin) loadSchedulerPolicy(jobType string) (schedulerPolicy, bool, error) {
|
||||
cfg, err := r.store.LoadJobTypeConfig(jobType)
|
||||
if err != nil {
|
||||
return schedulerPolicy{}, false, err
|
||||
}
|
||||
descriptor, err := r.store.LoadDescriptor(jobType)
|
||||
if err != nil {
|
||||
return schedulerPolicy{}, false, err
|
||||
}
|
||||
|
||||
adminRuntime := deriveSchedulerAdminRuntime(cfg, descriptor)
|
||||
if adminRuntime == nil {
|
||||
return schedulerPolicy{}, false, nil
|
||||
}
|
||||
if !adminRuntime.Enabled {
|
||||
return schedulerPolicy{}, false, nil
|
||||
}
|
||||
|
||||
policy := schedulerPolicy{
|
||||
DetectionInterval: durationFromSeconds(adminRuntime.DetectionIntervalSeconds, defaultScheduledDetectionInterval),
|
||||
DetectionTimeout: durationFromSeconds(adminRuntime.DetectionTimeoutSeconds, defaultScheduledDetectionTimeout),
|
||||
ExecutionTimeout: defaultScheduledExecutionTimeout,
|
||||
RetryBackoff: durationFromSeconds(adminRuntime.RetryBackoffSeconds, defaultScheduledRetryBackoff),
|
||||
MaxResults: adminRuntime.MaxJobsPerDetection,
|
||||
ExecutionConcurrency: int(adminRuntime.GlobalExecutionConcurrency),
|
||||
PerWorkerConcurrency: int(adminRuntime.PerWorkerExecutionConcurrency),
|
||||
RetryLimit: int(adminRuntime.RetryLimit),
|
||||
ExecutorReserveBackoff: 200 * time.Millisecond,
|
||||
}
|
||||
|
||||
if policy.DetectionInterval < r.schedulerTick {
|
||||
policy.DetectionInterval = r.schedulerTick
|
||||
}
|
||||
if policy.MaxResults <= 0 {
|
||||
policy.MaxResults = defaultScheduledMaxResults
|
||||
}
|
||||
if policy.ExecutionConcurrency <= 0 {
|
||||
policy.ExecutionConcurrency = defaultScheduledExecutionConcurrency
|
||||
}
|
||||
if policy.ExecutionConcurrency > maxScheduledExecutionConcurrency {
|
||||
policy.ExecutionConcurrency = maxScheduledExecutionConcurrency
|
||||
}
|
||||
if policy.PerWorkerConcurrency <= 0 {
|
||||
policy.PerWorkerConcurrency = defaultScheduledPerWorkerConcurrency
|
||||
}
|
||||
if policy.PerWorkerConcurrency > policy.ExecutionConcurrency {
|
||||
policy.PerWorkerConcurrency = policy.ExecutionConcurrency
|
||||
}
|
||||
if policy.RetryLimit < 0 {
|
||||
policy.RetryLimit = 0
|
||||
}
|
||||
|
||||
// Plugin protocol currently has only detection timeout in admin settings.
|
||||
execTimeout := time.Duration(adminRuntime.DetectionTimeoutSeconds*2) * time.Second
|
||||
if execTimeout < defaultScheduledExecutionTimeout {
|
||||
execTimeout = defaultScheduledExecutionTimeout
|
||||
}
|
||||
policy.ExecutionTimeout = execTimeout
|
||||
|
||||
return policy, true, nil
|
||||
}
|
||||
|
||||
func (r *Plugin) ListSchedulerStates() ([]SchedulerJobTypeState, error) {
|
||||
jobTypes, err := r.ListKnownJobTypes()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
r.schedulerMu.Lock()
|
||||
nextDetectionAt := make(map[string]time.Time, len(r.nextDetectionAt))
|
||||
for jobType, nextRun := range r.nextDetectionAt {
|
||||
nextDetectionAt[jobType] = nextRun
|
||||
}
|
||||
detectionInFlight := make(map[string]bool, len(r.detectionInFlight))
|
||||
for jobType, inFlight := range r.detectionInFlight {
|
||||
detectionInFlight[jobType] = inFlight
|
||||
}
|
||||
r.schedulerMu.Unlock()
|
||||
|
||||
states := make([]SchedulerJobTypeState, 0, len(jobTypes))
|
||||
for _, jobType := range jobTypes {
|
||||
state := SchedulerJobTypeState{
|
||||
JobType: jobType,
|
||||
DetectionInFlight: detectionInFlight[jobType],
|
||||
}
|
||||
|
||||
if nextRun, ok := nextDetectionAt[jobType]; ok && !nextRun.IsZero() {
|
||||
nextRunUTC := nextRun.UTC()
|
||||
state.NextDetectionAt = &nextRunUTC
|
||||
}
|
||||
|
||||
policy, enabled, loadErr := r.loadSchedulerPolicy(jobType)
|
||||
if loadErr != nil {
|
||||
state.PolicyError = loadErr.Error()
|
||||
} else {
|
||||
state.Enabled = enabled
|
||||
if enabled {
|
||||
state.DetectionIntervalSeconds = secondsFromDuration(policy.DetectionInterval)
|
||||
state.DetectionTimeoutSeconds = secondsFromDuration(policy.DetectionTimeout)
|
||||
state.ExecutionTimeoutSeconds = secondsFromDuration(policy.ExecutionTimeout)
|
||||
state.MaxJobsPerDetection = policy.MaxResults
|
||||
state.GlobalExecutionConcurrency = policy.ExecutionConcurrency
|
||||
state.PerWorkerExecutionConcurrency = policy.PerWorkerConcurrency
|
||||
state.RetryLimit = policy.RetryLimit
|
||||
state.RetryBackoffSeconds = secondsFromDuration(policy.RetryBackoff)
|
||||
}
|
||||
}
|
||||
|
||||
leasedWorkerID := r.getDetectorLease(jobType)
|
||||
if leasedWorkerID != "" {
|
||||
state.DetectorWorkerID = leasedWorkerID
|
||||
if worker, ok := r.registry.Get(leasedWorkerID); ok {
|
||||
if capability := worker.Capabilities[jobType]; capability != nil && capability.CanDetect {
|
||||
state.DetectorAvailable = true
|
||||
}
|
||||
}
|
||||
}
|
||||
if state.DetectorWorkerID == "" {
|
||||
detector, detectorErr := r.registry.PickDetector(jobType)
|
||||
if detectorErr == nil && detector != nil {
|
||||
state.DetectorAvailable = true
|
||||
state.DetectorWorkerID = detector.WorkerID
|
||||
}
|
||||
}
|
||||
|
||||
executors, executorErr := r.registry.ListExecutors(jobType)
|
||||
if executorErr == nil {
|
||||
state.ExecutorWorkerCount = len(executors)
|
||||
}
|
||||
|
||||
states = append(states, state)
|
||||
}
|
||||
|
||||
return states, nil
|
||||
}
|
||||
|
||||
func deriveSchedulerAdminRuntime(
|
||||
cfg *plugin_pb.PersistedJobTypeConfig,
|
||||
descriptor *plugin_pb.JobTypeDescriptor,
|
||||
) *plugin_pb.AdminRuntimeConfig {
|
||||
if cfg != nil && cfg.AdminRuntime != nil {
|
||||
adminConfig := *cfg.AdminRuntime
|
||||
return &adminConfig
|
||||
}
|
||||
|
||||
if descriptor == nil || descriptor.AdminRuntimeDefaults == nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
defaults := descriptor.AdminRuntimeDefaults
|
||||
return &plugin_pb.AdminRuntimeConfig{
|
||||
Enabled: defaults.Enabled,
|
||||
DetectionIntervalSeconds: defaults.DetectionIntervalSeconds,
|
||||
DetectionTimeoutSeconds: defaults.DetectionTimeoutSeconds,
|
||||
MaxJobsPerDetection: defaults.MaxJobsPerDetection,
|
||||
GlobalExecutionConcurrency: defaults.GlobalExecutionConcurrency,
|
||||
PerWorkerExecutionConcurrency: defaults.PerWorkerExecutionConcurrency,
|
||||
RetryLimit: defaults.RetryLimit,
|
||||
RetryBackoffSeconds: defaults.RetryBackoffSeconds,
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) markDetectionDue(jobType string, interval time.Duration) bool {
|
||||
now := time.Now().UTC()
|
||||
|
||||
r.schedulerMu.Lock()
|
||||
defer r.schedulerMu.Unlock()
|
||||
|
||||
if r.detectionInFlight[jobType] {
|
||||
return false
|
||||
}
|
||||
|
||||
nextRun, exists := r.nextDetectionAt[jobType]
|
||||
if exists && now.Before(nextRun) {
|
||||
return false
|
||||
}
|
||||
|
||||
r.nextDetectionAt[jobType] = now.Add(interval)
|
||||
r.detectionInFlight[jobType] = true
|
||||
return true
|
||||
}
|
||||
|
||||
func (r *Plugin) finishDetection(jobType string) {
|
||||
r.schedulerMu.Lock()
|
||||
delete(r.detectionInFlight, jobType)
|
||||
r.schedulerMu.Unlock()
|
||||
}
|
||||
|
||||
func (r *Plugin) pruneSchedulerState(activeJobTypes map[string]struct{}) {
|
||||
r.schedulerMu.Lock()
|
||||
defer r.schedulerMu.Unlock()
|
||||
|
||||
for jobType := range r.nextDetectionAt {
|
||||
if _, ok := activeJobTypes[jobType]; !ok {
|
||||
delete(r.nextDetectionAt, jobType)
|
||||
delete(r.detectionInFlight, jobType)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) clearSchedulerJobType(jobType string) {
|
||||
r.schedulerMu.Lock()
|
||||
delete(r.nextDetectionAt, jobType)
|
||||
delete(r.detectionInFlight, jobType)
|
||||
r.schedulerMu.Unlock()
|
||||
r.clearDetectorLease(jobType, "")
|
||||
}
|
||||
|
||||
func (r *Plugin) pruneDetectorLeases(activeJobTypes map[string]struct{}) {
|
||||
r.detectorLeaseMu.Lock()
|
||||
defer r.detectorLeaseMu.Unlock()
|
||||
|
||||
for jobType := range r.detectorLeases {
|
||||
if _, ok := activeJobTypes[jobType]; !ok {
|
||||
delete(r.detectorLeases, jobType)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) runScheduledDetection(jobType string, policy schedulerPolicy) {
|
||||
defer r.finishDetection(jobType)
|
||||
|
||||
start := time.Now().UTC()
|
||||
r.appendActivity(JobActivity{
|
||||
JobType: jobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: "scheduled detection started",
|
||||
Stage: "detecting",
|
||||
OccurredAt: timeToPtr(start),
|
||||
})
|
||||
|
||||
if skip, waitingCount, waitingThreshold := r.shouldSkipDetectionForWaitingJobs(jobType, policy); skip {
|
||||
r.appendActivity(JobActivity{
|
||||
JobType: jobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: fmt.Sprintf("scheduled detection skipped: waiting backlog %d reached threshold %d", waitingCount, waitingThreshold),
|
||||
Stage: "skipped_waiting_backlog",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
})
|
||||
return
|
||||
}
|
||||
|
||||
clusterContext, err := r.loadSchedulerClusterContext()
|
||||
if err != nil {
|
||||
r.appendActivity(JobActivity{
|
||||
JobType: jobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: fmt.Sprintf("scheduled detection aborted: %v", err),
|
||||
Stage: "failed",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
})
|
||||
return
|
||||
}
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), policy.DetectionTimeout)
|
||||
proposals, err := r.RunDetection(ctx, jobType, clusterContext, policy.MaxResults)
|
||||
cancel()
|
||||
if err != nil {
|
||||
r.appendActivity(JobActivity{
|
||||
JobType: jobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: fmt.Sprintf("scheduled detection failed: %v", err),
|
||||
Stage: "failed",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
})
|
||||
return
|
||||
}
|
||||
|
||||
r.appendActivity(JobActivity{
|
||||
JobType: jobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: fmt.Sprintf("scheduled detection completed: %d proposal(s)", len(proposals)),
|
||||
Stage: "detected",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
})
|
||||
|
||||
filteredByActive, skippedActive := r.filterProposalsWithActiveJobs(jobType, proposals)
|
||||
if skippedActive > 0 {
|
||||
r.appendActivity(JobActivity{
|
||||
JobType: jobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: fmt.Sprintf("scheduled detection skipped %d proposal(s) due to active assigned/running jobs", skippedActive),
|
||||
Stage: "deduped_active_jobs",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
})
|
||||
}
|
||||
|
||||
if len(filteredByActive) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
filtered := r.filterScheduledProposals(filteredByActive)
|
||||
if len(filtered) != len(filteredByActive) {
|
||||
r.appendActivity(JobActivity{
|
||||
JobType: jobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: fmt.Sprintf("scheduled detection deduped %d proposal(s) within this run", len(filteredByActive)-len(filtered)),
|
||||
Stage: "deduped",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
})
|
||||
}
|
||||
|
||||
if len(filtered) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
r.dispatchScheduledProposals(jobType, filtered, clusterContext, policy)
|
||||
}
|
||||
|
||||
func (r *Plugin) loadSchedulerClusterContext() (*plugin_pb.ClusterContext, error) {
|
||||
if r.clusterContextProvider == nil {
|
||||
return nil, fmt.Errorf("cluster context provider is not configured")
|
||||
}
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), defaultClusterContextTimeout)
|
||||
defer cancel()
|
||||
|
||||
clusterContext, err := r.clusterContextProvider(ctx)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
if clusterContext == nil {
|
||||
return nil, fmt.Errorf("cluster context provider returned nil")
|
||||
}
|
||||
return clusterContext, nil
|
||||
}
|
||||
|
||||
func (r *Plugin) dispatchScheduledProposals(
|
||||
jobType string,
|
||||
proposals []*plugin_pb.JobProposal,
|
||||
clusterContext *plugin_pb.ClusterContext,
|
||||
policy schedulerPolicy,
|
||||
) {
|
||||
jobQueue := make(chan *plugin_pb.JobSpec, len(proposals))
|
||||
for index, proposal := range proposals {
|
||||
job := buildScheduledJobSpec(jobType, proposal, index)
|
||||
r.trackExecutionQueued(job)
|
||||
select {
|
||||
case <-r.shutdownCh:
|
||||
close(jobQueue)
|
||||
return
|
||||
default:
|
||||
jobQueue <- job
|
||||
}
|
||||
}
|
||||
close(jobQueue)
|
||||
|
||||
var wg sync.WaitGroup
|
||||
var statsMu sync.Mutex
|
||||
successCount := 0
|
||||
errorCount := 0
|
||||
|
||||
workerCount := policy.ExecutionConcurrency
|
||||
if workerCount < 1 {
|
||||
workerCount = 1
|
||||
}
|
||||
|
||||
for i := 0; i < workerCount; i++ {
|
||||
wg.Add(1)
|
||||
go func() {
|
||||
defer wg.Done()
|
||||
|
||||
for job := range jobQueue {
|
||||
select {
|
||||
case <-r.shutdownCh:
|
||||
return
|
||||
default:
|
||||
}
|
||||
|
||||
for {
|
||||
select {
|
||||
case <-r.shutdownCh:
|
||||
return
|
||||
default:
|
||||
}
|
||||
|
||||
executor, release, reserveErr := r.reserveScheduledExecutor(jobType, policy)
|
||||
if reserveErr != nil {
|
||||
select {
|
||||
case <-r.shutdownCh:
|
||||
return
|
||||
default:
|
||||
}
|
||||
statsMu.Lock()
|
||||
errorCount++
|
||||
statsMu.Unlock()
|
||||
r.appendActivity(JobActivity{
|
||||
JobType: jobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: fmt.Sprintf("scheduled execution reservation failed: %v", reserveErr),
|
||||
Stage: "failed",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
})
|
||||
break
|
||||
}
|
||||
|
||||
err := r.executeScheduledJobWithExecutor(executor, job, clusterContext, policy)
|
||||
release()
|
||||
if errors.Is(err, errExecutorAtCapacity) {
|
||||
r.trackExecutionQueued(job)
|
||||
if !waitForShutdownOrTimer(r.shutdownCh, policy.ExecutorReserveBackoff) {
|
||||
return
|
||||
}
|
||||
continue
|
||||
}
|
||||
if err != nil {
|
||||
statsMu.Lock()
|
||||
errorCount++
|
||||
statsMu.Unlock()
|
||||
r.appendActivity(JobActivity{
|
||||
JobID: job.JobId,
|
||||
JobType: job.JobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: fmt.Sprintf("scheduled execution failed: %v", err),
|
||||
Stage: "failed",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
})
|
||||
break
|
||||
}
|
||||
|
||||
statsMu.Lock()
|
||||
successCount++
|
||||
statsMu.Unlock()
|
||||
break
|
||||
}
|
||||
}
|
||||
}()
|
||||
}
|
||||
|
||||
wg.Wait()
|
||||
|
||||
r.appendActivity(JobActivity{
|
||||
JobType: jobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: fmt.Sprintf("scheduled execution finished: success=%d error=%d", successCount, errorCount),
|
||||
Stage: "executed",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
})
|
||||
}
|
||||
|
||||
func (r *Plugin) reserveScheduledExecutor(
|
||||
jobType string,
|
||||
policy schedulerPolicy,
|
||||
) (*WorkerSession, func(), error) {
|
||||
deadline := time.Now().Add(policy.ExecutionTimeout)
|
||||
if policy.ExecutionTimeout <= 0 {
|
||||
deadline = time.Now().Add(10 * time.Minute) // Default cap
|
||||
}
|
||||
|
||||
for {
|
||||
select {
|
||||
case <-r.shutdownCh:
|
||||
return nil, nil, fmt.Errorf("plugin is shutting down")
|
||||
default:
|
||||
}
|
||||
|
||||
if time.Now().After(deadline) {
|
||||
return nil, nil, fmt.Errorf("timed out waiting for executor capacity for %s", jobType)
|
||||
}
|
||||
|
||||
executors, err := r.registry.ListExecutors(jobType)
|
||||
if err != nil {
|
||||
if !waitForShutdownOrTimer(r.shutdownCh, policy.ExecutorReserveBackoff) {
|
||||
return nil, nil, fmt.Errorf("plugin is shutting down")
|
||||
}
|
||||
continue
|
||||
}
|
||||
|
||||
for _, executor := range executors {
|
||||
release, ok := r.tryReserveExecutorCapacity(executor, jobType, policy)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
return executor, release, nil
|
||||
}
|
||||
|
||||
if !waitForShutdownOrTimer(r.shutdownCh, policy.ExecutorReserveBackoff) {
|
||||
return nil, nil, fmt.Errorf("plugin is shutting down")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) tryReserveExecutorCapacity(
|
||||
executor *WorkerSession,
|
||||
jobType string,
|
||||
policy schedulerPolicy,
|
||||
) (func(), bool) {
|
||||
if executor == nil || strings.TrimSpace(executor.WorkerID) == "" {
|
||||
return nil, false
|
||||
}
|
||||
|
||||
limit := schedulerWorkerExecutionLimit(executor, jobType, policy)
|
||||
if limit <= 0 {
|
||||
return nil, false
|
||||
}
|
||||
heartbeatUsed := 0
|
||||
if executor.Heartbeat != nil && executor.Heartbeat.ExecutionSlotsUsed > 0 {
|
||||
heartbeatUsed = int(executor.Heartbeat.ExecutionSlotsUsed)
|
||||
}
|
||||
|
||||
workerID := strings.TrimSpace(executor.WorkerID)
|
||||
|
||||
r.schedulerExecMu.Lock()
|
||||
reserved := r.schedulerExecReservations[workerID]
|
||||
if heartbeatUsed+reserved >= limit {
|
||||
r.schedulerExecMu.Unlock()
|
||||
return nil, false
|
||||
}
|
||||
r.schedulerExecReservations[workerID] = reserved + 1
|
||||
r.schedulerExecMu.Unlock()
|
||||
|
||||
release := func() {
|
||||
r.releaseExecutorCapacity(workerID)
|
||||
}
|
||||
return release, true
|
||||
}
|
||||
|
||||
func (r *Plugin) releaseExecutorCapacity(workerID string) {
|
||||
workerID = strings.TrimSpace(workerID)
|
||||
if workerID == "" {
|
||||
return
|
||||
}
|
||||
|
||||
r.schedulerExecMu.Lock()
|
||||
defer r.schedulerExecMu.Unlock()
|
||||
|
||||
current := r.schedulerExecReservations[workerID]
|
||||
if current <= 1 {
|
||||
delete(r.schedulerExecReservations, workerID)
|
||||
return
|
||||
}
|
||||
r.schedulerExecReservations[workerID] = current - 1
|
||||
}
|
||||
|
||||
func schedulerWorkerExecutionLimit(executor *WorkerSession, jobType string, policy schedulerPolicy) int {
|
||||
limit := policy.PerWorkerConcurrency
|
||||
if limit <= 0 {
|
||||
limit = defaultScheduledPerWorkerConcurrency
|
||||
}
|
||||
|
||||
if capability := executor.Capabilities[jobType]; capability != nil && capability.MaxExecutionConcurrency > 0 {
|
||||
capLimit := int(capability.MaxExecutionConcurrency)
|
||||
if capLimit < limit {
|
||||
limit = capLimit
|
||||
}
|
||||
}
|
||||
|
||||
if executor.Heartbeat != nil && executor.Heartbeat.ExecutionSlotsTotal > 0 {
|
||||
heartbeatLimit := int(executor.Heartbeat.ExecutionSlotsTotal)
|
||||
if heartbeatLimit < limit {
|
||||
limit = heartbeatLimit
|
||||
}
|
||||
}
|
||||
|
||||
if limit < 0 {
|
||||
return 0
|
||||
}
|
||||
return limit
|
||||
}
|
||||
|
||||
func (r *Plugin) executeScheduledJobWithExecutor(
|
||||
executor *WorkerSession,
|
||||
job *plugin_pb.JobSpec,
|
||||
clusterContext *plugin_pb.ClusterContext,
|
||||
policy schedulerPolicy,
|
||||
) error {
|
||||
maxAttempts := policy.RetryLimit + 1
|
||||
if maxAttempts < 1 {
|
||||
maxAttempts = 1
|
||||
}
|
||||
|
||||
var lastErr error
|
||||
for attempt := 1; attempt <= maxAttempts; attempt++ {
|
||||
select {
|
||||
case <-r.shutdownCh:
|
||||
return fmt.Errorf("plugin is shutting down")
|
||||
default:
|
||||
}
|
||||
|
||||
execCtx, cancel := context.WithTimeout(context.Background(), policy.ExecutionTimeout)
|
||||
_, err := r.executeJobWithExecutor(execCtx, executor, job, clusterContext, int32(attempt))
|
||||
cancel()
|
||||
if err == nil {
|
||||
return nil
|
||||
}
|
||||
if isExecutorAtCapacityError(err) {
|
||||
return errExecutorAtCapacity
|
||||
}
|
||||
lastErr = err
|
||||
|
||||
if attempt < maxAttempts {
|
||||
r.appendActivity(JobActivity{
|
||||
JobID: job.JobId,
|
||||
JobType: job.JobType,
|
||||
Source: "admin_scheduler",
|
||||
Message: fmt.Sprintf("retrying job attempt %d/%d after error: %v", attempt, maxAttempts, err),
|
||||
Stage: "retry",
|
||||
OccurredAt: timeToPtr(time.Now().UTC()),
|
||||
})
|
||||
if !waitForShutdownOrTimer(r.shutdownCh, policy.RetryBackoff) {
|
||||
return fmt.Errorf("plugin is shutting down")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if lastErr == nil {
|
||||
lastErr = fmt.Errorf("execution failed without an explicit error")
|
||||
}
|
||||
return lastErr
|
||||
}
|
||||
|
||||
func (r *Plugin) shouldSkipDetectionForWaitingJobs(jobType string, policy schedulerPolicy) (bool, int, int) {
|
||||
waitingCount := r.countWaitingTrackedJobs(jobType)
|
||||
threshold := waitingBacklogThreshold(policy)
|
||||
if threshold <= 0 {
|
||||
return false, waitingCount, threshold
|
||||
}
|
||||
return waitingCount >= threshold, waitingCount, threshold
|
||||
}
|
||||
|
||||
func (r *Plugin) countWaitingTrackedJobs(jobType string) int {
|
||||
normalizedJobType := strings.TrimSpace(jobType)
|
||||
if normalizedJobType == "" {
|
||||
return 0
|
||||
}
|
||||
|
||||
waiting := 0
|
||||
r.jobsMu.RLock()
|
||||
for _, job := range r.jobs {
|
||||
if job == nil {
|
||||
continue
|
||||
}
|
||||
if strings.TrimSpace(job.JobType) != normalizedJobType {
|
||||
continue
|
||||
}
|
||||
if !isWaitingTrackedJobState(job.State) {
|
||||
continue
|
||||
}
|
||||
waiting++
|
||||
}
|
||||
r.jobsMu.RUnlock()
|
||||
|
||||
return waiting
|
||||
}
|
||||
|
||||
func waitingBacklogThreshold(policy schedulerPolicy) int {
|
||||
concurrency := policy.ExecutionConcurrency
|
||||
if concurrency <= 0 {
|
||||
concurrency = defaultScheduledExecutionConcurrency
|
||||
}
|
||||
threshold := concurrency * defaultWaitingBacklogMultiplier
|
||||
if threshold < defaultWaitingBacklogFloor {
|
||||
threshold = defaultWaitingBacklogFloor
|
||||
}
|
||||
if policy.MaxResults > 0 && threshold > int(policy.MaxResults) {
|
||||
threshold = int(policy.MaxResults)
|
||||
}
|
||||
return threshold
|
||||
}
|
||||
|
||||
func isExecutorAtCapacityError(err error) bool {
|
||||
if err == nil {
|
||||
return false
|
||||
}
|
||||
if errors.Is(err, errExecutorAtCapacity) {
|
||||
return true
|
||||
}
|
||||
return strings.Contains(strings.ToLower(err.Error()), "executor is at capacity")
|
||||
}
|
||||
|
||||
func buildScheduledJobSpec(jobType string, proposal *plugin_pb.JobProposal, index int) *plugin_pb.JobSpec {
|
||||
now := timestamppb.Now()
|
||||
|
||||
jobID := fmt.Sprintf("%s-scheduled-%d-%d", jobType, now.AsTime().UnixNano(), index)
|
||||
|
||||
job := &plugin_pb.JobSpec{
|
||||
JobId: jobID,
|
||||
JobType: jobType,
|
||||
Priority: plugin_pb.JobPriority_JOB_PRIORITY_NORMAL,
|
||||
Parameters: map[string]*plugin_pb.ConfigValue{},
|
||||
Labels: map[string]string{},
|
||||
CreatedAt: now,
|
||||
ScheduledAt: now,
|
||||
}
|
||||
|
||||
if proposal == nil {
|
||||
return job
|
||||
}
|
||||
|
||||
if proposal.JobType != "" {
|
||||
job.JobType = proposal.JobType
|
||||
}
|
||||
job.Summary = proposal.Summary
|
||||
job.Detail = proposal.Detail
|
||||
if proposal.Priority != plugin_pb.JobPriority_JOB_PRIORITY_UNSPECIFIED {
|
||||
job.Priority = proposal.Priority
|
||||
}
|
||||
job.DedupeKey = proposal.DedupeKey
|
||||
job.Parameters = CloneConfigValueMap(proposal.Parameters)
|
||||
if proposal.Labels != nil {
|
||||
job.Labels = make(map[string]string, len(proposal.Labels))
|
||||
for k, v := range proposal.Labels {
|
||||
job.Labels[k] = v
|
||||
}
|
||||
}
|
||||
if proposal.NotBefore != nil {
|
||||
job.ScheduledAt = proposal.NotBefore
|
||||
}
|
||||
|
||||
return job
|
||||
}
|
||||
|
||||
func durationFromSeconds(seconds int32, defaultValue time.Duration) time.Duration {
|
||||
if seconds <= 0 {
|
||||
return defaultValue
|
||||
}
|
||||
return time.Duration(seconds) * time.Second
|
||||
}
|
||||
|
||||
func secondsFromDuration(duration time.Duration) int32 {
|
||||
if duration <= 0 {
|
||||
return 0
|
||||
}
|
||||
return int32(duration / time.Second)
|
||||
}
|
||||
|
||||
func waitForShutdownOrTimer(shutdown <-chan struct{}, duration time.Duration) bool {
|
||||
if duration <= 0 {
|
||||
return true
|
||||
}
|
||||
|
||||
timer := time.NewTimer(duration)
|
||||
defer timer.Stop()
|
||||
|
||||
select {
|
||||
case <-shutdown:
|
||||
return false
|
||||
case <-timer.C:
|
||||
return true
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Plugin) filterProposalsWithActiveJobs(jobType string, proposals []*plugin_pb.JobProposal) ([]*plugin_pb.JobProposal, int) {
|
||||
if len(proposals) == 0 {
|
||||
return proposals, 0
|
||||
}
|
||||
|
||||
activeKeys := make(map[string]struct{})
|
||||
r.jobsMu.RLock()
|
||||
for _, job := range r.jobs {
|
||||
if job == nil {
|
||||
continue
|
||||
}
|
||||
if strings.TrimSpace(job.JobType) != strings.TrimSpace(jobType) {
|
||||
continue
|
||||
}
|
||||
if !isActiveTrackedJobState(job.State) {
|
||||
continue
|
||||
}
|
||||
|
||||
key := strings.TrimSpace(job.DedupeKey)
|
||||
if key == "" {
|
||||
key = strings.TrimSpace(job.JobID)
|
||||
}
|
||||
if key == "" {
|
||||
continue
|
||||
}
|
||||
activeKeys[key] = struct{}{}
|
||||
}
|
||||
r.jobsMu.RUnlock()
|
||||
|
||||
if len(activeKeys) == 0 {
|
||||
return proposals, 0
|
||||
}
|
||||
|
||||
filtered := make([]*plugin_pb.JobProposal, 0, len(proposals))
|
||||
skipped := 0
|
||||
for _, proposal := range proposals {
|
||||
if proposal == nil {
|
||||
continue
|
||||
}
|
||||
key := proposalExecutionKey(proposal)
|
||||
if key != "" {
|
||||
if _, exists := activeKeys[key]; exists {
|
||||
skipped++
|
||||
continue
|
||||
}
|
||||
}
|
||||
filtered = append(filtered, proposal)
|
||||
}
|
||||
|
||||
return filtered, skipped
|
||||
}
|
||||
|
||||
func proposalExecutionKey(proposal *plugin_pb.JobProposal) string {
|
||||
if proposal == nil {
|
||||
return ""
|
||||
}
|
||||
key := strings.TrimSpace(proposal.DedupeKey)
|
||||
if key != "" {
|
||||
return key
|
||||
}
|
||||
return strings.TrimSpace(proposal.ProposalId)
|
||||
}
|
||||
|
||||
func isActiveTrackedJobState(state string) bool {
|
||||
normalized := strings.ToLower(strings.TrimSpace(state))
|
||||
switch normalized {
|
||||
case "pending", "assigned", "running", "in_progress", "job_state_pending", "job_state_assigned", "job_state_running":
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
func isWaitingTrackedJobState(state string) bool {
|
||||
normalized := strings.ToLower(strings.TrimSpace(state))
|
||||
return normalized == "pending" || normalized == "job_state_pending"
|
||||
}
|
||||
|
||||
func (r *Plugin) filterScheduledProposals(proposals []*plugin_pb.JobProposal) []*plugin_pb.JobProposal {
|
||||
filtered := make([]*plugin_pb.JobProposal, 0, len(proposals))
|
||||
seenInRun := make(map[string]struct{}, len(proposals))
|
||||
|
||||
for _, proposal := range proposals {
|
||||
if proposal == nil {
|
||||
continue
|
||||
}
|
||||
|
||||
key := proposal.DedupeKey
|
||||
if key == "" {
|
||||
key = proposal.ProposalId
|
||||
}
|
||||
if key == "" {
|
||||
filtered = append(filtered, proposal)
|
||||
continue
|
||||
}
|
||||
|
||||
if _, exists := seenInRun[key]; exists {
|
||||
continue
|
||||
}
|
||||
|
||||
seenInRun[key] = struct{}{}
|
||||
filtered = append(filtered, proposal)
|
||||
}
|
||||
|
||||
return filtered
|
||||
}
|
||||
583
weed/admin/plugin/plugin_scheduler_test.go
Normal file
583
weed/admin/plugin/plugin_scheduler_test.go
Normal file
@@ -0,0 +1,583 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
)
|
||||
|
||||
func TestLoadSchedulerPolicyUsesAdminConfig(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
err = pluginSvc.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
|
||||
JobType: "vacuum",
|
||||
AdminRuntime: &plugin_pb.AdminRuntimeConfig{
|
||||
Enabled: true,
|
||||
DetectionIntervalSeconds: 30,
|
||||
DetectionTimeoutSeconds: 20,
|
||||
MaxJobsPerDetection: 123,
|
||||
GlobalExecutionConcurrency: 5,
|
||||
PerWorkerExecutionConcurrency: 2,
|
||||
RetryLimit: 4,
|
||||
RetryBackoffSeconds: 7,
|
||||
},
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatalf("SaveJobTypeConfig: %v", err)
|
||||
}
|
||||
|
||||
policy, enabled, err := pluginSvc.loadSchedulerPolicy("vacuum")
|
||||
if err != nil {
|
||||
t.Fatalf("loadSchedulerPolicy: %v", err)
|
||||
}
|
||||
if !enabled {
|
||||
t.Fatalf("expected enabled policy")
|
||||
}
|
||||
if policy.MaxResults != 123 {
|
||||
t.Fatalf("unexpected max results: got=%d", policy.MaxResults)
|
||||
}
|
||||
if policy.ExecutionConcurrency != 5 {
|
||||
t.Fatalf("unexpected global concurrency: got=%d", policy.ExecutionConcurrency)
|
||||
}
|
||||
if policy.PerWorkerConcurrency != 2 {
|
||||
t.Fatalf("unexpected per-worker concurrency: got=%d", policy.PerWorkerConcurrency)
|
||||
}
|
||||
if policy.RetryLimit != 4 {
|
||||
t.Fatalf("unexpected retry limit: got=%d", policy.RetryLimit)
|
||||
}
|
||||
}
|
||||
|
||||
func TestLoadSchedulerPolicyUsesDescriptorDefaultsWhenConfigMissing(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
err = pluginSvc.store.SaveDescriptor("ec", &plugin_pb.JobTypeDescriptor{
|
||||
JobType: "ec",
|
||||
AdminRuntimeDefaults: &plugin_pb.AdminRuntimeDefaults{
|
||||
Enabled: true,
|
||||
DetectionIntervalSeconds: 60,
|
||||
DetectionTimeoutSeconds: 25,
|
||||
MaxJobsPerDetection: 30,
|
||||
GlobalExecutionConcurrency: 4,
|
||||
PerWorkerExecutionConcurrency: 2,
|
||||
RetryLimit: 3,
|
||||
RetryBackoffSeconds: 6,
|
||||
},
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatalf("SaveDescriptor: %v", err)
|
||||
}
|
||||
|
||||
policy, enabled, err := pluginSvc.loadSchedulerPolicy("ec")
|
||||
if err != nil {
|
||||
t.Fatalf("loadSchedulerPolicy: %v", err)
|
||||
}
|
||||
if !enabled {
|
||||
t.Fatalf("expected enabled policy from descriptor defaults")
|
||||
}
|
||||
if policy.MaxResults != 30 {
|
||||
t.Fatalf("unexpected max results: got=%d", policy.MaxResults)
|
||||
}
|
||||
if policy.ExecutionConcurrency != 4 {
|
||||
t.Fatalf("unexpected global concurrency: got=%d", policy.ExecutionConcurrency)
|
||||
}
|
||||
if policy.PerWorkerConcurrency != 2 {
|
||||
t.Fatalf("unexpected per-worker concurrency: got=%d", policy.PerWorkerConcurrency)
|
||||
}
|
||||
}
|
||||
|
||||
func TestReserveScheduledExecutorRespectsPerWorkerLimit(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 4},
|
||||
},
|
||||
})
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-b",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
|
||||
},
|
||||
})
|
||||
|
||||
policy := schedulerPolicy{
|
||||
PerWorkerConcurrency: 1,
|
||||
ExecutorReserveBackoff: time.Millisecond,
|
||||
}
|
||||
|
||||
executor1, release1, err := pluginSvc.reserveScheduledExecutor("balance", policy)
|
||||
if err != nil {
|
||||
t.Fatalf("reserve executor 1: %v", err)
|
||||
}
|
||||
defer release1()
|
||||
|
||||
executor2, release2, err := pluginSvc.reserveScheduledExecutor("balance", policy)
|
||||
if err != nil {
|
||||
t.Fatalf("reserve executor 2: %v", err)
|
||||
}
|
||||
defer release2()
|
||||
|
||||
if executor1.WorkerID == executor2.WorkerID {
|
||||
t.Fatalf("expected different executors due per-worker limit, got same worker %s", executor1.WorkerID)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFilterScheduledProposalsDedupe(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
proposals := []*plugin_pb.JobProposal{
|
||||
{ProposalId: "p1", DedupeKey: "d1"},
|
||||
{ProposalId: "p2", DedupeKey: "d1"}, // same dedupe key
|
||||
{ProposalId: "p3", DedupeKey: "d3"},
|
||||
{ProposalId: "p3"}, // fallback dedupe by proposal id
|
||||
{ProposalId: "p4"},
|
||||
{ProposalId: "p4"}, // same proposal id, no dedupe key
|
||||
}
|
||||
|
||||
filtered := pluginSvc.filterScheduledProposals(proposals)
|
||||
if len(filtered) != 4 {
|
||||
t.Fatalf("unexpected filtered size: got=%d want=4", len(filtered))
|
||||
}
|
||||
|
||||
filtered2 := pluginSvc.filterScheduledProposals(proposals)
|
||||
if len(filtered2) != 4 {
|
||||
t.Fatalf("expected second run dedupe to be per-run only, got=%d", len(filtered2))
|
||||
}
|
||||
}
|
||||
|
||||
func TestBuildScheduledJobSpecDoesNotReuseProposalID(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
proposal := &plugin_pb.JobProposal{
|
||||
ProposalId: "vacuum-2",
|
||||
DedupeKey: "vacuum:2",
|
||||
JobType: "vacuum",
|
||||
}
|
||||
|
||||
jobA := buildScheduledJobSpec("vacuum", proposal, 0)
|
||||
jobB := buildScheduledJobSpec("vacuum", proposal, 1)
|
||||
|
||||
if jobA.JobId == proposal.ProposalId {
|
||||
t.Fatalf("scheduled job id must not reuse proposal id: %s", jobA.JobId)
|
||||
}
|
||||
if jobB.JobId == proposal.ProposalId {
|
||||
t.Fatalf("scheduled job id must not reuse proposal id: %s", jobB.JobId)
|
||||
}
|
||||
if jobA.JobId == jobB.JobId {
|
||||
t.Fatalf("scheduled job ids must be unique across jobs: %s", jobA.JobId)
|
||||
}
|
||||
}
|
||||
|
||||
func TestFilterProposalsWithActiveJobs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.trackExecutionStart("req-1", "worker-a", &plugin_pb.JobSpec{
|
||||
JobId: "job-1",
|
||||
JobType: "vacuum",
|
||||
DedupeKey: "vacuum:k1",
|
||||
}, 1)
|
||||
pluginSvc.trackExecutionStart("req-2", "worker-b", &plugin_pb.JobSpec{
|
||||
JobId: "job-2",
|
||||
JobType: "vacuum",
|
||||
}, 1)
|
||||
pluginSvc.trackExecutionQueued(&plugin_pb.JobSpec{
|
||||
JobId: "job-3",
|
||||
JobType: "vacuum",
|
||||
DedupeKey: "vacuum:k4",
|
||||
})
|
||||
|
||||
filtered, skipped := pluginSvc.filterProposalsWithActiveJobs("vacuum", []*plugin_pb.JobProposal{
|
||||
{ProposalId: "proposal-1", JobType: "vacuum", DedupeKey: "vacuum:k1"},
|
||||
{ProposalId: "job-2", JobType: "vacuum"},
|
||||
{ProposalId: "proposal-2b", JobType: "vacuum", DedupeKey: "vacuum:k4"},
|
||||
{ProposalId: "proposal-3", JobType: "vacuum", DedupeKey: "vacuum:k3"},
|
||||
{ProposalId: "proposal-4", JobType: "balance", DedupeKey: "balance:k1"},
|
||||
})
|
||||
if skipped != 3 {
|
||||
t.Fatalf("unexpected skipped count: got=%d want=3", skipped)
|
||||
}
|
||||
if len(filtered) != 2 {
|
||||
t.Fatalf("unexpected filtered size: got=%d want=2", len(filtered))
|
||||
}
|
||||
if filtered[0].ProposalId != "proposal-3" || filtered[1].ProposalId != "proposal-4" {
|
||||
t.Fatalf("unexpected filtered proposals: got=%s,%s", filtered[0].ProposalId, filtered[1].ProposalId)
|
||||
}
|
||||
}
|
||||
|
||||
func TestReserveScheduledExecutorTimesOutWhenNoExecutor(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
policy := schedulerPolicy{
|
||||
ExecutionTimeout: 30 * time.Millisecond,
|
||||
ExecutorReserveBackoff: 5 * time.Millisecond,
|
||||
PerWorkerConcurrency: 1,
|
||||
}
|
||||
|
||||
start := time.Now()
|
||||
pluginSvc.Shutdown()
|
||||
_, _, err = pluginSvc.reserveScheduledExecutor("missing-job-type", policy)
|
||||
if err == nil {
|
||||
t.Fatalf("expected reservation shutdown error")
|
||||
}
|
||||
if time.Since(start) > 50*time.Millisecond {
|
||||
t.Fatalf("reservation returned too late after shutdown: duration=%v", time.Since(start))
|
||||
}
|
||||
}
|
||||
|
||||
func TestReserveScheduledExecutorWaitsForWorkerCapacity(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 1},
|
||||
},
|
||||
})
|
||||
|
||||
policy := schedulerPolicy{
|
||||
ExecutionTimeout: time.Second,
|
||||
PerWorkerConcurrency: 8,
|
||||
ExecutorReserveBackoff: 5 * time.Millisecond,
|
||||
}
|
||||
|
||||
_, release1, err := pluginSvc.reserveScheduledExecutor("balance", policy)
|
||||
if err != nil {
|
||||
t.Fatalf("reserve executor 1: %v", err)
|
||||
}
|
||||
defer release1()
|
||||
|
||||
type reserveResult struct {
|
||||
err error
|
||||
}
|
||||
secondReserveCh := make(chan reserveResult, 1)
|
||||
go func() {
|
||||
_, release2, reserveErr := pluginSvc.reserveScheduledExecutor("balance", policy)
|
||||
if release2 != nil {
|
||||
release2()
|
||||
}
|
||||
secondReserveCh <- reserveResult{err: reserveErr}
|
||||
}()
|
||||
|
||||
select {
|
||||
case result := <-secondReserveCh:
|
||||
t.Fatalf("expected second reservation to wait for capacity, got=%v", result.err)
|
||||
case <-time.After(25 * time.Millisecond):
|
||||
// Expected: still waiting.
|
||||
}
|
||||
|
||||
release1()
|
||||
|
||||
select {
|
||||
case result := <-secondReserveCh:
|
||||
if result.err != nil {
|
||||
t.Fatalf("second reservation error: %v", result.err)
|
||||
}
|
||||
case <-time.After(200 * time.Millisecond):
|
||||
t.Fatalf("second reservation did not acquire after capacity release")
|
||||
}
|
||||
}
|
||||
|
||||
func TestShouldSkipDetectionForWaitingJobs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
policy := schedulerPolicy{
|
||||
ExecutionConcurrency: 2,
|
||||
MaxResults: 100,
|
||||
}
|
||||
threshold := waitingBacklogThreshold(policy)
|
||||
if threshold <= 0 {
|
||||
t.Fatalf("expected positive waiting threshold")
|
||||
}
|
||||
|
||||
for i := 0; i < threshold; i++ {
|
||||
pluginSvc.trackExecutionQueued(&plugin_pb.JobSpec{
|
||||
JobId: fmt.Sprintf("job-waiting-%d", i),
|
||||
JobType: "vacuum",
|
||||
DedupeKey: fmt.Sprintf("vacuum:%d", i),
|
||||
})
|
||||
}
|
||||
|
||||
skip, waitingCount, waitingThreshold := pluginSvc.shouldSkipDetectionForWaitingJobs("vacuum", policy)
|
||||
if !skip {
|
||||
t.Fatalf("expected detection to skip when waiting backlog reaches threshold")
|
||||
}
|
||||
if waitingCount != threshold {
|
||||
t.Fatalf("unexpected waiting count: got=%d want=%d", waitingCount, threshold)
|
||||
}
|
||||
if waitingThreshold != threshold {
|
||||
t.Fatalf("unexpected waiting threshold: got=%d want=%d", waitingThreshold, threshold)
|
||||
}
|
||||
}
|
||||
|
||||
func TestWaitingBacklogThresholdHonorsMaxResultsCap(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
policy := schedulerPolicy{
|
||||
ExecutionConcurrency: 8,
|
||||
MaxResults: 6,
|
||||
}
|
||||
threshold := waitingBacklogThreshold(policy)
|
||||
if threshold != 6 {
|
||||
t.Fatalf("expected threshold to be capped by max results, got=%d", threshold)
|
||||
}
|
||||
}
|
||||
|
||||
func TestListSchedulerStatesIncludesPolicyAndState(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
const jobType = "vacuum"
|
||||
err = pluginSvc.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
|
||||
JobType: jobType,
|
||||
AdminRuntime: &plugin_pb.AdminRuntimeConfig{
|
||||
Enabled: true,
|
||||
DetectionIntervalSeconds: 45,
|
||||
DetectionTimeoutSeconds: 30,
|
||||
MaxJobsPerDetection: 80,
|
||||
GlobalExecutionConcurrency: 3,
|
||||
PerWorkerExecutionConcurrency: 2,
|
||||
RetryLimit: 1,
|
||||
RetryBackoffSeconds: 9,
|
||||
},
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatalf("SaveJobTypeConfig: %v", err)
|
||||
}
|
||||
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: jobType, CanDetect: true, CanExecute: true},
|
||||
},
|
||||
})
|
||||
|
||||
nextDetectionAt := time.Now().UTC().Add(2 * time.Minute).Round(time.Second)
|
||||
pluginSvc.schedulerMu.Lock()
|
||||
pluginSvc.nextDetectionAt[jobType] = nextDetectionAt
|
||||
pluginSvc.detectionInFlight[jobType] = true
|
||||
pluginSvc.schedulerMu.Unlock()
|
||||
|
||||
states, err := pluginSvc.ListSchedulerStates()
|
||||
if err != nil {
|
||||
t.Fatalf("ListSchedulerStates: %v", err)
|
||||
}
|
||||
|
||||
state := findSchedulerState(states, jobType)
|
||||
if state == nil {
|
||||
t.Fatalf("missing scheduler state for %s", jobType)
|
||||
}
|
||||
if !state.Enabled {
|
||||
t.Fatalf("expected enabled scheduler state")
|
||||
}
|
||||
if state.PolicyError != "" {
|
||||
t.Fatalf("unexpected policy error: %s", state.PolicyError)
|
||||
}
|
||||
if !state.DetectionInFlight {
|
||||
t.Fatalf("expected detection in flight")
|
||||
}
|
||||
if state.NextDetectionAt == nil {
|
||||
t.Fatalf("expected next detection time")
|
||||
}
|
||||
if state.NextDetectionAt.Unix() != nextDetectionAt.Unix() {
|
||||
t.Fatalf("unexpected next detection time: got=%v want=%v", state.NextDetectionAt, nextDetectionAt)
|
||||
}
|
||||
if state.DetectionIntervalSeconds != 45 {
|
||||
t.Fatalf("unexpected detection interval: got=%d", state.DetectionIntervalSeconds)
|
||||
}
|
||||
if state.DetectionTimeoutSeconds != 30 {
|
||||
t.Fatalf("unexpected detection timeout: got=%d", state.DetectionTimeoutSeconds)
|
||||
}
|
||||
if state.ExecutionTimeoutSeconds != 90 {
|
||||
t.Fatalf("unexpected execution timeout: got=%d", state.ExecutionTimeoutSeconds)
|
||||
}
|
||||
if state.MaxJobsPerDetection != 80 {
|
||||
t.Fatalf("unexpected max jobs per detection: got=%d", state.MaxJobsPerDetection)
|
||||
}
|
||||
if state.GlobalExecutionConcurrency != 3 {
|
||||
t.Fatalf("unexpected global execution concurrency: got=%d", state.GlobalExecutionConcurrency)
|
||||
}
|
||||
if state.PerWorkerExecutionConcurrency != 2 {
|
||||
t.Fatalf("unexpected per worker execution concurrency: got=%d", state.PerWorkerExecutionConcurrency)
|
||||
}
|
||||
if state.RetryLimit != 1 {
|
||||
t.Fatalf("unexpected retry limit: got=%d", state.RetryLimit)
|
||||
}
|
||||
if state.RetryBackoffSeconds != 9 {
|
||||
t.Fatalf("unexpected retry backoff: got=%d", state.RetryBackoffSeconds)
|
||||
}
|
||||
if !state.DetectorAvailable || state.DetectorWorkerID != "worker-a" {
|
||||
t.Fatalf("unexpected detector assignment: available=%v worker=%s", state.DetectorAvailable, state.DetectorWorkerID)
|
||||
}
|
||||
if state.ExecutorWorkerCount != 1 {
|
||||
t.Fatalf("unexpected executor worker count: got=%d", state.ExecutorWorkerCount)
|
||||
}
|
||||
}
|
||||
|
||||
func TestListSchedulerStatesShowsDisabledWhenNoPolicy(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
const jobType = "balance"
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-b",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: jobType, CanDetect: true, CanExecute: true},
|
||||
},
|
||||
})
|
||||
|
||||
states, err := pluginSvc.ListSchedulerStates()
|
||||
if err != nil {
|
||||
t.Fatalf("ListSchedulerStates: %v", err)
|
||||
}
|
||||
|
||||
state := findSchedulerState(states, jobType)
|
||||
if state == nil {
|
||||
t.Fatalf("missing scheduler state for %s", jobType)
|
||||
}
|
||||
if state.Enabled {
|
||||
t.Fatalf("expected disabled scheduler state")
|
||||
}
|
||||
if state.PolicyError != "" {
|
||||
t.Fatalf("unexpected policy error: %s", state.PolicyError)
|
||||
}
|
||||
if !state.DetectorAvailable || state.DetectorWorkerID != "worker-b" {
|
||||
t.Fatalf("unexpected detector details: available=%v worker=%s", state.DetectorAvailable, state.DetectorWorkerID)
|
||||
}
|
||||
if state.ExecutorWorkerCount != 1 {
|
||||
t.Fatalf("unexpected executor worker count: got=%d", state.ExecutorWorkerCount)
|
||||
}
|
||||
}
|
||||
|
||||
func findSchedulerState(states []SchedulerJobTypeState, jobType string) *SchedulerJobTypeState {
|
||||
for i := range states {
|
||||
if states[i].JobType == jobType {
|
||||
return &states[i]
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func TestPickDetectorPrefersLeasedWorker(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "vacuum", CanDetect: true},
|
||||
},
|
||||
})
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-b",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "vacuum", CanDetect: true},
|
||||
},
|
||||
})
|
||||
|
||||
pluginSvc.setDetectorLease("vacuum", "worker-b")
|
||||
|
||||
detector, err := pluginSvc.pickDetector("vacuum")
|
||||
if err != nil {
|
||||
t.Fatalf("pickDetector: %v", err)
|
||||
}
|
||||
if detector.WorkerID != "worker-b" {
|
||||
t.Fatalf("expected leased detector worker-b, got=%s", detector.WorkerID)
|
||||
}
|
||||
}
|
||||
|
||||
func TestPickDetectorReassignsWhenLeaseIsStale(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pluginSvc, err := New(Options{})
|
||||
if err != nil {
|
||||
t.Fatalf("New: %v", err)
|
||||
}
|
||||
defer pluginSvc.Shutdown()
|
||||
|
||||
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "vacuum", CanDetect: true},
|
||||
},
|
||||
})
|
||||
pluginSvc.setDetectorLease("vacuum", "worker-stale")
|
||||
|
||||
detector, err := pluginSvc.pickDetector("vacuum")
|
||||
if err != nil {
|
||||
t.Fatalf("pickDetector: %v", err)
|
||||
}
|
||||
if detector.WorkerID != "worker-a" {
|
||||
t.Fatalf("expected reassigned detector worker-a, got=%s", detector.WorkerID)
|
||||
}
|
||||
|
||||
lease := pluginSvc.getDetectorLease("vacuum")
|
||||
if lease != "worker-a" {
|
||||
t.Fatalf("expected detector lease to be updated to worker-a, got=%s", lease)
|
||||
}
|
||||
}
|
||||
66
weed/admin/plugin/plugin_schema_prefetch.go
Normal file
66
weed/admin/plugin/plugin_schema_prefetch.go
Normal file
@@ -0,0 +1,66 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"context"
|
||||
"sort"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/glog"
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
)
|
||||
|
||||
const descriptorPrefetchTimeout = 20 * time.Second
|
||||
|
||||
func (r *Plugin) prefetchDescriptorsFromHello(hello *plugin_pb.WorkerHello) {
|
||||
if hello == nil || len(hello.Capabilities) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
jobTypeSet := make(map[string]struct{})
|
||||
for _, capability := range hello.Capabilities {
|
||||
if capability == nil || capability.JobType == "" {
|
||||
continue
|
||||
}
|
||||
if !capability.CanDetect && !capability.CanExecute {
|
||||
continue
|
||||
}
|
||||
jobTypeSet[capability.JobType] = struct{}{}
|
||||
}
|
||||
|
||||
if len(jobTypeSet) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
jobTypes := make([]string, 0, len(jobTypeSet))
|
||||
for jobType := range jobTypeSet {
|
||||
jobTypes = append(jobTypes, jobType)
|
||||
}
|
||||
sort.Strings(jobTypes)
|
||||
|
||||
for _, jobType := range jobTypes {
|
||||
select {
|
||||
case <-r.shutdownCh:
|
||||
return
|
||||
default:
|
||||
}
|
||||
|
||||
descriptor, err := r.store.LoadDescriptor(jobType)
|
||||
if err != nil {
|
||||
glog.Warningf("Plugin descriptor prefetch check failed for %s: %v", jobType, err)
|
||||
continue
|
||||
}
|
||||
if descriptor != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
ctx, cancel := context.WithTimeout(r.ctx, descriptorPrefetchTimeout)
|
||||
_, err = r.RequestConfigSchema(ctx, jobType, false)
|
||||
cancel()
|
||||
if err != nil {
|
||||
glog.V(1).Infof("Plugin descriptor prefetch skipped for %s: %v", jobType, err)
|
||||
continue
|
||||
}
|
||||
|
||||
glog.V(1).Infof("Plugin descriptor prefetched for job_type=%s", jobType)
|
||||
}
|
||||
}
|
||||
465
weed/admin/plugin/registry.go
Normal file
465
weed/admin/plugin/registry.go
Normal file
@@ -0,0 +1,465 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"sort"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
)
|
||||
|
||||
const defaultWorkerStaleTimeout = 2 * time.Minute
|
||||
|
||||
// WorkerSession contains tracked worker metadata and plugin status.
|
||||
type WorkerSession struct {
|
||||
WorkerID string
|
||||
WorkerInstance string
|
||||
Address string
|
||||
WorkerVersion string
|
||||
ProtocolVersion string
|
||||
ConnectedAt time.Time
|
||||
LastSeenAt time.Time
|
||||
Capabilities map[string]*plugin_pb.JobTypeCapability
|
||||
Heartbeat *plugin_pb.WorkerHeartbeat
|
||||
}
|
||||
|
||||
// Registry tracks connected plugin workers and capability-based selection.
|
||||
type Registry struct {
|
||||
mu sync.RWMutex
|
||||
sessions map[string]*WorkerSession
|
||||
staleAfter time.Duration
|
||||
detectorCursor map[string]int
|
||||
executorCursor map[string]int
|
||||
}
|
||||
|
||||
func NewRegistry() *Registry {
|
||||
return &Registry{
|
||||
sessions: make(map[string]*WorkerSession),
|
||||
staleAfter: defaultWorkerStaleTimeout,
|
||||
detectorCursor: make(map[string]int),
|
||||
executorCursor: make(map[string]int),
|
||||
}
|
||||
}
|
||||
|
||||
func (r *Registry) UpsertFromHello(hello *plugin_pb.WorkerHello) *WorkerSession {
|
||||
now := time.Now()
|
||||
caps := make(map[string]*plugin_pb.JobTypeCapability, len(hello.Capabilities))
|
||||
for _, c := range hello.Capabilities {
|
||||
if c == nil || c.JobType == "" {
|
||||
continue
|
||||
}
|
||||
caps[c.JobType] = cloneJobTypeCapability(c)
|
||||
}
|
||||
|
||||
r.mu.Lock()
|
||||
defer r.mu.Unlock()
|
||||
|
||||
session, ok := r.sessions[hello.WorkerId]
|
||||
if !ok {
|
||||
session = &WorkerSession{
|
||||
WorkerID: hello.WorkerId,
|
||||
ConnectedAt: now,
|
||||
}
|
||||
r.sessions[hello.WorkerId] = session
|
||||
}
|
||||
|
||||
session.WorkerInstance = hello.WorkerInstanceId
|
||||
session.Address = hello.Address
|
||||
session.WorkerVersion = hello.WorkerVersion
|
||||
session.ProtocolVersion = hello.ProtocolVersion
|
||||
session.LastSeenAt = now
|
||||
session.Capabilities = caps
|
||||
|
||||
return cloneWorkerSession(session)
|
||||
}
|
||||
|
||||
func (r *Registry) Remove(workerID string) {
|
||||
r.mu.Lock()
|
||||
defer r.mu.Unlock()
|
||||
delete(r.sessions, workerID)
|
||||
}
|
||||
|
||||
func (r *Registry) UpdateHeartbeat(workerID string, heartbeat *plugin_pb.WorkerHeartbeat) {
|
||||
r.mu.Lock()
|
||||
defer r.mu.Unlock()
|
||||
|
||||
session, ok := r.sessions[workerID]
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
session.Heartbeat = cloneWorkerHeartbeat(heartbeat)
|
||||
session.LastSeenAt = time.Now()
|
||||
}
|
||||
|
||||
func (r *Registry) Get(workerID string) (*WorkerSession, bool) {
|
||||
r.mu.RLock()
|
||||
defer r.mu.RUnlock()
|
||||
session, ok := r.sessions[workerID]
|
||||
if !ok || r.isSessionStaleLocked(session, time.Now()) {
|
||||
return nil, false
|
||||
}
|
||||
return cloneWorkerSession(session), true
|
||||
}
|
||||
|
||||
func (r *Registry) List() []*WorkerSession {
|
||||
r.mu.RLock()
|
||||
defer r.mu.RUnlock()
|
||||
out := make([]*WorkerSession, 0, len(r.sessions))
|
||||
now := time.Now()
|
||||
for _, s := range r.sessions {
|
||||
if r.isSessionStaleLocked(s, now) {
|
||||
continue
|
||||
}
|
||||
out = append(out, cloneWorkerSession(s))
|
||||
}
|
||||
sort.Slice(out, func(i, j int) bool {
|
||||
return out[i].WorkerID < out[j].WorkerID
|
||||
})
|
||||
return out
|
||||
}
|
||||
|
||||
// DetectableJobTypes returns sorted job types that currently have at least one detect-capable worker.
|
||||
func (r *Registry) DetectableJobTypes() []string {
|
||||
r.mu.RLock()
|
||||
defer r.mu.RUnlock()
|
||||
|
||||
jobTypes := make(map[string]struct{})
|
||||
now := time.Now()
|
||||
for _, session := range r.sessions {
|
||||
if r.isSessionStaleLocked(session, now) {
|
||||
continue
|
||||
}
|
||||
for jobType, capability := range session.Capabilities {
|
||||
if capability == nil || !capability.CanDetect {
|
||||
continue
|
||||
}
|
||||
jobTypes[jobType] = struct{}{}
|
||||
}
|
||||
}
|
||||
|
||||
out := make([]string, 0, len(jobTypes))
|
||||
for jobType := range jobTypes {
|
||||
out = append(out, jobType)
|
||||
}
|
||||
sort.Strings(out)
|
||||
return out
|
||||
}
|
||||
|
||||
// JobTypes returns sorted job types known by connected workers regardless of capability kind.
|
||||
func (r *Registry) JobTypes() []string {
|
||||
r.mu.RLock()
|
||||
defer r.mu.RUnlock()
|
||||
|
||||
jobTypes := make(map[string]struct{})
|
||||
now := time.Now()
|
||||
for _, session := range r.sessions {
|
||||
if r.isSessionStaleLocked(session, now) {
|
||||
continue
|
||||
}
|
||||
for jobType := range session.Capabilities {
|
||||
if jobType == "" {
|
||||
continue
|
||||
}
|
||||
jobTypes[jobType] = struct{}{}
|
||||
}
|
||||
}
|
||||
|
||||
out := make([]string, 0, len(jobTypes))
|
||||
for jobType := range jobTypes {
|
||||
out = append(out, jobType)
|
||||
}
|
||||
sort.Strings(out)
|
||||
return out
|
||||
}
|
||||
|
||||
// PickSchemaProvider picks one worker for schema requests.
|
||||
// Preference order:
|
||||
// 1) workers that can detect this job type
|
||||
// 2) workers that can execute this job type
|
||||
// tie-break: more free slots, then lexical worker ID.
|
||||
func (r *Registry) PickSchemaProvider(jobType string) (*WorkerSession, error) {
|
||||
r.mu.RLock()
|
||||
defer r.mu.RUnlock()
|
||||
|
||||
var candidates []*WorkerSession
|
||||
now := time.Now()
|
||||
for _, s := range r.sessions {
|
||||
if r.isSessionStaleLocked(s, now) {
|
||||
continue
|
||||
}
|
||||
capability := s.Capabilities[jobType]
|
||||
if capability == nil {
|
||||
continue
|
||||
}
|
||||
if capability.CanDetect || capability.CanExecute {
|
||||
candidates = append(candidates, s)
|
||||
}
|
||||
}
|
||||
|
||||
if len(candidates) == 0 {
|
||||
return nil, fmt.Errorf("no worker available for schema job_type=%s", jobType)
|
||||
}
|
||||
|
||||
sort.Slice(candidates, func(i, j int) bool {
|
||||
a := candidates[i]
|
||||
b := candidates[j]
|
||||
ac := a.Capabilities[jobType]
|
||||
bc := b.Capabilities[jobType]
|
||||
|
||||
// Prefer detect-capable providers first.
|
||||
if ac.CanDetect != bc.CanDetect {
|
||||
return ac.CanDetect
|
||||
}
|
||||
|
||||
aSlots := availableDetectionSlots(a, ac) + availableExecutionSlots(a, ac)
|
||||
bSlots := availableDetectionSlots(b, bc) + availableExecutionSlots(b, bc)
|
||||
if aSlots != bSlots {
|
||||
return aSlots > bSlots
|
||||
}
|
||||
return a.WorkerID < b.WorkerID
|
||||
})
|
||||
|
||||
return cloneWorkerSession(candidates[0]), nil
|
||||
}
|
||||
|
||||
// PickDetector picks one detector worker for a job type.
|
||||
func (r *Registry) PickDetector(jobType string) (*WorkerSession, error) {
|
||||
return r.pickByKind(jobType, true)
|
||||
}
|
||||
|
||||
// PickExecutor picks one executor worker for a job type.
|
||||
func (r *Registry) PickExecutor(jobType string) (*WorkerSession, error) {
|
||||
return r.pickByKind(jobType, false)
|
||||
}
|
||||
|
||||
// ListExecutors returns sorted executor candidates for one job type.
|
||||
// Ordering is by most available execution slots, then lexical worker ID.
|
||||
// The top tie group is rotated round-robin to prevent sticky assignment.
|
||||
func (r *Registry) ListExecutors(jobType string) ([]*WorkerSession, error) {
|
||||
r.mu.Lock()
|
||||
defer r.mu.Unlock()
|
||||
|
||||
candidates := r.collectByKindLocked(jobType, false, time.Now())
|
||||
if len(candidates) == 0 {
|
||||
return nil, fmt.Errorf("no executor worker available for job_type=%s", jobType)
|
||||
}
|
||||
|
||||
sortByKind(candidates, jobType, false)
|
||||
r.rotateTopCandidatesLocked(candidates, jobType, false)
|
||||
|
||||
out := make([]*WorkerSession, 0, len(candidates))
|
||||
for _, candidate := range candidates {
|
||||
out = append(out, cloneWorkerSession(candidate))
|
||||
}
|
||||
return out, nil
|
||||
}
|
||||
|
||||
func (r *Registry) pickByKind(jobType string, detect bool) (*WorkerSession, error) {
|
||||
r.mu.Lock()
|
||||
defer r.mu.Unlock()
|
||||
|
||||
candidates := r.collectByKindLocked(jobType, detect, time.Now())
|
||||
|
||||
if len(candidates) == 0 {
|
||||
kind := "executor"
|
||||
if detect {
|
||||
kind = "detector"
|
||||
}
|
||||
return nil, fmt.Errorf("no %s worker available for job_type=%s", kind, jobType)
|
||||
}
|
||||
|
||||
sortByKind(candidates, jobType, detect)
|
||||
r.rotateTopCandidatesLocked(candidates, jobType, detect)
|
||||
|
||||
return cloneWorkerSession(candidates[0]), nil
|
||||
}
|
||||
|
||||
func (r *Registry) collectByKindLocked(jobType string, detect bool, now time.Time) []*WorkerSession {
|
||||
var candidates []*WorkerSession
|
||||
for _, session := range r.sessions {
|
||||
if r.isSessionStaleLocked(session, now) {
|
||||
continue
|
||||
}
|
||||
capability := session.Capabilities[jobType]
|
||||
if capability == nil {
|
||||
continue
|
||||
}
|
||||
if detect && capability.CanDetect {
|
||||
candidates = append(candidates, session)
|
||||
}
|
||||
if !detect && capability.CanExecute {
|
||||
candidates = append(candidates, session)
|
||||
}
|
||||
}
|
||||
return candidates
|
||||
}
|
||||
|
||||
func (r *Registry) isSessionStaleLocked(session *WorkerSession, now time.Time) bool {
|
||||
if session == nil {
|
||||
return true
|
||||
}
|
||||
if r.staleAfter <= 0 {
|
||||
return false
|
||||
}
|
||||
|
||||
lastSeen := session.LastSeenAt
|
||||
if lastSeen.IsZero() {
|
||||
lastSeen = session.ConnectedAt
|
||||
}
|
||||
if lastSeen.IsZero() {
|
||||
return false
|
||||
}
|
||||
return now.Sub(lastSeen) > r.staleAfter
|
||||
}
|
||||
|
||||
func sortByKind(candidates []*WorkerSession, jobType string, detect bool) {
|
||||
sort.Slice(candidates, func(i, j int) bool {
|
||||
a := candidates[i]
|
||||
b := candidates[j]
|
||||
ac := a.Capabilities[jobType]
|
||||
bc := b.Capabilities[jobType]
|
||||
|
||||
aSlots := availableSlotsByKind(a, ac, detect)
|
||||
bSlots := availableSlotsByKind(b, bc, detect)
|
||||
|
||||
if aSlots != bSlots {
|
||||
return aSlots > bSlots
|
||||
}
|
||||
return a.WorkerID < b.WorkerID
|
||||
})
|
||||
}
|
||||
|
||||
func (r *Registry) rotateTopCandidatesLocked(candidates []*WorkerSession, jobType string, detect bool) {
|
||||
if len(candidates) < 2 {
|
||||
return
|
||||
}
|
||||
|
||||
capability := candidates[0].Capabilities[jobType]
|
||||
topSlots := availableSlotsByKind(candidates[0], capability, detect)
|
||||
tieEnd := 1
|
||||
for tieEnd < len(candidates) {
|
||||
nextCapability := candidates[tieEnd].Capabilities[jobType]
|
||||
if availableSlotsByKind(candidates[tieEnd], nextCapability, detect) != topSlots {
|
||||
break
|
||||
}
|
||||
tieEnd++
|
||||
}
|
||||
if tieEnd <= 1 {
|
||||
return
|
||||
}
|
||||
|
||||
cursorKey := strings.TrimSpace(jobType)
|
||||
if cursorKey == "" {
|
||||
cursorKey = "*"
|
||||
}
|
||||
|
||||
var offset int
|
||||
if detect {
|
||||
offset = r.detectorCursor[cursorKey] % tieEnd
|
||||
r.detectorCursor[cursorKey] = (offset + 1) % tieEnd
|
||||
} else {
|
||||
offset = r.executorCursor[cursorKey] % tieEnd
|
||||
r.executorCursor[cursorKey] = (offset + 1) % tieEnd
|
||||
}
|
||||
|
||||
if offset == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
prefix := append([]*WorkerSession(nil), candidates[:tieEnd]...)
|
||||
for i := 0; i < tieEnd; i++ {
|
||||
candidates[i] = prefix[(i+offset)%tieEnd]
|
||||
}
|
||||
}
|
||||
|
||||
func availableSlotsByKind(
|
||||
session *WorkerSession,
|
||||
capability *plugin_pb.JobTypeCapability,
|
||||
detect bool,
|
||||
) int {
|
||||
if detect {
|
||||
return availableDetectionSlots(session, capability)
|
||||
}
|
||||
return availableExecutionSlots(session, capability)
|
||||
}
|
||||
|
||||
func availableDetectionSlots(session *WorkerSession, capability *plugin_pb.JobTypeCapability) int {
|
||||
if session.Heartbeat != nil && session.Heartbeat.DetectionSlotsTotal > 0 {
|
||||
free := int(session.Heartbeat.DetectionSlotsTotal - session.Heartbeat.DetectionSlotsUsed)
|
||||
if free < 0 {
|
||||
return 0
|
||||
}
|
||||
return free
|
||||
}
|
||||
if capability.MaxDetectionConcurrency > 0 {
|
||||
return int(capability.MaxDetectionConcurrency)
|
||||
}
|
||||
return 1
|
||||
}
|
||||
|
||||
func availableExecutionSlots(session *WorkerSession, capability *plugin_pb.JobTypeCapability) int {
|
||||
if session.Heartbeat != nil && session.Heartbeat.ExecutionSlotsTotal > 0 {
|
||||
free := int(session.Heartbeat.ExecutionSlotsTotal - session.Heartbeat.ExecutionSlotsUsed)
|
||||
if free < 0 {
|
||||
return 0
|
||||
}
|
||||
return free
|
||||
}
|
||||
if capability.MaxExecutionConcurrency > 0 {
|
||||
return int(capability.MaxExecutionConcurrency)
|
||||
}
|
||||
return 1
|
||||
}
|
||||
|
||||
func cloneWorkerSession(in *WorkerSession) *WorkerSession {
|
||||
if in == nil {
|
||||
return nil
|
||||
}
|
||||
out := *in
|
||||
out.Capabilities = make(map[string]*plugin_pb.JobTypeCapability, len(in.Capabilities))
|
||||
for jobType, cap := range in.Capabilities {
|
||||
out.Capabilities[jobType] = cloneJobTypeCapability(cap)
|
||||
}
|
||||
out.Heartbeat = cloneWorkerHeartbeat(in.Heartbeat)
|
||||
return &out
|
||||
}
|
||||
|
||||
func cloneJobTypeCapability(in *plugin_pb.JobTypeCapability) *plugin_pb.JobTypeCapability {
|
||||
if in == nil {
|
||||
return nil
|
||||
}
|
||||
out := *in
|
||||
return &out
|
||||
}
|
||||
|
||||
func cloneWorkerHeartbeat(in *plugin_pb.WorkerHeartbeat) *plugin_pb.WorkerHeartbeat {
|
||||
if in == nil {
|
||||
return nil
|
||||
}
|
||||
out := *in
|
||||
if in.RunningWork != nil {
|
||||
out.RunningWork = make([]*plugin_pb.RunningWork, 0, len(in.RunningWork))
|
||||
for _, rw := range in.RunningWork {
|
||||
if rw == nil {
|
||||
continue
|
||||
}
|
||||
clone := *rw
|
||||
out.RunningWork = append(out.RunningWork, &clone)
|
||||
}
|
||||
}
|
||||
if in.QueuedJobsByType != nil {
|
||||
out.QueuedJobsByType = make(map[string]int32, len(in.QueuedJobsByType))
|
||||
for k, v := range in.QueuedJobsByType {
|
||||
out.QueuedJobsByType[k] = v
|
||||
}
|
||||
}
|
||||
if in.Metadata != nil {
|
||||
out.Metadata = make(map[string]string, len(in.Metadata))
|
||||
for k, v := range in.Metadata {
|
||||
out.Metadata[k] = v
|
||||
}
|
||||
}
|
||||
return &out
|
||||
}
|
||||
321
weed/admin/plugin/registry_test.go
Normal file
321
weed/admin/plugin/registry_test.go
Normal file
@@ -0,0 +1,321 @@
|
||||
package plugin
|
||||
|
||||
import (
|
||||
"reflect"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
|
||||
)
|
||||
|
||||
func TestRegistryPickDetectorPrefersMoreFreeSlots(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
r := NewRegistry()
|
||||
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "vacuum", CanDetect: true, CanExecute: true, MaxDetectionConcurrency: 2, MaxExecutionConcurrency: 2},
|
||||
},
|
||||
})
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-b",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "vacuum", CanDetect: true, CanExecute: true, MaxDetectionConcurrency: 4, MaxExecutionConcurrency: 4},
|
||||
},
|
||||
})
|
||||
|
||||
r.UpdateHeartbeat("worker-a", &plugin_pb.WorkerHeartbeat{
|
||||
WorkerId: "worker-a",
|
||||
DetectionSlotsUsed: 1,
|
||||
DetectionSlotsTotal: 2,
|
||||
})
|
||||
r.UpdateHeartbeat("worker-b", &plugin_pb.WorkerHeartbeat{
|
||||
WorkerId: "worker-b",
|
||||
DetectionSlotsUsed: 1,
|
||||
DetectionSlotsTotal: 4,
|
||||
})
|
||||
|
||||
picked, err := r.PickDetector("vacuum")
|
||||
if err != nil {
|
||||
t.Fatalf("PickDetector: %v", err)
|
||||
}
|
||||
if picked.WorkerID != "worker-b" {
|
||||
t.Fatalf("unexpected detector picked: got %s want worker-b", picked.WorkerID)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRegistryPickExecutorAllowsSameWorker(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
r := NewRegistry()
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-x",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "balance", CanDetect: true, CanExecute: true, MaxDetectionConcurrency: 1, MaxExecutionConcurrency: 1},
|
||||
},
|
||||
})
|
||||
|
||||
detector, err := r.PickDetector("balance")
|
||||
if err != nil {
|
||||
t.Fatalf("PickDetector: %v", err)
|
||||
}
|
||||
executor, err := r.PickExecutor("balance")
|
||||
if err != nil {
|
||||
t.Fatalf("PickExecutor: %v", err)
|
||||
}
|
||||
|
||||
if detector.WorkerID != "worker-x" || executor.WorkerID != "worker-x" {
|
||||
t.Fatalf("expected same worker for detect/execute, got detector=%s executor=%s", detector.WorkerID, executor.WorkerID)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRegistryDetectableJobTypes(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
r := NewRegistry()
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "vacuum", CanDetect: true, CanExecute: true},
|
||||
{JobType: "balance", CanDetect: false, CanExecute: true},
|
||||
},
|
||||
})
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-b",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "ec", CanDetect: true, CanExecute: false},
|
||||
{JobType: "vacuum", CanDetect: true, CanExecute: false},
|
||||
},
|
||||
})
|
||||
|
||||
got := r.DetectableJobTypes()
|
||||
want := []string{"ec", "vacuum"}
|
||||
if !reflect.DeepEqual(got, want) {
|
||||
t.Fatalf("unexpected detectable job types: got=%v want=%v", got, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRegistryJobTypes(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
r := NewRegistry()
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "vacuum", CanDetect: true},
|
||||
{JobType: "balance", CanExecute: true},
|
||||
},
|
||||
})
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-b",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "ec", CanDetect: true},
|
||||
},
|
||||
})
|
||||
|
||||
got := r.JobTypes()
|
||||
want := []string{"balance", "ec", "vacuum"}
|
||||
if !reflect.DeepEqual(got, want) {
|
||||
t.Fatalf("unexpected job types: got=%v want=%v", got, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRegistryListExecutorsSortedBySlots(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
r := NewRegistry()
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
|
||||
},
|
||||
})
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-b",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 4},
|
||||
},
|
||||
})
|
||||
|
||||
r.UpdateHeartbeat("worker-a", &plugin_pb.WorkerHeartbeat{
|
||||
WorkerId: "worker-a",
|
||||
ExecutionSlotsUsed: 1,
|
||||
ExecutionSlotsTotal: 2,
|
||||
})
|
||||
r.UpdateHeartbeat("worker-b", &plugin_pb.WorkerHeartbeat{
|
||||
WorkerId: "worker-b",
|
||||
ExecutionSlotsUsed: 1,
|
||||
ExecutionSlotsTotal: 4,
|
||||
})
|
||||
|
||||
executors, err := r.ListExecutors("balance")
|
||||
if err != nil {
|
||||
t.Fatalf("ListExecutors: %v", err)
|
||||
}
|
||||
if len(executors) != 2 {
|
||||
t.Fatalf("unexpected candidate count: got=%d", len(executors))
|
||||
}
|
||||
if executors[0].WorkerID != "worker-b" || executors[1].WorkerID != "worker-a" {
|
||||
t.Fatalf("unexpected executor order: got=%s,%s", executors[0].WorkerID, executors[1].WorkerID)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRegistryPickExecutorRoundRobinForTopTie(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
r := NewRegistry()
|
||||
for _, workerID := range []string{"worker-a", "worker-b", "worker-c"} {
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: workerID,
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 1},
|
||||
},
|
||||
})
|
||||
}
|
||||
|
||||
got := make([]string, 0, 6)
|
||||
for i := 0; i < 6; i++ {
|
||||
executor, err := r.PickExecutor("balance")
|
||||
if err != nil {
|
||||
t.Fatalf("PickExecutor: %v", err)
|
||||
}
|
||||
got = append(got, executor.WorkerID)
|
||||
}
|
||||
|
||||
want := []string{"worker-a", "worker-b", "worker-c", "worker-a", "worker-b", "worker-c"}
|
||||
if !reflect.DeepEqual(got, want) {
|
||||
t.Fatalf("unexpected pick order: got=%v want=%v", got, want)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRegistryListExecutorsRoundRobinForTopTie(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
r := NewRegistry()
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
|
||||
},
|
||||
})
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-b",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
|
||||
},
|
||||
})
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-c",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 1},
|
||||
},
|
||||
})
|
||||
|
||||
r.UpdateHeartbeat("worker-a", &plugin_pb.WorkerHeartbeat{
|
||||
WorkerId: "worker-a",
|
||||
ExecutionSlotsUsed: 0,
|
||||
ExecutionSlotsTotal: 2,
|
||||
})
|
||||
r.UpdateHeartbeat("worker-b", &plugin_pb.WorkerHeartbeat{
|
||||
WorkerId: "worker-b",
|
||||
ExecutionSlotsUsed: 0,
|
||||
ExecutionSlotsTotal: 2,
|
||||
})
|
||||
r.UpdateHeartbeat("worker-c", &plugin_pb.WorkerHeartbeat{
|
||||
WorkerId: "worker-c",
|
||||
ExecutionSlotsUsed: 0,
|
||||
ExecutionSlotsTotal: 1,
|
||||
})
|
||||
|
||||
firstCall, err := r.ListExecutors("balance")
|
||||
if err != nil {
|
||||
t.Fatalf("ListExecutors first call: %v", err)
|
||||
}
|
||||
secondCall, err := r.ListExecutors("balance")
|
||||
if err != nil {
|
||||
t.Fatalf("ListExecutors second call: %v", err)
|
||||
}
|
||||
thirdCall, err := r.ListExecutors("balance")
|
||||
if err != nil {
|
||||
t.Fatalf("ListExecutors third call: %v", err)
|
||||
}
|
||||
|
||||
if firstCall[0].WorkerID != "worker-a" || firstCall[1].WorkerID != "worker-b" || firstCall[2].WorkerID != "worker-c" {
|
||||
t.Fatalf("unexpected first executor order: got=%s,%s,%s", firstCall[0].WorkerID, firstCall[1].WorkerID, firstCall[2].WorkerID)
|
||||
}
|
||||
if secondCall[0].WorkerID != "worker-b" || secondCall[1].WorkerID != "worker-a" || secondCall[2].WorkerID != "worker-c" {
|
||||
t.Fatalf("unexpected second executor order: got=%s,%s,%s", secondCall[0].WorkerID, secondCall[1].WorkerID, secondCall[2].WorkerID)
|
||||
}
|
||||
if thirdCall[0].WorkerID != "worker-a" || thirdCall[1].WorkerID != "worker-b" || thirdCall[2].WorkerID != "worker-c" {
|
||||
t.Fatalf("unexpected third executor order: got=%s,%s,%s", thirdCall[0].WorkerID, thirdCall[1].WorkerID, thirdCall[2].WorkerID)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRegistrySkipsStaleWorkersForSelectionAndListing(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
r := NewRegistry()
|
||||
r.staleAfter = 2 * time.Second
|
||||
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-stale",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "vacuum", CanDetect: true, CanExecute: true},
|
||||
},
|
||||
})
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-fresh",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "vacuum", CanDetect: true, CanExecute: true},
|
||||
},
|
||||
})
|
||||
|
||||
r.mu.Lock()
|
||||
r.sessions["worker-stale"].LastSeenAt = time.Now().Add(-10 * time.Second)
|
||||
r.sessions["worker-fresh"].LastSeenAt = time.Now()
|
||||
r.mu.Unlock()
|
||||
|
||||
picked, err := r.PickDetector("vacuum")
|
||||
if err != nil {
|
||||
t.Fatalf("PickDetector: %v", err)
|
||||
}
|
||||
if picked.WorkerID != "worker-fresh" {
|
||||
t.Fatalf("unexpected detector: got=%s want=worker-fresh", picked.WorkerID)
|
||||
}
|
||||
|
||||
if _, ok := r.Get("worker-stale"); ok {
|
||||
t.Fatalf("expected stale worker to be hidden from Get")
|
||||
}
|
||||
if _, ok := r.Get("worker-fresh"); !ok {
|
||||
t.Fatalf("expected fresh worker from Get")
|
||||
}
|
||||
|
||||
listed := r.List()
|
||||
if len(listed) != 1 || listed[0].WorkerID != "worker-fresh" {
|
||||
t.Fatalf("unexpected listed workers: %+v", listed)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRegistryReturnsNoDetectorWhenAllWorkersStale(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
r := NewRegistry()
|
||||
r.staleAfter = 2 * time.Second
|
||||
|
||||
r.UpsertFromHello(&plugin_pb.WorkerHello{
|
||||
WorkerId: "worker-a",
|
||||
Capabilities: []*plugin_pb.JobTypeCapability{
|
||||
{JobType: "vacuum", CanDetect: true},
|
||||
},
|
||||
})
|
||||
|
||||
r.mu.Lock()
|
||||
r.sessions["worker-a"].LastSeenAt = time.Now().Add(-10 * time.Second)
|
||||
r.mu.Unlock()
|
||||
|
||||
if _, err := r.PickDetector("vacuum"); err == nil {
|
||||
t.Fatalf("expected no detector when all workers are stale")
|
||||
}
|
||||
}
|
||||
103
weed/admin/plugin/types.go
Normal file
103
weed/admin/plugin/types.go
Normal file
@@ -0,0 +1,103 @@
|
||||
package plugin
|
||||
|
||||
import "time"
|
||||
|
||||
const (
|
||||
// Keep exactly the last 10 successful and last 10 error runs per job type.
|
||||
MaxSuccessfulRunHistory = 10
|
||||
MaxErrorRunHistory = 10
|
||||
)
|
||||
|
||||
type RunOutcome string
|
||||
|
||||
const (
|
||||
RunOutcomeSuccess RunOutcome = "success"
|
||||
RunOutcomeError RunOutcome = "error"
|
||||
)
|
||||
|
||||
type JobRunRecord struct {
|
||||
RunID string `json:"run_id"`
|
||||
JobID string `json:"job_id"`
|
||||
JobType string `json:"job_type"`
|
||||
WorkerID string `json:"worker_id"`
|
||||
Outcome RunOutcome `json:"outcome"`
|
||||
Message string `json:"message,omitempty"`
|
||||
DurationMs int64 `json:"duration_ms,omitempty"`
|
||||
CompletedAt *time.Time `json:"completed_at,omitempty"`
|
||||
}
|
||||
|
||||
type JobTypeRunHistory struct {
|
||||
JobType string `json:"job_type"`
|
||||
SuccessfulRuns []JobRunRecord `json:"successful_runs"`
|
||||
ErrorRuns []JobRunRecord `json:"error_runs"`
|
||||
LastUpdatedTime *time.Time `json:"last_updated_time,omitempty"`
|
||||
}
|
||||
|
||||
type TrackedJob struct {
|
||||
JobID string `json:"job_id"`
|
||||
JobType string `json:"job_type"`
|
||||
RequestID string `json:"request_id"`
|
||||
WorkerID string `json:"worker_id"`
|
||||
DedupeKey string `json:"dedupe_key,omitempty"`
|
||||
Summary string `json:"summary,omitempty"`
|
||||
Detail string `json:"detail,omitempty"`
|
||||
Parameters map[string]interface{} `json:"parameters,omitempty"`
|
||||
Labels map[string]string `json:"labels,omitempty"`
|
||||
State string `json:"state"`
|
||||
Progress float64 `json:"progress"`
|
||||
Stage string `json:"stage,omitempty"`
|
||||
Message string `json:"message,omitempty"`
|
||||
Attempt int32 `json:"attempt,omitempty"`
|
||||
CreatedAt *time.Time `json:"created_at,omitempty"`
|
||||
UpdatedAt *time.Time `json:"updated_at,omitempty"`
|
||||
CompletedAt *time.Time `json:"completed_at,omitempty"`
|
||||
ErrorMessage string `json:"error_message,omitempty"`
|
||||
ResultSummary string `json:"result_summary,omitempty"`
|
||||
ResultOutputValues map[string]interface{} `json:"result_output_values,omitempty"`
|
||||
}
|
||||
|
||||
type JobActivity struct {
|
||||
JobID string `json:"job_id"`
|
||||
JobType string `json:"job_type"`
|
||||
RequestID string `json:"request_id,omitempty"`
|
||||
WorkerID string `json:"worker_id,omitempty"`
|
||||
Source string `json:"source"`
|
||||
Message string `json:"message"`
|
||||
Stage string `json:"stage,omitempty"`
|
||||
Details map[string]interface{} `json:"details,omitempty"`
|
||||
OccurredAt *time.Time `json:"occurred_at,omitempty"`
|
||||
}
|
||||
|
||||
type JobDetail struct {
|
||||
Job *TrackedJob `json:"job"`
|
||||
RunRecord *JobRunRecord `json:"run_record,omitempty"`
|
||||
Activities []JobActivity `json:"activities"`
|
||||
RelatedJobs []TrackedJob `json:"related_jobs,omitempty"`
|
||||
LastUpdated *time.Time `json:"last_updated,omitempty"`
|
||||
}
|
||||
|
||||
type SchedulerJobTypeState struct {
|
||||
JobType string `json:"job_type"`
|
||||
Enabled bool `json:"enabled"`
|
||||
PolicyError string `json:"policy_error,omitempty"`
|
||||
DetectionInFlight bool `json:"detection_in_flight"`
|
||||
NextDetectionAt *time.Time `json:"next_detection_at,omitempty"`
|
||||
DetectionIntervalSeconds int32 `json:"detection_interval_seconds,omitempty"`
|
||||
DetectionTimeoutSeconds int32 `json:"detection_timeout_seconds,omitempty"`
|
||||
ExecutionTimeoutSeconds int32 `json:"execution_timeout_seconds,omitempty"`
|
||||
MaxJobsPerDetection int32 `json:"max_jobs_per_detection,omitempty"`
|
||||
GlobalExecutionConcurrency int `json:"global_execution_concurrency,omitempty"`
|
||||
PerWorkerExecutionConcurrency int `json:"per_worker_execution_concurrency,omitempty"`
|
||||
RetryLimit int `json:"retry_limit,omitempty"`
|
||||
RetryBackoffSeconds int32 `json:"retry_backoff_seconds,omitempty"`
|
||||
DetectorAvailable bool `json:"detector_available"`
|
||||
DetectorWorkerID string `json:"detector_worker_id,omitempty"`
|
||||
ExecutorWorkerCount int `json:"executor_worker_count"`
|
||||
}
|
||||
|
||||
func timeToPtr(t time.Time) *time.Time {
|
||||
if t.IsZero() {
|
||||
return nil
|
||||
}
|
||||
return &t
|
||||
}
|
||||
Reference in New Issue
Block a user