Refactor plugin system and migrate worker runtime (#8369)

* admin: add plugin runtime UI page and route wiring

* pb: add plugin gRPC contract and generated bindings

* admin/plugin: implement worker registry, runtime, monitoring, and config store

* admin/dash: wire plugin runtime and expose plugin workflow APIs

* command: add flags to enable plugin runtime

* admin: rename remaining plugin v2 wording to plugin

* admin/plugin: add detectable job type registry helper

* admin/plugin: add scheduled detection and dispatch orchestration

* admin/plugin: prefetch job type descriptors when workers connect

* admin/plugin: add known job type discovery API and UI

* admin/plugin: refresh design doc to match current implementation

* admin/plugin: enforce per-worker scheduler concurrency limits

* admin/plugin: use descriptor runtime defaults for scheduler policy

* admin/ui: auto-load first known plugin job type on page open

* admin/plugin: bootstrap persisted config from descriptor defaults

* admin/plugin: dedupe scheduled proposals by dedupe key

* admin/ui: add job type and state filters for plugin monitoring

* admin/ui: add per-job-type plugin activity summary

* admin/plugin: split descriptor read API from schema refresh

* admin/ui: keep plugin summary metrics global while tables are filtered

* admin/plugin: retry executor reservation before timing out

* admin/plugin: expose scheduler states for monitoring

* admin/ui: show per-job-type scheduler states in plugin monitor

* pb/plugin: rename protobuf package to plugin

* admin/plugin: rename pluginRuntime wiring to plugin

* admin/plugin: remove runtime naming from plugin APIs and UI

* admin/plugin: rename runtime files to plugin naming

* admin/plugin: persist jobs and activities for monitor recovery

* admin/plugin: lease one detector worker per job type

* admin/ui: show worker load from plugin heartbeats

* admin/plugin: skip stale workers for detector and executor picks

* plugin/worker: add plugin worker command and stream runtime scaffold

* plugin/worker: implement vacuum detect and execute handlers

* admin/plugin: document external vacuum plugin worker starter

* command: update plugin.worker help to reflect implemented flow

* command/admin: drop legacy Plugin V2 label

* plugin/worker: validate vacuum job type and respect min interval

* plugin/worker: test no-op detect when min interval not elapsed

* command/admin: document plugin.worker external process

* plugin/worker: advertise configured concurrency in hello

* command/plugin.worker: add jobType handler selection

* command/plugin.worker: test handler selection by job type

* command/plugin.worker: persist worker id in workingDir

* admin/plugin: document plugin.worker jobType and workingDir flags

* plugin/worker: support cancel request for in-flight work

* plugin/worker: test cancel request acknowledgements

* command/plugin.worker: document workingDir and jobType behavior

* plugin/worker: emit executor activity events for monitor

* plugin/worker: test executor activity builder

* admin/plugin: send last successful run in detection request

* admin/plugin: send cancel request when detect or execute context ends

* admin/plugin: document worker cancel request responsibility

* admin/handlers: expose plugin scheduler states API in no-auth mode

* admin/handlers: test plugin scheduler states route registration

* admin/plugin: keep worker id on worker-generated activity records

* admin/plugin: test worker id propagation in monitor activities

* admin/dash: always initialize plugin service

* command/admin: remove plugin enable flags and default to enabled

* admin/dash: drop pluginEnabled constructor parameter

* admin/plugin UI: stop checking plugin enabled state

* admin/plugin: remove docs for plugin enable flags

* admin/dash: remove unused plugin enabled check method

* admin/dash: fallback to in-memory plugin init when dataDir fails

* admin/plugin API: expose worker gRPC port in status

* command/plugin.worker: resolve admin gRPC port via plugin status

* split plugin UI into overview/configuration/monitoring pages

* Update layout_templ.go

* add volume_balance plugin worker handler

* wire plugin.worker CLI for volume_balance job type

* add erasure_coding plugin worker handler

* wire plugin.worker CLI for erasure_coding job type

* support multi-job handlers in plugin worker runtime

* allow plugin.worker jobType as comma-separated list

* admin/plugin UI: rename to Workers and simplify config view

* plugin worker: queue detection requests instead of capacity reject

* Update plugin_worker.go

* plugin volume_balance: remove force_move/timeout from worker config UI

* plugin erasure_coding: enforce local working dir and cleanup

* admin/plugin UI: rename admin settings to job scheduling

* admin/plugin UI: persist and robustly render detection results

* admin/plugin: record and return detection trace metadata

* admin/plugin UI: show detection process and decision trace

* plugin: surface detector decision trace as activities

* mini: start a plugin worker by default

* admin/plugin UI: split monitoring into detection and execution tabs

* plugin worker: emit detection decision trace for EC and balance

* admin workers UI: split monitoring into detection and execution pages

* plugin scheduler: skip proposals for active assigned/running jobs

* admin workers UI: add job queue tab

* plugin worker: add dummy stress detector and executor job type

* admin workers UI: reorder tabs to detection queue execution

* admin workers UI: regenerate plugin template

* plugin defaults: include dummy stress and add stress tests

* plugin dummy stress: rotate detection selections across runs

* plugin scheduler: remove cross-run proposal dedupe

* plugin queue: track pending scheduled jobs

* plugin scheduler: wait for executor capacity before dispatch

* plugin scheduler: skip detection when waiting backlog is high

* plugin: add disk-backed job detail API and persistence

* admin ui: show plugin job detail modal from job id links

* plugin: generate unique job ids instead of reusing proposal ids

* plugin worker: emit heartbeats on work state changes

* plugin registry: round-robin tied executor and detector picks

* add temporary EC overnight stress runner

* plugin job details: persist and render EC execution plans

* ec volume details: color data and parity shard badges

* shard labels: keep parity ids numeric and color-only distinction

* admin: remove legacy maintenance UI routes and templates

* admin: remove dead maintenance endpoint helpers

* Update layout_templ.go

* remove dummy_stress worker and command support

* refactor plugin UI to job-type top tabs and sub-tabs

* migrate weed worker command to plugin runtime

* remove plugin.worker command and keep worker runtime with metrics

* update helm worker args for jobType and execution flags

* set plugin scheduling defaults to global 16 and per-worker 4

* stress: fix RPC context reuse and remove redundant variables in ec_stress_runner

* admin/plugin: fix lifecycle races, safe channel operations, and terminal state constants

* admin/dash: randomize job IDs and fix priority zero-value overwrite in plugin API

* admin/handlers: implement buffered rendering to prevent response corruption

* admin/plugin: implement debounced persistence flusher and optimize BuildJobDetail memory lookups

* admin/plugin: fix priority overwrite and implement bounded wait in scheduler reserve

* admin/plugin: implement atomic file writes and fix run record side effects

* admin/plugin: use P prefix for parity shard labels in execution plans

* admin/plugin: enable parallel execution for cancellation tests

* admin: refactor time.Time fields to pointers for better JSON omitempty support

* admin/plugin: implement pointer-safe time assignments and comparisons in plugin core

* admin/plugin: fix time assignment and sorting logic in plugin monitor after pointer refactor

* admin/plugin: update scheduler activity tracking to use time pointers

* admin/plugin: fix time-based run history trimming after pointer refactor

* admin/dash: fix JobSpec struct literal in plugin API after pointer refactor

* admin/view: add D/P prefixes to EC shard badges for UI consistency

* admin/plugin: use lifecycle-aware context for schema prefetching

* Update ec_volume_details_templ.go

* admin/stress: fix proposal sorting and log volume cleanup errors

* stress: refine ec stress runner with math/rand and collection name

- Added Collection field to VolumeEcShardsDeleteRequest for correct filename construction.
- Replaced crypto/rand with seeded math/rand PRNG for bulk payloads.
- Added documentation for EcMinAge zero-value behavior.
- Added logging for ignored errors in volume/shard deletion.

* admin: return internal server error for plugin store failures

Changed error status code from 400 Bad Request to 500 Internal Server Error for failures in GetPluginJobDetail to correctly reflect server-side errors.

* admin: implement safe channel sends and graceful shutdown sync

- Added sync.WaitGroup to Plugin struct to manage background goroutines.
- Implemented safeSendCh helper using recover() to prevent panics on closed channels.
- Ensured Shutdown() waits for all background operations to complete.

* admin: robustify plugin monitor with nil-safe time and record init

- Standardized nil-safe assignment for *time.Time pointers (CreatedAt, UpdatedAt, CompletedAt).
- Ensured persistJobDetailSnapshot initializes new records correctly if they don't exist on disk.
- Fixed debounced persistence to trigger immediate write on job completion.

* admin: improve scheduler shutdown behavior and logic guards

- Replaced brittle error string matching with explicit r.shutdownCh selection for shutdown detection.
- Removed redundant nil guard in buildScheduledJobSpec.
- Standardized WaitGroup usage for schedulerLoop.

* admin: implement deep copy for job parameters and atomic write fixes

- Implemented deepCopyGenericValue and used it in cloneTrackedJob to prevent shared state.
- Ensured atomicWriteFile creates parent directories before writing.

* admin: remove unreachable branch in shard classification

Removed an unreachable 'totalShards <= 0' check in classifyShardID as dataShards and parityShards are already guarded.

* admin: secure UI links and use canonical shard constants

- Added rel="noopener noreferrer" to external links for security.
- Replaced magic number 14 with erasure_coding.TotalShardsCount.
- Used renderEcShardBadge for missing shard list consistency.

* admin: stabilize plugin tests and fix regressions

- Composed a robust plugin_monitor_test.go to handle asynchronous persistence.
- Updated all time.Time literals to use timeToPtr helper.
- Added explicit Shutdown() calls in tests to synchronize with debounced writes.
- Fixed syntax errors and orphaned struct literals in tests.

* Potential fix for code scanning alert no. 278: Slice memory allocation with excessive size value

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* Potential fix for code scanning alert no. 283: Uncontrolled data used in path expression

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* admin: finalize refinements for error handling, scheduler, and race fixes

- Standardized HTTP 500 status codes for store failures in plugin_api.go.
- Tracked scheduled detection goroutines with sync.WaitGroup for safe shutdown.
- Fixed race condition in safeSendDetectionComplete by extracting channel under lock.
- Implemented deep copy for JobActivity details.
- Used defaultDirPerm constant in atomicWriteFile.

* test(ec): migrate admin dockertest to plugin APIs

* admin/plugin_api: fix RunPluginJobTypeAPI to return 500 for server-side detection/filter errors

* admin/plugin_api: fix ExecutePluginJobAPI to return 500 for job execution failures

* admin/plugin_api: limit parseProtoJSONBody request body to 1MB to prevent unbounded memory usage

* admin/plugin: consolidate regex to package-level validJobTypePattern; add char validation to sanitizeJobID

* admin/plugin: fix racy Shutdown channel close with sync.Once

* admin/plugin: track sendLoop and recv goroutines in WorkerStream with r.wg

* admin/plugin: document writeProtoFiles atomicity — .pb is source of truth, .json is human-readable only

* admin/plugin: extract activityLess helper to deduplicate nil-safe OccurredAt sort comparators

* test/ec: check http.NewRequest errors to prevent nil req panics

* test/ec: replace deprecated ioutil/math/rand, fix stale step comment 5.1→3.1

* plugin(ec): raise default detection and scheduling throughput limits

* topology: include empty disks in volume list and EC capacity fallback

* topology: remove hard 10-task cap for detection planning

* Update ec_volume_details_templ.go

* adjust default

* fix tests

---------

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
This commit is contained in:
Chris Lu
2026-02-18 13:42:41 -08:00
committed by GitHub
parent 5463038760
commit 8ec9ff4a12
82 changed files with 23419 additions and 11389 deletions

205
weed/admin/plugin/DESIGN.md Normal file
View File

@@ -0,0 +1,205 @@
# Admin Worker Plugin System (Design)
This document describes the plugin system for admin-managed workers, implemented in parallel with the current maintenance/worker mechanism.
## Scope
- Add a new plugin protocol and runtime model for multi-language workers.
- Keep all current admin + worker code paths untouched.
- Use gRPC for all admin-worker communication.
- Let workers describe job configuration UI declaratively via protobuf.
- Persist all job type configuration under admin server data directory.
- Support detector workers and executor workers per job type.
- Add end-to-end workflow observability (activities, active jobs, progress).
## New Contract
- Proto file: `weed/pb/plugin.proto`
- gRPC service: `PluginControlService.WorkerStream`
- Connection model: worker-initiated long-lived bidirectional stream.
Why this model:
- Works for workers in any language with gRPC support.
- Avoids admin dialing constraints in NAT/private networks.
- Allows command/response, progress streaming, and heartbeat over one channel.
## Core Runtime Components (Admin Side)
1. `PluginRegistry`
- Tracks connected workers and their per-job-type capabilities.
- Maintains liveness via heartbeat timeout.
2. `SchemaCoordinator`
- For each job type, asks one capable worker for `JobTypeDescriptor`.
- Caches descriptor version and refresh timestamp.
3. `ConfigStore`
- Persists descriptor + saved config values in `dataDir`.
- Stores both:
- Admin-owned runtime config (detection interval, dispatch concurrency, retry).
- Worker-owned config values (plugin-specific detection/execution knobs).
4. `DetectorScheduler`
- Per job type, chooses one detector worker (`can_detect=true`).
- Sends `RunDetectionRequest` with saved configs + cluster context.
- Accepts `DetectionProposals`, dedupes by `dedupe_key`, inserts jobs.
5. `JobDispatcher`
- Chooses executor worker (`can_execute=true`) for each pending job.
- Sends `ExecuteJobRequest`.
- Consumes `JobProgressUpdate` and `JobCompleted`.
6. `WorkflowMonitor`
- Builds live counters and timeline from events:
- activities per job type,
- active jobs,
- per-job progress/state,
- worker health/load.
## Worker Responsibilities
1. Register capabilities on connect (`WorkerHello`).
2. Expose job type descriptor (`ConfigSchemaResponse`) including UI schemas:
- admin config form,
- worker config form,
- defaults.
3. Run detection on demand (`RunDetectionRequest`) and return proposals.
4. Execute assigned jobs (`ExecuteJobRequest`) and stream progress.
5. Heartbeat regularly with slot usage and running work.
6. Handle cancellation requests (`CancelRequest`) for in-flight detection/execution.
## Declarative UI Model
UI is fully derived from protobuf schema:
- `ConfigForm`
- `ConfigSection`
- `ConfigField`
- `ConfigOption`
- `ValidationRule`
- `ConfigValue` (typed scalar/list/map/object value container)
Result:
- Admin can render forms without hardcoded task structs.
- New job types can ship UI schema from worker binary alone.
- Worker language is irrelevant as long as it can emit protobuf messages.
## Detection and Dispatch Flow
1. Worker connects and registers capabilities.
2. Admin requests descriptor per job type.
3. Admin persists descriptor and editable config values.
4. On detection interval (admin-owned setting):
- Admin chooses one detector worker for that job type.
- Sends `RunDetectionRequest` with:
- `AdminRuntimeConfig`,
- `admin_config_values`,
- `worker_config_values`,
- `ClusterContext` (master/filer/volume grpc locations, metadata).
5. Detector emits `DetectionProposals` and `DetectionComplete`.
6. Admin dedupes and enqueues jobs.
7. Dispatcher assigns jobs to any eligible executor worker.
8. Executor emits `JobProgressUpdate` and `JobCompleted`.
9. Monitor updates workflow UI in near-real-time.
## Persistence Layout (Admin Data Dir)
Current layout under `<admin-data-dir>/plugin/`:
- `job_types/<job_type>/descriptor.pb`
- `job_types/<job_type>/descriptor.json`
- `job_types/<job_type>/config.pb`
- `job_types/<job_type>/config.json`
- `job_types/<job_type>/runs.json`
- `jobs/tracked_jobs.json`
- `activities/activities.json`
`config.pb` should use `PersistedJobTypeConfig` from `plugin.proto`.
## Admin UI
- Route: `/plugin`
- Includes:
- runtime status,
- workers/capabilities,
- declarative descriptor-driven config forms,
- run history (last 10 success + last 10 errors),
- tracked jobs and activity stream,
- manual actions for schema refresh, detection, and detect+execute workflow.
## Scheduling Policy (Initial)
Detector selection per job type:
- only workers with `can_detect=true`.
- prefer healthy worker with highest free detection slots.
- lease ends when heartbeat timeout or stream drop.
Execution dispatch:
- only workers with `can_execute=true`.
- select by available execution slots and least active jobs.
- retry on failure using admin runtime retry config.
## Safety and Reliability
- Idempotency: dedupe proposals by (`job_type`, `dedupe_key`).
- Backpressure: enforce max jobs per detection run.
- Timeouts: detection and execution timeout from admin runtime config.
- Replay-safe persistence: write job state changes before emitting UI events.
- Heartbeat-based failover for detector/executor reassignment.
## Backward Compatibility
- Legacy `worker.proto` runtime remains internally available where still referenced.
- External CLI worker path is moved to plugin runtime behavior.
- Runtime is enabled by default on admin worker gRPC server.
## Incremental Rollout Plan
Phase 1
- Introduce protocol and storage models only.
Phase 2
- Build admin registry/scheduler/dispatcher behind feature flag.
Phase 3
- Add dedicated plugin UI pages and metrics.
Phase 4
- Port one existing job type (e.g. vacuum) as external worker plugin.
Phase 4 status (starter)
- Added `weed worker` command as an external `plugin.proto` worker process.
- Initial handler implements `vacuum` job type with:
- declarative descriptor/config form response (`ConfigSchemaResponse`),
- detection via master topology scan (`RunDetectionRequest`),
- execution via existing vacuum task logic (`ExecuteJobRequest`),
- heartbeat/load reporting for monitor UI.
- Legacy maintenance-worker-specific CLI path is removed.
Run example:
- Start admin: `weed admin -master=localhost:9333`
- Start worker: `weed worker -admin=localhost:23646`
- Optional explicit job type: `weed worker -admin=localhost:23646 -jobType=vacuum`
- Optional stable worker ID persistence: `weed worker -admin=localhost:23646 -workingDir=/var/lib/seaweedfs-plugin`
Phase 5
- Migrate remaining job types and deprecate old mechanism.
## Agreed Defaults
1. Detector multiplicity
- Exactly one detector worker per job type at a time. Admin selects one worker and runs detection there.
2. Secret handling
- No encryption at rest required for plugin config in this phase.
3. Schema compatibility
- No migration policy required yet; this is a new system.
4. Execution ownership
- Same worker is allowed to do both detection and execution.
5. Retention
- Keep last 10 successful runs and last 10 error runs per job type.

View File

@@ -0,0 +1,739 @@
package plugin
import (
"encoding/json"
"fmt"
"net/url"
"os"
"path/filepath"
"regexp"
"sort"
"strings"
"sync"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
"google.golang.org/protobuf/encoding/protojson"
"google.golang.org/protobuf/proto"
)
const (
pluginDirName = "plugin"
jobTypesDirName = "job_types"
jobsDirName = "jobs"
jobDetailsDirName = "job_details"
activitiesDirName = "activities"
descriptorPBFileName = "descriptor.pb"
descriptorJSONFileName = "descriptor.json"
configPBFileName = "config.pb"
configJSONFileName = "config.json"
runsJSONFileName = "runs.json"
trackedJobsJSONFileName = "tracked_jobs.json"
activitiesJSONFileName = "activities.json"
defaultDirPerm = 0o755
defaultFilePerm = 0o644
)
// validJobTypePattern is the canonical pattern for safe job type names.
// Only letters, digits, underscore, dash, and dot are allowed, which prevents
// path traversal because '/', '\\', and whitespace are rejected.
var validJobTypePattern = regexp.MustCompile(`^[A-Za-z0-9_.-]+$`)
// ConfigStore persists plugin configuration and bounded run history.
// If admin data dir is empty, it transparently falls back to in-memory mode.
type ConfigStore struct {
configured bool
baseDir string
mu sync.RWMutex
memDescriptors map[string]*plugin_pb.JobTypeDescriptor
memConfigs map[string]*plugin_pb.PersistedJobTypeConfig
memRunHistory map[string]*JobTypeRunHistory
memTrackedJobs []TrackedJob
memActivities []JobActivity
memJobDetails map[string]TrackedJob
}
func NewConfigStore(adminDataDir string) (*ConfigStore, error) {
store := &ConfigStore{
configured: adminDataDir != "",
memDescriptors: make(map[string]*plugin_pb.JobTypeDescriptor),
memConfigs: make(map[string]*plugin_pb.PersistedJobTypeConfig),
memRunHistory: make(map[string]*JobTypeRunHistory),
memJobDetails: make(map[string]TrackedJob),
}
if adminDataDir == "" {
return store, nil
}
store.baseDir = filepath.Join(adminDataDir, pluginDirName)
if err := os.MkdirAll(filepath.Join(store.baseDir, jobTypesDirName), defaultDirPerm); err != nil {
return nil, fmt.Errorf("create plugin job_types dir: %w", err)
}
if err := os.MkdirAll(filepath.Join(store.baseDir, jobsDirName), defaultDirPerm); err != nil {
return nil, fmt.Errorf("create plugin jobs dir: %w", err)
}
if err := os.MkdirAll(filepath.Join(store.baseDir, jobsDirName, jobDetailsDirName), defaultDirPerm); err != nil {
return nil, fmt.Errorf("create plugin job_details dir: %w", err)
}
if err := os.MkdirAll(filepath.Join(store.baseDir, activitiesDirName), defaultDirPerm); err != nil {
return nil, fmt.Errorf("create plugin activities dir: %w", err)
}
return store, nil
}
func (s *ConfigStore) IsConfigured() bool {
return s.configured
}
func (s *ConfigStore) BaseDir() string {
return s.baseDir
}
func (s *ConfigStore) SaveDescriptor(jobType string, descriptor *plugin_pb.JobTypeDescriptor) error {
if descriptor == nil {
return fmt.Errorf("descriptor is nil")
}
if _, err := sanitizeJobType(jobType); err != nil {
return err
}
clone := proto.Clone(descriptor).(*plugin_pb.JobTypeDescriptor)
if clone.JobType == "" {
clone.JobType = jobType
}
s.mu.Lock()
defer s.mu.Unlock()
if !s.configured {
s.memDescriptors[jobType] = clone
return nil
}
jobTypeDir, err := s.ensureJobTypeDir(jobType)
if err != nil {
return err
}
pbPath := filepath.Join(jobTypeDir, descriptorPBFileName)
jsonPath := filepath.Join(jobTypeDir, descriptorJSONFileName)
if err := writeProtoFiles(clone, pbPath, jsonPath); err != nil {
return fmt.Errorf("save descriptor for %s: %w", jobType, err)
}
return nil
}
func (s *ConfigStore) LoadDescriptor(jobType string) (*plugin_pb.JobTypeDescriptor, error) {
if _, err := sanitizeJobType(jobType); err != nil {
return nil, err
}
s.mu.RLock()
if !s.configured {
d := s.memDescriptors[jobType]
s.mu.RUnlock()
if d == nil {
return nil, nil
}
return proto.Clone(d).(*plugin_pb.JobTypeDescriptor), nil
}
s.mu.RUnlock()
pbPath := filepath.Join(s.baseDir, jobTypesDirName, jobType, descriptorPBFileName)
data, err := os.ReadFile(pbPath)
if err != nil {
if os.IsNotExist(err) {
return nil, nil
}
return nil, fmt.Errorf("read descriptor for %s: %w", jobType, err)
}
var descriptor plugin_pb.JobTypeDescriptor
if err := proto.Unmarshal(data, &descriptor); err != nil {
return nil, fmt.Errorf("unmarshal descriptor for %s: %w", jobType, err)
}
return &descriptor, nil
}
func (s *ConfigStore) SaveJobTypeConfig(config *plugin_pb.PersistedJobTypeConfig) error {
if config == nil {
return fmt.Errorf("job type config is nil")
}
if config.JobType == "" {
return fmt.Errorf("job type config has empty job_type")
}
sanitizedJobType, err := sanitizeJobType(config.JobType)
if err != nil {
return err
}
// Use the sanitized job type going forward to ensure it is safe for filesystem paths.
config.JobType = sanitizedJobType
clone := proto.Clone(config).(*plugin_pb.PersistedJobTypeConfig)
s.mu.Lock()
defer s.mu.Unlock()
if !s.configured {
s.memConfigs[config.JobType] = clone
return nil
}
jobTypeDir, err := s.ensureJobTypeDir(config.JobType)
if err != nil {
return err
}
pbPath := filepath.Join(jobTypeDir, configPBFileName)
jsonPath := filepath.Join(jobTypeDir, configJSONFileName)
if err := writeProtoFiles(clone, pbPath, jsonPath); err != nil {
return fmt.Errorf("save job type config for %s: %w", config.JobType, err)
}
return nil
}
func (s *ConfigStore) LoadJobTypeConfig(jobType string) (*plugin_pb.PersistedJobTypeConfig, error) {
if _, err := sanitizeJobType(jobType); err != nil {
return nil, err
}
s.mu.RLock()
if !s.configured {
cfg := s.memConfigs[jobType]
s.mu.RUnlock()
if cfg == nil {
return nil, nil
}
return proto.Clone(cfg).(*plugin_pb.PersistedJobTypeConfig), nil
}
s.mu.RUnlock()
pbPath := filepath.Join(s.baseDir, jobTypesDirName, jobType, configPBFileName)
data, err := os.ReadFile(pbPath)
if err != nil {
if os.IsNotExist(err) {
return nil, nil
}
return nil, fmt.Errorf("read job type config for %s: %w", jobType, err)
}
var config plugin_pb.PersistedJobTypeConfig
if err := proto.Unmarshal(data, &config); err != nil {
return nil, fmt.Errorf("unmarshal job type config for %s: %w", jobType, err)
}
return &config, nil
}
func (s *ConfigStore) AppendRunRecord(jobType string, record *JobRunRecord) error {
if record == nil {
return fmt.Errorf("run record is nil")
}
if _, err := sanitizeJobType(jobType); err != nil {
return err
}
safeRecord := *record
if safeRecord.JobType == "" {
safeRecord.JobType = jobType
}
if safeRecord.CompletedAt == nil || safeRecord.CompletedAt.IsZero() {
safeRecord.CompletedAt = timeToPtr(time.Now().UTC())
}
s.mu.Lock()
defer s.mu.Unlock()
history, err := s.loadRunHistoryLocked(jobType)
if err != nil {
return err
}
if safeRecord.Outcome == RunOutcomeSuccess {
history.SuccessfulRuns = append(history.SuccessfulRuns, safeRecord)
} else {
safeRecord.Outcome = RunOutcomeError
history.ErrorRuns = append(history.ErrorRuns, safeRecord)
}
history.SuccessfulRuns = trimRuns(history.SuccessfulRuns, MaxSuccessfulRunHistory)
history.ErrorRuns = trimRuns(history.ErrorRuns, MaxErrorRunHistory)
history.LastUpdatedTime = timeToPtr(time.Now().UTC())
return s.saveRunHistoryLocked(jobType, history)
}
func (s *ConfigStore) LoadRunHistory(jobType string) (*JobTypeRunHistory, error) {
if _, err := sanitizeJobType(jobType); err != nil {
return nil, err
}
s.mu.Lock()
defer s.mu.Unlock()
history, err := s.loadRunHistoryLocked(jobType)
if err != nil {
return nil, err
}
return cloneRunHistory(history), nil
}
func (s *ConfigStore) SaveTrackedJobs(jobs []TrackedJob) error {
s.mu.Lock()
defer s.mu.Unlock()
clone := cloneTrackedJobs(jobs)
if !s.configured {
s.memTrackedJobs = clone
return nil
}
encoded, err := json.MarshalIndent(clone, "", " ")
if err != nil {
return fmt.Errorf("encode tracked jobs: %w", err)
}
path := filepath.Join(s.baseDir, jobsDirName, trackedJobsJSONFileName)
if err := atomicWriteFile(path, encoded, defaultFilePerm); err != nil {
return fmt.Errorf("write tracked jobs: %w", err)
}
return nil
}
func (s *ConfigStore) LoadTrackedJobs() ([]TrackedJob, error) {
s.mu.RLock()
if !s.configured {
out := cloneTrackedJobs(s.memTrackedJobs)
s.mu.RUnlock()
return out, nil
}
s.mu.RUnlock()
path := filepath.Join(s.baseDir, jobsDirName, trackedJobsJSONFileName)
data, err := os.ReadFile(path)
if err != nil {
if os.IsNotExist(err) {
return nil, nil
}
return nil, fmt.Errorf("read tracked jobs: %w", err)
}
var jobs []TrackedJob
if err := json.Unmarshal(data, &jobs); err != nil {
return nil, fmt.Errorf("parse tracked jobs: %w", err)
}
return cloneTrackedJobs(jobs), nil
}
func (s *ConfigStore) SaveJobDetail(job TrackedJob) error {
jobID, err := sanitizeJobID(job.JobID)
if err != nil {
return err
}
s.mu.Lock()
defer s.mu.Unlock()
clone := cloneTrackedJob(job)
clone.JobID = jobID
if !s.configured {
s.memJobDetails[jobID] = clone
return nil
}
encoded, err := json.MarshalIndent(clone, "", " ")
if err != nil {
return fmt.Errorf("encode job detail: %w", err)
}
path := filepath.Join(s.baseDir, jobsDirName, jobDetailsDirName, jobDetailFileName(jobID))
if err := atomicWriteFile(path, encoded, defaultFilePerm); err != nil {
return fmt.Errorf("write job detail: %w", err)
}
return nil
}
func (s *ConfigStore) LoadJobDetail(jobID string) (*TrackedJob, error) {
jobID, err := sanitizeJobID(jobID)
if err != nil {
return nil, err
}
s.mu.RLock()
if !s.configured {
job, ok := s.memJobDetails[jobID]
s.mu.RUnlock()
if !ok {
return nil, nil
}
clone := cloneTrackedJob(job)
return &clone, nil
}
s.mu.RUnlock()
path := filepath.Join(s.baseDir, jobsDirName, jobDetailsDirName, jobDetailFileName(jobID))
data, err := os.ReadFile(path)
if err != nil {
if os.IsNotExist(err) {
return nil, nil
}
return nil, fmt.Errorf("read job detail: %w", err)
}
var job TrackedJob
if err := json.Unmarshal(data, &job); err != nil {
return nil, fmt.Errorf("parse job detail: %w", err)
}
clone := cloneTrackedJob(job)
return &clone, nil
}
func (s *ConfigStore) SaveActivities(activities []JobActivity) error {
s.mu.Lock()
defer s.mu.Unlock()
clone := cloneActivities(activities)
if !s.configured {
s.memActivities = clone
return nil
}
encoded, err := json.MarshalIndent(clone, "", " ")
if err != nil {
return fmt.Errorf("encode activities: %w", err)
}
path := filepath.Join(s.baseDir, activitiesDirName, activitiesJSONFileName)
if err := atomicWriteFile(path, encoded, defaultFilePerm); err != nil {
return fmt.Errorf("write activities: %w", err)
}
return nil
}
func (s *ConfigStore) LoadActivities() ([]JobActivity, error) {
s.mu.RLock()
if !s.configured {
out := cloneActivities(s.memActivities)
s.mu.RUnlock()
return out, nil
}
s.mu.RUnlock()
path := filepath.Join(s.baseDir, activitiesDirName, activitiesJSONFileName)
data, err := os.ReadFile(path)
if err != nil {
if os.IsNotExist(err) {
return nil, nil
}
return nil, fmt.Errorf("read activities: %w", err)
}
var activities []JobActivity
if err := json.Unmarshal(data, &activities); err != nil {
return nil, fmt.Errorf("parse activities: %w", err)
}
return cloneActivities(activities), nil
}
func (s *ConfigStore) ListJobTypes() ([]string, error) {
s.mu.RLock()
defer s.mu.RUnlock()
jobTypeSet := make(map[string]struct{})
if !s.configured {
for jobType := range s.memDescriptors {
jobTypeSet[jobType] = struct{}{}
}
for jobType := range s.memConfigs {
jobTypeSet[jobType] = struct{}{}
}
for jobType := range s.memRunHistory {
jobTypeSet[jobType] = struct{}{}
}
} else {
jobTypesPath := filepath.Join(s.baseDir, jobTypesDirName)
entries, err := os.ReadDir(jobTypesPath)
if err != nil {
if os.IsNotExist(err) {
return []string{}, nil
}
return nil, fmt.Errorf("list job types: %w", err)
}
for _, entry := range entries {
if !entry.IsDir() {
continue
}
jobType := strings.TrimSpace(entry.Name())
if _, err := sanitizeJobType(jobType); err != nil {
continue
}
jobTypeSet[jobType] = struct{}{}
}
}
jobTypes := make([]string, 0, len(jobTypeSet))
for jobType := range jobTypeSet {
jobTypes = append(jobTypes, jobType)
}
sort.Strings(jobTypes)
return jobTypes, nil
}
func (s *ConfigStore) loadRunHistoryLocked(jobType string) (*JobTypeRunHistory, error) {
if !s.configured {
history, ok := s.memRunHistory[jobType]
if !ok {
history = &JobTypeRunHistory{JobType: jobType}
s.memRunHistory[jobType] = history
}
return cloneRunHistory(history), nil
}
runsPath := filepath.Join(s.baseDir, jobTypesDirName, jobType, runsJSONFileName)
data, err := os.ReadFile(runsPath)
if err != nil {
if os.IsNotExist(err) {
return &JobTypeRunHistory{JobType: jobType}, nil
}
return nil, fmt.Errorf("read run history for %s: %w", jobType, err)
}
var history JobTypeRunHistory
if err := json.Unmarshal(data, &history); err != nil {
return nil, fmt.Errorf("parse run history for %s: %w", jobType, err)
}
if history.JobType == "" {
history.JobType = jobType
}
return &history, nil
}
func (s *ConfigStore) saveRunHistoryLocked(jobType string, history *JobTypeRunHistory) error {
if !s.configured {
s.memRunHistory[jobType] = cloneRunHistory(history)
return nil
}
jobTypeDir, err := s.ensureJobTypeDir(jobType)
if err != nil {
return err
}
encoded, err := json.MarshalIndent(history, "", " ")
if err != nil {
return fmt.Errorf("encode run history for %s: %w", jobType, err)
}
runsPath := filepath.Join(jobTypeDir, runsJSONFileName)
if err := atomicWriteFile(runsPath, encoded, defaultFilePerm); err != nil {
return fmt.Errorf("write run history for %s: %w", jobType, err)
}
return nil
}
func (s *ConfigStore) ensureJobTypeDir(jobType string) (string, error) {
if !s.configured {
return "", nil
}
jobTypeDir := filepath.Join(s.baseDir, jobTypesDirName, jobType)
if err := os.MkdirAll(jobTypeDir, defaultDirPerm); err != nil {
return "", fmt.Errorf("create job type dir for %s: %w", jobType, err)
}
return jobTypeDir, nil
}
func sanitizeJobType(jobType string) (string, error) {
jobType = strings.TrimSpace(jobType)
if jobType == "" {
return "", fmt.Errorf("job type is empty")
}
// Enforce a strict, path-safe pattern for job types: only letters, digits, underscore, dash and dot.
// This prevents path traversal because '/', '\\' and whitespace are rejected.
if !validJobTypePattern.MatchString(jobType) {
return "", fmt.Errorf("invalid job type %q: must match %s", jobType, validJobTypePattern.String())
}
return jobType, nil
}
// validJobIDPattern allows letters, digits, dash, underscore, and dot.
// url.PathEscape in jobDetailFileName provides a second layer of defense.
var validJobIDPattern = regexp.MustCompile(`^[A-Za-z0-9_.-]+$`)
func sanitizeJobID(jobID string) (string, error) {
jobID = strings.TrimSpace(jobID)
if jobID == "" {
return "", fmt.Errorf("job id is empty")
}
if !validJobIDPattern.MatchString(jobID) {
return "", fmt.Errorf("invalid job id %q: must match %s", jobID, validJobIDPattern.String())
}
return jobID, nil
}
func jobDetailFileName(jobID string) string {
return url.PathEscape(jobID) + ".json"
}
func trimRuns(runs []JobRunRecord, maxKeep int) []JobRunRecord {
if len(runs) == 0 {
return runs
}
sort.Slice(runs, func(i, j int) bool {
ti := time.Time{}
if runs[i].CompletedAt != nil {
ti = *runs[i].CompletedAt
}
tj := time.Time{}
if runs[j].CompletedAt != nil {
tj = *runs[j].CompletedAt
}
return ti.After(tj)
})
if len(runs) > maxKeep {
runs = runs[:maxKeep]
}
return runs
}
func cloneRunHistory(in *JobTypeRunHistory) *JobTypeRunHistory {
if in == nil {
return nil
}
out := *in
if in.SuccessfulRuns != nil {
out.SuccessfulRuns = append([]JobRunRecord(nil), in.SuccessfulRuns...)
}
if in.ErrorRuns != nil {
out.ErrorRuns = append([]JobRunRecord(nil), in.ErrorRuns...)
}
return &out
}
func cloneTrackedJobs(in []TrackedJob) []TrackedJob {
if len(in) == 0 {
return nil
}
out := make([]TrackedJob, len(in))
for i := range in {
out[i] = cloneTrackedJob(in[i])
}
return out
}
func cloneTrackedJob(in TrackedJob) TrackedJob {
out := in
if in.Parameters != nil {
out.Parameters = make(map[string]interface{}, len(in.Parameters))
for key, value := range in.Parameters {
out.Parameters[key] = deepCopyGenericValue(value)
}
}
if in.Labels != nil {
out.Labels = make(map[string]string, len(in.Labels))
for key, value := range in.Labels {
out.Labels[key] = value
}
}
if in.ResultOutputValues != nil {
out.ResultOutputValues = make(map[string]interface{}, len(in.ResultOutputValues))
for key, value := range in.ResultOutputValues {
out.ResultOutputValues[key] = deepCopyGenericValue(value)
}
}
return out
}
func deepCopyGenericValue(val interface{}) interface{} {
switch v := val.(type) {
case map[string]interface{}:
res := make(map[string]interface{}, len(v))
for k, val := range v {
res[k] = deepCopyGenericValue(val)
}
return res
case []interface{}:
res := make([]interface{}, len(v))
for i, val := range v {
res[i] = deepCopyGenericValue(val)
}
return res
default:
return v
}
}
func cloneActivities(in []JobActivity) []JobActivity {
if len(in) == 0 {
return nil
}
out := make([]JobActivity, len(in))
for i := range in {
out[i] = in[i]
if in[i].Details != nil {
out[i].Details = make(map[string]interface{}, len(in[i].Details))
for key, value := range in[i].Details {
out[i].Details[key] = deepCopyGenericValue(value)
}
}
}
return out
}
// writeProtoFiles writes message to both a binary protobuf file (pbPath) and a
// human-readable JSON file (jsonPath) using atomicWriteFile for each.
// The .pb file is the authoritative source of truth: all reads use proto.Unmarshal
// on the .pb file. The .json file is for human inspection only, so a partial
// failure where .pb succeeds but .json fails leaves the store in a consistent state.
func writeProtoFiles(message proto.Message, pbPath string, jsonPath string) error {
pbData, err := proto.Marshal(message)
if err != nil {
return fmt.Errorf("marshal protobuf: %w", err)
}
if err := atomicWriteFile(pbPath, pbData, defaultFilePerm); err != nil {
return fmt.Errorf("write protobuf file: %w", err)
}
jsonData, err := protojson.MarshalOptions{
Multiline: true,
Indent: " ",
EmitUnpopulated: true,
}.Marshal(message)
if err != nil {
return fmt.Errorf("marshal json: %w", err)
}
if err := atomicWriteFile(jsonPath, jsonData, defaultFilePerm); err != nil {
return fmt.Errorf("write json file: %w", err)
}
return nil
}
func atomicWriteFile(filename string, data []byte, perm os.FileMode) error {
dir := filepath.Dir(filename)
if err := os.MkdirAll(dir, defaultDirPerm); err != nil {
return fmt.Errorf("create directory %s: %w", dir, err)
}
tmpFile := filename + ".tmp"
if err := os.WriteFile(tmpFile, data, perm); err != nil {
return err
}
if err := os.Rename(tmpFile, filename); err != nil {
_ = os.Remove(tmpFile)
return err
}
return nil
}

View File

@@ -0,0 +1,257 @@
package plugin
import (
"reflect"
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
)
func TestConfigStoreDescriptorRoundTrip(t *testing.T) {
t.Parallel()
tempDir := t.TempDir()
store, err := NewConfigStore(tempDir)
if err != nil {
t.Fatalf("NewConfigStore: %v", err)
}
descriptor := &plugin_pb.JobTypeDescriptor{
JobType: "vacuum",
DisplayName: "Vacuum",
Description: "Vacuum volumes",
DescriptorVersion: 1,
}
if err := store.SaveDescriptor("vacuum", descriptor); err != nil {
t.Fatalf("SaveDescriptor: %v", err)
}
got, err := store.LoadDescriptor("vacuum")
if err != nil {
t.Fatalf("LoadDescriptor: %v", err)
}
if got == nil {
t.Fatalf("LoadDescriptor: nil descriptor")
}
if got.DisplayName != descriptor.DisplayName {
t.Fatalf("unexpected display name: got %q want %q", got.DisplayName, descriptor.DisplayName)
}
}
func TestConfigStoreRunHistoryRetention(t *testing.T) {
t.Parallel()
store, err := NewConfigStore(t.TempDir())
if err != nil {
t.Fatalf("NewConfigStore: %v", err)
}
base := time.Now().UTC().Add(-24 * time.Hour)
for i := 0; i < 15; i++ {
err := store.AppendRunRecord("balance", &JobRunRecord{
RunID: "s" + time.Duration(i).String(),
JobID: "job-success",
JobType: "balance",
WorkerID: "worker-a",
Outcome: RunOutcomeSuccess,
CompletedAt: timeToPtr(base.Add(time.Duration(i) * time.Minute)),
})
if err != nil {
t.Fatalf("AppendRunRecord success[%d]: %v", i, err)
}
}
for i := 0; i < 12; i++ {
err := store.AppendRunRecord("balance", &JobRunRecord{
RunID: "e" + time.Duration(i).String(),
JobID: "job-error",
JobType: "balance",
WorkerID: "worker-b",
Outcome: RunOutcomeError,
CompletedAt: timeToPtr(base.Add(time.Duration(i) * time.Minute)),
})
if err != nil {
t.Fatalf("AppendRunRecord error[%d]: %v", i, err)
}
}
history, err := store.LoadRunHistory("balance")
if err != nil {
t.Fatalf("LoadRunHistory: %v", err)
}
if len(history.SuccessfulRuns) != MaxSuccessfulRunHistory {
t.Fatalf("successful retention mismatch: got %d want %d", len(history.SuccessfulRuns), MaxSuccessfulRunHistory)
}
if len(history.ErrorRuns) != MaxErrorRunHistory {
t.Fatalf("error retention mismatch: got %d want %d", len(history.ErrorRuns), MaxErrorRunHistory)
}
for i := 1; i < len(history.SuccessfulRuns); i++ {
t1 := time.Time{}
if history.SuccessfulRuns[i-1].CompletedAt != nil {
t1 = *history.SuccessfulRuns[i-1].CompletedAt
}
t2 := time.Time{}
if history.SuccessfulRuns[i].CompletedAt != nil {
t2 = *history.SuccessfulRuns[i].CompletedAt
}
if t1.Before(t2) {
t.Fatalf("successful run order not descending at %d", i)
}
}
for i := 1; i < len(history.ErrorRuns); i++ {
t1 := time.Time{}
if history.ErrorRuns[i-1].CompletedAt != nil {
t1 = *history.ErrorRuns[i-1].CompletedAt
}
t2 := time.Time{}
if history.ErrorRuns[i].CompletedAt != nil {
t2 = *history.ErrorRuns[i].CompletedAt
}
if t1.Before(t2) {
t.Fatalf("error run order not descending at %d", i)
}
}
}
func TestConfigStoreListJobTypes(t *testing.T) {
t.Parallel()
store, err := NewConfigStore("")
if err != nil {
t.Fatalf("NewConfigStore: %v", err)
}
if err := store.SaveDescriptor("vacuum", &plugin_pb.JobTypeDescriptor{JobType: "vacuum"}); err != nil {
t.Fatalf("SaveDescriptor: %v", err)
}
if err := store.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
JobType: "balance",
AdminRuntime: &plugin_pb.AdminRuntimeConfig{Enabled: true},
}); err != nil {
t.Fatalf("SaveJobTypeConfig: %v", err)
}
if err := store.AppendRunRecord("ec", &JobRunRecord{Outcome: RunOutcomeSuccess, CompletedAt: timeToPtr(time.Now().UTC())}); err != nil {
t.Fatalf("AppendRunRecord: %v", err)
}
got, err := store.ListJobTypes()
if err != nil {
t.Fatalf("ListJobTypes: %v", err)
}
want := []string{"balance", "ec", "vacuum"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("unexpected job types: got=%v want=%v", got, want)
}
}
func TestConfigStoreMonitorStateRoundTrip(t *testing.T) {
t.Parallel()
store, err := NewConfigStore(t.TempDir())
if err != nil {
t.Fatalf("NewConfigStore: %v", err)
}
tracked := []TrackedJob{
{
JobID: "job-1",
JobType: "vacuum",
State: "running",
Progress: 55,
WorkerID: "worker-a",
CreatedAt: timeToPtr(time.Now().UTC().Add(-2 * time.Minute)),
UpdatedAt: timeToPtr(time.Now().UTC().Add(-1 * time.Minute)),
},
}
activities := []JobActivity{
{
JobID: "job-1",
JobType: "vacuum",
Source: "worker_progress",
Message: "processing",
Stage: "running",
OccurredAt: timeToPtr(time.Now().UTC()),
Details: map[string]interface{}{
"step": "scan",
},
},
}
if err := store.SaveTrackedJobs(tracked); err != nil {
t.Fatalf("SaveTrackedJobs: %v", err)
}
if err := store.SaveActivities(activities); err != nil {
t.Fatalf("SaveActivities: %v", err)
}
gotTracked, err := store.LoadTrackedJobs()
if err != nil {
t.Fatalf("LoadTrackedJobs: %v", err)
}
if len(gotTracked) != 1 || gotTracked[0].JobID != tracked[0].JobID {
t.Fatalf("unexpected tracked jobs: %+v", gotTracked)
}
gotActivities, err := store.LoadActivities()
if err != nil {
t.Fatalf("LoadActivities: %v", err)
}
if len(gotActivities) != 1 || gotActivities[0].Message != activities[0].Message {
t.Fatalf("unexpected activities: %+v", gotActivities)
}
if gotActivities[0].Details["step"] != "scan" {
t.Fatalf("unexpected activity details: %+v", gotActivities[0].Details)
}
}
func TestConfigStoreJobDetailRoundTrip(t *testing.T) {
t.Parallel()
store, err := NewConfigStore(t.TempDir())
if err != nil {
t.Fatalf("NewConfigStore: %v", err)
}
input := TrackedJob{
JobID: "job-detail-1",
JobType: "vacuum",
Summary: "detail summary",
Detail: "detail payload",
CreatedAt: timeToPtr(time.Now().UTC().Add(-2 * time.Minute)),
UpdatedAt: timeToPtr(time.Now().UTC()),
Parameters: map[string]interface{}{
"volume_id": map[string]interface{}{"int64_value": "3"},
},
Labels: map[string]string{
"source": "detector",
},
ResultOutputValues: map[string]interface{}{
"moved": map[string]interface{}{"bool_value": true},
},
}
if err := store.SaveJobDetail(input); err != nil {
t.Fatalf("SaveJobDetail: %v", err)
}
got, err := store.LoadJobDetail(input.JobID)
if err != nil {
t.Fatalf("LoadJobDetail: %v", err)
}
if got == nil {
t.Fatalf("LoadJobDetail returned nil")
}
if got.Detail != input.Detail {
t.Fatalf("unexpected detail: got=%q want=%q", got.Detail, input.Detail)
}
if got.Labels["source"] != "detector" {
t.Fatalf("unexpected labels: %+v", got.Labels)
}
if got.ResultOutputValues == nil {
t.Fatalf("expected result output values")
}
}

View File

@@ -0,0 +1,231 @@
package plugin
import (
"encoding/base64"
"sort"
"strconv"
"strings"
"github.com/seaweedfs/seaweedfs/weed/pb/worker_pb"
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
"google.golang.org/protobuf/proto"
)
func enrichTrackedJobParameters(jobType string, parameters map[string]interface{}) map[string]interface{} {
if len(parameters) == 0 {
return parameters
}
if _, exists := parameters["execution_plan"]; exists {
return parameters
}
taskParams, ok := decodeTaskParamsFromPlainParameters(parameters)
if !ok || taskParams == nil {
return parameters
}
plan := buildExecutionPlan(strings.TrimSpace(jobType), taskParams)
if plan == nil {
return parameters
}
enriched := make(map[string]interface{}, len(parameters)+1)
for key, value := range parameters {
enriched[key] = value
}
enriched["execution_plan"] = plan
return enriched
}
func decodeTaskParamsFromPlainParameters(parameters map[string]interface{}) (*worker_pb.TaskParams, bool) {
rawField, ok := parameters["task_params_pb"]
if !ok || rawField == nil {
return nil, false
}
fieldMap, ok := rawField.(map[string]interface{})
if !ok {
return nil, false
}
bytesValue, _ := fieldMap["bytes_value"].(string)
bytesValue = strings.TrimSpace(bytesValue)
if bytesValue == "" {
return nil, false
}
payload, err := base64.StdEncoding.DecodeString(bytesValue)
if err != nil {
return nil, false
}
params := &worker_pb.TaskParams{}
if err := proto.Unmarshal(payload, params); err != nil {
return nil, false
}
return params, true
}
func buildExecutionPlan(jobType string, params *worker_pb.TaskParams) map[string]interface{} {
if params == nil {
return nil
}
normalizedJobType := strings.TrimSpace(jobType)
if normalizedJobType == "" && params.GetErasureCodingParams() != nil {
normalizedJobType = "erasure_coding"
}
switch normalizedJobType {
case "erasure_coding":
return buildErasureCodingExecutionPlan(params)
default:
return nil
}
}
func buildErasureCodingExecutionPlan(params *worker_pb.TaskParams) map[string]interface{} {
if params == nil {
return nil
}
ecParams := params.GetErasureCodingParams()
if ecParams == nil {
return nil
}
dataShards := int(ecParams.DataShards)
if dataShards <= 0 {
dataShards = int(erasure_coding.DataShardsCount)
}
parityShards := int(ecParams.ParityShards)
if parityShards <= 0 {
parityShards = int(erasure_coding.ParityShardsCount)
}
totalShards := dataShards + parityShards
sources := make([]map[string]interface{}, 0, len(params.Sources))
for _, source := range params.Sources {
if source == nil {
continue
}
sources = append(sources, buildExecutionEndpoint(
source.Node,
source.DataCenter,
source.Rack,
source.VolumeId,
source.ShardIds,
dataShards,
))
}
targets := make([]map[string]interface{}, 0, len(params.Targets))
shardAssignments := make([]map[string]interface{}, 0, totalShards)
for targetIndex, target := range params.Targets {
if target == nil {
continue
}
targets = append(targets, buildExecutionEndpoint(
target.Node,
target.DataCenter,
target.Rack,
target.VolumeId,
target.ShardIds,
dataShards,
))
for _, shardID := range normalizeShardIDs(target.ShardIds) {
kind, label := classifyShardID(shardID, dataShards)
shardAssignments = append(shardAssignments, map[string]interface{}{
"shard_id": shardID,
"kind": kind,
"label": label,
"target_index": targetIndex + 1,
"target_node": strings.TrimSpace(target.Node),
"target_data_center": strings.TrimSpace(target.DataCenter),
"target_rack": strings.TrimSpace(target.Rack),
"target_volume_id": int(target.VolumeId),
})
}
}
sort.Slice(shardAssignments, func(i, j int) bool {
left, _ := shardAssignments[i]["shard_id"].(int)
right, _ := shardAssignments[j]["shard_id"].(int)
return left < right
})
plan := map[string]interface{}{
"job_type": "erasure_coding",
"task_id": strings.TrimSpace(params.TaskId),
"volume_id": int(params.VolumeId),
"collection": strings.TrimSpace(params.Collection),
"data_shards": dataShards,
"parity_shards": parityShards,
"total_shards": totalShards,
"sources": sources,
"targets": targets,
"source_count": len(sources),
"target_count": len(targets),
}
if len(shardAssignments) > 0 {
plan["shard_assignments"] = shardAssignments
}
return plan
}
func buildExecutionEndpoint(
node string,
dataCenter string,
rack string,
volumeID uint32,
shardIDs []uint32,
dataShardCount int,
) map[string]interface{} {
allShards := normalizeShardIDs(shardIDs)
dataShards := make([]int, 0, len(allShards))
parityShards := make([]int, 0, len(allShards))
for _, shardID := range allShards {
if shardID < dataShardCount {
dataShards = append(dataShards, shardID)
} else {
parityShards = append(parityShards, shardID)
}
}
return map[string]interface{}{
"node": strings.TrimSpace(node),
"data_center": strings.TrimSpace(dataCenter),
"rack": strings.TrimSpace(rack),
"volume_id": int(volumeID),
"shard_ids": allShards,
"data_shard_ids": dataShards,
"parity_shard_ids": parityShards,
}
}
func normalizeShardIDs(shardIDs []uint32) []int {
if len(shardIDs) == 0 {
return nil
}
out := make([]int, 0, len(shardIDs))
for _, shardID := range shardIDs {
out = append(out, int(shardID))
}
sort.Ints(out)
return out
}
func classifyShardID(shardID int, dataShardCount int) (kind string, label string) {
if dataShardCount <= 0 {
dataShardCount = int(erasure_coding.DataShardsCount)
}
if shardID < dataShardCount {
return "data", "D" + strconv.Itoa(shardID)
}
return "parity", "P" + strconv.Itoa(shardID)
}

1243
weed/admin/plugin/plugin.go Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,112 @@
package plugin
import (
"context"
"errors"
"testing"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
)
func TestRunDetectionSendsCancelOnContextDone(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New plugin error: %v", err)
}
defer pluginSvc.Shutdown()
const workerID = "worker-detect"
const jobType = "vacuum"
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: workerID,
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
},
})
session := &streamSession{workerID: workerID, outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 4)}
pluginSvc.putSession(session)
ctx, cancel := context.WithCancel(context.Background())
errCh := make(chan error, 1)
go func() {
_, runErr := pluginSvc.RunDetection(ctx, jobType, &plugin_pb.ClusterContext{}, 10)
errCh <- runErr
}()
first := <-session.outgoing
if first.GetRunDetectionRequest() == nil {
t.Fatalf("expected first message to be run_detection_request")
}
cancel()
second := <-session.outgoing
cancelReq := second.GetCancelRequest()
if cancelReq == nil {
t.Fatalf("expected second message to be cancel_request")
}
if cancelReq.TargetId != first.RequestId {
t.Fatalf("unexpected cancel target id: got=%s want=%s", cancelReq.TargetId, first.RequestId)
}
if cancelReq.TargetKind != plugin_pb.WorkKind_WORK_KIND_DETECTION {
t.Fatalf("unexpected cancel target kind: %v", cancelReq.TargetKind)
}
runErr := <-errCh
if !errors.Is(runErr, context.Canceled) {
t.Fatalf("expected context canceled error, got %v", runErr)
}
}
func TestExecuteJobSendsCancelOnContextDone(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New plugin error: %v", err)
}
defer pluginSvc.Shutdown()
const workerID = "worker-exec"
const jobType = "vacuum"
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: workerID,
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: jobType, CanExecute: true, MaxExecutionConcurrency: 1},
},
})
session := &streamSession{workerID: workerID, outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 4)}
pluginSvc.putSession(session)
job := &plugin_pb.JobSpec{JobId: "job-1", JobType: jobType}
ctx, cancel := context.WithCancel(context.Background())
errCh := make(chan error, 1)
go func() {
_, runErr := pluginSvc.ExecuteJob(ctx, job, &plugin_pb.ClusterContext{}, 1)
errCh <- runErr
}()
first := <-session.outgoing
if first.GetExecuteJobRequest() == nil {
t.Fatalf("expected first message to be execute_job_request")
}
cancel()
second := <-session.outgoing
cancelReq := second.GetCancelRequest()
if cancelReq == nil {
t.Fatalf("expected second message to be cancel_request")
}
if cancelReq.TargetId != first.RequestId {
t.Fatalf("unexpected cancel target id: got=%s want=%s", cancelReq.TargetId, first.RequestId)
}
if cancelReq.TargetKind != plugin_pb.WorkKind_WORK_KIND_EXECUTION {
t.Fatalf("unexpected cancel target kind: %v", cancelReq.TargetKind)
}
runErr := <-errCh
if !errors.Is(runErr, context.Canceled) {
t.Fatalf("expected context canceled error, got %v", runErr)
}
}

View File

@@ -0,0 +1,125 @@
package plugin
import (
"testing"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
)
func TestEnsureJobTypeConfigFromDescriptorBootstrapsDefaults(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
descriptor := &plugin_pb.JobTypeDescriptor{
JobType: "vacuum",
DescriptorVersion: 3,
AdminConfigForm: &plugin_pb.ConfigForm{
DefaultValues: map[string]*plugin_pb.ConfigValue{
"scan_scope": {Kind: &plugin_pb.ConfigValue_StringValue{StringValue: "all"}},
},
},
WorkerConfigForm: &plugin_pb.ConfigForm{
DefaultValues: map[string]*plugin_pb.ConfigValue{
"threshold": {Kind: &plugin_pb.ConfigValue_DoubleValue{DoubleValue: 0.3}},
},
},
AdminRuntimeDefaults: &plugin_pb.AdminRuntimeDefaults{
Enabled: true,
DetectionIntervalSeconds: 60,
DetectionTimeoutSeconds: 20,
MaxJobsPerDetection: 30,
GlobalExecutionConcurrency: 4,
PerWorkerExecutionConcurrency: 2,
RetryLimit: 3,
RetryBackoffSeconds: 5,
},
}
if err := pluginSvc.ensureJobTypeConfigFromDescriptor("vacuum", descriptor); err != nil {
t.Fatalf("ensureJobTypeConfigFromDescriptor: %v", err)
}
cfg, err := pluginSvc.LoadJobTypeConfig("vacuum")
if err != nil {
t.Fatalf("LoadJobTypeConfig: %v", err)
}
if cfg == nil {
t.Fatalf("expected non-nil config")
}
if cfg.DescriptorVersion != 3 {
t.Fatalf("unexpected descriptor version: got=%d", cfg.DescriptorVersion)
}
if cfg.AdminRuntime == nil || !cfg.AdminRuntime.Enabled {
t.Fatalf("expected enabled admin settings")
}
if cfg.AdminRuntime.GlobalExecutionConcurrency != 4 {
t.Fatalf("unexpected global execution concurrency: %d", cfg.AdminRuntime.GlobalExecutionConcurrency)
}
if _, ok := cfg.AdminConfigValues["scan_scope"]; !ok {
t.Fatalf("missing admin default value")
}
if _, ok := cfg.WorkerConfigValues["threshold"]; !ok {
t.Fatalf("missing worker default value")
}
}
func TestEnsureJobTypeConfigFromDescriptorDoesNotOverwriteExisting(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
if err := pluginSvc.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
JobType: "balance",
AdminRuntime: &plugin_pb.AdminRuntimeConfig{
Enabled: true,
GlobalExecutionConcurrency: 9,
},
AdminConfigValues: map[string]*plugin_pb.ConfigValue{
"custom": {Kind: &plugin_pb.ConfigValue_StringValue{StringValue: "keep"}},
},
}); err != nil {
t.Fatalf("SaveJobTypeConfig: %v", err)
}
descriptor := &plugin_pb.JobTypeDescriptor{
JobType: "balance",
DescriptorVersion: 7,
AdminConfigForm: &plugin_pb.ConfigForm{
DefaultValues: map[string]*plugin_pb.ConfigValue{
"custom": {Kind: &plugin_pb.ConfigValue_StringValue{StringValue: "overwrite"}},
},
},
AdminRuntimeDefaults: &plugin_pb.AdminRuntimeDefaults{
Enabled: true,
GlobalExecutionConcurrency: 1,
},
}
if err := pluginSvc.ensureJobTypeConfigFromDescriptor("balance", descriptor); err != nil {
t.Fatalf("ensureJobTypeConfigFromDescriptor: %v", err)
}
cfg, err := pluginSvc.LoadJobTypeConfig("balance")
if err != nil {
t.Fatalf("LoadJobTypeConfig: %v", err)
}
if cfg == nil {
t.Fatalf("expected config")
}
if cfg.AdminRuntime == nil || cfg.AdminRuntime.GlobalExecutionConcurrency != 9 {
t.Fatalf("existing admin settings should be preserved, got=%v", cfg.AdminRuntime)
}
custom := cfg.AdminConfigValues["custom"]
if custom == nil || custom.GetStringValue() != "keep" {
t.Fatalf("existing admin config should be preserved")
}
}

View File

@@ -0,0 +1,197 @@
package plugin
import (
"context"
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
)
func TestRunDetectionIncludesLatestSuccessfulRun(t *testing.T) {
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New plugin error: %v", err)
}
defer pluginSvc.Shutdown()
jobType := "vacuum"
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
},
})
session := &streamSession{workerID: "worker-a", outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 1)}
pluginSvc.putSession(session)
oldSuccess := time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)
latestSuccess := time.Date(2026, 2, 1, 0, 0, 0, 0, time.UTC)
if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{Outcome: RunOutcomeSuccess, CompletedAt: timeToPtr(oldSuccess)}); err != nil {
t.Fatalf("AppendRunRecord old success: %v", err)
}
if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{Outcome: RunOutcomeError, CompletedAt: timeToPtr(latestSuccess.Add(2 * time.Hour))}); err != nil {
t.Fatalf("AppendRunRecord error run: %v", err)
}
if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{Outcome: RunOutcomeSuccess, CompletedAt: timeToPtr(latestSuccess)}); err != nil {
t.Fatalf("AppendRunRecord latest success: %v", err)
}
resultCh := make(chan error, 1)
go func() {
_, runErr := pluginSvc.RunDetection(context.Background(), jobType, &plugin_pb.ClusterContext{}, 10)
resultCh <- runErr
}()
message := <-session.outgoing
detectRequest := message.GetRunDetectionRequest()
if detectRequest == nil {
t.Fatalf("expected run detection request message")
}
if detectRequest.LastSuccessfulRun == nil {
t.Fatalf("expected last_successful_run to be set")
}
if got := detectRequest.LastSuccessfulRun.AsTime().UTC(); !got.Equal(latestSuccess) {
t.Fatalf("unexpected last_successful_run, got=%s want=%s", got, latestSuccess)
}
pluginSvc.handleDetectionComplete("worker-a", &plugin_pb.DetectionComplete{
RequestId: message.RequestId,
JobType: jobType,
Success: true,
})
if runErr := <-resultCh; runErr != nil {
t.Fatalf("RunDetection error: %v", runErr)
}
}
func TestRunDetectionOmitsLastSuccessfulRunWhenNoSuccessHistory(t *testing.T) {
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New plugin error: %v", err)
}
defer pluginSvc.Shutdown()
jobType := "vacuum"
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
},
})
session := &streamSession{workerID: "worker-a", outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 1)}
pluginSvc.putSession(session)
if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{
Outcome: RunOutcomeError,
CompletedAt: timeToPtr(time.Date(2026, 2, 10, 0, 0, 0, 0, time.UTC)),
}); err != nil {
t.Fatalf("AppendRunRecord error run: %v", err)
}
resultCh := make(chan error, 1)
go func() {
_, runErr := pluginSvc.RunDetection(context.Background(), jobType, &plugin_pb.ClusterContext{}, 10)
resultCh <- runErr
}()
message := <-session.outgoing
detectRequest := message.GetRunDetectionRequest()
if detectRequest == nil {
t.Fatalf("expected run detection request message")
}
if detectRequest.LastSuccessfulRun != nil {
t.Fatalf("expected last_successful_run to be nil when no success history")
}
pluginSvc.handleDetectionComplete("worker-a", &plugin_pb.DetectionComplete{
RequestId: message.RequestId,
JobType: jobType,
Success: true,
})
if runErr := <-resultCh; runErr != nil {
t.Fatalf("RunDetection error: %v", runErr)
}
}
func TestRunDetectionWithReportCapturesDetectionActivities(t *testing.T) {
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New plugin error: %v", err)
}
defer pluginSvc.Shutdown()
jobType := "vacuum"
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
},
})
session := &streamSession{workerID: "worker-a", outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 1)}
pluginSvc.putSession(session)
reportCh := make(chan *DetectionReport, 1)
errCh := make(chan error, 1)
go func() {
report, runErr := pluginSvc.RunDetectionWithReport(context.Background(), jobType, &plugin_pb.ClusterContext{}, 10)
reportCh <- report
errCh <- runErr
}()
message := <-session.outgoing
requestID := message.GetRequestId()
if requestID == "" {
t.Fatalf("expected request id in detection request")
}
pluginSvc.handleDetectionProposals("worker-a", &plugin_pb.DetectionProposals{
RequestId: requestID,
JobType: jobType,
Proposals: []*plugin_pb.JobProposal{
{
ProposalId: "proposal-1",
JobType: jobType,
Summary: "vacuum proposal",
Detail: "based on garbage ratio",
},
},
})
pluginSvc.handleDetectionComplete("worker-a", &plugin_pb.DetectionComplete{
RequestId: requestID,
JobType: jobType,
Success: true,
TotalProposals: 1,
})
report := <-reportCh
if report == nil {
t.Fatalf("expected detection report")
}
if report.RequestID == "" {
t.Fatalf("expected detection report request id")
}
if report.WorkerID != "worker-a" {
t.Fatalf("expected worker-a, got %q", report.WorkerID)
}
if len(report.Proposals) != 1 {
t.Fatalf("expected one proposal in report, got %d", len(report.Proposals))
}
if runErr := <-errCh; runErr != nil {
t.Fatalf("RunDetectionWithReport error: %v", runErr)
}
activities := pluginSvc.ListActivities(jobType, 0)
stages := map[string]bool{}
for _, activity := range activities {
if activity.RequestID != report.RequestID {
continue
}
stages[activity.Stage] = true
}
if !stages["requested"] || !stages["proposal"] || !stages["completed"] {
t.Fatalf("expected requested/proposal/completed activities, got stages=%v", stages)
}
}

View File

@@ -0,0 +1,896 @@
package plugin
import (
"encoding/json"
"sort"
"strings"
"time"
"github.com/seaweedfs/seaweedfs/weed/glog"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
"google.golang.org/protobuf/encoding/protojson"
)
const (
maxTrackedJobsTotal = 1000
maxActivityRecords = 4000
maxRelatedJobs = 100
)
var (
StateSucceeded = strings.ToLower(plugin_pb.JobState_JOB_STATE_SUCCEEDED.String())
StateFailed = strings.ToLower(plugin_pb.JobState_JOB_STATE_FAILED.String())
StateCanceled = strings.ToLower(plugin_pb.JobState_JOB_STATE_CANCELED.String())
)
// activityLess reports whether activity a occurred after activity b (newest-first order).
// A nil OccurredAt is treated as the zero time.
func activityLess(a, b JobActivity) bool {
ta := time.Time{}
if a.OccurredAt != nil {
ta = *a.OccurredAt
}
tb := time.Time{}
if b.OccurredAt != nil {
tb = *b.OccurredAt
}
return ta.After(tb)
}
func (r *Plugin) loadPersistedMonitorState() error {
trackedJobs, err := r.store.LoadTrackedJobs()
if err != nil {
return err
}
activities, err := r.store.LoadActivities()
if err != nil {
return err
}
if len(trackedJobs) > 0 {
r.jobsMu.Lock()
for i := range trackedJobs {
job := trackedJobs[i]
if strings.TrimSpace(job.JobID) == "" {
continue
}
// Backward compatibility: migrate older inline detail payloads
// out of tracked_jobs.json into dedicated per-job detail files.
if hasTrackedJobRichDetails(job) {
if err := r.store.SaveJobDetail(job); err != nil {
glog.Warningf("Plugin failed to migrate detail snapshot for job %s: %v", job.JobID, err)
}
}
stripTrackedJobDetailFields(&job)
jobCopy := job
r.jobs[job.JobID] = &jobCopy
}
r.pruneTrackedJobsLocked()
r.jobsMu.Unlock()
}
if len(activities) > maxActivityRecords {
activities = activities[len(activities)-maxActivityRecords:]
}
if len(activities) > 0 {
r.activitiesMu.Lock()
r.activities = append([]JobActivity(nil), activities...)
r.activitiesMu.Unlock()
}
return nil
}
func (r *Plugin) ListTrackedJobs(jobType string, state string, limit int) []TrackedJob {
r.jobsMu.RLock()
defer r.jobsMu.RUnlock()
normalizedJobType := strings.TrimSpace(jobType)
normalizedState := strings.TrimSpace(strings.ToLower(state))
items := make([]TrackedJob, 0, len(r.jobs))
for _, job := range r.jobs {
if job == nil {
continue
}
if normalizedJobType != "" && job.JobType != normalizedJobType {
continue
}
if normalizedState != "" && strings.ToLower(job.State) != normalizedState {
continue
}
items = append(items, cloneTrackedJob(*job))
}
sort.Slice(items, func(i, j int) bool {
ti := time.Time{}
if items[i].UpdatedAt != nil {
ti = *items[i].UpdatedAt
}
tj := time.Time{}
if items[j].UpdatedAt != nil {
tj = *items[j].UpdatedAt
}
if !ti.Equal(tj) {
return ti.After(tj)
}
return items[i].JobID < items[j].JobID
})
if limit > 0 && len(items) > limit {
items = items[:limit]
}
return items
}
func (r *Plugin) GetTrackedJob(jobID string) (*TrackedJob, bool) {
r.jobsMu.RLock()
defer r.jobsMu.RUnlock()
job, ok := r.jobs[jobID]
if !ok || job == nil {
return nil, false
}
clone := cloneTrackedJob(*job)
return &clone, true
}
func (r *Plugin) ListActivities(jobType string, limit int) []JobActivity {
r.activitiesMu.RLock()
defer r.activitiesMu.RUnlock()
normalized := strings.TrimSpace(jobType)
activities := make([]JobActivity, 0, len(r.activities))
for _, activity := range r.activities {
if normalized != "" && activity.JobType != normalized {
continue
}
activities = append(activities, activity)
}
sort.Slice(activities, func(i, j int) bool {
return activityLess(activities[i], activities[j])
})
if limit > 0 && len(activities) > limit {
activities = activities[:limit]
}
return activities
}
func (r *Plugin) ListJobActivities(jobID string, limit int) []JobActivity {
normalizedJobID := strings.TrimSpace(jobID)
if normalizedJobID == "" {
return nil
}
r.activitiesMu.RLock()
activities := make([]JobActivity, 0, len(r.activities))
for _, activity := range r.activities {
if strings.TrimSpace(activity.JobID) != normalizedJobID {
continue
}
activities = append(activities, activity)
}
r.activitiesMu.RUnlock()
sort.Slice(activities, func(i, j int) bool {
return !activityLess(activities[i], activities[j]) // oldest-first for job timeline
})
if limit > 0 && len(activities) > limit {
activities = activities[len(activities)-limit:]
}
return activities
}
func (r *Plugin) BuildJobDetail(jobID string, activityLimit int, relatedLimit int) (*JobDetail, bool, error) {
normalizedJobID := strings.TrimSpace(jobID)
if normalizedJobID == "" {
return nil, false, nil
}
// Clamp relatedLimit to a safe range to avoid excessive memory allocation from untrusted input.
if relatedLimit <= 0 {
relatedLimit = 0
} else if relatedLimit > maxRelatedJobs {
relatedLimit = maxRelatedJobs
}
r.jobsMu.RLock()
trackedSnapshot, ok := r.jobs[normalizedJobID]
if ok && trackedSnapshot != nil {
candidate := cloneTrackedJob(*trackedSnapshot)
stripTrackedJobDetailFields(&candidate)
trackedSnapshot = &candidate
} else {
trackedSnapshot = nil
}
r.jobsMu.RUnlock()
detailJob, err := r.store.LoadJobDetail(normalizedJobID)
if err != nil {
return nil, false, err
}
if trackedSnapshot == nil && detailJob == nil {
return nil, false, nil
}
if detailJob == nil && trackedSnapshot != nil {
clone := cloneTrackedJob(*trackedSnapshot)
detailJob = &clone
}
if detailJob == nil {
return nil, false, nil
}
if trackedSnapshot != nil {
mergeTrackedStatusIntoDetail(detailJob, trackedSnapshot)
}
detailJob.Parameters = enrichTrackedJobParameters(detailJob.JobType, detailJob.Parameters)
r.activitiesMu.RLock()
activities := append([]JobActivity(nil), r.activities...)
r.activitiesMu.RUnlock()
detail := &JobDetail{
Job: detailJob,
Activities: filterJobActivitiesFromSlice(activities, normalizedJobID, activityLimit),
LastUpdated: timeToPtr(time.Now().UTC()),
}
if history, err := r.store.LoadRunHistory(detailJob.JobType); err != nil {
return nil, true, err
} else if history != nil {
for i := range history.SuccessfulRuns {
record := history.SuccessfulRuns[i]
if strings.TrimSpace(record.JobID) == normalizedJobID {
recordCopy := record
detail.RunRecord = &recordCopy
break
}
}
if detail.RunRecord == nil {
for i := range history.ErrorRuns {
record := history.ErrorRuns[i]
if strings.TrimSpace(record.JobID) == normalizedJobID {
recordCopy := record
detail.RunRecord = &recordCopy
break
}
}
}
}
if relatedLimit > 0 {
related := make([]TrackedJob, 0, relatedLimit)
r.jobsMu.RLock()
for _, candidate := range r.jobs {
if strings.TrimSpace(candidate.JobType) != strings.TrimSpace(detailJob.JobType) {
continue
}
if strings.TrimSpace(candidate.JobID) == normalizedJobID {
continue
}
cloned := cloneTrackedJob(*candidate)
stripTrackedJobDetailFields(&cloned)
related = append(related, cloned)
if len(related) >= relatedLimit {
break
}
}
r.jobsMu.RUnlock()
detail.RelatedJobs = related
}
return detail, true, nil
}
func filterJobActivitiesFromSlice(all []JobActivity, jobID string, limit int) []JobActivity {
if strings.TrimSpace(jobID) == "" || len(all) == 0 {
return nil
}
activities := make([]JobActivity, 0, len(all))
for _, activity := range all {
if strings.TrimSpace(activity.JobID) != jobID {
continue
}
activities = append(activities, activity)
}
sort.Slice(activities, func(i, j int) bool {
return !activityLess(activities[i], activities[j]) // oldest-first for job timeline
})
if limit > 0 && len(activities) > limit {
activities = activities[len(activities)-limit:]
}
return activities
}
func stripTrackedJobDetailFields(job *TrackedJob) {
if job == nil {
return
}
job.Detail = ""
job.Parameters = nil
job.Labels = nil
job.ResultOutputValues = nil
}
func hasTrackedJobRichDetails(job TrackedJob) bool {
return strings.TrimSpace(job.Detail) != "" ||
len(job.Parameters) > 0 ||
len(job.Labels) > 0 ||
len(job.ResultOutputValues) > 0
}
func mergeTrackedStatusIntoDetail(detail *TrackedJob, tracked *TrackedJob) {
if detail == nil || tracked == nil {
return
}
if detail.JobType == "" {
detail.JobType = tracked.JobType
}
if detail.RequestID == "" {
detail.RequestID = tracked.RequestID
}
if detail.WorkerID == "" {
detail.WorkerID = tracked.WorkerID
}
if detail.DedupeKey == "" {
detail.DedupeKey = tracked.DedupeKey
}
if detail.Summary == "" {
detail.Summary = tracked.Summary
}
if detail.State == "" {
detail.State = tracked.State
}
if detail.Progress == 0 {
detail.Progress = tracked.Progress
}
if detail.Stage == "" {
detail.Stage = tracked.Stage
}
if detail.Message == "" {
detail.Message = tracked.Message
}
if detail.Attempt == 0 {
detail.Attempt = tracked.Attempt
}
if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
detail.CreatedAt = tracked.CreatedAt
}
if detail.UpdatedAt == nil || detail.UpdatedAt.IsZero() {
detail.UpdatedAt = tracked.UpdatedAt
}
if detail.CompletedAt == nil || detail.CompletedAt.IsZero() {
detail.CompletedAt = tracked.CompletedAt
}
if detail.ErrorMessage == "" {
detail.ErrorMessage = tracked.ErrorMessage
}
if detail.ResultSummary == "" {
detail.ResultSummary = tracked.ResultSummary
}
}
func (r *Plugin) handleJobProgressUpdate(workerID string, update *plugin_pb.JobProgressUpdate) {
if update == nil {
return
}
now := time.Now().UTC()
resolvedWorkerID := strings.TrimSpace(workerID)
if strings.TrimSpace(update.JobId) != "" {
r.jobsMu.Lock()
job := r.jobs[update.JobId]
if job == nil {
job = &TrackedJob{
JobID: update.JobId,
JobType: update.JobType,
RequestID: update.RequestId,
WorkerID: resolvedWorkerID,
CreatedAt: timeToPtr(now),
}
r.jobs[update.JobId] = job
}
if update.JobType != "" {
job.JobType = update.JobType
}
if update.RequestId != "" {
job.RequestID = update.RequestId
}
if job.WorkerID != "" {
resolvedWorkerID = job.WorkerID
} else if resolvedWorkerID != "" {
job.WorkerID = resolvedWorkerID
}
job.State = strings.ToLower(update.State.String())
job.Progress = update.ProgressPercent
job.Stage = update.Stage
job.Message = update.Message
job.UpdatedAt = timeToPtr(now)
r.pruneTrackedJobsLocked()
r.dirtyJobs = true
r.jobsMu.Unlock()
}
r.trackWorkerActivities(update.JobType, update.JobId, update.RequestId, resolvedWorkerID, update.Activities)
if update.Message != "" || update.Stage != "" {
source := "worker_progress"
if strings.TrimSpace(update.JobId) == "" {
source = "worker_detection"
}
r.appendActivity(JobActivity{
JobID: update.JobId,
JobType: update.JobType,
RequestID: update.RequestId,
WorkerID: resolvedWorkerID,
Source: source,
Message: update.Message,
Stage: update.Stage,
OccurredAt: timeToPtr(now),
})
}
}
func (r *Plugin) trackExecutionStart(requestID, workerID string, job *plugin_pb.JobSpec, attempt int32) {
if job == nil || strings.TrimSpace(job.JobId) == "" {
return
}
now := time.Now().UTC()
r.jobsMu.Lock()
tracked := r.jobs[job.JobId]
if tracked == nil {
tracked = &TrackedJob{
JobID: job.JobId,
CreatedAt: timeToPtr(now),
}
r.jobs[job.JobId] = tracked
}
tracked.JobType = job.JobType
tracked.RequestID = requestID
tracked.WorkerID = workerID
tracked.DedupeKey = job.DedupeKey
tracked.Summary = job.Summary
tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_ASSIGNED.String())
tracked.Progress = 0
tracked.Stage = "assigned"
tracked.Message = "job assigned to worker"
tracked.Attempt = attempt
if tracked.CreatedAt == nil || tracked.CreatedAt.IsZero() {
tracked.CreatedAt = timeToPtr(now)
}
tracked.UpdatedAt = timeToPtr(now)
trackedSnapshot := cloneTrackedJob(*tracked)
r.pruneTrackedJobsLocked()
r.dirtyJobs = true
r.jobsMu.Unlock()
r.persistJobDetailSnapshot(job.JobId, func(detail *TrackedJob) {
detail.JobID = job.JobId
detail.JobType = job.JobType
detail.RequestID = requestID
detail.WorkerID = workerID
detail.DedupeKey = job.DedupeKey
detail.Summary = job.Summary
detail.Detail = job.Detail
detail.Parameters = enrichTrackedJobParameters(job.JobType, configValueMapToPlain(job.Parameters))
if len(job.Labels) > 0 {
labels := make(map[string]string, len(job.Labels))
for key, value := range job.Labels {
labels[key] = value
}
detail.Labels = labels
} else {
detail.Labels = nil
}
detail.State = trackedSnapshot.State
detail.Progress = trackedSnapshot.Progress
detail.Stage = trackedSnapshot.Stage
detail.Message = trackedSnapshot.Message
detail.Attempt = attempt
if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
detail.CreatedAt = trackedSnapshot.CreatedAt
}
detail.UpdatedAt = trackedSnapshot.UpdatedAt
})
r.appendActivity(JobActivity{
JobID: job.JobId,
JobType: job.JobType,
RequestID: requestID,
WorkerID: workerID,
Source: "admin_dispatch",
Message: "job assigned",
Stage: "assigned",
OccurredAt: timeToPtr(now),
})
}
func (r *Plugin) trackExecutionQueued(job *plugin_pb.JobSpec) {
if job == nil || strings.TrimSpace(job.JobId) == "" {
return
}
now := time.Now().UTC()
r.jobsMu.Lock()
tracked := r.jobs[job.JobId]
if tracked == nil {
tracked = &TrackedJob{
JobID: job.JobId,
CreatedAt: timeToPtr(now),
}
r.jobs[job.JobId] = tracked
}
tracked.JobType = job.JobType
tracked.DedupeKey = job.DedupeKey
tracked.Summary = job.Summary
tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_PENDING.String())
tracked.Progress = 0
tracked.Stage = "queued"
tracked.Message = "waiting for available executor"
if tracked.CreatedAt == nil || tracked.CreatedAt.IsZero() {
tracked.CreatedAt = timeToPtr(now)
}
tracked.UpdatedAt = timeToPtr(now)
trackedSnapshot := cloneTrackedJob(*tracked)
r.pruneTrackedJobsLocked()
r.dirtyJobs = true
r.jobsMu.Unlock()
r.persistJobDetailSnapshot(job.JobId, func(detail *TrackedJob) {
detail.JobID = job.JobId
detail.JobType = job.JobType
detail.DedupeKey = job.DedupeKey
detail.Summary = job.Summary
detail.Detail = job.Detail
detail.Parameters = enrichTrackedJobParameters(job.JobType, configValueMapToPlain(job.Parameters))
if len(job.Labels) > 0 {
labels := make(map[string]string, len(job.Labels))
for key, value := range job.Labels {
labels[key] = value
}
detail.Labels = labels
} else {
detail.Labels = nil
}
detail.State = trackedSnapshot.State
detail.Progress = trackedSnapshot.Progress
detail.Stage = trackedSnapshot.Stage
detail.Message = trackedSnapshot.Message
if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
detail.CreatedAt = trackedSnapshot.CreatedAt
}
detail.UpdatedAt = trackedSnapshot.UpdatedAt
})
r.appendActivity(JobActivity{
JobID: job.JobId,
JobType: job.JobType,
Source: "admin_scheduler",
Message: "job queued for execution",
Stage: "queued",
OccurredAt: timeToPtr(now),
})
}
func (r *Plugin) trackExecutionCompletion(completed *plugin_pb.JobCompleted) *TrackedJob {
if completed == nil || strings.TrimSpace(completed.JobId) == "" {
return nil
}
now := time.Now().UTC()
if completed.CompletedAt != nil {
now = completed.CompletedAt.AsTime().UTC()
}
r.jobsMu.Lock()
tracked := r.jobs[completed.JobId]
if tracked == nil {
tracked = &TrackedJob{
JobID: completed.JobId,
CreatedAt: timeToPtr(now),
}
r.jobs[completed.JobId] = tracked
}
if completed.JobType != "" {
tracked.JobType = completed.JobType
}
if completed.RequestId != "" {
tracked.RequestID = completed.RequestId
}
if completed.Success {
tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_SUCCEEDED.String())
tracked.Progress = 100
tracked.Stage = "completed"
if completed.Result != nil {
tracked.ResultSummary = completed.Result.Summary
}
tracked.Message = tracked.ResultSummary
if tracked.Message == "" {
tracked.Message = "completed"
}
tracked.ErrorMessage = ""
} else {
tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_FAILED.String())
tracked.Stage = "failed"
tracked.ErrorMessage = completed.ErrorMessage
tracked.Message = completed.ErrorMessage
}
tracked.UpdatedAt = timeToPtr(now)
tracked.CompletedAt = timeToPtr(now)
r.pruneTrackedJobsLocked()
clone := cloneTrackedJob(*tracked)
r.dirtyJobs = true
r.jobsMu.Unlock()
r.persistJobDetailSnapshot(completed.JobId, func(detail *TrackedJob) {
detail.JobID = completed.JobId
if completed.JobType != "" {
detail.JobType = completed.JobType
}
if completed.RequestId != "" {
detail.RequestID = completed.RequestId
}
detail.State = clone.State
detail.Progress = clone.Progress
detail.Stage = clone.Stage
detail.Message = clone.Message
detail.ErrorMessage = clone.ErrorMessage
detail.ResultSummary = clone.ResultSummary
if completed.Success && completed.Result != nil {
detail.ResultOutputValues = configValueMapToPlain(completed.Result.OutputValues)
} else {
detail.ResultOutputValues = nil
}
if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
detail.CreatedAt = clone.CreatedAt
}
if detail.UpdatedAt == nil || detail.UpdatedAt.IsZero() {
detail.UpdatedAt = clone.UpdatedAt
}
if detail.CompletedAt == nil || detail.CompletedAt.IsZero() {
detail.CompletedAt = clone.CompletedAt
}
})
r.appendActivity(JobActivity{
JobID: completed.JobId,
JobType: completed.JobType,
RequestID: completed.RequestId,
WorkerID: clone.WorkerID,
Source: "worker_completion",
Message: clone.Message,
Stage: clone.Stage,
OccurredAt: timeToPtr(now),
})
return &clone
}
func (r *Plugin) trackWorkerActivities(jobType, jobID, requestID, workerID string, events []*plugin_pb.ActivityEvent) {
if len(events) == 0 {
return
}
for _, event := range events {
if event == nil {
continue
}
timestamp := time.Now().UTC()
if event.CreatedAt != nil {
timestamp = event.CreatedAt.AsTime().UTC()
}
r.appendActivity(JobActivity{
JobID: jobID,
JobType: jobType,
RequestID: requestID,
WorkerID: workerID,
Source: strings.ToLower(event.Source.String()),
Message: event.Message,
Stage: event.Stage,
Details: configValueMapToPlain(event.Details),
OccurredAt: timeToPtr(timestamp),
})
}
}
func (r *Plugin) appendActivity(activity JobActivity) {
if activity.OccurredAt == nil || activity.OccurredAt.IsZero() {
activity.OccurredAt = timeToPtr(time.Now().UTC())
}
r.activitiesMu.Lock()
r.activities = append(r.activities, activity)
if len(r.activities) > maxActivityRecords {
r.activities = r.activities[len(r.activities)-maxActivityRecords:]
}
r.dirtyActivities = true
r.activitiesMu.Unlock()
}
func (r *Plugin) pruneTrackedJobsLocked() {
if len(r.jobs) <= maxTrackedJobsTotal {
return
}
type sortableJob struct {
jobID string
updatedAt time.Time
}
terminalJobs := make([]sortableJob, 0)
for jobID, job := range r.jobs {
if job.State == StateSucceeded ||
job.State == StateFailed ||
job.State == StateCanceled {
updAt := time.Time{}
if job.UpdatedAt != nil {
updAt = *job.UpdatedAt
}
terminalJobs = append(terminalJobs, sortableJob{jobID, updAt})
}
}
if len(terminalJobs) == 0 {
return
}
sort.Slice(terminalJobs, func(i, j int) bool {
return terminalJobs[i].updatedAt.Before(terminalJobs[j].updatedAt)
})
toDelete := len(r.jobs) - maxTrackedJobsTotal
if toDelete <= 0 {
return
}
if toDelete > len(terminalJobs) {
toDelete = len(terminalJobs)
}
for i := 0; i < toDelete; i++ {
delete(r.jobs, terminalJobs[i].jobID)
}
}
func configValueMapToPlain(values map[string]*plugin_pb.ConfigValue) map[string]interface{} {
if len(values) == 0 {
return nil
}
payload, err := protojson.MarshalOptions{UseProtoNames: true}.Marshal(&plugin_pb.ValueMap{Fields: values})
if err != nil {
return nil
}
decoded := map[string]interface{}{}
if err := json.Unmarshal(payload, &decoded); err != nil {
return nil
}
fields, ok := decoded["fields"].(map[string]interface{})
if !ok {
return nil
}
return fields
}
func (r *Plugin) persistTrackedJobsSnapshot() {
r.jobsMu.Lock()
r.dirtyJobs = false
jobs := make([]TrackedJob, 0, len(r.jobs))
for _, job := range r.jobs {
if job == nil || strings.TrimSpace(job.JobID) == "" {
continue
}
clone := cloneTrackedJob(*job)
stripTrackedJobDetailFields(&clone)
jobs = append(jobs, clone)
}
r.jobsMu.Unlock()
if len(jobs) == 0 {
return
}
sort.Slice(jobs, func(i, j int) bool {
ti := time.Time{}
if jobs[i].UpdatedAt != nil {
ti = *jobs[i].UpdatedAt
}
tj := time.Time{}
if jobs[j].UpdatedAt != nil {
tj = *jobs[j].UpdatedAt
}
if !ti.Equal(tj) {
return ti.After(tj)
}
return jobs[i].JobID < jobs[j].JobID
})
if len(jobs) > maxTrackedJobsTotal {
jobs = jobs[:maxTrackedJobsTotal]
}
if err := r.store.SaveTrackedJobs(jobs); err != nil {
glog.Warningf("Plugin failed to persist tracked jobs: %v", err)
}
}
func (r *Plugin) persistJobDetailSnapshot(jobID string, apply func(detail *TrackedJob)) {
normalizedJobID, _ := sanitizeJobID(jobID)
if normalizedJobID == "" {
return
}
r.jobDetailsMu.Lock()
defer r.jobDetailsMu.Unlock()
detail, err := r.store.LoadJobDetail(normalizedJobID)
if err != nil {
glog.Warningf("Plugin failed to load job detail snapshot for %s: %v", normalizedJobID, err)
return
}
if detail == nil {
detail = &TrackedJob{
JobID: normalizedJobID,
}
}
if apply != nil {
apply(detail)
}
if err := r.store.SaveJobDetail(*detail); err != nil {
glog.Warningf("Plugin failed to persist job detail snapshot for %s: %v", normalizedJobID, err)
}
}
func (r *Plugin) persistActivitiesSnapshot() {
r.activitiesMu.Lock()
r.dirtyActivities = false
activities := append([]JobActivity(nil), r.activities...)
r.activitiesMu.Unlock()
if len(activities) == 0 {
return
}
if len(activities) > maxActivityRecords {
activities = activities[len(activities)-maxActivityRecords:]
}
if err := r.store.SaveActivities(activities); err != nil {
glog.Warningf("Plugin failed to persist activities: %v", err)
}
}
func (r *Plugin) persistenceLoop() {
defer r.wg.Done()
for {
select {
case <-r.shutdownCh:
r.persistTrackedJobsSnapshot()
r.persistActivitiesSnapshot()
return
case <-r.persistTicker.C:
r.jobsMu.RLock()
needsJobsFlush := r.dirtyJobs
r.jobsMu.RUnlock()
if needsJobsFlush {
r.persistTrackedJobsSnapshot()
}
r.activitiesMu.RLock()
needsActivitiesFlush := r.dirtyActivities
r.activitiesMu.RUnlock()
if needsActivitiesFlush {
r.persistActivitiesSnapshot()
}
}
}
}

View File

@@ -0,0 +1,600 @@
package plugin
import (
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
"github.com/seaweedfs/seaweedfs/weed/pb/worker_pb"
"google.golang.org/protobuf/proto"
"google.golang.org/protobuf/types/known/timestamppb"
)
func TestPluginLoadsPersistedMonitorStateOnStart(t *testing.T) {
t.Parallel()
dataDir := t.TempDir()
store, err := NewConfigStore(dataDir)
if err != nil {
t.Fatalf("NewConfigStore: %v", err)
}
seedJobs := []TrackedJob{
{
JobID: "job-seeded",
JobType: "vacuum",
State: "running",
CreatedAt: timeToPtr(time.Now().UTC().Add(-2 * time.Minute)),
UpdatedAt: timeToPtr(time.Now().UTC().Add(-1 * time.Minute)),
},
}
seedActivities := []JobActivity{
{
JobID: "job-seeded",
JobType: "vacuum",
Source: "worker_progress",
Message: "seeded",
OccurredAt: timeToPtr(time.Now().UTC().Add(-30 * time.Second)),
},
}
if err := store.SaveTrackedJobs(seedJobs); err != nil {
t.Fatalf("SaveTrackedJobs: %v", err)
}
if err := store.SaveActivities(seedActivities); err != nil {
t.Fatalf("SaveActivities: %v", err)
}
pluginSvc, err := New(Options{DataDir: dataDir})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
gotJobs := pluginSvc.ListTrackedJobs("", "", 0)
if len(gotJobs) != 1 || gotJobs[0].JobID != "job-seeded" {
t.Fatalf("unexpected loaded jobs: %+v", gotJobs)
}
gotActivities := pluginSvc.ListActivities("", 0)
if len(gotActivities) != 1 || gotActivities[0].Message != "seeded" {
t.Fatalf("unexpected loaded activities: %+v", gotActivities)
}
}
func TestPluginPersistsMonitorStateAfterJobUpdates(t *testing.T) {
t.Parallel()
dataDir := t.TempDir()
pluginSvc, err := New(Options{DataDir: dataDir})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
job := &plugin_pb.JobSpec{
JobId: "job-persist",
JobType: "vacuum",
Summary: "persist test",
}
pluginSvc.trackExecutionStart("req-persist", "worker-a", job, 1)
pluginSvc.trackExecutionCompletion(&plugin_pb.JobCompleted{
RequestId: "req-persist",
JobId: "job-persist",
JobType: "vacuum",
Success: true,
Result: &plugin_pb.JobResult{Summary: "done"},
CompletedAt: timestamppb.New(time.Now().UTC()),
})
pluginSvc.Shutdown()
store, err := NewConfigStore(dataDir)
if err != nil {
t.Fatalf("NewConfigStore: %v", err)
}
trackedJobs, err := store.LoadTrackedJobs()
if err != nil {
t.Fatalf("LoadTrackedJobs: %v", err)
}
if len(trackedJobs) == 0 {
t.Fatalf("expected persisted tracked jobs")
}
found := false
for _, tracked := range trackedJobs {
if tracked.JobID == "job-persist" {
found = true
if tracked.State == "" {
t.Fatalf("persisted job state should not be empty")
}
}
}
if !found {
t.Fatalf("persisted tracked jobs missing job-persist")
}
activities, err := store.LoadActivities()
if err != nil {
t.Fatalf("LoadActivities: %v", err)
}
if len(activities) == 0 {
t.Fatalf("expected persisted activities")
}
}
func TestTrackExecutionQueuedMarksPendingState(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
pluginSvc.trackExecutionQueued(&plugin_pb.JobSpec{
JobId: "job-pending-1",
JobType: "vacuum",
DedupeKey: "vacuum:1",
Summary: "pending queue item",
})
jobs := pluginSvc.ListTrackedJobs("vacuum", "", 10)
if len(jobs) != 1 {
t.Fatalf("expected one tracked pending job, got=%d", len(jobs))
}
job := jobs[0]
if job.JobID != "job-pending-1" {
t.Fatalf("unexpected pending job id: %s", job.JobID)
}
if job.State != "job_state_pending" {
t.Fatalf("unexpected pending job state: %s", job.State)
}
if job.Stage != "queued" {
t.Fatalf("unexpected pending job stage: %s", job.Stage)
}
activities := pluginSvc.ListActivities("vacuum", 50)
found := false
for _, activity := range activities {
if activity.JobID == "job-pending-1" && activity.Stage == "queued" && activity.Source == "admin_scheduler" {
found = true
break
}
}
if !found {
t.Fatalf("expected queued activity for pending job")
}
}
func TestHandleJobProgressUpdateCarriesWorkerIDInActivities(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
job := &plugin_pb.JobSpec{
JobId: "job-progress-worker",
JobType: "vacuum",
}
pluginSvc.trackExecutionStart("req-progress-worker", "worker-a", job, 1)
pluginSvc.handleJobProgressUpdate("worker-a", &plugin_pb.JobProgressUpdate{
RequestId: "req-progress-worker",
JobId: "job-progress-worker",
JobType: "vacuum",
State: plugin_pb.JobState_JOB_STATE_RUNNING,
ProgressPercent: 42.0,
Stage: "scan",
Message: "in progress",
Activities: []*plugin_pb.ActivityEvent{
{
Source: plugin_pb.ActivitySource_ACTIVITY_SOURCE_EXECUTOR,
Message: "volume scanned",
Stage: "scan",
},
},
})
activities := pluginSvc.ListActivities("vacuum", 0)
if len(activities) == 0 {
t.Fatalf("expected activity entries")
}
foundProgress := false
foundEvent := false
for _, activity := range activities {
if activity.Source == "worker_progress" && activity.Message == "in progress" {
foundProgress = true
if activity.WorkerID != "worker-a" {
t.Fatalf("worker_progress activity worker mismatch: got=%q want=%q", activity.WorkerID, "worker-a")
}
}
if activity.Message == "volume scanned" {
foundEvent = true
if activity.WorkerID != "worker-a" {
t.Fatalf("worker event worker mismatch: got=%q want=%q", activity.WorkerID, "worker-a")
}
}
}
if !foundProgress {
t.Fatalf("expected worker_progress activity")
}
if !foundEvent {
t.Fatalf("expected worker activity event")
}
}
func TestHandleJobProgressUpdateWithoutJobIDTracksDetectionActivities(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
pluginSvc.handleJobProgressUpdate("worker-detector", &plugin_pb.JobProgressUpdate{
RequestId: "detect-req-1",
JobType: "vacuum",
State: plugin_pb.JobState_JOB_STATE_RUNNING,
Stage: "decision_summary",
Message: "VACUUM: No tasks created for 3 volumes",
Activities: []*plugin_pb.ActivityEvent{
{
Source: plugin_pb.ActivitySource_ACTIVITY_SOURCE_DETECTOR,
Stage: "decision_summary",
Message: "VACUUM: No tasks created for 3 volumes",
},
},
})
activities := pluginSvc.ListActivities("vacuum", 0)
if len(activities) == 0 {
t.Fatalf("expected activity entries")
}
foundDetectionProgress := false
foundDetectorEvent := false
for _, activity := range activities {
if activity.RequestID != "detect-req-1" {
continue
}
if activity.Source == "worker_detection" {
foundDetectionProgress = true
if activity.WorkerID != "worker-detector" {
t.Fatalf("worker_detection worker mismatch: got=%q want=%q", activity.WorkerID, "worker-detector")
}
}
if activity.Source == "activity_source_detector" {
foundDetectorEvent = true
if activity.WorkerID != "worker-detector" {
t.Fatalf("detector event worker mismatch: got=%q want=%q", activity.WorkerID, "worker-detector")
}
}
}
if !foundDetectionProgress {
t.Fatalf("expected worker_detection activity")
}
if !foundDetectorEvent {
t.Fatalf("expected detector activity event")
}
}
func TestHandleJobCompletedCarriesWorkerIDInActivitiesAndRunHistory(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
job := &plugin_pb.JobSpec{
JobId: "job-complete-worker",
JobType: "vacuum",
}
pluginSvc.trackExecutionStart("req-complete-worker", "worker-b", job, 1)
pluginSvc.handleJobCompleted(&plugin_pb.JobCompleted{
RequestId: "req-complete-worker",
JobId: "job-complete-worker",
JobType: "vacuum",
Success: true,
Activities: []*plugin_pb.ActivityEvent{
{
Source: plugin_pb.ActivitySource_ACTIVITY_SOURCE_EXECUTOR,
Message: "finalizer done",
Stage: "finalize",
},
},
CompletedAt: timestamppb.Now(),
})
pluginSvc.Shutdown()
activities := pluginSvc.ListActivities("vacuum", 0)
foundWorkerEvent := false
for _, activity := range activities {
if activity.Message == "finalizer done" {
foundWorkerEvent = true
if activity.WorkerID != "worker-b" {
t.Fatalf("worker completion event worker mismatch: got=%q want=%q", activity.WorkerID, "worker-b")
}
}
}
if !foundWorkerEvent {
t.Fatalf("expected completion worker event activity")
}
history, err := pluginSvc.LoadRunHistory("vacuum")
if err != nil {
t.Fatalf("LoadRunHistory: %v", err)
}
if history == nil || len(history.SuccessfulRuns) == 0 {
t.Fatalf("expected successful run history entry")
}
if history.SuccessfulRuns[0].WorkerID != "worker-b" {
t.Fatalf("run history worker mismatch: got=%q want=%q", history.SuccessfulRuns[0].WorkerID, "worker-b")
}
}
func TestTrackExecutionStartStoresJobPayloadDetails(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{DataDir: t.TempDir()})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
pluginSvc.trackExecutionStart("req-payload", "worker-c", &plugin_pb.JobSpec{
JobId: "job-payload",
JobType: "vacuum",
Summary: "payload summary",
Detail: "payload detail",
Parameters: map[string]*plugin_pb.ConfigValue{
"volume_id": {
Kind: &plugin_pb.ConfigValue_Int64Value{Int64Value: 9},
},
},
Labels: map[string]string{
"source": "detector",
},
}, 2)
pluginSvc.Shutdown()
job, found := pluginSvc.GetTrackedJob("job-payload")
if !found || job == nil {
t.Fatalf("expected tracked job")
}
if job.Detail != "" {
t.Fatalf("expected in-memory tracked job detail to be stripped, got=%q", job.Detail)
}
if job.Attempt != 2 {
t.Fatalf("unexpected attempt: %d", job.Attempt)
}
if len(job.Labels) != 0 {
t.Fatalf("expected in-memory labels to be stripped, got=%+v", job.Labels)
}
if len(job.Parameters) != 0 {
t.Fatalf("expected in-memory parameters to be stripped, got=%+v", job.Parameters)
}
detail, found, err := pluginSvc.BuildJobDetail("job-payload", 100, 0)
if err != nil {
t.Fatalf("BuildJobDetail: %v", err)
}
if !found || detail == nil || detail.Job == nil {
t.Fatalf("expected disk-backed job detail")
}
if detail.Job.Detail != "payload detail" {
t.Fatalf("unexpected disk-backed detail: %q", detail.Job.Detail)
}
if got := detail.Job.Labels["source"]; got != "detector" {
t.Fatalf("unexpected disk-backed label source: %q", got)
}
if got, ok := detail.Job.Parameters["volume_id"].(map[string]interface{}); !ok || got["int64_value"] != "9" {
t.Fatalf("unexpected disk-backed parameters payload: %#v", detail.Job.Parameters["volume_id"])
}
}
func TestTrackExecutionStartStoresErasureCodingExecutionPlan(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{DataDir: t.TempDir()})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
taskParams := &worker_pb.TaskParams{
TaskId: "task-ec-1",
VolumeId: 29,
Collection: "photos",
Sources: []*worker_pb.TaskSource{
{
Node: "source-a:8080",
DataCenter: "dc1",
Rack: "rack1",
VolumeId: 29,
},
},
Targets: []*worker_pb.TaskTarget{
{
Node: "target-a:8080",
DataCenter: "dc1",
Rack: "rack2",
VolumeId: 29,
ShardIds: []uint32{0, 10},
},
{
Node: "target-b:8080",
DataCenter: "dc2",
Rack: "rack3",
VolumeId: 29,
ShardIds: []uint32{1, 11},
},
},
TaskParams: &worker_pb.TaskParams_ErasureCodingParams{
ErasureCodingParams: &worker_pb.ErasureCodingTaskParams{
DataShards: 10,
ParityShards: 4,
},
},
}
payload, err := proto.Marshal(taskParams)
if err != nil {
t.Fatalf("Marshal task params: %v", err)
}
pluginSvc.trackExecutionStart("req-ec-plan", "worker-ec", &plugin_pb.JobSpec{
JobId: "job-ec-plan",
JobType: "erasure_coding",
Parameters: map[string]*plugin_pb.ConfigValue{
"task_params_pb": {
Kind: &plugin_pb.ConfigValue_BytesValue{BytesValue: payload},
},
},
}, 1)
pluginSvc.Shutdown()
detail, found, err := pluginSvc.BuildJobDetail("job-ec-plan", 100, 0)
if err != nil {
t.Fatalf("BuildJobDetail: %v", err)
}
if !found || detail == nil || detail.Job == nil {
t.Fatalf("expected disk-backed detail")
}
rawPlan, ok := detail.Job.Parameters["execution_plan"]
if !ok {
t.Fatalf("expected execution_plan in parameters, got=%+v", detail.Job.Parameters)
}
plan, ok := rawPlan.(map[string]interface{})
if !ok {
t.Fatalf("unexpected execution_plan type: %T", rawPlan)
}
if plan["job_type"] != "erasure_coding" {
t.Fatalf("unexpected execution plan job type: %+v", plan["job_type"])
}
if plan["volume_id"] != float64(29) {
t.Fatalf("unexpected execution plan volume id: %+v", plan["volume_id"])
}
targets, ok := plan["targets"].([]interface{})
if !ok || len(targets) != 2 {
t.Fatalf("unexpected targets in execution plan: %+v", plan["targets"])
}
assignments, ok := plan["shard_assignments"].([]interface{})
if !ok || len(assignments) != 4 {
t.Fatalf("unexpected shard assignments in execution plan: %+v", plan["shard_assignments"])
}
firstAssignment, ok := assignments[0].(map[string]interface{})
if !ok {
t.Fatalf("unexpected first assignment payload: %+v", assignments[0])
}
if firstAssignment["shard_id"] != float64(0) || firstAssignment["kind"] != "data" {
t.Fatalf("unexpected first assignment: %+v", firstAssignment)
}
}
func TestBuildJobDetailIncludesActivitiesAndRunRecord(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{DataDir: t.TempDir()})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
pluginSvc.trackExecutionStart("req-detail", "worker-z", &plugin_pb.JobSpec{
JobId: "job-detail",
JobType: "vacuum",
Summary: "detail summary",
}, 1)
pluginSvc.handleJobProgressUpdate("worker-z", &plugin_pb.JobProgressUpdate{
RequestId: "req-detail",
JobId: "job-detail",
JobType: "vacuum",
State: plugin_pb.JobState_JOB_STATE_RUNNING,
Stage: "scan",
Message: "scanning volume",
})
pluginSvc.handleJobCompleted(&plugin_pb.JobCompleted{
RequestId: "req-detail",
JobId: "job-detail",
JobType: "vacuum",
Success: true,
Result: &plugin_pb.JobResult{
Summary: "done",
OutputValues: map[string]*plugin_pb.ConfigValue{
"affected": {
Kind: &plugin_pb.ConfigValue_Int64Value{Int64Value: 1},
},
},
},
CompletedAt: timestamppb.Now(),
})
pluginSvc.Shutdown()
detail, found, err := pluginSvc.BuildJobDetail("job-detail", 100, 5)
if err != nil {
t.Fatalf("BuildJobDetail error: %v", err)
}
if !found || detail == nil {
t.Fatalf("expected job detail")
}
if detail.Job == nil || detail.Job.JobID != "job-detail" {
t.Fatalf("unexpected job detail payload: %+v", detail.Job)
}
if detail.RunRecord == nil || detail.RunRecord.JobID != "job-detail" {
t.Fatalf("expected run record for job-detail, got=%+v", detail.RunRecord)
}
if len(detail.Activities) == 0 {
t.Fatalf("expected activity timeline entries")
}
if detail.Job.ResultOutputValues == nil {
t.Fatalf("expected result output values")
}
}
func TestBuildJobDetailLoadsFromDiskWhenMemoryCleared(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{DataDir: t.TempDir()})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
pluginSvc.trackExecutionStart("req-disk", "worker-d", &plugin_pb.JobSpec{
JobId: "job-disk",
JobType: "vacuum",
Summary: "disk summary",
Detail: "disk detail payload",
}, 1)
pluginSvc.Shutdown()
pluginSvc.jobsMu.Lock()
pluginSvc.jobs = map[string]*TrackedJob{}
pluginSvc.jobsMu.Unlock()
pluginSvc.activitiesMu.Lock()
pluginSvc.activities = nil
pluginSvc.activitiesMu.Unlock()
detail, found, err := pluginSvc.BuildJobDetail("job-disk", 100, 0)
if err != nil {
t.Fatalf("BuildJobDetail: %v", err)
}
if !found || detail == nil || detail.Job == nil {
t.Fatalf("expected detail from disk")
}
if detail.Job.Detail != "disk detail payload" {
t.Fatalf("unexpected disk detail payload: %q", detail.Job.Detail)
}
}

View File

@@ -0,0 +1,945 @@
package plugin
import (
"context"
"errors"
"fmt"
"strings"
"sync"
"time"
"github.com/seaweedfs/seaweedfs/weed/glog"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
"google.golang.org/protobuf/types/known/timestamppb"
)
var errExecutorAtCapacity = errors.New("executor is at capacity")
const (
defaultSchedulerTick = 5 * time.Second
defaultScheduledDetectionInterval = 300 * time.Second
defaultScheduledDetectionTimeout = 45 * time.Second
defaultScheduledExecutionTimeout = 90 * time.Second
defaultScheduledMaxResults int32 = 1000
defaultScheduledExecutionConcurrency = 1
defaultScheduledPerWorkerConcurrency = 1
maxScheduledExecutionConcurrency = 128
defaultScheduledRetryBackoff = 5 * time.Second
defaultClusterContextTimeout = 10 * time.Second
defaultWaitingBacklogFloor = 8
defaultWaitingBacklogMultiplier = 4
)
type schedulerPolicy struct {
DetectionInterval time.Duration
DetectionTimeout time.Duration
ExecutionTimeout time.Duration
RetryBackoff time.Duration
MaxResults int32
ExecutionConcurrency int
PerWorkerConcurrency int
RetryLimit int
ExecutorReserveBackoff time.Duration
}
func (r *Plugin) schedulerLoop() {
defer r.wg.Done()
ticker := time.NewTicker(r.schedulerTick)
defer ticker.Stop()
// Try once immediately on startup.
r.runSchedulerTick()
for {
select {
case <-r.shutdownCh:
return
case <-ticker.C:
r.runSchedulerTick()
}
}
}
func (r *Plugin) runSchedulerTick() {
jobTypes := r.registry.DetectableJobTypes()
if len(jobTypes) == 0 {
return
}
active := make(map[string]struct{}, len(jobTypes))
for _, jobType := range jobTypes {
active[jobType] = struct{}{}
policy, enabled, err := r.loadSchedulerPolicy(jobType)
if err != nil {
glog.Warningf("Plugin scheduler failed to load policy for %s: %v", jobType, err)
continue
}
if !enabled {
r.clearSchedulerJobType(jobType)
continue
}
if !r.markDetectionDue(jobType, policy.DetectionInterval) {
continue
}
r.wg.Add(1)
go func(jt string, p schedulerPolicy) {
defer r.wg.Done()
r.runScheduledDetection(jt, p)
}(jobType, policy)
}
r.pruneSchedulerState(active)
r.pruneDetectorLeases(active)
}
func (r *Plugin) loadSchedulerPolicy(jobType string) (schedulerPolicy, bool, error) {
cfg, err := r.store.LoadJobTypeConfig(jobType)
if err != nil {
return schedulerPolicy{}, false, err
}
descriptor, err := r.store.LoadDescriptor(jobType)
if err != nil {
return schedulerPolicy{}, false, err
}
adminRuntime := deriveSchedulerAdminRuntime(cfg, descriptor)
if adminRuntime == nil {
return schedulerPolicy{}, false, nil
}
if !adminRuntime.Enabled {
return schedulerPolicy{}, false, nil
}
policy := schedulerPolicy{
DetectionInterval: durationFromSeconds(adminRuntime.DetectionIntervalSeconds, defaultScheduledDetectionInterval),
DetectionTimeout: durationFromSeconds(adminRuntime.DetectionTimeoutSeconds, defaultScheduledDetectionTimeout),
ExecutionTimeout: defaultScheduledExecutionTimeout,
RetryBackoff: durationFromSeconds(adminRuntime.RetryBackoffSeconds, defaultScheduledRetryBackoff),
MaxResults: adminRuntime.MaxJobsPerDetection,
ExecutionConcurrency: int(adminRuntime.GlobalExecutionConcurrency),
PerWorkerConcurrency: int(adminRuntime.PerWorkerExecutionConcurrency),
RetryLimit: int(adminRuntime.RetryLimit),
ExecutorReserveBackoff: 200 * time.Millisecond,
}
if policy.DetectionInterval < r.schedulerTick {
policy.DetectionInterval = r.schedulerTick
}
if policy.MaxResults <= 0 {
policy.MaxResults = defaultScheduledMaxResults
}
if policy.ExecutionConcurrency <= 0 {
policy.ExecutionConcurrency = defaultScheduledExecutionConcurrency
}
if policy.ExecutionConcurrency > maxScheduledExecutionConcurrency {
policy.ExecutionConcurrency = maxScheduledExecutionConcurrency
}
if policy.PerWorkerConcurrency <= 0 {
policy.PerWorkerConcurrency = defaultScheduledPerWorkerConcurrency
}
if policy.PerWorkerConcurrency > policy.ExecutionConcurrency {
policy.PerWorkerConcurrency = policy.ExecutionConcurrency
}
if policy.RetryLimit < 0 {
policy.RetryLimit = 0
}
// Plugin protocol currently has only detection timeout in admin settings.
execTimeout := time.Duration(adminRuntime.DetectionTimeoutSeconds*2) * time.Second
if execTimeout < defaultScheduledExecutionTimeout {
execTimeout = defaultScheduledExecutionTimeout
}
policy.ExecutionTimeout = execTimeout
return policy, true, nil
}
func (r *Plugin) ListSchedulerStates() ([]SchedulerJobTypeState, error) {
jobTypes, err := r.ListKnownJobTypes()
if err != nil {
return nil, err
}
r.schedulerMu.Lock()
nextDetectionAt := make(map[string]time.Time, len(r.nextDetectionAt))
for jobType, nextRun := range r.nextDetectionAt {
nextDetectionAt[jobType] = nextRun
}
detectionInFlight := make(map[string]bool, len(r.detectionInFlight))
for jobType, inFlight := range r.detectionInFlight {
detectionInFlight[jobType] = inFlight
}
r.schedulerMu.Unlock()
states := make([]SchedulerJobTypeState, 0, len(jobTypes))
for _, jobType := range jobTypes {
state := SchedulerJobTypeState{
JobType: jobType,
DetectionInFlight: detectionInFlight[jobType],
}
if nextRun, ok := nextDetectionAt[jobType]; ok && !nextRun.IsZero() {
nextRunUTC := nextRun.UTC()
state.NextDetectionAt = &nextRunUTC
}
policy, enabled, loadErr := r.loadSchedulerPolicy(jobType)
if loadErr != nil {
state.PolicyError = loadErr.Error()
} else {
state.Enabled = enabled
if enabled {
state.DetectionIntervalSeconds = secondsFromDuration(policy.DetectionInterval)
state.DetectionTimeoutSeconds = secondsFromDuration(policy.DetectionTimeout)
state.ExecutionTimeoutSeconds = secondsFromDuration(policy.ExecutionTimeout)
state.MaxJobsPerDetection = policy.MaxResults
state.GlobalExecutionConcurrency = policy.ExecutionConcurrency
state.PerWorkerExecutionConcurrency = policy.PerWorkerConcurrency
state.RetryLimit = policy.RetryLimit
state.RetryBackoffSeconds = secondsFromDuration(policy.RetryBackoff)
}
}
leasedWorkerID := r.getDetectorLease(jobType)
if leasedWorkerID != "" {
state.DetectorWorkerID = leasedWorkerID
if worker, ok := r.registry.Get(leasedWorkerID); ok {
if capability := worker.Capabilities[jobType]; capability != nil && capability.CanDetect {
state.DetectorAvailable = true
}
}
}
if state.DetectorWorkerID == "" {
detector, detectorErr := r.registry.PickDetector(jobType)
if detectorErr == nil && detector != nil {
state.DetectorAvailable = true
state.DetectorWorkerID = detector.WorkerID
}
}
executors, executorErr := r.registry.ListExecutors(jobType)
if executorErr == nil {
state.ExecutorWorkerCount = len(executors)
}
states = append(states, state)
}
return states, nil
}
func deriveSchedulerAdminRuntime(
cfg *plugin_pb.PersistedJobTypeConfig,
descriptor *plugin_pb.JobTypeDescriptor,
) *plugin_pb.AdminRuntimeConfig {
if cfg != nil && cfg.AdminRuntime != nil {
adminConfig := *cfg.AdminRuntime
return &adminConfig
}
if descriptor == nil || descriptor.AdminRuntimeDefaults == nil {
return nil
}
defaults := descriptor.AdminRuntimeDefaults
return &plugin_pb.AdminRuntimeConfig{
Enabled: defaults.Enabled,
DetectionIntervalSeconds: defaults.DetectionIntervalSeconds,
DetectionTimeoutSeconds: defaults.DetectionTimeoutSeconds,
MaxJobsPerDetection: defaults.MaxJobsPerDetection,
GlobalExecutionConcurrency: defaults.GlobalExecutionConcurrency,
PerWorkerExecutionConcurrency: defaults.PerWorkerExecutionConcurrency,
RetryLimit: defaults.RetryLimit,
RetryBackoffSeconds: defaults.RetryBackoffSeconds,
}
}
func (r *Plugin) markDetectionDue(jobType string, interval time.Duration) bool {
now := time.Now().UTC()
r.schedulerMu.Lock()
defer r.schedulerMu.Unlock()
if r.detectionInFlight[jobType] {
return false
}
nextRun, exists := r.nextDetectionAt[jobType]
if exists && now.Before(nextRun) {
return false
}
r.nextDetectionAt[jobType] = now.Add(interval)
r.detectionInFlight[jobType] = true
return true
}
func (r *Plugin) finishDetection(jobType string) {
r.schedulerMu.Lock()
delete(r.detectionInFlight, jobType)
r.schedulerMu.Unlock()
}
func (r *Plugin) pruneSchedulerState(activeJobTypes map[string]struct{}) {
r.schedulerMu.Lock()
defer r.schedulerMu.Unlock()
for jobType := range r.nextDetectionAt {
if _, ok := activeJobTypes[jobType]; !ok {
delete(r.nextDetectionAt, jobType)
delete(r.detectionInFlight, jobType)
}
}
}
func (r *Plugin) clearSchedulerJobType(jobType string) {
r.schedulerMu.Lock()
delete(r.nextDetectionAt, jobType)
delete(r.detectionInFlight, jobType)
r.schedulerMu.Unlock()
r.clearDetectorLease(jobType, "")
}
func (r *Plugin) pruneDetectorLeases(activeJobTypes map[string]struct{}) {
r.detectorLeaseMu.Lock()
defer r.detectorLeaseMu.Unlock()
for jobType := range r.detectorLeases {
if _, ok := activeJobTypes[jobType]; !ok {
delete(r.detectorLeases, jobType)
}
}
}
func (r *Plugin) runScheduledDetection(jobType string, policy schedulerPolicy) {
defer r.finishDetection(jobType)
start := time.Now().UTC()
r.appendActivity(JobActivity{
JobType: jobType,
Source: "admin_scheduler",
Message: "scheduled detection started",
Stage: "detecting",
OccurredAt: timeToPtr(start),
})
if skip, waitingCount, waitingThreshold := r.shouldSkipDetectionForWaitingJobs(jobType, policy); skip {
r.appendActivity(JobActivity{
JobType: jobType,
Source: "admin_scheduler",
Message: fmt.Sprintf("scheduled detection skipped: waiting backlog %d reached threshold %d", waitingCount, waitingThreshold),
Stage: "skipped_waiting_backlog",
OccurredAt: timeToPtr(time.Now().UTC()),
})
return
}
clusterContext, err := r.loadSchedulerClusterContext()
if err != nil {
r.appendActivity(JobActivity{
JobType: jobType,
Source: "admin_scheduler",
Message: fmt.Sprintf("scheduled detection aborted: %v", err),
Stage: "failed",
OccurredAt: timeToPtr(time.Now().UTC()),
})
return
}
ctx, cancel := context.WithTimeout(context.Background(), policy.DetectionTimeout)
proposals, err := r.RunDetection(ctx, jobType, clusterContext, policy.MaxResults)
cancel()
if err != nil {
r.appendActivity(JobActivity{
JobType: jobType,
Source: "admin_scheduler",
Message: fmt.Sprintf("scheduled detection failed: %v", err),
Stage: "failed",
OccurredAt: timeToPtr(time.Now().UTC()),
})
return
}
r.appendActivity(JobActivity{
JobType: jobType,
Source: "admin_scheduler",
Message: fmt.Sprintf("scheduled detection completed: %d proposal(s)", len(proposals)),
Stage: "detected",
OccurredAt: timeToPtr(time.Now().UTC()),
})
filteredByActive, skippedActive := r.filterProposalsWithActiveJobs(jobType, proposals)
if skippedActive > 0 {
r.appendActivity(JobActivity{
JobType: jobType,
Source: "admin_scheduler",
Message: fmt.Sprintf("scheduled detection skipped %d proposal(s) due to active assigned/running jobs", skippedActive),
Stage: "deduped_active_jobs",
OccurredAt: timeToPtr(time.Now().UTC()),
})
}
if len(filteredByActive) == 0 {
return
}
filtered := r.filterScheduledProposals(filteredByActive)
if len(filtered) != len(filteredByActive) {
r.appendActivity(JobActivity{
JobType: jobType,
Source: "admin_scheduler",
Message: fmt.Sprintf("scheduled detection deduped %d proposal(s) within this run", len(filteredByActive)-len(filtered)),
Stage: "deduped",
OccurredAt: timeToPtr(time.Now().UTC()),
})
}
if len(filtered) == 0 {
return
}
r.dispatchScheduledProposals(jobType, filtered, clusterContext, policy)
}
func (r *Plugin) loadSchedulerClusterContext() (*plugin_pb.ClusterContext, error) {
if r.clusterContextProvider == nil {
return nil, fmt.Errorf("cluster context provider is not configured")
}
ctx, cancel := context.WithTimeout(context.Background(), defaultClusterContextTimeout)
defer cancel()
clusterContext, err := r.clusterContextProvider(ctx)
if err != nil {
return nil, err
}
if clusterContext == nil {
return nil, fmt.Errorf("cluster context provider returned nil")
}
return clusterContext, nil
}
func (r *Plugin) dispatchScheduledProposals(
jobType string,
proposals []*plugin_pb.JobProposal,
clusterContext *plugin_pb.ClusterContext,
policy schedulerPolicy,
) {
jobQueue := make(chan *plugin_pb.JobSpec, len(proposals))
for index, proposal := range proposals {
job := buildScheduledJobSpec(jobType, proposal, index)
r.trackExecutionQueued(job)
select {
case <-r.shutdownCh:
close(jobQueue)
return
default:
jobQueue <- job
}
}
close(jobQueue)
var wg sync.WaitGroup
var statsMu sync.Mutex
successCount := 0
errorCount := 0
workerCount := policy.ExecutionConcurrency
if workerCount < 1 {
workerCount = 1
}
for i := 0; i < workerCount; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for job := range jobQueue {
select {
case <-r.shutdownCh:
return
default:
}
for {
select {
case <-r.shutdownCh:
return
default:
}
executor, release, reserveErr := r.reserveScheduledExecutor(jobType, policy)
if reserveErr != nil {
select {
case <-r.shutdownCh:
return
default:
}
statsMu.Lock()
errorCount++
statsMu.Unlock()
r.appendActivity(JobActivity{
JobType: jobType,
Source: "admin_scheduler",
Message: fmt.Sprintf("scheduled execution reservation failed: %v", reserveErr),
Stage: "failed",
OccurredAt: timeToPtr(time.Now().UTC()),
})
break
}
err := r.executeScheduledJobWithExecutor(executor, job, clusterContext, policy)
release()
if errors.Is(err, errExecutorAtCapacity) {
r.trackExecutionQueued(job)
if !waitForShutdownOrTimer(r.shutdownCh, policy.ExecutorReserveBackoff) {
return
}
continue
}
if err != nil {
statsMu.Lock()
errorCount++
statsMu.Unlock()
r.appendActivity(JobActivity{
JobID: job.JobId,
JobType: job.JobType,
Source: "admin_scheduler",
Message: fmt.Sprintf("scheduled execution failed: %v", err),
Stage: "failed",
OccurredAt: timeToPtr(time.Now().UTC()),
})
break
}
statsMu.Lock()
successCount++
statsMu.Unlock()
break
}
}
}()
}
wg.Wait()
r.appendActivity(JobActivity{
JobType: jobType,
Source: "admin_scheduler",
Message: fmt.Sprintf("scheduled execution finished: success=%d error=%d", successCount, errorCount),
Stage: "executed",
OccurredAt: timeToPtr(time.Now().UTC()),
})
}
func (r *Plugin) reserveScheduledExecutor(
jobType string,
policy schedulerPolicy,
) (*WorkerSession, func(), error) {
deadline := time.Now().Add(policy.ExecutionTimeout)
if policy.ExecutionTimeout <= 0 {
deadline = time.Now().Add(10 * time.Minute) // Default cap
}
for {
select {
case <-r.shutdownCh:
return nil, nil, fmt.Errorf("plugin is shutting down")
default:
}
if time.Now().After(deadline) {
return nil, nil, fmt.Errorf("timed out waiting for executor capacity for %s", jobType)
}
executors, err := r.registry.ListExecutors(jobType)
if err != nil {
if !waitForShutdownOrTimer(r.shutdownCh, policy.ExecutorReserveBackoff) {
return nil, nil, fmt.Errorf("plugin is shutting down")
}
continue
}
for _, executor := range executors {
release, ok := r.tryReserveExecutorCapacity(executor, jobType, policy)
if !ok {
continue
}
return executor, release, nil
}
if !waitForShutdownOrTimer(r.shutdownCh, policy.ExecutorReserveBackoff) {
return nil, nil, fmt.Errorf("plugin is shutting down")
}
}
}
func (r *Plugin) tryReserveExecutorCapacity(
executor *WorkerSession,
jobType string,
policy schedulerPolicy,
) (func(), bool) {
if executor == nil || strings.TrimSpace(executor.WorkerID) == "" {
return nil, false
}
limit := schedulerWorkerExecutionLimit(executor, jobType, policy)
if limit <= 0 {
return nil, false
}
heartbeatUsed := 0
if executor.Heartbeat != nil && executor.Heartbeat.ExecutionSlotsUsed > 0 {
heartbeatUsed = int(executor.Heartbeat.ExecutionSlotsUsed)
}
workerID := strings.TrimSpace(executor.WorkerID)
r.schedulerExecMu.Lock()
reserved := r.schedulerExecReservations[workerID]
if heartbeatUsed+reserved >= limit {
r.schedulerExecMu.Unlock()
return nil, false
}
r.schedulerExecReservations[workerID] = reserved + 1
r.schedulerExecMu.Unlock()
release := func() {
r.releaseExecutorCapacity(workerID)
}
return release, true
}
func (r *Plugin) releaseExecutorCapacity(workerID string) {
workerID = strings.TrimSpace(workerID)
if workerID == "" {
return
}
r.schedulerExecMu.Lock()
defer r.schedulerExecMu.Unlock()
current := r.schedulerExecReservations[workerID]
if current <= 1 {
delete(r.schedulerExecReservations, workerID)
return
}
r.schedulerExecReservations[workerID] = current - 1
}
func schedulerWorkerExecutionLimit(executor *WorkerSession, jobType string, policy schedulerPolicy) int {
limit := policy.PerWorkerConcurrency
if limit <= 0 {
limit = defaultScheduledPerWorkerConcurrency
}
if capability := executor.Capabilities[jobType]; capability != nil && capability.MaxExecutionConcurrency > 0 {
capLimit := int(capability.MaxExecutionConcurrency)
if capLimit < limit {
limit = capLimit
}
}
if executor.Heartbeat != nil && executor.Heartbeat.ExecutionSlotsTotal > 0 {
heartbeatLimit := int(executor.Heartbeat.ExecutionSlotsTotal)
if heartbeatLimit < limit {
limit = heartbeatLimit
}
}
if limit < 0 {
return 0
}
return limit
}
func (r *Plugin) executeScheduledJobWithExecutor(
executor *WorkerSession,
job *plugin_pb.JobSpec,
clusterContext *plugin_pb.ClusterContext,
policy schedulerPolicy,
) error {
maxAttempts := policy.RetryLimit + 1
if maxAttempts < 1 {
maxAttempts = 1
}
var lastErr error
for attempt := 1; attempt <= maxAttempts; attempt++ {
select {
case <-r.shutdownCh:
return fmt.Errorf("plugin is shutting down")
default:
}
execCtx, cancel := context.WithTimeout(context.Background(), policy.ExecutionTimeout)
_, err := r.executeJobWithExecutor(execCtx, executor, job, clusterContext, int32(attempt))
cancel()
if err == nil {
return nil
}
if isExecutorAtCapacityError(err) {
return errExecutorAtCapacity
}
lastErr = err
if attempt < maxAttempts {
r.appendActivity(JobActivity{
JobID: job.JobId,
JobType: job.JobType,
Source: "admin_scheduler",
Message: fmt.Sprintf("retrying job attempt %d/%d after error: %v", attempt, maxAttempts, err),
Stage: "retry",
OccurredAt: timeToPtr(time.Now().UTC()),
})
if !waitForShutdownOrTimer(r.shutdownCh, policy.RetryBackoff) {
return fmt.Errorf("plugin is shutting down")
}
}
}
if lastErr == nil {
lastErr = fmt.Errorf("execution failed without an explicit error")
}
return lastErr
}
func (r *Plugin) shouldSkipDetectionForWaitingJobs(jobType string, policy schedulerPolicy) (bool, int, int) {
waitingCount := r.countWaitingTrackedJobs(jobType)
threshold := waitingBacklogThreshold(policy)
if threshold <= 0 {
return false, waitingCount, threshold
}
return waitingCount >= threshold, waitingCount, threshold
}
func (r *Plugin) countWaitingTrackedJobs(jobType string) int {
normalizedJobType := strings.TrimSpace(jobType)
if normalizedJobType == "" {
return 0
}
waiting := 0
r.jobsMu.RLock()
for _, job := range r.jobs {
if job == nil {
continue
}
if strings.TrimSpace(job.JobType) != normalizedJobType {
continue
}
if !isWaitingTrackedJobState(job.State) {
continue
}
waiting++
}
r.jobsMu.RUnlock()
return waiting
}
func waitingBacklogThreshold(policy schedulerPolicy) int {
concurrency := policy.ExecutionConcurrency
if concurrency <= 0 {
concurrency = defaultScheduledExecutionConcurrency
}
threshold := concurrency * defaultWaitingBacklogMultiplier
if threshold < defaultWaitingBacklogFloor {
threshold = defaultWaitingBacklogFloor
}
if policy.MaxResults > 0 && threshold > int(policy.MaxResults) {
threshold = int(policy.MaxResults)
}
return threshold
}
func isExecutorAtCapacityError(err error) bool {
if err == nil {
return false
}
if errors.Is(err, errExecutorAtCapacity) {
return true
}
return strings.Contains(strings.ToLower(err.Error()), "executor is at capacity")
}
func buildScheduledJobSpec(jobType string, proposal *plugin_pb.JobProposal, index int) *plugin_pb.JobSpec {
now := timestamppb.Now()
jobID := fmt.Sprintf("%s-scheduled-%d-%d", jobType, now.AsTime().UnixNano(), index)
job := &plugin_pb.JobSpec{
JobId: jobID,
JobType: jobType,
Priority: plugin_pb.JobPriority_JOB_PRIORITY_NORMAL,
Parameters: map[string]*plugin_pb.ConfigValue{},
Labels: map[string]string{},
CreatedAt: now,
ScheduledAt: now,
}
if proposal == nil {
return job
}
if proposal.JobType != "" {
job.JobType = proposal.JobType
}
job.Summary = proposal.Summary
job.Detail = proposal.Detail
if proposal.Priority != plugin_pb.JobPriority_JOB_PRIORITY_UNSPECIFIED {
job.Priority = proposal.Priority
}
job.DedupeKey = proposal.DedupeKey
job.Parameters = CloneConfigValueMap(proposal.Parameters)
if proposal.Labels != nil {
job.Labels = make(map[string]string, len(proposal.Labels))
for k, v := range proposal.Labels {
job.Labels[k] = v
}
}
if proposal.NotBefore != nil {
job.ScheduledAt = proposal.NotBefore
}
return job
}
func durationFromSeconds(seconds int32, defaultValue time.Duration) time.Duration {
if seconds <= 0 {
return defaultValue
}
return time.Duration(seconds) * time.Second
}
func secondsFromDuration(duration time.Duration) int32 {
if duration <= 0 {
return 0
}
return int32(duration / time.Second)
}
func waitForShutdownOrTimer(shutdown <-chan struct{}, duration time.Duration) bool {
if duration <= 0 {
return true
}
timer := time.NewTimer(duration)
defer timer.Stop()
select {
case <-shutdown:
return false
case <-timer.C:
return true
}
}
func (r *Plugin) filterProposalsWithActiveJobs(jobType string, proposals []*plugin_pb.JobProposal) ([]*plugin_pb.JobProposal, int) {
if len(proposals) == 0 {
return proposals, 0
}
activeKeys := make(map[string]struct{})
r.jobsMu.RLock()
for _, job := range r.jobs {
if job == nil {
continue
}
if strings.TrimSpace(job.JobType) != strings.TrimSpace(jobType) {
continue
}
if !isActiveTrackedJobState(job.State) {
continue
}
key := strings.TrimSpace(job.DedupeKey)
if key == "" {
key = strings.TrimSpace(job.JobID)
}
if key == "" {
continue
}
activeKeys[key] = struct{}{}
}
r.jobsMu.RUnlock()
if len(activeKeys) == 0 {
return proposals, 0
}
filtered := make([]*plugin_pb.JobProposal, 0, len(proposals))
skipped := 0
for _, proposal := range proposals {
if proposal == nil {
continue
}
key := proposalExecutionKey(proposal)
if key != "" {
if _, exists := activeKeys[key]; exists {
skipped++
continue
}
}
filtered = append(filtered, proposal)
}
return filtered, skipped
}
func proposalExecutionKey(proposal *plugin_pb.JobProposal) string {
if proposal == nil {
return ""
}
key := strings.TrimSpace(proposal.DedupeKey)
if key != "" {
return key
}
return strings.TrimSpace(proposal.ProposalId)
}
func isActiveTrackedJobState(state string) bool {
normalized := strings.ToLower(strings.TrimSpace(state))
switch normalized {
case "pending", "assigned", "running", "in_progress", "job_state_pending", "job_state_assigned", "job_state_running":
return true
default:
return false
}
}
func isWaitingTrackedJobState(state string) bool {
normalized := strings.ToLower(strings.TrimSpace(state))
return normalized == "pending" || normalized == "job_state_pending"
}
func (r *Plugin) filterScheduledProposals(proposals []*plugin_pb.JobProposal) []*plugin_pb.JobProposal {
filtered := make([]*plugin_pb.JobProposal, 0, len(proposals))
seenInRun := make(map[string]struct{}, len(proposals))
for _, proposal := range proposals {
if proposal == nil {
continue
}
key := proposal.DedupeKey
if key == "" {
key = proposal.ProposalId
}
if key == "" {
filtered = append(filtered, proposal)
continue
}
if _, exists := seenInRun[key]; exists {
continue
}
seenInRun[key] = struct{}{}
filtered = append(filtered, proposal)
}
return filtered
}

View File

@@ -0,0 +1,583 @@
package plugin
import (
"fmt"
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
)
func TestLoadSchedulerPolicyUsesAdminConfig(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
err = pluginSvc.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
JobType: "vacuum",
AdminRuntime: &plugin_pb.AdminRuntimeConfig{
Enabled: true,
DetectionIntervalSeconds: 30,
DetectionTimeoutSeconds: 20,
MaxJobsPerDetection: 123,
GlobalExecutionConcurrency: 5,
PerWorkerExecutionConcurrency: 2,
RetryLimit: 4,
RetryBackoffSeconds: 7,
},
})
if err != nil {
t.Fatalf("SaveJobTypeConfig: %v", err)
}
policy, enabled, err := pluginSvc.loadSchedulerPolicy("vacuum")
if err != nil {
t.Fatalf("loadSchedulerPolicy: %v", err)
}
if !enabled {
t.Fatalf("expected enabled policy")
}
if policy.MaxResults != 123 {
t.Fatalf("unexpected max results: got=%d", policy.MaxResults)
}
if policy.ExecutionConcurrency != 5 {
t.Fatalf("unexpected global concurrency: got=%d", policy.ExecutionConcurrency)
}
if policy.PerWorkerConcurrency != 2 {
t.Fatalf("unexpected per-worker concurrency: got=%d", policy.PerWorkerConcurrency)
}
if policy.RetryLimit != 4 {
t.Fatalf("unexpected retry limit: got=%d", policy.RetryLimit)
}
}
func TestLoadSchedulerPolicyUsesDescriptorDefaultsWhenConfigMissing(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
err = pluginSvc.store.SaveDescriptor("ec", &plugin_pb.JobTypeDescriptor{
JobType: "ec",
AdminRuntimeDefaults: &plugin_pb.AdminRuntimeDefaults{
Enabled: true,
DetectionIntervalSeconds: 60,
DetectionTimeoutSeconds: 25,
MaxJobsPerDetection: 30,
GlobalExecutionConcurrency: 4,
PerWorkerExecutionConcurrency: 2,
RetryLimit: 3,
RetryBackoffSeconds: 6,
},
})
if err != nil {
t.Fatalf("SaveDescriptor: %v", err)
}
policy, enabled, err := pluginSvc.loadSchedulerPolicy("ec")
if err != nil {
t.Fatalf("loadSchedulerPolicy: %v", err)
}
if !enabled {
t.Fatalf("expected enabled policy from descriptor defaults")
}
if policy.MaxResults != 30 {
t.Fatalf("unexpected max results: got=%d", policy.MaxResults)
}
if policy.ExecutionConcurrency != 4 {
t.Fatalf("unexpected global concurrency: got=%d", policy.ExecutionConcurrency)
}
if policy.PerWorkerConcurrency != 2 {
t.Fatalf("unexpected per-worker concurrency: got=%d", policy.PerWorkerConcurrency)
}
}
func TestReserveScheduledExecutorRespectsPerWorkerLimit(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 4},
},
})
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-b",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
},
})
policy := schedulerPolicy{
PerWorkerConcurrency: 1,
ExecutorReserveBackoff: time.Millisecond,
}
executor1, release1, err := pluginSvc.reserveScheduledExecutor("balance", policy)
if err != nil {
t.Fatalf("reserve executor 1: %v", err)
}
defer release1()
executor2, release2, err := pluginSvc.reserveScheduledExecutor("balance", policy)
if err != nil {
t.Fatalf("reserve executor 2: %v", err)
}
defer release2()
if executor1.WorkerID == executor2.WorkerID {
t.Fatalf("expected different executors due per-worker limit, got same worker %s", executor1.WorkerID)
}
}
func TestFilterScheduledProposalsDedupe(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
proposals := []*plugin_pb.JobProposal{
{ProposalId: "p1", DedupeKey: "d1"},
{ProposalId: "p2", DedupeKey: "d1"}, // same dedupe key
{ProposalId: "p3", DedupeKey: "d3"},
{ProposalId: "p3"}, // fallback dedupe by proposal id
{ProposalId: "p4"},
{ProposalId: "p4"}, // same proposal id, no dedupe key
}
filtered := pluginSvc.filterScheduledProposals(proposals)
if len(filtered) != 4 {
t.Fatalf("unexpected filtered size: got=%d want=4", len(filtered))
}
filtered2 := pluginSvc.filterScheduledProposals(proposals)
if len(filtered2) != 4 {
t.Fatalf("expected second run dedupe to be per-run only, got=%d", len(filtered2))
}
}
func TestBuildScheduledJobSpecDoesNotReuseProposalID(t *testing.T) {
t.Parallel()
proposal := &plugin_pb.JobProposal{
ProposalId: "vacuum-2",
DedupeKey: "vacuum:2",
JobType: "vacuum",
}
jobA := buildScheduledJobSpec("vacuum", proposal, 0)
jobB := buildScheduledJobSpec("vacuum", proposal, 1)
if jobA.JobId == proposal.ProposalId {
t.Fatalf("scheduled job id must not reuse proposal id: %s", jobA.JobId)
}
if jobB.JobId == proposal.ProposalId {
t.Fatalf("scheduled job id must not reuse proposal id: %s", jobB.JobId)
}
if jobA.JobId == jobB.JobId {
t.Fatalf("scheduled job ids must be unique across jobs: %s", jobA.JobId)
}
}
func TestFilterProposalsWithActiveJobs(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
pluginSvc.trackExecutionStart("req-1", "worker-a", &plugin_pb.JobSpec{
JobId: "job-1",
JobType: "vacuum",
DedupeKey: "vacuum:k1",
}, 1)
pluginSvc.trackExecutionStart("req-2", "worker-b", &plugin_pb.JobSpec{
JobId: "job-2",
JobType: "vacuum",
}, 1)
pluginSvc.trackExecutionQueued(&plugin_pb.JobSpec{
JobId: "job-3",
JobType: "vacuum",
DedupeKey: "vacuum:k4",
})
filtered, skipped := pluginSvc.filterProposalsWithActiveJobs("vacuum", []*plugin_pb.JobProposal{
{ProposalId: "proposal-1", JobType: "vacuum", DedupeKey: "vacuum:k1"},
{ProposalId: "job-2", JobType: "vacuum"},
{ProposalId: "proposal-2b", JobType: "vacuum", DedupeKey: "vacuum:k4"},
{ProposalId: "proposal-3", JobType: "vacuum", DedupeKey: "vacuum:k3"},
{ProposalId: "proposal-4", JobType: "balance", DedupeKey: "balance:k1"},
})
if skipped != 3 {
t.Fatalf("unexpected skipped count: got=%d want=3", skipped)
}
if len(filtered) != 2 {
t.Fatalf("unexpected filtered size: got=%d want=2", len(filtered))
}
if filtered[0].ProposalId != "proposal-3" || filtered[1].ProposalId != "proposal-4" {
t.Fatalf("unexpected filtered proposals: got=%s,%s", filtered[0].ProposalId, filtered[1].ProposalId)
}
}
func TestReserveScheduledExecutorTimesOutWhenNoExecutor(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
policy := schedulerPolicy{
ExecutionTimeout: 30 * time.Millisecond,
ExecutorReserveBackoff: 5 * time.Millisecond,
PerWorkerConcurrency: 1,
}
start := time.Now()
pluginSvc.Shutdown()
_, _, err = pluginSvc.reserveScheduledExecutor("missing-job-type", policy)
if err == nil {
t.Fatalf("expected reservation shutdown error")
}
if time.Since(start) > 50*time.Millisecond {
t.Fatalf("reservation returned too late after shutdown: duration=%v", time.Since(start))
}
}
func TestReserveScheduledExecutorWaitsForWorkerCapacity(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 1},
},
})
policy := schedulerPolicy{
ExecutionTimeout: time.Second,
PerWorkerConcurrency: 8,
ExecutorReserveBackoff: 5 * time.Millisecond,
}
_, release1, err := pluginSvc.reserveScheduledExecutor("balance", policy)
if err != nil {
t.Fatalf("reserve executor 1: %v", err)
}
defer release1()
type reserveResult struct {
err error
}
secondReserveCh := make(chan reserveResult, 1)
go func() {
_, release2, reserveErr := pluginSvc.reserveScheduledExecutor("balance", policy)
if release2 != nil {
release2()
}
secondReserveCh <- reserveResult{err: reserveErr}
}()
select {
case result := <-secondReserveCh:
t.Fatalf("expected second reservation to wait for capacity, got=%v", result.err)
case <-time.After(25 * time.Millisecond):
// Expected: still waiting.
}
release1()
select {
case result := <-secondReserveCh:
if result.err != nil {
t.Fatalf("second reservation error: %v", result.err)
}
case <-time.After(200 * time.Millisecond):
t.Fatalf("second reservation did not acquire after capacity release")
}
}
func TestShouldSkipDetectionForWaitingJobs(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
policy := schedulerPolicy{
ExecutionConcurrency: 2,
MaxResults: 100,
}
threshold := waitingBacklogThreshold(policy)
if threshold <= 0 {
t.Fatalf("expected positive waiting threshold")
}
for i := 0; i < threshold; i++ {
pluginSvc.trackExecutionQueued(&plugin_pb.JobSpec{
JobId: fmt.Sprintf("job-waiting-%d", i),
JobType: "vacuum",
DedupeKey: fmt.Sprintf("vacuum:%d", i),
})
}
skip, waitingCount, waitingThreshold := pluginSvc.shouldSkipDetectionForWaitingJobs("vacuum", policy)
if !skip {
t.Fatalf("expected detection to skip when waiting backlog reaches threshold")
}
if waitingCount != threshold {
t.Fatalf("unexpected waiting count: got=%d want=%d", waitingCount, threshold)
}
if waitingThreshold != threshold {
t.Fatalf("unexpected waiting threshold: got=%d want=%d", waitingThreshold, threshold)
}
}
func TestWaitingBacklogThresholdHonorsMaxResultsCap(t *testing.T) {
t.Parallel()
policy := schedulerPolicy{
ExecutionConcurrency: 8,
MaxResults: 6,
}
threshold := waitingBacklogThreshold(policy)
if threshold != 6 {
t.Fatalf("expected threshold to be capped by max results, got=%d", threshold)
}
}
func TestListSchedulerStatesIncludesPolicyAndState(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
const jobType = "vacuum"
err = pluginSvc.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
JobType: jobType,
AdminRuntime: &plugin_pb.AdminRuntimeConfig{
Enabled: true,
DetectionIntervalSeconds: 45,
DetectionTimeoutSeconds: 30,
MaxJobsPerDetection: 80,
GlobalExecutionConcurrency: 3,
PerWorkerExecutionConcurrency: 2,
RetryLimit: 1,
RetryBackoffSeconds: 9,
},
})
if err != nil {
t.Fatalf("SaveJobTypeConfig: %v", err)
}
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: jobType, CanDetect: true, CanExecute: true},
},
})
nextDetectionAt := time.Now().UTC().Add(2 * time.Minute).Round(time.Second)
pluginSvc.schedulerMu.Lock()
pluginSvc.nextDetectionAt[jobType] = nextDetectionAt
pluginSvc.detectionInFlight[jobType] = true
pluginSvc.schedulerMu.Unlock()
states, err := pluginSvc.ListSchedulerStates()
if err != nil {
t.Fatalf("ListSchedulerStates: %v", err)
}
state := findSchedulerState(states, jobType)
if state == nil {
t.Fatalf("missing scheduler state for %s", jobType)
}
if !state.Enabled {
t.Fatalf("expected enabled scheduler state")
}
if state.PolicyError != "" {
t.Fatalf("unexpected policy error: %s", state.PolicyError)
}
if !state.DetectionInFlight {
t.Fatalf("expected detection in flight")
}
if state.NextDetectionAt == nil {
t.Fatalf("expected next detection time")
}
if state.NextDetectionAt.Unix() != nextDetectionAt.Unix() {
t.Fatalf("unexpected next detection time: got=%v want=%v", state.NextDetectionAt, nextDetectionAt)
}
if state.DetectionIntervalSeconds != 45 {
t.Fatalf("unexpected detection interval: got=%d", state.DetectionIntervalSeconds)
}
if state.DetectionTimeoutSeconds != 30 {
t.Fatalf("unexpected detection timeout: got=%d", state.DetectionTimeoutSeconds)
}
if state.ExecutionTimeoutSeconds != 90 {
t.Fatalf("unexpected execution timeout: got=%d", state.ExecutionTimeoutSeconds)
}
if state.MaxJobsPerDetection != 80 {
t.Fatalf("unexpected max jobs per detection: got=%d", state.MaxJobsPerDetection)
}
if state.GlobalExecutionConcurrency != 3 {
t.Fatalf("unexpected global execution concurrency: got=%d", state.GlobalExecutionConcurrency)
}
if state.PerWorkerExecutionConcurrency != 2 {
t.Fatalf("unexpected per worker execution concurrency: got=%d", state.PerWorkerExecutionConcurrency)
}
if state.RetryLimit != 1 {
t.Fatalf("unexpected retry limit: got=%d", state.RetryLimit)
}
if state.RetryBackoffSeconds != 9 {
t.Fatalf("unexpected retry backoff: got=%d", state.RetryBackoffSeconds)
}
if !state.DetectorAvailable || state.DetectorWorkerID != "worker-a" {
t.Fatalf("unexpected detector assignment: available=%v worker=%s", state.DetectorAvailable, state.DetectorWorkerID)
}
if state.ExecutorWorkerCount != 1 {
t.Fatalf("unexpected executor worker count: got=%d", state.ExecutorWorkerCount)
}
}
func TestListSchedulerStatesShowsDisabledWhenNoPolicy(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
const jobType = "balance"
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-b",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: jobType, CanDetect: true, CanExecute: true},
},
})
states, err := pluginSvc.ListSchedulerStates()
if err != nil {
t.Fatalf("ListSchedulerStates: %v", err)
}
state := findSchedulerState(states, jobType)
if state == nil {
t.Fatalf("missing scheduler state for %s", jobType)
}
if state.Enabled {
t.Fatalf("expected disabled scheduler state")
}
if state.PolicyError != "" {
t.Fatalf("unexpected policy error: %s", state.PolicyError)
}
if !state.DetectorAvailable || state.DetectorWorkerID != "worker-b" {
t.Fatalf("unexpected detector details: available=%v worker=%s", state.DetectorAvailable, state.DetectorWorkerID)
}
if state.ExecutorWorkerCount != 1 {
t.Fatalf("unexpected executor worker count: got=%d", state.ExecutorWorkerCount)
}
}
func findSchedulerState(states []SchedulerJobTypeState, jobType string) *SchedulerJobTypeState {
for i := range states {
if states[i].JobType == jobType {
return &states[i]
}
}
return nil
}
func TestPickDetectorPrefersLeasedWorker(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "vacuum", CanDetect: true},
},
})
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-b",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "vacuum", CanDetect: true},
},
})
pluginSvc.setDetectorLease("vacuum", "worker-b")
detector, err := pluginSvc.pickDetector("vacuum")
if err != nil {
t.Fatalf("pickDetector: %v", err)
}
if detector.WorkerID != "worker-b" {
t.Fatalf("expected leased detector worker-b, got=%s", detector.WorkerID)
}
}
func TestPickDetectorReassignsWhenLeaseIsStale(t *testing.T) {
t.Parallel()
pluginSvc, err := New(Options{})
if err != nil {
t.Fatalf("New: %v", err)
}
defer pluginSvc.Shutdown()
pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "vacuum", CanDetect: true},
},
})
pluginSvc.setDetectorLease("vacuum", "worker-stale")
detector, err := pluginSvc.pickDetector("vacuum")
if err != nil {
t.Fatalf("pickDetector: %v", err)
}
if detector.WorkerID != "worker-a" {
t.Fatalf("expected reassigned detector worker-a, got=%s", detector.WorkerID)
}
lease := pluginSvc.getDetectorLease("vacuum")
if lease != "worker-a" {
t.Fatalf("expected detector lease to be updated to worker-a, got=%s", lease)
}
}

View File

@@ -0,0 +1,66 @@
package plugin
import (
"context"
"sort"
"time"
"github.com/seaweedfs/seaweedfs/weed/glog"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
)
const descriptorPrefetchTimeout = 20 * time.Second
func (r *Plugin) prefetchDescriptorsFromHello(hello *plugin_pb.WorkerHello) {
if hello == nil || len(hello.Capabilities) == 0 {
return
}
jobTypeSet := make(map[string]struct{})
for _, capability := range hello.Capabilities {
if capability == nil || capability.JobType == "" {
continue
}
if !capability.CanDetect && !capability.CanExecute {
continue
}
jobTypeSet[capability.JobType] = struct{}{}
}
if len(jobTypeSet) == 0 {
return
}
jobTypes := make([]string, 0, len(jobTypeSet))
for jobType := range jobTypeSet {
jobTypes = append(jobTypes, jobType)
}
sort.Strings(jobTypes)
for _, jobType := range jobTypes {
select {
case <-r.shutdownCh:
return
default:
}
descriptor, err := r.store.LoadDescriptor(jobType)
if err != nil {
glog.Warningf("Plugin descriptor prefetch check failed for %s: %v", jobType, err)
continue
}
if descriptor != nil {
continue
}
ctx, cancel := context.WithTimeout(r.ctx, descriptorPrefetchTimeout)
_, err = r.RequestConfigSchema(ctx, jobType, false)
cancel()
if err != nil {
glog.V(1).Infof("Plugin descriptor prefetch skipped for %s: %v", jobType, err)
continue
}
glog.V(1).Infof("Plugin descriptor prefetched for job_type=%s", jobType)
}
}

View File

@@ -0,0 +1,465 @@
package plugin
import (
"fmt"
"sort"
"strings"
"sync"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
)
const defaultWorkerStaleTimeout = 2 * time.Minute
// WorkerSession contains tracked worker metadata and plugin status.
type WorkerSession struct {
WorkerID string
WorkerInstance string
Address string
WorkerVersion string
ProtocolVersion string
ConnectedAt time.Time
LastSeenAt time.Time
Capabilities map[string]*plugin_pb.JobTypeCapability
Heartbeat *plugin_pb.WorkerHeartbeat
}
// Registry tracks connected plugin workers and capability-based selection.
type Registry struct {
mu sync.RWMutex
sessions map[string]*WorkerSession
staleAfter time.Duration
detectorCursor map[string]int
executorCursor map[string]int
}
func NewRegistry() *Registry {
return &Registry{
sessions: make(map[string]*WorkerSession),
staleAfter: defaultWorkerStaleTimeout,
detectorCursor: make(map[string]int),
executorCursor: make(map[string]int),
}
}
func (r *Registry) UpsertFromHello(hello *plugin_pb.WorkerHello) *WorkerSession {
now := time.Now()
caps := make(map[string]*plugin_pb.JobTypeCapability, len(hello.Capabilities))
for _, c := range hello.Capabilities {
if c == nil || c.JobType == "" {
continue
}
caps[c.JobType] = cloneJobTypeCapability(c)
}
r.mu.Lock()
defer r.mu.Unlock()
session, ok := r.sessions[hello.WorkerId]
if !ok {
session = &WorkerSession{
WorkerID: hello.WorkerId,
ConnectedAt: now,
}
r.sessions[hello.WorkerId] = session
}
session.WorkerInstance = hello.WorkerInstanceId
session.Address = hello.Address
session.WorkerVersion = hello.WorkerVersion
session.ProtocolVersion = hello.ProtocolVersion
session.LastSeenAt = now
session.Capabilities = caps
return cloneWorkerSession(session)
}
func (r *Registry) Remove(workerID string) {
r.mu.Lock()
defer r.mu.Unlock()
delete(r.sessions, workerID)
}
func (r *Registry) UpdateHeartbeat(workerID string, heartbeat *plugin_pb.WorkerHeartbeat) {
r.mu.Lock()
defer r.mu.Unlock()
session, ok := r.sessions[workerID]
if !ok {
return
}
session.Heartbeat = cloneWorkerHeartbeat(heartbeat)
session.LastSeenAt = time.Now()
}
func (r *Registry) Get(workerID string) (*WorkerSession, bool) {
r.mu.RLock()
defer r.mu.RUnlock()
session, ok := r.sessions[workerID]
if !ok || r.isSessionStaleLocked(session, time.Now()) {
return nil, false
}
return cloneWorkerSession(session), true
}
func (r *Registry) List() []*WorkerSession {
r.mu.RLock()
defer r.mu.RUnlock()
out := make([]*WorkerSession, 0, len(r.sessions))
now := time.Now()
for _, s := range r.sessions {
if r.isSessionStaleLocked(s, now) {
continue
}
out = append(out, cloneWorkerSession(s))
}
sort.Slice(out, func(i, j int) bool {
return out[i].WorkerID < out[j].WorkerID
})
return out
}
// DetectableJobTypes returns sorted job types that currently have at least one detect-capable worker.
func (r *Registry) DetectableJobTypes() []string {
r.mu.RLock()
defer r.mu.RUnlock()
jobTypes := make(map[string]struct{})
now := time.Now()
for _, session := range r.sessions {
if r.isSessionStaleLocked(session, now) {
continue
}
for jobType, capability := range session.Capabilities {
if capability == nil || !capability.CanDetect {
continue
}
jobTypes[jobType] = struct{}{}
}
}
out := make([]string, 0, len(jobTypes))
for jobType := range jobTypes {
out = append(out, jobType)
}
sort.Strings(out)
return out
}
// JobTypes returns sorted job types known by connected workers regardless of capability kind.
func (r *Registry) JobTypes() []string {
r.mu.RLock()
defer r.mu.RUnlock()
jobTypes := make(map[string]struct{})
now := time.Now()
for _, session := range r.sessions {
if r.isSessionStaleLocked(session, now) {
continue
}
for jobType := range session.Capabilities {
if jobType == "" {
continue
}
jobTypes[jobType] = struct{}{}
}
}
out := make([]string, 0, len(jobTypes))
for jobType := range jobTypes {
out = append(out, jobType)
}
sort.Strings(out)
return out
}
// PickSchemaProvider picks one worker for schema requests.
// Preference order:
// 1) workers that can detect this job type
// 2) workers that can execute this job type
// tie-break: more free slots, then lexical worker ID.
func (r *Registry) PickSchemaProvider(jobType string) (*WorkerSession, error) {
r.mu.RLock()
defer r.mu.RUnlock()
var candidates []*WorkerSession
now := time.Now()
for _, s := range r.sessions {
if r.isSessionStaleLocked(s, now) {
continue
}
capability := s.Capabilities[jobType]
if capability == nil {
continue
}
if capability.CanDetect || capability.CanExecute {
candidates = append(candidates, s)
}
}
if len(candidates) == 0 {
return nil, fmt.Errorf("no worker available for schema job_type=%s", jobType)
}
sort.Slice(candidates, func(i, j int) bool {
a := candidates[i]
b := candidates[j]
ac := a.Capabilities[jobType]
bc := b.Capabilities[jobType]
// Prefer detect-capable providers first.
if ac.CanDetect != bc.CanDetect {
return ac.CanDetect
}
aSlots := availableDetectionSlots(a, ac) + availableExecutionSlots(a, ac)
bSlots := availableDetectionSlots(b, bc) + availableExecutionSlots(b, bc)
if aSlots != bSlots {
return aSlots > bSlots
}
return a.WorkerID < b.WorkerID
})
return cloneWorkerSession(candidates[0]), nil
}
// PickDetector picks one detector worker for a job type.
func (r *Registry) PickDetector(jobType string) (*WorkerSession, error) {
return r.pickByKind(jobType, true)
}
// PickExecutor picks one executor worker for a job type.
func (r *Registry) PickExecutor(jobType string) (*WorkerSession, error) {
return r.pickByKind(jobType, false)
}
// ListExecutors returns sorted executor candidates for one job type.
// Ordering is by most available execution slots, then lexical worker ID.
// The top tie group is rotated round-robin to prevent sticky assignment.
func (r *Registry) ListExecutors(jobType string) ([]*WorkerSession, error) {
r.mu.Lock()
defer r.mu.Unlock()
candidates := r.collectByKindLocked(jobType, false, time.Now())
if len(candidates) == 0 {
return nil, fmt.Errorf("no executor worker available for job_type=%s", jobType)
}
sortByKind(candidates, jobType, false)
r.rotateTopCandidatesLocked(candidates, jobType, false)
out := make([]*WorkerSession, 0, len(candidates))
for _, candidate := range candidates {
out = append(out, cloneWorkerSession(candidate))
}
return out, nil
}
func (r *Registry) pickByKind(jobType string, detect bool) (*WorkerSession, error) {
r.mu.Lock()
defer r.mu.Unlock()
candidates := r.collectByKindLocked(jobType, detect, time.Now())
if len(candidates) == 0 {
kind := "executor"
if detect {
kind = "detector"
}
return nil, fmt.Errorf("no %s worker available for job_type=%s", kind, jobType)
}
sortByKind(candidates, jobType, detect)
r.rotateTopCandidatesLocked(candidates, jobType, detect)
return cloneWorkerSession(candidates[0]), nil
}
func (r *Registry) collectByKindLocked(jobType string, detect bool, now time.Time) []*WorkerSession {
var candidates []*WorkerSession
for _, session := range r.sessions {
if r.isSessionStaleLocked(session, now) {
continue
}
capability := session.Capabilities[jobType]
if capability == nil {
continue
}
if detect && capability.CanDetect {
candidates = append(candidates, session)
}
if !detect && capability.CanExecute {
candidates = append(candidates, session)
}
}
return candidates
}
func (r *Registry) isSessionStaleLocked(session *WorkerSession, now time.Time) bool {
if session == nil {
return true
}
if r.staleAfter <= 0 {
return false
}
lastSeen := session.LastSeenAt
if lastSeen.IsZero() {
lastSeen = session.ConnectedAt
}
if lastSeen.IsZero() {
return false
}
return now.Sub(lastSeen) > r.staleAfter
}
func sortByKind(candidates []*WorkerSession, jobType string, detect bool) {
sort.Slice(candidates, func(i, j int) bool {
a := candidates[i]
b := candidates[j]
ac := a.Capabilities[jobType]
bc := b.Capabilities[jobType]
aSlots := availableSlotsByKind(a, ac, detect)
bSlots := availableSlotsByKind(b, bc, detect)
if aSlots != bSlots {
return aSlots > bSlots
}
return a.WorkerID < b.WorkerID
})
}
func (r *Registry) rotateTopCandidatesLocked(candidates []*WorkerSession, jobType string, detect bool) {
if len(candidates) < 2 {
return
}
capability := candidates[0].Capabilities[jobType]
topSlots := availableSlotsByKind(candidates[0], capability, detect)
tieEnd := 1
for tieEnd < len(candidates) {
nextCapability := candidates[tieEnd].Capabilities[jobType]
if availableSlotsByKind(candidates[tieEnd], nextCapability, detect) != topSlots {
break
}
tieEnd++
}
if tieEnd <= 1 {
return
}
cursorKey := strings.TrimSpace(jobType)
if cursorKey == "" {
cursorKey = "*"
}
var offset int
if detect {
offset = r.detectorCursor[cursorKey] % tieEnd
r.detectorCursor[cursorKey] = (offset + 1) % tieEnd
} else {
offset = r.executorCursor[cursorKey] % tieEnd
r.executorCursor[cursorKey] = (offset + 1) % tieEnd
}
if offset == 0 {
return
}
prefix := append([]*WorkerSession(nil), candidates[:tieEnd]...)
for i := 0; i < tieEnd; i++ {
candidates[i] = prefix[(i+offset)%tieEnd]
}
}
func availableSlotsByKind(
session *WorkerSession,
capability *plugin_pb.JobTypeCapability,
detect bool,
) int {
if detect {
return availableDetectionSlots(session, capability)
}
return availableExecutionSlots(session, capability)
}
func availableDetectionSlots(session *WorkerSession, capability *plugin_pb.JobTypeCapability) int {
if session.Heartbeat != nil && session.Heartbeat.DetectionSlotsTotal > 0 {
free := int(session.Heartbeat.DetectionSlotsTotal - session.Heartbeat.DetectionSlotsUsed)
if free < 0 {
return 0
}
return free
}
if capability.MaxDetectionConcurrency > 0 {
return int(capability.MaxDetectionConcurrency)
}
return 1
}
func availableExecutionSlots(session *WorkerSession, capability *plugin_pb.JobTypeCapability) int {
if session.Heartbeat != nil && session.Heartbeat.ExecutionSlotsTotal > 0 {
free := int(session.Heartbeat.ExecutionSlotsTotal - session.Heartbeat.ExecutionSlotsUsed)
if free < 0 {
return 0
}
return free
}
if capability.MaxExecutionConcurrency > 0 {
return int(capability.MaxExecutionConcurrency)
}
return 1
}
func cloneWorkerSession(in *WorkerSession) *WorkerSession {
if in == nil {
return nil
}
out := *in
out.Capabilities = make(map[string]*plugin_pb.JobTypeCapability, len(in.Capabilities))
for jobType, cap := range in.Capabilities {
out.Capabilities[jobType] = cloneJobTypeCapability(cap)
}
out.Heartbeat = cloneWorkerHeartbeat(in.Heartbeat)
return &out
}
func cloneJobTypeCapability(in *plugin_pb.JobTypeCapability) *plugin_pb.JobTypeCapability {
if in == nil {
return nil
}
out := *in
return &out
}
func cloneWorkerHeartbeat(in *plugin_pb.WorkerHeartbeat) *plugin_pb.WorkerHeartbeat {
if in == nil {
return nil
}
out := *in
if in.RunningWork != nil {
out.RunningWork = make([]*plugin_pb.RunningWork, 0, len(in.RunningWork))
for _, rw := range in.RunningWork {
if rw == nil {
continue
}
clone := *rw
out.RunningWork = append(out.RunningWork, &clone)
}
}
if in.QueuedJobsByType != nil {
out.QueuedJobsByType = make(map[string]int32, len(in.QueuedJobsByType))
for k, v := range in.QueuedJobsByType {
out.QueuedJobsByType[k] = v
}
}
if in.Metadata != nil {
out.Metadata = make(map[string]string, len(in.Metadata))
for k, v := range in.Metadata {
out.Metadata[k] = v
}
}
return &out
}

View File

@@ -0,0 +1,321 @@
package plugin
import (
"reflect"
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
)
func TestRegistryPickDetectorPrefersMoreFreeSlots(t *testing.T) {
t.Parallel()
r := NewRegistry()
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "vacuum", CanDetect: true, CanExecute: true, MaxDetectionConcurrency: 2, MaxExecutionConcurrency: 2},
},
})
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-b",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "vacuum", CanDetect: true, CanExecute: true, MaxDetectionConcurrency: 4, MaxExecutionConcurrency: 4},
},
})
r.UpdateHeartbeat("worker-a", &plugin_pb.WorkerHeartbeat{
WorkerId: "worker-a",
DetectionSlotsUsed: 1,
DetectionSlotsTotal: 2,
})
r.UpdateHeartbeat("worker-b", &plugin_pb.WorkerHeartbeat{
WorkerId: "worker-b",
DetectionSlotsUsed: 1,
DetectionSlotsTotal: 4,
})
picked, err := r.PickDetector("vacuum")
if err != nil {
t.Fatalf("PickDetector: %v", err)
}
if picked.WorkerID != "worker-b" {
t.Fatalf("unexpected detector picked: got %s want worker-b", picked.WorkerID)
}
}
func TestRegistryPickExecutorAllowsSameWorker(t *testing.T) {
t.Parallel()
r := NewRegistry()
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-x",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "balance", CanDetect: true, CanExecute: true, MaxDetectionConcurrency: 1, MaxExecutionConcurrency: 1},
},
})
detector, err := r.PickDetector("balance")
if err != nil {
t.Fatalf("PickDetector: %v", err)
}
executor, err := r.PickExecutor("balance")
if err != nil {
t.Fatalf("PickExecutor: %v", err)
}
if detector.WorkerID != "worker-x" || executor.WorkerID != "worker-x" {
t.Fatalf("expected same worker for detect/execute, got detector=%s executor=%s", detector.WorkerID, executor.WorkerID)
}
}
func TestRegistryDetectableJobTypes(t *testing.T) {
t.Parallel()
r := NewRegistry()
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "vacuum", CanDetect: true, CanExecute: true},
{JobType: "balance", CanDetect: false, CanExecute: true},
},
})
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-b",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "ec", CanDetect: true, CanExecute: false},
{JobType: "vacuum", CanDetect: true, CanExecute: false},
},
})
got := r.DetectableJobTypes()
want := []string{"ec", "vacuum"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("unexpected detectable job types: got=%v want=%v", got, want)
}
}
func TestRegistryJobTypes(t *testing.T) {
t.Parallel()
r := NewRegistry()
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "vacuum", CanDetect: true},
{JobType: "balance", CanExecute: true},
},
})
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-b",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "ec", CanDetect: true},
},
})
got := r.JobTypes()
want := []string{"balance", "ec", "vacuum"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("unexpected job types: got=%v want=%v", got, want)
}
}
func TestRegistryListExecutorsSortedBySlots(t *testing.T) {
t.Parallel()
r := NewRegistry()
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
},
})
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-b",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 4},
},
})
r.UpdateHeartbeat("worker-a", &plugin_pb.WorkerHeartbeat{
WorkerId: "worker-a",
ExecutionSlotsUsed: 1,
ExecutionSlotsTotal: 2,
})
r.UpdateHeartbeat("worker-b", &plugin_pb.WorkerHeartbeat{
WorkerId: "worker-b",
ExecutionSlotsUsed: 1,
ExecutionSlotsTotal: 4,
})
executors, err := r.ListExecutors("balance")
if err != nil {
t.Fatalf("ListExecutors: %v", err)
}
if len(executors) != 2 {
t.Fatalf("unexpected candidate count: got=%d", len(executors))
}
if executors[0].WorkerID != "worker-b" || executors[1].WorkerID != "worker-a" {
t.Fatalf("unexpected executor order: got=%s,%s", executors[0].WorkerID, executors[1].WorkerID)
}
}
func TestRegistryPickExecutorRoundRobinForTopTie(t *testing.T) {
t.Parallel()
r := NewRegistry()
for _, workerID := range []string{"worker-a", "worker-b", "worker-c"} {
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: workerID,
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 1},
},
})
}
got := make([]string, 0, 6)
for i := 0; i < 6; i++ {
executor, err := r.PickExecutor("balance")
if err != nil {
t.Fatalf("PickExecutor: %v", err)
}
got = append(got, executor.WorkerID)
}
want := []string{"worker-a", "worker-b", "worker-c", "worker-a", "worker-b", "worker-c"}
if !reflect.DeepEqual(got, want) {
t.Fatalf("unexpected pick order: got=%v want=%v", got, want)
}
}
func TestRegistryListExecutorsRoundRobinForTopTie(t *testing.T) {
t.Parallel()
r := NewRegistry()
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
},
})
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-b",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
},
})
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-c",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 1},
},
})
r.UpdateHeartbeat("worker-a", &plugin_pb.WorkerHeartbeat{
WorkerId: "worker-a",
ExecutionSlotsUsed: 0,
ExecutionSlotsTotal: 2,
})
r.UpdateHeartbeat("worker-b", &plugin_pb.WorkerHeartbeat{
WorkerId: "worker-b",
ExecutionSlotsUsed: 0,
ExecutionSlotsTotal: 2,
})
r.UpdateHeartbeat("worker-c", &plugin_pb.WorkerHeartbeat{
WorkerId: "worker-c",
ExecutionSlotsUsed: 0,
ExecutionSlotsTotal: 1,
})
firstCall, err := r.ListExecutors("balance")
if err != nil {
t.Fatalf("ListExecutors first call: %v", err)
}
secondCall, err := r.ListExecutors("balance")
if err != nil {
t.Fatalf("ListExecutors second call: %v", err)
}
thirdCall, err := r.ListExecutors("balance")
if err != nil {
t.Fatalf("ListExecutors third call: %v", err)
}
if firstCall[0].WorkerID != "worker-a" || firstCall[1].WorkerID != "worker-b" || firstCall[2].WorkerID != "worker-c" {
t.Fatalf("unexpected first executor order: got=%s,%s,%s", firstCall[0].WorkerID, firstCall[1].WorkerID, firstCall[2].WorkerID)
}
if secondCall[0].WorkerID != "worker-b" || secondCall[1].WorkerID != "worker-a" || secondCall[2].WorkerID != "worker-c" {
t.Fatalf("unexpected second executor order: got=%s,%s,%s", secondCall[0].WorkerID, secondCall[1].WorkerID, secondCall[2].WorkerID)
}
if thirdCall[0].WorkerID != "worker-a" || thirdCall[1].WorkerID != "worker-b" || thirdCall[2].WorkerID != "worker-c" {
t.Fatalf("unexpected third executor order: got=%s,%s,%s", thirdCall[0].WorkerID, thirdCall[1].WorkerID, thirdCall[2].WorkerID)
}
}
func TestRegistrySkipsStaleWorkersForSelectionAndListing(t *testing.T) {
t.Parallel()
r := NewRegistry()
r.staleAfter = 2 * time.Second
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-stale",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "vacuum", CanDetect: true, CanExecute: true},
},
})
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-fresh",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "vacuum", CanDetect: true, CanExecute: true},
},
})
r.mu.Lock()
r.sessions["worker-stale"].LastSeenAt = time.Now().Add(-10 * time.Second)
r.sessions["worker-fresh"].LastSeenAt = time.Now()
r.mu.Unlock()
picked, err := r.PickDetector("vacuum")
if err != nil {
t.Fatalf("PickDetector: %v", err)
}
if picked.WorkerID != "worker-fresh" {
t.Fatalf("unexpected detector: got=%s want=worker-fresh", picked.WorkerID)
}
if _, ok := r.Get("worker-stale"); ok {
t.Fatalf("expected stale worker to be hidden from Get")
}
if _, ok := r.Get("worker-fresh"); !ok {
t.Fatalf("expected fresh worker from Get")
}
listed := r.List()
if len(listed) != 1 || listed[0].WorkerID != "worker-fresh" {
t.Fatalf("unexpected listed workers: %+v", listed)
}
}
func TestRegistryReturnsNoDetectorWhenAllWorkersStale(t *testing.T) {
t.Parallel()
r := NewRegistry()
r.staleAfter = 2 * time.Second
r.UpsertFromHello(&plugin_pb.WorkerHello{
WorkerId: "worker-a",
Capabilities: []*plugin_pb.JobTypeCapability{
{JobType: "vacuum", CanDetect: true},
},
})
r.mu.Lock()
r.sessions["worker-a"].LastSeenAt = time.Now().Add(-10 * time.Second)
r.mu.Unlock()
if _, err := r.PickDetector("vacuum"); err == nil {
t.Fatalf("expected no detector when all workers are stale")
}
}

103
weed/admin/plugin/types.go Normal file
View File

@@ -0,0 +1,103 @@
package plugin
import "time"
const (
// Keep exactly the last 10 successful and last 10 error runs per job type.
MaxSuccessfulRunHistory = 10
MaxErrorRunHistory = 10
)
type RunOutcome string
const (
RunOutcomeSuccess RunOutcome = "success"
RunOutcomeError RunOutcome = "error"
)
type JobRunRecord struct {
RunID string `json:"run_id"`
JobID string `json:"job_id"`
JobType string `json:"job_type"`
WorkerID string `json:"worker_id"`
Outcome RunOutcome `json:"outcome"`
Message string `json:"message,omitempty"`
DurationMs int64 `json:"duration_ms,omitempty"`
CompletedAt *time.Time `json:"completed_at,omitempty"`
}
type JobTypeRunHistory struct {
JobType string `json:"job_type"`
SuccessfulRuns []JobRunRecord `json:"successful_runs"`
ErrorRuns []JobRunRecord `json:"error_runs"`
LastUpdatedTime *time.Time `json:"last_updated_time,omitempty"`
}
type TrackedJob struct {
JobID string `json:"job_id"`
JobType string `json:"job_type"`
RequestID string `json:"request_id"`
WorkerID string `json:"worker_id"`
DedupeKey string `json:"dedupe_key,omitempty"`
Summary string `json:"summary,omitempty"`
Detail string `json:"detail,omitempty"`
Parameters map[string]interface{} `json:"parameters,omitempty"`
Labels map[string]string `json:"labels,omitempty"`
State string `json:"state"`
Progress float64 `json:"progress"`
Stage string `json:"stage,omitempty"`
Message string `json:"message,omitempty"`
Attempt int32 `json:"attempt,omitempty"`
CreatedAt *time.Time `json:"created_at,omitempty"`
UpdatedAt *time.Time `json:"updated_at,omitempty"`
CompletedAt *time.Time `json:"completed_at,omitempty"`
ErrorMessage string `json:"error_message,omitempty"`
ResultSummary string `json:"result_summary,omitempty"`
ResultOutputValues map[string]interface{} `json:"result_output_values,omitempty"`
}
type JobActivity struct {
JobID string `json:"job_id"`
JobType string `json:"job_type"`
RequestID string `json:"request_id,omitempty"`
WorkerID string `json:"worker_id,omitempty"`
Source string `json:"source"`
Message string `json:"message"`
Stage string `json:"stage,omitempty"`
Details map[string]interface{} `json:"details,omitempty"`
OccurredAt *time.Time `json:"occurred_at,omitempty"`
}
type JobDetail struct {
Job *TrackedJob `json:"job"`
RunRecord *JobRunRecord `json:"run_record,omitempty"`
Activities []JobActivity `json:"activities"`
RelatedJobs []TrackedJob `json:"related_jobs,omitempty"`
LastUpdated *time.Time `json:"last_updated,omitempty"`
}
type SchedulerJobTypeState struct {
JobType string `json:"job_type"`
Enabled bool `json:"enabled"`
PolicyError string `json:"policy_error,omitempty"`
DetectionInFlight bool `json:"detection_in_flight"`
NextDetectionAt *time.Time `json:"next_detection_at,omitempty"`
DetectionIntervalSeconds int32 `json:"detection_interval_seconds,omitempty"`
DetectionTimeoutSeconds int32 `json:"detection_timeout_seconds,omitempty"`
ExecutionTimeoutSeconds int32 `json:"execution_timeout_seconds,omitempty"`
MaxJobsPerDetection int32 `json:"max_jobs_per_detection,omitempty"`
GlobalExecutionConcurrency int `json:"global_execution_concurrency,omitempty"`
PerWorkerExecutionConcurrency int `json:"per_worker_execution_concurrency,omitempty"`
RetryLimit int `json:"retry_limit,omitempty"`
RetryBackoffSeconds int32 `json:"retry_backoff_seconds,omitempty"`
DetectorAvailable bool `json:"detector_available"`
DetectorWorkerID string `json:"detector_worker_id,omitempty"`
ExecutorWorkerCount int `json:"executor_worker_count"`
}
func timeToPtr(t time.Time) *time.Time {
if t.IsZero() {
return nil
}
return &t
}