Refactor plugin system and migrate worker runtime (#8369)

* admin: add plugin runtime UI page and route wiring * pb: add plugin gRPC contract and generated bindings * admin/plugin: implement worker registry, runtime, monitoring, and config store * admin/dash: wire plugin runtime and expose plugin workflow APIs * command: add flags to enable plugin runtime * admin: rename remaining plugin v2 wording to plugin * admin/plugin: add detectable job type registry helper * admin/plugin: add scheduled detection and dispatch orchestration * admin/plugin: prefetch job type descriptors when workers connect * admin/plugin: add known job type discovery API and UI * admin/plugin: refresh design doc to match current implementation * admin/plugin: enforce per-worker scheduler concurrency limits * admin/plugin: use descriptor runtime defaults for scheduler policy * admin/ui: auto-load first known plugin job type on page open * admin/plugin: bootstrap persisted config from descriptor defaults * admin/plugin: dedupe scheduled proposals by dedupe key * admin/ui: add job type and state filters for plugin monitoring * admin/ui: add per-job-type plugin activity summary * admin/plugin: split descriptor read API from schema refresh * admin/ui: keep plugin summary metrics global while tables are filtered * admin/plugin: retry executor reservation before timing out * admin/plugin: expose scheduler states for monitoring * admin/ui: show per-job-type scheduler states in plugin monitor * pb/plugin: rename protobuf package to plugin * admin/plugin: rename pluginRuntime wiring to plugin * admin/plugin: remove runtime naming from plugin APIs and UI * admin/plugin: rename runtime files to plugin naming * admin/plugin: persist jobs and activities for monitor recovery * admin/plugin: lease one detector worker per job type * admin/ui: show worker load from plugin heartbeats * admin/plugin: skip stale workers for detector and executor picks * plugin/worker: add plugin worker command and stream runtime scaffold * plugin/worker: implement vacuum detect and execute handlers * admin/plugin: document external vacuum plugin worker starter * command: update plugin.worker help to reflect implemented flow * command/admin: drop legacy Plugin V2 label * plugin/worker: validate vacuum job type and respect min interval * plugin/worker: test no-op detect when min interval not elapsed * command/admin: document plugin.worker external process * plugin/worker: advertise configured concurrency in hello * command/plugin.worker: add jobType handler selection * command/plugin.worker: test handler selection by job type * command/plugin.worker: persist worker id in workingDir * admin/plugin: document plugin.worker jobType and workingDir flags * plugin/worker: support cancel request for in-flight work * plugin/worker: test cancel request acknowledgements * command/plugin.worker: document workingDir and jobType behavior * plugin/worker: emit executor activity events for monitor * plugin/worker: test executor activity builder * admin/plugin: send last successful run in detection request * admin/plugin: send cancel request when detect or execute context ends * admin/plugin: document worker cancel request responsibility * admin/handlers: expose plugin scheduler states API in no-auth mode * admin/handlers: test plugin scheduler states route registration * admin/plugin: keep worker id on worker-generated activity records * admin/plugin: test worker id propagation in monitor activities * admin/dash: always initialize plugin service * command/admin: remove plugin enable flags and default to enabled * admin/dash: drop pluginEnabled constructor parameter * admin/plugin UI: stop checking plugin enabled state * admin/plugin: remove docs for plugin enable flags * admin/dash: remove unused plugin enabled check method * admin/dash: fallback to in-memory plugin init when dataDir fails * admin/plugin API: expose worker gRPC port in status * command/plugin.worker: resolve admin gRPC port via plugin status * split plugin UI into overview/configuration/monitoring pages * Update layout_templ.go * add volume_balance plugin worker handler * wire plugin.worker CLI for volume_balance job type * add erasure_coding plugin worker handler * wire plugin.worker CLI for erasure_coding job type * support multi-job handlers in plugin worker runtime * allow plugin.worker jobType as comma-separated list * admin/plugin UI: rename to Workers and simplify config view * plugin worker: queue detection requests instead of capacity reject * Update plugin_worker.go * plugin volume_balance: remove force_move/timeout from worker config UI * plugin erasure_coding: enforce local working dir and cleanup * admin/plugin UI: rename admin settings to job scheduling * admin/plugin UI: persist and robustly render detection results * admin/plugin: record and return detection trace metadata * admin/plugin UI: show detection process and decision trace * plugin: surface detector decision trace as activities * mini: start a plugin worker by default * admin/plugin UI: split monitoring into detection and execution tabs * plugin worker: emit detection decision trace for EC and balance * admin workers UI: split monitoring into detection and execution pages * plugin scheduler: skip proposals for active assigned/running jobs * admin workers UI: add job queue tab * plugin worker: add dummy stress detector and executor job type * admin workers UI: reorder tabs to detection queue execution * admin workers UI: regenerate plugin template * plugin defaults: include dummy stress and add stress tests * plugin dummy stress: rotate detection selections across runs * plugin scheduler: remove cross-run proposal dedupe * plugin queue: track pending scheduled jobs * plugin scheduler: wait for executor capacity before dispatch * plugin scheduler: skip detection when waiting backlog is high * plugin: add disk-backed job detail API and persistence * admin ui: show plugin job detail modal from job id links * plugin: generate unique job ids instead of reusing proposal ids * plugin worker: emit heartbeats on work state changes * plugin registry: round-robin tied executor and detector picks * add temporary EC overnight stress runner * plugin job details: persist and render EC execution plans * ec volume details: color data and parity shard badges * shard labels: keep parity ids numeric and color-only distinction * admin: remove legacy maintenance UI routes and templates * admin: remove dead maintenance endpoint helpers * Update layout_templ.go * remove dummy_stress worker and command support * refactor plugin UI to job-type top tabs and sub-tabs * migrate weed worker command to plugin runtime * remove plugin.worker command and keep worker runtime with metrics * update helm worker args for jobType and execution flags * set plugin scheduling defaults to global 16 and per-worker 4 * stress: fix RPC context reuse and remove redundant variables in ec_stress_runner * admin/plugin: fix lifecycle races, safe channel operations, and terminal state constants * admin/dash: randomize job IDs and fix priority zero-value overwrite in plugin API * admin/handlers: implement buffered rendering to prevent response corruption * admin/plugin: implement debounced persistence flusher and optimize BuildJobDetail memory lookups * admin/plugin: fix priority overwrite and implement bounded wait in scheduler reserve * admin/plugin: implement atomic file writes and fix run record side effects * admin/plugin: use P prefix for parity shard labels in execution plans * admin/plugin: enable parallel execution for cancellation tests * admin: refactor time.Time fields to pointers for better JSON omitempty support * admin/plugin: implement pointer-safe time assignments and comparisons in plugin core * admin/plugin: fix time assignment and sorting logic in plugin monitor after pointer refactor * admin/plugin: update scheduler activity tracking to use time pointers * admin/plugin: fix time-based run history trimming after pointer refactor * admin/dash: fix JobSpec struct literal in plugin API after pointer refactor * admin/view: add D/P prefixes to EC shard badges for UI consistency * admin/plugin: use lifecycle-aware context for schema prefetching * Update ec_volume_details_templ.go * admin/stress: fix proposal sorting and log volume cleanup errors * stress: refine ec stress runner with math/rand and collection name - Added Collection field to VolumeEcShardsDeleteRequest for correct filename construction. - Replaced crypto/rand with seeded math/rand PRNG for bulk payloads. - Added documentation for EcMinAge zero-value behavior. - Added logging for ignored errors in volume/shard deletion. * admin: return internal server error for plugin store failures Changed error status code from 400 Bad Request to 500 Internal Server Error for failures in GetPluginJobDetail to correctly reflect server-side errors. * admin: implement safe channel sends and graceful shutdown sync - Added sync.WaitGroup to Plugin struct to manage background goroutines. - Implemented safeSendCh helper using recover() to prevent panics on closed channels. - Ensured Shutdown() waits for all background operations to complete. * admin: robustify plugin monitor with nil-safe time and record init - Standardized nil-safe assignment for *time.Time pointers (CreatedAt, UpdatedAt, CompletedAt). - Ensured persistJobDetailSnapshot initializes new records correctly if they don't exist on disk. - Fixed debounced persistence to trigger immediate write on job completion. * admin: improve scheduler shutdown behavior and logic guards - Replaced brittle error string matching with explicit r.shutdownCh selection for shutdown detection. - Removed redundant nil guard in buildScheduledJobSpec. - Standardized WaitGroup usage for schedulerLoop. * admin: implement deep copy for job parameters and atomic write fixes - Implemented deepCopyGenericValue and used it in cloneTrackedJob to prevent shared state. - Ensured atomicWriteFile creates parent directories before writing. * admin: remove unreachable branch in shard classification Removed an unreachable 'totalShards <= 0' check in classifyShardID as dataShards and parityShards are already guarded. * admin: secure UI links and use canonical shard constants - Added rel="noopener noreferrer" to external links for security. - Replaced magic number 14 with erasure_coding.TotalShardsCount. - Used renderEcShardBadge for missing shard list consistency. * admin: stabilize plugin tests and fix regressions - Composed a robust plugin_monitor_test.go to handle asynchronous persistence. - Updated all time.Time literals to use timeToPtr helper. - Added explicit Shutdown() calls in tests to synchronize with debounced writes. - Fixed syntax errors and orphaned struct literals in tests. * Potential fix for code scanning alert no. 278: Slice memory allocation with excessive size value Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 283: Uncontrolled data used in path expression Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * admin: finalize refinements for error handling, scheduler, and race fixes - Standardized HTTP 500 status codes for store failures in plugin_api.go. - Tracked scheduled detection goroutines with sync.WaitGroup for safe shutdown. - Fixed race condition in safeSendDetectionComplete by extracting channel under lock. - Implemented deep copy for JobActivity details. - Used defaultDirPerm constant in atomicWriteFile. * test(ec): migrate admin dockertest to plugin APIs * admin/plugin_api: fix RunPluginJobTypeAPI to return 500 for server-side detection/filter errors * admin/plugin_api: fix ExecutePluginJobAPI to return 500 for job execution failures * admin/plugin_api: limit parseProtoJSONBody request body to 1MB to prevent unbounded memory usage * admin/plugin: consolidate regex to package-level validJobTypePattern; add char validation to sanitizeJobID * admin/plugin: fix racy Shutdown channel close with sync.Once * admin/plugin: track sendLoop and recv goroutines in WorkerStream with r.wg * admin/plugin: document writeProtoFiles atomicity — .pb is source of truth, .json is human-readable only * admin/plugin: extract activityLess helper to deduplicate nil-safe OccurredAt sort comparators * test/ec: check http.NewRequest errors to prevent nil req panics * test/ec: replace deprecated ioutil/math/rand, fix stale step comment 5.1→3.1 * plugin(ec): raise default detection and scheduling throughput limits * topology: include empty disks in volume list and EC capacity fallback * topology: remove hard 10-task cap for detection planning * Update ec_volume_details_templ.go * adjust default * fix tests --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2026-02-18 13:42:41 -08:00
parent 5463038760
commit 8ec9ff4a12
82 changed files with 23419 additions and 11389 deletions
--- a/weed/admin/plugin/DESIGN.md
+++ b/weed/admin/plugin/DESIGN.md
@@ -0,0 +1,205 @@
+# Admin Worker Plugin System (Design)
+
+This document describes the plugin system for admin-managed workers, implemented in parallel with the current maintenance/worker mechanism.
+
+## Scope
+
+- Add a new plugin protocol and runtime model for multi-language workers.
+- Keep all current admin + worker code paths untouched.
+- Use gRPC for all admin-worker communication.
+- Let workers describe job configuration UI declaratively via protobuf.
+- Persist all job type configuration under admin server data directory.
+- Support detector workers and executor workers per job type.
+- Add end-to-end workflow observability (activities, active jobs, progress).
+
+## New Contract
+
+- Proto file: `weed/pb/plugin.proto`
+- gRPC service: `PluginControlService.WorkerStream`
+- Connection model: worker-initiated long-lived bidirectional stream.
+
+Why this model:
+
+- Works for workers in any language with gRPC support.
+- Avoids admin dialing constraints in NAT/private networks.
+- Allows command/response, progress streaming, and heartbeat over one channel.
+
+## Core Runtime Components (Admin Side)
+
+1. `PluginRegistry`
+- Tracks connected workers and their per-job-type capabilities.
+- Maintains liveness via heartbeat timeout.
+
+2. `SchemaCoordinator`
+- For each job type, asks one capable worker for `JobTypeDescriptor`.
+- Caches descriptor version and refresh timestamp.
+
+3. `ConfigStore`
+- Persists descriptor + saved config values in `dataDir`.
+- Stores both:
+  - Admin-owned runtime config (detection interval, dispatch concurrency, retry).
+  - Worker-owned config values (plugin-specific detection/execution knobs).
+
+4. `DetectorScheduler`
+- Per job type, chooses one detector worker (`can_detect=true`).
+- Sends `RunDetectionRequest` with saved configs + cluster context.
+- Accepts `DetectionProposals`, dedupes by `dedupe_key`, inserts jobs.
+
+5. `JobDispatcher`
+- Chooses executor worker (`can_execute=true`) for each pending job.
+- Sends `ExecuteJobRequest`.
+- Consumes `JobProgressUpdate` and `JobCompleted`.
+
+6. `WorkflowMonitor`
+- Builds live counters and timeline from events:
+  - activities per job type,
+  - active jobs,
+  - per-job progress/state,
+  - worker health/load.
+
+## Worker Responsibilities
+
+1. Register capabilities on connect (`WorkerHello`).
+2. Expose job type descriptor (`ConfigSchemaResponse`) including UI schemas:
+- admin config form,
+- worker config form,
+- defaults.
+3. Run detection on demand (`RunDetectionRequest`) and return proposals.
+4. Execute assigned jobs (`ExecuteJobRequest`) and stream progress.
+5. Heartbeat regularly with slot usage and running work.
+6. Handle cancellation requests (`CancelRequest`) for in-flight detection/execution.
+
+## Declarative UI Model
+
+UI is fully derived from protobuf schema:
+
+- `ConfigForm`
+- `ConfigSection`
+- `ConfigField`
+- `ConfigOption`
+- `ValidationRule`
+- `ConfigValue` (typed scalar/list/map/object value container)
+
+Result:
+
+- Admin can render forms without hardcoded task structs.
+- New job types can ship UI schema from worker binary alone.
+- Worker language is irrelevant as long as it can emit protobuf messages.
+
+## Detection and Dispatch Flow
+
+1. Worker connects and registers capabilities.
+2. Admin requests descriptor per job type.
+3. Admin persists descriptor and editable config values.
+4. On detection interval (admin-owned setting):
+- Admin chooses one detector worker for that job type.
+- Sends `RunDetectionRequest` with:
+  - `AdminRuntimeConfig`,
+  - `admin_config_values`,
+  - `worker_config_values`,
+  - `ClusterContext` (master/filer/volume grpc locations, metadata).
+5. Detector emits `DetectionProposals` and `DetectionComplete`.
+6. Admin dedupes and enqueues jobs.
+7. Dispatcher assigns jobs to any eligible executor worker.
+8. Executor emits `JobProgressUpdate` and `JobCompleted`.
+9. Monitor updates workflow UI in near-real-time.
+
+## Persistence Layout (Admin Data Dir)
+
+Current layout under `<admin-data-dir>/plugin/`:
+
+- `job_types/<job_type>/descriptor.pb`
+- `job_types/<job_type>/descriptor.json`
+- `job_types/<job_type>/config.pb`
+- `job_types/<job_type>/config.json`
+- `job_types/<job_type>/runs.json`
+- `jobs/tracked_jobs.json`
+- `activities/activities.json`
+
+`config.pb` should use `PersistedJobTypeConfig` from `plugin.proto`.
+
+## Admin UI
+
+- Route: `/plugin`
+- Includes:
+  - runtime status,
+  - workers/capabilities,
+  - declarative descriptor-driven config forms,
+  - run history (last 10 success + last 10 errors),
+  - tracked jobs and activity stream,
+  - manual actions for schema refresh, detection, and detect+execute workflow.
+
+## Scheduling Policy (Initial)
+
+Detector selection per job type:
+- only workers with `can_detect=true`.
+- prefer healthy worker with highest free detection slots.
+- lease ends when heartbeat timeout or stream drop.
+
+Execution dispatch:
+- only workers with `can_execute=true`.
+- select by available execution slots and least active jobs.
+- retry on failure using admin runtime retry config.
+
+## Safety and Reliability
+
+- Idempotency: dedupe proposals by (`job_type`, `dedupe_key`).
+- Backpressure: enforce max jobs per detection run.
+- Timeouts: detection and execution timeout from admin runtime config.
+- Replay-safe persistence: write job state changes before emitting UI events.
+- Heartbeat-based failover for detector/executor reassignment.
+
+## Backward Compatibility
+
+- Legacy `worker.proto` runtime remains internally available where still referenced.
+- External CLI worker path is moved to plugin runtime behavior.
+- Runtime is enabled by default on admin worker gRPC server.
+
+## Incremental Rollout Plan
+
+Phase 1
+- Introduce protocol and storage models only.
+
+Phase 2
+- Build admin registry/scheduler/dispatcher behind feature flag.
+
+Phase 3
+- Add dedicated plugin UI pages and metrics.
+
+Phase 4
+- Port one existing job type (e.g. vacuum) as external worker plugin.
+
+Phase 4 status (starter)
+- Added `weed worker` command as an external `plugin.proto` worker process.
+- Initial handler implements `vacuum` job type with:
+  - declarative descriptor/config form response (`ConfigSchemaResponse`),
+  - detection via master topology scan (`RunDetectionRequest`),
+  - execution via existing vacuum task logic (`ExecuteJobRequest`),
+  - heartbeat/load reporting for monitor UI.
+- Legacy maintenance-worker-specific CLI path is removed.
+
+Run example:
+- Start admin: `weed admin -master=localhost:9333`
+- Start worker: `weed worker -admin=localhost:23646`
+- Optional explicit job type: `weed worker -admin=localhost:23646 -jobType=vacuum`
+- Optional stable worker ID persistence: `weed worker -admin=localhost:23646 -workingDir=/var/lib/seaweedfs-plugin`
+
+Phase 5
+- Migrate remaining job types and deprecate old mechanism.
+
+## Agreed Defaults
+
+1. Detector multiplicity
+- Exactly one detector worker per job type at a time. Admin selects one worker and runs detection there.
+
+2. Secret handling
+- No encryption at rest required for plugin config in this phase.
+
+3. Schema compatibility
+- No migration policy required yet; this is a new system.
+
+4. Execution ownership
+- Same worker is allowed to do both detection and execution.
+
+5. Retention
+- Keep last 10 successful runs and last 10 error runs per job type.
--- a/weed/admin/plugin/config_store.go
+++ b/weed/admin/plugin/config_store.go
@@ -0,0 +1,739 @@
+package plugin
+
+import (
+	"encoding/json"
+	"fmt"
+	"net/url"
+	"os"
+	"path/filepath"
+	"regexp"
+	"sort"
+	"strings"
+	"sync"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+	"google.golang.org/protobuf/encoding/protojson"
+	"google.golang.org/protobuf/proto"
+)
+
+const (
+	pluginDirName           = "plugin"
+	jobTypesDirName         = "job_types"
+	jobsDirName             = "jobs"
+	jobDetailsDirName       = "job_details"
+	activitiesDirName       = "activities"
+	descriptorPBFileName    = "descriptor.pb"
+	descriptorJSONFileName  = "descriptor.json"
+	configPBFileName        = "config.pb"
+	configJSONFileName      = "config.json"
+	runsJSONFileName        = "runs.json"
+	trackedJobsJSONFileName = "tracked_jobs.json"
+	activitiesJSONFileName  = "activities.json"
+	defaultDirPerm          = 0o755
+	defaultFilePerm         = 0o644
+)
+
+// validJobTypePattern is the canonical pattern for safe job type names.
+// Only letters, digits, underscore, dash, and dot are allowed, which prevents
+// path traversal because '/', '\\', and whitespace are rejected.
+var validJobTypePattern = regexp.MustCompile(`^[A-Za-z0-9_.-]+$`)
+
+// ConfigStore persists plugin configuration and bounded run history.
+// If admin data dir is empty, it transparently falls back to in-memory mode.
+type ConfigStore struct {
+	configured bool
+	baseDir    string
+
+	mu sync.RWMutex
+
+	memDescriptors map[string]*plugin_pb.JobTypeDescriptor
+	memConfigs     map[string]*plugin_pb.PersistedJobTypeConfig
+	memRunHistory  map[string]*JobTypeRunHistory
+	memTrackedJobs []TrackedJob
+	memActivities  []JobActivity
+	memJobDetails  map[string]TrackedJob
+}
+
+func NewConfigStore(adminDataDir string) (*ConfigStore, error) {
+	store := &ConfigStore{
+		configured:     adminDataDir != "",
+		memDescriptors: make(map[string]*plugin_pb.JobTypeDescriptor),
+		memConfigs:     make(map[string]*plugin_pb.PersistedJobTypeConfig),
+		memRunHistory:  make(map[string]*JobTypeRunHistory),
+		memJobDetails:  make(map[string]TrackedJob),
+	}
+
+	if adminDataDir == "" {
+		return store, nil
+	}
+
+	store.baseDir = filepath.Join(adminDataDir, pluginDirName)
+	if err := os.MkdirAll(filepath.Join(store.baseDir, jobTypesDirName), defaultDirPerm); err != nil {
+		return nil, fmt.Errorf("create plugin job_types dir: %w", err)
+	}
+	if err := os.MkdirAll(filepath.Join(store.baseDir, jobsDirName), defaultDirPerm); err != nil {
+		return nil, fmt.Errorf("create plugin jobs dir: %w", err)
+	}
+	if err := os.MkdirAll(filepath.Join(store.baseDir, jobsDirName, jobDetailsDirName), defaultDirPerm); err != nil {
+		return nil, fmt.Errorf("create plugin job_details dir: %w", err)
+	}
+	if err := os.MkdirAll(filepath.Join(store.baseDir, activitiesDirName), defaultDirPerm); err != nil {
+		return nil, fmt.Errorf("create plugin activities dir: %w", err)
+	}
+
+	return store, nil
+}
+
+func (s *ConfigStore) IsConfigured() bool {
+	return s.configured
+}
+
+func (s *ConfigStore) BaseDir() string {
+	return s.baseDir
+}
+
+func (s *ConfigStore) SaveDescriptor(jobType string, descriptor *plugin_pb.JobTypeDescriptor) error {
+	if descriptor == nil {
+		return fmt.Errorf("descriptor is nil")
+	}
+	if _, err := sanitizeJobType(jobType); err != nil {
+		return err
+	}
+
+	clone := proto.Clone(descriptor).(*plugin_pb.JobTypeDescriptor)
+	if clone.JobType == "" {
+		clone.JobType = jobType
+	}
+
+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	if !s.configured {
+		s.memDescriptors[jobType] = clone
+		return nil
+	}
+
+	jobTypeDir, err := s.ensureJobTypeDir(jobType)
+	if err != nil {
+		return err
+	}
+
+	pbPath := filepath.Join(jobTypeDir, descriptorPBFileName)
+	jsonPath := filepath.Join(jobTypeDir, descriptorJSONFileName)
+
+	if err := writeProtoFiles(clone, pbPath, jsonPath); err != nil {
+		return fmt.Errorf("save descriptor for %s: %w", jobType, err)
+	}
+
+	return nil
+}
+
+func (s *ConfigStore) LoadDescriptor(jobType string) (*plugin_pb.JobTypeDescriptor, error) {
+	if _, err := sanitizeJobType(jobType); err != nil {
+		return nil, err
+	}
+
+	s.mu.RLock()
+	if !s.configured {
+		d := s.memDescriptors[jobType]
+		s.mu.RUnlock()
+		if d == nil {
+			return nil, nil
+		}
+		return proto.Clone(d).(*plugin_pb.JobTypeDescriptor), nil
+	}
+	s.mu.RUnlock()
+
+	pbPath := filepath.Join(s.baseDir, jobTypesDirName, jobType, descriptorPBFileName)
+	data, err := os.ReadFile(pbPath)
+	if err != nil {
+		if os.IsNotExist(err) {
+			return nil, nil
+		}
+		return nil, fmt.Errorf("read descriptor for %s: %w", jobType, err)
+	}
+
+	var descriptor plugin_pb.JobTypeDescriptor
+	if err := proto.Unmarshal(data, &descriptor); err != nil {
+		return nil, fmt.Errorf("unmarshal descriptor for %s: %w", jobType, err)
+	}
+	return &descriptor, nil
+}
+
+func (s *ConfigStore) SaveJobTypeConfig(config *plugin_pb.PersistedJobTypeConfig) error {
+	if config == nil {
+		return fmt.Errorf("job type config is nil")
+	}
+	if config.JobType == "" {
+		return fmt.Errorf("job type config has empty job_type")
+	}
+	sanitizedJobType, err := sanitizeJobType(config.JobType)
+	if err != nil {
+		return err
+	}
+	// Use the sanitized job type going forward to ensure it is safe for filesystem paths.
+	config.JobType = sanitizedJobType
+
+	clone := proto.Clone(config).(*plugin_pb.PersistedJobTypeConfig)
+
+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	if !s.configured {
+		s.memConfigs[config.JobType] = clone
+		return nil
+	}
+
+	jobTypeDir, err := s.ensureJobTypeDir(config.JobType)
+	if err != nil {
+		return err
+	}
+
+	pbPath := filepath.Join(jobTypeDir, configPBFileName)
+	jsonPath := filepath.Join(jobTypeDir, configJSONFileName)
+
+	if err := writeProtoFiles(clone, pbPath, jsonPath); err != nil {
+		return fmt.Errorf("save job type config for %s: %w", config.JobType, err)
+	}
+
+	return nil
+}
+
+func (s *ConfigStore) LoadJobTypeConfig(jobType string) (*plugin_pb.PersistedJobTypeConfig, error) {
+	if _, err := sanitizeJobType(jobType); err != nil {
+		return nil, err
+	}
+
+	s.mu.RLock()
+	if !s.configured {
+		cfg := s.memConfigs[jobType]
+		s.mu.RUnlock()
+		if cfg == nil {
+			return nil, nil
+		}
+		return proto.Clone(cfg).(*plugin_pb.PersistedJobTypeConfig), nil
+	}
+	s.mu.RUnlock()
+
+	pbPath := filepath.Join(s.baseDir, jobTypesDirName, jobType, configPBFileName)
+	data, err := os.ReadFile(pbPath)
+	if err != nil {
+		if os.IsNotExist(err) {
+			return nil, nil
+		}
+		return nil, fmt.Errorf("read job type config for %s: %w", jobType, err)
+	}
+
+	var config plugin_pb.PersistedJobTypeConfig
+	if err := proto.Unmarshal(data, &config); err != nil {
+		return nil, fmt.Errorf("unmarshal job type config for %s: %w", jobType, err)
+	}
+
+	return &config, nil
+}
+
+func (s *ConfigStore) AppendRunRecord(jobType string, record *JobRunRecord) error {
+	if record == nil {
+		return fmt.Errorf("run record is nil")
+	}
+	if _, err := sanitizeJobType(jobType); err != nil {
+		return err
+	}
+
+	safeRecord := *record
+	if safeRecord.JobType == "" {
+		safeRecord.JobType = jobType
+	}
+	if safeRecord.CompletedAt == nil || safeRecord.CompletedAt.IsZero() {
+		safeRecord.CompletedAt = timeToPtr(time.Now().UTC())
+	}
+
+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	history, err := s.loadRunHistoryLocked(jobType)
+	if err != nil {
+		return err
+	}
+
+	if safeRecord.Outcome == RunOutcomeSuccess {
+		history.SuccessfulRuns = append(history.SuccessfulRuns, safeRecord)
+	} else {
+		safeRecord.Outcome = RunOutcomeError
+		history.ErrorRuns = append(history.ErrorRuns, safeRecord)
+	}
+
+	history.SuccessfulRuns = trimRuns(history.SuccessfulRuns, MaxSuccessfulRunHistory)
+	history.ErrorRuns = trimRuns(history.ErrorRuns, MaxErrorRunHistory)
+	history.LastUpdatedTime = timeToPtr(time.Now().UTC())
+
+	return s.saveRunHistoryLocked(jobType, history)
+}
+
+func (s *ConfigStore) LoadRunHistory(jobType string) (*JobTypeRunHistory, error) {
+	if _, err := sanitizeJobType(jobType); err != nil {
+		return nil, err
+	}
+
+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	history, err := s.loadRunHistoryLocked(jobType)
+	if err != nil {
+		return nil, err
+	}
+	return cloneRunHistory(history), nil
+}
+
+func (s *ConfigStore) SaveTrackedJobs(jobs []TrackedJob) error {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	clone := cloneTrackedJobs(jobs)
+
+	if !s.configured {
+		s.memTrackedJobs = clone
+		return nil
+	}
+
+	encoded, err := json.MarshalIndent(clone, "", "  ")
+	if err != nil {
+		return fmt.Errorf("encode tracked jobs: %w", err)
+	}
+
+	path := filepath.Join(s.baseDir, jobsDirName, trackedJobsJSONFileName)
+	if err := atomicWriteFile(path, encoded, defaultFilePerm); err != nil {
+		return fmt.Errorf("write tracked jobs: %w", err)
+	}
+	return nil
+}
+
+func (s *ConfigStore) LoadTrackedJobs() ([]TrackedJob, error) {
+	s.mu.RLock()
+	if !s.configured {
+		out := cloneTrackedJobs(s.memTrackedJobs)
+		s.mu.RUnlock()
+		return out, nil
+	}
+	s.mu.RUnlock()
+
+	path := filepath.Join(s.baseDir, jobsDirName, trackedJobsJSONFileName)
+	data, err := os.ReadFile(path)
+	if err != nil {
+		if os.IsNotExist(err) {
+			return nil, nil
+		}
+		return nil, fmt.Errorf("read tracked jobs: %w", err)
+	}
+
+	var jobs []TrackedJob
+	if err := json.Unmarshal(data, &jobs); err != nil {
+		return nil, fmt.Errorf("parse tracked jobs: %w", err)
+	}
+	return cloneTrackedJobs(jobs), nil
+}
+
+func (s *ConfigStore) SaveJobDetail(job TrackedJob) error {
+	jobID, err := sanitizeJobID(job.JobID)
+	if err != nil {
+		return err
+	}
+
+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	clone := cloneTrackedJob(job)
+	clone.JobID = jobID
+
+	if !s.configured {
+		s.memJobDetails[jobID] = clone
+		return nil
+	}
+
+	encoded, err := json.MarshalIndent(clone, "", "  ")
+	if err != nil {
+		return fmt.Errorf("encode job detail: %w", err)
+	}
+
+	path := filepath.Join(s.baseDir, jobsDirName, jobDetailsDirName, jobDetailFileName(jobID))
+	if err := atomicWriteFile(path, encoded, defaultFilePerm); err != nil {
+		return fmt.Errorf("write job detail: %w", err)
+	}
+
+	return nil
+}
+
+func (s *ConfigStore) LoadJobDetail(jobID string) (*TrackedJob, error) {
+	jobID, err := sanitizeJobID(jobID)
+	if err != nil {
+		return nil, err
+	}
+
+	s.mu.RLock()
+	if !s.configured {
+		job, ok := s.memJobDetails[jobID]
+		s.mu.RUnlock()
+		if !ok {
+			return nil, nil
+		}
+		clone := cloneTrackedJob(job)
+		return &clone, nil
+	}
+	s.mu.RUnlock()
+
+	path := filepath.Join(s.baseDir, jobsDirName, jobDetailsDirName, jobDetailFileName(jobID))
+	data, err := os.ReadFile(path)
+	if err != nil {
+		if os.IsNotExist(err) {
+			return nil, nil
+		}
+		return nil, fmt.Errorf("read job detail: %w", err)
+	}
+
+	var job TrackedJob
+	if err := json.Unmarshal(data, &job); err != nil {
+		return nil, fmt.Errorf("parse job detail: %w", err)
+	}
+	clone := cloneTrackedJob(job)
+	return &clone, nil
+}
+
+func (s *ConfigStore) SaveActivities(activities []JobActivity) error {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	clone := cloneActivities(activities)
+
+	if !s.configured {
+		s.memActivities = clone
+		return nil
+	}
+
+	encoded, err := json.MarshalIndent(clone, "", "  ")
+	if err != nil {
+		return fmt.Errorf("encode activities: %w", err)
+	}
+
+	path := filepath.Join(s.baseDir, activitiesDirName, activitiesJSONFileName)
+	if err := atomicWriteFile(path, encoded, defaultFilePerm); err != nil {
+		return fmt.Errorf("write activities: %w", err)
+	}
+	return nil
+}
+
+func (s *ConfigStore) LoadActivities() ([]JobActivity, error) {
+	s.mu.RLock()
+	if !s.configured {
+		out := cloneActivities(s.memActivities)
+		s.mu.RUnlock()
+		return out, nil
+	}
+	s.mu.RUnlock()
+
+	path := filepath.Join(s.baseDir, activitiesDirName, activitiesJSONFileName)
+	data, err := os.ReadFile(path)
+	if err != nil {
+		if os.IsNotExist(err) {
+			return nil, nil
+		}
+		return nil, fmt.Errorf("read activities: %w", err)
+	}
+
+	var activities []JobActivity
+	if err := json.Unmarshal(data, &activities); err != nil {
+		return nil, fmt.Errorf("parse activities: %w", err)
+	}
+	return cloneActivities(activities), nil
+}
+
+func (s *ConfigStore) ListJobTypes() ([]string, error) {
+	s.mu.RLock()
+	defer s.mu.RUnlock()
+
+	jobTypeSet := make(map[string]struct{})
+
+	if !s.configured {
+		for jobType := range s.memDescriptors {
+			jobTypeSet[jobType] = struct{}{}
+		}
+		for jobType := range s.memConfigs {
+			jobTypeSet[jobType] = struct{}{}
+		}
+		for jobType := range s.memRunHistory {
+			jobTypeSet[jobType] = struct{}{}
+		}
+	} else {
+		jobTypesPath := filepath.Join(s.baseDir, jobTypesDirName)
+		entries, err := os.ReadDir(jobTypesPath)
+		if err != nil {
+			if os.IsNotExist(err) {
+				return []string{}, nil
+			}
+			return nil, fmt.Errorf("list job types: %w", err)
+		}
+		for _, entry := range entries {
+			if !entry.IsDir() {
+				continue
+			}
+			jobType := strings.TrimSpace(entry.Name())
+			if _, err := sanitizeJobType(jobType); err != nil {
+				continue
+			}
+			jobTypeSet[jobType] = struct{}{}
+		}
+	}
+
+	jobTypes := make([]string, 0, len(jobTypeSet))
+	for jobType := range jobTypeSet {
+		jobTypes = append(jobTypes, jobType)
+	}
+	sort.Strings(jobTypes)
+	return jobTypes, nil
+}
+
+func (s *ConfigStore) loadRunHistoryLocked(jobType string) (*JobTypeRunHistory, error) {
+	if !s.configured {
+		history, ok := s.memRunHistory[jobType]
+		if !ok {
+			history = &JobTypeRunHistory{JobType: jobType}
+			s.memRunHistory[jobType] = history
+		}
+		return cloneRunHistory(history), nil
+	}
+
+	runsPath := filepath.Join(s.baseDir, jobTypesDirName, jobType, runsJSONFileName)
+	data, err := os.ReadFile(runsPath)
+	if err != nil {
+		if os.IsNotExist(err) {
+			return &JobTypeRunHistory{JobType: jobType}, nil
+		}
+		return nil, fmt.Errorf("read run history for %s: %w", jobType, err)
+	}
+
+	var history JobTypeRunHistory
+	if err := json.Unmarshal(data, &history); err != nil {
+		return nil, fmt.Errorf("parse run history for %s: %w", jobType, err)
+	}
+	if history.JobType == "" {
+		history.JobType = jobType
+	}
+	return &history, nil
+}
+
+func (s *ConfigStore) saveRunHistoryLocked(jobType string, history *JobTypeRunHistory) error {
+	if !s.configured {
+		s.memRunHistory[jobType] = cloneRunHistory(history)
+		return nil
+	}
+
+	jobTypeDir, err := s.ensureJobTypeDir(jobType)
+	if err != nil {
+		return err
+	}
+
+	encoded, err := json.MarshalIndent(history, "", "  ")
+	if err != nil {
+		return fmt.Errorf("encode run history for %s: %w", jobType, err)
+	}
+
+	runsPath := filepath.Join(jobTypeDir, runsJSONFileName)
+	if err := atomicWriteFile(runsPath, encoded, defaultFilePerm); err != nil {
+		return fmt.Errorf("write run history for %s: %w", jobType, err)
+	}
+	return nil
+}
+
+func (s *ConfigStore) ensureJobTypeDir(jobType string) (string, error) {
+	if !s.configured {
+		return "", nil
+	}
+	jobTypeDir := filepath.Join(s.baseDir, jobTypesDirName, jobType)
+	if err := os.MkdirAll(jobTypeDir, defaultDirPerm); err != nil {
+		return "", fmt.Errorf("create job type dir for %s: %w", jobType, err)
+	}
+	return jobTypeDir, nil
+}
+
+func sanitizeJobType(jobType string) (string, error) {
+	jobType = strings.TrimSpace(jobType)
+	if jobType == "" {
+		return "", fmt.Errorf("job type is empty")
+	}
+	// Enforce a strict, path-safe pattern for job types: only letters, digits, underscore, dash and dot.
+	// This prevents path traversal because '/', '\\' and whitespace are rejected.
+	if !validJobTypePattern.MatchString(jobType) {
+		return "", fmt.Errorf("invalid job type %q: must match %s", jobType, validJobTypePattern.String())
+	}
+	return jobType, nil
+}
+
+// validJobIDPattern allows letters, digits, dash, underscore, and dot.
+// url.PathEscape in jobDetailFileName provides a second layer of defense.
+var validJobIDPattern = regexp.MustCompile(`^[A-Za-z0-9_.-]+$`)
+
+func sanitizeJobID(jobID string) (string, error) {
+	jobID = strings.TrimSpace(jobID)
+	if jobID == "" {
+		return "", fmt.Errorf("job id is empty")
+	}
+	if !validJobIDPattern.MatchString(jobID) {
+		return "", fmt.Errorf("invalid job id %q: must match %s", jobID, validJobIDPattern.String())
+	}
+	return jobID, nil
+}
+
+func jobDetailFileName(jobID string) string {
+	return url.PathEscape(jobID) + ".json"
+}
+
+func trimRuns(runs []JobRunRecord, maxKeep int) []JobRunRecord {
+	if len(runs) == 0 {
+		return runs
+	}
+	sort.Slice(runs, func(i, j int) bool {
+		ti := time.Time{}
+		if runs[i].CompletedAt != nil {
+			ti = *runs[i].CompletedAt
+		}
+		tj := time.Time{}
+		if runs[j].CompletedAt != nil {
+			tj = *runs[j].CompletedAt
+		}
+		return ti.After(tj)
+	})
+	if len(runs) > maxKeep {
+		runs = runs[:maxKeep]
+	}
+	return runs
+}
+
+func cloneRunHistory(in *JobTypeRunHistory) *JobTypeRunHistory {
+	if in == nil {
+		return nil
+	}
+	out := *in
+	if in.SuccessfulRuns != nil {
+		out.SuccessfulRuns = append([]JobRunRecord(nil), in.SuccessfulRuns...)
+	}
+	if in.ErrorRuns != nil {
+		out.ErrorRuns = append([]JobRunRecord(nil), in.ErrorRuns...)
+	}
+	return &out
+}
+
+func cloneTrackedJobs(in []TrackedJob) []TrackedJob {
+	if len(in) == 0 {
+		return nil
+	}
+
+	out := make([]TrackedJob, len(in))
+	for i := range in {
+		out[i] = cloneTrackedJob(in[i])
+	}
+	return out
+}
+
+func cloneTrackedJob(in TrackedJob) TrackedJob {
+	out := in
+	if in.Parameters != nil {
+		out.Parameters = make(map[string]interface{}, len(in.Parameters))
+		for key, value := range in.Parameters {
+			out.Parameters[key] = deepCopyGenericValue(value)
+		}
+	}
+	if in.Labels != nil {
+		out.Labels = make(map[string]string, len(in.Labels))
+		for key, value := range in.Labels {
+			out.Labels[key] = value
+		}
+	}
+	if in.ResultOutputValues != nil {
+		out.ResultOutputValues = make(map[string]interface{}, len(in.ResultOutputValues))
+		for key, value := range in.ResultOutputValues {
+			out.ResultOutputValues[key] = deepCopyGenericValue(value)
+		}
+	}
+	return out
+}
+
+func deepCopyGenericValue(val interface{}) interface{} {
+	switch v := val.(type) {
+	case map[string]interface{}:
+		res := make(map[string]interface{}, len(v))
+		for k, val := range v {
+			res[k] = deepCopyGenericValue(val)
+		}
+		return res
+	case []interface{}:
+		res := make([]interface{}, len(v))
+		for i, val := range v {
+			res[i] = deepCopyGenericValue(val)
+		}
+		return res
+	default:
+		return v
+	}
+}
+
+func cloneActivities(in []JobActivity) []JobActivity {
+	if len(in) == 0 {
+		return nil
+	}
+
+	out := make([]JobActivity, len(in))
+	for i := range in {
+		out[i] = in[i]
+		if in[i].Details != nil {
+			out[i].Details = make(map[string]interface{}, len(in[i].Details))
+			for key, value := range in[i].Details {
+				out[i].Details[key] = deepCopyGenericValue(value)
+			}
+		}
+	}
+	return out
+}
+
+// writeProtoFiles writes message to both a binary protobuf file (pbPath) and a
+// human-readable JSON file (jsonPath) using atomicWriteFile for each.
+// The .pb file is the authoritative source of truth: all reads use proto.Unmarshal
+// on the .pb file. The .json file is for human inspection only, so a partial
+// failure where .pb succeeds but .json fails leaves the store in a consistent state.
+func writeProtoFiles(message proto.Message, pbPath string, jsonPath string) error {
+	pbData, err := proto.Marshal(message)
+	if err != nil {
+		return fmt.Errorf("marshal protobuf: %w", err)
+	}
+	if err := atomicWriteFile(pbPath, pbData, defaultFilePerm); err != nil {
+		return fmt.Errorf("write protobuf file: %w", err)
+	}
+
+	jsonData, err := protojson.MarshalOptions{
+		Multiline:       true,
+		Indent:          "  ",
+		EmitUnpopulated: true,
+	}.Marshal(message)
+	if err != nil {
+		return fmt.Errorf("marshal json: %w", err)
+	}
+	if err := atomicWriteFile(jsonPath, jsonData, defaultFilePerm); err != nil {
+		return fmt.Errorf("write json file: %w", err)
+	}
+
+	return nil
+}
+func atomicWriteFile(filename string, data []byte, perm os.FileMode) error {
+	dir := filepath.Dir(filename)
+	if err := os.MkdirAll(dir, defaultDirPerm); err != nil {
+		return fmt.Errorf("create directory %s: %w", dir, err)
+	}
+	tmpFile := filename + ".tmp"
+	if err := os.WriteFile(tmpFile, data, perm); err != nil {
+		return err
+	}
+	if err := os.Rename(tmpFile, filename); err != nil {
+		_ = os.Remove(tmpFile)
+		return err
+	}
+	return nil
+}
--- a/weed/admin/plugin/config_store_test.go
+++ b/weed/admin/plugin/config_store_test.go
@@ -0,0 +1,257 @@
+package plugin
+
+import (
+	"reflect"
+	"testing"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+)
+
+func TestConfigStoreDescriptorRoundTrip(t *testing.T) {
+	t.Parallel()
+
+	tempDir := t.TempDir()
+	store, err := NewConfigStore(tempDir)
+	if err != nil {
+		t.Fatalf("NewConfigStore: %v", err)
+	}
+
+	descriptor := &plugin_pb.JobTypeDescriptor{
+		JobType:           "vacuum",
+		DisplayName:       "Vacuum",
+		Description:       "Vacuum volumes",
+		DescriptorVersion: 1,
+	}
+
+	if err := store.SaveDescriptor("vacuum", descriptor); err != nil {
+		t.Fatalf("SaveDescriptor: %v", err)
+	}
+
+	got, err := store.LoadDescriptor("vacuum")
+	if err != nil {
+		t.Fatalf("LoadDescriptor: %v", err)
+	}
+	if got == nil {
+		t.Fatalf("LoadDescriptor: nil descriptor")
+	}
+	if got.DisplayName != descriptor.DisplayName {
+		t.Fatalf("unexpected display name: got %q want %q", got.DisplayName, descriptor.DisplayName)
+	}
+
+}
+
+func TestConfigStoreRunHistoryRetention(t *testing.T) {
+	t.Parallel()
+
+	store, err := NewConfigStore(t.TempDir())
+	if err != nil {
+		t.Fatalf("NewConfigStore: %v", err)
+	}
+
+	base := time.Now().UTC().Add(-24 * time.Hour)
+	for i := 0; i < 15; i++ {
+		err := store.AppendRunRecord("balance", &JobRunRecord{
+			RunID:       "s" + time.Duration(i).String(),
+			JobID:       "job-success",
+			JobType:     "balance",
+			WorkerID:    "worker-a",
+			Outcome:     RunOutcomeSuccess,
+			CompletedAt: timeToPtr(base.Add(time.Duration(i) * time.Minute)),
+		})
+		if err != nil {
+			t.Fatalf("AppendRunRecord success[%d]: %v", i, err)
+		}
+	}
+
+	for i := 0; i < 12; i++ {
+		err := store.AppendRunRecord("balance", &JobRunRecord{
+			RunID:       "e" + time.Duration(i).String(),
+			JobID:       "job-error",
+			JobType:     "balance",
+			WorkerID:    "worker-b",
+			Outcome:     RunOutcomeError,
+			CompletedAt: timeToPtr(base.Add(time.Duration(i) * time.Minute)),
+		})
+		if err != nil {
+			t.Fatalf("AppendRunRecord error[%d]: %v", i, err)
+		}
+	}
+
+	history, err := store.LoadRunHistory("balance")
+	if err != nil {
+		t.Fatalf("LoadRunHistory: %v", err)
+	}
+	if len(history.SuccessfulRuns) != MaxSuccessfulRunHistory {
+		t.Fatalf("successful retention mismatch: got %d want %d", len(history.SuccessfulRuns), MaxSuccessfulRunHistory)
+	}
+	if len(history.ErrorRuns) != MaxErrorRunHistory {
+		t.Fatalf("error retention mismatch: got %d want %d", len(history.ErrorRuns), MaxErrorRunHistory)
+	}
+
+	for i := 1; i < len(history.SuccessfulRuns); i++ {
+		t1 := time.Time{}
+		if history.SuccessfulRuns[i-1].CompletedAt != nil {
+			t1 = *history.SuccessfulRuns[i-1].CompletedAt
+		}
+		t2 := time.Time{}
+		if history.SuccessfulRuns[i].CompletedAt != nil {
+			t2 = *history.SuccessfulRuns[i].CompletedAt
+		}
+		if t1.Before(t2) {
+			t.Fatalf("successful run order not descending at %d", i)
+		}
+	}
+	for i := 1; i < len(history.ErrorRuns); i++ {
+		t1 := time.Time{}
+		if history.ErrorRuns[i-1].CompletedAt != nil {
+			t1 = *history.ErrorRuns[i-1].CompletedAt
+		}
+		t2 := time.Time{}
+		if history.ErrorRuns[i].CompletedAt != nil {
+			t2 = *history.ErrorRuns[i].CompletedAt
+		}
+		if t1.Before(t2) {
+			t.Fatalf("error run order not descending at %d", i)
+		}
+	}
+}
+
+func TestConfigStoreListJobTypes(t *testing.T) {
+	t.Parallel()
+
+	store, err := NewConfigStore("")
+	if err != nil {
+		t.Fatalf("NewConfigStore: %v", err)
+	}
+
+	if err := store.SaveDescriptor("vacuum", &plugin_pb.JobTypeDescriptor{JobType: "vacuum"}); err != nil {
+		t.Fatalf("SaveDescriptor: %v", err)
+	}
+	if err := store.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
+		JobType:      "balance",
+		AdminRuntime: &plugin_pb.AdminRuntimeConfig{Enabled: true},
+	}); err != nil {
+		t.Fatalf("SaveJobTypeConfig: %v", err)
+	}
+	if err := store.AppendRunRecord("ec", &JobRunRecord{Outcome: RunOutcomeSuccess, CompletedAt: timeToPtr(time.Now().UTC())}); err != nil {
+		t.Fatalf("AppendRunRecord: %v", err)
+	}
+
+	got, err := store.ListJobTypes()
+	if err != nil {
+		t.Fatalf("ListJobTypes: %v", err)
+	}
+	want := []string{"balance", "ec", "vacuum"}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("unexpected job types: got=%v want=%v", got, want)
+	}
+}
+
+func TestConfigStoreMonitorStateRoundTrip(t *testing.T) {
+	t.Parallel()
+
+	store, err := NewConfigStore(t.TempDir())
+	if err != nil {
+		t.Fatalf("NewConfigStore: %v", err)
+	}
+
+	tracked := []TrackedJob{
+		{
+			JobID:     "job-1",
+			JobType:   "vacuum",
+			State:     "running",
+			Progress:  55,
+			WorkerID:  "worker-a",
+			CreatedAt: timeToPtr(time.Now().UTC().Add(-2 * time.Minute)),
+			UpdatedAt: timeToPtr(time.Now().UTC().Add(-1 * time.Minute)),
+		},
+	}
+	activities := []JobActivity{
+		{
+			JobID:      "job-1",
+			JobType:    "vacuum",
+			Source:     "worker_progress",
+			Message:    "processing",
+			Stage:      "running",
+			OccurredAt: timeToPtr(time.Now().UTC()),
+			Details: map[string]interface{}{
+				"step": "scan",
+			},
+		},
+	}
+
+	if err := store.SaveTrackedJobs(tracked); err != nil {
+		t.Fatalf("SaveTrackedJobs: %v", err)
+	}
+	if err := store.SaveActivities(activities); err != nil {
+		t.Fatalf("SaveActivities: %v", err)
+	}
+
+	gotTracked, err := store.LoadTrackedJobs()
+	if err != nil {
+		t.Fatalf("LoadTrackedJobs: %v", err)
+	}
+	if len(gotTracked) != 1 || gotTracked[0].JobID != tracked[0].JobID {
+		t.Fatalf("unexpected tracked jobs: %+v", gotTracked)
+	}
+
+	gotActivities, err := store.LoadActivities()
+	if err != nil {
+		t.Fatalf("LoadActivities: %v", err)
+	}
+	if len(gotActivities) != 1 || gotActivities[0].Message != activities[0].Message {
+		t.Fatalf("unexpected activities: %+v", gotActivities)
+	}
+	if gotActivities[0].Details["step"] != "scan" {
+		t.Fatalf("unexpected activity details: %+v", gotActivities[0].Details)
+	}
+}
+
+func TestConfigStoreJobDetailRoundTrip(t *testing.T) {
+	t.Parallel()
+
+	store, err := NewConfigStore(t.TempDir())
+	if err != nil {
+		t.Fatalf("NewConfigStore: %v", err)
+	}
+
+	input := TrackedJob{
+		JobID:     "job-detail-1",
+		JobType:   "vacuum",
+		Summary:   "detail summary",
+		Detail:    "detail payload",
+		CreatedAt: timeToPtr(time.Now().UTC().Add(-2 * time.Minute)),
+		UpdatedAt: timeToPtr(time.Now().UTC()),
+		Parameters: map[string]interface{}{
+			"volume_id": map[string]interface{}{"int64_value": "3"},
+		},
+		Labels: map[string]string{
+			"source": "detector",
+		},
+		ResultOutputValues: map[string]interface{}{
+			"moved": map[string]interface{}{"bool_value": true},
+		},
+	}
+
+	if err := store.SaveJobDetail(input); err != nil {
+		t.Fatalf("SaveJobDetail: %v", err)
+	}
+
+	got, err := store.LoadJobDetail(input.JobID)
+	if err != nil {
+		t.Fatalf("LoadJobDetail: %v", err)
+	}
+	if got == nil {
+		t.Fatalf("LoadJobDetail returned nil")
+	}
+	if got.Detail != input.Detail {
+		t.Fatalf("unexpected detail: got=%q want=%q", got.Detail, input.Detail)
+	}
+	if got.Labels["source"] != "detector" {
+		t.Fatalf("unexpected labels: %+v", got.Labels)
+	}
+	if got.ResultOutputValues == nil {
+		t.Fatalf("expected result output values")
+	}
+}
--- a/weed/admin/plugin/job_execution_plan.go
+++ b/weed/admin/plugin/job_execution_plan.go
@@ -0,0 +1,231 @@
+package plugin
+
+import (
+	"encoding/base64"
+	"sort"
+	"strconv"
+	"strings"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/worker_pb"
+	"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
+	"google.golang.org/protobuf/proto"
+)
+
+func enrichTrackedJobParameters(jobType string, parameters map[string]interface{}) map[string]interface{} {
+	if len(parameters) == 0 {
+		return parameters
+	}
+	if _, exists := parameters["execution_plan"]; exists {
+		return parameters
+	}
+
+	taskParams, ok := decodeTaskParamsFromPlainParameters(parameters)
+	if !ok || taskParams == nil {
+		return parameters
+	}
+
+	plan := buildExecutionPlan(strings.TrimSpace(jobType), taskParams)
+	if plan == nil {
+		return parameters
+	}
+
+	enriched := make(map[string]interface{}, len(parameters)+1)
+	for key, value := range parameters {
+		enriched[key] = value
+	}
+	enriched["execution_plan"] = plan
+	return enriched
+}
+
+func decodeTaskParamsFromPlainParameters(parameters map[string]interface{}) (*worker_pb.TaskParams, bool) {
+	rawField, ok := parameters["task_params_pb"]
+	if !ok || rawField == nil {
+		return nil, false
+	}
+
+	fieldMap, ok := rawField.(map[string]interface{})
+	if !ok {
+		return nil, false
+	}
+
+	bytesValue, _ := fieldMap["bytes_value"].(string)
+	bytesValue = strings.TrimSpace(bytesValue)
+	if bytesValue == "" {
+		return nil, false
+	}
+
+	payload, err := base64.StdEncoding.DecodeString(bytesValue)
+	if err != nil {
+		return nil, false
+	}
+
+	params := &worker_pb.TaskParams{}
+	if err := proto.Unmarshal(payload, params); err != nil {
+		return nil, false
+	}
+
+	return params, true
+}
+
+func buildExecutionPlan(jobType string, params *worker_pb.TaskParams) map[string]interface{} {
+	if params == nil {
+		return nil
+	}
+
+	normalizedJobType := strings.TrimSpace(jobType)
+	if normalizedJobType == "" && params.GetErasureCodingParams() != nil {
+		normalizedJobType = "erasure_coding"
+	}
+
+	switch normalizedJobType {
+	case "erasure_coding":
+		return buildErasureCodingExecutionPlan(params)
+	default:
+		return nil
+	}
+}
+
+func buildErasureCodingExecutionPlan(params *worker_pb.TaskParams) map[string]interface{} {
+	if params == nil {
+		return nil
+	}
+
+	ecParams := params.GetErasureCodingParams()
+	if ecParams == nil {
+		return nil
+	}
+
+	dataShards := int(ecParams.DataShards)
+	if dataShards <= 0 {
+		dataShards = int(erasure_coding.DataShardsCount)
+	}
+	parityShards := int(ecParams.ParityShards)
+	if parityShards <= 0 {
+		parityShards = int(erasure_coding.ParityShardsCount)
+	}
+	totalShards := dataShards + parityShards
+
+	sources := make([]map[string]interface{}, 0, len(params.Sources))
+	for _, source := range params.Sources {
+		if source == nil {
+			continue
+		}
+		sources = append(sources, buildExecutionEndpoint(
+			source.Node,
+			source.DataCenter,
+			source.Rack,
+			source.VolumeId,
+			source.ShardIds,
+			dataShards,
+		))
+	}
+
+	targets := make([]map[string]interface{}, 0, len(params.Targets))
+	shardAssignments := make([]map[string]interface{}, 0, totalShards)
+	for targetIndex, target := range params.Targets {
+		if target == nil {
+			continue
+		}
+
+		targets = append(targets, buildExecutionEndpoint(
+			target.Node,
+			target.DataCenter,
+			target.Rack,
+			target.VolumeId,
+			target.ShardIds,
+			dataShards,
+		))
+
+		for _, shardID := range normalizeShardIDs(target.ShardIds) {
+			kind, label := classifyShardID(shardID, dataShards)
+			shardAssignments = append(shardAssignments, map[string]interface{}{
+				"shard_id":           shardID,
+				"kind":               kind,
+				"label":              label,
+				"target_index":       targetIndex + 1,
+				"target_node":        strings.TrimSpace(target.Node),
+				"target_data_center": strings.TrimSpace(target.DataCenter),
+				"target_rack":        strings.TrimSpace(target.Rack),
+				"target_volume_id":   int(target.VolumeId),
+			})
+		}
+	}
+	sort.Slice(shardAssignments, func(i, j int) bool {
+		left, _ := shardAssignments[i]["shard_id"].(int)
+		right, _ := shardAssignments[j]["shard_id"].(int)
+		return left < right
+	})
+
+	plan := map[string]interface{}{
+		"job_type":      "erasure_coding",
+		"task_id":       strings.TrimSpace(params.TaskId),
+		"volume_id":     int(params.VolumeId),
+		"collection":    strings.TrimSpace(params.Collection),
+		"data_shards":   dataShards,
+		"parity_shards": parityShards,
+		"total_shards":  totalShards,
+		"sources":       sources,
+		"targets":       targets,
+		"source_count":  len(sources),
+		"target_count":  len(targets),
+	}
+
+	if len(shardAssignments) > 0 {
+		plan["shard_assignments"] = shardAssignments
+	}
+
+	return plan
+}
+
+func buildExecutionEndpoint(
+	node string,
+	dataCenter string,
+	rack string,
+	volumeID uint32,
+	shardIDs []uint32,
+	dataShardCount int,
+) map[string]interface{} {
+	allShards := normalizeShardIDs(shardIDs)
+	dataShards := make([]int, 0, len(allShards))
+	parityShards := make([]int, 0, len(allShards))
+	for _, shardID := range allShards {
+		if shardID < dataShardCount {
+			dataShards = append(dataShards, shardID)
+		} else {
+			parityShards = append(parityShards, shardID)
+		}
+	}
+
+	return map[string]interface{}{
+		"node":             strings.TrimSpace(node),
+		"data_center":      strings.TrimSpace(dataCenter),
+		"rack":             strings.TrimSpace(rack),
+		"volume_id":        int(volumeID),
+		"shard_ids":        allShards,
+		"data_shard_ids":   dataShards,
+		"parity_shard_ids": parityShards,
+	}
+}
+
+func normalizeShardIDs(shardIDs []uint32) []int {
+	if len(shardIDs) == 0 {
+		return nil
+	}
+
+	out := make([]int, 0, len(shardIDs))
+	for _, shardID := range shardIDs {
+		out = append(out, int(shardID))
+	}
+	sort.Ints(out)
+	return out
+}
+
+func classifyShardID(shardID int, dataShardCount int) (kind string, label string) {
+	if dataShardCount <= 0 {
+		dataShardCount = int(erasure_coding.DataShardsCount)
+	}
+	if shardID < dataShardCount {
+		return "data", "D" + strconv.Itoa(shardID)
+	}
+	return "parity", "P" + strconv.Itoa(shardID)
+}
--- a/weed/admin/plugin/plugin.go
+++ b/weed/admin/plugin/plugin.go
--- a/weed/admin/plugin/plugin_cancel_test.go
+++ b/weed/admin/plugin/plugin_cancel_test.go
@@ -0,0 +1,112 @@
+package plugin
+
+import (
+	"context"
+	"errors"
+	"testing"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+)
+
+func TestRunDetectionSendsCancelOnContextDone(t *testing.T) {
+	t.Parallel()
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New plugin error: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	const workerID = "worker-detect"
+	const jobType = "vacuum"
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: workerID,
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
+		},
+	})
+	session := &streamSession{workerID: workerID, outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 4)}
+	pluginSvc.putSession(session)
+
+	ctx, cancel := context.WithCancel(context.Background())
+	errCh := make(chan error, 1)
+	go func() {
+		_, runErr := pluginSvc.RunDetection(ctx, jobType, &plugin_pb.ClusterContext{}, 10)
+		errCh <- runErr
+	}()
+
+	first := <-session.outgoing
+	if first.GetRunDetectionRequest() == nil {
+		t.Fatalf("expected first message to be run_detection_request")
+	}
+
+	cancel()
+
+	second := <-session.outgoing
+	cancelReq := second.GetCancelRequest()
+	if cancelReq == nil {
+		t.Fatalf("expected second message to be cancel_request")
+	}
+	if cancelReq.TargetId != first.RequestId {
+		t.Fatalf("unexpected cancel target id: got=%s want=%s", cancelReq.TargetId, first.RequestId)
+	}
+	if cancelReq.TargetKind != plugin_pb.WorkKind_WORK_KIND_DETECTION {
+		t.Fatalf("unexpected cancel target kind: %v", cancelReq.TargetKind)
+	}
+
+	runErr := <-errCh
+	if !errors.Is(runErr, context.Canceled) {
+		t.Fatalf("expected context canceled error, got %v", runErr)
+	}
+}
+
+func TestExecuteJobSendsCancelOnContextDone(t *testing.T) {
+	t.Parallel()
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New plugin error: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	const workerID = "worker-exec"
+	const jobType = "vacuum"
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: workerID,
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: jobType, CanExecute: true, MaxExecutionConcurrency: 1},
+		},
+	})
+	session := &streamSession{workerID: workerID, outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 4)}
+	pluginSvc.putSession(session)
+
+	job := &plugin_pb.JobSpec{JobId: "job-1", JobType: jobType}
+	ctx, cancel := context.WithCancel(context.Background())
+	errCh := make(chan error, 1)
+	go func() {
+		_, runErr := pluginSvc.ExecuteJob(ctx, job, &plugin_pb.ClusterContext{}, 1)
+		errCh <- runErr
+	}()
+
+	first := <-session.outgoing
+	if first.GetExecuteJobRequest() == nil {
+		t.Fatalf("expected first message to be execute_job_request")
+	}
+
+	cancel()
+
+	second := <-session.outgoing
+	cancelReq := second.GetCancelRequest()
+	if cancelReq == nil {
+		t.Fatalf("expected second message to be cancel_request")
+	}
+	if cancelReq.TargetId != first.RequestId {
+		t.Fatalf("unexpected cancel target id: got=%s want=%s", cancelReq.TargetId, first.RequestId)
+	}
+	if cancelReq.TargetKind != plugin_pb.WorkKind_WORK_KIND_EXECUTION {
+		t.Fatalf("unexpected cancel target kind: %v", cancelReq.TargetKind)
+	}
+
+	runErr := <-errCh
+	if !errors.Is(runErr, context.Canceled) {
+		t.Fatalf("expected context canceled error, got %v", runErr)
+	}
+}
--- a/weed/admin/plugin/plugin_config_bootstrap_test.go
+++ b/weed/admin/plugin/plugin_config_bootstrap_test.go
@@ -0,0 +1,125 @@
+package plugin
+
+import (
+	"testing"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+)
+
+func TestEnsureJobTypeConfigFromDescriptorBootstrapsDefaults(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	descriptor := &plugin_pb.JobTypeDescriptor{
+		JobType:           "vacuum",
+		DescriptorVersion: 3,
+		AdminConfigForm: &plugin_pb.ConfigForm{
+			DefaultValues: map[string]*plugin_pb.ConfigValue{
+				"scan_scope": {Kind: &plugin_pb.ConfigValue_StringValue{StringValue: "all"}},
+			},
+		},
+		WorkerConfigForm: &plugin_pb.ConfigForm{
+			DefaultValues: map[string]*plugin_pb.ConfigValue{
+				"threshold": {Kind: &plugin_pb.ConfigValue_DoubleValue{DoubleValue: 0.3}},
+			},
+		},
+		AdminRuntimeDefaults: &plugin_pb.AdminRuntimeDefaults{
+			Enabled:                       true,
+			DetectionIntervalSeconds:      60,
+			DetectionTimeoutSeconds:       20,
+			MaxJobsPerDetection:           30,
+			GlobalExecutionConcurrency:    4,
+			PerWorkerExecutionConcurrency: 2,
+			RetryLimit:                    3,
+			RetryBackoffSeconds:           5,
+		},
+	}
+
+	if err := pluginSvc.ensureJobTypeConfigFromDescriptor("vacuum", descriptor); err != nil {
+		t.Fatalf("ensureJobTypeConfigFromDescriptor: %v", err)
+	}
+
+	cfg, err := pluginSvc.LoadJobTypeConfig("vacuum")
+	if err != nil {
+		t.Fatalf("LoadJobTypeConfig: %v", err)
+	}
+	if cfg == nil {
+		t.Fatalf("expected non-nil config")
+	}
+	if cfg.DescriptorVersion != 3 {
+		t.Fatalf("unexpected descriptor version: got=%d", cfg.DescriptorVersion)
+	}
+	if cfg.AdminRuntime == nil || !cfg.AdminRuntime.Enabled {
+		t.Fatalf("expected enabled admin settings")
+	}
+	if cfg.AdminRuntime.GlobalExecutionConcurrency != 4 {
+		t.Fatalf("unexpected global execution concurrency: %d", cfg.AdminRuntime.GlobalExecutionConcurrency)
+	}
+	if _, ok := cfg.AdminConfigValues["scan_scope"]; !ok {
+		t.Fatalf("missing admin default value")
+	}
+	if _, ok := cfg.WorkerConfigValues["threshold"]; !ok {
+		t.Fatalf("missing worker default value")
+	}
+}
+
+func TestEnsureJobTypeConfigFromDescriptorDoesNotOverwriteExisting(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	if err := pluginSvc.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
+		JobType: "balance",
+		AdminRuntime: &plugin_pb.AdminRuntimeConfig{
+			Enabled:                    true,
+			GlobalExecutionConcurrency: 9,
+		},
+		AdminConfigValues: map[string]*plugin_pb.ConfigValue{
+			"custom": {Kind: &plugin_pb.ConfigValue_StringValue{StringValue: "keep"}},
+		},
+	}); err != nil {
+		t.Fatalf("SaveJobTypeConfig: %v", err)
+	}
+
+	descriptor := &plugin_pb.JobTypeDescriptor{
+		JobType:           "balance",
+		DescriptorVersion: 7,
+		AdminConfigForm: &plugin_pb.ConfigForm{
+			DefaultValues: map[string]*plugin_pb.ConfigValue{
+				"custom": {Kind: &plugin_pb.ConfigValue_StringValue{StringValue: "overwrite"}},
+			},
+		},
+		AdminRuntimeDefaults: &plugin_pb.AdminRuntimeDefaults{
+			Enabled:                    true,
+			GlobalExecutionConcurrency: 1,
+		},
+	}
+
+	if err := pluginSvc.ensureJobTypeConfigFromDescriptor("balance", descriptor); err != nil {
+		t.Fatalf("ensureJobTypeConfigFromDescriptor: %v", err)
+	}
+
+	cfg, err := pluginSvc.LoadJobTypeConfig("balance")
+	if err != nil {
+		t.Fatalf("LoadJobTypeConfig: %v", err)
+	}
+	if cfg == nil {
+		t.Fatalf("expected config")
+	}
+	if cfg.AdminRuntime == nil || cfg.AdminRuntime.GlobalExecutionConcurrency != 9 {
+		t.Fatalf("existing admin settings should be preserved, got=%v", cfg.AdminRuntime)
+	}
+	custom := cfg.AdminConfigValues["custom"]
+	if custom == nil || custom.GetStringValue() != "keep" {
+		t.Fatalf("existing admin config should be preserved")
+	}
+}
--- a/weed/admin/plugin/plugin_detection_test.go
+++ b/weed/admin/plugin/plugin_detection_test.go
@@ -0,0 +1,197 @@
+package plugin
+
+import (
+	"context"
+	"testing"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+)
+
+func TestRunDetectionIncludesLatestSuccessfulRun(t *testing.T) {
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New plugin error: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	jobType := "vacuum"
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
+		},
+	})
+	session := &streamSession{workerID: "worker-a", outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 1)}
+	pluginSvc.putSession(session)
+
+	oldSuccess := time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)
+	latestSuccess := time.Date(2026, 2, 1, 0, 0, 0, 0, time.UTC)
+	if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{Outcome: RunOutcomeSuccess, CompletedAt: timeToPtr(oldSuccess)}); err != nil {
+		t.Fatalf("AppendRunRecord old success: %v", err)
+	}
+	if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{Outcome: RunOutcomeError, CompletedAt: timeToPtr(latestSuccess.Add(2 * time.Hour))}); err != nil {
+		t.Fatalf("AppendRunRecord error run: %v", err)
+	}
+	if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{Outcome: RunOutcomeSuccess, CompletedAt: timeToPtr(latestSuccess)}); err != nil {
+		t.Fatalf("AppendRunRecord latest success: %v", err)
+	}
+
+	resultCh := make(chan error, 1)
+	go func() {
+		_, runErr := pluginSvc.RunDetection(context.Background(), jobType, &plugin_pb.ClusterContext{}, 10)
+		resultCh <- runErr
+	}()
+
+	message := <-session.outgoing
+	detectRequest := message.GetRunDetectionRequest()
+	if detectRequest == nil {
+		t.Fatalf("expected run detection request message")
+	}
+	if detectRequest.LastSuccessfulRun == nil {
+		t.Fatalf("expected last_successful_run to be set")
+	}
+	if got := detectRequest.LastSuccessfulRun.AsTime().UTC(); !got.Equal(latestSuccess) {
+		t.Fatalf("unexpected last_successful_run, got=%s want=%s", got, latestSuccess)
+	}
+
+	pluginSvc.handleDetectionComplete("worker-a", &plugin_pb.DetectionComplete{
+		RequestId: message.RequestId,
+		JobType:   jobType,
+		Success:   true,
+	})
+
+	if runErr := <-resultCh; runErr != nil {
+		t.Fatalf("RunDetection error: %v", runErr)
+	}
+}
+
+func TestRunDetectionOmitsLastSuccessfulRunWhenNoSuccessHistory(t *testing.T) {
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New plugin error: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	jobType := "vacuum"
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
+		},
+	})
+	session := &streamSession{workerID: "worker-a", outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 1)}
+	pluginSvc.putSession(session)
+
+	if err := pluginSvc.store.AppendRunRecord(jobType, &JobRunRecord{
+		Outcome:     RunOutcomeError,
+		CompletedAt: timeToPtr(time.Date(2026, 2, 10, 0, 0, 0, 0, time.UTC)),
+	}); err != nil {
+		t.Fatalf("AppendRunRecord error run: %v", err)
+	}
+
+	resultCh := make(chan error, 1)
+	go func() {
+		_, runErr := pluginSvc.RunDetection(context.Background(), jobType, &plugin_pb.ClusterContext{}, 10)
+		resultCh <- runErr
+	}()
+
+	message := <-session.outgoing
+	detectRequest := message.GetRunDetectionRequest()
+	if detectRequest == nil {
+		t.Fatalf("expected run detection request message")
+	}
+	if detectRequest.LastSuccessfulRun != nil {
+		t.Fatalf("expected last_successful_run to be nil when no success history")
+	}
+
+	pluginSvc.handleDetectionComplete("worker-a", &plugin_pb.DetectionComplete{
+		RequestId: message.RequestId,
+		JobType:   jobType,
+		Success:   true,
+	})
+
+	if runErr := <-resultCh; runErr != nil {
+		t.Fatalf("RunDetection error: %v", runErr)
+	}
+}
+
+func TestRunDetectionWithReportCapturesDetectionActivities(t *testing.T) {
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New plugin error: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	jobType := "vacuum"
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: jobType, CanDetect: true, MaxDetectionConcurrency: 1},
+		},
+	})
+	session := &streamSession{workerID: "worker-a", outgoing: make(chan *plugin_pb.AdminToWorkerMessage, 1)}
+	pluginSvc.putSession(session)
+
+	reportCh := make(chan *DetectionReport, 1)
+	errCh := make(chan error, 1)
+	go func() {
+		report, runErr := pluginSvc.RunDetectionWithReport(context.Background(), jobType, &plugin_pb.ClusterContext{}, 10)
+		reportCh <- report
+		errCh <- runErr
+	}()
+
+	message := <-session.outgoing
+	requestID := message.GetRequestId()
+	if requestID == "" {
+		t.Fatalf("expected request id in detection request")
+	}
+
+	pluginSvc.handleDetectionProposals("worker-a", &plugin_pb.DetectionProposals{
+		RequestId: requestID,
+		JobType:   jobType,
+		Proposals: []*plugin_pb.JobProposal{
+			{
+				ProposalId: "proposal-1",
+				JobType:    jobType,
+				Summary:    "vacuum proposal",
+				Detail:     "based on garbage ratio",
+			},
+		},
+	})
+	pluginSvc.handleDetectionComplete("worker-a", &plugin_pb.DetectionComplete{
+		RequestId:      requestID,
+		JobType:        jobType,
+		Success:        true,
+		TotalProposals: 1,
+	})
+
+	report := <-reportCh
+	if report == nil {
+		t.Fatalf("expected detection report")
+	}
+	if report.RequestID == "" {
+		t.Fatalf("expected detection report request id")
+	}
+	if report.WorkerID != "worker-a" {
+		t.Fatalf("expected worker-a, got %q", report.WorkerID)
+	}
+	if len(report.Proposals) != 1 {
+		t.Fatalf("expected one proposal in report, got %d", len(report.Proposals))
+	}
+	if runErr := <-errCh; runErr != nil {
+		t.Fatalf("RunDetectionWithReport error: %v", runErr)
+	}
+
+	activities := pluginSvc.ListActivities(jobType, 0)
+	stages := map[string]bool{}
+	for _, activity := range activities {
+		if activity.RequestID != report.RequestID {
+			continue
+		}
+		stages[activity.Stage] = true
+	}
+	if !stages["requested"] || !stages["proposal"] || !stages["completed"] {
+		t.Fatalf("expected requested/proposal/completed activities, got stages=%v", stages)
+	}
+}
--- a/weed/admin/plugin/plugin_monitor.go
+++ b/weed/admin/plugin/plugin_monitor.go
@@ -0,0 +1,896 @@
+package plugin
+
+import (
+	"encoding/json"
+	"sort"
+	"strings"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/glog"
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+	"google.golang.org/protobuf/encoding/protojson"
+)
+
+const (
+	maxTrackedJobsTotal = 1000
+	maxActivityRecords  = 4000
+	maxRelatedJobs      = 100
+)
+
+var (
+	StateSucceeded = strings.ToLower(plugin_pb.JobState_JOB_STATE_SUCCEEDED.String())
+	StateFailed    = strings.ToLower(plugin_pb.JobState_JOB_STATE_FAILED.String())
+	StateCanceled  = strings.ToLower(plugin_pb.JobState_JOB_STATE_CANCELED.String())
+)
+
+// activityLess reports whether activity a occurred after activity b (newest-first order).
+// A nil OccurredAt is treated as the zero time.
+func activityLess(a, b JobActivity) bool {
+	ta := time.Time{}
+	if a.OccurredAt != nil {
+		ta = *a.OccurredAt
+	}
+	tb := time.Time{}
+	if b.OccurredAt != nil {
+		tb = *b.OccurredAt
+	}
+	return ta.After(tb)
+}
+
+func (r *Plugin) loadPersistedMonitorState() error {
+	trackedJobs, err := r.store.LoadTrackedJobs()
+	if err != nil {
+		return err
+	}
+	activities, err := r.store.LoadActivities()
+	if err != nil {
+		return err
+	}
+
+	if len(trackedJobs) > 0 {
+		r.jobsMu.Lock()
+		for i := range trackedJobs {
+			job := trackedJobs[i]
+			if strings.TrimSpace(job.JobID) == "" {
+				continue
+			}
+			// Backward compatibility: migrate older inline detail payloads
+			// out of tracked_jobs.json into dedicated per-job detail files.
+			if hasTrackedJobRichDetails(job) {
+				if err := r.store.SaveJobDetail(job); err != nil {
+					glog.Warningf("Plugin failed to migrate detail snapshot for job %s: %v", job.JobID, err)
+				}
+			}
+			stripTrackedJobDetailFields(&job)
+			jobCopy := job
+			r.jobs[job.JobID] = &jobCopy
+		}
+		r.pruneTrackedJobsLocked()
+		r.jobsMu.Unlock()
+	}
+
+	if len(activities) > maxActivityRecords {
+		activities = activities[len(activities)-maxActivityRecords:]
+	}
+	if len(activities) > 0 {
+		r.activitiesMu.Lock()
+		r.activities = append([]JobActivity(nil), activities...)
+		r.activitiesMu.Unlock()
+	}
+
+	return nil
+}
+
+func (r *Plugin) ListTrackedJobs(jobType string, state string, limit int) []TrackedJob {
+	r.jobsMu.RLock()
+	defer r.jobsMu.RUnlock()
+
+	normalizedJobType := strings.TrimSpace(jobType)
+	normalizedState := strings.TrimSpace(strings.ToLower(state))
+
+	items := make([]TrackedJob, 0, len(r.jobs))
+	for _, job := range r.jobs {
+		if job == nil {
+			continue
+		}
+		if normalizedJobType != "" && job.JobType != normalizedJobType {
+			continue
+		}
+		if normalizedState != "" && strings.ToLower(job.State) != normalizedState {
+			continue
+		}
+		items = append(items, cloneTrackedJob(*job))
+	}
+
+	sort.Slice(items, func(i, j int) bool {
+		ti := time.Time{}
+		if items[i].UpdatedAt != nil {
+			ti = *items[i].UpdatedAt
+		}
+		tj := time.Time{}
+		if items[j].UpdatedAt != nil {
+			tj = *items[j].UpdatedAt
+		}
+		if !ti.Equal(tj) {
+			return ti.After(tj)
+		}
+		return items[i].JobID < items[j].JobID
+	})
+
+	if limit > 0 && len(items) > limit {
+		items = items[:limit]
+	}
+	return items
+}
+
+func (r *Plugin) GetTrackedJob(jobID string) (*TrackedJob, bool) {
+	r.jobsMu.RLock()
+	defer r.jobsMu.RUnlock()
+
+	job, ok := r.jobs[jobID]
+	if !ok || job == nil {
+		return nil, false
+	}
+	clone := cloneTrackedJob(*job)
+	return &clone, true
+}
+
+func (r *Plugin) ListActivities(jobType string, limit int) []JobActivity {
+	r.activitiesMu.RLock()
+	defer r.activitiesMu.RUnlock()
+
+	normalized := strings.TrimSpace(jobType)
+	activities := make([]JobActivity, 0, len(r.activities))
+	for _, activity := range r.activities {
+		if normalized != "" && activity.JobType != normalized {
+			continue
+		}
+		activities = append(activities, activity)
+	}
+
+	sort.Slice(activities, func(i, j int) bool {
+		return activityLess(activities[i], activities[j])
+	})
+	if limit > 0 && len(activities) > limit {
+		activities = activities[:limit]
+	}
+	return activities
+}
+
+func (r *Plugin) ListJobActivities(jobID string, limit int) []JobActivity {
+	normalizedJobID := strings.TrimSpace(jobID)
+	if normalizedJobID == "" {
+		return nil
+	}
+
+	r.activitiesMu.RLock()
+	activities := make([]JobActivity, 0, len(r.activities))
+	for _, activity := range r.activities {
+		if strings.TrimSpace(activity.JobID) != normalizedJobID {
+			continue
+		}
+		activities = append(activities, activity)
+	}
+	r.activitiesMu.RUnlock()
+
+	sort.Slice(activities, func(i, j int) bool {
+		return !activityLess(activities[i], activities[j]) // oldest-first for job timeline
+	})
+	if limit > 0 && len(activities) > limit {
+		activities = activities[len(activities)-limit:]
+	}
+	return activities
+}
+
+func (r *Plugin) BuildJobDetail(jobID string, activityLimit int, relatedLimit int) (*JobDetail, bool, error) {
+	normalizedJobID := strings.TrimSpace(jobID)
+	if normalizedJobID == "" {
+		return nil, false, nil
+	}
+
+	// Clamp relatedLimit to a safe range to avoid excessive memory allocation from untrusted input.
+	if relatedLimit <= 0 {
+		relatedLimit = 0
+	} else if relatedLimit > maxRelatedJobs {
+		relatedLimit = maxRelatedJobs
+	}
+
+	r.jobsMu.RLock()
+	trackedSnapshot, ok := r.jobs[normalizedJobID]
+	if ok && trackedSnapshot != nil {
+		candidate := cloneTrackedJob(*trackedSnapshot)
+		stripTrackedJobDetailFields(&candidate)
+		trackedSnapshot = &candidate
+	} else {
+		trackedSnapshot = nil
+	}
+	r.jobsMu.RUnlock()
+
+	detailJob, err := r.store.LoadJobDetail(normalizedJobID)
+	if err != nil {
+		return nil, false, err
+	}
+
+	if trackedSnapshot == nil && detailJob == nil {
+		return nil, false, nil
+	}
+	if detailJob == nil && trackedSnapshot != nil {
+		clone := cloneTrackedJob(*trackedSnapshot)
+		detailJob = &clone
+	}
+	if detailJob == nil {
+		return nil, false, nil
+	}
+	if trackedSnapshot != nil {
+		mergeTrackedStatusIntoDetail(detailJob, trackedSnapshot)
+	}
+	detailJob.Parameters = enrichTrackedJobParameters(detailJob.JobType, detailJob.Parameters)
+
+	r.activitiesMu.RLock()
+	activities := append([]JobActivity(nil), r.activities...)
+	r.activitiesMu.RUnlock()
+
+	detail := &JobDetail{
+		Job:         detailJob,
+		Activities:  filterJobActivitiesFromSlice(activities, normalizedJobID, activityLimit),
+		LastUpdated: timeToPtr(time.Now().UTC()),
+	}
+
+	if history, err := r.store.LoadRunHistory(detailJob.JobType); err != nil {
+		return nil, true, err
+	} else if history != nil {
+		for i := range history.SuccessfulRuns {
+			record := history.SuccessfulRuns[i]
+			if strings.TrimSpace(record.JobID) == normalizedJobID {
+				recordCopy := record
+				detail.RunRecord = &recordCopy
+				break
+			}
+		}
+		if detail.RunRecord == nil {
+			for i := range history.ErrorRuns {
+				record := history.ErrorRuns[i]
+				if strings.TrimSpace(record.JobID) == normalizedJobID {
+					recordCopy := record
+					detail.RunRecord = &recordCopy
+					break
+				}
+			}
+		}
+	}
+
+	if relatedLimit > 0 {
+		related := make([]TrackedJob, 0, relatedLimit)
+		r.jobsMu.RLock()
+		for _, candidate := range r.jobs {
+			if strings.TrimSpace(candidate.JobType) != strings.TrimSpace(detailJob.JobType) {
+				continue
+			}
+			if strings.TrimSpace(candidate.JobID) == normalizedJobID {
+				continue
+			}
+			cloned := cloneTrackedJob(*candidate)
+			stripTrackedJobDetailFields(&cloned)
+			related = append(related, cloned)
+			if len(related) >= relatedLimit {
+				break
+			}
+		}
+		r.jobsMu.RUnlock()
+		detail.RelatedJobs = related
+	}
+
+	return detail, true, nil
+}
+
+func filterJobActivitiesFromSlice(all []JobActivity, jobID string, limit int) []JobActivity {
+	if strings.TrimSpace(jobID) == "" || len(all) == 0 {
+		return nil
+	}
+
+	activities := make([]JobActivity, 0, len(all))
+	for _, activity := range all {
+		if strings.TrimSpace(activity.JobID) != jobID {
+			continue
+		}
+		activities = append(activities, activity)
+	}
+
+	sort.Slice(activities, func(i, j int) bool {
+		return !activityLess(activities[i], activities[j]) // oldest-first for job timeline
+	})
+	if limit > 0 && len(activities) > limit {
+		activities = activities[len(activities)-limit:]
+	}
+	return activities
+}
+
+func stripTrackedJobDetailFields(job *TrackedJob) {
+	if job == nil {
+		return
+	}
+	job.Detail = ""
+	job.Parameters = nil
+	job.Labels = nil
+	job.ResultOutputValues = nil
+}
+
+func hasTrackedJobRichDetails(job TrackedJob) bool {
+	return strings.TrimSpace(job.Detail) != "" ||
+		len(job.Parameters) > 0 ||
+		len(job.Labels) > 0 ||
+		len(job.ResultOutputValues) > 0
+}
+
+func mergeTrackedStatusIntoDetail(detail *TrackedJob, tracked *TrackedJob) {
+	if detail == nil || tracked == nil {
+		return
+	}
+
+	if detail.JobType == "" {
+		detail.JobType = tracked.JobType
+	}
+	if detail.RequestID == "" {
+		detail.RequestID = tracked.RequestID
+	}
+	if detail.WorkerID == "" {
+		detail.WorkerID = tracked.WorkerID
+	}
+	if detail.DedupeKey == "" {
+		detail.DedupeKey = tracked.DedupeKey
+	}
+	if detail.Summary == "" {
+		detail.Summary = tracked.Summary
+	}
+	if detail.State == "" {
+		detail.State = tracked.State
+	}
+	if detail.Progress == 0 {
+		detail.Progress = tracked.Progress
+	}
+	if detail.Stage == "" {
+		detail.Stage = tracked.Stage
+	}
+	if detail.Message == "" {
+		detail.Message = tracked.Message
+	}
+	if detail.Attempt == 0 {
+		detail.Attempt = tracked.Attempt
+	}
+	if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
+		detail.CreatedAt = tracked.CreatedAt
+	}
+	if detail.UpdatedAt == nil || detail.UpdatedAt.IsZero() {
+		detail.UpdatedAt = tracked.UpdatedAt
+	}
+	if detail.CompletedAt == nil || detail.CompletedAt.IsZero() {
+		detail.CompletedAt = tracked.CompletedAt
+	}
+	if detail.ErrorMessage == "" {
+		detail.ErrorMessage = tracked.ErrorMessage
+	}
+	if detail.ResultSummary == "" {
+		detail.ResultSummary = tracked.ResultSummary
+	}
+}
+
+func (r *Plugin) handleJobProgressUpdate(workerID string, update *plugin_pb.JobProgressUpdate) {
+	if update == nil {
+		return
+	}
+
+	now := time.Now().UTC()
+	resolvedWorkerID := strings.TrimSpace(workerID)
+
+	if strings.TrimSpace(update.JobId) != "" {
+		r.jobsMu.Lock()
+		job := r.jobs[update.JobId]
+		if job == nil {
+			job = &TrackedJob{
+				JobID:     update.JobId,
+				JobType:   update.JobType,
+				RequestID: update.RequestId,
+				WorkerID:  resolvedWorkerID,
+				CreatedAt: timeToPtr(now),
+			}
+			r.jobs[update.JobId] = job
+		}
+
+		if update.JobType != "" {
+			job.JobType = update.JobType
+		}
+		if update.RequestId != "" {
+			job.RequestID = update.RequestId
+		}
+		if job.WorkerID != "" {
+			resolvedWorkerID = job.WorkerID
+		} else if resolvedWorkerID != "" {
+			job.WorkerID = resolvedWorkerID
+		}
+		job.State = strings.ToLower(update.State.String())
+		job.Progress = update.ProgressPercent
+		job.Stage = update.Stage
+		job.Message = update.Message
+		job.UpdatedAt = timeToPtr(now)
+		r.pruneTrackedJobsLocked()
+		r.dirtyJobs = true
+		r.jobsMu.Unlock()
+	}
+
+	r.trackWorkerActivities(update.JobType, update.JobId, update.RequestId, resolvedWorkerID, update.Activities)
+	if update.Message != "" || update.Stage != "" {
+		source := "worker_progress"
+		if strings.TrimSpace(update.JobId) == "" {
+			source = "worker_detection"
+		}
+		r.appendActivity(JobActivity{
+			JobID:      update.JobId,
+			JobType:    update.JobType,
+			RequestID:  update.RequestId,
+			WorkerID:   resolvedWorkerID,
+			Source:     source,
+			Message:    update.Message,
+			Stage:      update.Stage,
+			OccurredAt: timeToPtr(now),
+		})
+	}
+}
+
+func (r *Plugin) trackExecutionStart(requestID, workerID string, job *plugin_pb.JobSpec, attempt int32) {
+	if job == nil || strings.TrimSpace(job.JobId) == "" {
+		return
+	}
+
+	now := time.Now().UTC()
+
+	r.jobsMu.Lock()
+	tracked := r.jobs[job.JobId]
+	if tracked == nil {
+		tracked = &TrackedJob{
+			JobID:     job.JobId,
+			CreatedAt: timeToPtr(now),
+		}
+		r.jobs[job.JobId] = tracked
+	}
+
+	tracked.JobType = job.JobType
+	tracked.RequestID = requestID
+	tracked.WorkerID = workerID
+	tracked.DedupeKey = job.DedupeKey
+	tracked.Summary = job.Summary
+	tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_ASSIGNED.String())
+	tracked.Progress = 0
+	tracked.Stage = "assigned"
+	tracked.Message = "job assigned to worker"
+	tracked.Attempt = attempt
+	if tracked.CreatedAt == nil || tracked.CreatedAt.IsZero() {
+		tracked.CreatedAt = timeToPtr(now)
+	}
+	tracked.UpdatedAt = timeToPtr(now)
+	trackedSnapshot := cloneTrackedJob(*tracked)
+	r.pruneTrackedJobsLocked()
+	r.dirtyJobs = true
+	r.jobsMu.Unlock()
+	r.persistJobDetailSnapshot(job.JobId, func(detail *TrackedJob) {
+		detail.JobID = job.JobId
+		detail.JobType = job.JobType
+		detail.RequestID = requestID
+		detail.WorkerID = workerID
+		detail.DedupeKey = job.DedupeKey
+		detail.Summary = job.Summary
+		detail.Detail = job.Detail
+		detail.Parameters = enrichTrackedJobParameters(job.JobType, configValueMapToPlain(job.Parameters))
+		if len(job.Labels) > 0 {
+			labels := make(map[string]string, len(job.Labels))
+			for key, value := range job.Labels {
+				labels[key] = value
+			}
+			detail.Labels = labels
+		} else {
+			detail.Labels = nil
+		}
+		detail.State = trackedSnapshot.State
+		detail.Progress = trackedSnapshot.Progress
+		detail.Stage = trackedSnapshot.Stage
+		detail.Message = trackedSnapshot.Message
+		detail.Attempt = attempt
+		if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
+			detail.CreatedAt = trackedSnapshot.CreatedAt
+		}
+		detail.UpdatedAt = trackedSnapshot.UpdatedAt
+	})
+
+	r.appendActivity(JobActivity{
+		JobID:      job.JobId,
+		JobType:    job.JobType,
+		RequestID:  requestID,
+		WorkerID:   workerID,
+		Source:     "admin_dispatch",
+		Message:    "job assigned",
+		Stage:      "assigned",
+		OccurredAt: timeToPtr(now),
+	})
+}
+
+func (r *Plugin) trackExecutionQueued(job *plugin_pb.JobSpec) {
+	if job == nil || strings.TrimSpace(job.JobId) == "" {
+		return
+	}
+
+	now := time.Now().UTC()
+
+	r.jobsMu.Lock()
+	tracked := r.jobs[job.JobId]
+	if tracked == nil {
+		tracked = &TrackedJob{
+			JobID:     job.JobId,
+			CreatedAt: timeToPtr(now),
+		}
+		r.jobs[job.JobId] = tracked
+	}
+
+	tracked.JobType = job.JobType
+	tracked.DedupeKey = job.DedupeKey
+	tracked.Summary = job.Summary
+	tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_PENDING.String())
+	tracked.Progress = 0
+	tracked.Stage = "queued"
+	tracked.Message = "waiting for available executor"
+	if tracked.CreatedAt == nil || tracked.CreatedAt.IsZero() {
+		tracked.CreatedAt = timeToPtr(now)
+	}
+	tracked.UpdatedAt = timeToPtr(now)
+	trackedSnapshot := cloneTrackedJob(*tracked)
+	r.pruneTrackedJobsLocked()
+	r.dirtyJobs = true
+	r.jobsMu.Unlock()
+	r.persistJobDetailSnapshot(job.JobId, func(detail *TrackedJob) {
+		detail.JobID = job.JobId
+		detail.JobType = job.JobType
+		detail.DedupeKey = job.DedupeKey
+		detail.Summary = job.Summary
+		detail.Detail = job.Detail
+		detail.Parameters = enrichTrackedJobParameters(job.JobType, configValueMapToPlain(job.Parameters))
+		if len(job.Labels) > 0 {
+			labels := make(map[string]string, len(job.Labels))
+			for key, value := range job.Labels {
+				labels[key] = value
+			}
+			detail.Labels = labels
+		} else {
+			detail.Labels = nil
+		}
+		detail.State = trackedSnapshot.State
+		detail.Progress = trackedSnapshot.Progress
+		detail.Stage = trackedSnapshot.Stage
+		detail.Message = trackedSnapshot.Message
+		if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
+			detail.CreatedAt = trackedSnapshot.CreatedAt
+		}
+		detail.UpdatedAt = trackedSnapshot.UpdatedAt
+	})
+
+	r.appendActivity(JobActivity{
+		JobID:      job.JobId,
+		JobType:    job.JobType,
+		Source:     "admin_scheduler",
+		Message:    "job queued for execution",
+		Stage:      "queued",
+		OccurredAt: timeToPtr(now),
+	})
+}
+
+func (r *Plugin) trackExecutionCompletion(completed *plugin_pb.JobCompleted) *TrackedJob {
+	if completed == nil || strings.TrimSpace(completed.JobId) == "" {
+		return nil
+	}
+
+	now := time.Now().UTC()
+	if completed.CompletedAt != nil {
+		now = completed.CompletedAt.AsTime().UTC()
+	}
+
+	r.jobsMu.Lock()
+	tracked := r.jobs[completed.JobId]
+	if tracked == nil {
+		tracked = &TrackedJob{
+			JobID:     completed.JobId,
+			CreatedAt: timeToPtr(now),
+		}
+		r.jobs[completed.JobId] = tracked
+	}
+
+	if completed.JobType != "" {
+		tracked.JobType = completed.JobType
+	}
+	if completed.RequestId != "" {
+		tracked.RequestID = completed.RequestId
+	}
+	if completed.Success {
+		tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_SUCCEEDED.String())
+		tracked.Progress = 100
+		tracked.Stage = "completed"
+		if completed.Result != nil {
+			tracked.ResultSummary = completed.Result.Summary
+		}
+		tracked.Message = tracked.ResultSummary
+		if tracked.Message == "" {
+			tracked.Message = "completed"
+		}
+		tracked.ErrorMessage = ""
+	} else {
+		tracked.State = strings.ToLower(plugin_pb.JobState_JOB_STATE_FAILED.String())
+		tracked.Stage = "failed"
+		tracked.ErrorMessage = completed.ErrorMessage
+		tracked.Message = completed.ErrorMessage
+	}
+
+	tracked.UpdatedAt = timeToPtr(now)
+	tracked.CompletedAt = timeToPtr(now)
+	r.pruneTrackedJobsLocked()
+	clone := cloneTrackedJob(*tracked)
+	r.dirtyJobs = true
+	r.jobsMu.Unlock()
+	r.persistJobDetailSnapshot(completed.JobId, func(detail *TrackedJob) {
+		detail.JobID = completed.JobId
+		if completed.JobType != "" {
+			detail.JobType = completed.JobType
+		}
+		if completed.RequestId != "" {
+			detail.RequestID = completed.RequestId
+		}
+		detail.State = clone.State
+		detail.Progress = clone.Progress
+		detail.Stage = clone.Stage
+		detail.Message = clone.Message
+		detail.ErrorMessage = clone.ErrorMessage
+		detail.ResultSummary = clone.ResultSummary
+		if completed.Success && completed.Result != nil {
+			detail.ResultOutputValues = configValueMapToPlain(completed.Result.OutputValues)
+		} else {
+			detail.ResultOutputValues = nil
+		}
+		if detail.CreatedAt == nil || detail.CreatedAt.IsZero() {
+			detail.CreatedAt = clone.CreatedAt
+		}
+		if detail.UpdatedAt == nil || detail.UpdatedAt.IsZero() {
+			detail.UpdatedAt = clone.UpdatedAt
+		}
+		if detail.CompletedAt == nil || detail.CompletedAt.IsZero() {
+			detail.CompletedAt = clone.CompletedAt
+		}
+	})
+
+	r.appendActivity(JobActivity{
+		JobID:      completed.JobId,
+		JobType:    completed.JobType,
+		RequestID:  completed.RequestId,
+		WorkerID:   clone.WorkerID,
+		Source:     "worker_completion",
+		Message:    clone.Message,
+		Stage:      clone.Stage,
+		OccurredAt: timeToPtr(now),
+	})
+
+	return &clone
+}
+
+func (r *Plugin) trackWorkerActivities(jobType, jobID, requestID, workerID string, events []*plugin_pb.ActivityEvent) {
+	if len(events) == 0 {
+		return
+	}
+	for _, event := range events {
+		if event == nil {
+			continue
+		}
+		timestamp := time.Now().UTC()
+		if event.CreatedAt != nil {
+			timestamp = event.CreatedAt.AsTime().UTC()
+		}
+		r.appendActivity(JobActivity{
+			JobID:      jobID,
+			JobType:    jobType,
+			RequestID:  requestID,
+			WorkerID:   workerID,
+			Source:     strings.ToLower(event.Source.String()),
+			Message:    event.Message,
+			Stage:      event.Stage,
+			Details:    configValueMapToPlain(event.Details),
+			OccurredAt: timeToPtr(timestamp),
+		})
+	}
+}
+
+func (r *Plugin) appendActivity(activity JobActivity) {
+	if activity.OccurredAt == nil || activity.OccurredAt.IsZero() {
+		activity.OccurredAt = timeToPtr(time.Now().UTC())
+	}
+
+	r.activitiesMu.Lock()
+	r.activities = append(r.activities, activity)
+	if len(r.activities) > maxActivityRecords {
+		r.activities = r.activities[len(r.activities)-maxActivityRecords:]
+	}
+	r.dirtyActivities = true
+	r.activitiesMu.Unlock()
+}
+
+func (r *Plugin) pruneTrackedJobsLocked() {
+	if len(r.jobs) <= maxTrackedJobsTotal {
+		return
+	}
+
+	type sortableJob struct {
+		jobID     string
+		updatedAt time.Time
+	}
+	terminalJobs := make([]sortableJob, 0)
+	for jobID, job := range r.jobs {
+		if job.State == StateSucceeded ||
+			job.State == StateFailed ||
+			job.State == StateCanceled {
+			updAt := time.Time{}
+			if job.UpdatedAt != nil {
+				updAt = *job.UpdatedAt
+			}
+			terminalJobs = append(terminalJobs, sortableJob{jobID, updAt})
+		}
+	}
+
+	if len(terminalJobs) == 0 {
+		return
+	}
+
+	sort.Slice(terminalJobs, func(i, j int) bool {
+		return terminalJobs[i].updatedAt.Before(terminalJobs[j].updatedAt)
+	})
+
+	toDelete := len(r.jobs) - maxTrackedJobsTotal
+	if toDelete <= 0 {
+		return
+	}
+	if toDelete > len(terminalJobs) {
+		toDelete = len(terminalJobs)
+	}
+
+	for i := 0; i < toDelete; i++ {
+		delete(r.jobs, terminalJobs[i].jobID)
+	}
+}
+
+func configValueMapToPlain(values map[string]*plugin_pb.ConfigValue) map[string]interface{} {
+	if len(values) == 0 {
+		return nil
+	}
+
+	payload, err := protojson.MarshalOptions{UseProtoNames: true}.Marshal(&plugin_pb.ValueMap{Fields: values})
+	if err != nil {
+		return nil
+	}
+
+	decoded := map[string]interface{}{}
+	if err := json.Unmarshal(payload, &decoded); err != nil {
+		return nil
+	}
+
+	fields, ok := decoded["fields"].(map[string]interface{})
+	if !ok {
+		return nil
+	}
+	return fields
+}
+
+func (r *Plugin) persistTrackedJobsSnapshot() {
+	r.jobsMu.Lock()
+	r.dirtyJobs = false
+	jobs := make([]TrackedJob, 0, len(r.jobs))
+	for _, job := range r.jobs {
+		if job == nil || strings.TrimSpace(job.JobID) == "" {
+			continue
+		}
+		clone := cloneTrackedJob(*job)
+		stripTrackedJobDetailFields(&clone)
+		jobs = append(jobs, clone)
+	}
+	r.jobsMu.Unlock()
+
+	if len(jobs) == 0 {
+		return
+	}
+
+	sort.Slice(jobs, func(i, j int) bool {
+		ti := time.Time{}
+		if jobs[i].UpdatedAt != nil {
+			ti = *jobs[i].UpdatedAt
+		}
+		tj := time.Time{}
+		if jobs[j].UpdatedAt != nil {
+			tj = *jobs[j].UpdatedAt
+		}
+		if !ti.Equal(tj) {
+			return ti.After(tj)
+		}
+		return jobs[i].JobID < jobs[j].JobID
+	})
+	if len(jobs) > maxTrackedJobsTotal {
+		jobs = jobs[:maxTrackedJobsTotal]
+	}
+
+	if err := r.store.SaveTrackedJobs(jobs); err != nil {
+		glog.Warningf("Plugin failed to persist tracked jobs: %v", err)
+	}
+}
+
+func (r *Plugin) persistJobDetailSnapshot(jobID string, apply func(detail *TrackedJob)) {
+	normalizedJobID, _ := sanitizeJobID(jobID)
+	if normalizedJobID == "" {
+		return
+	}
+
+	r.jobDetailsMu.Lock()
+	defer r.jobDetailsMu.Unlock()
+
+	detail, err := r.store.LoadJobDetail(normalizedJobID)
+	if err != nil {
+		glog.Warningf("Plugin failed to load job detail snapshot for %s: %v", normalizedJobID, err)
+		return
+	}
+	if detail == nil {
+		detail = &TrackedJob{
+			JobID: normalizedJobID,
+		}
+	}
+
+	if apply != nil {
+		apply(detail)
+	}
+
+	if err := r.store.SaveJobDetail(*detail); err != nil {
+		glog.Warningf("Plugin failed to persist job detail snapshot for %s: %v", normalizedJobID, err)
+	}
+}
+
+func (r *Plugin) persistActivitiesSnapshot() {
+	r.activitiesMu.Lock()
+	r.dirtyActivities = false
+	activities := append([]JobActivity(nil), r.activities...)
+	r.activitiesMu.Unlock()
+
+	if len(activities) == 0 {
+		return
+	}
+
+	if len(activities) > maxActivityRecords {
+		activities = activities[len(activities)-maxActivityRecords:]
+	}
+
+	if err := r.store.SaveActivities(activities); err != nil {
+		glog.Warningf("Plugin failed to persist activities: %v", err)
+	}
+}
+
+func (r *Plugin) persistenceLoop() {
+	defer r.wg.Done()
+	for {
+		select {
+		case <-r.shutdownCh:
+			r.persistTrackedJobsSnapshot()
+			r.persistActivitiesSnapshot()
+			return
+		case <-r.persistTicker.C:
+			r.jobsMu.RLock()
+			needsJobsFlush := r.dirtyJobs
+			r.jobsMu.RUnlock()
+			if needsJobsFlush {
+				r.persistTrackedJobsSnapshot()
+			}
+
+			r.activitiesMu.RLock()
+			needsActivitiesFlush := r.dirtyActivities
+			r.activitiesMu.RUnlock()
+			if needsActivitiesFlush {
+				r.persistActivitiesSnapshot()
+			}
+		}
+	}
+}
--- a/weed/admin/plugin/plugin_monitor_test.go
+++ b/weed/admin/plugin/plugin_monitor_test.go
@@ -0,0 +1,600 @@
+package plugin
+
+import (
+	"testing"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+	"github.com/seaweedfs/seaweedfs/weed/pb/worker_pb"
+	"google.golang.org/protobuf/proto"
+	"google.golang.org/protobuf/types/known/timestamppb"
+)
+
+func TestPluginLoadsPersistedMonitorStateOnStart(t *testing.T) {
+	t.Parallel()
+
+	dataDir := t.TempDir()
+	store, err := NewConfigStore(dataDir)
+	if err != nil {
+		t.Fatalf("NewConfigStore: %v", err)
+	}
+
+	seedJobs := []TrackedJob{
+		{
+			JobID:     "job-seeded",
+			JobType:   "vacuum",
+			State:     "running",
+			CreatedAt: timeToPtr(time.Now().UTC().Add(-2 * time.Minute)),
+			UpdatedAt: timeToPtr(time.Now().UTC().Add(-1 * time.Minute)),
+		},
+	}
+	seedActivities := []JobActivity{
+		{
+			JobID:      "job-seeded",
+			JobType:    "vacuum",
+			Source:     "worker_progress",
+			Message:    "seeded",
+			OccurredAt: timeToPtr(time.Now().UTC().Add(-30 * time.Second)),
+		},
+	}
+
+	if err := store.SaveTrackedJobs(seedJobs); err != nil {
+		t.Fatalf("SaveTrackedJobs: %v", err)
+	}
+	if err := store.SaveActivities(seedActivities); err != nil {
+		t.Fatalf("SaveActivities: %v", err)
+	}
+
+	pluginSvc, err := New(Options{DataDir: dataDir})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	gotJobs := pluginSvc.ListTrackedJobs("", "", 0)
+	if len(gotJobs) != 1 || gotJobs[0].JobID != "job-seeded" {
+		t.Fatalf("unexpected loaded jobs: %+v", gotJobs)
+	}
+
+	gotActivities := pluginSvc.ListActivities("", 0)
+	if len(gotActivities) != 1 || gotActivities[0].Message != "seeded" {
+		t.Fatalf("unexpected loaded activities: %+v", gotActivities)
+	}
+}
+
+func TestPluginPersistsMonitorStateAfterJobUpdates(t *testing.T) {
+	t.Parallel()
+
+	dataDir := t.TempDir()
+	pluginSvc, err := New(Options{DataDir: dataDir})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	job := &plugin_pb.JobSpec{
+		JobId:   "job-persist",
+		JobType: "vacuum",
+		Summary: "persist test",
+	}
+	pluginSvc.trackExecutionStart("req-persist", "worker-a", job, 1)
+
+	pluginSvc.trackExecutionCompletion(&plugin_pb.JobCompleted{
+		RequestId:   "req-persist",
+		JobId:       "job-persist",
+		JobType:     "vacuum",
+		Success:     true,
+		Result:      &plugin_pb.JobResult{Summary: "done"},
+		CompletedAt: timestamppb.New(time.Now().UTC()),
+	})
+	pluginSvc.Shutdown()
+
+	store, err := NewConfigStore(dataDir)
+	if err != nil {
+		t.Fatalf("NewConfigStore: %v", err)
+	}
+
+	trackedJobs, err := store.LoadTrackedJobs()
+	if err != nil {
+		t.Fatalf("LoadTrackedJobs: %v", err)
+	}
+	if len(trackedJobs) == 0 {
+		t.Fatalf("expected persisted tracked jobs")
+	}
+
+	found := false
+	for _, tracked := range trackedJobs {
+		if tracked.JobID == "job-persist" {
+			found = true
+			if tracked.State == "" {
+				t.Fatalf("persisted job state should not be empty")
+			}
+		}
+	}
+	if !found {
+		t.Fatalf("persisted tracked jobs missing job-persist")
+	}
+
+	activities, err := store.LoadActivities()
+	if err != nil {
+		t.Fatalf("LoadActivities: %v", err)
+	}
+	if len(activities) == 0 {
+		t.Fatalf("expected persisted activities")
+	}
+}
+
+func TestTrackExecutionQueuedMarksPendingState(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	pluginSvc.trackExecutionQueued(&plugin_pb.JobSpec{
+		JobId:     "job-pending-1",
+		JobType:   "vacuum",
+		DedupeKey: "vacuum:1",
+		Summary:   "pending queue item",
+	})
+
+	jobs := pluginSvc.ListTrackedJobs("vacuum", "", 10)
+	if len(jobs) != 1 {
+		t.Fatalf("expected one tracked pending job, got=%d", len(jobs))
+	}
+	job := jobs[0]
+	if job.JobID != "job-pending-1" {
+		t.Fatalf("unexpected pending job id: %s", job.JobID)
+	}
+	if job.State != "job_state_pending" {
+		t.Fatalf("unexpected pending job state: %s", job.State)
+	}
+	if job.Stage != "queued" {
+		t.Fatalf("unexpected pending job stage: %s", job.Stage)
+	}
+
+	activities := pluginSvc.ListActivities("vacuum", 50)
+	found := false
+	for _, activity := range activities {
+		if activity.JobID == "job-pending-1" && activity.Stage == "queued" && activity.Source == "admin_scheduler" {
+			found = true
+			break
+		}
+	}
+	if !found {
+		t.Fatalf("expected queued activity for pending job")
+	}
+}
+
+func TestHandleJobProgressUpdateCarriesWorkerIDInActivities(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	job := &plugin_pb.JobSpec{
+		JobId:   "job-progress-worker",
+		JobType: "vacuum",
+	}
+	pluginSvc.trackExecutionStart("req-progress-worker", "worker-a", job, 1)
+
+	pluginSvc.handleJobProgressUpdate("worker-a", &plugin_pb.JobProgressUpdate{
+		RequestId:       "req-progress-worker",
+		JobId:           "job-progress-worker",
+		JobType:         "vacuum",
+		State:           plugin_pb.JobState_JOB_STATE_RUNNING,
+		ProgressPercent: 42.0,
+		Stage:           "scan",
+		Message:         "in progress",
+		Activities: []*plugin_pb.ActivityEvent{
+			{
+				Source:  plugin_pb.ActivitySource_ACTIVITY_SOURCE_EXECUTOR,
+				Message: "volume scanned",
+				Stage:   "scan",
+			},
+		},
+	})
+
+	activities := pluginSvc.ListActivities("vacuum", 0)
+	if len(activities) == 0 {
+		t.Fatalf("expected activity entries")
+	}
+
+	foundProgress := false
+	foundEvent := false
+	for _, activity := range activities {
+		if activity.Source == "worker_progress" && activity.Message == "in progress" {
+			foundProgress = true
+			if activity.WorkerID != "worker-a" {
+				t.Fatalf("worker_progress activity worker mismatch: got=%q want=%q", activity.WorkerID, "worker-a")
+			}
+		}
+		if activity.Message == "volume scanned" {
+			foundEvent = true
+			if activity.WorkerID != "worker-a" {
+				t.Fatalf("worker event worker mismatch: got=%q want=%q", activity.WorkerID, "worker-a")
+			}
+		}
+	}
+
+	if !foundProgress {
+		t.Fatalf("expected worker_progress activity")
+	}
+	if !foundEvent {
+		t.Fatalf("expected worker activity event")
+	}
+}
+
+func TestHandleJobProgressUpdateWithoutJobIDTracksDetectionActivities(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	pluginSvc.handleJobProgressUpdate("worker-detector", &plugin_pb.JobProgressUpdate{
+		RequestId: "detect-req-1",
+		JobType:   "vacuum",
+		State:     plugin_pb.JobState_JOB_STATE_RUNNING,
+		Stage:     "decision_summary",
+		Message:   "VACUUM: No tasks created for 3 volumes",
+		Activities: []*plugin_pb.ActivityEvent{
+			{
+				Source:  plugin_pb.ActivitySource_ACTIVITY_SOURCE_DETECTOR,
+				Stage:   "decision_summary",
+				Message: "VACUUM: No tasks created for 3 volumes",
+			},
+		},
+	})
+
+	activities := pluginSvc.ListActivities("vacuum", 0)
+	if len(activities) == 0 {
+		t.Fatalf("expected activity entries")
+	}
+
+	foundDetectionProgress := false
+	foundDetectorEvent := false
+	for _, activity := range activities {
+		if activity.RequestID != "detect-req-1" {
+			continue
+		}
+		if activity.Source == "worker_detection" {
+			foundDetectionProgress = true
+			if activity.WorkerID != "worker-detector" {
+				t.Fatalf("worker_detection worker mismatch: got=%q want=%q", activity.WorkerID, "worker-detector")
+			}
+		}
+		if activity.Source == "activity_source_detector" {
+			foundDetectorEvent = true
+			if activity.WorkerID != "worker-detector" {
+				t.Fatalf("detector event worker mismatch: got=%q want=%q", activity.WorkerID, "worker-detector")
+			}
+		}
+	}
+
+	if !foundDetectionProgress {
+		t.Fatalf("expected worker_detection activity")
+	}
+	if !foundDetectorEvent {
+		t.Fatalf("expected detector activity event")
+	}
+}
+
+func TestHandleJobCompletedCarriesWorkerIDInActivitiesAndRunHistory(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	job := &plugin_pb.JobSpec{
+		JobId:   "job-complete-worker",
+		JobType: "vacuum",
+	}
+	pluginSvc.trackExecutionStart("req-complete-worker", "worker-b", job, 1)
+
+	pluginSvc.handleJobCompleted(&plugin_pb.JobCompleted{
+		RequestId: "req-complete-worker",
+		JobId:     "job-complete-worker",
+		JobType:   "vacuum",
+		Success:   true,
+		Activities: []*plugin_pb.ActivityEvent{
+			{
+				Source:  plugin_pb.ActivitySource_ACTIVITY_SOURCE_EXECUTOR,
+				Message: "finalizer done",
+				Stage:   "finalize",
+			},
+		},
+		CompletedAt: timestamppb.Now(),
+	})
+	pluginSvc.Shutdown()
+
+	activities := pluginSvc.ListActivities("vacuum", 0)
+	foundWorkerEvent := false
+	for _, activity := range activities {
+		if activity.Message == "finalizer done" {
+			foundWorkerEvent = true
+			if activity.WorkerID != "worker-b" {
+				t.Fatalf("worker completion event worker mismatch: got=%q want=%q", activity.WorkerID, "worker-b")
+			}
+		}
+	}
+	if !foundWorkerEvent {
+		t.Fatalf("expected completion worker event activity")
+	}
+
+	history, err := pluginSvc.LoadRunHistory("vacuum")
+	if err != nil {
+		t.Fatalf("LoadRunHistory: %v", err)
+	}
+	if history == nil || len(history.SuccessfulRuns) == 0 {
+		t.Fatalf("expected successful run history entry")
+	}
+	if history.SuccessfulRuns[0].WorkerID != "worker-b" {
+		t.Fatalf("run history worker mismatch: got=%q want=%q", history.SuccessfulRuns[0].WorkerID, "worker-b")
+	}
+}
+
+func TestTrackExecutionStartStoresJobPayloadDetails(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{DataDir: t.TempDir()})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	pluginSvc.trackExecutionStart("req-payload", "worker-c", &plugin_pb.JobSpec{
+		JobId:   "job-payload",
+		JobType: "vacuum",
+		Summary: "payload summary",
+		Detail:  "payload detail",
+		Parameters: map[string]*plugin_pb.ConfigValue{
+			"volume_id": {
+				Kind: &plugin_pb.ConfigValue_Int64Value{Int64Value: 9},
+			},
+		},
+		Labels: map[string]string{
+			"source": "detector",
+		},
+	}, 2)
+	pluginSvc.Shutdown()
+
+	job, found := pluginSvc.GetTrackedJob("job-payload")
+	if !found || job == nil {
+		t.Fatalf("expected tracked job")
+	}
+	if job.Detail != "" {
+		t.Fatalf("expected in-memory tracked job detail to be stripped, got=%q", job.Detail)
+	}
+	if job.Attempt != 2 {
+		t.Fatalf("unexpected attempt: %d", job.Attempt)
+	}
+	if len(job.Labels) != 0 {
+		t.Fatalf("expected in-memory labels to be stripped, got=%+v", job.Labels)
+	}
+	if len(job.Parameters) != 0 {
+		t.Fatalf("expected in-memory parameters to be stripped, got=%+v", job.Parameters)
+	}
+
+	detail, found, err := pluginSvc.BuildJobDetail("job-payload", 100, 0)
+	if err != nil {
+		t.Fatalf("BuildJobDetail: %v", err)
+	}
+	if !found || detail == nil || detail.Job == nil {
+		t.Fatalf("expected disk-backed job detail")
+	}
+	if detail.Job.Detail != "payload detail" {
+		t.Fatalf("unexpected disk-backed detail: %q", detail.Job.Detail)
+	}
+	if got := detail.Job.Labels["source"]; got != "detector" {
+		t.Fatalf("unexpected disk-backed label source: %q", got)
+	}
+	if got, ok := detail.Job.Parameters["volume_id"].(map[string]interface{}); !ok || got["int64_value"] != "9" {
+		t.Fatalf("unexpected disk-backed parameters payload: %#v", detail.Job.Parameters["volume_id"])
+	}
+}
+
+func TestTrackExecutionStartStoresErasureCodingExecutionPlan(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{DataDir: t.TempDir()})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	taskParams := &worker_pb.TaskParams{
+		TaskId:     "task-ec-1",
+		VolumeId:   29,
+		Collection: "photos",
+		Sources: []*worker_pb.TaskSource{
+			{
+				Node:       "source-a:8080",
+				DataCenter: "dc1",
+				Rack:       "rack1",
+				VolumeId:   29,
+			},
+		},
+		Targets: []*worker_pb.TaskTarget{
+			{
+				Node:       "target-a:8080",
+				DataCenter: "dc1",
+				Rack:       "rack2",
+				VolumeId:   29,
+				ShardIds:   []uint32{0, 10},
+			},
+			{
+				Node:       "target-b:8080",
+				DataCenter: "dc2",
+				Rack:       "rack3",
+				VolumeId:   29,
+				ShardIds:   []uint32{1, 11},
+			},
+		},
+		TaskParams: &worker_pb.TaskParams_ErasureCodingParams{
+			ErasureCodingParams: &worker_pb.ErasureCodingTaskParams{
+				DataShards:   10,
+				ParityShards: 4,
+			},
+		},
+	}
+	payload, err := proto.Marshal(taskParams)
+	if err != nil {
+		t.Fatalf("Marshal task params: %v", err)
+	}
+
+	pluginSvc.trackExecutionStart("req-ec-plan", "worker-ec", &plugin_pb.JobSpec{
+		JobId:   "job-ec-plan",
+		JobType: "erasure_coding",
+		Parameters: map[string]*plugin_pb.ConfigValue{
+			"task_params_pb": {
+				Kind: &plugin_pb.ConfigValue_BytesValue{BytesValue: payload},
+			},
+		},
+	}, 1)
+	pluginSvc.Shutdown()
+
+	detail, found, err := pluginSvc.BuildJobDetail("job-ec-plan", 100, 0)
+	if err != nil {
+		t.Fatalf("BuildJobDetail: %v", err)
+	}
+	if !found || detail == nil || detail.Job == nil {
+		t.Fatalf("expected disk-backed detail")
+	}
+
+	rawPlan, ok := detail.Job.Parameters["execution_plan"]
+	if !ok {
+		t.Fatalf("expected execution_plan in parameters, got=%+v", detail.Job.Parameters)
+	}
+	plan, ok := rawPlan.(map[string]interface{})
+	if !ok {
+		t.Fatalf("unexpected execution_plan type: %T", rawPlan)
+	}
+	if plan["job_type"] != "erasure_coding" {
+		t.Fatalf("unexpected execution plan job type: %+v", plan["job_type"])
+	}
+	if plan["volume_id"] != float64(29) {
+		t.Fatalf("unexpected execution plan volume id: %+v", plan["volume_id"])
+	}
+	targets, ok := plan["targets"].([]interface{})
+	if !ok || len(targets) != 2 {
+		t.Fatalf("unexpected targets in execution plan: %+v", plan["targets"])
+	}
+	assignments, ok := plan["shard_assignments"].([]interface{})
+	if !ok || len(assignments) != 4 {
+		t.Fatalf("unexpected shard assignments in execution plan: %+v", plan["shard_assignments"])
+	}
+	firstAssignment, ok := assignments[0].(map[string]interface{})
+	if !ok {
+		t.Fatalf("unexpected first assignment payload: %+v", assignments[0])
+	}
+	if firstAssignment["shard_id"] != float64(0) || firstAssignment["kind"] != "data" {
+		t.Fatalf("unexpected first assignment: %+v", firstAssignment)
+	}
+}
+
+func TestBuildJobDetailIncludesActivitiesAndRunRecord(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{DataDir: t.TempDir()})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	pluginSvc.trackExecutionStart("req-detail", "worker-z", &plugin_pb.JobSpec{
+		JobId:   "job-detail",
+		JobType: "vacuum",
+		Summary: "detail summary",
+	}, 1)
+	pluginSvc.handleJobProgressUpdate("worker-z", &plugin_pb.JobProgressUpdate{
+		RequestId: "req-detail",
+		JobId:     "job-detail",
+		JobType:   "vacuum",
+		State:     plugin_pb.JobState_JOB_STATE_RUNNING,
+		Stage:     "scan",
+		Message:   "scanning volume",
+	})
+	pluginSvc.handleJobCompleted(&plugin_pb.JobCompleted{
+		RequestId: "req-detail",
+		JobId:     "job-detail",
+		JobType:   "vacuum",
+		Success:   true,
+		Result: &plugin_pb.JobResult{
+			Summary: "done",
+			OutputValues: map[string]*plugin_pb.ConfigValue{
+				"affected": {
+					Kind: &plugin_pb.ConfigValue_Int64Value{Int64Value: 1},
+				},
+			},
+		},
+		CompletedAt: timestamppb.Now(),
+	})
+	pluginSvc.Shutdown()
+
+	detail, found, err := pluginSvc.BuildJobDetail("job-detail", 100, 5)
+	if err != nil {
+		t.Fatalf("BuildJobDetail error: %v", err)
+	}
+	if !found || detail == nil {
+		t.Fatalf("expected job detail")
+	}
+	if detail.Job == nil || detail.Job.JobID != "job-detail" {
+		t.Fatalf("unexpected job detail payload: %+v", detail.Job)
+	}
+	if detail.RunRecord == nil || detail.RunRecord.JobID != "job-detail" {
+		t.Fatalf("expected run record for job-detail, got=%+v", detail.RunRecord)
+	}
+	if len(detail.Activities) == 0 {
+		t.Fatalf("expected activity timeline entries")
+	}
+	if detail.Job.ResultOutputValues == nil {
+		t.Fatalf("expected result output values")
+	}
+}
+
+func TestBuildJobDetailLoadsFromDiskWhenMemoryCleared(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{DataDir: t.TempDir()})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	pluginSvc.trackExecutionStart("req-disk", "worker-d", &plugin_pb.JobSpec{
+		JobId:   "job-disk",
+		JobType: "vacuum",
+		Summary: "disk summary",
+		Detail:  "disk detail payload",
+	}, 1)
+	pluginSvc.Shutdown()
+
+	pluginSvc.jobsMu.Lock()
+	pluginSvc.jobs = map[string]*TrackedJob{}
+	pluginSvc.jobsMu.Unlock()
+	pluginSvc.activitiesMu.Lock()
+	pluginSvc.activities = nil
+	pluginSvc.activitiesMu.Unlock()
+
+	detail, found, err := pluginSvc.BuildJobDetail("job-disk", 100, 0)
+	if err != nil {
+		t.Fatalf("BuildJobDetail: %v", err)
+	}
+	if !found || detail == nil || detail.Job == nil {
+		t.Fatalf("expected detail from disk")
+	}
+	if detail.Job.Detail != "disk detail payload" {
+		t.Fatalf("unexpected disk detail payload: %q", detail.Job.Detail)
+	}
+}
--- a/weed/admin/plugin/plugin_scheduler.go
+++ b/weed/admin/plugin/plugin_scheduler.go
@@ -0,0 +1,945 @@
+package plugin
+
+import (
+	"context"
+	"errors"
+	"fmt"
+	"strings"
+	"sync"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/glog"
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+	"google.golang.org/protobuf/types/known/timestamppb"
+)
+
+var errExecutorAtCapacity = errors.New("executor is at capacity")
+
+const (
+	defaultSchedulerTick                       = 5 * time.Second
+	defaultScheduledDetectionInterval          = 300 * time.Second
+	defaultScheduledDetectionTimeout           = 45 * time.Second
+	defaultScheduledExecutionTimeout           = 90 * time.Second
+	defaultScheduledMaxResults           int32 = 1000
+	defaultScheduledExecutionConcurrency       = 1
+	defaultScheduledPerWorkerConcurrency       = 1
+	maxScheduledExecutionConcurrency           = 128
+	defaultScheduledRetryBackoff               = 5 * time.Second
+	defaultClusterContextTimeout               = 10 * time.Second
+	defaultWaitingBacklogFloor                 = 8
+	defaultWaitingBacklogMultiplier            = 4
+)
+
+type schedulerPolicy struct {
+	DetectionInterval      time.Duration
+	DetectionTimeout       time.Duration
+	ExecutionTimeout       time.Duration
+	RetryBackoff           time.Duration
+	MaxResults             int32
+	ExecutionConcurrency   int
+	PerWorkerConcurrency   int
+	RetryLimit             int
+	ExecutorReserveBackoff time.Duration
+}
+
+func (r *Plugin) schedulerLoop() {
+	defer r.wg.Done()
+	ticker := time.NewTicker(r.schedulerTick)
+	defer ticker.Stop()
+
+	// Try once immediately on startup.
+	r.runSchedulerTick()
+
+	for {
+		select {
+		case <-r.shutdownCh:
+			return
+		case <-ticker.C:
+			r.runSchedulerTick()
+		}
+	}
+}
+
+func (r *Plugin) runSchedulerTick() {
+	jobTypes := r.registry.DetectableJobTypes()
+	if len(jobTypes) == 0 {
+		return
+	}
+
+	active := make(map[string]struct{}, len(jobTypes))
+	for _, jobType := range jobTypes {
+		active[jobType] = struct{}{}
+
+		policy, enabled, err := r.loadSchedulerPolicy(jobType)
+		if err != nil {
+			glog.Warningf("Plugin scheduler failed to load policy for %s: %v", jobType, err)
+			continue
+		}
+		if !enabled {
+			r.clearSchedulerJobType(jobType)
+			continue
+		}
+
+		if !r.markDetectionDue(jobType, policy.DetectionInterval) {
+			continue
+		}
+
+		r.wg.Add(1)
+		go func(jt string, p schedulerPolicy) {
+			defer r.wg.Done()
+			r.runScheduledDetection(jt, p)
+		}(jobType, policy)
+	}
+
+	r.pruneSchedulerState(active)
+	r.pruneDetectorLeases(active)
+}
+
+func (r *Plugin) loadSchedulerPolicy(jobType string) (schedulerPolicy, bool, error) {
+	cfg, err := r.store.LoadJobTypeConfig(jobType)
+	if err != nil {
+		return schedulerPolicy{}, false, err
+	}
+	descriptor, err := r.store.LoadDescriptor(jobType)
+	if err != nil {
+		return schedulerPolicy{}, false, err
+	}
+
+	adminRuntime := deriveSchedulerAdminRuntime(cfg, descriptor)
+	if adminRuntime == nil {
+		return schedulerPolicy{}, false, nil
+	}
+	if !adminRuntime.Enabled {
+		return schedulerPolicy{}, false, nil
+	}
+
+	policy := schedulerPolicy{
+		DetectionInterval:      durationFromSeconds(adminRuntime.DetectionIntervalSeconds, defaultScheduledDetectionInterval),
+		DetectionTimeout:       durationFromSeconds(adminRuntime.DetectionTimeoutSeconds, defaultScheduledDetectionTimeout),
+		ExecutionTimeout:       defaultScheduledExecutionTimeout,
+		RetryBackoff:           durationFromSeconds(adminRuntime.RetryBackoffSeconds, defaultScheduledRetryBackoff),
+		MaxResults:             adminRuntime.MaxJobsPerDetection,
+		ExecutionConcurrency:   int(adminRuntime.GlobalExecutionConcurrency),
+		PerWorkerConcurrency:   int(adminRuntime.PerWorkerExecutionConcurrency),
+		RetryLimit:             int(adminRuntime.RetryLimit),
+		ExecutorReserveBackoff: 200 * time.Millisecond,
+	}
+
+	if policy.DetectionInterval < r.schedulerTick {
+		policy.DetectionInterval = r.schedulerTick
+	}
+	if policy.MaxResults <= 0 {
+		policy.MaxResults = defaultScheduledMaxResults
+	}
+	if policy.ExecutionConcurrency <= 0 {
+		policy.ExecutionConcurrency = defaultScheduledExecutionConcurrency
+	}
+	if policy.ExecutionConcurrency > maxScheduledExecutionConcurrency {
+		policy.ExecutionConcurrency = maxScheduledExecutionConcurrency
+	}
+	if policy.PerWorkerConcurrency <= 0 {
+		policy.PerWorkerConcurrency = defaultScheduledPerWorkerConcurrency
+	}
+	if policy.PerWorkerConcurrency > policy.ExecutionConcurrency {
+		policy.PerWorkerConcurrency = policy.ExecutionConcurrency
+	}
+	if policy.RetryLimit < 0 {
+		policy.RetryLimit = 0
+	}
+
+	// Plugin protocol currently has only detection timeout in admin settings.
+	execTimeout := time.Duration(adminRuntime.DetectionTimeoutSeconds*2) * time.Second
+	if execTimeout < defaultScheduledExecutionTimeout {
+		execTimeout = defaultScheduledExecutionTimeout
+	}
+	policy.ExecutionTimeout = execTimeout
+
+	return policy, true, nil
+}
+
+func (r *Plugin) ListSchedulerStates() ([]SchedulerJobTypeState, error) {
+	jobTypes, err := r.ListKnownJobTypes()
+	if err != nil {
+		return nil, err
+	}
+
+	r.schedulerMu.Lock()
+	nextDetectionAt := make(map[string]time.Time, len(r.nextDetectionAt))
+	for jobType, nextRun := range r.nextDetectionAt {
+		nextDetectionAt[jobType] = nextRun
+	}
+	detectionInFlight := make(map[string]bool, len(r.detectionInFlight))
+	for jobType, inFlight := range r.detectionInFlight {
+		detectionInFlight[jobType] = inFlight
+	}
+	r.schedulerMu.Unlock()
+
+	states := make([]SchedulerJobTypeState, 0, len(jobTypes))
+	for _, jobType := range jobTypes {
+		state := SchedulerJobTypeState{
+			JobType:           jobType,
+			DetectionInFlight: detectionInFlight[jobType],
+		}
+
+		if nextRun, ok := nextDetectionAt[jobType]; ok && !nextRun.IsZero() {
+			nextRunUTC := nextRun.UTC()
+			state.NextDetectionAt = &nextRunUTC
+		}
+
+		policy, enabled, loadErr := r.loadSchedulerPolicy(jobType)
+		if loadErr != nil {
+			state.PolicyError = loadErr.Error()
+		} else {
+			state.Enabled = enabled
+			if enabled {
+				state.DetectionIntervalSeconds = secondsFromDuration(policy.DetectionInterval)
+				state.DetectionTimeoutSeconds = secondsFromDuration(policy.DetectionTimeout)
+				state.ExecutionTimeoutSeconds = secondsFromDuration(policy.ExecutionTimeout)
+				state.MaxJobsPerDetection = policy.MaxResults
+				state.GlobalExecutionConcurrency = policy.ExecutionConcurrency
+				state.PerWorkerExecutionConcurrency = policy.PerWorkerConcurrency
+				state.RetryLimit = policy.RetryLimit
+				state.RetryBackoffSeconds = secondsFromDuration(policy.RetryBackoff)
+			}
+		}
+
+		leasedWorkerID := r.getDetectorLease(jobType)
+		if leasedWorkerID != "" {
+			state.DetectorWorkerID = leasedWorkerID
+			if worker, ok := r.registry.Get(leasedWorkerID); ok {
+				if capability := worker.Capabilities[jobType]; capability != nil && capability.CanDetect {
+					state.DetectorAvailable = true
+				}
+			}
+		}
+		if state.DetectorWorkerID == "" {
+			detector, detectorErr := r.registry.PickDetector(jobType)
+			if detectorErr == nil && detector != nil {
+				state.DetectorAvailable = true
+				state.DetectorWorkerID = detector.WorkerID
+			}
+		}
+
+		executors, executorErr := r.registry.ListExecutors(jobType)
+		if executorErr == nil {
+			state.ExecutorWorkerCount = len(executors)
+		}
+
+		states = append(states, state)
+	}
+
+	return states, nil
+}
+
+func deriveSchedulerAdminRuntime(
+	cfg *plugin_pb.PersistedJobTypeConfig,
+	descriptor *plugin_pb.JobTypeDescriptor,
+) *plugin_pb.AdminRuntimeConfig {
+	if cfg != nil && cfg.AdminRuntime != nil {
+		adminConfig := *cfg.AdminRuntime
+		return &adminConfig
+	}
+
+	if descriptor == nil || descriptor.AdminRuntimeDefaults == nil {
+		return nil
+	}
+
+	defaults := descriptor.AdminRuntimeDefaults
+	return &plugin_pb.AdminRuntimeConfig{
+		Enabled:                       defaults.Enabled,
+		DetectionIntervalSeconds:      defaults.DetectionIntervalSeconds,
+		DetectionTimeoutSeconds:       defaults.DetectionTimeoutSeconds,
+		MaxJobsPerDetection:           defaults.MaxJobsPerDetection,
+		GlobalExecutionConcurrency:    defaults.GlobalExecutionConcurrency,
+		PerWorkerExecutionConcurrency: defaults.PerWorkerExecutionConcurrency,
+		RetryLimit:                    defaults.RetryLimit,
+		RetryBackoffSeconds:           defaults.RetryBackoffSeconds,
+	}
+}
+
+func (r *Plugin) markDetectionDue(jobType string, interval time.Duration) bool {
+	now := time.Now().UTC()
+
+	r.schedulerMu.Lock()
+	defer r.schedulerMu.Unlock()
+
+	if r.detectionInFlight[jobType] {
+		return false
+	}
+
+	nextRun, exists := r.nextDetectionAt[jobType]
+	if exists && now.Before(nextRun) {
+		return false
+	}
+
+	r.nextDetectionAt[jobType] = now.Add(interval)
+	r.detectionInFlight[jobType] = true
+	return true
+}
+
+func (r *Plugin) finishDetection(jobType string) {
+	r.schedulerMu.Lock()
+	delete(r.detectionInFlight, jobType)
+	r.schedulerMu.Unlock()
+}
+
+func (r *Plugin) pruneSchedulerState(activeJobTypes map[string]struct{}) {
+	r.schedulerMu.Lock()
+	defer r.schedulerMu.Unlock()
+
+	for jobType := range r.nextDetectionAt {
+		if _, ok := activeJobTypes[jobType]; !ok {
+			delete(r.nextDetectionAt, jobType)
+			delete(r.detectionInFlight, jobType)
+		}
+	}
+}
+
+func (r *Plugin) clearSchedulerJobType(jobType string) {
+	r.schedulerMu.Lock()
+	delete(r.nextDetectionAt, jobType)
+	delete(r.detectionInFlight, jobType)
+	r.schedulerMu.Unlock()
+	r.clearDetectorLease(jobType, "")
+}
+
+func (r *Plugin) pruneDetectorLeases(activeJobTypes map[string]struct{}) {
+	r.detectorLeaseMu.Lock()
+	defer r.detectorLeaseMu.Unlock()
+
+	for jobType := range r.detectorLeases {
+		if _, ok := activeJobTypes[jobType]; !ok {
+			delete(r.detectorLeases, jobType)
+		}
+	}
+}
+
+func (r *Plugin) runScheduledDetection(jobType string, policy schedulerPolicy) {
+	defer r.finishDetection(jobType)
+
+	start := time.Now().UTC()
+	r.appendActivity(JobActivity{
+		JobType:    jobType,
+		Source:     "admin_scheduler",
+		Message:    "scheduled detection started",
+		Stage:      "detecting",
+		OccurredAt: timeToPtr(start),
+	})
+
+	if skip, waitingCount, waitingThreshold := r.shouldSkipDetectionForWaitingJobs(jobType, policy); skip {
+		r.appendActivity(JobActivity{
+			JobType:    jobType,
+			Source:     "admin_scheduler",
+			Message:    fmt.Sprintf("scheduled detection skipped: waiting backlog %d reached threshold %d", waitingCount, waitingThreshold),
+			Stage:      "skipped_waiting_backlog",
+			OccurredAt: timeToPtr(time.Now().UTC()),
+		})
+		return
+	}
+
+	clusterContext, err := r.loadSchedulerClusterContext()
+	if err != nil {
+		r.appendActivity(JobActivity{
+			JobType:    jobType,
+			Source:     "admin_scheduler",
+			Message:    fmt.Sprintf("scheduled detection aborted: %v", err),
+			Stage:      "failed",
+			OccurredAt: timeToPtr(time.Now().UTC()),
+		})
+		return
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), policy.DetectionTimeout)
+	proposals, err := r.RunDetection(ctx, jobType, clusterContext, policy.MaxResults)
+	cancel()
+	if err != nil {
+		r.appendActivity(JobActivity{
+			JobType:    jobType,
+			Source:     "admin_scheduler",
+			Message:    fmt.Sprintf("scheduled detection failed: %v", err),
+			Stage:      "failed",
+			OccurredAt: timeToPtr(time.Now().UTC()),
+		})
+		return
+	}
+
+	r.appendActivity(JobActivity{
+		JobType:    jobType,
+		Source:     "admin_scheduler",
+		Message:    fmt.Sprintf("scheduled detection completed: %d proposal(s)", len(proposals)),
+		Stage:      "detected",
+		OccurredAt: timeToPtr(time.Now().UTC()),
+	})
+
+	filteredByActive, skippedActive := r.filterProposalsWithActiveJobs(jobType, proposals)
+	if skippedActive > 0 {
+		r.appendActivity(JobActivity{
+			JobType:    jobType,
+			Source:     "admin_scheduler",
+			Message:    fmt.Sprintf("scheduled detection skipped %d proposal(s) due to active assigned/running jobs", skippedActive),
+			Stage:      "deduped_active_jobs",
+			OccurredAt: timeToPtr(time.Now().UTC()),
+		})
+	}
+
+	if len(filteredByActive) == 0 {
+		return
+	}
+
+	filtered := r.filterScheduledProposals(filteredByActive)
+	if len(filtered) != len(filteredByActive) {
+		r.appendActivity(JobActivity{
+			JobType:    jobType,
+			Source:     "admin_scheduler",
+			Message:    fmt.Sprintf("scheduled detection deduped %d proposal(s) within this run", len(filteredByActive)-len(filtered)),
+			Stage:      "deduped",
+			OccurredAt: timeToPtr(time.Now().UTC()),
+		})
+	}
+
+	if len(filtered) == 0 {
+		return
+	}
+
+	r.dispatchScheduledProposals(jobType, filtered, clusterContext, policy)
+}
+
+func (r *Plugin) loadSchedulerClusterContext() (*plugin_pb.ClusterContext, error) {
+	if r.clusterContextProvider == nil {
+		return nil, fmt.Errorf("cluster context provider is not configured")
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), defaultClusterContextTimeout)
+	defer cancel()
+
+	clusterContext, err := r.clusterContextProvider(ctx)
+	if err != nil {
+		return nil, err
+	}
+	if clusterContext == nil {
+		return nil, fmt.Errorf("cluster context provider returned nil")
+	}
+	return clusterContext, nil
+}
+
+func (r *Plugin) dispatchScheduledProposals(
+	jobType string,
+	proposals []*plugin_pb.JobProposal,
+	clusterContext *plugin_pb.ClusterContext,
+	policy schedulerPolicy,
+) {
+	jobQueue := make(chan *plugin_pb.JobSpec, len(proposals))
+	for index, proposal := range proposals {
+		job := buildScheduledJobSpec(jobType, proposal, index)
+		r.trackExecutionQueued(job)
+		select {
+		case <-r.shutdownCh:
+			close(jobQueue)
+			return
+		default:
+			jobQueue <- job
+		}
+	}
+	close(jobQueue)
+
+	var wg sync.WaitGroup
+	var statsMu sync.Mutex
+	successCount := 0
+	errorCount := 0
+
+	workerCount := policy.ExecutionConcurrency
+	if workerCount < 1 {
+		workerCount = 1
+	}
+
+	for i := 0; i < workerCount; i++ {
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+
+			for job := range jobQueue {
+				select {
+				case <-r.shutdownCh:
+					return
+				default:
+				}
+
+				for {
+					select {
+					case <-r.shutdownCh:
+						return
+					default:
+					}
+
+					executor, release, reserveErr := r.reserveScheduledExecutor(jobType, policy)
+					if reserveErr != nil {
+						select {
+						case <-r.shutdownCh:
+							return
+						default:
+						}
+						statsMu.Lock()
+						errorCount++
+						statsMu.Unlock()
+						r.appendActivity(JobActivity{
+							JobType:    jobType,
+							Source:     "admin_scheduler",
+							Message:    fmt.Sprintf("scheduled execution reservation failed: %v", reserveErr),
+							Stage:      "failed",
+							OccurredAt: timeToPtr(time.Now().UTC()),
+						})
+						break
+					}
+
+					err := r.executeScheduledJobWithExecutor(executor, job, clusterContext, policy)
+					release()
+					if errors.Is(err, errExecutorAtCapacity) {
+						r.trackExecutionQueued(job)
+						if !waitForShutdownOrTimer(r.shutdownCh, policy.ExecutorReserveBackoff) {
+							return
+						}
+						continue
+					}
+					if err != nil {
+						statsMu.Lock()
+						errorCount++
+						statsMu.Unlock()
+						r.appendActivity(JobActivity{
+							JobID:      job.JobId,
+							JobType:    job.JobType,
+							Source:     "admin_scheduler",
+							Message:    fmt.Sprintf("scheduled execution failed: %v", err),
+							Stage:      "failed",
+							OccurredAt: timeToPtr(time.Now().UTC()),
+						})
+						break
+					}
+
+					statsMu.Lock()
+					successCount++
+					statsMu.Unlock()
+					break
+				}
+			}
+		}()
+	}
+
+	wg.Wait()
+
+	r.appendActivity(JobActivity{
+		JobType:    jobType,
+		Source:     "admin_scheduler",
+		Message:    fmt.Sprintf("scheduled execution finished: success=%d error=%d", successCount, errorCount),
+		Stage:      "executed",
+		OccurredAt: timeToPtr(time.Now().UTC()),
+	})
+}
+
+func (r *Plugin) reserveScheduledExecutor(
+	jobType string,
+	policy schedulerPolicy,
+) (*WorkerSession, func(), error) {
+	deadline := time.Now().Add(policy.ExecutionTimeout)
+	if policy.ExecutionTimeout <= 0 {
+		deadline = time.Now().Add(10 * time.Minute) // Default cap
+	}
+
+	for {
+		select {
+		case <-r.shutdownCh:
+			return nil, nil, fmt.Errorf("plugin is shutting down")
+		default:
+		}
+
+		if time.Now().After(deadline) {
+			return nil, nil, fmt.Errorf("timed out waiting for executor capacity for %s", jobType)
+		}
+
+		executors, err := r.registry.ListExecutors(jobType)
+		if err != nil {
+			if !waitForShutdownOrTimer(r.shutdownCh, policy.ExecutorReserveBackoff) {
+				return nil, nil, fmt.Errorf("plugin is shutting down")
+			}
+			continue
+		}
+
+		for _, executor := range executors {
+			release, ok := r.tryReserveExecutorCapacity(executor, jobType, policy)
+			if !ok {
+				continue
+			}
+			return executor, release, nil
+		}
+
+		if !waitForShutdownOrTimer(r.shutdownCh, policy.ExecutorReserveBackoff) {
+			return nil, nil, fmt.Errorf("plugin is shutting down")
+		}
+	}
+}
+
+func (r *Plugin) tryReserveExecutorCapacity(
+	executor *WorkerSession,
+	jobType string,
+	policy schedulerPolicy,
+) (func(), bool) {
+	if executor == nil || strings.TrimSpace(executor.WorkerID) == "" {
+		return nil, false
+	}
+
+	limit := schedulerWorkerExecutionLimit(executor, jobType, policy)
+	if limit <= 0 {
+		return nil, false
+	}
+	heartbeatUsed := 0
+	if executor.Heartbeat != nil && executor.Heartbeat.ExecutionSlotsUsed > 0 {
+		heartbeatUsed = int(executor.Heartbeat.ExecutionSlotsUsed)
+	}
+
+	workerID := strings.TrimSpace(executor.WorkerID)
+
+	r.schedulerExecMu.Lock()
+	reserved := r.schedulerExecReservations[workerID]
+	if heartbeatUsed+reserved >= limit {
+		r.schedulerExecMu.Unlock()
+		return nil, false
+	}
+	r.schedulerExecReservations[workerID] = reserved + 1
+	r.schedulerExecMu.Unlock()
+
+	release := func() {
+		r.releaseExecutorCapacity(workerID)
+	}
+	return release, true
+}
+
+func (r *Plugin) releaseExecutorCapacity(workerID string) {
+	workerID = strings.TrimSpace(workerID)
+	if workerID == "" {
+		return
+	}
+
+	r.schedulerExecMu.Lock()
+	defer r.schedulerExecMu.Unlock()
+
+	current := r.schedulerExecReservations[workerID]
+	if current <= 1 {
+		delete(r.schedulerExecReservations, workerID)
+		return
+	}
+	r.schedulerExecReservations[workerID] = current - 1
+}
+
+func schedulerWorkerExecutionLimit(executor *WorkerSession, jobType string, policy schedulerPolicy) int {
+	limit := policy.PerWorkerConcurrency
+	if limit <= 0 {
+		limit = defaultScheduledPerWorkerConcurrency
+	}
+
+	if capability := executor.Capabilities[jobType]; capability != nil && capability.MaxExecutionConcurrency > 0 {
+		capLimit := int(capability.MaxExecutionConcurrency)
+		if capLimit < limit {
+			limit = capLimit
+		}
+	}
+
+	if executor.Heartbeat != nil && executor.Heartbeat.ExecutionSlotsTotal > 0 {
+		heartbeatLimit := int(executor.Heartbeat.ExecutionSlotsTotal)
+		if heartbeatLimit < limit {
+			limit = heartbeatLimit
+		}
+	}
+
+	if limit < 0 {
+		return 0
+	}
+	return limit
+}
+
+func (r *Plugin) executeScheduledJobWithExecutor(
+	executor *WorkerSession,
+	job *plugin_pb.JobSpec,
+	clusterContext *plugin_pb.ClusterContext,
+	policy schedulerPolicy,
+) error {
+	maxAttempts := policy.RetryLimit + 1
+	if maxAttempts < 1 {
+		maxAttempts = 1
+	}
+
+	var lastErr error
+	for attempt := 1; attempt <= maxAttempts; attempt++ {
+		select {
+		case <-r.shutdownCh:
+			return fmt.Errorf("plugin is shutting down")
+		default:
+		}
+
+		execCtx, cancel := context.WithTimeout(context.Background(), policy.ExecutionTimeout)
+		_, err := r.executeJobWithExecutor(execCtx, executor, job, clusterContext, int32(attempt))
+		cancel()
+		if err == nil {
+			return nil
+		}
+		if isExecutorAtCapacityError(err) {
+			return errExecutorAtCapacity
+		}
+		lastErr = err
+
+		if attempt < maxAttempts {
+			r.appendActivity(JobActivity{
+				JobID:      job.JobId,
+				JobType:    job.JobType,
+				Source:     "admin_scheduler",
+				Message:    fmt.Sprintf("retrying job attempt %d/%d after error: %v", attempt, maxAttempts, err),
+				Stage:      "retry",
+				OccurredAt: timeToPtr(time.Now().UTC()),
+			})
+			if !waitForShutdownOrTimer(r.shutdownCh, policy.RetryBackoff) {
+				return fmt.Errorf("plugin is shutting down")
+			}
+		}
+	}
+
+	if lastErr == nil {
+		lastErr = fmt.Errorf("execution failed without an explicit error")
+	}
+	return lastErr
+}
+
+func (r *Plugin) shouldSkipDetectionForWaitingJobs(jobType string, policy schedulerPolicy) (bool, int, int) {
+	waitingCount := r.countWaitingTrackedJobs(jobType)
+	threshold := waitingBacklogThreshold(policy)
+	if threshold <= 0 {
+		return false, waitingCount, threshold
+	}
+	return waitingCount >= threshold, waitingCount, threshold
+}
+
+func (r *Plugin) countWaitingTrackedJobs(jobType string) int {
+	normalizedJobType := strings.TrimSpace(jobType)
+	if normalizedJobType == "" {
+		return 0
+	}
+
+	waiting := 0
+	r.jobsMu.RLock()
+	for _, job := range r.jobs {
+		if job == nil {
+			continue
+		}
+		if strings.TrimSpace(job.JobType) != normalizedJobType {
+			continue
+		}
+		if !isWaitingTrackedJobState(job.State) {
+			continue
+		}
+		waiting++
+	}
+	r.jobsMu.RUnlock()
+
+	return waiting
+}
+
+func waitingBacklogThreshold(policy schedulerPolicy) int {
+	concurrency := policy.ExecutionConcurrency
+	if concurrency <= 0 {
+		concurrency = defaultScheduledExecutionConcurrency
+	}
+	threshold := concurrency * defaultWaitingBacklogMultiplier
+	if threshold < defaultWaitingBacklogFloor {
+		threshold = defaultWaitingBacklogFloor
+	}
+	if policy.MaxResults > 0 && threshold > int(policy.MaxResults) {
+		threshold = int(policy.MaxResults)
+	}
+	return threshold
+}
+
+func isExecutorAtCapacityError(err error) bool {
+	if err == nil {
+		return false
+	}
+	if errors.Is(err, errExecutorAtCapacity) {
+		return true
+	}
+	return strings.Contains(strings.ToLower(err.Error()), "executor is at capacity")
+}
+
+func buildScheduledJobSpec(jobType string, proposal *plugin_pb.JobProposal, index int) *plugin_pb.JobSpec {
+	now := timestamppb.Now()
+
+	jobID := fmt.Sprintf("%s-scheduled-%d-%d", jobType, now.AsTime().UnixNano(), index)
+
+	job := &plugin_pb.JobSpec{
+		JobId:       jobID,
+		JobType:     jobType,
+		Priority:    plugin_pb.JobPriority_JOB_PRIORITY_NORMAL,
+		Parameters:  map[string]*plugin_pb.ConfigValue{},
+		Labels:      map[string]string{},
+		CreatedAt:   now,
+		ScheduledAt: now,
+	}
+
+	if proposal == nil {
+		return job
+	}
+
+	if proposal.JobType != "" {
+		job.JobType = proposal.JobType
+	}
+	job.Summary = proposal.Summary
+	job.Detail = proposal.Detail
+	if proposal.Priority != plugin_pb.JobPriority_JOB_PRIORITY_UNSPECIFIED {
+		job.Priority = proposal.Priority
+	}
+	job.DedupeKey = proposal.DedupeKey
+	job.Parameters = CloneConfigValueMap(proposal.Parameters)
+	if proposal.Labels != nil {
+		job.Labels = make(map[string]string, len(proposal.Labels))
+		for k, v := range proposal.Labels {
+			job.Labels[k] = v
+		}
+	}
+	if proposal.NotBefore != nil {
+		job.ScheduledAt = proposal.NotBefore
+	}
+
+	return job
+}
+
+func durationFromSeconds(seconds int32, defaultValue time.Duration) time.Duration {
+	if seconds <= 0 {
+		return defaultValue
+	}
+	return time.Duration(seconds) * time.Second
+}
+
+func secondsFromDuration(duration time.Duration) int32 {
+	if duration <= 0 {
+		return 0
+	}
+	return int32(duration / time.Second)
+}
+
+func waitForShutdownOrTimer(shutdown <-chan struct{}, duration time.Duration) bool {
+	if duration <= 0 {
+		return true
+	}
+
+	timer := time.NewTimer(duration)
+	defer timer.Stop()
+
+	select {
+	case <-shutdown:
+		return false
+	case <-timer.C:
+		return true
+	}
+}
+
+func (r *Plugin) filterProposalsWithActiveJobs(jobType string, proposals []*plugin_pb.JobProposal) ([]*plugin_pb.JobProposal, int) {
+	if len(proposals) == 0 {
+		return proposals, 0
+	}
+
+	activeKeys := make(map[string]struct{})
+	r.jobsMu.RLock()
+	for _, job := range r.jobs {
+		if job == nil {
+			continue
+		}
+		if strings.TrimSpace(job.JobType) != strings.TrimSpace(jobType) {
+			continue
+		}
+		if !isActiveTrackedJobState(job.State) {
+			continue
+		}
+
+		key := strings.TrimSpace(job.DedupeKey)
+		if key == "" {
+			key = strings.TrimSpace(job.JobID)
+		}
+		if key == "" {
+			continue
+		}
+		activeKeys[key] = struct{}{}
+	}
+	r.jobsMu.RUnlock()
+
+	if len(activeKeys) == 0 {
+		return proposals, 0
+	}
+
+	filtered := make([]*plugin_pb.JobProposal, 0, len(proposals))
+	skipped := 0
+	for _, proposal := range proposals {
+		if proposal == nil {
+			continue
+		}
+		key := proposalExecutionKey(proposal)
+		if key != "" {
+			if _, exists := activeKeys[key]; exists {
+				skipped++
+				continue
+			}
+		}
+		filtered = append(filtered, proposal)
+	}
+
+	return filtered, skipped
+}
+
+func proposalExecutionKey(proposal *plugin_pb.JobProposal) string {
+	if proposal == nil {
+		return ""
+	}
+	key := strings.TrimSpace(proposal.DedupeKey)
+	if key != "" {
+		return key
+	}
+	return strings.TrimSpace(proposal.ProposalId)
+}
+
+func isActiveTrackedJobState(state string) bool {
+	normalized := strings.ToLower(strings.TrimSpace(state))
+	switch normalized {
+	case "pending", "assigned", "running", "in_progress", "job_state_pending", "job_state_assigned", "job_state_running":
+		return true
+	default:
+		return false
+	}
+}
+
+func isWaitingTrackedJobState(state string) bool {
+	normalized := strings.ToLower(strings.TrimSpace(state))
+	return normalized == "pending" || normalized == "job_state_pending"
+}
+
+func (r *Plugin) filterScheduledProposals(proposals []*plugin_pb.JobProposal) []*plugin_pb.JobProposal {
+	filtered := make([]*plugin_pb.JobProposal, 0, len(proposals))
+	seenInRun := make(map[string]struct{}, len(proposals))
+
+	for _, proposal := range proposals {
+		if proposal == nil {
+			continue
+		}
+
+		key := proposal.DedupeKey
+		if key == "" {
+			key = proposal.ProposalId
+		}
+		if key == "" {
+			filtered = append(filtered, proposal)
+			continue
+		}
+
+		if _, exists := seenInRun[key]; exists {
+			continue
+		}
+
+		seenInRun[key] = struct{}{}
+		filtered = append(filtered, proposal)
+	}
+
+	return filtered
+}
--- a/weed/admin/plugin/plugin_scheduler_test.go
+++ b/weed/admin/plugin/plugin_scheduler_test.go
@@ -0,0 +1,583 @@
+package plugin
+
+import (
+	"fmt"
+	"testing"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+)
+
+func TestLoadSchedulerPolicyUsesAdminConfig(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	err = pluginSvc.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
+		JobType: "vacuum",
+		AdminRuntime: &plugin_pb.AdminRuntimeConfig{
+			Enabled:                       true,
+			DetectionIntervalSeconds:      30,
+			DetectionTimeoutSeconds:       20,
+			MaxJobsPerDetection:           123,
+			GlobalExecutionConcurrency:    5,
+			PerWorkerExecutionConcurrency: 2,
+			RetryLimit:                    4,
+			RetryBackoffSeconds:           7,
+		},
+	})
+	if err != nil {
+		t.Fatalf("SaveJobTypeConfig: %v", err)
+	}
+
+	policy, enabled, err := pluginSvc.loadSchedulerPolicy("vacuum")
+	if err != nil {
+		t.Fatalf("loadSchedulerPolicy: %v", err)
+	}
+	if !enabled {
+		t.Fatalf("expected enabled policy")
+	}
+	if policy.MaxResults != 123 {
+		t.Fatalf("unexpected max results: got=%d", policy.MaxResults)
+	}
+	if policy.ExecutionConcurrency != 5 {
+		t.Fatalf("unexpected global concurrency: got=%d", policy.ExecutionConcurrency)
+	}
+	if policy.PerWorkerConcurrency != 2 {
+		t.Fatalf("unexpected per-worker concurrency: got=%d", policy.PerWorkerConcurrency)
+	}
+	if policy.RetryLimit != 4 {
+		t.Fatalf("unexpected retry limit: got=%d", policy.RetryLimit)
+	}
+}
+
+func TestLoadSchedulerPolicyUsesDescriptorDefaultsWhenConfigMissing(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	err = pluginSvc.store.SaveDescriptor("ec", &plugin_pb.JobTypeDescriptor{
+		JobType: "ec",
+		AdminRuntimeDefaults: &plugin_pb.AdminRuntimeDefaults{
+			Enabled:                       true,
+			DetectionIntervalSeconds:      60,
+			DetectionTimeoutSeconds:       25,
+			MaxJobsPerDetection:           30,
+			GlobalExecutionConcurrency:    4,
+			PerWorkerExecutionConcurrency: 2,
+			RetryLimit:                    3,
+			RetryBackoffSeconds:           6,
+		},
+	})
+	if err != nil {
+		t.Fatalf("SaveDescriptor: %v", err)
+	}
+
+	policy, enabled, err := pluginSvc.loadSchedulerPolicy("ec")
+	if err != nil {
+		t.Fatalf("loadSchedulerPolicy: %v", err)
+	}
+	if !enabled {
+		t.Fatalf("expected enabled policy from descriptor defaults")
+	}
+	if policy.MaxResults != 30 {
+		t.Fatalf("unexpected max results: got=%d", policy.MaxResults)
+	}
+	if policy.ExecutionConcurrency != 4 {
+		t.Fatalf("unexpected global concurrency: got=%d", policy.ExecutionConcurrency)
+	}
+	if policy.PerWorkerConcurrency != 2 {
+		t.Fatalf("unexpected per-worker concurrency: got=%d", policy.PerWorkerConcurrency)
+	}
+}
+
+func TestReserveScheduledExecutorRespectsPerWorkerLimit(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 4},
+		},
+	})
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-b",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
+		},
+	})
+
+	policy := schedulerPolicy{
+		PerWorkerConcurrency:   1,
+		ExecutorReserveBackoff: time.Millisecond,
+	}
+
+	executor1, release1, err := pluginSvc.reserveScheduledExecutor("balance", policy)
+	if err != nil {
+		t.Fatalf("reserve executor 1: %v", err)
+	}
+	defer release1()
+
+	executor2, release2, err := pluginSvc.reserveScheduledExecutor("balance", policy)
+	if err != nil {
+		t.Fatalf("reserve executor 2: %v", err)
+	}
+	defer release2()
+
+	if executor1.WorkerID == executor2.WorkerID {
+		t.Fatalf("expected different executors due per-worker limit, got same worker %s", executor1.WorkerID)
+	}
+}
+
+func TestFilterScheduledProposalsDedupe(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	proposals := []*plugin_pb.JobProposal{
+		{ProposalId: "p1", DedupeKey: "d1"},
+		{ProposalId: "p2", DedupeKey: "d1"}, // same dedupe key
+		{ProposalId: "p3", DedupeKey: "d3"},
+		{ProposalId: "p3"}, // fallback dedupe by proposal id
+		{ProposalId: "p4"},
+		{ProposalId: "p4"}, // same proposal id, no dedupe key
+	}
+
+	filtered := pluginSvc.filterScheduledProposals(proposals)
+	if len(filtered) != 4 {
+		t.Fatalf("unexpected filtered size: got=%d want=4", len(filtered))
+	}
+
+	filtered2 := pluginSvc.filterScheduledProposals(proposals)
+	if len(filtered2) != 4 {
+		t.Fatalf("expected second run dedupe to be per-run only, got=%d", len(filtered2))
+	}
+}
+
+func TestBuildScheduledJobSpecDoesNotReuseProposalID(t *testing.T) {
+	t.Parallel()
+
+	proposal := &plugin_pb.JobProposal{
+		ProposalId: "vacuum-2",
+		DedupeKey:  "vacuum:2",
+		JobType:    "vacuum",
+	}
+
+	jobA := buildScheduledJobSpec("vacuum", proposal, 0)
+	jobB := buildScheduledJobSpec("vacuum", proposal, 1)
+
+	if jobA.JobId == proposal.ProposalId {
+		t.Fatalf("scheduled job id must not reuse proposal id: %s", jobA.JobId)
+	}
+	if jobB.JobId == proposal.ProposalId {
+		t.Fatalf("scheduled job id must not reuse proposal id: %s", jobB.JobId)
+	}
+	if jobA.JobId == jobB.JobId {
+		t.Fatalf("scheduled job ids must be unique across jobs: %s", jobA.JobId)
+	}
+}
+
+func TestFilterProposalsWithActiveJobs(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	pluginSvc.trackExecutionStart("req-1", "worker-a", &plugin_pb.JobSpec{
+		JobId:     "job-1",
+		JobType:   "vacuum",
+		DedupeKey: "vacuum:k1",
+	}, 1)
+	pluginSvc.trackExecutionStart("req-2", "worker-b", &plugin_pb.JobSpec{
+		JobId:   "job-2",
+		JobType: "vacuum",
+	}, 1)
+	pluginSvc.trackExecutionQueued(&plugin_pb.JobSpec{
+		JobId:     "job-3",
+		JobType:   "vacuum",
+		DedupeKey: "vacuum:k4",
+	})
+
+	filtered, skipped := pluginSvc.filterProposalsWithActiveJobs("vacuum", []*plugin_pb.JobProposal{
+		{ProposalId: "proposal-1", JobType: "vacuum", DedupeKey: "vacuum:k1"},
+		{ProposalId: "job-2", JobType: "vacuum"},
+		{ProposalId: "proposal-2b", JobType: "vacuum", DedupeKey: "vacuum:k4"},
+		{ProposalId: "proposal-3", JobType: "vacuum", DedupeKey: "vacuum:k3"},
+		{ProposalId: "proposal-4", JobType: "balance", DedupeKey: "balance:k1"},
+	})
+	if skipped != 3 {
+		t.Fatalf("unexpected skipped count: got=%d want=3", skipped)
+	}
+	if len(filtered) != 2 {
+		t.Fatalf("unexpected filtered size: got=%d want=2", len(filtered))
+	}
+	if filtered[0].ProposalId != "proposal-3" || filtered[1].ProposalId != "proposal-4" {
+		t.Fatalf("unexpected filtered proposals: got=%s,%s", filtered[0].ProposalId, filtered[1].ProposalId)
+	}
+}
+
+func TestReserveScheduledExecutorTimesOutWhenNoExecutor(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	policy := schedulerPolicy{
+		ExecutionTimeout:       30 * time.Millisecond,
+		ExecutorReserveBackoff: 5 * time.Millisecond,
+		PerWorkerConcurrency:   1,
+	}
+
+	start := time.Now()
+	pluginSvc.Shutdown()
+	_, _, err = pluginSvc.reserveScheduledExecutor("missing-job-type", policy)
+	if err == nil {
+		t.Fatalf("expected reservation shutdown error")
+	}
+	if time.Since(start) > 50*time.Millisecond {
+		t.Fatalf("reservation returned too late after shutdown: duration=%v", time.Since(start))
+	}
+}
+
+func TestReserveScheduledExecutorWaitsForWorkerCapacity(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 1},
+		},
+	})
+
+	policy := schedulerPolicy{
+		ExecutionTimeout:       time.Second,
+		PerWorkerConcurrency:   8,
+		ExecutorReserveBackoff: 5 * time.Millisecond,
+	}
+
+	_, release1, err := pluginSvc.reserveScheduledExecutor("balance", policy)
+	if err != nil {
+		t.Fatalf("reserve executor 1: %v", err)
+	}
+	defer release1()
+
+	type reserveResult struct {
+		err error
+	}
+	secondReserveCh := make(chan reserveResult, 1)
+	go func() {
+		_, release2, reserveErr := pluginSvc.reserveScheduledExecutor("balance", policy)
+		if release2 != nil {
+			release2()
+		}
+		secondReserveCh <- reserveResult{err: reserveErr}
+	}()
+
+	select {
+	case result := <-secondReserveCh:
+		t.Fatalf("expected second reservation to wait for capacity, got=%v", result.err)
+	case <-time.After(25 * time.Millisecond):
+		// Expected: still waiting.
+	}
+
+	release1()
+
+	select {
+	case result := <-secondReserveCh:
+		if result.err != nil {
+			t.Fatalf("second reservation error: %v", result.err)
+		}
+	case <-time.After(200 * time.Millisecond):
+		t.Fatalf("second reservation did not acquire after capacity release")
+	}
+}
+
+func TestShouldSkipDetectionForWaitingJobs(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	policy := schedulerPolicy{
+		ExecutionConcurrency: 2,
+		MaxResults:           100,
+	}
+	threshold := waitingBacklogThreshold(policy)
+	if threshold <= 0 {
+		t.Fatalf("expected positive waiting threshold")
+	}
+
+	for i := 0; i < threshold; i++ {
+		pluginSvc.trackExecutionQueued(&plugin_pb.JobSpec{
+			JobId:     fmt.Sprintf("job-waiting-%d", i),
+			JobType:   "vacuum",
+			DedupeKey: fmt.Sprintf("vacuum:%d", i),
+		})
+	}
+
+	skip, waitingCount, waitingThreshold := pluginSvc.shouldSkipDetectionForWaitingJobs("vacuum", policy)
+	if !skip {
+		t.Fatalf("expected detection to skip when waiting backlog reaches threshold")
+	}
+	if waitingCount != threshold {
+		t.Fatalf("unexpected waiting count: got=%d want=%d", waitingCount, threshold)
+	}
+	if waitingThreshold != threshold {
+		t.Fatalf("unexpected waiting threshold: got=%d want=%d", waitingThreshold, threshold)
+	}
+}
+
+func TestWaitingBacklogThresholdHonorsMaxResultsCap(t *testing.T) {
+	t.Parallel()
+
+	policy := schedulerPolicy{
+		ExecutionConcurrency: 8,
+		MaxResults:           6,
+	}
+	threshold := waitingBacklogThreshold(policy)
+	if threshold != 6 {
+		t.Fatalf("expected threshold to be capped by max results, got=%d", threshold)
+	}
+}
+
+func TestListSchedulerStatesIncludesPolicyAndState(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	const jobType = "vacuum"
+	err = pluginSvc.SaveJobTypeConfig(&plugin_pb.PersistedJobTypeConfig{
+		JobType: jobType,
+		AdminRuntime: &plugin_pb.AdminRuntimeConfig{
+			Enabled:                       true,
+			DetectionIntervalSeconds:      45,
+			DetectionTimeoutSeconds:       30,
+			MaxJobsPerDetection:           80,
+			GlobalExecutionConcurrency:    3,
+			PerWorkerExecutionConcurrency: 2,
+			RetryLimit:                    1,
+			RetryBackoffSeconds:           9,
+		},
+	})
+	if err != nil {
+		t.Fatalf("SaveJobTypeConfig: %v", err)
+	}
+
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: jobType, CanDetect: true, CanExecute: true},
+		},
+	})
+
+	nextDetectionAt := time.Now().UTC().Add(2 * time.Minute).Round(time.Second)
+	pluginSvc.schedulerMu.Lock()
+	pluginSvc.nextDetectionAt[jobType] = nextDetectionAt
+	pluginSvc.detectionInFlight[jobType] = true
+	pluginSvc.schedulerMu.Unlock()
+
+	states, err := pluginSvc.ListSchedulerStates()
+	if err != nil {
+		t.Fatalf("ListSchedulerStates: %v", err)
+	}
+
+	state := findSchedulerState(states, jobType)
+	if state == nil {
+		t.Fatalf("missing scheduler state for %s", jobType)
+	}
+	if !state.Enabled {
+		t.Fatalf("expected enabled scheduler state")
+	}
+	if state.PolicyError != "" {
+		t.Fatalf("unexpected policy error: %s", state.PolicyError)
+	}
+	if !state.DetectionInFlight {
+		t.Fatalf("expected detection in flight")
+	}
+	if state.NextDetectionAt == nil {
+		t.Fatalf("expected next detection time")
+	}
+	if state.NextDetectionAt.Unix() != nextDetectionAt.Unix() {
+		t.Fatalf("unexpected next detection time: got=%v want=%v", state.NextDetectionAt, nextDetectionAt)
+	}
+	if state.DetectionIntervalSeconds != 45 {
+		t.Fatalf("unexpected detection interval: got=%d", state.DetectionIntervalSeconds)
+	}
+	if state.DetectionTimeoutSeconds != 30 {
+		t.Fatalf("unexpected detection timeout: got=%d", state.DetectionTimeoutSeconds)
+	}
+	if state.ExecutionTimeoutSeconds != 90 {
+		t.Fatalf("unexpected execution timeout: got=%d", state.ExecutionTimeoutSeconds)
+	}
+	if state.MaxJobsPerDetection != 80 {
+		t.Fatalf("unexpected max jobs per detection: got=%d", state.MaxJobsPerDetection)
+	}
+	if state.GlobalExecutionConcurrency != 3 {
+		t.Fatalf("unexpected global execution concurrency: got=%d", state.GlobalExecutionConcurrency)
+	}
+	if state.PerWorkerExecutionConcurrency != 2 {
+		t.Fatalf("unexpected per worker execution concurrency: got=%d", state.PerWorkerExecutionConcurrency)
+	}
+	if state.RetryLimit != 1 {
+		t.Fatalf("unexpected retry limit: got=%d", state.RetryLimit)
+	}
+	if state.RetryBackoffSeconds != 9 {
+		t.Fatalf("unexpected retry backoff: got=%d", state.RetryBackoffSeconds)
+	}
+	if !state.DetectorAvailable || state.DetectorWorkerID != "worker-a" {
+		t.Fatalf("unexpected detector assignment: available=%v worker=%s", state.DetectorAvailable, state.DetectorWorkerID)
+	}
+	if state.ExecutorWorkerCount != 1 {
+		t.Fatalf("unexpected executor worker count: got=%d", state.ExecutorWorkerCount)
+	}
+}
+
+func TestListSchedulerStatesShowsDisabledWhenNoPolicy(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	const jobType = "balance"
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-b",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: jobType, CanDetect: true, CanExecute: true},
+		},
+	})
+
+	states, err := pluginSvc.ListSchedulerStates()
+	if err != nil {
+		t.Fatalf("ListSchedulerStates: %v", err)
+	}
+
+	state := findSchedulerState(states, jobType)
+	if state == nil {
+		t.Fatalf("missing scheduler state for %s", jobType)
+	}
+	if state.Enabled {
+		t.Fatalf("expected disabled scheduler state")
+	}
+	if state.PolicyError != "" {
+		t.Fatalf("unexpected policy error: %s", state.PolicyError)
+	}
+	if !state.DetectorAvailable || state.DetectorWorkerID != "worker-b" {
+		t.Fatalf("unexpected detector details: available=%v worker=%s", state.DetectorAvailable, state.DetectorWorkerID)
+	}
+	if state.ExecutorWorkerCount != 1 {
+		t.Fatalf("unexpected executor worker count: got=%d", state.ExecutorWorkerCount)
+	}
+}
+
+func findSchedulerState(states []SchedulerJobTypeState, jobType string) *SchedulerJobTypeState {
+	for i := range states {
+		if states[i].JobType == jobType {
+			return &states[i]
+		}
+	}
+	return nil
+}
+
+func TestPickDetectorPrefersLeasedWorker(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "vacuum", CanDetect: true},
+		},
+	})
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-b",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "vacuum", CanDetect: true},
+		},
+	})
+
+	pluginSvc.setDetectorLease("vacuum", "worker-b")
+
+	detector, err := pluginSvc.pickDetector("vacuum")
+	if err != nil {
+		t.Fatalf("pickDetector: %v", err)
+	}
+	if detector.WorkerID != "worker-b" {
+		t.Fatalf("expected leased detector worker-b, got=%s", detector.WorkerID)
+	}
+}
+
+func TestPickDetectorReassignsWhenLeaseIsStale(t *testing.T) {
+	t.Parallel()
+
+	pluginSvc, err := New(Options{})
+	if err != nil {
+		t.Fatalf("New: %v", err)
+	}
+	defer pluginSvc.Shutdown()
+
+	pluginSvc.registry.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "vacuum", CanDetect: true},
+		},
+	})
+	pluginSvc.setDetectorLease("vacuum", "worker-stale")
+
+	detector, err := pluginSvc.pickDetector("vacuum")
+	if err != nil {
+		t.Fatalf("pickDetector: %v", err)
+	}
+	if detector.WorkerID != "worker-a" {
+		t.Fatalf("expected reassigned detector worker-a, got=%s", detector.WorkerID)
+	}
+
+	lease := pluginSvc.getDetectorLease("vacuum")
+	if lease != "worker-a" {
+		t.Fatalf("expected detector lease to be updated to worker-a, got=%s", lease)
+	}
+}
--- a/weed/admin/plugin/plugin_schema_prefetch.go
+++ b/weed/admin/plugin/plugin_schema_prefetch.go
@@ -0,0 +1,66 @@
+package plugin
+
+import (
+	"context"
+	"sort"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/glog"
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+)
+
+const descriptorPrefetchTimeout = 20 * time.Second
+
+func (r *Plugin) prefetchDescriptorsFromHello(hello *plugin_pb.WorkerHello) {
+	if hello == nil || len(hello.Capabilities) == 0 {
+		return
+	}
+
+	jobTypeSet := make(map[string]struct{})
+	for _, capability := range hello.Capabilities {
+		if capability == nil || capability.JobType == "" {
+			continue
+		}
+		if !capability.CanDetect && !capability.CanExecute {
+			continue
+		}
+		jobTypeSet[capability.JobType] = struct{}{}
+	}
+
+	if len(jobTypeSet) == 0 {
+		return
+	}
+
+	jobTypes := make([]string, 0, len(jobTypeSet))
+	for jobType := range jobTypeSet {
+		jobTypes = append(jobTypes, jobType)
+	}
+	sort.Strings(jobTypes)
+
+	for _, jobType := range jobTypes {
+		select {
+		case <-r.shutdownCh:
+			return
+		default:
+		}
+
+		descriptor, err := r.store.LoadDescriptor(jobType)
+		if err != nil {
+			glog.Warningf("Plugin descriptor prefetch check failed for %s: %v", jobType, err)
+			continue
+		}
+		if descriptor != nil {
+			continue
+		}
+
+		ctx, cancel := context.WithTimeout(r.ctx, descriptorPrefetchTimeout)
+		_, err = r.RequestConfigSchema(ctx, jobType, false)
+		cancel()
+		if err != nil {
+			glog.V(1).Infof("Plugin descriptor prefetch skipped for %s: %v", jobType, err)
+			continue
+		}
+
+		glog.V(1).Infof("Plugin descriptor prefetched for job_type=%s", jobType)
+	}
+}
--- a/weed/admin/plugin/registry.go
+++ b/weed/admin/plugin/registry.go
@@ -0,0 +1,465 @@
+package plugin
+
+import (
+	"fmt"
+	"sort"
+	"strings"
+	"sync"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+)
+
+const defaultWorkerStaleTimeout = 2 * time.Minute
+
+// WorkerSession contains tracked worker metadata and plugin status.
+type WorkerSession struct {
+	WorkerID        string
+	WorkerInstance  string
+	Address         string
+	WorkerVersion   string
+	ProtocolVersion string
+	ConnectedAt     time.Time
+	LastSeenAt      time.Time
+	Capabilities    map[string]*plugin_pb.JobTypeCapability
+	Heartbeat       *plugin_pb.WorkerHeartbeat
+}
+
+// Registry tracks connected plugin workers and capability-based selection.
+type Registry struct {
+	mu             sync.RWMutex
+	sessions       map[string]*WorkerSession
+	staleAfter     time.Duration
+	detectorCursor map[string]int
+	executorCursor map[string]int
+}
+
+func NewRegistry() *Registry {
+	return &Registry{
+		sessions:       make(map[string]*WorkerSession),
+		staleAfter:     defaultWorkerStaleTimeout,
+		detectorCursor: make(map[string]int),
+		executorCursor: make(map[string]int),
+	}
+}
+
+func (r *Registry) UpsertFromHello(hello *plugin_pb.WorkerHello) *WorkerSession {
+	now := time.Now()
+	caps := make(map[string]*plugin_pb.JobTypeCapability, len(hello.Capabilities))
+	for _, c := range hello.Capabilities {
+		if c == nil || c.JobType == "" {
+			continue
+		}
+		caps[c.JobType] = cloneJobTypeCapability(c)
+	}
+
+	r.mu.Lock()
+	defer r.mu.Unlock()
+
+	session, ok := r.sessions[hello.WorkerId]
+	if !ok {
+		session = &WorkerSession{
+			WorkerID:    hello.WorkerId,
+			ConnectedAt: now,
+		}
+		r.sessions[hello.WorkerId] = session
+	}
+
+	session.WorkerInstance = hello.WorkerInstanceId
+	session.Address = hello.Address
+	session.WorkerVersion = hello.WorkerVersion
+	session.ProtocolVersion = hello.ProtocolVersion
+	session.LastSeenAt = now
+	session.Capabilities = caps
+
+	return cloneWorkerSession(session)
+}
+
+func (r *Registry) Remove(workerID string) {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	delete(r.sessions, workerID)
+}
+
+func (r *Registry) UpdateHeartbeat(workerID string, heartbeat *plugin_pb.WorkerHeartbeat) {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+
+	session, ok := r.sessions[workerID]
+	if !ok {
+		return
+	}
+	session.Heartbeat = cloneWorkerHeartbeat(heartbeat)
+	session.LastSeenAt = time.Now()
+}
+
+func (r *Registry) Get(workerID string) (*WorkerSession, bool) {
+	r.mu.RLock()
+	defer r.mu.RUnlock()
+	session, ok := r.sessions[workerID]
+	if !ok || r.isSessionStaleLocked(session, time.Now()) {
+		return nil, false
+	}
+	return cloneWorkerSession(session), true
+}
+
+func (r *Registry) List() []*WorkerSession {
+	r.mu.RLock()
+	defer r.mu.RUnlock()
+	out := make([]*WorkerSession, 0, len(r.sessions))
+	now := time.Now()
+	for _, s := range r.sessions {
+		if r.isSessionStaleLocked(s, now) {
+			continue
+		}
+		out = append(out, cloneWorkerSession(s))
+	}
+	sort.Slice(out, func(i, j int) bool {
+		return out[i].WorkerID < out[j].WorkerID
+	})
+	return out
+}
+
+// DetectableJobTypes returns sorted job types that currently have at least one detect-capable worker.
+func (r *Registry) DetectableJobTypes() []string {
+	r.mu.RLock()
+	defer r.mu.RUnlock()
+
+	jobTypes := make(map[string]struct{})
+	now := time.Now()
+	for _, session := range r.sessions {
+		if r.isSessionStaleLocked(session, now) {
+			continue
+		}
+		for jobType, capability := range session.Capabilities {
+			if capability == nil || !capability.CanDetect {
+				continue
+			}
+			jobTypes[jobType] = struct{}{}
+		}
+	}
+
+	out := make([]string, 0, len(jobTypes))
+	for jobType := range jobTypes {
+		out = append(out, jobType)
+	}
+	sort.Strings(out)
+	return out
+}
+
+// JobTypes returns sorted job types known by connected workers regardless of capability kind.
+func (r *Registry) JobTypes() []string {
+	r.mu.RLock()
+	defer r.mu.RUnlock()
+
+	jobTypes := make(map[string]struct{})
+	now := time.Now()
+	for _, session := range r.sessions {
+		if r.isSessionStaleLocked(session, now) {
+			continue
+		}
+		for jobType := range session.Capabilities {
+			if jobType == "" {
+				continue
+			}
+			jobTypes[jobType] = struct{}{}
+		}
+	}
+
+	out := make([]string, 0, len(jobTypes))
+	for jobType := range jobTypes {
+		out = append(out, jobType)
+	}
+	sort.Strings(out)
+	return out
+}
+
+// PickSchemaProvider picks one worker for schema requests.
+// Preference order:
+// 1) workers that can detect this job type
+// 2) workers that can execute this job type
+// tie-break: more free slots, then lexical worker ID.
+func (r *Registry) PickSchemaProvider(jobType string) (*WorkerSession, error) {
+	r.mu.RLock()
+	defer r.mu.RUnlock()
+
+	var candidates []*WorkerSession
+	now := time.Now()
+	for _, s := range r.sessions {
+		if r.isSessionStaleLocked(s, now) {
+			continue
+		}
+		capability := s.Capabilities[jobType]
+		if capability == nil {
+			continue
+		}
+		if capability.CanDetect || capability.CanExecute {
+			candidates = append(candidates, s)
+		}
+	}
+
+	if len(candidates) == 0 {
+		return nil, fmt.Errorf("no worker available for schema job_type=%s", jobType)
+	}
+
+	sort.Slice(candidates, func(i, j int) bool {
+		a := candidates[i]
+		b := candidates[j]
+		ac := a.Capabilities[jobType]
+		bc := b.Capabilities[jobType]
+
+		// Prefer detect-capable providers first.
+		if ac.CanDetect != bc.CanDetect {
+			return ac.CanDetect
+		}
+
+		aSlots := availableDetectionSlots(a, ac) + availableExecutionSlots(a, ac)
+		bSlots := availableDetectionSlots(b, bc) + availableExecutionSlots(b, bc)
+		if aSlots != bSlots {
+			return aSlots > bSlots
+		}
+		return a.WorkerID < b.WorkerID
+	})
+
+	return cloneWorkerSession(candidates[0]), nil
+}
+
+// PickDetector picks one detector worker for a job type.
+func (r *Registry) PickDetector(jobType string) (*WorkerSession, error) {
+	return r.pickByKind(jobType, true)
+}
+
+// PickExecutor picks one executor worker for a job type.
+func (r *Registry) PickExecutor(jobType string) (*WorkerSession, error) {
+	return r.pickByKind(jobType, false)
+}
+
+// ListExecutors returns sorted executor candidates for one job type.
+// Ordering is by most available execution slots, then lexical worker ID.
+// The top tie group is rotated round-robin to prevent sticky assignment.
+func (r *Registry) ListExecutors(jobType string) ([]*WorkerSession, error) {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+
+	candidates := r.collectByKindLocked(jobType, false, time.Now())
+	if len(candidates) == 0 {
+		return nil, fmt.Errorf("no executor worker available for job_type=%s", jobType)
+	}
+
+	sortByKind(candidates, jobType, false)
+	r.rotateTopCandidatesLocked(candidates, jobType, false)
+
+	out := make([]*WorkerSession, 0, len(candidates))
+	for _, candidate := range candidates {
+		out = append(out, cloneWorkerSession(candidate))
+	}
+	return out, nil
+}
+
+func (r *Registry) pickByKind(jobType string, detect bool) (*WorkerSession, error) {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+
+	candidates := r.collectByKindLocked(jobType, detect, time.Now())
+
+	if len(candidates) == 0 {
+		kind := "executor"
+		if detect {
+			kind = "detector"
+		}
+		return nil, fmt.Errorf("no %s worker available for job_type=%s", kind, jobType)
+	}
+
+	sortByKind(candidates, jobType, detect)
+	r.rotateTopCandidatesLocked(candidates, jobType, detect)
+
+	return cloneWorkerSession(candidates[0]), nil
+}
+
+func (r *Registry) collectByKindLocked(jobType string, detect bool, now time.Time) []*WorkerSession {
+	var candidates []*WorkerSession
+	for _, session := range r.sessions {
+		if r.isSessionStaleLocked(session, now) {
+			continue
+		}
+		capability := session.Capabilities[jobType]
+		if capability == nil {
+			continue
+		}
+		if detect && capability.CanDetect {
+			candidates = append(candidates, session)
+		}
+		if !detect && capability.CanExecute {
+			candidates = append(candidates, session)
+		}
+	}
+	return candidates
+}
+
+func (r *Registry) isSessionStaleLocked(session *WorkerSession, now time.Time) bool {
+	if session == nil {
+		return true
+	}
+	if r.staleAfter <= 0 {
+		return false
+	}
+
+	lastSeen := session.LastSeenAt
+	if lastSeen.IsZero() {
+		lastSeen = session.ConnectedAt
+	}
+	if lastSeen.IsZero() {
+		return false
+	}
+	return now.Sub(lastSeen) > r.staleAfter
+}
+
+func sortByKind(candidates []*WorkerSession, jobType string, detect bool) {
+	sort.Slice(candidates, func(i, j int) bool {
+		a := candidates[i]
+		b := candidates[j]
+		ac := a.Capabilities[jobType]
+		bc := b.Capabilities[jobType]
+
+		aSlots := availableSlotsByKind(a, ac, detect)
+		bSlots := availableSlotsByKind(b, bc, detect)
+
+		if aSlots != bSlots {
+			return aSlots > bSlots
+		}
+		return a.WorkerID < b.WorkerID
+	})
+}
+
+func (r *Registry) rotateTopCandidatesLocked(candidates []*WorkerSession, jobType string, detect bool) {
+	if len(candidates) < 2 {
+		return
+	}
+
+	capability := candidates[0].Capabilities[jobType]
+	topSlots := availableSlotsByKind(candidates[0], capability, detect)
+	tieEnd := 1
+	for tieEnd < len(candidates) {
+		nextCapability := candidates[tieEnd].Capabilities[jobType]
+		if availableSlotsByKind(candidates[tieEnd], nextCapability, detect) != topSlots {
+			break
+		}
+		tieEnd++
+	}
+	if tieEnd <= 1 {
+		return
+	}
+
+	cursorKey := strings.TrimSpace(jobType)
+	if cursorKey == "" {
+		cursorKey = "*"
+	}
+
+	var offset int
+	if detect {
+		offset = r.detectorCursor[cursorKey] % tieEnd
+		r.detectorCursor[cursorKey] = (offset + 1) % tieEnd
+	} else {
+		offset = r.executorCursor[cursorKey] % tieEnd
+		r.executorCursor[cursorKey] = (offset + 1) % tieEnd
+	}
+
+	if offset == 0 {
+		return
+	}
+
+	prefix := append([]*WorkerSession(nil), candidates[:tieEnd]...)
+	for i := 0; i < tieEnd; i++ {
+		candidates[i] = prefix[(i+offset)%tieEnd]
+	}
+}
+
+func availableSlotsByKind(
+	session *WorkerSession,
+	capability *plugin_pb.JobTypeCapability,
+	detect bool,
+) int {
+	if detect {
+		return availableDetectionSlots(session, capability)
+	}
+	return availableExecutionSlots(session, capability)
+}
+
+func availableDetectionSlots(session *WorkerSession, capability *plugin_pb.JobTypeCapability) int {
+	if session.Heartbeat != nil && session.Heartbeat.DetectionSlotsTotal > 0 {
+		free := int(session.Heartbeat.DetectionSlotsTotal - session.Heartbeat.DetectionSlotsUsed)
+		if free < 0 {
+			return 0
+		}
+		return free
+	}
+	if capability.MaxDetectionConcurrency > 0 {
+		return int(capability.MaxDetectionConcurrency)
+	}
+	return 1
+}
+
+func availableExecutionSlots(session *WorkerSession, capability *plugin_pb.JobTypeCapability) int {
+	if session.Heartbeat != nil && session.Heartbeat.ExecutionSlotsTotal > 0 {
+		free := int(session.Heartbeat.ExecutionSlotsTotal - session.Heartbeat.ExecutionSlotsUsed)
+		if free < 0 {
+			return 0
+		}
+		return free
+	}
+	if capability.MaxExecutionConcurrency > 0 {
+		return int(capability.MaxExecutionConcurrency)
+	}
+	return 1
+}
+
+func cloneWorkerSession(in *WorkerSession) *WorkerSession {
+	if in == nil {
+		return nil
+	}
+	out := *in
+	out.Capabilities = make(map[string]*plugin_pb.JobTypeCapability, len(in.Capabilities))
+	for jobType, cap := range in.Capabilities {
+		out.Capabilities[jobType] = cloneJobTypeCapability(cap)
+	}
+	out.Heartbeat = cloneWorkerHeartbeat(in.Heartbeat)
+	return &out
+}
+
+func cloneJobTypeCapability(in *plugin_pb.JobTypeCapability) *plugin_pb.JobTypeCapability {
+	if in == nil {
+		return nil
+	}
+	out := *in
+	return &out
+}
+
+func cloneWorkerHeartbeat(in *plugin_pb.WorkerHeartbeat) *plugin_pb.WorkerHeartbeat {
+	if in == nil {
+		return nil
+	}
+	out := *in
+	if in.RunningWork != nil {
+		out.RunningWork = make([]*plugin_pb.RunningWork, 0, len(in.RunningWork))
+		for _, rw := range in.RunningWork {
+			if rw == nil {
+				continue
+			}
+			clone := *rw
+			out.RunningWork = append(out.RunningWork, &clone)
+		}
+	}
+	if in.QueuedJobsByType != nil {
+		out.QueuedJobsByType = make(map[string]int32, len(in.QueuedJobsByType))
+		for k, v := range in.QueuedJobsByType {
+			out.QueuedJobsByType[k] = v
+		}
+	}
+	if in.Metadata != nil {
+		out.Metadata = make(map[string]string, len(in.Metadata))
+		for k, v := range in.Metadata {
+			out.Metadata[k] = v
+		}
+	}
+	return &out
+}
--- a/weed/admin/plugin/registry_test.go
+++ b/weed/admin/plugin/registry_test.go
@@ -0,0 +1,321 @@
+package plugin
+
+import (
+	"reflect"
+	"testing"
+	"time"
+
+	"github.com/seaweedfs/seaweedfs/weed/pb/plugin_pb"
+)
+
+func TestRegistryPickDetectorPrefersMoreFreeSlots(t *testing.T) {
+	t.Parallel()
+
+	r := NewRegistry()
+
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "vacuum", CanDetect: true, CanExecute: true, MaxDetectionConcurrency: 2, MaxExecutionConcurrency: 2},
+		},
+	})
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-b",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "vacuum", CanDetect: true, CanExecute: true, MaxDetectionConcurrency: 4, MaxExecutionConcurrency: 4},
+		},
+	})
+
+	r.UpdateHeartbeat("worker-a", &plugin_pb.WorkerHeartbeat{
+		WorkerId:            "worker-a",
+		DetectionSlotsUsed:  1,
+		DetectionSlotsTotal: 2,
+	})
+	r.UpdateHeartbeat("worker-b", &plugin_pb.WorkerHeartbeat{
+		WorkerId:            "worker-b",
+		DetectionSlotsUsed:  1,
+		DetectionSlotsTotal: 4,
+	})
+
+	picked, err := r.PickDetector("vacuum")
+	if err != nil {
+		t.Fatalf("PickDetector: %v", err)
+	}
+	if picked.WorkerID != "worker-b" {
+		t.Fatalf("unexpected detector picked: got %s want worker-b", picked.WorkerID)
+	}
+}
+
+func TestRegistryPickExecutorAllowsSameWorker(t *testing.T) {
+	t.Parallel()
+
+	r := NewRegistry()
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-x",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "balance", CanDetect: true, CanExecute: true, MaxDetectionConcurrency: 1, MaxExecutionConcurrency: 1},
+		},
+	})
+
+	detector, err := r.PickDetector("balance")
+	if err != nil {
+		t.Fatalf("PickDetector: %v", err)
+	}
+	executor, err := r.PickExecutor("balance")
+	if err != nil {
+		t.Fatalf("PickExecutor: %v", err)
+	}
+
+	if detector.WorkerID != "worker-x" || executor.WorkerID != "worker-x" {
+		t.Fatalf("expected same worker for detect/execute, got detector=%s executor=%s", detector.WorkerID, executor.WorkerID)
+	}
+}
+
+func TestRegistryDetectableJobTypes(t *testing.T) {
+	t.Parallel()
+
+	r := NewRegistry()
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "vacuum", CanDetect: true, CanExecute: true},
+			{JobType: "balance", CanDetect: false, CanExecute: true},
+		},
+	})
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-b",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "ec", CanDetect: true, CanExecute: false},
+			{JobType: "vacuum", CanDetect: true, CanExecute: false},
+		},
+	})
+
+	got := r.DetectableJobTypes()
+	want := []string{"ec", "vacuum"}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("unexpected detectable job types: got=%v want=%v", got, want)
+	}
+}
+
+func TestRegistryJobTypes(t *testing.T) {
+	t.Parallel()
+
+	r := NewRegistry()
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "vacuum", CanDetect: true},
+			{JobType: "balance", CanExecute: true},
+		},
+	})
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-b",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "ec", CanDetect: true},
+		},
+	})
+
+	got := r.JobTypes()
+	want := []string{"balance", "ec", "vacuum"}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("unexpected job types: got=%v want=%v", got, want)
+	}
+}
+
+func TestRegistryListExecutorsSortedBySlots(t *testing.T) {
+	t.Parallel()
+
+	r := NewRegistry()
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
+		},
+	})
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-b",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 4},
+		},
+	})
+
+	r.UpdateHeartbeat("worker-a", &plugin_pb.WorkerHeartbeat{
+		WorkerId:            "worker-a",
+		ExecutionSlotsUsed:  1,
+		ExecutionSlotsTotal: 2,
+	})
+	r.UpdateHeartbeat("worker-b", &plugin_pb.WorkerHeartbeat{
+		WorkerId:            "worker-b",
+		ExecutionSlotsUsed:  1,
+		ExecutionSlotsTotal: 4,
+	})
+
+	executors, err := r.ListExecutors("balance")
+	if err != nil {
+		t.Fatalf("ListExecutors: %v", err)
+	}
+	if len(executors) != 2 {
+		t.Fatalf("unexpected candidate count: got=%d", len(executors))
+	}
+	if executors[0].WorkerID != "worker-b" || executors[1].WorkerID != "worker-a" {
+		t.Fatalf("unexpected executor order: got=%s,%s", executors[0].WorkerID, executors[1].WorkerID)
+	}
+}
+
+func TestRegistryPickExecutorRoundRobinForTopTie(t *testing.T) {
+	t.Parallel()
+
+	r := NewRegistry()
+	for _, workerID := range []string{"worker-a", "worker-b", "worker-c"} {
+		r.UpsertFromHello(&plugin_pb.WorkerHello{
+			WorkerId: workerID,
+			Capabilities: []*plugin_pb.JobTypeCapability{
+				{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 1},
+			},
+		})
+	}
+
+	got := make([]string, 0, 6)
+	for i := 0; i < 6; i++ {
+		executor, err := r.PickExecutor("balance")
+		if err != nil {
+			t.Fatalf("PickExecutor: %v", err)
+		}
+		got = append(got, executor.WorkerID)
+	}
+
+	want := []string{"worker-a", "worker-b", "worker-c", "worker-a", "worker-b", "worker-c"}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("unexpected pick order: got=%v want=%v", got, want)
+	}
+}
+
+func TestRegistryListExecutorsRoundRobinForTopTie(t *testing.T) {
+	t.Parallel()
+
+	r := NewRegistry()
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
+		},
+	})
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-b",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 2},
+		},
+	})
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-c",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "balance", CanExecute: true, MaxExecutionConcurrency: 1},
+		},
+	})
+
+	r.UpdateHeartbeat("worker-a", &plugin_pb.WorkerHeartbeat{
+		WorkerId:            "worker-a",
+		ExecutionSlotsUsed:  0,
+		ExecutionSlotsTotal: 2,
+	})
+	r.UpdateHeartbeat("worker-b", &plugin_pb.WorkerHeartbeat{
+		WorkerId:            "worker-b",
+		ExecutionSlotsUsed:  0,
+		ExecutionSlotsTotal: 2,
+	})
+	r.UpdateHeartbeat("worker-c", &plugin_pb.WorkerHeartbeat{
+		WorkerId:            "worker-c",
+		ExecutionSlotsUsed:  0,
+		ExecutionSlotsTotal: 1,
+	})
+
+	firstCall, err := r.ListExecutors("balance")
+	if err != nil {
+		t.Fatalf("ListExecutors first call: %v", err)
+	}
+	secondCall, err := r.ListExecutors("balance")
+	if err != nil {
+		t.Fatalf("ListExecutors second call: %v", err)
+	}
+	thirdCall, err := r.ListExecutors("balance")
+	if err != nil {
+		t.Fatalf("ListExecutors third call: %v", err)
+	}
+
+	if firstCall[0].WorkerID != "worker-a" || firstCall[1].WorkerID != "worker-b" || firstCall[2].WorkerID != "worker-c" {
+		t.Fatalf("unexpected first executor order: got=%s,%s,%s", firstCall[0].WorkerID, firstCall[1].WorkerID, firstCall[2].WorkerID)
+	}
+	if secondCall[0].WorkerID != "worker-b" || secondCall[1].WorkerID != "worker-a" || secondCall[2].WorkerID != "worker-c" {
+		t.Fatalf("unexpected second executor order: got=%s,%s,%s", secondCall[0].WorkerID, secondCall[1].WorkerID, secondCall[2].WorkerID)
+	}
+	if thirdCall[0].WorkerID != "worker-a" || thirdCall[1].WorkerID != "worker-b" || thirdCall[2].WorkerID != "worker-c" {
+		t.Fatalf("unexpected third executor order: got=%s,%s,%s", thirdCall[0].WorkerID, thirdCall[1].WorkerID, thirdCall[2].WorkerID)
+	}
+}
+
+func TestRegistrySkipsStaleWorkersForSelectionAndListing(t *testing.T) {
+	t.Parallel()
+
+	r := NewRegistry()
+	r.staleAfter = 2 * time.Second
+
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-stale",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "vacuum", CanDetect: true, CanExecute: true},
+		},
+	})
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-fresh",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "vacuum", CanDetect: true, CanExecute: true},
+		},
+	})
+
+	r.mu.Lock()
+	r.sessions["worker-stale"].LastSeenAt = time.Now().Add(-10 * time.Second)
+	r.sessions["worker-fresh"].LastSeenAt = time.Now()
+	r.mu.Unlock()
+
+	picked, err := r.PickDetector("vacuum")
+	if err != nil {
+		t.Fatalf("PickDetector: %v", err)
+	}
+	if picked.WorkerID != "worker-fresh" {
+		t.Fatalf("unexpected detector: got=%s want=worker-fresh", picked.WorkerID)
+	}
+
+	if _, ok := r.Get("worker-stale"); ok {
+		t.Fatalf("expected stale worker to be hidden from Get")
+	}
+	if _, ok := r.Get("worker-fresh"); !ok {
+		t.Fatalf("expected fresh worker from Get")
+	}
+
+	listed := r.List()
+	if len(listed) != 1 || listed[0].WorkerID != "worker-fresh" {
+		t.Fatalf("unexpected listed workers: %+v", listed)
+	}
+}
+
+func TestRegistryReturnsNoDetectorWhenAllWorkersStale(t *testing.T) {
+	t.Parallel()
+
+	r := NewRegistry()
+	r.staleAfter = 2 * time.Second
+
+	r.UpsertFromHello(&plugin_pb.WorkerHello{
+		WorkerId: "worker-a",
+		Capabilities: []*plugin_pb.JobTypeCapability{
+			{JobType: "vacuum", CanDetect: true},
+		},
+	})
+
+	r.mu.Lock()
+	r.sessions["worker-a"].LastSeenAt = time.Now().Add(-10 * time.Second)
+	r.mu.Unlock()
+
+	if _, err := r.PickDetector("vacuum"); err == nil {
+		t.Fatalf("expected no detector when all workers are stale")
+	}
+}
--- a/weed/admin/plugin/types.go
+++ b/weed/admin/plugin/types.go
@@ -0,0 +1,103 @@
+package plugin
+
+import "time"
+
+const (
+	// Keep exactly the last 10 successful and last 10 error runs per job type.
+	MaxSuccessfulRunHistory = 10
+	MaxErrorRunHistory      = 10
+)
+
+type RunOutcome string
+
+const (
+	RunOutcomeSuccess RunOutcome = "success"
+	RunOutcomeError   RunOutcome = "error"
+)
+
+type JobRunRecord struct {
+	RunID       string     `json:"run_id"`
+	JobID       string     `json:"job_id"`
+	JobType     string     `json:"job_type"`
+	WorkerID    string     `json:"worker_id"`
+	Outcome     RunOutcome `json:"outcome"`
+	Message     string     `json:"message,omitempty"`
+	DurationMs  int64      `json:"duration_ms,omitempty"`
+	CompletedAt *time.Time `json:"completed_at,omitempty"`
+}
+
+type JobTypeRunHistory struct {
+	JobType         string         `json:"job_type"`
+	SuccessfulRuns  []JobRunRecord `json:"successful_runs"`
+	ErrorRuns       []JobRunRecord `json:"error_runs"`
+	LastUpdatedTime *time.Time     `json:"last_updated_time,omitempty"`
+}
+
+type TrackedJob struct {
+	JobID              string                 `json:"job_id"`
+	JobType            string                 `json:"job_type"`
+	RequestID          string                 `json:"request_id"`
+	WorkerID           string                 `json:"worker_id"`
+	DedupeKey          string                 `json:"dedupe_key,omitempty"`
+	Summary            string                 `json:"summary,omitempty"`
+	Detail             string                 `json:"detail,omitempty"`
+	Parameters         map[string]interface{} `json:"parameters,omitempty"`
+	Labels             map[string]string      `json:"labels,omitempty"`
+	State              string                 `json:"state"`
+	Progress           float64                `json:"progress"`
+	Stage              string                 `json:"stage,omitempty"`
+	Message            string                 `json:"message,omitempty"`
+	Attempt            int32                  `json:"attempt,omitempty"`
+	CreatedAt          *time.Time             `json:"created_at,omitempty"`
+	UpdatedAt          *time.Time             `json:"updated_at,omitempty"`
+	CompletedAt        *time.Time             `json:"completed_at,omitempty"`
+	ErrorMessage       string                 `json:"error_message,omitempty"`
+	ResultSummary      string                 `json:"result_summary,omitempty"`
+	ResultOutputValues map[string]interface{} `json:"result_output_values,omitempty"`
+}
+
+type JobActivity struct {
+	JobID      string                 `json:"job_id"`
+	JobType    string                 `json:"job_type"`
+	RequestID  string                 `json:"request_id,omitempty"`
+	WorkerID   string                 `json:"worker_id,omitempty"`
+	Source     string                 `json:"source"`
+	Message    string                 `json:"message"`
+	Stage      string                 `json:"stage,omitempty"`
+	Details    map[string]interface{} `json:"details,omitempty"`
+	OccurredAt *time.Time             `json:"occurred_at,omitempty"`
+}
+
+type JobDetail struct {
+	Job         *TrackedJob   `json:"job"`
+	RunRecord   *JobRunRecord `json:"run_record,omitempty"`
+	Activities  []JobActivity `json:"activities"`
+	RelatedJobs []TrackedJob  `json:"related_jobs,omitempty"`
+	LastUpdated *time.Time    `json:"last_updated,omitempty"`
+}
+
+type SchedulerJobTypeState struct {
+	JobType                       string     `json:"job_type"`
+	Enabled                       bool       `json:"enabled"`
+	PolicyError                   string     `json:"policy_error,omitempty"`
+	DetectionInFlight             bool       `json:"detection_in_flight"`
+	NextDetectionAt               *time.Time `json:"next_detection_at,omitempty"`
+	DetectionIntervalSeconds      int32      `json:"detection_interval_seconds,omitempty"`
+	DetectionTimeoutSeconds       int32      `json:"detection_timeout_seconds,omitempty"`
+	ExecutionTimeoutSeconds       int32      `json:"execution_timeout_seconds,omitempty"`
+	MaxJobsPerDetection           int32      `json:"max_jobs_per_detection,omitempty"`
+	GlobalExecutionConcurrency    int        `json:"global_execution_concurrency,omitempty"`
+	PerWorkerExecutionConcurrency int        `json:"per_worker_execution_concurrency,omitempty"`
+	RetryLimit                    int        `json:"retry_limit,omitempty"`
+	RetryBackoffSeconds           int32      `json:"retry_backoff_seconds,omitempty"`
+	DetectorAvailable             bool       `json:"detector_available"`
+	DetectorWorkerID              string     `json:"detector_worker_id,omitempty"`
+	ExecutorWorkerCount           int        `json:"executor_worker_count"`
+}
+
+func timeToPtr(t time.Time) *time.Time {
+	if t.IsZero() {
+		return nil
+	}
+	return &t
+}